Python知識(shí)分享網(wǎng) - 專業(yè)的Python學(xué)習(xí)網(wǎng)站 學(xué)Python,上Python222
Swin3D:一個(gè)用于3D室內(nèi)場(chǎng)景理解的預(yù)先訓(xùn)練的Transformer主干 PDF 下載
匿名網(wǎng)友發(fā)布于:2025-05-31 10:56:26
(侵權(quán)舉報(bào))
(假如點(diǎn)擊沒(méi)反應(yīng),多刷新兩次就OK!)

Swin3D:一個(gè)用于3D室內(nèi)場(chǎng)景理解的預(yù)先訓(xùn)練的Transformer主干  PDF 下載 圖1

 

 

資料內(nèi)容:

 

 

. Introduction
Pretrained backbones with fine-tuning have been widely
applied to various 2D vision and NLP tasks [13, 2, 10, 3],
where a backbone network pretrained on a large dataset is
concatenated with task-specific back-end and then fine-tuned
for different downstream tasks. This approach demonstrates
*
Interns at Microsoft Research Asia. †Contact person.
its superior performance and great advantages in reducing
the workload of network design and training, as well as the
amount of labeled data required for different vision tasks.
In the work, we present a pretrained 3D backbone, named
SWIN3D, for 3D indoor scene understanding tasks. Our
method represents the 3D point cloud of an input 3D scene as
sparse voxels in 3D space and adapts the Swin Transformer
[30] designed for regular 2D images to unorganized 3D
points as the 3D backbone. We analyze the key issues that
prevent the na¨?ve 3D extension of Swin Transformer from
exploring large models and achieving high performance,
i.e., the high memory complexity, the ignorance of signal
irregularity. Based on our analysis, we develop a novel
3D self-attention operator to compute the self-attentions of
sparse voxels within each local window, which reduces the
memory cost of self-attention from quadratic to linear with
respect to the number of sparse voxels within a window and
computes efficiently; enhances self-attention via capturing
various signal irregularities by our generalized contextual
relative positional embedding [48, 26].
The novel design of our SWIN3D backbone enables us to
scale up the backbone model and the amount of data used
for pretraining. To this end, we pretrained a large SWIN3D
model with 60M parameters via a 3D semantic segmenta
tion task over a synthetic 3D indoor scene dataset [60] that
includes 21K rooms and is about ten times larger than the
ScanNet dataset. After pretraining, we cascade the pretrained
SWIN3D backbone with task-specific back-end decoders
and fine-tune the models for various downstream 3D indoor
scene understanding tasks.