Semiconductor USA
Back to Careers

Intern 2020-2021 - Transformers for object segmentation/depth estimation

Location: Paris, France
Job Category: Internship

Apply Now
2020-2021 - Transformers for object segmentation/depth estimation
AI/Machine Learning Internship - Samsung Strategy and Innovation Center - Paris - 6 months
Keywords: machine learning, computer vision, deep learning, segmentation, depth estimation, transformers
--------------------------------------------------------
Transformers for object segmentation/depth estimation
--------------------------------------------------------
SOTA methods for object segmentation/depth estimation on single images have achieved impressive results which are now widely used for
domains as autonomous driving and indoor mapping.
However performing the same tasks on continuous streams is different than doing it in each frame independently:
Unwanted errors may happen in a certain frame
Flickering between one map and the other may happen, due to errors e.g. on small objects/borders etc..
Information easy to guess in some frame can be challenging in the frame after (occlusion, illumination, etc…)
Moreover performing segmentation/depth estimation on videos can open doors to weak supervision (leveraging information in unlabeled parts of
a video).
Known approaches:
STD2P https://scalable.mpi-inf.mpg.de/files/2017/04/cvpr2017.pdf:
Compute independent maps for each frame
With the aid of optical flow, train a dedicated pooling module to fuse the maps for giving a smooth result for each keyframe.
STGRU https://openaccess.thecvf.com/content_cvpr_2018/papers/Nilsson_Semantic_Video_Segmentation_CVPR_2018_paper.pdf):
Compute independent maps for each frame
Integrates optical flow (spatial transformer)
Use recurrent networks to aggregate and fuse information over time.
The objective of the internship is to study how transformers can be useful in this task, in several ways:
Can we get N segmentations/depths as input of a transformer encoder which computes end-to-end N pooled segmentations/depths?
How do we enforce pixel-level spatio-temporal attention?
Can we give N segmentations/depths to a transformer (encoder-decoder) which will output the map N+1 (conditioned on the singleimage
segmentation at the same timestamp).
How do we manage memory consumption? Do we need to under-represent our image features for memory limitations? What’s the
impact in the results?
Can we get rid of optical flow input?
Can we train weakly?
A part of the video is not labeled, but unlabeled images help both reconstructing segmentation maps at other timestamps,
comparable with a GT, both performing weak predictions on them, by defining soft unsupervised loss terms
-------------------------------------------------------
Samsung Strategy and Innovation Center
-------------------------------------------------------
With offices in San Jose (US), Menlo Park (US), New York (US), Paris (France), Tel Aviv (Israel) and Seoul (Korea), the goal of Samsung
Strategy and Innovation Center (SSIC) is to smartly add artificial intelligence into Samsung products and to promote innovation. Our first lines of
work are the Automated Mobility and the Internet of Things, in order to seek and develop high impact solutions to revolutionize uses. We are
customer-centric, making our technologies respecting privacy. In collaboration with Samsung's business teams, SSIC brings the latest research
innovations to create products optimized by AI, and quickly accessible to users.
https://www.samsung.com/us/ssic/
Apply Now