Recovering 3D geometries of scenes from 2D images is one of the most fundamental and challenging problems in computer vision. On one hand, traditional geometry-based algorithms such as SfM and SLAM are fragile in certain environments, and the resulting noisy point-clouds are hard to process and interpret. On the other hand, recent learning-based 3D-understanding neural networks parse scenes by extrapolating patterns seen in the training data, which often have limited generalizability and accuracy.
In my dissertation, I try to address these shortcomings and combine the advantage of geometry-based and data-driven approaches into an integrated framework. More specifically, I have applied learning-based methods to extract high-level geometric structures from images and use them for 3D parsing. To this end, I have designed specialized neural networks that understand geometric structures such as lines, junctions, planes, vanishing points, and symmetry, and detect them from images accurately; I have created large-scale 3D datasets with structural annotations to support data-driven approaches; and I have demonstrated how to use these high-level abstractions to parse and reconstruct scenes. By combining the power of data-driven approaches and geometric principles, future 3D systems are becoming more accurate, reliable, and easier to implement, resulting in clean, compact, and interpretable scene representations.