Accurate pose estimation and 3D reconstruction are important in a variety of applications such as autonomous navigation or mapping in uncertain or unknown environments. Bundle adjustment (BA) and simultaneous localization and mapping (SLAM) are commonly used approaches to address these and other related problems. Given a sequence of images, BA is the problem of simultaneously inferring the camera poses and the observed 3D landmarks. BA is typically solved by minimizing the re-projection error between image observations (image features) and their prediction obtained by projecting the corresponding landmarks onto the camera frame. This optimization is typically realized using non-linear least-squares (NLS) approaches. Different techniques exist for detecting image features, including the well-known SIFT features. Yet, state of the art BA and visual SLAM approaches formulate the constraints in the NLS optimization utilizing only image feature coordinates.
In this work we propose to incorporate within BA a new type of constraints that use feature scale information that is readily available from a typical image feature detector (e.g. SIFT, SURF). While feature scales (and orientation) play an important role in image matching, they have not been utilized thus far for estimation purposes in BA framework. Our approach exploits feature scale information and uses it to enhance the accuracy of bundle adjustment, especially along the optical axis of the camera in a monocular setup. Specifically, we formulate constraints between the measured and predicted feature scale, where the latter depends on the distance from the camera and the corresponding 3D point, and optimize the system variables to minimize the residual error in these constraints in addition to minimizing the standard re-projection error. We study our approach both in synthetic environments and real-image ground and aerial datasets (KITTI and Kagaru), demonstrating significant improvement in positioning error.