KITTI Stereo Visual Odometry

This script implements a stereo visual odometry pipeline for KITTI-style sequences. It estimates the camera trajectory from stereo image pairs using disparity-based depth, ORB feature matching, and PnP pose estimation.

Overview

The pipeline works as follows:

  1. Load stereo calibration from calib.txt
  2. Read left and right image sequences from image_0/ and image_1/
  3. Compute a disparity map and depth map from each stereo pair
  4. Detect and match ORB features between consecutive left images
  5. Reconstruct 3D points from the previous frame using stereo depth
  6. Estimate camera motion with solvePnPRansac
  7. Accumulate poses to build the trajectory
  8. Plot the estimated trajectory, optionally together with ground truth

Expected Folder Structure

The input sequence directory should contain:

frames/
├── image_0/
│   ├── 000000.png
│   ├── 000001.png
│   └── ...
├── image_1/
│   ├── 000000.png
│   ├── 000001.png
│   └── ...
├── calib.txt
└── poses.txt   # optional ground-truth poses

Usage

Run the script from the command line:

python stereo_vo.py frames --max-frames 500

Arguments

  • sequence: Path to the sequence folder. Defaults to frames
  • --max-frames: Maximum number of frames to process
  • --no-debug: Disable match visualization

Example:

python stereo_vo.py frames --max-frames 300 --no-debug

Main Class

KITTIStereoVO

This class contains the full stereo visual odometry pipeline.

Constructor

vo = KITTIStereoVO(seq_path)

Parameters

  • seq_path: Path to a KITTI-style sequence folder

During initialization, the class:

  • stores paths to the left and right image folders
  • loads projection matrices P0 and P1
  • extracts intrinsic camera parameters
  • computes the stereo baseline
  • loads image file lists
  • creates the ORB detector and brute-force matcher
  • creates the StereoSGBM disparity estimator

Calibration Loading

Calibration is loaded from calib.txt using _load_calib().

Method

@staticmethod
def _load_calib(calib_path: str):
    ...

Description

The function reads projection matrices from the calibration file and extracts:

  • P0: projection matrix for the left camera
  • P1: projection matrix for the right camera

Each matrix is expected to contain 12 values and is reshaped into a 3 x 4 matrix.

Important Parameters

The camera intrinsics are derived from P0:

fx = P0[0, 0]
fy = P0[1, 1]
cx = P0[0, 2]
cy = P0[1, 2]

The stereo baseline is derived from P1:

baseline = abs(P1[0, 3] / P1[0, 0])

This uses the stereo projection model:

P1 = [fx 0 cx -fx*b; ...]

Depth Estimation

Depth is computed from the disparity map obtained using StereoSGBM.

Stereo Matcher

The script creates the stereo matcher like this:

self.stereo = cv2.StereoSGBM_create(
    minDisparity=0,
    numDisparities=16 * 8,
    blockSize=7,
    P1=8 * 3 * 7**2,
    P2=32 * 3 * 7**2,
    disp12MaxDiff=1,
    uniquenessRatio=10,
    speckleWindowSize=100,
    speckleRange=32,
)

Method

def compute_depth(self, img_left_gray, img_right_gray):
    ...

Description

The disparity is computed from the left and right grayscale images:

disparity = self.stereo.compute(img_left_gray, img_right_gray).astype(np.float32) / 16.0

Depth is then calculated from disparity using:

depth = (fx * baseline) / disparity

Only disparities greater than 1.0 are considered valid. Invalid pixels are assigned NaN.

Returns

  • disparity: disparity map
  • depth: depth map in meters

3D Point Reconstruction

A pixel with a valid depth value is projected into 3D camera coordinates.

Method

def pixel_to_3d(self, u, v, depth):
    ...

Formula

x = (u - cx) * z / fx
y = (v - cy) * z / fy
z = depth

Return Value

A NumPy array:

[x, y, z]

representing the 3D point in the camera coordinate system.

Motion Estimation Pipeline

The main visual odometry loop is implemented in run().

Method

def run(self, max_frames=None, show_debug=True):
    ...

Parameters

  • max_frames: maximum number of frames to process
  • show_debug: whether to display feature match visualizations

Initialization

The method starts by:

  • setting the initial pose to the identity matrix
  • storing the initial trajectory position at the origin
  • loading the first stereo pair
  • detecting ORB features in the first left image
  • computing the first depth map
pose = np.eye(4, dtype=np.float64)
trajectory = [pose[:3, 3].copy()]

Feature Detection and Matching

For each new frame:

  1. Load the current stereo pair
  2. Detect ORB keypoints and descriptors in the current left image
  3. Match descriptors between previous and current images
matches = self.matcher.match(prev_des, curr_des)
matches = sorted(matches, key=lambda m: m.distance)

The matcher used is:

self.matcher = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)

Because ORB produces binary descriptors, Hamming distance is used.

3D-2D Correspondence Construction

For each feature match:

  • take the matched keypoint in the previous image
  • look up its depth value in the previous depth map
  • reconstruct the 3D point in the previous camera frame
  • pair it with the matched 2D point in the current image

Only correspondences are kept if:

  • the pixel lies inside the image bounds
  • the depth is valid
  • the depth is positive
  • the depth is less than 30 meters

This produces:

  • obj_points: 3D points
  • img_points: corresponding 2D image points

If fewer than 8 correspondences are available, the frame is skipped.

Pose Estimation with PnP

Camera motion is estimated using OpenCV's RANSAC-based Perspective-n-Point solver:

success, rvec, tvec, inliers = cv2.solvePnPRansac(
    obj_points,
    img_points,
    K,
    None,
    iterationsCount=100,
    reprojectionError=3.0,
    confidence=0.999,
    flags=cv2.SOLVEPNP_ITERATIVE
)

Camera Matrix

The intrinsic matrix used is:

K = np.array([
    [self.fx, 0, self.cx],
    [0, self.fy, self.cy],
    [0, 0, 1]
], dtype=np.float64)

Success Conditions

The pose estimate is accepted only if:

  • success is true
  • inliers is not None
  • at least 8 inliers are found

Otherwise the frame is skipped.

Relative Transform

The rotation vector from PnP is converted into a rotation matrix:

R, _ = cv2.Rodrigues(rvec)

Then the relative transform from the previous camera to the current camera is built:

T_prev_to_curr = np.eye(4, dtype=np.float64)
T_prev_to_curr[:3, :3] = R
T_prev_to_curr[:3, 3] = tvec[:, 0]

Pose Accumulation

The current world pose is updated by composing the inverse of the relative motion:

pose = pose @ np.linalg.inv(T_prev_to_curr)
trajectory.append(pose[:3, 3].copy())

This stores the camera center of each processed frame in the trajectory.

Debug Visualization

If show_debug=True, the script displays up to 100 feature matches:

dbg = cv2.drawMatches(
    prev_left, prev_kp,
    curr_left, curr_kp,
    good_matches[:100],
    None,
    flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
)
cv2.imshow("Matches", dbg)

Press Esc to stop early.

Return Value

The run() method returns:

np.array(trajectory)

This is an N x 3 array containing the estimated camera positions.

Trajectory Plotting

The function plot_trajectory() visualizes the estimated path.

Function

def plot_trajectory(traj, gt_path=None):
    ...

Description

The trajectory is plotted in the X-Z plane:

plt.plot(traj[:, 0], traj[:, 2], label="Estimated", linewidth=2)

If a ground-truth poses.txt file exists, it is also loaded and plotted for comparison.

Ground Truth Format

Each line in poses.txt is expected to contain 12 values representing a flattened 3 x 4 pose matrix:

r11 r12 r13 tx r21 r22 r23 ty r31 r32 r33 tz

The translation column T[:, 3] is extracted from each pose.

Example Console Output

During processing, the script prints per-frame statistics:

Frame 0001 | matches= 842 | 3D-2D= 196 | inliers= 154
Frame 0002 | matches= 801 | 3D-2D= 188 | inliers= 149
Frame 0003 | matches= 790 | 3D-2D= 173 | inliers= 132

If something fails, messages such as the following are printed:

Frame 12: not enough 3D-2D correspondences (5)
Frame 18: PnP failed

Script Entry Point

The script can also be run directly via:

if __name__ == "__main__":
    ...

It parses command-line arguments, creates the KITTIStereoVO instance, runs the pipeline, and plots the final trajectory.

Complete Source Code

import os
import cv2
import numpy as np
import matplotlib.pyplot as plt


class KITTIStereoVO:
    def __init__(self, seq_path: str):
        self.seq_path = seq_path
        self.left_dir = os.path.join(seq_path, "image_0")
        self.right_dir = os.path.join(seq_path, "image_1")
        self.calib_path = os.path.join(seq_path, "calib.txt")

        self.P0, self.P1 = self._load_calib(self.calib_path)

        self.fx = self.P0[0, 0]
        self.fy = self.P0[1, 1]
        self.cx = self.P0[0, 2]
        self.cy = self.P0[1, 2]

        # Baseline from projection matrices:
        # P1 = [fx 0 cx -fx*b; ...]
        self.baseline = abs(self.P1[0, 3] / self.P1[0, 0])

        self.left_images = sorted(
            [os.path.join(self.left_dir, f) for f in os.listdir(self.left_dir) if f.endswith(".png")]
        )
        self.right_images = sorted(
            [os.path.join(self.right_dir, f) for f in os.listdir(self.right_dir) if f.endswith(".png")]
        )

        self.orb = cv2.ORB_create(nfeatures=2500)
        self.matcher = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)

        # Stereo matcher
        self.stereo = cv2.StereoSGBM_create(
            minDisparity=0,
            numDisparities=16 * 8,   # must be divisible by 16
            blockSize=7,
            P1=8 * 3 * 7**2,
            P2=32 * 3 * 7**2,
            disp12MaxDiff=1,
            uniquenessRatio=10,
            speckleWindowSize=100,
            speckleRange=32,
        )

    @staticmethod
    def _load_calib(calib_path: str):
        data = {}
        with open(calib_path, "r") as f:
            for line in f:
                if ":" not in line:
                    continue
                key, value = line.strip().split(":", 1)
                nums = np.fromstring(value, sep=" ")
                if nums.size == 12:
                    data[key] = nums.reshape(3, 4)

        P0 = data["P0"]
        P1 = data["P1"]
        return P0, P1

    def compute_depth(self, img_left_gray, img_right_gray):
        disparity = self.stereo.compute(img_left_gray, img_right_gray).astype(np.float32) / 16.0

        # invalid disparity => no depth
        depth = np.full_like(disparity, np.nan, dtype=np.float32)
        valid = disparity > 1.0
        depth[valid] = (self.fx * self.baseline) / disparity[valid]
        return disparity, depth

    def pixel_to_3d(self, u, v, depth):
        z = depth
        x = (u - self.cx) * z / self.fx
        y = (v - self.cy) * z / self.fy
        return np.array([x, y, z], dtype=np.float32)

    def run(self, max_frames=None, show_debug=True):
        if max_frames is None:
            max_frames = len(self.left_images)

        pose = np.eye(4, dtype=np.float64)
        trajectory = [pose[:3, 3].copy()]

        prev_left = cv2.imread(self.left_images[0], cv2.IMREAD_GRAYSCALE)
        prev_right = cv2.imread(self.right_images[0], cv2.IMREAD_GRAYSCALE)

        prev_kp, prev_des = self.orb.detectAndCompute(prev_left, None)
        _, prev_depth = self.compute_depth(prev_left, prev_right)

        for i in range(1, min(max_frames, len(self.left_images))):
            curr_left = cv2.imread(self.left_images[i], cv2.IMREAD_GRAYSCALE)
            curr_right = cv2.imread(self.right_images[i], cv2.IMREAD_GRAYSCALE)

            curr_kp, curr_des = self.orb.detectAndCompute(curr_left, None)
            if prev_des is None or curr_des is None:
                prev_left, prev_right = curr_left, curr_right
                prev_kp, prev_des = curr_kp, curr_des
                _, prev_depth = self.compute_depth(prev_left, prev_right)
                continue

            matches = self.matcher.match(prev_des, curr_des)
            matches = sorted(matches, key=lambda m: m.distance)

            obj_points = []
            img_points = []
            good_matches = []

            for m in matches:
                u_prev, v_prev = prev_kp[m.queryIdx].pt
                u_curr, v_curr = curr_kp[m.trainIdx].pt

                u_i = int(round(u_prev))
                v_i = int(round(v_prev))

                if not (0 <= u_i < prev_depth.shape[1] and 0 <= v_i < prev_depth.shape[0]):
                    continue

                z = prev_depth[v_i, u_i]
                if np.isnan(z) or z <= 0 or z > 30:
                    continue

                p3d = self.pixel_to_3d(u_prev, v_prev, z)
                obj_points.append(p3d)
                img_points.append([u_curr, v_curr])
                good_matches.append(m)

            obj_points = np.asarray(obj_points, dtype=np.float32)
            img_points = np.asarray(img_points, dtype=np.float32)

            if len(obj_points) < 8:
                print(f"Frame {i}: not enough 3D-2D correspondences ({len(obj_points)})")
                prev_left, prev_right = curr_left, curr_right
                prev_kp, prev_des = curr_kp, curr_des
                _, prev_depth = self.compute_depth(prev_left, prev_right)
                continue

            K = np.array([
                [self.fx, 0, self.cx],
                [0, self.fy, self.cy],
                [0, 0, 1]
            ], dtype=np.float64)

            success, rvec, tvec, inliers = cv2.solvePnPRansac(
                obj_points,
                img_points,
                K,
                None,
                iterationsCount=100,
                reprojectionError=3.0,
                confidence=0.999,
                flags=cv2.SOLVEPNP_ITERATIVE
            )

            if not success or inliers is None or len(inliers) < 8:
                print(f"Frame {i}: PnP failed")
                prev_left, prev_right = curr_left, curr_right
                prev_kp, prev_des = curr_kp, curr_des
                _, prev_depth = self.compute_depth(prev_left, prev_right)
                continue

            R, _ = cv2.Rodrigues(rvec)

            # Transform from prev camera to current camera
            T_prev_to_curr = np.eye(4, dtype=np.float64)
            T_prev_to_curr[:3, :3] = R
            T_prev_to_curr[:3, 3] = tvec[:, 0]

            # Accumulate as world pose of current frame
            pose = pose @ np.linalg.inv(T_prev_to_curr)
            trajectory.append(pose[:3, 3].copy())

            print(
                f"Frame {i:04d} | matches={len(matches):4d} "
                f"| 3D-2D={len(obj_points):4d} | inliers={len(inliers):4d}"
            )

            if show_debug:
                dbg = cv2.drawMatches(
                    prev_left, prev_kp,
                    curr_left, curr_kp,
                    good_matches[:100],
                    None,
                    flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
                )
                cv2.imshow("Matches", dbg)
                if cv2.waitKey(1) == 27:
                    break

            prev_left, prev_right = curr_left, curr_right
            prev_kp, prev_des = curr_kp, curr_des
            _, prev_depth = self.compute_depth(prev_left, prev_right)

        cv2.destroyAllWindows()
        return np.array(trajectory)


def plot_trajectory(traj, gt_path=None):
    plt.figure(figsize=(8, 6))
    plt.plot(traj[:, 0], traj[:, 2], label="Estimated", linewidth=2)

    if gt_path is not None and os.path.exists(gt_path):
        gt = []
        with open(gt_path, "r") as f:
            for line in f:
                vals = np.fromstring(line.strip(), sep=" ")
                if vals.size == 12:
                    T = vals.reshape(3, 4)
                    gt.append(T[:, 3])
        gt = np.array(gt)
        plt.plot(gt[:, 0], gt[:, 2], label="Ground Truth", linewidth=2)

    plt.xlabel("X [m]")
    plt.ylabel("Z [m]")
    plt.title("KITTI Stereo VO")
    plt.legend()
    plt.axis("equal")
    plt.grid(True)
    plt.show()


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="KITTI-format Stereo Visual Odometry")
    parser.add_argument("sequence", nargs="?", default="frames",
                        help="Path to sequence folder (must contain image_0/, image_1/, calib.txt)")
    parser.add_argument("--max-frames", type=int, default=500)
    parser.add_argument("--no-debug", action="store_true")
    args = parser.parse_args()

    gt_path = os.path.join(args.sequence, "poses.txt")

    vo = KITTIStereoVO(args.sequence)
    trajectory = vo.run(max_frames=args.max_frames, show_debug=not args.no_debug)
    plot_trajectory(trajectory, gt_path=gt_path)

Notes

  • The script assumes rectified stereo images
  • Depth quality depends strongly on disparity quality
  • The maximum accepted depth is currently limited to 30 meters
  • Drift accumulates over time because this is pure VO without loop closure
  • Ground truth plotting is optional and depends on poses.txt

Add to mkdocs.yml

site_name: My Project Documentation

theme:
  name: material

nav:
  - Home: index.md
  - GPS to KITTI: gps_to_kitti.md
  - Stereo Visual Odometry: stereo_vo.md

Minimal Home Page

You can put this in docs/index.md:

# Project Documentation

This site documents the KITTI data conversion and stereo visual odometry pipeline.

## Pages

- GPS to KITTI Pose Conversion
- Stereo Visual Odometry

If you want, I can also format this into a better Material for MkDocs version with: - callout boxes - code tabs - parameter tables - cleaner API-style sections