KITTI Stereo Visual Odometry

This script implements a stereo visual odometry pipeline for KITTI-style sequences. It estimates the camera trajectory from stereo image pairs using disparity-based depth, ORB feature matching, and PnP pose estimation.

Overview

The pipeline works as follows:

Load stereo calibration from calib.txt
Read left and right image sequences from image_0/ and image_1/
Compute a disparity map and depth map from each stereo pair
Detect and match ORB features between consecutive left images
Reconstruct 3D points from the previous frame using stereo depth
Estimate camera motion with solvePnPRansac
Accumulate poses to build the trajectory
Plot the estimated trajectory, optionally together with ground truth

Expected Folder Structure

The input sequence directory should contain:

frames/
├── image_0/
│   ├── 000000.png
│   ├── 000001.png
│   └── ...
├── image_1/
│   ├── 000000.png
│   ├── 000001.png
│   └── ...
├── calib.txt
└── poses.txt   # optional ground-truth poses

Usage

Run the script from the command line:

python stereo_vo.py frames --max-frames 500

Arguments

sequence: Path to the sequence folder. Defaults to frames
--max-frames: Maximum number of frames to process
--no-debug: Disable match visualization

Example:

python stereo_vo.py frames --max-frames 300 --no-debug

Main Class

`KITTIStereoVO`

This class contains the full stereo visual odometry pipeline.

Constructor

vo = KITTIStereoVO(seq_path)

Parameters

seq_path: Path to a KITTI-style sequence folder

During initialization, the class:

stores paths to the left and right image folders
loads projection matrices P0 and P1
extracts intrinsic camera parameters
computes the stereo baseline
loads image file lists
creates the ORB detector and brute-force matcher
creates the StereoSGBM disparity estimator

Calibration Loading

Calibration is loaded from calib.txt using _load_calib().

Method

@staticmethod
def _load_calib(calib_path: str):
    ...

Description

The function reads projection matrices from the calibration file and extracts:

P0: projection matrix for the left camera
P1: projection matrix for the right camera

Each matrix is expected to contain 12 values and is reshaped into a 3 x 4 matrix.

Important Parameters

The camera intrinsics are derived from P0:

fx = P0[0, 0]
fy = P0[1, 1]
cx = P0[0, 2]
cy = P0[1, 2]

The stereo baseline is derived from P1:

baseline = abs(P1[0, 3] / P1[0, 0])

This uses the stereo projection model:

P1 = [fx 0 cx -fx*b; ...]

Depth Estimation

Depth is computed from the disparity map obtained using StereoSGBM.

Stereo Matcher

The script creates the stereo matcher like this:

self.stereo = cv2.StereoSGBM_create(
    minDisparity=0,
    numDisparities=16 * 8,
    blockSize=7,
    P1=8 * 3 * 7**2,
    P2=32 * 3 * 7**2,
    disp12MaxDiff=1,
    uniquenessRatio=10,
    speckleWindowSize=100,
    speckleRange=32,
)

Method

def compute_depth(self, img_left_gray, img_right_gray):
    ...

Description

The disparity is computed from the left and right grayscale images:

disparity = self.stereo.compute(img_left_gray, img_right_gray).astype(np.float32) / 16.0

Depth is then calculated from disparity using:

depth = (fx * baseline) / disparity

Only disparities greater than 1.0 are considered valid. Invalid pixels are assigned NaN.

Returns

disparity: disparity map
depth: depth map in meters

3D Point Reconstruction

A pixel with a valid depth value is projected into 3D camera coordinates.

Method

def pixel_to_3d(self, u, v, depth):
    ...

Formula

x = (u - cx) * z / fx
y = (v - cy) * z / fy
z = depth

Return Value

A NumPy array:

[x, y, z]

representing the 3D point in the camera coordinate system.

Motion Estimation Pipeline

The main visual odometry loop is implemented in run().

Method

def run(self, max_frames=None, show_debug=True):
    ...

Parameters

max_frames: maximum number of frames to process
show_debug: whether to display feature match visualizations

Initialization

The method starts by:

setting the initial pose to the identity matrix
storing the initial trajectory position at the origin
loading the first stereo pair
detecting ORB features in the first left image
computing the first depth map

pose = np.eye(4, dtype=np.float64)
trajectory = [pose[:3, 3].copy()]

Feature Detection and Matching

For each new frame:

Load the current stereo pair
Detect ORB keypoints and descriptors in the current left image
Match descriptors between previous and current images

matches = self.matcher.match(prev_des, curr_des)
matches = sorted(matches, key=lambda m: m.distance)

The matcher used is:

self.matcher = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)

Because ORB produces binary descriptors, Hamming distance is used.

3D-2D Correspondence Construction

For each feature match:

take the matched keypoint in the previous image
look up its depth value in the previous depth map
reconstruct the 3D point in the previous camera frame
pair it with the matched 2D point in the current image

Only correspondences are kept if:

the pixel lies inside the image bounds
the depth is valid
the depth is positive
the depth is less than 30 meters

This produces:

obj_points: 3D points
img_points: corresponding 2D image points

If fewer than 8 correspondences are available, the frame is skipped.

Pose Estimation with PnP

Camera motion is estimated using OpenCV's RANSAC-based Perspective-n-Point solver:

success, rvec, tvec, inliers = cv2.solvePnPRansac(
    obj_points,
    img_points,
    K,
    None,
    iterationsCount=100,
    reprojectionError=3.0,
    confidence=0.999,
    flags=cv2.SOLVEPNP_ITERATIVE
)

Camera Matrix

The intrinsic matrix used is:

K = np.array([
    [self.fx, 0, self.cx],
    [0, self.fy, self.cy],
    [0, 0, 1]
], dtype=np.float64)

Success Conditions

The pose estimate is accepted only if:

success is true
inliers is not None
at least 8 inliers are found

Otherwise the frame is skipped.

Relative Transform

The rotation vector from PnP is converted into a rotation matrix:

R, _ = cv2.Rodrigues(rvec)

Then the relative transform from the previous camera to the current camera is built:

T_prev_to_curr = np.eye(4, dtype=np.float64)
T_prev_to_curr[:3, :3] = R
T_prev_to_curr[:3, 3] = tvec[:, 0]

Pose Accumulation

The current world pose is updated by composing the inverse of the relative motion:

pose = pose @ np.linalg.inv(T_prev_to_curr)
trajectory.append(pose[:3, 3].copy())

This stores the camera center of each processed frame in the trajectory.

Debug Visualization

If show_debug=True, the script displays up to 100 feature matches:

dbg = cv2.drawMatches(
    prev_left, prev_kp,
    curr_left, curr_kp,
    good_matches[:100],
    None,
    flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
)
cv2.imshow("Matches", dbg)

Press Esc to stop early.

Return Value

The run() method returns:

np.array(trajectory)

This is an N x 3 array containing the estimated camera positions.

Trajectory Plotting

The function plot_trajectory() visualizes the estimated path.

Function

def plot_trajectory(traj, gt_path=None):
    ...

Description

The trajectory is plotted in the X-Z plane:

plt.plot(traj[:, 0], traj[:, 2], label="Estimated", linewidth=2)

If a ground-truth poses.txt file exists, it is also loaded and plotted for comparison.

Ground Truth Format

Each line in poses.txt is expected to contain 12 values representing a flattened 3 x 4 pose matrix:

r11 r12 r13 tx r21 r22 r23 ty r31 r32 r33 tz

The translation column T[:, 3] is extracted from each pose.

Example Console Output

During processing, the script prints per-frame statistics:

Frame 0001 | matches= 842 | 3D-2D= 196 | inliers= 154
Frame 0002 | matches= 801 | 3D-2D= 188 | inliers= 149
Frame 0003 | matches= 790 | 3D-2D= 173 | inliers= 132

If something fails, messages such as the following are printed:

Frame 12: not enough 3D-2D correspondences (5)
Frame 18: PnP failed

Script Entry Point

The script can also be run directly via:

if __name__ == "__main__":
    ...

It parses command-line arguments, creates the KITTIStereoVO instance, runs the pipeline, and plots the final trajectory.

Complete Source Code

import os
import cv2
import numpy as np
import matplotlib.pyplot as plt


class KITTIStereoVO:
    def __init__(self, seq_path: str):
        self.seq_path = seq_path
        self.left_dir = os.path.join(seq_path, "image_0")
        self.right_dir = os.path.join(seq_path, "image_1")
        self.calib_path = os.path.join(seq_path, "calib.txt")

        self.P0, self.P1 = self._load_calib(self.calib_path)

        self.fx = self.P0[0, 0]
        self.fy = self.P0[1, 1]
        self.cx = self.P0[0, 2]
        self.cy = self.P0[1, 2]

        # Baseline from projection matrices:
        # P1 = [fx 0 cx -fx*b; ...]
        self.baseline = abs(self.P1[0, 3] / self.P1[0, 0])

        self.left_images = sorted(
            [os.path.join(self.left_dir, f) for f in os.listdir(self.left_dir) if f.endswith(".png")]
        )
        self.right_images = sorted(
            [os.path.join(self.right_dir, f) for f in os.listdir(self.right_dir) if f.endswith(".png")]
        )

        self.orb = cv2.ORB_create(nfeatures=2500)
        self.matcher = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)

        # Stereo matcher
        self.stereo = cv2.StereoSGBM_create(
            minDisparity=0,
            numDisparities=16 * 8,   # must be divisible by 16
            blockSize=7,
            P1=8 * 3 * 7**2,
            P2=32 * 3 * 7**2,
            disp12MaxDiff=1,
            uniquenessRatio=10,
            speckleWindowSize=100,
            speckleRange=32,
        )

    @staticmethod
    def _load_calib(calib_path: str):
        data = {}
        with open(calib_path, "r") as f:
            for line in f:
                if ":" not in line:
                    continue
                key, value = line.strip().split(":", 1)
                nums = np.fromstring(value, sep=" ")
                if nums.size == 12:
                    data[key] = nums.reshape(3, 4)

        P0 = data["P0"]
        P1 = data["P1"]
        return P0, P1

    def compute_depth(self, img_left_gray, img_right_gray):
        disparity = self.stereo.compute(img_left_gray, img_right_gray).astype(np.float32) / 16.0

        # invalid disparity => no depth
        depth = np.full_like(disparity, np.nan, dtype=np.float32)
        valid = disparity > 1.0
        depth[valid] = (self.fx * self.baseline) / disparity[valid]
        return disparity, depth

    def pixel_to_3d(self, u, v, depth):
        z = depth
        x = (u - self.cx) * z / self.fx
        y = (v - self.cy) * z / self.fy
        return np.array([x, y, z], dtype=np.float32)

    def run(self, max_frames=None, show_debug=True):
        if max_frames is None:
            max_frames = len(self.left_images)

        pose = np.eye(4, dtype=np.float64)
        trajectory = [pose[:3, 3].copy()]

        prev_left = cv2.imread(self.left_images[0], cv2.IMREAD_GRAYSCALE)
        prev_right = cv2.imread(self.right_images[0], cv2.IMREAD_GRAYSCALE)

        prev_kp, prev_des = self.orb.detectAndCompute(prev_left, None)
        _, prev_depth = self.compute_depth(prev_left, prev_right)

        for i in range(1, min(max_frames, len(self.left_images))):
            curr_left = cv2.imread(self.left_images[i], cv2.IMREAD_GRAYSCALE)
            curr_right = cv2.imread(self.right_images[i], cv2.IMREAD_GRAYSCALE)

            curr_kp, curr_des = self.orb.detectAndCompute(curr_left, None)
            if prev_des is None or curr_des is None:
                prev_left, prev_right = curr_left, curr_right
                prev_kp, prev_des = curr_kp, curr_des
                _, prev_depth = self.compute_depth(prev_left, prev_right)
                continue

            matches = self.matcher.match(prev_des, curr_des)
            matches = sorted(matches, key=lambda m: m.distance)

            obj_points = []
            img_points = []
            good_matches = []

            for m in matches:
                u_prev, v_prev = prev_kp[m.queryIdx].pt
                u_curr, v_curr = curr_kp[m.trainIdx].pt

                u_i = int(round(u_prev))
                v_i = int(round(v_prev))

                if not (0 <= u_i < prev_depth.shape[1] and 0 <= v_i < prev_depth.shape[0]):
                    continue

                z = prev_depth[v_i, u_i]
                if np.isnan(z) or z <= 0 or z > 30:
                    continue

                p3d = self.pixel_to_3d(u_prev, v_prev, z)
                obj_points.append(p3d)
                img_points.append([u_curr, v_curr])
                good_matches.append(m)

            obj_points = np.asarray(obj_points, dtype=np.float32)
            img_points = np.asarray(img_points, dtype=np.float32)

            if len(obj_points) < 8:
                print(f"Frame {i}: not enough 3D-2D correspondences ({len(obj_points)})")
                prev_left, prev_right = curr_left, curr_right
                prev_kp, prev_des = curr_kp, curr_des
                _, prev_depth = self.compute_depth(prev_left, prev_right)
                continue

            K = np.array([
                [self.fx, 0, self.cx],
                [0, self.fy, self.cy],
                [0, 0, 1]
            ], dtype=np.float64)

            success, rvec, tvec, inliers = cv2.solvePnPRansac(
                obj_points,
                img_points,
                K,
                None,
                iterationsCount=100,
                reprojectionError=3.0,
                confidence=0.999,
                flags=cv2.SOLVEPNP_ITERATIVE
            )

            if not success or inliers is None or len(inliers) < 8:
                print(f"Frame {i}: PnP failed")
                prev_left, prev_right = curr_left, curr_right
                prev_kp, prev_des = curr_kp, curr_des
                _, prev_depth = self.compute_depth(prev_left, prev_right)
                continue

            R, _ = cv2.Rodrigues(rvec)

            # Transform from prev camera to current camera
            T_prev_to_curr = np.eye(4, dtype=np.float64)
            T_prev_to_curr[:3, :3] = R
            T_prev_to_curr[:3, 3] = tvec[:, 0]

            # Accumulate as world pose of current frame
            pose = pose @ np.linalg.inv(T_prev_to_curr)
            trajectory.append(pose[:3, 3].copy())

            print(
                f"Frame {i:04d} | matches={len(matches):4d} "
                f"| 3D-2D={len(obj_points):4d} | inliers={len(inliers):4d}"
            )

            if show_debug:
                dbg = cv2.drawMatches(
                    prev_left, prev_kp,
                    curr_left, curr_kp,
                    good_matches[:100],
                    None,
                    flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
                )
                cv2.imshow("Matches", dbg)
                if cv2.waitKey(1) == 27:
                    break

            prev_left, prev_right = curr_left, curr_right
            prev_kp, prev_des = curr_kp, curr_des
            _, prev_depth = self.compute_depth(prev_left, prev_right)

        cv2.destroyAllWindows()
        return np.array(trajectory)


def plot_trajectory(traj, gt_path=None):
    plt.figure(figsize=(8, 6))
    plt.plot(traj[:, 0], traj[:, 2], label="Estimated", linewidth=2)

    if gt_path is not None and os.path.exists(gt_path):
        gt = []
        with open(gt_path, "r") as f:
            for line in f:
                vals = np.fromstring(line.strip(), sep=" ")
                if vals.size == 12:
                    T = vals.reshape(3, 4)
                    gt.append(T[:, 3])
        gt = np.array(gt)
        plt.plot(gt[:, 0], gt[:, 2], label="Ground Truth", linewidth=2)

    plt.xlabel("X [m]")
    plt.ylabel("Z [m]")
    plt.title("KITTI Stereo VO")
    plt.legend()
    plt.axis("equal")
    plt.grid(True)
    plt.show()


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="KITTI-format Stereo Visual Odometry")
    parser.add_argument("sequence", nargs="?", default="frames",
                        help="Path to sequence folder (must contain image_0/, image_1/, calib.txt)")
    parser.add_argument("--max-frames", type=int, default=500)
    parser.add_argument("--no-debug", action="store_true")
    args = parser.parse_args()

    gt_path = os.path.join(args.sequence, "poses.txt")

    vo = KITTIStereoVO(args.sequence)
    trajectory = vo.run(max_frames=args.max_frames, show_debug=not args.no_debug)
    plot_trajectory(trajectory, gt_path=gt_path)

Notes

The script assumes rectified stereo images
Depth quality depends strongly on disparity quality
The maximum accepted depth is currently limited to 30 meters
Drift accumulates over time because this is pure VO without loop closure
Ground truth plotting is optional and depends on poses.txt

Add to `mkdocs.yml`

site_name: My Project Documentation

theme:
  name: material

nav:
  - Home: index.md
  - GPS to KITTI: gps_to_kitti.md
  - Stereo Visual Odometry: stereo_vo.md

Minimal Home Page

You can put this in docs/index.md:

# Project Documentation

This site documents the KITTI data conversion and stereo visual odometry pipeline.

## Pages

- GPS to KITTI Pose Conversion
- Stereo Visual Odometry

If you want, I can also format this into a better Material for MkDocs version with: - callout boxes - code tabs - parameter tables - cleaner API-style sections

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

KITTI Stereo Visual Odometry

Overview

Expected Folder Structure

Usage

Arguments

Main Class

KITTIStereoVO

Constructor

Parameters

Calibration Loading

Method

Description

Important Parameters

Depth Estimation

Stereo Matcher

Method

Description

Returns

3D Point Reconstruction

Method

Formula

Return Value

Motion Estimation Pipeline

Method

Parameters

Initialization

Feature Detection and Matching

3D-2D Correspondence Construction

Pose Estimation with PnP

Camera Matrix

Success Conditions

Relative Transform

Pose Accumulation

Debug Visualization

Return Value

Trajectory Plotting

Function

Description

Ground Truth Format

Example Console Output

Script Entry Point

Complete Source Code

Notes

Add to mkdocs.yml

Minimal Home Page

`KITTIStereoVO`

Add to `mkdocs.yml`