KITTI Stereo Visual Odometry
This script implements a stereo visual odometry pipeline for KITTI-style sequences. It estimates the camera trajectory from stereo image pairs using disparity-based depth, ORB feature matching, and PnP pose estimation.
Overview
The pipeline works as follows:
- Load stereo calibration from
calib.txt - Read left and right image sequences from
image_0/andimage_1/ - Compute a disparity map and depth map from each stereo pair
- Detect and match ORB features between consecutive left images
- Reconstruct 3D points from the previous frame using stereo depth
- Estimate camera motion with
solvePnPRansac - Accumulate poses to build the trajectory
- Plot the estimated trajectory, optionally together with ground truth
Expected Folder Structure
The input sequence directory should contain:
frames/
├── image_0/
│ ├── 000000.png
│ ├── 000001.png
│ └── ...
├── image_1/
│ ├── 000000.png
│ ├── 000001.png
│ └── ...
├── calib.txt
└── poses.txt # optional ground-truth poses
Usage
Run the script from the command line:
python stereo_vo.py frames --max-frames 500
Arguments
sequence: Path to the sequence folder. Defaults toframes--max-frames: Maximum number of frames to process--no-debug: Disable match visualization
Example:
python stereo_vo.py frames --max-frames 300 --no-debug
Main Class
KITTIStereoVO
This class contains the full stereo visual odometry pipeline.
Constructor
vo = KITTIStereoVO(seq_path)
Parameters
seq_path: Path to a KITTI-style sequence folder
During initialization, the class:
- stores paths to the left and right image folders
- loads projection matrices
P0andP1 - extracts intrinsic camera parameters
- computes the stereo baseline
- loads image file lists
- creates the ORB detector and brute-force matcher
- creates the StereoSGBM disparity estimator
Calibration Loading
Calibration is loaded from calib.txt using _load_calib().
Method
@staticmethod
def _load_calib(calib_path: str):
...
Description
The function reads projection matrices from the calibration file and extracts:
P0: projection matrix for the left cameraP1: projection matrix for the right camera
Each matrix is expected to contain 12 values and is reshaped into a 3 x 4 matrix.
Important Parameters
The camera intrinsics are derived from P0:
fx = P0[0, 0]
fy = P0[1, 1]
cx = P0[0, 2]
cy = P0[1, 2]
The stereo baseline is derived from P1:
baseline = abs(P1[0, 3] / P1[0, 0])
This uses the stereo projection model:
P1 = [fx 0 cx -fx*b; ...]
Depth Estimation
Depth is computed from the disparity map obtained using StereoSGBM.
Stereo Matcher
The script creates the stereo matcher like this:
self.stereo = cv2.StereoSGBM_create(
minDisparity=0,
numDisparities=16 * 8,
blockSize=7,
P1=8 * 3 * 7**2,
P2=32 * 3 * 7**2,
disp12MaxDiff=1,
uniquenessRatio=10,
speckleWindowSize=100,
speckleRange=32,
)
Method
def compute_depth(self, img_left_gray, img_right_gray):
...
Description
The disparity is computed from the left and right grayscale images:
disparity = self.stereo.compute(img_left_gray, img_right_gray).astype(np.float32) / 16.0
Depth is then calculated from disparity using:
depth = (fx * baseline) / disparity
Only disparities greater than 1.0 are considered valid. Invalid pixels are assigned NaN.
Returns
disparity: disparity mapdepth: depth map in meters
3D Point Reconstruction
A pixel with a valid depth value is projected into 3D camera coordinates.
Method
def pixel_to_3d(self, u, v, depth):
...
Formula
x = (u - cx) * z / fx
y = (v - cy) * z / fy
z = depth
Return Value
A NumPy array:
[x, y, z]
representing the 3D point in the camera coordinate system.
Motion Estimation Pipeline
The main visual odometry loop is implemented in run().
Method
def run(self, max_frames=None, show_debug=True):
...
Parameters
max_frames: maximum number of frames to processshow_debug: whether to display feature match visualizations
Initialization
The method starts by:
- setting the initial pose to the identity matrix
- storing the initial trajectory position at the origin
- loading the first stereo pair
- detecting ORB features in the first left image
- computing the first depth map
pose = np.eye(4, dtype=np.float64)
trajectory = [pose[:3, 3].copy()]
Feature Detection and Matching
For each new frame:
- Load the current stereo pair
- Detect ORB keypoints and descriptors in the current left image
- Match descriptors between previous and current images
matches = self.matcher.match(prev_des, curr_des)
matches = sorted(matches, key=lambda m: m.distance)
The matcher used is:
self.matcher = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
Because ORB produces binary descriptors, Hamming distance is used.
3D-2D Correspondence Construction
For each feature match:
- take the matched keypoint in the previous image
- look up its depth value in the previous depth map
- reconstruct the 3D point in the previous camera frame
- pair it with the matched 2D point in the current image
Only correspondences are kept if:
- the pixel lies inside the image bounds
- the depth is valid
- the depth is positive
- the depth is less than
30meters
This produces:
obj_points: 3D pointsimg_points: corresponding 2D image points
If fewer than 8 correspondences are available, the frame is skipped.
Pose Estimation with PnP
Camera motion is estimated using OpenCV's RANSAC-based Perspective-n-Point solver:
success, rvec, tvec, inliers = cv2.solvePnPRansac(
obj_points,
img_points,
K,
None,
iterationsCount=100,
reprojectionError=3.0,
confidence=0.999,
flags=cv2.SOLVEPNP_ITERATIVE
)
Camera Matrix
The intrinsic matrix used is:
K = np.array([
[self.fx, 0, self.cx],
[0, self.fy, self.cy],
[0, 0, 1]
], dtype=np.float64)
Success Conditions
The pose estimate is accepted only if:
successis trueinliersis notNone- at least 8 inliers are found
Otherwise the frame is skipped.
Relative Transform
The rotation vector from PnP is converted into a rotation matrix:
R, _ = cv2.Rodrigues(rvec)
Then the relative transform from the previous camera to the current camera is built:
T_prev_to_curr = np.eye(4, dtype=np.float64)
T_prev_to_curr[:3, :3] = R
T_prev_to_curr[:3, 3] = tvec[:, 0]
Pose Accumulation
The current world pose is updated by composing the inverse of the relative motion:
pose = pose @ np.linalg.inv(T_prev_to_curr)
trajectory.append(pose[:3, 3].copy())
This stores the camera center of each processed frame in the trajectory.
Debug Visualization
If show_debug=True, the script displays up to 100 feature matches:
dbg = cv2.drawMatches(
prev_left, prev_kp,
curr_left, curr_kp,
good_matches[:100],
None,
flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
)
cv2.imshow("Matches", dbg)
Press Esc to stop early.
Return Value
The run() method returns:
np.array(trajectory)
This is an N x 3 array containing the estimated camera positions.
Trajectory Plotting
The function plot_trajectory() visualizes the estimated path.
Function
def plot_trajectory(traj, gt_path=None):
...
Description
The trajectory is plotted in the X-Z plane:
plt.plot(traj[:, 0], traj[:, 2], label="Estimated", linewidth=2)
If a ground-truth poses.txt file exists, it is also loaded and plotted for comparison.
Ground Truth Format
Each line in poses.txt is expected to contain 12 values representing a flattened 3 x 4 pose matrix:
r11 r12 r13 tx r21 r22 r23 ty r31 r32 r33 tz
The translation column T[:, 3] is extracted from each pose.
Example Console Output
During processing, the script prints per-frame statistics:
Frame 0001 | matches= 842 | 3D-2D= 196 | inliers= 154
Frame 0002 | matches= 801 | 3D-2D= 188 | inliers= 149
Frame 0003 | matches= 790 | 3D-2D= 173 | inliers= 132
If something fails, messages such as the following are printed:
Frame 12: not enough 3D-2D correspondences (5)
Frame 18: PnP failed
Script Entry Point
The script can also be run directly via:
if __name__ == "__main__":
...
It parses command-line arguments, creates the KITTIStereoVO instance, runs the pipeline, and plots the final trajectory.
Complete Source Code
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
class KITTIStereoVO:
def __init__(self, seq_path: str):
self.seq_path = seq_path
self.left_dir = os.path.join(seq_path, "image_0")
self.right_dir = os.path.join(seq_path, "image_1")
self.calib_path = os.path.join(seq_path, "calib.txt")
self.P0, self.P1 = self._load_calib(self.calib_path)
self.fx = self.P0[0, 0]
self.fy = self.P0[1, 1]
self.cx = self.P0[0, 2]
self.cy = self.P0[1, 2]
# Baseline from projection matrices:
# P1 = [fx 0 cx -fx*b; ...]
self.baseline = abs(self.P1[0, 3] / self.P1[0, 0])
self.left_images = sorted(
[os.path.join(self.left_dir, f) for f in os.listdir(self.left_dir) if f.endswith(".png")]
)
self.right_images = sorted(
[os.path.join(self.right_dir, f) for f in os.listdir(self.right_dir) if f.endswith(".png")]
)
self.orb = cv2.ORB_create(nfeatures=2500)
self.matcher = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
# Stereo matcher
self.stereo = cv2.StereoSGBM_create(
minDisparity=0,
numDisparities=16 * 8, # must be divisible by 16
blockSize=7,
P1=8 * 3 * 7**2,
P2=32 * 3 * 7**2,
disp12MaxDiff=1,
uniquenessRatio=10,
speckleWindowSize=100,
speckleRange=32,
)
@staticmethod
def _load_calib(calib_path: str):
data = {}
with open(calib_path, "r") as f:
for line in f:
if ":" not in line:
continue
key, value = line.strip().split(":", 1)
nums = np.fromstring(value, sep=" ")
if nums.size == 12:
data[key] = nums.reshape(3, 4)
P0 = data["P0"]
P1 = data["P1"]
return P0, P1
def compute_depth(self, img_left_gray, img_right_gray):
disparity = self.stereo.compute(img_left_gray, img_right_gray).astype(np.float32) / 16.0
# invalid disparity => no depth
depth = np.full_like(disparity, np.nan, dtype=np.float32)
valid = disparity > 1.0
depth[valid] = (self.fx * self.baseline) / disparity[valid]
return disparity, depth
def pixel_to_3d(self, u, v, depth):
z = depth
x = (u - self.cx) * z / self.fx
y = (v - self.cy) * z / self.fy
return np.array([x, y, z], dtype=np.float32)
def run(self, max_frames=None, show_debug=True):
if max_frames is None:
max_frames = len(self.left_images)
pose = np.eye(4, dtype=np.float64)
trajectory = [pose[:3, 3].copy()]
prev_left = cv2.imread(self.left_images[0], cv2.IMREAD_GRAYSCALE)
prev_right = cv2.imread(self.right_images[0], cv2.IMREAD_GRAYSCALE)
prev_kp, prev_des = self.orb.detectAndCompute(prev_left, None)
_, prev_depth = self.compute_depth(prev_left, prev_right)
for i in range(1, min(max_frames, len(self.left_images))):
curr_left = cv2.imread(self.left_images[i], cv2.IMREAD_GRAYSCALE)
curr_right = cv2.imread(self.right_images[i], cv2.IMREAD_GRAYSCALE)
curr_kp, curr_des = self.orb.detectAndCompute(curr_left, None)
if prev_des is None or curr_des is None:
prev_left, prev_right = curr_left, curr_right
prev_kp, prev_des = curr_kp, curr_des
_, prev_depth = self.compute_depth(prev_left, prev_right)
continue
matches = self.matcher.match(prev_des, curr_des)
matches = sorted(matches, key=lambda m: m.distance)
obj_points = []
img_points = []
good_matches = []
for m in matches:
u_prev, v_prev = prev_kp[m.queryIdx].pt
u_curr, v_curr = curr_kp[m.trainIdx].pt
u_i = int(round(u_prev))
v_i = int(round(v_prev))
if not (0 <= u_i < prev_depth.shape[1] and 0 <= v_i < prev_depth.shape[0]):
continue
z = prev_depth[v_i, u_i]
if np.isnan(z) or z <= 0 or z > 30:
continue
p3d = self.pixel_to_3d(u_prev, v_prev, z)
obj_points.append(p3d)
img_points.append([u_curr, v_curr])
good_matches.append(m)
obj_points = np.asarray(obj_points, dtype=np.float32)
img_points = np.asarray(img_points, dtype=np.float32)
if len(obj_points) < 8:
print(f"Frame {i}: not enough 3D-2D correspondences ({len(obj_points)})")
prev_left, prev_right = curr_left, curr_right
prev_kp, prev_des = curr_kp, curr_des
_, prev_depth = self.compute_depth(prev_left, prev_right)
continue
K = np.array([
[self.fx, 0, self.cx],
[0, self.fy, self.cy],
[0, 0, 1]
], dtype=np.float64)
success, rvec, tvec, inliers = cv2.solvePnPRansac(
obj_points,
img_points,
K,
None,
iterationsCount=100,
reprojectionError=3.0,
confidence=0.999,
flags=cv2.SOLVEPNP_ITERATIVE
)
if not success or inliers is None or len(inliers) < 8:
print(f"Frame {i}: PnP failed")
prev_left, prev_right = curr_left, curr_right
prev_kp, prev_des = curr_kp, curr_des
_, prev_depth = self.compute_depth(prev_left, prev_right)
continue
R, _ = cv2.Rodrigues(rvec)
# Transform from prev camera to current camera
T_prev_to_curr = np.eye(4, dtype=np.float64)
T_prev_to_curr[:3, :3] = R
T_prev_to_curr[:3, 3] = tvec[:, 0]
# Accumulate as world pose of current frame
pose = pose @ np.linalg.inv(T_prev_to_curr)
trajectory.append(pose[:3, 3].copy())
print(
f"Frame {i:04d} | matches={len(matches):4d} "
f"| 3D-2D={len(obj_points):4d} | inliers={len(inliers):4d}"
)
if show_debug:
dbg = cv2.drawMatches(
prev_left, prev_kp,
curr_left, curr_kp,
good_matches[:100],
None,
flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
)
cv2.imshow("Matches", dbg)
if cv2.waitKey(1) == 27:
break
prev_left, prev_right = curr_left, curr_right
prev_kp, prev_des = curr_kp, curr_des
_, prev_depth = self.compute_depth(prev_left, prev_right)
cv2.destroyAllWindows()
return np.array(trajectory)
def plot_trajectory(traj, gt_path=None):
plt.figure(figsize=(8, 6))
plt.plot(traj[:, 0], traj[:, 2], label="Estimated", linewidth=2)
if gt_path is not None and os.path.exists(gt_path):
gt = []
with open(gt_path, "r") as f:
for line in f:
vals = np.fromstring(line.strip(), sep=" ")
if vals.size == 12:
T = vals.reshape(3, 4)
gt.append(T[:, 3])
gt = np.array(gt)
plt.plot(gt[:, 0], gt[:, 2], label="Ground Truth", linewidth=2)
plt.xlabel("X [m]")
plt.ylabel("Z [m]")
plt.title("KITTI Stereo VO")
plt.legend()
plt.axis("equal")
plt.grid(True)
plt.show()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="KITTI-format Stereo Visual Odometry")
parser.add_argument("sequence", nargs="?", default="frames",
help="Path to sequence folder (must contain image_0/, image_1/, calib.txt)")
parser.add_argument("--max-frames", type=int, default=500)
parser.add_argument("--no-debug", action="store_true")
args = parser.parse_args()
gt_path = os.path.join(args.sequence, "poses.txt")
vo = KITTIStereoVO(args.sequence)
trajectory = vo.run(max_frames=args.max_frames, show_debug=not args.no_debug)
plot_trajectory(trajectory, gt_path=gt_path)
Notes
- The script assumes rectified stereo images
- Depth quality depends strongly on disparity quality
- The maximum accepted depth is currently limited to
30meters - Drift accumulates over time because this is pure VO without loop closure
- Ground truth plotting is optional and depends on
poses.txt
Add to mkdocs.yml
site_name: My Project Documentation
theme:
name: material
nav:
- Home: index.md
- GPS to KITTI: gps_to_kitti.md
- Stereo Visual Odometry: stereo_vo.md
Minimal Home Page
You can put this in docs/index.md:
# Project Documentation
This site documents the KITTI data conversion and stereo visual odometry pipeline.
## Pages
- GPS to KITTI Pose Conversion
- Stereo Visual Odometry
If you want, I can also format this into a better Material for MkDocs version with: - callout boxes - code tabs - parameter tables - cleaner API-style sections