Video to 3D Reconstruction with 3D Gaussian Splatting

Posted on Apr 26, 2025

Last Updated on May 4, 2025

The neural rendering techniques has greatly changed how we create 3D models and new views from images. One powerful approach, 3D Gaussian Splatting, stands out because it offers both high-quality visuals and fast performance, making it possible to render scenes in real time from regular image collections. The repository video-3d-reconstruction-gsplat provides a hands-on example of this method, using the latest ideas in implicit 3D modeling and multi-view geometry.

This blog delves into the theoretical foundations of the project, placing it within the wider field of neural 3D reconstruction. We will reference the repository’s structure, highlight how it builds on key principles from foundational works like Schoenberger et al. (2016) on Structure-from-Motion (SfM), and explore its connection to the seminal paper on 3D Gaussian Splatting .

1. Pipeline Overview: Bridging SfM and Neural Rendering

The reconstruction pipeline is conceptually divided into three stages:

1.0. Video Frame Extraction using FFmpeg

Before any camera pose estimation or 3D reconstruction can begin, the input video must be converted into individual image frames. This task is efficiently handled by FFmpeg, a powerful multimedia processing tool, which extracts frames at a user-defined frame rate.

Command example:

ffmpeg -i input_video.mp4 -qscale:v 2 frames/frame_%04d.jpg

This command extracts frames into a frames/ directory, naming them sequentially as frame_0001.jpg, frame_0002.jpg, etc., while preserving good visual quality.

The extracted frames then serve as input to the Structure-from-Motion (SfM) pipeline. It’s critical to maintain consistent naming and ordering, as the downstream SfM tools like COLMAP rely on sequential frame indexing for feature matching and bundle adjustment.

This FFmpeg-based preprocessing step ensures that videos of arbitrary formats are converted into high-quality, image sequences, forming the bridge between raw video and multi-view geometry. A key consideration in this process is choosing an appropriate frame rate (FPS) that ensures sufficient overlap between frames, which is crucial for accurate 3D reconstruction.

For this demo, 52 frames were extracted with 10 FPS of resolution 1920x1080.

1.1. Camera Pose Estimation via Structure-from-Motion

The process begins with extracting camera parameters from the input video stream. This task is achieved using a Structure-from-Motion (SfM) pipeline, an approach popularized by Schoenberger et al. (2016) in their paper Structure-from-Motion Revisited . The project utilizes OSS tool COLMAP, which implement feature detection, matching, and bundle adjustment to recover camera intrinsics and extrinsics.

This initial step ensures that for each video frame, a calibrated camera matrix is available, a prerequisite for accurate 3D point estimation and subsequent rendering. Since the the images comes from video frame it is sequential, hence it is easy and fast to find features and match with sequential-matcher which has $O(n)$ time complexity, or else would have to go for exhaustive-matcher that has $O(n^2)$.

Sequential process of SfM aka camera poses + 3D point cloud

In the context of 3D space, each image frame is referred to as a ‘camera’ because it encapsulates not just pixel information, but also the intrinsic and extrinsic parameters that define the viewpoint. This includes properties such as focal length, principal point, and lens distortion (intrinsic parameters), as well as the camera’s position and orientation in 3D space relative to a world coordinate system (extrinsic parameters). These parameters enable the image to be used for geometric reasoning, such as projecting 3D points into 2D image space or reconstructing 3D structure from multiple views.

Here are the SfM Point clouds from different view points:

Point Clouds from SfM with cameras (Front View)

Challenges:

In SfM, a 3D point cloud is generated by triangulating corresponding 2D feature points detected across multiple image frames (cameras). The complexity of the point cloud arises from several factors: the density of detected features, the number and diversity of viewpoints, and the inherent noise in feature detection and matching. Each 3D point represents the intersection of back-projected rays from at least two camera positions; errors in feature localization, mismatches across images, or poor baseline geometry (small parallax) can lead to inaccurate triangulation, resulting in points that are misaligned, floating incorrectly in space, or forming noisy artifacts.

Point Cloud anomalies in Structure from Motion Reconstruction

Additionally, areas with low texture, repetitive patterns, occlusions, or motion blur reduce the reliability of feature matching, further contributing to misplaced or spurious points. Bundle adjustment attempts to optimize camera poses and 3D points jointly, but local minima or insufficient constraints can prevent full correction of these errors, leaving parts of the reconstructed point cloud distorted or incomplete.

How it affects 3D GS?
Any noisy or incomplete SfM results limit the initialization and optimization stages of 3DGS, making it difficult for the method to fully recover fine details or correct structural errors. In essence, the limitations of SfM propagate into 3DGS, constraining its accuracy and robustness in challenging scenes.

To mitigate these issues, I plan to explore post-processing techniques such as outlier filtering (e.g., statistical outlier removal or radius-based filtering) can be applied to clean up the point cloud. I’ll write a separate blog post to share the results and link them here.

1.2. 3D Gaussian Initialization

With camera poses established, the next phase involves initializing the 3D Gaussian set. Initial positions are often placed from a sparse point cloud obtained from SfM, with default Gaussian parameters assigned for scale, orientation, and color. This initialization acts as a geometric prior, enabling faster convergence during optimization.

In the repository, this initialization logic is embedded in the data preparation scripts and model initialization classes. While the code does not expose this directly as a standalone function, references to initialization parameters such as initial_scale, initial_opacity, or point cloud loaders reflect this design.

Training:
For training on an RTX 3060 12GB, the model was trained for 30k iterations, which took about 30 minutes. Despite moderate hardware, the optimization process still converged efficiently.

In 3D Gaussian training, monitoring convergence involves tracking loss reduction over iterations. When training, adjustments like learning rate and batch size ensure smoother convergence. For effective optimization, early stopping or adaptive learning rates can prevent overfitting and ensure the model reaches its best performance in the shortest time. At first, do a sanity check the dataset with default values. After seeing the output quality, tweak the parameters and re-train the model.

What’s it trying to do?
The Gaussian Model here is trying to represent a 3D scene as a collection of 3D Gaussians (splat primitives) instead of using meshes or voxels. Each Gaussian represents a small blob in 3D space, with parameters like:

position: where it is in space
scale/covariance: its size and shape
rotation: its orientation
opacity: how transparent it is
features: color and light-related properties

Splat Optimisation (Source: https://arxiv.org/abs/2308.04079)

The optimization process adjusts these parameters so that, when all Gaussians are projected from 3D to 2D, the rendered image matches the target images aka our scene.

In simple terms, the model learns where to put these blurry blobs, how big, what color, and how opaque, to collectively recreate the input images from different viewpoints.

Here are the renderings at different splat sizes:

GSplat 0% fill (only point clouds are visible now)

1.3. Optimization through Differentiable Rendering

The heart of the pipeline is the gradient-based optimization loop that updates the Gaussian parameters to minimize a photometric loss. This process involves rendering the current Gaussian set into each training image using a differentiable rasterizer that implements Gaussian splatting.

During each iteration, gradients are propagated from the rendered image back to the Gaussian parameters, enabling adjustments to position, covariance, color, and opacity to better match the target views.

We will find references to loss functions (typically a pixel-wise reconstruction loss), optimizers (often Adam), and learning rate schedulers in the training script. The differentiable rasterizer is implemented using CUDA or PyTorch custom operations to handle the non-trivial task of splatting anisotropic Gaussians in 3D space.

2. Efficiency and Real-Time Considerations: Insights from SpeedySplat

The project’s implementation explicitly benefits from optimizations proposed by project SpeedySplat , which addresses the computational bottlenecks of Gaussian splatting at scale.

What’s the contribution?

Precise Gaussian localization: They propose two new algorithms—SnugBox (tight bounding box) and AccuTile (exact tile intersection)—to better estimate which image tiles each Gaussian overlaps, making rendering faster by reducing unnecessary pixel-Gaussian computations.
Efficient pruning during training: They develop a memory-efficient pruning score that enables pruning Gaussians during training (instead of only after), via Soft Pruning (mid-training) and Hard Pruning (post-densification), reducing Gaussian counts by ~90% while maintaining image quality.

As a result, Speedy-Splat achieves:

~6× faster rendering
~10× fewer primitives aka a unit of 3D Gaussian
Faster training, less tea break time :(

What’s the difference I found?
The primary purpose of SpeedySplat is to enable real-time rendering by reducing the number of splats while maintaining visual quality. Impressively, this rendering model doesn’t even require a GPU. I tested unoptimized code and model on my Ryzen 3600 CPU and achieved over 80 FPS. This approach makes it possible to get decent frame rates even on mobile devices via the web. With further optimization and fine-tuning, performance could be improved even more.

3. Conclusion

The video-3d-reconstruction-gsplat repository exemplifies a integration of classical computer vision (SfM) with neural rendering (Gaussian Splatting), providing a pathway for efficient 3D reconstruction from monocular videos. By bridging the geometry-driven initialization with differentiable rasterization, it offers a practical tool for applications ranging from digital twins to virtual cinematography.

The codebase is modular, allowing users to plug in different pose estimation or gaussian splatting pipelines depending on their data quality or computational constraints.

Recently, there’s been a growing push in research to eliminate traditional Structure-from-Motion (SfM) tools like COLMAP and instead integrate the entire reconstruction process directly into neural networks. While this shift promises an end-to-end, differentiable pipeline, it also raises questions about reliability, scalability, and interpretability. Classical SfM methods have been battle-tested for robustness across diverse datasets, whereas neural alternatives may struggle with generalization, require more data, and obscure the geometric foundations behind the reconstruction. Although exciting, this trend may be trading proven stability for experimental convenience and whether these neural approaches can fully replace established tools in real-world applications remains an open question.

3D GS Resources:

COLMAP Alternatives for SfM (OSS):

Supporting Tools:

Tags:

Video to 3D Reconstruction with 3D Gaussian Splatting

1. Pipeline Overview: Bridging SfM and Neural Rendering

1.0. Video Frame Extraction using FFmpeg

1.1. Camera Pose Estimation via Structure-from-Motion

Sequential process of SfM aka camera poses + 3D point cloud

Point Clouds from SfM with cameras (Front View)

Side View with increase point size

Top View

Point Cloud anomalies in Structure from Motion Reconstruction

1.2. 3D Gaussian Initialization

Splat Optimisation (Source: https://arxiv.org/abs/2308.04079)

GSplat 0% fill (only point clouds are visible now)

GSplat 20% fill

GSplat 50% fill

GSplat 100% fill

Point Cloud and GSplat from 5th Camera

1.3. Optimization through Differentiable Rendering

2. Efficiency and Real-Time Considerations: Insights from SpeedySplat

3. Conclusion

More Readings