Lance: A Columnar Data Format for Computer Vision

Lance is a cloud-native columnar data format designed for managing large-scale computer vision datasets in production environments. Lance delivers blazing fast performance for image and video data use cases from analytics to point queries to training scans.

What problems does Lance solve?

Today, the data tooling stack for computer vision is insufficient to serve the needs of the ML engineering community.

Working with vision data for ML is different from working with tabular data:

  • Training, analytics, and labeling uses different tools requiring different formats

  • Data annotations are almost always deeply nested

  • Images / videos are large blobs that are difficult to query by existing engines

This results in some major pain-points:

  • Too much time spent on low level data munging

  • Multiple copies creates data quality issues, even for well-known datasets

  • Reproducibility and data versioning is extremely difficult to achieve

Lance to the rescue

To solve these pain-points, we are building Lance, an open-source columnar data format optimized for computer vision with the following goals:

  • Blazing fast performance for analytical scans and random access to individual records (for visualization and annotation)

  • Rich ML data types and integrations to eliminate manual data conversions.

  • Support for vector and search indices, versioning, and schema evolution.

Indices and tables