Extension Arrays¶

Lance provides extensions for Arrow arrays and Pandas Series to represent data types for machine learning applications.

BFloat16¶

BFloat16 is a 16-bit floating point number that is designed for machine learning use cases. Intuitively, it only has 2-3 digits of precision, but it has the same range as a 32-bit float: ~1e-38 to ~1e38. By comparison, a 16-bit float has a range of ~5.96e-8 to 65504.

Lance provides an Arrow extension array (lance.arrow.BFloat16Array) and a Pandas extension array (lance.pandas.BFloat16Dtype) for BFloat16. These are compatible with the ml_dtypes bfloat16 NumPy extension array.

If you are using Pandas, you can use the lance.bfloat16 dtype string to create the array:

import pandas as pd
import lance.arrow

series = pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
series

  1.1015625
    2.09375
    3.40625
dtype: lance.bfloat16

To create an an arrow array, use the lance.arrow.bfloat16_array() function:

from lance.arrow import bfloat16_array

array = bfloat16_array([1.1, 2.1, 3.4])
array

<lance.arrow.BFloat16Array object at 0x.+>
[1.1015625, 2.09375, 3.40625]

Finally, if you have a pre-existing NumPy array, you can convert it into either:

import numpy as np
from ml_dtypes import bfloat16
from lance.arrow import PandasBFloat16Array, BFloat16Array

np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
PandasBFloat16Array.from_numpy(np_array)
BFloat16Array.from_numpy(np_array)

<PandasBFloat16Array>
[1.1015625, 2.09375, 3.40625]
Length: 3, dtype: lance.bfloat16
<lance.arrow.BFloat16Array object at 0x.+>
[1.1015625, 2.09375, 3.40625]

When reading, these can be converted back to to the NumPy bfloat16 dtype using each array class’s to_numpy method.