Using Polars on results from DuckDB's Arrow interface in Rust

Headshot of
                Vikram Oberoi

Vikram Oberoi

·

December 4, 2022

vikramoberoi.com

· 7 min read

Using Polars on results from DuckDB's Arrow interface in Rust

I found myself wanting to manipulate/compute a large Arrow result set from a DuckDB query in Rust. I first wrote code to iterate over these results and compute what I needed, but the result was a lot of code that ran slowly and was cumbersome to write. I decided to reach for Polars instead.

The end result is way less code that is much more performant. I also have the Polars API to work with on any result set from DuckDB, which lets me iterate more quickly. It’s handy!

That said, this was really painful for me to figure out. There are a lot of documentation gaps in these projects, I still don’t quite understand the Arrow APIs, and I’m new to Rust. If you’re looking for a shortcut, I published a Github repo showing how you can glue these APIs together.

A duck giving a polar bear an arrow. The bear is confused.
This whole situation is very confusing. credit: katiebcartoons.com

If you just want to use these APIs, skip the rest of this post and visit the repo. I share more around how these APIs work, my journey to figuring it out, and questions that came up below.

Querying DuckDB

I started out using duckdb-rs, “… an ergonomic wrapper for using duckdb from Rust. It attempts to expose an interface similar to rusqlite.” The API is friendly, but as of this post’s publishing, duckdb-rs doesn’t support any nested data types, which I needed (my query results include lists). I looked into submitting a PR to support nested data types but my inexperience with Rust quickly put that to a halt.

The library code also starts from a fork of rusqlite and I think that makes things a bit challenging. SQLite does not support nested data types as far as I can tell (and I don’t see evidence of any such support in rusqlite). Reading through the duckdb-rs source, it seemed like one would need to shoehorn first class support for nested data types into the library or redesign a chunk of the library’s core.

I, uh, am not comfortable enough with Rust to attempt that in any reasonable amount of time.

Thankfully, the author of duckdb-rs also makes libduckdb_sys available: Rust bindings to DuckDB’s C API. And there’s a bunch of example code for how to use it in the duckdb-rs source.

This involves running a bunch of unsafe functions and dereferencing raw pointers, so you need to wrap a lot of this code in an unsafe block. Reading results from DuckDB’s Arrow interface involves two steps.

First, execute the query using the Arrow interface:

let mut result: duckdb_arrow = ptr::null_mut();
let state = duckdb_query_arrow(conn, sql.as_ptr(), &mut result);

// An example of error handling with this API. I'll skip this everywhere else.
if state == duckdb_state_DuckDBError {
    let error_message: *const c_char = duckdb_query_arrow_error(result);
    let error_message = CStr::from_ptr(error_message).to_str().unwrap();
    panic!("{}", error_message);
}

… then fetch batches of results:

let mut ffi_arrow_array: arrow2::ffi::ArrowArray = arrow2::ffi::ArrowArray::empty();
let state = duckdb_query_arrow_array(
    result,
    &mut &mut ffi_arrow_array as *mut _ as *mut *mut c_void, // I don't get this. I got it from duckdb-rs.
);

I do not understand why ffi_arrow_array needs to be cast this way. I do understand:

But the cast in Rust is baffling to me. What is as *mut _ doing in the middle of this? What does this do and why is it necessary? If you know, please tell me.

On another note, piecing all this together is painful. I had to read DuckDB’s C API docs, DuckDB’s C header, and the arrow2 crate docs to figure out why this particular incantation works. I am not sure there is a good solution to this. These are disparate, relatively immature projects. Maybe it just takes time? (And blog posts like these?)

Turning an arrow2::ffi::ArrowArray into a Polars DataFrame

With the code above we’ve got a batch of results from DuckDB populated in these Arrow C structs. But we’d like to go from these to a Polars DataFrame. How?

I used the arrow2 crate instead of arrow because it looks like Polars uses arrow2 and figured I might have an easier on-ramp to creating a Polars DataFrame.

We need to do the following:

To start with, we also need to get the Arrow array’s schema. DuckDB provides an API for this too:

let mut schema = arrow2::ffi::ArrowSchema::empty();
let state =
    duckdb_query_arrow_schema(result, &mut &mut schema as *mut _ as *mut *mut c_void);

(There’s that baffling cast again.)

Now we’re going to exit C API-land and convert these raw C structs into a Rust-native arrow2::array::Array.

let field = arrow2::ffi::import_field_from_c(schema).unwrap();
let arrow_array =
    arrow2::ffi::import_array_from_c(ffi_arrow_array, field.data_type).expect("ok");

arrow_array is an arrow2::array::Array, which is a trait object that can be downcast to the specific Arrow array type you want. I don’t know what a Rust trait object is yet, but I do know I want to iterate over an array of integers or whatever my query actually returns.

Let’s assume this is our query:

CREATE TABLE users (
    id INTEGER,
    username VARCHAR
);

-- Insert a bunch of data --

SELECT id, username FROM users;

It turns out that the arrow_array DuckDB’s Arrow interface returns is a StructArray, which is just a struct that contains multiple Arrow arrays of the same length. For our query above, it’ll be a StructArray containing two arrays: one for id and one for username.

So we first downcast to a StructArray:

let struct_array = arrow_array
    .as_any()
    .downcast_ref::<StructArray>()
    .expect("This Arrow Array should be a StructArray.");

Then we downcast each array in struct_array to their appropriate types. This was another area that took a while to grok. What are the types DuckDB returns in the struct_array? Are those user IDs signed or unsigned? Are they 32-bit or 64-bit?

Print out struct_array.fields() to view the data types you’re getting back. If you wanted to construct a DataFrame dynamically from any query, you’d have to inspect these fields then downcast each array to the corresponding type.

Here’s what the code looks like to fetch and downcast to the specific Arrow array types we need for the query above:

let id_array = struct_array.values()[0]
    .as_any()
    .downcast_ref::<Int32Array>()
    .unwrap();

let username_array = struct_array.values()[1]
    .as_any()
    .downcast_ref::<Utf8Array<i32>>()
    .unwrap();

Int32Array is an alias provided by arrow2 for a PrimitiveArray containing i32-type data. All primitive types are stored in Arrow arrays of types PrimitiveArray.

Utf8Array is not an alias. And I think the generic argument to Utf8Array is the type used for Arrow offsets. Specifically, these are the offsets used to identify where a UTF-8 string at a given index might start. If you have the array [“hello”, “world”], the offset for “hello” is 0 and the offset of “world” is 5.

(If that is incorrect, holler.)

A good place to start to read more about these type is the docs for arrow2::array.

Finally, we want to create Series from these arrays:

let id_series = Series::try_from(("id", id_array.to_boxed())).unwrap();
let username_series =
    Series::try_from(("username", username_array.to_boxed())).unwrap();

to_boxed() turns the downcasted arrays into a boxed arrow2::array::Array, the trait object all Arrow arrays are represented as in arrow2. This what Polars Series expects.

Finally, our dataframe:

let df = DataFrame::new(vec![id_series, username_series]).unwrap();

… and that’s how you start with a DuckDB query and end up with a Polars dataframe in Rust.

Further reading