Re: [C++][C Data][Table]ChunkedArray and Table

Aldrin Wed, 08 Mar 2023 11:22:07 -0800

oops, I did misread it. After seeing Weston's response, I realized you're
casting to a duckdb_arrow_schema, which I had read as ArrowSchema (I
thought it was a weird cast).


You're calling `duckdb_query_arrow_schema` and `duckdb_query_arrow_array`
correctly, as it's apparently a double pointer ([1] and [2], respectively).
But, looking at how it's used in duckdb ([3]), I think you're not releasing
memory correctly; namely, you're missing calls to `release` ([4]). I would
expect it to be a memory leak rather than a segfault, so there may be some
other related issue lurking in surrounding code outside of your excerpt. Or
possibly you're getting a weird situation where `delete` is deallocating
top-level objects, and the call to `new` is allocating directly over it (or
nearby memory) and your memory accesses are spanning regions that it
shouldn't.

Anyways, Weston's code sample is what I was thinking and it should be
straightforward to adapt it for the double pointer. If helpful, [5] seems
to be where the `*duckdb_arrow_schema` and `*duckdb_arrow_array` types are
defined. As a matter of fact, I find it suspicious that the
`duckdb_query_arrow_schema` function just dereferences to access the first
element of the struct, but I don't know enough low-level details to know if
that and not calling `release` are combining to be the root cause of your
segfaults.


[1]:
https://github.com/duckdb/duckdb/blob/master/src/main/capi/arrow-c.cpp#L22-L30
[2]:
https://github.com/duckdb/duckdb/blob/master/src/main/capi/arrow-c.cpp#L32-L46
[3]:
https://github.com/duckdb/duckdb/blob/master/test/api/capi/test_capi_arrow.cpp#L33-L38
[4]:
https://github.com/duckdb/duckdb/blob/master/test/api/capi/test_capi_arrow.cpp#L37
[5]:
https://github.com/duckdb/duckdb/blob/master/src/include/duckdb.h#L268-L273

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Wed, Mar 8, 2023 at 8:12 AM Weston Pace <[email protected]> wrote:

> I agree with Aldrin.  `arrow_schema` has type ArrowSchema*. You are
> passing in `&arrow_schema` which has type `ArrowSchema**` but you are
> casting it to `duckdb_arrow_schema*` so it at least compiles.  You can also
> simplify things by getting rid of some of this allocation.
>
> ```
>     ArrowSchema arrow_schema;
>     duckdb_query_arrow_schema(arrow_result, (duckdb_arrow_schema
> *)&arrow_schema);
>     auto schema = arrow::ImportSchema(&arrow_schema);
>     auto output_stream = arrow::io::BufferOutputStream::Create();
>     auto batch_writer = arrow::ipc::MakeStreamWriter(*output_stream,
> *schema);
>     while (true)
>     {
>       ArrowArray arrow_array;
>       duckdb_query_arrow_array(arrow_result, (duckdb_arrow_array
> *)&arrow_array);
>       if (arrow_array.length > 0)
>       {
>         (*batch_writer)->WriteRecordBatch((const arrow::RecordBatch
> &)**arrow::ImportRecordBatch(&arrow_array, *schema));
>       }
>       else
>       {
>         arrow_array.release(&arrow_array);
>         break;
>       }
>     }
>     auto buffer = (*output_stream)->Finish();
> ```
>
>
> On Tue, Mar 7, 2023 at 11:26 AM Aldrin <[email protected]> wrote:
>
>> I might be misreading, but do you need to get the address of your object
>> pointers, `*arrow_schema` and `*arrow_array`? If they're already pointers,
>> I'm not sure you need to get their address and cast that to a pointer.
>> otherwise, I don't see anything else that stands out as problematic to me.
>>
>> You could try to check the Result objects so that you get better error
>> information before the segfault occurs; as it stands now you're blindly
>> dereferencing. The other thing that might be helpful is that I think you
>> don't have to call `new` every time. In all of these instances you are
>> allocating memory for the objects, but you could just create them directly
>> and then pass references to them. For example:
>>
>>    ArrowSchema arrow_schema;
>>    duckdb_query_arrow_schema(arrow_result, &arrow_schema);
>>    ...
>>
>> Since ArrowSchema seems to be default constructible, I believe the
>> declaration alone will do the default construction (`ArrowSchema
>> arrow_schema`). But, my cpp knowledge comes and goes, so apologies if those
>> suggestions aren't accurate.
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Sun, Mar 5, 2023 at 8:56 AM Kimmo Linna <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I noticed that my previous code worked only for one ArrowArray from
>>> DuckDB. I change a code this but this is still working only for one
>>> ArrowArray. The second “loop” causes a segmentation fault. Could someone
>>> guide me a bit? What should I change to get this work?
>>>
>>>
>>> ArrowSchema *arrow_schema = new ArrowSchema();
>>> duckdb_query_arrow_schema(arrow_result, (duckdb_arrow_schema *)&
>>> arrow_schema);
>>> auto schema = arrow::ImportSchema(arrow_schema);
>>> auto output_stream = arrow::io::BufferOutputStream::Create();
>>> auto batch_writer = arrow::ipc::MakeStreamWriter(*output_stream, *schema
>>> );
>>> while (true) {
>>> ArrowArray *arrow_array = new ArrowArray();
>>> duckdb_query_arrow_array(arrow_result, (duckdb_arrow_array *)&
>>> arrow_array);
>>> if( arrow_array->length > 0) {
>>> (*batch_writer)->WriteRecordBatch((const arrow::RecordBatch&) **arrow::
>>> ImportRecordBatch(arrow_array,*schema));
>>> delete arrow_array;
>>> }else{
>>> break;
>>> }
>>> }
>>> auto buffer = (*output_stream)->Finish();
>>> Best regards,
>>>
>>> Kimmo
>>>
>>> --
>>> Kimmo Linna
>>> Nihtisalontie 3 as 1
>>> 02630  ESPOO
>>> [email protected]
>>> +358 40 590 1074
>>>
>>>
>>>
>>>

Re: [C++][C Data][Table]ChunkedArray and Table

Reply via email to