Hi,

I've merged it.

Note that you need to install Apache Arrow C++ (master) before you
install Apache Arrow GLib (master). Apache Arrow GLib
depends on Apache Arrow C++.

Thanks,
--
kou


In 
 
<ch2pr20mb309530614d045c970a50449deb...@ch2pr20mb3095.namprd20.prod.outlook.com>
  "Re: [C-GLib] reading values quickly from a list array " on Mon, 7 Sep 2020 
04:54:24 +0000,
  Ishan Anand <anand.is...@outlook.com> wrote:

> Thank you very much for the commit Kouhei-san. I'd love to use it sooner so 
> I'll use the source code directly to build Arrow-glib once this PR is in.
> 
> 
> Thank you,
> Ishan
> ________________________________
> From: Sutou Kouhei <k...@clear-code.com>
> Sent: Monday, September 7, 2020 6:44 AM
> To: user@arrow.apache.org <user@arrow.apache.org>
> Subject: Re: [C-GLib] reading values quickly from a list array
> 
> Hi,
> 
> garrow_list_array_get_value() is a bit high cost function
> because it creates a sub list array. It doesn't copy array
> data (it shares array data) but it creates a new sub array
> (container for data) in C++ level and C level.
> 
> Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
> list array values. Sorry. I've implemented them:
> https://github.com/apache/arrow/pull/8119
> 
> It'll be included in Apache Arrow GLib 2.0.0 that will be
> released in a few months.
> 
> (Can you wait 2.0.0?)
> 
> With these APIs, you can write like the following:
> 
> ----
> #include <stdlib.h>
> #include <arrow-glib/arrow-glib.h>
> 
> int
> main(void)
> {
>   GError *error = NULL;
> 
>   GArrowMemoryMappedInputStream *input;
>   input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
>   if (!input) {
>     g_print("failed to open file: %s\n", error->message);
>     g_error_free(error);
>     return EXIT_FAILURE;
>   }
> 
>   {
>     GArrowRecordBatchFileReader *reader;
>     reader =
>       garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
>                                           &error);
> 
>     if (!reader) {
>       g_print("failed to open file reader: %s\n", error->message);
>       g_error_free(error);
>       g_object_unref(input);
>       return EXIT_FAILURE;
>     }
> 
>     {
>       guint i;
>       guint num_batches = 100;
>       for (i = 0; i < num_batches; i++) {
>         GArrowRecordBatch *record_batch;
>         record_batch = 
> garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
> 
>         GArrowArray* column = 
> garrow_record_batch_get_column_data(record_batch, 1);
>         guint length_list = garrow_array_get_length(column);
> 
>         GArrowListArray* list_arr = (GArrowListArray*)column;
> 
>         GArrowInt64Array *list_values =
>           GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
>         gint64 n_list_values;
>         const gint64 *raw_list_values =
>           garrow_int64_array_get_values(list_values, &n_list_values);
>         gint64 n_value_offsets;
>         const gint32 *value_offsets =
>           garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
>         guint j;
>         for (j = 0; j < n_value_offsets; ++j) {
>           gint32 value_offset = value_offsets[j];
>           gint32 value_length = value_offsets[j + 1] - value_offset;
>           gint32 k;
>           for (k = 0; k < value_length; ++k) {
>             raw_list_values[value_offset + k];
>           }
>         }
>         g_object_unref(list_values);
> 
>         g_object_unref(column);
> 
>         g_object_unref(record_batch);
>       }
>     }
>     g_object_unref(reader);
>   }
> 
>   g_object_unref(input);
> 
>   return EXIT_SUCCESS;
> }
> ----
> 
> It takes 0.5sec on my machine.
> 
> 
> Thanks,
> --
> kou
> 
> In
>  
> <ch2pr20mb30959cc8165932970cd856c6eb...@ch2pr20mb3095.namprd20.prod.outlook.com>
>   "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 
> 07:40:06 +0000,
>   Ishan Anand <anand.is...@outlook.com> wrote:
> 
>> Hi
>>
>> I am trying to use the Arrow Glib API to read/write from C. Specifically, 
>> while Arrow is a columnar format, I'm really excited to be able to write a 
>> lot of rows from a C like runtime and access it from python for analytics as 
>> an array per column. And vice versa.
>>
>>  To get a quick example running, I created an Arrow table in python with 100 
>> million entries as follows:
>> ```py
>> import pyarrow as pa
>>
>> foo = {
>>     "colA": np.arange(0, 1000_000),
>>     "colB": [np.arange(1, 5)] * 1000_000
>> }
>>
>> table = pa.table(foo)
>> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
>>     for _ in range(100):
>>         writer.write_table(table)
>> ```
>>
>> However, using the Glib API to read the ListArray column data looks really 
>> slow. It takes like 5 seconds per record batch with a million entries. While 
>> the integer column over the entire table can be iterated over under 2 
>> seconds.
>>
>> The relevant snippet is this:
>> ```C
>>     guint num_batches = 100;
>>     for (i = 0; i < num_batches; i++) {
>>         GArrowRecordBatch *record_batch;
>>         record_batch = 
>> garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>>
>>         GArrowArray* column = 
>> garrow_record_batch_get_column_data(record_batch, 1);
>>         guint length_list = garrow_array_get_length(column);
>>         GArrowListArray* list_arr = (GArrowListArray*)column;
>>
>>         guint j;
>>         GArrowArray* list_elem;
>>         for (j = 0; j < length_list; j++) {
>>             list_elem = garrow_list_array_get_value(list_arr, j);
>>         }
>>     }
>> ```
>>
>> I can't seem to find a quicker alternative in the public Glib API to read 
>> data out of a list array. Is there a way to speed up this loop?
>>
>>
>> Thank you,
>> Ishan
>>
>>
>>

Reply via email to