Hi, I've merged it.
Note that you need to install Apache Arrow C++ (master) before you install Apache Arrow GLib (master). Apache Arrow GLib depends on Apache Arrow C++. Thanks, -- kou In <ch2pr20mb309530614d045c970a50449deb...@ch2pr20mb3095.namprd20.prod.outlook.com> "Re: [C-GLib] reading values quickly from a list array " on Mon, 7 Sep 2020 04:54:24 +0000, Ishan Anand <anand.is...@outlook.com> wrote: > Thank you very much for the commit Kouhei-san. I'd love to use it sooner so > I'll use the source code directly to build Arrow-glib once this PR is in. > > > Thank you, > Ishan > ________________________________ > From: Sutou Kouhei <k...@clear-code.com> > Sent: Monday, September 7, 2020 6:44 AM > To: user@arrow.apache.org <user@arrow.apache.org> > Subject: Re: [C-GLib] reading values quickly from a list array > > Hi, > > garrow_list_array_get_value() is a bit high cost function > because it creates a sub list array. It doesn't copy array > data (it shares array data) but it creates a new sub array > (container for data) in C++ level and C level. > > Apache Arrow GLib 1.0.1 doesn't have low level APIs to access > list array values. Sorry. I've implemented them: > https://github.com/apache/arrow/pull/8119 > > It'll be included in Apache Arrow GLib 2.0.0 that will be > released in a few months. > > (Can you wait 2.0.0?) > > With these APIs, you can write like the following: > > ---- > #include <stdlib.h> > #include <arrow-glib/arrow-glib.h> > > int > main(void) > { > GError *error = NULL; > > GArrowMemoryMappedInputStream *input; > input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error); > if (!input) { > g_print("failed to open file: %s\n", error->message); > g_error_free(error); > return EXIT_FAILURE; > } > > { > GArrowRecordBatchFileReader *reader; > reader = > garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input), > &error); > > if (!reader) { > g_print("failed to open file reader: %s\n", error->message); > g_error_free(error); > g_object_unref(input); > return EXIT_FAILURE; > } > > { > guint i; > guint num_batches = 100; > for (i = 0; i < num_batches; i++) { > GArrowRecordBatch *record_batch; > record_batch = > garrow_record_batch_file_reader_read_record_batch(reader, i, &error); > > GArrowArray* column = > garrow_record_batch_get_column_data(record_batch, 1); > guint length_list = garrow_array_get_length(column); > > GArrowListArray* list_arr = (GArrowListArray*)column; > > GArrowInt64Array *list_values = > GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr)); > gint64 n_list_values; > const gint64 *raw_list_values = > garrow_int64_array_get_values(list_values, &n_list_values); > gint64 n_value_offsets; > const gint32 *value_offsets = > garrow_list_array_get_value_offsets(list_arr, &n_value_offsets); > guint j; > for (j = 0; j < n_value_offsets; ++j) { > gint32 value_offset = value_offsets[j]; > gint32 value_length = value_offsets[j + 1] - value_offset; > gint32 k; > for (k = 0; k < value_length; ++k) { > raw_list_values[value_offset + k]; > } > } > g_object_unref(list_values); > > g_object_unref(column); > > g_object_unref(record_batch); > } > } > g_object_unref(reader); > } > > g_object_unref(input); > > return EXIT_SUCCESS; > } > ---- > > It takes 0.5sec on my machine. > > > Thanks, > -- > kou > > In > > <ch2pr20mb30959cc8165932970cd856c6eb...@ch2pr20mb3095.namprd20.prod.outlook.com> > "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 > 07:40:06 +0000, > Ishan Anand <anand.is...@outlook.com> wrote: > >> Hi >> >> I am trying to use the Arrow Glib API to read/write from C. Specifically, >> while Arrow is a columnar format, I'm really excited to be able to write a >> lot of rows from a C like runtime and access it from python for analytics as >> an array per column. And vice versa. >> >> To get a quick example running, I created an Arrow table in python with 100 >> million entries as follows: >> ```py >> import pyarrow as pa >> >> foo = { >> "colA": np.arange(0, 1000_000), >> "colB": [np.arange(1, 5)] * 1000_000 >> } >> >> table = pa.table(foo) >> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer: >> for _ in range(100): >> writer.write_table(table) >> ``` >> >> However, using the Glib API to read the ListArray column data looks really >> slow. It takes like 5 seconds per record batch with a million entries. While >> the integer column over the entire table can be iterated over under 2 >> seconds. >> >> The relevant snippet is this: >> ```C >> guint num_batches = 100; >> for (i = 0; i < num_batches; i++) { >> GArrowRecordBatch *record_batch; >> record_batch = >> garrow_record_batch_file_reader_read_record_batch(reader, i, &error); >> >> GArrowArray* column = >> garrow_record_batch_get_column_data(record_batch, 1); >> guint length_list = garrow_array_get_length(column); >> GArrowListArray* list_arr = (GArrowListArray*)column; >> >> guint j; >> GArrowArray* list_elem; >> for (j = 0; j < length_list; j++) { >> list_elem = garrow_list_array_get_value(list_arr, j); >> } >> } >> ``` >> >> I can't seem to find a quicker alternative in the public Glib API to read >> data out of a list array. Is there a way to speed up this loop? >> >> >> Thank you, >> Ishan >> >> >>