Hello Matthieu,
On Tue, 15 Mar 2022 12:28:17 +0100 Matthieu Bolt <[email protected]> wrote: > Dear Arrow developers, > > I'm investigating if the Arrow library would be useful in our server > backend application and I am having some questions: > > 1) How can a value in an Array/Table be updated? In the examples that I > have seen a table is constructed using ArrayBuilders, which results in > Arrays that can be used to construct a Table with a Schema. It is unclear > to me how to update a value once this process has been executed. Perhaps > updating should be implemented in terms of Slicing/RecordBatches instead of > Tables? Or is Arrow more suitable for static data and updating values does > not fit into the general idea of Arrow. Arrow C++ is built around the idea of immutable data, so indeed the Array/Table/etc. objects are not suitable for updating values once you have generated them. Immutable data greatly simplifies data access and eliminates synchronization costs (contention on locks etc.) > 2) If updating is not possible to implement for all types of Arrays, is > this a reasonable feature request for a DictionaryArray? Neither. A dictionary array is just another kind of Arrow array and is immutable like the others. > 3) Does the StringDictionaryBuilder execute some fancy run length > encoding/zipping in Finish? If not, is this a reasonable feature request? The ArrayBuilders produce data conformant to the Arrow in-memory format specification (*), which doesn't have a run length encoding. So the answer is "no" to both questions :-) (if the Arrow spec ever gets a run length encoding option, then of course it will have to be implemented in the Arrow C++ library) (*) https://arrow.apache.org/docs/format/Columnar.html > 4) Do all memory allocations occur in a given MemoryPool? More specifically > if a (NUMA aware) allocator is provided where possible in the API (by > subclassing MemoryPool?) will this allocator then be used for all > allocations? It will be used whenever you pass that MemoryPool to Arrow C++ APIs, yes. (if not, it's a bug which you should report on our bug tracker) Before you write your own MemoryPool implementation, though, I suggest you try the "standard" memory pools provided by Arrow C++ (jemalloc, mimalloc, system) to see if one of them already fits the bill. > 5) Does the Arrow library have static variables (not constexpr) that are > frequently accessed or that allocate memory during compute function > execution? The question is a bit unspecific. Most existing compute functions do not need persistent state, so the answer would be no. What are you concerned about? > 6) How can application threads be provided to the compute framework, > something like asio::io_context? Other than a bool async_mode in ExecPlan I > couldn't find anything in the API related to multi threading. The ExecContext you pass to ExecPlan can be customized with an Executor. Regards Antoine.
