Hello Matthieu,

On Tue, 15 Mar 2022 12:28:17 +0100
Matthieu Bolt <[email protected]> wrote:
> Dear Arrow developers,
> 
> I'm investigating if the Arrow library would be useful in our server
> backend application and I am having some questions:
> 
> 1) How can a value in an Array/Table be updated? In the examples that I
> have seen a table is constructed using ArrayBuilders, which results in
> Arrays that can be used to construct a Table with a Schema. It is unclear
> to me how to update a value once this process has been executed. Perhaps
> updating should be implemented in terms of Slicing/RecordBatches instead of
> Tables? Or is Arrow more suitable for static data and updating values does
> not fit into the general idea of Arrow.

Arrow C++ is built around the idea of immutable data, so indeed the
Array/Table/etc. objects are not suitable for updating values once you
have generated them.  Immutable data greatly simplifies data access and
eliminates synchronization costs (contention on locks etc.)

> 2) If updating is not possible to implement for all types of Arrays, is
> this a reasonable feature request for a DictionaryArray?

Neither. A dictionary array is just another kind of Arrow array and is
immutable like the others.

> 3) Does the StringDictionaryBuilder execute some fancy run length
> encoding/zipping in Finish? If not, is this a reasonable feature request?

The ArrayBuilders produce data conformant to the Arrow in-memory format
specification (*), which doesn't have a run length encoding. So the
answer is "no" to both questions :-)

(if the Arrow spec ever gets a run length encoding option, then of
course it will have to be implemented in the Arrow C++ library)

(*) https://arrow.apache.org/docs/format/Columnar.html

> 4) Do all memory allocations occur in a given MemoryPool? More specifically
> if a (NUMA aware) allocator is provided where possible in the API (by
> subclassing MemoryPool?) will this allocator then be used for all
> allocations?

It will be used whenever you pass that MemoryPool to Arrow C++ APIs,
yes.
(if not, it's a bug which you should report on our bug tracker)

Before you write your own MemoryPool implementation, though, I suggest
you try the "standard" memory pools provided by Arrow C++ (jemalloc,
mimalloc, system) to see if one of them already fits the bill.

> 5) Does the Arrow library have static variables (not constexpr) that are
> frequently accessed or that allocate memory during compute function
> execution?

The question is a bit unspecific. Most existing compute functions do not
need persistent state, so the answer would be no. What are you concerned
about?

> 6) How can application threads be provided to the compute framework,
> something like asio::io_context? Other than a bool async_mode in ExecPlan I
> couldn't find anything in the API related to multi threading.

The ExecContext you pass to ExecPlan can be customized with an Executor.

Regards

Antoine.


Reply via email to