Re: [DISCUSS] [C++] custom allocator for large objects

2020-07-11 Thread Wes McKinney
On Sat, Jul 11, 2020 at 4:10 AM Rémi Dettai wrote: > > Hi Micah, > > Thanks for the answer ! But it seems your email got split in half in some > way ;-) > > My use case mainly focuses on aggregations (with group by), and after > fighting quite a bit with the allocators I ended up thinking that it

Re: [DISCUSS] [C++] custom allocator for large objects

2020-07-11 Thread Rémi Dettai
Hi Micah, Thanks for the answer ! But it seems your email got split in half in some way ;-) My use case mainly focuses on aggregations (with group by), and after fighting quite a bit with the allocators I ended up thinking that it might not be worth it materializing the raw data as arrow tables i

Re: [DISCUSS] [C++] custom allocator for large objects

2020-07-09 Thread Micah Kornfield
Sorry for the delay. Clearing through my inbox backlog ... We should double check the code, but one thing that has bitten me in the past with variable-width data is the binary array builder ReserveData call [1], does not act the same way Reserve works. The former only grows the buffer by the exa

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-16 Thread Rémi Dettai
Hi Antoine and all ! Sorry for the delay, I wanted to understand things a bit better before getting back to you. As discussed, I focussed on the Parquet case. I've looked into parquet/encoding.cc to see what could be done to have a better memory reservation with ByteArrays. On my journey, I noti

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Antoine Pitrou
Le 05/06/2020 à 17:09, Rémi Dettai a écrit : > I looked into the details of why the decoder could not estimate the target > Arrow array size for my Parquet column. It's because I am decoding from > Parquet-Dictionary to Arrow-Plain (which is the default when loading > Parquet). In this case the s

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Antoine Pitrou
Le 05/06/2020 à 16:25, Uwe L. Korn a écrit : > > On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote: >> Hi Antoine ! >>> I would indeed have expected jemalloc to do that (remap the pages) >> I have no idea about the performance gain this would provide (if any). >> Could be interesting to explore

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
I looked into the details of why the decoder could not estimate the target Arrow array size for my Parquet column. It's because I am decoding from Parquet-Dictionary to Arrow-Plain (which is the default when loading Parquet). In this case the size prediction is impossible :-( > This would actually

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Uwe L. Korn
On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote: > Hi Antoine ! > > I would indeed have expected jemalloc to do that (remap the pages) > I have no idea about the performance gain this would provide (if any). > Could be interesting to explore. This would actually be the most interesting thing.

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
Hi Antoine ! > I would indeed have expected jemalloc to do that (remap the pages) I have no idea about the performance gain this would provide (if any). Could be interesting to explore. > do you know that Arrow also supports integration with another allocator, mimalloc I only tried Jemalloc and th

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Antoine Pitrou
Le 05/06/2020 à 14:25, Rémi Dettai a écrit : > Hi Uwe! > >> As your suggestions don't seem to be specific to Arrow, why not > contribute them directly to jemalloc? They are much better in reviewing > allocator code than we are. > I mentioned this idea in the jemalloc gitter. The first response w

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
Hi Uwe! > As your suggestions don't seem to be specific to Arrow, why not contribute them directly to jemalloc? They are much better in reviewing allocator code than we are. I mentioned this idea in the jemalloc gitter. The first response was that it should work but workloads with realloc aren't v

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Uwe L. Korn
Hello Rémi, under the hood jemalloc does quite similar things to what you describe. I'm not sure what the offset is in the current version but in earlier releases, it used a different allocation strategy for objects above 4MB. For the initial large allocation, you will see quite some copies as

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Antoine Pitrou
Le 04/06/2020 à 18:11, Rémi Dettai a écrit : > > Ideally, we should be able to presize the array to a good enough > estimate. > You should be able to get away with a correct estimation because parquet > column metadata contains the uncompressed size. But is their anything wrong > with this idea

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Rémi Dettai
> Ideally, we should be able to presize the array to a good enough estimate. You should be able to get away with a correct estimation because parquet column metadata contains the uncompressed size. But is their anything wrong with this idea of mmaping huge "runways" for our larger allocations ?

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Antoine Pitrou
On Thu, 4 Jun 2020 17:48:16 +0200 Rémi Dettai wrote: > When creating large arrays, Arrow uses realloc quite intensively. > > I have an example where y read a gzipped parquet column (strings) that > expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc > cannot anticipate this and

[DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Rémi Dettai
When creating large arrays, Arrow uses realloc quite intensively. I have an example where y read a gzipped parquet column (strings) that expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc cannot anticipate this and every reallocate call above 1MB (the most critical ones) ends up