On Sat, Jul 11, 2020 at 4:10 AM Rémi Dettai wrote:
>
> Hi Micah,
>
> Thanks for the answer ! But it seems your email got split in half in some
> way ;-)
>
> My use case mainly focuses on aggregations (with group by), and after
> fighting quite a bit with the allocators I ended up thinking that it
Hi Micah,
Thanks for the answer ! But it seems your email got split in half in some
way ;-)
My use case mainly focuses on aggregations (with group by), and after
fighting quite a bit with the allocators I ended up thinking that it might
not be worth it materializing the raw data as arrow tables i
Sorry for the delay. Clearing through my inbox backlog ...
We should double check the code, but one thing that has bitten me in the
past with variable-width data is the binary array builder ReserveData call
[1], does not act the same way Reserve works. The former only grows the
buffer by the exa
Hi Antoine and all !
Sorry for the delay, I wanted to understand things a bit better before
getting back to you. As discussed, I focussed on the Parquet case. I've
looked into parquet/encoding.cc to see what could be done to have a better
memory reservation with ByteArrays.
On my journey, I noti
Le 05/06/2020 à 17:09, Rémi Dettai a écrit :
> I looked into the details of why the decoder could not estimate the target
> Arrow array size for my Parquet column. It's because I am decoding from
> Parquet-Dictionary to Arrow-Plain (which is the default when loading
> Parquet). In this case the s
Le 05/06/2020 à 16:25, Uwe L. Korn a écrit :
>
> On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote:
>> Hi Antoine !
>>> I would indeed have expected jemalloc to do that (remap the pages)
>> I have no idea about the performance gain this would provide (if any).
>> Could be interesting to explore
I looked into the details of why the decoder could not estimate the target
Arrow array size for my Parquet column. It's because I am decoding from
Parquet-Dictionary to Arrow-Plain (which is the default when loading
Parquet). In this case the size prediction is impossible :-(
> This would actually
On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote:
> Hi Antoine !
> > I would indeed have expected jemalloc to do that (remap the pages)
> I have no idea about the performance gain this would provide (if any).
> Could be interesting to explore.
This would actually be the most interesting thing.
Hi Antoine !
> I would indeed have expected jemalloc to do that (remap the pages)
I have no idea about the performance gain this would provide (if any).
Could be interesting to explore.
> do you know that Arrow also supports integration with another allocator,
mimalloc
I only tried Jemalloc and th
Le 05/06/2020 à 14:25, Rémi Dettai a écrit :
> Hi Uwe!
>
>> As your suggestions don't seem to be specific to Arrow, why not
> contribute them directly to jemalloc? They are much better in reviewing
> allocator code than we are.
> I mentioned this idea in the jemalloc gitter. The first response w
Hi Uwe!
> As your suggestions don't seem to be specific to Arrow, why not
contribute them directly to jemalloc? They are much better in reviewing
allocator code than we are.
I mentioned this idea in the jemalloc gitter. The first response was that
it should work but workloads with realloc aren't v
Hello Rémi,
under the hood jemalloc does quite similar things to what you describe. I'm not
sure what the offset is in the current version but in earlier releases, it used
a different allocation strategy for objects above 4MB. For the initial large
allocation, you will see quite some copies as
Le 04/06/2020 à 18:11, Rémi Dettai a écrit :
> > Ideally, we should be able to presize the array to a good enough
> estimate.
> You should be able to get away with a correct estimation because parquet
> column metadata contains the uncompressed size. But is their anything wrong
> with this idea
> Ideally, we should be able to presize the array to a good enough
estimate.
You should be able to get away with a correct estimation because parquet
column metadata contains the uncompressed size. But is their anything wrong
with this idea of mmaping huge "runways" for our larger allocations ?
On Thu, 4 Jun 2020 17:48:16 +0200
Rémi Dettai wrote:
> When creating large arrays, Arrow uses realloc quite intensively.
>
> I have an example where y read a gzipped parquet column (strings) that
> expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> cannot anticipate this and
When creating large arrays, Arrow uses realloc quite intensively.
I have an example where y read a gzipped parquet column (strings) that
expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
cannot anticipate this and every reallocate call above 1MB (the most
critical ones) ends up
16 matches
Mail list logo