Actually you provided too little information to tell.

Will you store data locally or over the network?

Do you want to optimize for speed or data size?

Locally stored memory mapped arrow IPC should be fast - you may want
to test it against HDF5 and since it is memory mapped - slicing should
work great.

On the other hand, parquet file should have decent compression if you
want to save on storage and/or bandwidth.
But I do not know how to slice it efficiently in your case.

Question to others - is there option to read rows from 1000th to
2000th in current parquet interface?
There is head in dataset.Scanner but slice is not there.

BR,

Jacek

śr., 23 lis 2022 o 18:20 Ramón Casero Cañas <[email protected]> napisał(a):
>
> Hi Jacek,
>
> Thanks for your reply, but it looks like that would be a complicated 
> workaround. I have been looking some more, and it looks like hdf5 would be a 
> good file format for this problem.
>
> It naturally supports slicing like fp['field1'][1000:5000], provides chunking 
> and compression, new arrays can be appended... Maybe Arrow is just not the 
> right tool for this specific problem.
>
> Kind regards,
>
> Ramon.
>
>
> On Wed, 23 Nov 2022 at 15:54, Jacek Pliszka <[email protected]> wrote:
>>
>> Hi!
>>
>> I am not sure if this would solve your problem:
>>
>> pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f',
>> [len(v)*[f]]) for f, v in x.items()])
>>
>> pyarrow.Table
>> v: double
>> f: string
>> ----
>> v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]]
>> f: 
>> [["field1","field1","field1","field1","field1","field1","field1","field1"],["field2","field2","field2"],["field3","field3","field3","field3","field3"]]
>>
>> f column should compress very well or you can make it dictionary from the 
>> start.
>>
>> To get back you can do couple things, take from pc.equal, to_batches, groupby
>>
>> BR
>>
>> Jacek
>>
>>
>>
>> śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> napisał(a):
>> >
>> > Hi,
>> >
>> > I'm trying to figure out whether pyArrow could efficiently store and slice 
>> > large python dictionaries that contain numpy arrays of variable length, 
>> > e.g.
>> >
>> > x = {
>> > 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
>> > 'field2': [0.3, 0.5, 0.1],
>> > 'field3': [0.9, NaN, NaN, 0.1, 0.5]
>> > }
>> >
>> > Arrow seems to be designed for Tables, but I was wondering whether there's 
>> > a way to do this (probably not with a Table or RecordBatch because those 
>> > require the same lengths).
>> >
>> > The vector in each dictionary key would have in the order of 1e4 - 1e9 
>> > elements. There are some NaN gaps in the data (which would go well with 
>> > Arrow's null elements, I guess), but especially, many repeated values that 
>> > makes the data quite compressible.
>> >
>> > Apart from writing that data to disk quickly and with compression, then I 
>> > need to slice it efficiently, e.g.
>> >
>> > fp = open('file', 'r')
>> > v = fp['field1'][1000:5000]
>> >
>> > Is this something that can be done with pyArrow?
>> >
>> > Kind regards,
>> >
>> > Ramon.

Reply via email to