You could store it a List<float64> column:
```
>>> x = pa.array([[1.2, 2.3], [3.4]])
>>> x
<pyarrow.lib.ListArray object at 0x7f08d0b1f9a0>
[
[
1.2,
2.3
],
[
3.4
]
]
>>> x[0]
<pyarrow.ListScalar: [1.2, 2.3]>
>>> x[0][1]
<pyarrow.DoubleScalar: 2.3>
>>> x[0].values.slice(0, 1)
<pyarrow.lib.DoubleArray object at 0x7f08d0b1fc40>
[
1.2
]
```
This will be stored in parquet as LIST and should give you reasonable
compression (though I have not personally tested it).
Slicing is O(1) once it is loaded in memory.
On Wed, Nov 23, 2022 at 9:20 AM Ramón Casero Cañas <[email protected]> wrote:
>
> Hi Jacek,
>
> Thanks for your reply, but it looks like that would be a complicated
> workaround. I have been looking some more, and it looks like hdf5 would be a
> good file format for this problem.
>
> It naturally supports slicing like fp['field1'][1000:5000], provides chunking
> and compression, new arrays can be appended... Maybe Arrow is just not the
> right tool for this specific problem.
>
> Kind regards,
>
> Ramon.
>
>
> On Wed, 23 Nov 2022 at 15:54, Jacek Pliszka <[email protected]> wrote:
>>
>> Hi!
>>
>> I am not sure if this would solve your problem:
>>
>> pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f',
>> [len(v)*[f]]) for f, v in x.items()])
>>
>> pyarrow.Table
>> v: double
>> f: string
>> ----
>> v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]]
>> f:
>> [["field1","field1","field1","field1","field1","field1","field1","field1"],["field2","field2","field2"],["field3","field3","field3","field3","field3"]]
>>
>> f column should compress very well or you can make it dictionary from the
>> start.
>>
>> To get back you can do couple things, take from pc.equal, to_batches, groupby
>>
>> BR
>>
>> Jacek
>>
>>
>>
>> śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> napisał(a):
>> >
>> > Hi,
>> >
>> > I'm trying to figure out whether pyArrow could efficiently store and slice
>> > large python dictionaries that contain numpy arrays of variable length,
>> > e.g.
>> >
>> > x = {
>> > 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
>> > 'field2': [0.3, 0.5, 0.1],
>> > 'field3': [0.9, NaN, NaN, 0.1, 0.5]
>> > }
>> >
>> > Arrow seems to be designed for Tables, but I was wondering whether there's
>> > a way to do this (probably not with a Table or RecordBatch because those
>> > require the same lengths).
>> >
>> > The vector in each dictionary key would have in the order of 1e4 - 1e9
>> > elements. There are some NaN gaps in the data (which would go well with
>> > Arrow's null elements, I guess), but especially, many repeated values that
>> > makes the data quite compressible.
>> >
>> > Apart from writing that data to disk quickly and with compression, then I
>> > need to slice it efficiently, e.g.
>> >
>> > fp = open('file', 'r')
>> > v = fp['field1'][1000:5000]
>> >
>> > Is this something that can be done with pyArrow?
>> >
>> > Kind regards,
>> >
>> > Ramon.