Re: [python] Using Arrow for storing compressable python dictionaries

Weston Pace Wed, 23 Nov 2022 11:01:36 -0800

You could store it a List<float64> column:

```
>>> x = pa.array([[1.2, 2.3], [3.4]])
>>> x
    <pyarrow.lib.ListArray object at 0x7f08d0b1f9a0>
    [
      [
        1.2,
        2.3
      ],
      [
        3.4
      ]
    ]
    >>> x[0]
    <pyarrow.ListScalar: [1.2, 2.3]>
    >>> x[0][1]
    <pyarrow.DoubleScalar: 2.3>
    >>> x[0].values.slice(0, 1)
    <pyarrow.lib.DoubleArray object at 0x7f08d0b1fc40>
    [
      1.2
    ]
```


This will be stored in parquet as LIST and should give you reasonable
compression (though I have not personally tested it).

Slicing is O(1) once it is loaded in memory.

On Wed, Nov 23, 2022 at 9:20 AM Ramón Casero Cañas <[email protected]> wrote:
>
> Hi Jacek,
>
> Thanks for your reply, but it looks like that would be a complicated 
> workaround. I have been looking some more, and it looks like hdf5 would be a 
> good file format for this problem.
>
> It naturally supports slicing like fp['field1'][1000:5000], provides chunking 
> and compression, new arrays can be appended... Maybe Arrow is just not the 
> right tool for this specific problem.
>
> Kind regards,
>
> Ramon.
>
>
> On Wed, 23 Nov 2022 at 15:54, Jacek Pliszka <[email protected]> wrote:
>>
>> Hi!
>>
>> I am not sure if this would solve your problem:
>>
>> pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f',
>> [len(v)*[f]]) for f, v in x.items()])
>>
>> pyarrow.Table
>> v: double
>> f: string
>> ----
>> v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]]
>> f: 
>> [["field1","field1","field1","field1","field1","field1","field1","field1"],["field2","field2","field2"],["field3","field3","field3","field3","field3"]]
>>
>> f column should compress very well or you can make it dictionary from the 
>> start.
>>
>> To get back you can do couple things, take from pc.equal, to_batches, groupby
>>
>> BR
>>
>> Jacek
>>
>>
>>
>> śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> napisał(a):
>> >
>> > Hi,
>> >
>> > I'm trying to figure out whether pyArrow could efficiently store and slice 
>> > large python dictionaries that contain numpy arrays of variable length, 
>> > e.g.
>> >
>> > x = {
>> > 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
>> > 'field2': [0.3, 0.5, 0.1],
>> > 'field3': [0.9, NaN, NaN, 0.1, 0.5]
>> > }
>> >
>> > Arrow seems to be designed for Tables, but I was wondering whether there's 
>> > a way to do this (probably not with a Table or RecordBatch because those 
>> > require the same lengths).
>> >
>> > The vector in each dictionary key would have in the order of 1e4 - 1e9 
>> > elements. There are some NaN gaps in the data (which would go well with 
>> > Arrow's null elements, I guess), but especially, many repeated values that 
>> > makes the data quite compressible.
>> >
>> > Apart from writing that data to disk quickly and with compression, then I 
>> > need to slice it efficiently, e.g.
>> >
>> > fp = open('file', 'r')
>> > v = fp['field1'][1000:5000]
>> >
>> > Is this something that can be done with pyArrow?
>> >
>> > Kind regards,
>> >
>> > Ramon.

Re: [python] Using Arrow for storing compressable python dictionaries

Reply via email to