RE: [python] Using Arrow for storing compressable python dictionaries

Lee, David Fri, 25 Nov 2022 05:09:22 -0800


Weston's answer works.. Same answer here, but the final syntax is closer to 
your question..


>>> import pyarrow as pa
>>>
>>> x = {
...     'field1': [[0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7]],
...     'field2': [[0.3, 0.5, 0.1]],
...     'field3': [[0.9, None, None, 0.1, 0.5]]
... }
>>>
>>> arrow_data = pa.Table.from_pydict(x)
>>>
>>> arrow_data['field1'][0].values.slice(3, 2)
<pyarrow.lib.DoubleArray object at 0x0000023CD929C1C0>
[
  0.1,
  0.2
]

-----Original Message-----
From: Weston Pace <[email protected]> 
Sent: Wednesday, November 23, 2022 11:01 AM
To: [email protected]
Subject: Re: [python] Using Arrow for storing compressable python dictionaries

External Email: Use caution with links and attachments


You could store it a List<float64> column:

```
>>> x = pa.array([[1.2, 2.3], [3.4]])
>>> x
    <pyarrow.lib.ListArray object at 0x7f08d0b1f9a0>
    [
      [
        1.2,
        2.3
      ],
      [
        3.4
      ]
    ]
    >>> x[0]
    <pyarrow.ListScalar: [1.2, 2.3]>
    >>> x[0][1]
    <pyarrow.DoubleScalar: 2.3>
    >>> x[0].values.slice(0, 1)
    <pyarrow.lib.DoubleArray object at 0x7f08d0b1fc40>
    [
      1.2
    ]
```

This will be stored in parquet as LIST and should give you reasonable 
compression (though I have not personally tested it).

Slicing is O(1) once it is loaded in memory.

On Wed, Nov 23, 2022 at 9:20 AM Ramón Casero Cañas <[email protected]> wrote:
>
> Hi Jacek,
>
> Thanks for your reply, but it looks like that would be a complicated 
> workaround. I have been looking some more, and it looks like hdf5 would be a 
> good file format for this problem.
>
> It naturally supports slicing like fp['field1'][1000:5000], provides chunking 
> and compression, new arrays can be appended... Maybe Arrow is just not the 
> right tool for this specific problem.
>
> Kind regards,
>
> Ramon.
>
>
> On Wed, 23 Nov 2022 at 15:54, Jacek Pliszka <[email protected]> wrote:
>>
>> Hi!
>>
>> I am not sure if this would solve your problem:
>>
>> pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f',
>> [len(v)*[f]]) for f, v in x.items()])
>>
>> pyarrow.Table
>> v: double
>> f: string
>> ----
>> v: 
>> [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]]
>> f: 
>> [["field1","field1","field1","field1","field1","field1","field1","fie
>> ld1"],["field2","field2","field2"],["field3","field3","field3","field
>> 3","field3"]]
>>
>> f column should compress very well or you can make it dictionary from the 
>> start.
>>
>> To get back you can do couple things, take from pc.equal, to_batches, 
>> groupby
>>
>> BR
>>
>> Jacek
>>
>>
>>
>> śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> napisał(a):
>> >
>> > Hi,
>> >
>> > I'm trying to figure out whether pyArrow could efficiently store and slice 
>> > large python dictionaries that contain numpy arrays of variable length, 
>> > e.g.
>> >
>> > x = {
>> > 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
>> > 'field2': [0.3, 0.5, 0.1],
>> > 'field3': [0.9, NaN, NaN, 0.1, 0.5] }
>> >
>> > Arrow seems to be designed for Tables, but I was wondering whether there's 
>> > a way to do this (probably not with a Table or RecordBatch because those 
>> > require the same lengths).
>> >
>> > The vector in each dictionary key would have in the order of 1e4 - 1e9 
>> > elements. There are some NaN gaps in the data (which would go well with 
>> > Arrow's null elements, I guess), but especially, many repeated values that 
>> > makes the data quite compressible.
>> >
>> > Apart from writing that data to disk quickly and with compression, then I 
>> > need to slice it efficiently, e.g.
>> >
>> > fp = open('file', 'r')
>> > v = fp['field1'][1000:5000]
>> >
>> > Is this something that can be done with pyArrow?
>> >
>> > Kind regards,
>> >
>> > Ramon.


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

RE: [python] Using Arrow for storing compressable python dictionaries

Reply via email to