Weston's answer works.. Same answer here, but the final syntax is closer to
your question..
>>> import pyarrow as pa
>>>
>>> x = {
... 'field1': [[0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7]],
... 'field2': [[0.3, 0.5, 0.1]],
... 'field3': [[0.9, None, None, 0.1, 0.5]]
... }
>>>
>>> arrow_data = pa.Table.from_pydict(x)
>>>
>>> arrow_data['field1'][0].values.slice(3, 2)
<pyarrow.lib.DoubleArray object at 0x0000023CD929C1C0>
[
0.1,
0.2
]
-----Original Message-----
From: Weston Pace <[email protected]>
Sent: Wednesday, November 23, 2022 11:01 AM
To: [email protected]
Subject: Re: [python] Using Arrow for storing compressable python dictionaries
External Email: Use caution with links and attachments
You could store it a List<float64> column:
```
>>> x = pa.array([[1.2, 2.3], [3.4]])
>>> x
<pyarrow.lib.ListArray object at 0x7f08d0b1f9a0>
[
[
1.2,
2.3
],
[
3.4
]
]
>>> x[0]
<pyarrow.ListScalar: [1.2, 2.3]>
>>> x[0][1]
<pyarrow.DoubleScalar: 2.3>
>>> x[0].values.slice(0, 1)
<pyarrow.lib.DoubleArray object at 0x7f08d0b1fc40>
[
1.2
]
```
This will be stored in parquet as LIST and should give you reasonable
compression (though I have not personally tested it).
Slicing is O(1) once it is loaded in memory.
On Wed, Nov 23, 2022 at 9:20 AM Ramón Casero Cañas <[email protected]> wrote:
>
> Hi Jacek,
>
> Thanks for your reply, but it looks like that would be a complicated
> workaround. I have been looking some more, and it looks like hdf5 would be a
> good file format for this problem.
>
> It naturally supports slicing like fp['field1'][1000:5000], provides chunking
> and compression, new arrays can be appended... Maybe Arrow is just not the
> right tool for this specific problem.
>
> Kind regards,
>
> Ramon.
>
>
> On Wed, 23 Nov 2022 at 15:54, Jacek Pliszka <[email protected]> wrote:
>>
>> Hi!
>>
>> I am not sure if this would solve your problem:
>>
>> pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f',
>> [len(v)*[f]]) for f, v in x.items()])
>>
>> pyarrow.Table
>> v: double
>> f: string
>> ----
>> v:
>> [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]]
>> f:
>> [["field1","field1","field1","field1","field1","field1","field1","fie
>> ld1"],["field2","field2","field2"],["field3","field3","field3","field
>> 3","field3"]]
>>
>> f column should compress very well or you can make it dictionary from the
>> start.
>>
>> To get back you can do couple things, take from pc.equal, to_batches,
>> groupby
>>
>> BR
>>
>> Jacek
>>
>>
>>
>> śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> napisał(a):
>> >
>> > Hi,
>> >
>> > I'm trying to figure out whether pyArrow could efficiently store and slice
>> > large python dictionaries that contain numpy arrays of variable length,
>> > e.g.
>> >
>> > x = {
>> > 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
>> > 'field2': [0.3, 0.5, 0.1],
>> > 'field3': [0.9, NaN, NaN, 0.1, 0.5] }
>> >
>> > Arrow seems to be designed for Tables, but I was wondering whether there's
>> > a way to do this (probably not with a Table or RecordBatch because those
>> > require the same lengths).
>> >
>> > The vector in each dictionary key would have in the order of 1e4 - 1e9
>> > elements. There are some NaN gaps in the data (which would go well with
>> > Arrow's null elements, I guess), but especially, many repeated values that
>> > makes the data quite compressible.
>> >
>> > Apart from writing that data to disk quickly and with compression, then I
>> > need to slice it efficiently, e.g.
>> >
>> > fp = open('file', 'r')
>> > v = fp['field1'][1000:5000]
>> >
>> > Is this something that can be done with pyArrow?
>> >
>> > Kind regards,
>> >
>> > Ramon.
This message may contain information that is confidential or privileged. If you
are not the intended recipient, please advise the sender immediately and delete
this message. See
http://www.blackrock.com/corporate/compliance/email-disclaimers for further
information. Please refer to
http://www.blackrock.com/corporate/compliance/privacy-policy for more
information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/about-us/contacts-locations.
© 2022 BlackRock, Inc. All rights reserved.