Hi!

I am not sure if this would solve your problem:

pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f',
[len(v)*[f]]) for f, v in x.items()])

pyarrow.Table
v: double
f: string
----
v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]]
f: 
[["field1","field1","field1","field1","field1","field1","field1","field1"],["field2","field2","field2"],["field3","field3","field3","field3","field3"]]

f column should compress very well or you can make it dictionary from the start.

To get back you can do couple things, take from pc.equal, to_batches, groupby

BR

Jacek



śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> napisał(a):
>
> Hi,
>
> I'm trying to figure out whether pyArrow could efficiently store and slice 
> large python dictionaries that contain numpy arrays of variable length, e.g.
>
> x = {
> 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
> 'field2': [0.3, 0.5, 0.1],
> 'field3': [0.9, NaN, NaN, 0.1, 0.5]
> }
>
> Arrow seems to be designed for Tables, but I was wondering whether there's a 
> way to do this (probably not with a Table or RecordBatch because those 
> require the same lengths).
>
> The vector in each dictionary key would have in the order of 1e4 - 1e9 
> elements. There are some NaN gaps in the data (which would go well with 
> Arrow's null elements, I guess), but especially, many repeated values that 
> makes the data quite compressible.
>
> Apart from writing that data to disk quickly and with compression, then I 
> need to slice it efficiently, e.g.
>
> fp = open('file', 'r')
> v = fp['field1'][1000:5000]
>
> Is this something that can be done with pyArrow?
>
> Kind regards,
>
> Ramon.

Reply via email to