The Arrow version of a nested structure will use significantly less
memory than the nested-Python-dictionary version.
We don't have a 100% complete converter from JSON-like data to Arrow
in-memory -- the main thing that's missing is creation of Unions
automatically. For example, the array
[700, 800, {'random string53': 900, 'random string54': 'random string55'}]
would need to be a union of an integer and a struct.
Assuming you don't have heterogeneous arrays and the type of values
don't change from record to record, you can simply pass a list of
records to pyarrow.array
- Wes
On Tue, Sep 24, 2019 at 1:26 PM Luke <[email protected]> wrote:
>
> This is a simplified example but trying to figure out what gains can be had
> using arrow vice straight nested python dictionaries for something like the
> following:
>
> {'random string 1': {'field1': {'field11': 'random string 2',
> 'field12': 100},
> 'field2': 200,
> 'field3': [300,
> 400,
> {'random string 3': 500}]
> },
> 'random string 4': {'field5': {'field51': 600,
> 'field52 ': [700,
> 800,
> {'random string53': 900,
> 'random string54': 'random
> string55'}
> ]
> }
> }
> }
>
> I didn't see anything that would convert an arbitrary nested dictionary into
> some arrow structure -- did I miss something? If there isn't what are some
> suggestions. I am doing pretty heavy data analysis where I am handed some
> nested python dictionaries or nested json that I am loading into a nested
> python dictionary. The memory footprint on these are large and I have
> individual json files when loaded by json.load becomes a 5-6 GB python
> dictionary (which is a little crazy when the actual json files is like 700MB).
>
> curious,
> Luke