> Question to others - is there option to read rows from 1000th to
> 2000th in current parquet interface?

I'm pretty sure the answer today is "no".  In theory, however, you
should be able to narrow down which pages to load using a skip.  I
believe there is some active work in this area[1].

> There is head in dataset.Scanner but slice is not there.

The head in the scanner today is a sort of "best effort stop reading
when we have enough data" and cannot accomplish any kind of skipping.
I am working on adding support in the scanner for skipping record
batches based on a limit & offset but that isn't ready yet.

[1] https://issues.apache.org/jira/browse/PARQUET-2210

On Wed, Nov 23, 2022 at 11:18 AM Jacek Pliszka <[email protected]> wrote:
>
> Actually you provided too little information to tell.
>
> Will you store data locally or over the network?
>
> Do you want to optimize for speed or data size?
>
> Locally stored memory mapped arrow IPC should be fast - you may want
> to test it against HDF5 and since it is memory mapped - slicing should
> work great.
>
> On the other hand, parquet file should have decent compression if you
> want to save on storage and/or bandwidth.
> But I do not know how to slice it efficiently in your case.
>
> Question to others - is there option to read rows from 1000th to
> 2000th in current parquet interface?
> There is head in dataset.Scanner but slice is not there.
>
> BR,
>
> Jacek
>
> śr., 23 lis 2022 o 18:20 Ramón Casero Cañas <[email protected]> napisał(a):
> >
> > Hi Jacek,
> >
> > Thanks for your reply, but it looks like that would be a complicated 
> > workaround. I have been looking some more, and it looks like hdf5 would be 
> > a good file format for this problem.
> >
> > It naturally supports slicing like fp['field1'][1000:5000], provides 
> > chunking and compression, new arrays can be appended... Maybe Arrow is just 
> > not the right tool for this specific problem.
> >
> > Kind regards,
> >
> > Ramon.
> >
> >
> > On Wed, 23 Nov 2022 at 15:54, Jacek Pliszka <[email protected]> wrote:
> >>
> >> Hi!
> >>
> >> I am not sure if this would solve your problem:
> >>
> >> pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f',
> >> [len(v)*[f]]) for f, v in x.items()])
> >>
> >> pyarrow.Table
> >> v: double
> >> f: string
> >> ----
> >> v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]]
> >> f: 
> >> [["field1","field1","field1","field1","field1","field1","field1","field1"],["field2","field2","field2"],["field3","field3","field3","field3","field3"]]
> >>
> >> f column should compress very well or you can make it dictionary from the 
> >> start.
> >>
> >> To get back you can do couple things, take from pc.equal, to_batches, 
> >> groupby
> >>
> >> BR
> >>
> >> Jacek
> >>
> >>
> >>
> >> śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> napisał(a):
> >> >
> >> > Hi,
> >> >
> >> > I'm trying to figure out whether pyArrow could efficiently store and 
> >> > slice large python dictionaries that contain numpy arrays of variable 
> >> > length, e.g.
> >> >
> >> > x = {
> >> > 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
> >> > 'field2': [0.3, 0.5, 0.1],
> >> > 'field3': [0.9, NaN, NaN, 0.1, 0.5]
> >> > }
> >> >
> >> > Arrow seems to be designed for Tables, but I was wondering whether 
> >> > there's a way to do this (probably not with a Table or RecordBatch 
> >> > because those require the same lengths).
> >> >
> >> > The vector in each dictionary key would have in the order of 1e4 - 1e9 
> >> > elements. There are some NaN gaps in the data (which would go well with 
> >> > Arrow's null elements, I guess), but especially, many repeated values 
> >> > that makes the data quite compressible.
> >> >
> >> > Apart from writing that data to disk quickly and with compression, then 
> >> > I need to slice it efficiently, e.g.
> >> >
> >> > fp = open('file', 'r')
> >> > v = fp['field1'][1000:5000]
> >> >
> >> > Is this something that can be done with pyArrow?
> >> >
> >> > Kind regards,
> >> >
> >> > Ramon.

Reply via email to