> Question to others - is there option to read rows from 1000th to > 2000th in current parquet interface?
I'm pretty sure the answer today is "no". In theory, however, you should be able to narrow down which pages to load using a skip. I believe there is some active work in this area[1]. > There is head in dataset.Scanner but slice is not there. The head in the scanner today is a sort of "best effort stop reading when we have enough data" and cannot accomplish any kind of skipping. I am working on adding support in the scanner for skipping record batches based on a limit & offset but that isn't ready yet. [1] https://issues.apache.org/jira/browse/PARQUET-2210 On Wed, Nov 23, 2022 at 11:18 AM Jacek Pliszka <[email protected]> wrote: > > Actually you provided too little information to tell. > > Will you store data locally or over the network? > > Do you want to optimize for speed or data size? > > Locally stored memory mapped arrow IPC should be fast - you may want > to test it against HDF5 and since it is memory mapped - slicing should > work great. > > On the other hand, parquet file should have decent compression if you > want to save on storage and/or bandwidth. > But I do not know how to slice it efficiently in your case. > > Question to others - is there option to read rows from 1000th to > 2000th in current parquet interface? > There is head in dataset.Scanner but slice is not there. > > BR, > > Jacek > > śr., 23 lis 2022 o 18:20 Ramón Casero Cañas <[email protected]> napisał(a): > > > > Hi Jacek, > > > > Thanks for your reply, but it looks like that would be a complicated > > workaround. I have been looking some more, and it looks like hdf5 would be > > a good file format for this problem. > > > > It naturally supports slicing like fp['field1'][1000:5000], provides > > chunking and compression, new arrays can be appended... Maybe Arrow is just > > not the right tool for this specific problem. > > > > Kind regards, > > > > Ramon. > > > > > > On Wed, 23 Nov 2022 at 15:54, Jacek Pliszka <[email protected]> wrote: > >> > >> Hi! > >> > >> I am not sure if this would solve your problem: > >> > >> pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f', > >> [len(v)*[f]]) for f, v in x.items()]) > >> > >> pyarrow.Table > >> v: double > >> f: string > >> ---- > >> v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]] > >> f: > >> [["field1","field1","field1","field1","field1","field1","field1","field1"],["field2","field2","field2"],["field3","field3","field3","field3","field3"]] > >> > >> f column should compress very well or you can make it dictionary from the > >> start. > >> > >> To get back you can do couple things, take from pc.equal, to_batches, > >> groupby > >> > >> BR > >> > >> Jacek > >> > >> > >> > >> śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> napisał(a): > >> > > >> > Hi, > >> > > >> > I'm trying to figure out whether pyArrow could efficiently store and > >> > slice large python dictionaries that contain numpy arrays of variable > >> > length, e.g. > >> > > >> > x = { > >> > 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7], > >> > 'field2': [0.3, 0.5, 0.1], > >> > 'field3': [0.9, NaN, NaN, 0.1, 0.5] > >> > } > >> > > >> > Arrow seems to be designed for Tables, but I was wondering whether > >> > there's a way to do this (probably not with a Table or RecordBatch > >> > because those require the same lengths). > >> > > >> > The vector in each dictionary key would have in the order of 1e4 - 1e9 > >> > elements. There are some NaN gaps in the data (which would go well with > >> > Arrow's null elements, I guess), but especially, many repeated values > >> > that makes the data quite compressible. > >> > > >> > Apart from writing that data to disk quickly and with compression, then > >> > I need to slice it efficiently, e.g. > >> > > >> > fp = open('file', 'r') > >> > v = fp['field1'][1000:5000] > >> > > >> > Is this something that can be done with pyArrow? > >> > > >> > Kind regards, > >> > > >> > Ramon.
