1. conversion - when you work with data coming from SQL often it is Decimal - pandas handling of decimals is very inefficient - you can convert them to int/float in arrow and pass to pandas - uses much less memory
2. filters - check filters options of ParquetDataset or read_parquet - you can filter only the rows you need BR, Jacek pt., 12 lut 2021 o 15:45 jonathan mercier <[email protected]> napisał(a): > > Oh yes, I can do this too. > Thanks > Now when I see parquet I think pyarrow :-) > > what did you mean by conversion of filtering ? > Could you provides a little example please > > Anyway > > Have a goo day > > Le vendredi 12 février 2021 à 15:26 +0100, Jacek Pliszka a écrit : > > Sure - I believe you can do it even in pandas - you have columns > > parameter: pd.read_parquet('f.pq', columns=['A', 'B']) > > > > arrow is more useful if you need to do some conversion of filtering. > > > > BR, > > > > Jacek > > > > pt., 12 lut 2021 o 15:21 jonathan mercier <[email protected]> > > napisał(a): > > > > > > Dear, > > > I have a parquet files with 300 000 columns and 30 000 rows. > > > If I load a such file to pandas dataframe (with pyarrow) that take > > > around 100 GO of ram. > > > > > > As I perform a pairwise comparison between column I could load > > > those > > > data by N columns by N columns. > > > > > > So is it possible to load from a parquet file only few columns by > > > their > > > names ? Which will save some memory. > > > > > > Thanks > > > > > > > > > -- > > > Researcher computational biology > > > PhD, Jonathan MERCIER > > > > > > Bioinformatics (LBI) > > > 2, rue Gaston > > > Crémieux > > > 91057 Evry Cedex > > > > > > > > > Tel :(+33)1 60 87 83 44 > > > Email :[email protected] > > > > > > > > > > > -- > Researcher computational biology > PhD, Jonathan MERCIER > > Bioinformatics (LBI) > 2, rue Gaston > Crémieux > 91057 Evry Cedex > > > Tel :(+33)1 60 87 83 44 > Email :[email protected] > > >
