Thanks for the hint.
I do not saw a to_numpy method from Tabl object so I think I have to do
it manually in python

something like:

#### python3

import pyarrow.parquet as pq
import numpy as np
data = pq.read_table(dataset_path')
matrix = np.zeros((data.num_rows,data.num_columns),dtype=np.bool_)
for i,col in enumerate(data.columns):
    matrix[:,i] = col




Le lundi 01 mars 2021 à 11:31 +0100, Jacek Pliszka a écrit :
> Other will probably give you better hints but
> 
> You do not need to convert to Pandas.  read in arrow and convert to
> numpy directly if numpy is what you want.
> 
> BR,
> 
> Jacek
> 
> pon., 1 mar 2021 o 11:24 jonathan mercier <[email protected]>
> napisał(a):
> > 
> > Dear,
> > 
> > I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow
> > thus I own a table with 300 000 columns and around 45 000 row of
> > presence/absence (0/1). It is a  file of ~150 Mo.
> > 
> > I read this file like this:
> > 
> > import pyarrow.parquet as pq
> > data =
> > pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_
> > )
> > 
> > And this statement take 1 hour …
> > So is there a trick to speedup to load in memory those data ?
> > Is it possible to distribute the loading with a library such as ray ?
> > 
> > thanks
> > 
> > Best regards
> > 
> > 
> > --
> >                 Researcher computational biology
> >                 PhD, Jonathan MERCIER
> > 
> >                 Bioinformatics (LBI)
> >                 2, rue Gaston
> >                 Crémieux
> >                 91057 Evry Cedex
> > 
> > 
> >                 Tel :(+33)1 60 87 83 44
> >                 Email :[email protected]
> > 
> > 
> > 

-- 
                Researcher computational biology
                PhD, Jonathan MERCIER
            
                Bioinformatics (LBI)
                2, rue Gaston
                Crémieux
                91057 Evry Cedex
            
            
                Tel :(+33)1 60 87 83 44
                Email :[email protected]
                
            

Reply via email to