I suspect you can reap significant performance benefits without going to the engineering lengths that we've gone for general purpose CSV parsing.
On Tue, Mar 24, 2020 at 6:04 AM jonathan mercier <[email protected]> wrote: > > Hi Wes > > Thanks for your quick answer. I took a look to pyarrow csv reader : > https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/reader.cc > and > https://github.com/apache/arrow/blob/master/python/pyarrow/_csv.pyx > > I have a lot of code to undertand and write in order to expose a *.bed > reader in python. > > I will try to do my best > > Thanks > > Have a nice day > > > Le lundi 23 mars 2020 à 18:24 -0500, Wes McKinney a écrit : > > hi Jonathan -- generally my approach would be to write some Cython or > > C/C++ code to create the file loader. Any time you are writing a file > > loader that deals with individual table cells in pure Python it's > > going to suffer from some performance problems. > > > > We've talked about exposing the Arrow C++ incremental builder classes > > in Python or Cython -- I didn't find a JIRA issue about this but I > > created > > > > https://issues.apache.org/jira/browse/ARROW-8189 > > > > Hope this helps > > Wes > > > > On Mon, Mar 23, 2020 at 3:10 PM jonathan mercier > > <[email protected]> wrote: > > > Dear, > > > > > > I would like to parse *.bed file to pyarrow > > > > > > A Bed file look like this: > > > #This is a comment > > > chr1 10000 69091 > > > chr1 80608 106842 > > > chr3 70008 207666 > > > chr14 257666 297968 > > > > > > > > > So we can see it is a tabulated text file with 3 columns. Some line > > > can > > > be a comment if starts with a # > > > > > > > > > My way to hadle such file is not efficient and I would like your > > > insight to load such data > > > > > > My way, I read file lini by line with bython builtin open, if line > > > do > > > not starts with a # ; I split the line each column is converted to > > > expected column type (i.e str, int …) and append each data to their > > > columns. And finally I create a pyarrow table and write it to > > > parquet. > > > > > > > > > > > > import pyarrow as pa > > > from pyarrow.parquet import ParquetWriter > > > bed3_schema = pa.schema([('chr', pa.string()), > > > ('start', pa.int64()), > > > ('end', pa.float64())]) > > > bed3_column_type = [str, int, int] > > > > > > > > > def bed_to_parquet(bed_path: str, parquet_path: str, dataset=None): > > > columns = [[], [], []] > > > with open(bed_path) as stream: > > > for row in stream: > > > if not row.startswith('#'): > > > cols = row.split('\t') > > > for i, item in enumerate(cols): > > > casted_value = bed3_column_type[i](item) > > > columns[i].append(casted_value) > > > arrays = [pa.array(column) for column in columns] > > > table = pa.Table.from_arrays(arrays, schema=bed3_schema) > > > with ParquetWriter(parquet_path, table.schema, > > > use_dictionary=True, version='2.0') as > > > writer: > > > if dataset: > > > writer.write_to_dataset(table, dataset) > > > else: > > > writer.write_table(table) > > > >
