hi Jonathan -- generally my approach would be to write some Cython or C/C++ code to create the file loader. Any time you are writing a file loader that deals with individual table cells in pure Python it's going to suffer from some performance problems.
We've talked about exposing the Arrow C++ incremental builder classes in Python or Cython -- I didn't find a JIRA issue about this but I created https://issues.apache.org/jira/browse/ARROW-8189 Hope this helps Wes On Mon, Mar 23, 2020 at 3:10 PM jonathan mercier <[email protected]> wrote: > > Dear, > > I would like to parse *.bed file to pyarrow > > A Bed file look like this: > #This is a comment > chr1 10000 69091 > chr1 80608 106842 > chr3 70008 207666 > chr14 257666 297968 > > > So we can see it is a tabulated text file with 3 columns. Some line can > be a comment if starts with a # > > > My way to hadle such file is not efficient and I would like your > insight to load such data > > My way, I read file lini by line with bython builtin open, if line do > not starts with a # ; I split the line each column is converted to > expected column type (i.e str, int …) and append each data to their > columns. And finally I create a pyarrow table and write it to parquet. > > > > import pyarrow as pa > from pyarrow.parquet import ParquetWriter > bed3_schema = pa.schema([('chr', pa.string()), > ('start', pa.int64()), > ('end', pa.float64())]) > bed3_column_type = [str, int, int] > > > def bed_to_parquet(bed_path: str, parquet_path: str, dataset=None): > columns = [[], [], []] > with open(bed_path) as stream: > for row in stream: > if not row.startswith('#'): > cols = row.split('\t') > for i, item in enumerate(cols): > casted_value = bed3_column_type[i](item) > columns[i].append(casted_value) > arrays = [pa.array(column) for column in columns] > table = pa.Table.from_arrays(arrays, schema=bed3_schema) > with ParquetWriter(parquet_path, table.schema, > use_dictionary=True, version='2.0') as writer: > if dataset: > writer.write_to_dataset(table, dataset) > else: > writer.write_table(table) >
