Dear,
I would like to parse *.bed file to pyarrow
A Bed file look like this:
#This is a comment
chr1 10000 69091
chr1 80608 106842
chr3 70008 207666
chr14 257666 297968
So we can see it is a tabulated text file with 3 columns. Some line can
be a comment if starts with a #
My way to hadle such file is not efficient and I would like your
insight to load such data
My way, I read file lini by line with bython builtin open, if line do
not starts with a # ; I split the line each column is converted to
expected column type (i.e str, int …) and append each data to their
columns. And finally I create a pyarrow table and write it to parquet.
import pyarrow as pa
from pyarrow.parquet import ParquetWriter
bed3_schema = pa.schema([('chr', pa.string()),
('start', pa.int64()),
('end', pa.float64())])
bed3_column_type = [str, int, int]
def bed_to_parquet(bed_path: str, parquet_path: str, dataset=None):
columns = [[], [], []]
with open(bed_path) as stream:
for row in stream:
if not row.startswith('#'):
cols = row.split('\t')
for i, item in enumerate(cols):
casted_value = bed3_column_type[i](item)
columns[i].append(casted_value)
arrays = [pa.array(column) for column in columns]
table = pa.Table.from_arrays(arrays, schema=bed3_schema)
with ParquetWriter(parquet_path, table.schema,
use_dictionary=True, version='2.0') as writer:
if dataset:
writer.write_to_dataset(table, dataset)
else:
writer.write_table(table)