Dear,

I would like to parse *.bed file to pyarrow

A Bed file look like this:
#This is a comment
chr1    10000   69091
chr1    80608   106842
chr3    70008   207666
chr14   257666  297968


So we can see it is a tabulated text file with 3 columns. Some line can
be a comment if starts with a #


My way to hadle such file is not efficient and I would like your
insight to load such data

My way, I read file lini by line with bython builtin open, if line do
not starts with a # ;  I split the line each column is converted to
expected column type (i.e str, int …) and append each data to their
columns. And finally I create a pyarrow table and write it to parquet.



import pyarrow as pa
from pyarrow.parquet import ParquetWriter
bed3_schema = pa.schema([('chr', pa.string()),
                        ('start', pa.int64()),
                        ('end', pa.float64())])
bed3_column_type = [str, int, int]


def bed_to_parquet(bed_path: str, parquet_path: str, dataset=None):
    columns = [[], [], []]
    with open(bed_path) as stream:
        for row in stream:
            if not row.startswith('#'):
                cols = row.split('\t')
                for i, item in enumerate(cols):
                    casted_value = bed3_column_type[i](item)
                    columns[i].append(casted_value)
    arrays = [pa.array(column) for column in columns]
    table = pa.Table.from_arrays(arrays, schema=bed3_schema)
    with ParquetWriter(parquet_path, table.schema,
                       use_dictionary=True, version='2.0') as writer:
        if dataset:
            writer.write_to_dataset(table, dataset)
        else:
            writer.write_table(table)

Reply via email to