Hello,

I am really terribly sorry if this question is directly answered somewhere
in the docs, but after poking around the crates for arrow and parquet, I
wasn't completely sure of the answer and was hoping it wouldn't be too much
of a hassle for a dev to just summarize the tooling available for the
use-case.

The group that I'm working for has a function that generates rows of a
ginormous dataset and appends them to a CSV. We would then like to do some
post-processing. However, the dataset is too big to fit in main memory. We
were hoping that if we stored everything in parquets, we would be able to
cut down on our disk space consumption and also be able to run a
semi-efficient file-backed columnar sort. However, it looks like some of
the high-level tools for writing and interacting with parquets are still in
development.

Is this something that can easily (or with some surmountable degree of
difficulty) be done with the tools available in the Parquet/Arrow crate? If
we ran a query with Datafusion would it be able to collect and sort the top
k columns without reading everything into memory?

Thanks so much for your help!
Josh

Reply via email to