Hi Joshua,

TLDR is that the datafusion or possibly ballista (Rust query engine) may do
what you want, but you would have to test it. I can't recall offhand if a
sort + limit will be applied per partition and then a final sort + limit
(which would avoid reading everything into memory at once).

In order to maximize the chance of it working, when writing the parquet
file you want to do so in several different "row groups".

The Apache Arrow Rust ecosystem has several crates which are related, so
depending on exactly what you are doing you may want to pick higher or lower
1. parquet (lower level apis for reading / writing parquet encoded data)
2. arrow (higher level apis for manipulating in memory arrays)
3. datafusion (an in memory query engine)
4. ballista (a spark-like distributed query system)

Hope that helps,
Andrew




On Tue, Sep 28, 2021 at 1:33 PM Joshua Abrams <[email protected]>
wrote:

> Hello,
>
> I am really terribly sorry if this question is directly answered somewhere
> in the docs, but after poking around the crates for arrow and parquet, I
> wasn't completely sure of the answer and was hoping it wouldn't be too much
> of a hassle for a dev to just summarize the tooling available for the
> use-case.
>
> The group that I'm working for has a function that generates rows of a
> ginormous dataset and appends them to a CSV. We would then like to do some
> post-processing. However, the dataset is too big to fit in main memory. We
> were hoping that if we stored everything in parquets, we would be able to
> cut down on our disk space consumption and also be able to run a
> semi-efficient file-backed columnar sort. However, it looks like some of
> the high-level tools for writing and interacting with parquets are still in
> development.
>
> Is this something that can easily (or with some surmountable degree of
> difficulty) be done with the tools available in the Parquet/Arrow crate? If
> we ran a query with Datafusion would it be able to collect and sort the top
> k columns without reading everything into memory?
>
> Thanks so much for your help!
> Josh
>
>

Reply via email to