how many rows are you including in a batch? you might want to try with smaller row batches since your columns are so wide.
the other thing you can try instead of parquet is testing files with progressively more columns. If the width of your tables are the problem then you'll be able to see when it becomes a problem and what the peak load time is to know how much you may be missing out on.
Sent from Proton Mail for iOS
Hi Richard,
I tried to reproduce [1] something akin to what you describe and I
also see worse-than-expected performance. I did find a GitHub Issue
[2] describing performance issues with wide record batches which might
be relevant here, though I'm not sure.
Have you tried the same kind of workflow but with Parquet as your
on-disk format instead of Feather?
[1] https://gist.github.com/amoeba/38591e99bd8682b60779021ac57f146b
[2] https://github.com/apache/arrow/issues/16270
On Wed, Sep 27, 2023 at 10:14 PM Richard Beare <[email protected]> wrote:
>
> Hi,
> I have created (from R) an arrow dataset consisting of 86 files (feather). Each of them is 895M, with about 500 rows and 32000 columns. The natural structure of the complete dataframe is a 86*500 row dataframe.
>
> My aim is to load a chunk consisting of all rows and a subset of columns (two ID columns + 100 other columns), I'll do some manipulation and modelling on that chunk, then move to the next and repeat.
>
> Each row in the dataframe corresponds to a flattened image, with two ID columns. Each feather file contains the set of images corresponding to a single measure.
>
> I want to run a series of collect(arrowdataset[, c("ID1", "ID2", "V1", "V2")])
>
> However the load time seems very slow (10 minutes+), and I'm wondering what I've done wrong. I've tested on hosts with SSD.
>
> I can see a saving in which ID1 becomes part of the partitioning instead of storing it with the data, but that sounds like a minor change.
>
> Any thoughts on what I've missed.
publicKey - [email protected] - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
