I'm collating the problems I observe while testing this.

1) Probably memory leak in parquet writer workflow.

My workflow that creates my dataset is pretty simple - loops through a set
of 500 images, loads them, extracts a subset based on a mask. The resulting
vector becomes a row in the table. When using the approach at
https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3, but
including a gc() call in my loop, memory usage reported by top sits at 3.2%
on my test machine. However, if I use the version for parquet files (
https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf), there
is clearly a leak of some sort - after 160 of 500 images the RAM use grows
to 17% (5.3G), and continues to grow.

2) Probably excess RAM use loading the dataset - I can open individual
parquet files, but opening the dataset then doing
collect(select(parquetdataset, ID, V1, V2, V3, V4))) crashes the R session,
most likely due running out of RAM (unconfirmed).

3) I return to feather format and only save the first 100 data columns + ID
columns - The number of rows remains the same as before. Now a
collect(select(arrowdataset, ID1, ID2, V1, V2)) takes 3.6 seconds
while collect(select(arrowdataset, ID1, ID2, starts_with("V")) takes 7.3
seconds - i.e. much faster than extracting the same columns from a much
wider dataset.

It looks like I should try pivoting the data to a long format before saving
and see if I can analyse it that way. I was previously fetching a set of
columns and pivoting/nesting anyway, so perhaps it isn't a bad thing.


On Sun, Oct 1, 2023 at 9:03 PM Richard Beare <[email protected]>
wrote:

> Thanks for all the suggestions, I'm working through them.
>
> One problem I've discovered relates to creating the dataset in the
> first place. For the parquet version I'm using the approach described here:
>
>  https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf
>
> in response to an earlier question. However I'm finding that the R ram
> usage steadily grows through the loop, which it shouldn't. I suspect I
> should be forcing a flush to disk at regular intervals, but I can't see how
> to achieve that from R. There's no obvious reason why the ram use needs to
> increase with iterations.
>
> On Sat, Sep 30, 2023 at 1:23 AM Aldrin <[email protected]> wrote:
>
>> how many rows are you including in a batch? you might want to try with
>> smaller row batches since your columns are so wide.
>>
>> the other thing you can try instead of parquet is testing files with
>> progressively more columns. If the width of your tables are the problem
>> then you'll be able to see when it becomes a problem and what the peak load
>> time is to know how much you may be missing out on.
>>
>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>
>>
>> On Thu, Sep 28, 2023 at 22:34, Bryce Mecum <[email protected]
>> <On+Thu,+Sep+28,+2023+at+22:34,+Bryce+Mecum+%3C%3Ca+href=>> wrote:
>>
>> Hi Richard,
>>
>> I tried to reproduce [1] something akin to what you describe and I
>> also see worse-than-expected performance. I did find a GitHub Issue
>> [2] describing performance issues with wide record batches which might
>> be relevant here, though I'm not sure.
>>
>> Have you tried the same kind of workflow but with Parquet as your
>> on-disk format instead of Feather?
>>
>> [1] https://gist.github.com/amoeba/38591e99bd8682b60779021ac57f146b
>> [2] https://github.com/apache/arrow/issues/16270
>>
>> On Wed, Sep 27, 2023 at 10:14 PM Richard Beare <[email protected]>
>> wrote:
>> >
>> > Hi,
>> > I have created (from R) an arrow dataset consisting of 86 files
>> (feather). Each of them is 895M, with about 500 rows and 32000 columns. The
>> natural structure of the complete dataframe is a 86*500 row dataframe.
>> >
>> > My aim is to load a chunk consisting of all rows and a subset of
>> columns (two ID columns + 100 other columns), I'll do some manipulation and
>> modelling on that chunk, then move to the next and repeat.
>> >
>> > Each row in the dataframe corresponds to a flattened image, with two ID
>> columns. Each feather file contains the set of images corresponding to a
>> single measure.
>> >
>> > I want to run a series of collect(arrowdataset[, c("ID1", "ID2", "V1",
>> "V2")])
>> >
>> > However the load time seems very slow (10 minutes+), and I'm wondering
>> what I've done wrong. I've tested on hosts with SSD.
>> >
>> > I can see a saving in which ID1 becomes part of the partitioning
>> instead of storing it with the data, but that sounds like a minor change.
>> >
>> > Any thoughts on what I've missed.
>>
>>

Reply via email to