I'm collating the problems I observe while testing this. 1) Probably memory leak in parquet writer workflow.
My workflow that creates my dataset is pretty simple - loops through a set of 500 images, loads them, extracts a subset based on a mask. The resulting vector becomes a row in the table. When using the approach at https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3, but including a gc() call in my loop, memory usage reported by top sits at 3.2% on my test machine. However, if I use the version for parquet files ( https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf), there is clearly a leak of some sort - after 160 of 500 images the RAM use grows to 17% (5.3G), and continues to grow. 2) Probably excess RAM use loading the dataset - I can open individual parquet files, but opening the dataset then doing collect(select(parquetdataset, ID, V1, V2, V3, V4))) crashes the R session, most likely due running out of RAM (unconfirmed). 3) I return to feather format and only save the first 100 data columns + ID columns - The number of rows remains the same as before. Now a collect(select(arrowdataset, ID1, ID2, V1, V2)) takes 3.6 seconds while collect(select(arrowdataset, ID1, ID2, starts_with("V")) takes 7.3 seconds - i.e. much faster than extracting the same columns from a much wider dataset. It looks like I should try pivoting the data to a long format before saving and see if I can analyse it that way. I was previously fetching a set of columns and pivoting/nesting anyway, so perhaps it isn't a bad thing. On Sun, Oct 1, 2023 at 9:03 PM Richard Beare <[email protected]> wrote: > Thanks for all the suggestions, I'm working through them. > > One problem I've discovered relates to creating the dataset in the > first place. For the parquet version I'm using the approach described here: > > https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf > > in response to an earlier question. However I'm finding that the R ram > usage steadily grows through the loop, which it shouldn't. I suspect I > should be forcing a flush to disk at regular intervals, but I can't see how > to achieve that from R. There's no obvious reason why the ram use needs to > increase with iterations. > > On Sat, Sep 30, 2023 at 1:23 AM Aldrin <[email protected]> wrote: > >> how many rows are you including in a batch? you might want to try with >> smaller row batches since your columns are so wide. >> >> the other thing you can try instead of parquet is testing files with >> progressively more columns. If the width of your tables are the problem >> then you'll be able to see when it becomes a problem and what the peak load >> time is to know how much you may be missing out on. >> >> Sent from Proton Mail <https://proton.me/mail/home> for iOS >> >> >> On Thu, Sep 28, 2023 at 22:34, Bryce Mecum <[email protected] >> <On+Thu,+Sep+28,+2023+at+22:34,+Bryce+Mecum+%3C%3Ca+href=>> wrote: >> >> Hi Richard, >> >> I tried to reproduce [1] something akin to what you describe and I >> also see worse-than-expected performance. I did find a GitHub Issue >> [2] describing performance issues with wide record batches which might >> be relevant here, though I'm not sure. >> >> Have you tried the same kind of workflow but with Parquet as your >> on-disk format instead of Feather? >> >> [1] https://gist.github.com/amoeba/38591e99bd8682b60779021ac57f146b >> [2] https://github.com/apache/arrow/issues/16270 >> >> On Wed, Sep 27, 2023 at 10:14 PM Richard Beare <[email protected]> >> wrote: >> > >> > Hi, >> > I have created (from R) an arrow dataset consisting of 86 files >> (feather). Each of them is 895M, with about 500 rows and 32000 columns. The >> natural structure of the complete dataframe is a 86*500 row dataframe. >> > >> > My aim is to load a chunk consisting of all rows and a subset of >> columns (two ID columns + 100 other columns), I'll do some manipulation and >> modelling on that chunk, then move to the next and repeat. >> > >> > Each row in the dataframe corresponds to a flattened image, with two ID >> columns. Each feather file contains the set of images corresponding to a >> single measure. >> > >> > I want to run a series of collect(arrowdataset[, c("ID1", "ID2", "V1", >> "V2")]) >> > >> > However the load time seems very slow (10 minutes+), and I'm wondering >> what I've done wrong. I've tested on hosts with SSD. >> > >> > I can see a saving in which ID1 becomes part of the partitioning >> instead of storing it with the data, but that sounds like a minor change. >> > >> > Any thoughts on what I've missed. >> >>
