[jira] [Created] (ARROW-7520) Arrow / R - too many batches causes a crash
Christian created ARROW-7520: Summary: Arrow / R - too many batches causes a crash Key: ARROW-7520 URL: https://issues.apache.org/jira/browse/ARROW-7520 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.15.1 Environment: - Session info --- setting value version R version 3.6.1 (2019-07-05) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype English_United States.1252 tz America/New_York date 2020-01-08 - Packages --- ! package * version date lib source acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.1) arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.2) askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1) assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0) bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1) cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.1) codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) credentials 1.1 2019-03-12 [1] CRAN (R 3.6.2) curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1) DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1) digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1) foreign 0.8-71 2018-07-20 [2] CRAN (R 3.6.1) Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1) future * 1.15.0-9000 2019-11-19 [1] Github (HenrikBengtsson/future@bc241c7) ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) globals 0.12.4 2018-10-11 [1] CRAN (R 3.6.0) glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) Hmisc * 4.3-0 2019-11-07 [1] CRAN (R 3.6.1) htmlTable 1.13.2 2019-09-22 [1] CRAN (R 3.6.1) D htmltools 0.3.6.9004 2019-09-20 [1] Github (rstudio/htmltools@c49b29c) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.6.1
Re: Arrow / R - too many batches causes a crash
Can you please open a JIRA issue? On Wed, Jan 8, 2020 at 12:37 PM Christian Klar wrote: > Hi, > > > > At the bottom please find the session_info. > > > > When creating north of 200-300 batches, the writing to the arrow file > crashes R – it doesn’t even show an error message. Rstudio just aborts. > > > > I have the feeling that maybe each batch becomes a stream and R has issues > with the connections, but that’s a total guess. > > > > Any help would be appreciated. > > > > ## > > > > Here is the function. When running it with 3000 it crashes immediately. > > > > Before that I ran it with 100, and then increased it slowly, and then it > randomly crashed again. > > > > > write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000) > > > > write_arrow_custom <- function(df,targetarrow,nrbatches) { > > ct <- nrbatches > > idxs <- c(0:ct)/ct*nrow(df) > > idxs <- round(idxs,0) %>% as.integer() > > idxs[length(idxs)] <- nrow(df) > > df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% > mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% filter(! > is.na(colto)) %>% mutate(R=row_number()) > > stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% > sum()==nrow(df)) > > table_df <- Table$create(name=rownames(df[1,]),df[1,]) > > writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema) > > df_nav %>% dlply(c('R'),function(df_nav){ > > catl(glue('{df_nav$colfrom[1]}:{df_nav$colto[1]} / {df_nav$R[1]}...')) > > tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],] > > writer$write_batch(record_batch(name = rownames(tmp), tmp)) > > NULL > > }) -> batch_lst > > writer$close() > > rm(batch_lst) > > gc() > > } > > > > > > ## > > > > > > > > - Session info > --- > > setting value > > version R version 3.6.1 (2019-07-05) > > os Windows 10 x64 > > system x86_64, mingw32 > > ui RStudio > > language (EN) > > collate English_United States.1252 > > ctypeEnglish_United States.1252 > > tz America/New_York > > date 2020-01-08 > > > > - Packages > --- > > ! package * version date lib source > > >acepack1.4.1 2016-10-29 [1] CRAN (R > 3.6.1) > >arrow* 0.15.1.12019-11-05 [1] CRAN (R 3.6.2) > > >askpass1.1 2019-01-13 [1] CRAN (R 3.6.1) > > >assertthat 0.2.1 2019-03-21 [1] CRAN (R > 3.6.1) > >backports 1.1.5 2019-10-02 [1] CRAN (R > 3.6.1) > >base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) > > >bit1.1-14 2018-05-29 [1] CRAN (R > 3.6.0) > >bit64 0.9-7 2017-05-08 [1] CRAN (R > 3.6.0) > >blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) > > >callr 3.3.1 2019-07-18 [1] CRAN (R > 3.6.1) > >cellranger 1.1.0 2016-07-27 [1] CRAN (R > 3.6.1) > >checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) > > >cli1.1.0 2019-03-19 [1] CRAN (R > 3.6.1) > >cluster2.1.0 2019-06-19 [2] CRAN (R 3.6.1) > > >codetools 0.2-16 2018-12-24 [2] CRAN (R > 3.6.1) > >colorspace 1.4-1 2019-03-18 [1] CRAN (R > 3.6.1) > >commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) > > >crayon 1.3.4 2017-09-16 [1] CRAN (R > 3.6.1) > >credentials1.1 2019-03-12 [1] CRAN (R > 3.6.2) > >curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) > > >data.table 1.12.2 2019-04-07 [1] CRAN (R > 3.6.1) > >DBI * 1.0.0 2018-05-02 [1] CRAN (R > 3.6.1) > >desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) > > >devtools * 2.2.0 2019-09-07 [1] CRAN (R > 3.6.1) > >digest 0.6.23 2019-11-23 [1] CRAN (R > 3.6.1) > >dplyr* 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) > > >DT 0.9 2019-09-17 [1] CRAN (R > 3.6.1) > >ellipsis 0.3.0 2019-09-20 [1] CRAN (R > 3.6.1) > >evaluate 0.142019-05-28 [1] CRAN (R 3.6.1) > > >foreign0.8-71 2018-07-20 [2] CRAN (R > 3.6.1) > >Formula * 1.2-3 2018-05-03 [1] CRAN (R > 3.6.0) > >fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) > > >fst * 0.9.0 2019-04-09 [1] CRAN (R > 3.6.1) > >future * 1.15.0-9000 2019-11-19 [1] Github > (HenrikBengtsson/future@bc241c7) > >ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) > > >globals0.12.4 2018-10-11 [1] CRAN (R > 3.6.0) > >glue * 1.3.1 2019-03-12 [1] CRAN (R > 3.6.1) > >gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) >
Arrow / R - too many batches causes a crash
Hi, At the bottom please find the session_info. When creating north of 200-300 batches, the writing to the arrow file crashes R – it doesn’t even show an error message. Rstudio just aborts. I have the feeling that maybe each batch becomes a stream and R has issues with the connections, but that’s a total guess. Any help would be appreciated. ## Here is the function. When running it with 3000 it crashes immediately. Before that I ran it with 100, and then increased it slowly, and then it randomly crashed again. write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000) write_arrow_custom <- function(df,targetarrow,nrbatches) { ct <- nrbatches idxs <- c(0:ct)/ct*nrow(df) idxs <- round(idxs,0) %>% as.integer() idxs[length(idxs)] <- nrow(df) df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% filter(!is.na(colto)) %>% mutate(R=row_number()) stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% sum()==nrow(df)) table_df <- Table$create(name=rownames(df[1,]),df[1,]) writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema) df_nav %>% dlply(c('R'),function(df_nav){ catl(glue('{df_nav$colfrom[1]}:{df_nav$colto[1]} / {df_nav$R[1]}...')) tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],] writer$write_batch(record_batch(name = rownames(tmp), tmp)) NULL }) -> batch_lst writer$close() rm(batch_lst) gc() } [cid:image001.jpg@01D5C628.B003ACC0] ## - Session info --- setting value version R version 3.6.1 (2019-07-05) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctypeEnglish_United States.1252 tz America/New_York date 2020-01-08 - Packages --- ! package * version date lib source acepack1.4.1 2016-10-29 [1] CRAN (R 3.6.1) arrow* 0.15.1.12019-11-05 [1] CRAN (R 3.6.2) askpass1.1 2019-01-13 [1] CRAN (R 3.6.1) assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) bit1.1-14 2018-05-29 [1] CRAN (R 3.6.0) bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) cli1.1.0 2019-03-19 [1] CRAN (R 3.6.1) cluster2.1.0 2019-06-19 [2] CRAN (R 3.6.1) codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) credentials1.1 2019-03-12 [1] CRAN (R 3.6.2) curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1) DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1) digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) dplyr* 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) evaluate 0.142019-05-28 [1] CRAN (R 3.6.1) foreign0.8-71 2018-07-20 [2] CRAN (R 3.6.1) Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1) future * 1.15.0-9000 2019-11-19 [1] Github (HenrikBengtsson/future@bc241c7) ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) globals0.12.4 2018-10-11 [1] CRAN (R 3.6.0) glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) Hmisc* 4.3-0 2019-11-07 [1] CRAN (R 3.6.1) htmlTable 1.13.2 2019-09-22 [1] CRAN (R 3.6.1) D htmltools 0.3.6.9004 2019-09-20 [1] Github (rstudio/htmltools@c49b29c) htmlwidgets1.3 2018-09-30 [1] CRAN (R 3.6.1) jsonlite * 1.6 2018-12-07 [1] CRAN (R 3.6.1) knitr 1.252019-09-18 [1] CRAN