[ https://issues.apache.org/jira/browse/ARROW-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson resolved ARROW-7520. ------------------------------------ Fix Version/s: 0.17.0 Assignee: Neal Richardson Resolution: Fixed This has been addressed in ARROW-5501; RecordBatch*Writer now requires that you pass an {{OutputStream}} so that you can manage the file connection. The previously supported behavior would let you open connections you couldn't close. > [R] Writing many batches causes a crash > --------------------------------------- > > Key: ARROW-7520 > URL: https://issues.apache.org/jira/browse/ARROW-7520 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 0.15.1 > Environment: - Session info > ----------------------------------------------------------------------------------------------------------------------------------------------------------- > setting value > version R version 3.6.1 (2019-07-05) > os Windows 10 x64 > system x86_64, mingw32 > ui RStudio > language (EN) > collate English_United States.1252 > ctype English_United States.1252 > tz America/New_York > date 2020-01-08 > > - Packages > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > ! package * version date lib source > > acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.1) > > arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.2) > > askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1) > > assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) > > backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) > > base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) > > bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0) > > bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) > > blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) > > callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1) > > cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) > > checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) > > cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1) > > cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.1) > > codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1) > > colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) > > commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) > > crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) > > credentials 1.1 2019-03-12 [1] CRAN (R 3.6.2) > > curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) > > data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1) > > DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1) > > desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) > > devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1) > > digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) > > dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) > > DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) > > ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) > > evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1) > > foreign 0.8-71 2018-07-20 [2] CRAN (R 3.6.1) > > Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0) > > fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) > > fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1) > > future * 1.15.0-9000 2019-11-19 [1] Github > (HenrikBengtsson/future@bc241c7) > ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) > > globals 0.12.4 2018-10-11 [1] CRAN (R 3.6.0) > > glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) > > gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) > > gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5) > > gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) > > Hmisc * 4.3-0 2019-11-07 [1] CRAN (R 3.6.1) > > htmlTable 1.13.2 2019-09-22 [1] CRAN (R 3.6.1) > > D htmltools 0.3.6.9004 2019-09-20 [1] Github > (rstudio/htmltools@c49b29c) > htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.6.1) > > jsonlite * 1.6 2018-12-07 [1] CRAN (R 3.6.1) > > knitr 1.25 2019-09-18 [1] CRAN (R 3.6.1) > > lattice * 0.20-38 2018-11-04 [2] CRAN (R 3.6.1) > > latticeExtra 0.6-28 2016-02-09 [1] CRAN (R 3.6.1) > > lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.1) > > lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.1) > > listenv 0.7.0 2018-01-21 [1] CRAN (R 3.6.1) > > lubridate * 1.7.4 2018-04-11 [1] CRAN (R 3.6.1) > > magrittr * 1.5 2014-11-22 [1] CRAN (R 3.6.1) > > Matrix 1.2-17 2019-03-22 [2] CRAN (R 3.6.1) > > memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.1) > > munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.1) > > nnet 7.3-12 2016-02-02 [2] CRAN (R 3.6.1) > > openssl 1.4.1 2019-07-18 [1] CRAN (R 3.6.1) > > outliers * 0.14 2011-01-24 [1] CRAN (R 3.6.0) > > pillar 1.4.2 2019-06-29 [1] CRAN (R 3.6.1) > > pkgbuild 1.0.5 2019-08-26 [1] CRAN (R 3.6.1) > > pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.6.1) > > pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.1) > > plyr * 1.8.4 2016-06-08 [1] CRAN (R 3.6.1) > > prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.1) > > processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1) > > pryr * 0.1.4 2018-02-18 [1] CRAN (R 3.6.1) > > ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.1) > > purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.6.1) > > R6 * 2.4.1 2019-11-12 [1] CRAN (R 3.6.1) > > RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.6.0) > > Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1) > > readxl * 1.3.1 2019-03-13 [1] CRAN (R 3.6.1) > > remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1) > > rlang * 0.4.2 2019-11-23 [1] CRAN (R 3.6.1) > > rmarkdown * 2.0.3 2019-12-19 [1] Github > (rstudio/rmarkdown@26cc3b1) > RODBC * 1.3-16 2019-09-03 [1] CRAN (R 3.6.1) > > roxygen2 * 6.1.1 2018-11-07 [1] CRAN (R 3.6.1) > > rpart 4.1-15 2019-04-12 [2] CRAN (R 3.6.1) > > rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.1) > > RSQLite * 2.1.2 2019-07-24 [1] CRAN (R 3.6.1) > > rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.1) > > scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.1) > > sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1) > > slide * 0.0.0.9002 2019-11-27 [1] Github > (DavisVaughan/slide@92e8e02) > ssh 0.6 2019-04-09 [1] CRAN (R 3.6.2) > > stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0) > > stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.1) > > survival * 2.44-1.1 2019-04-01 [2] CRAN (R 3.6.1) > > testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.1) > > tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.1) > > tidyr * 1.0.0 2019-09-11 [1] CRAN (R 3.6.1) > > tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.1) > > usethis * 1.5.1 2019-07-04 [1] CRAN (R 3.6.1) > > varhandle * 2.0.3 2018-07-04 [1] CRAN (R 3.6.0) > > vctrs 0.2.0.9007 2019-11-27 [1] Github (r-lib/vctrs@945809e) > > withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.1) > > xfun 0.9 2019-08-21 [1] CRAN (R 3.6.1) > > xml2 * 1.2.2 2019-08-09 [1] CRAN (R 3.6.1) > > xts * 0.11-2 2018-11-05 [1] CRAN (R 3.6.1) > > zoo * 1.8-6 2019-05-28 [1] CRAN (R 3.6.1) > > > [1] C:/Users/cklar/Desktop/R packages > [2] C:/Program Files/R/R-3.6.1/library > > P -- Loaded and on-disk path mismatch. > D -- DLL MD5 mismatch, broken installation. > Reporter: Christian > Assignee: Neal Richardson > Priority: Trivial > Fix For: 0.17.0 > > > Hi, > When creating north of 200-300 batches, the writing to the arrow file crashes > R - it doesn't even show an error message. Rstudio just aborts. > I have the feeling that maybe each batch becomes a stream and R has issues > with the connections, but that's a total guess. > Any help would be appreciated. > > ## > > Here is the function. When running it with 3000 it crashes immediately. > Before that I ran it with 100, and then increased it slowly, and then it > randomly crashed again. > > ## > Now I received this error message after writing 30 batches. > Error in ipc___RecordBatchWriter__WriteRecordBatch(self, batch) : > Invalid: Invalid operation on closed file > Error in ipc___RecordBatchWriter__WriteRecordBatch(self, batch) : > Invalid: Invalid operation on closed file > ## > write_arrow_custom(data.frame(A=c(1:100000),B=c(1:100000)),'C:/Temp/test.arrow',3000) > > write_arrow_custom <- function(df,targetarrow,nrbatches) { > ct <- nrbatches > idxs <- c(0:ct)/ct*nrow(df) > idxs <- round(idxs,0) %>% as.integer() > idxs[length(idxs)] <- nrow(df) > df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% > mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% > filter(!is.na(colto)) %>% mutate(R=row_number()) > stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% > sum()==nrow(df)) > table_df <- Table$create(name=rownames(df[1,]),df[1,]) > writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema) > df_nav %>% dlply(c('R'),function(df_nav) > { catl(glue('\\{df_nav$colfrom[1]} > :{df_nav$colto[1]} / {df_nav$R[1]}...')) > tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],] > writer$write_batch(record_batch(name = rownames(tmp), tmp)) > NULL > }) -> batch_lst > writer$close() > rm(batch_lst) > gc() > } > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)