[jira] [Created] (ARROW-7520) Arrow / R - too many batches causes a crash

2020-01-08 Thread Christian (Jira)
Christian created ARROW-7520:


 Summary: Arrow / R - too many batches causes a crash
 Key: ARROW-7520
 URL: https://issues.apache.org/jira/browse/ARROW-7520
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.15.1
 Environment: - Session info 
---

setting  value  

 version  R version 3.6.1 (2019-07-05)

os   Windows 10 x64  

 system   x86_64, mingw32

 ui   RStudio

 language (EN)   

 collate  English_United States.1252 

 ctype    English_United States.1252 

 tz   America/New_York   

 date 2020-01-08 

 

- Packages 
---

! package  * version date   lib source  
   

   acepack    1.4.1   2016-10-29 [1] CRAN (R 3.6.1) 
   

   arrow    * 0.15.1.1    2019-11-05 [1] CRAN (R 3.6.2) 
   

   askpass    1.1 2019-01-13 [1] CRAN (R 3.6.1)     


   assertthat 0.2.1   2019-03-21 [1] CRAN (R 3.6.1) 
   

   backports  1.1.5   2019-10-02 [1] CRAN (R 3.6.1) 
   

   base64enc  0.1-3   2015-07-28 [1] CRAN (R 3.6.0)     


   bit    1.1-14  2018-05-29 [1] CRAN (R 3.6.0) 
   

   bit64  0.9-7   2017-05-08 [1] CRAN (R 3.6.0) 
   

   blob   1.2.0   2019-07-09 [1] CRAN (R 3.6.1) 


   callr  3.3.1   2019-07-18 [1] CRAN (R 3.6.1) 
   

   cellranger 1.1.0   2016-07-27 [1] CRAN (R 3.6.1) 
   

   checkmate  1.9.4   2019-07-04 [1] CRAN (R 3.6.1)     


   cli    1.1.0   2019-03-19 [1] CRAN (R 3.6.1) 
   

   cluster    2.1.0   2019-06-19 [2] CRAN (R 3.6.1)     


   codetools  0.2-16  2018-12-24 [2] CRAN (R 3.6.1) 
   

   colorspace 1.4-1   2019-03-18 [1] CRAN (R 3.6.1) 
   

   commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1)     


   crayon 1.3.4   2017-09-16 [1] CRAN (R 3.6.1) 
   

   credentials    1.1 2019-03-12 [1] CRAN (R 3.6.2) 
   

   curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) 


   data.table 1.12.2  2019-04-07 [1] CRAN (R 3.6.1) 
   

   DBI  * 1.0.0   2018-05-02 [1] CRAN (R 3.6.1) 
   

   desc   1.2.0   2018-05-01 [1] CRAN (R 3.6.1)     


   devtools * 2.2.0   2019-09-07 [1] CRAN (R 3.6.1) 
   

   digest 0.6.23  2019-11-23 [1] CRAN (R 3.6.1) 
   

   dplyr    * 0.8.3   2019-07-04 [1] CRAN (R 3.6.1)     


   DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) 
   

   ellipsis   0.3.0   2019-09-20 [1] CRAN (R 3.6.1) 
   

   evaluate   0.14    2019-05-28 [1] CRAN (R 3.6.1)     


   foreign    0.8-71  2018-07-20 [2] CRAN (R 3.6.1) 
   

   Formula  * 1.2-3   2018-05-03 [1] CRAN (R 3.6.0) 
   

   fs 1.3.1   2019-05-06 [1] CRAN (R 3.6.1) 


   fst  * 0.9.0   2019-04-09 [1] CRAN (R 3.6.1) 
   

   future   * 1.15.0-9000 2019-11-19 [1] Github 
(HenrikBengtsson/future@bc241c7)

   ggplot2  * 3.2.1   2019-08-10 [1] CRAN (R 3.6.1)     


   globals    0.12.4  2018-10-11 [1] CRAN (R 3.6.0) 
   

   glue * 1.3.1   2019-03-12 [1] CRAN (R 3.6.1) 
   

   gridExtra  2.3 2017-09-09 [1] CRAN (R 3.6.1)     


   gt   * 0.1.0   2019-11-27 [1] Github (rstudio/gt@284bbe5)
   

   gtable 0.3.0   2019-03-25 [1] CRAN (R 3.6.1) 
   

   Hmisc    * 4.3-0   2019-11-07 [1] CRAN (R 3.6.1)     


   htmlTable  1.13.2  2019-09-22 [1] CRAN (R 3.6.1) 
   

 D htmltools  0.3.6.9004  2019-09-20 [1] Github (rstudio/htmltools@c49b29c) 
   

   htmlwidgets    1.3 2018-09-30 [1] CRAN (R 3.6.1

Re: Arrow / R - too many batches causes a crash

2020-01-08 Thread Wes McKinney
Can you please open a JIRA issue?

On Wed, Jan 8, 2020 at 12:37 PM Christian Klar 
wrote:

> Hi,
>
>
>
> At the bottom please find the session_info.
>
>
>
> When creating north of 200-300 batches, the writing to the arrow file
> crashes R – it doesn’t even show an error message. Rstudio just aborts.
>
>
>
> I have the feeling that maybe each batch becomes a stream and R has issues
> with the connections, but that’s a total guess.
>
>
>
> Any help would be appreciated.
>
>
>
> ##
>
>
>
> Here is the function. When running it with 3000 it crashes immediately.
>
>
>
> Before that I ran it with 100, and then increased it slowly, and then it
> randomly crashed again.
>
>
>
>
> write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000)
>
>
>
> write_arrow_custom <- function(df,targetarrow,nrbatches) {
>
>   ct <- nrbatches
>
>   idxs <- c(0:ct)/ct*nrow(df)
>
>   idxs <- round(idxs,0) %>% as.integer()
>
>   idxs[length(idxs)] <- nrow(df)
>
>   df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>%
> mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% filter(!
> is.na(colto)) %>% mutate(R=row_number())
>
>   stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>%
> sum()==nrow(df))
>
>   table_df <- Table$create(name=rownames(df[1,]),df[1,])
>
>   writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema)
>
>   df_nav %>% dlply(c('R'),function(df_nav){
>
> catl(glue('{df_nav$colfrom[1]}:{df_nav$colto[1]} / {df_nav$R[1]}...'))
>
> tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],]
>
> writer$write_batch(record_batch(name = rownames(tmp), tmp))
>
> NULL
>
>   }) -> batch_lst
>
>   writer$close()
>
>   rm(batch_lst)
>
>   gc()
>
> }
>
>
>
>
>
> ##
>
>
>
>
>
>
>
> - Session info
> ---
>
> setting  value
>
>  version  R version 3.6.1 (2019-07-05)
>
> os   Windows 10 x64
>
>  system   x86_64, mingw32
>
>  ui   RStudio
>
>  language (EN)
>
>  collate  English_United States.1252
>
>  ctypeEnglish_United States.1252
>
>  tz   America/New_York
>
>  date 2020-01-08
>
>
>
> - Packages
> ---
>
> ! package  * version date   lib source
>
>
>acepack1.4.1   2016-10-29 [1] CRAN (R
> 3.6.1)
>
>arrow* 0.15.1.12019-11-05 [1] CRAN (R 3.6.2)
>
>
>askpass1.1 2019-01-13 [1] CRAN (R 3.6.1)
>
>
>assertthat 0.2.1   2019-03-21 [1] CRAN (R
> 3.6.1)
>
>backports  1.1.5   2019-10-02 [1] CRAN (R
> 3.6.1)
>
>base64enc  0.1-3   2015-07-28 [1] CRAN (R 3.6.0)
>
>
>bit1.1-14  2018-05-29 [1] CRAN (R
> 3.6.0)
>
>bit64  0.9-7   2017-05-08 [1] CRAN (R
> 3.6.0)
>
>blob   1.2.0   2019-07-09 [1] CRAN (R 3.6.1)
>
>
>callr  3.3.1   2019-07-18 [1] CRAN (R
> 3.6.1)
>
>cellranger 1.1.0   2016-07-27 [1] CRAN (R
> 3.6.1)
>
>checkmate  1.9.4   2019-07-04 [1] CRAN (R 3.6.1)
>
>
>cli1.1.0   2019-03-19 [1] CRAN (R
> 3.6.1)
>
>cluster2.1.0   2019-06-19 [2] CRAN (R 3.6.1)
>
>
>codetools  0.2-16  2018-12-24 [2] CRAN (R
> 3.6.1)
>
>colorspace 1.4-1   2019-03-18 [1] CRAN (R
> 3.6.1)
>
>commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1)
>
>
>crayon 1.3.4   2017-09-16 [1] CRAN (R
> 3.6.1)
>
>credentials1.1 2019-03-12 [1] CRAN (R
> 3.6.2)
>
>curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1)
>
>
>data.table 1.12.2  2019-04-07 [1] CRAN (R
> 3.6.1)
>
>DBI  * 1.0.0   2018-05-02 [1] CRAN (R
> 3.6.1)
>
>desc   1.2.0   2018-05-01 [1] CRAN (R 3.6.1)
>
>
>devtools * 2.2.0   2019-09-07 [1] CRAN (R
> 3.6.1)
>
>digest 0.6.23  2019-11-23 [1] CRAN (R
> 3.6.1)
>
>dplyr* 0.8.3   2019-07-04 [1] CRAN (R 3.6.1)
>
>
>DT 0.9 2019-09-17 [1] CRAN (R
> 3.6.1)
>
>ellipsis   0.3.0   2019-09-20 [1] CRAN (R
> 3.6.1)
>
>evaluate   0.142019-05-28 [1] CRAN (R 3.6.1)
>
>
>foreign0.8-71  2018-07-20 [2] CRAN (R
> 3.6.1)
>
>Formula  * 1.2-3   2018-05-03 [1] CRAN (R
> 3.6.0)
>
>fs 1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
>
>
>fst  * 0.9.0   2019-04-09 [1] CRAN (R
> 3.6.1)
>
>future   * 1.15.0-9000 2019-11-19 [1] Github
> (HenrikBengtsson/future@bc241c7)
>
>ggplot2  * 3.2.1   2019-08-10 [1] CRAN (R 3.6.1)
>
>
>globals0.12.4  2018-10-11 [1] CRAN (R
> 3.6.0)
>
>glue * 1.3.1   2019-03-12 [1] CRAN (R
> 3.6.1)
>
>gridExtra  2.3 2017-09-09 [1] CRAN (R 3.6.1)
>

Arrow / R - too many batches causes a crash

2020-01-08 Thread Christian Klar
Hi,

At the bottom please find the session_info.

When creating north of 200-300 batches, the writing to the arrow file crashes R 
– it doesn’t even show an error message. Rstudio just aborts.

I have the feeling that maybe each batch becomes a stream and R has issues with 
the connections, but that’s a total guess.

Any help would be appreciated.

##

Here is the function. When running it with 3000 it crashes immediately.

Before that I ran it with 100, and then increased it slowly, and then it 
randomly crashed again.

write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000)

write_arrow_custom <- function(df,targetarrow,nrbatches) {
  ct <- nrbatches
  idxs <- c(0:ct)/ct*nrow(df)
  idxs <- round(idxs,0) %>% as.integer()
  idxs[length(idxs)] <- nrow(df)
  df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% 
mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% 
filter(!is.na(colto)) %>% mutate(R=row_number())
  stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% 
sum()==nrow(df))
  table_df <- Table$create(name=rownames(df[1,]),df[1,])
  writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema)
  df_nav %>% dlply(c('R'),function(df_nav){
catl(glue('{df_nav$colfrom[1]}:{df_nav$colto[1]} / {df_nav$R[1]}...'))
tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],]
writer$write_batch(record_batch(name = rownames(tmp), tmp))
NULL
  }) -> batch_lst
  writer$close()
  rm(batch_lst)
  gc()
}

[cid:image001.jpg@01D5C628.B003ACC0]

##



- Session info 
---
setting  value
 version  R version 3.6.1 (2019-07-05)
os   Windows 10 x64
 system   x86_64, mingw32
 ui   RStudio
 language (EN)
 collate  English_United States.1252
 ctypeEnglish_United States.1252
 tz   America/New_York
 date 2020-01-08

- Packages 
---
! package  * version date   lib source
   acepack1.4.1   2016-10-29 [1] CRAN (R 3.6.1)
   arrow* 0.15.1.12019-11-05 [1] CRAN (R 3.6.2)
   askpass1.1 2019-01-13 [1] CRAN (R 3.6.1)
   assertthat 0.2.1   2019-03-21 [1] CRAN (R 3.6.1)
   backports  1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
   base64enc  0.1-3   2015-07-28 [1] CRAN (R 3.6.0)
   bit1.1-14  2018-05-29 [1] CRAN (R 3.6.0)
   bit64  0.9-7   2017-05-08 [1] CRAN (R 3.6.0)
   blob   1.2.0   2019-07-09 [1] CRAN (R 3.6.1)
   callr  3.3.1   2019-07-18 [1] CRAN (R 3.6.1)
   cellranger 1.1.0   2016-07-27 [1] CRAN (R 3.6.1)
   checkmate  1.9.4   2019-07-04 [1] CRAN (R 3.6.1)
   cli1.1.0   2019-03-19 [1] CRAN (R 3.6.1)
   cluster2.1.0   2019-06-19 [2] CRAN (R 3.6.1)
   codetools  0.2-16  2018-12-24 [2] CRAN (R 3.6.1)
   colorspace 1.4-1   2019-03-18 [1] CRAN (R 3.6.1)
   commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1)
   crayon 1.3.4   2017-09-16 [1] CRAN (R 3.6.1)
   credentials1.1 2019-03-12 [1] CRAN (R 3.6.2)
   curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1)
   data.table 1.12.2  2019-04-07 [1] CRAN (R 3.6.1)
   DBI  * 1.0.0   2018-05-02 [1] CRAN (R 3.6.1)
   desc   1.2.0   2018-05-01 [1] CRAN (R 3.6.1)
   devtools * 2.2.0   2019-09-07 [1] CRAN (R 3.6.1)
   digest 0.6.23  2019-11-23 [1] CRAN (R 3.6.1)
   dplyr* 0.8.3   2019-07-04 [1] CRAN (R 3.6.1)
   DT 0.9 2019-09-17 [1] CRAN (R 3.6.1)
   ellipsis   0.3.0   2019-09-20 [1] CRAN (R 3.6.1)
   evaluate   0.142019-05-28 [1] CRAN (R 3.6.1)
   foreign0.8-71  2018-07-20 [2] CRAN (R 3.6.1)
   Formula  * 1.2-3   2018-05-03 [1] CRAN (R 3.6.0)
   fs 1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
   fst  * 0.9.0   2019-04-09 [1] CRAN (R 3.6.1)
   future   * 1.15.0-9000 2019-11-19 [1] Github 
(HenrikBengtsson/future@bc241c7)
   ggplot2  * 3.2.1   2019-08-10 [1] CRAN (R 3.6.1)
   globals0.12.4  2018-10-11 [1] CRAN (R 3.6.0)
   glue * 1.3.1   2019-03-12 [1] CRAN (R 3.6.1)
   gridExtra  2.3 2017-09-09 [1] CRAN (R 3.6.1)
   gt   * 0.1.0   2019-11-27 [1] Github (rstudio/gt@284bbe5)
   gtable 0.3.0   2019-03-25 [1] CRAN (R 3.6.1)
   Hmisc* 4.3-0   2019-11-07 [1] CRAN (R 3.6.1)
   htmlTable  1.13.2  2019-09-22 [1] CRAN (R 3.6.1)
 D htmltools  0.3.6.9004  2019-09-20 [1] Github (rstudio/htmltools@c49b29c)
   htmlwidgets1.3 2018-09-30 [1] CRAN (R 3.6.1)
   jsonlite * 1.6 2018-12-07 [1] CRAN (R 3.6.1)
   knitr  1.252019-09-18 [1] CRAN