RE: spark.lapply

2018-09-27 Thread Junior Alvarez
Around 500KB each time i call the function (~150 times)

From: Felix Cheung 
Sent: den 26 september 2018 14:57
To: Junior Alvarez ; user@spark.apache.org
Subject: Re: spark.lapply

It looks like the native R process is terminated from buffer overflow. Do you 
know how much data is involved?



From: Junior Alvarez 
mailto:junior.alva...@ericsson.com>>
Sent: Wednesday, September 26, 2018 7:33 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: spark.lapply

Hi!

I'm using spark.lapply() in sparkR on a mesos service I get the following crash 
randomly (The spark.lapply() function is called around 150 times, some times it 
crashes after 16 calls, other after 25 calls and so on...it is completely 
random, even though the data used in the actual call is always the same the 150 
times I called that function):

...

18/09/26 07:30:42 INFO TaskSetManager: Finished task 129.0 in stage 78.0 (TID 
1192) in 98 ms on 10.255.0.18 (executor 0) (121/143)

18/09/26 07:30:42 WARN TaskSetManager: Lost task 128.0 in stage 78.0 (TID 1191, 
10.255.0.18, executor 0): org.apache.spark.SparkException: R computation failed 
with

 7f327f4dd000-7f327f50 r-xp  08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f51c000-7f327f6f2000 rw-p  00:00 0

7f327f6fc000-7f327f6fd000 rw-p  00:00 0

7f327f6fd000-7f327f6ff000 rw-p  00:00 0

7f327f6ff000-7f327f70 r--p 00022000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f70-7f327f701000 rw-p 00023000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f701000-7f327f702000 rw-p  00:00 0

7fff6070f000-7fff60767000 rw-p  00:00 0  [stack]

7fff6077f000-7fff60781000 r-xp  00:00 0  [vdso]

ff60-ff601000 r-xp  00:00 0  
[vsyscall]

*** buffer overflow detected ***: /usr/local/lib/R/bin/exec/R terminated

=== Backtrace: =

/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x7f327db9529f]

/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f327dc3087c]

/lib/x86_64-linux-gnu/libc.so.6(+0x10d750)[0x7f327dc2f750]

...

If I of course use the native R lapply() everything works fine.

I wonder if this is a known issue, and/or is there is a way to avoid it when 
using sparkR.

B r
/Junior



Re: spark.lapply

2018-09-26 Thread Felix Cheung
It looks like the native R process is terminated from buffer overflow. Do you 
know how much data is involved?



From: Junior Alvarez 
Sent: Wednesday, September 26, 2018 7:33 AM
To: user@spark.apache.org
Subject: spark.lapply

Hi!

I’m using spark.lapply() in sparkR on a mesos service I get the following crash 
randomly (The spark.lapply() function is called around 150 times, some times it 
crashes after 16 calls, other after 25 calls and so on…it is completely random, 
even though the data used in the actual call is always the same the 150 times I 
called that function):

…

18/09/26 07:30:42 INFO TaskSetManager: Finished task 129.0 in stage 78.0 (TID 
1192) in 98 ms on 10.255.0.18 (executor 0) (121/143)

18/09/26 07:30:42 WARN TaskSetManager: Lost task 128.0 in stage 78.0 (TID 1191, 
10.255.0.18, executor 0): org.apache.spark.SparkException: R computation failed 
with

 7f327f4dd000-7f327f50 r-xp  08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f51c000-7f327f6f2000 rw-p  00:00 0

7f327f6fc000-7f327f6fd000 rw-p  00:00 0

7f327f6fd000-7f327f6ff000 rw-p  00:00 0

7f327f6ff000-7f327f70 r--p 00022000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f70-7f327f701000 rw-p 00023000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f701000-7f327f702000 rw-p  00:00 0

7fff6070f000-7fff60767000 rw-p  00:00 0  [stack]

7fff6077f000-7fff60781000 r-xp  00:00 0  [vdso]

ff60-ff601000 r-xp  00:00 0  
[vsyscall]

*** buffer overflow detected ***: /usr/local/lib/R/bin/exec/R terminated

=== Backtrace: =

/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x7f327db9529f]

/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f327dc3087c]

/lib/x86_64-linux-gnu/libc.so.6(+0x10d750)[0x7f327dc2f750]

…

If I of course use the native R lapply() everything works fine.

I wonder if this is a known issue, and/or is there is a way to avoid it when 
using sparkR.

B r
/Junior



spark.lapply

2018-09-26 Thread Junior Alvarez
Hi!

I'm using spark.lapply() in sparkR on a mesos service I get the following crash 
randomly (The spark.lapply() function is called around 150 times, some times it 
crashes after 16 calls, other after 25 calls and so on...it is completely 
random, even though the data used in the actual call is always the same the 150 
times I called that function):

...

18/09/26 07:30:42 INFO TaskSetManager: Finished task 129.0 in stage 78.0 (TID 
1192) in 98 ms on 10.255.0.18 (executor 0) (121/143)

18/09/26 07:30:42 WARN TaskSetManager: Lost task 128.0 in stage 78.0 (TID 1191, 
10.255.0.18, executor 0): org.apache.spark.SparkException: R computation failed 
with

 7f327f4dd000-7f327f50 r-xp  08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f51c000-7f327f6f2000 rw-p  00:00 0

7f327f6fc000-7f327f6fd000 rw-p  00:00 0

7f327f6fd000-7f327f6ff000 rw-p  00:00 0

7f327f6ff000-7f327f70 r--p 00022000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f70-7f327f701000 rw-p 00023000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f701000-7f327f702000 rw-p  00:00 0

7fff6070f000-7fff60767000 rw-p  00:00 0  [stack]

7fff6077f000-7fff60781000 r-xp  00:00 0  [vdso]

ff60-ff601000 r-xp  00:00 0  
[vsyscall]

*** buffer overflow detected ***: /usr/local/lib/R/bin/exec/R terminated

=== Backtrace: =

/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x7f327db9529f]

/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f327dc3087c]

/lib/x86_64-linux-gnu/libc.so.6(+0x10d750)[0x7f327dc2f750]

...

If I of course use the native R lapply() everything works fine.

I wonder if this is a known issue, and/or is there is a way to avoid it when 
using sparkR.

B r
/Junior



Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Felix Cheung
The reason your second example works is because of a closure capture behavior. 
It should be ok for a small amount of data.

You could also use SparkR:::broadcast but please keep in mind that is not 
public API we actively support.

Thank you for the information on formula - I will test that out. Please note 
that SparkR code is now at

https://github.com/apache/spark/tree/master/R
_
From: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>
Sent: Thursday, August 25, 2016 6:08 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: <user@spark.apache.org<mailto:user@spark.apache.org>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>


I tested both in local and cluster mode and the ‘<<-‘ seemed to work at least 
for small data. Or am I missing something? Is there a way for me to test? If 
that does not work, can I use something like this?

sc <- SparkR:::getSparkContext()
bcStack <- SparkR:::broadcast(sc,stack)

I realized that the error: Error in writeBin(batch, con, endian = "big")

Was due to an object within the ‘parameters’ list which was a R formula.

When the spark.lapply method calls the parallelize method, it splits the list 
and calls the SparkR:::writeRaw method, which tries to convert from formula to 
binary exploding the size of the object being passed.

https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/R/serialize.R

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, August 25, 2016 2:35 PM
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB 
csv:

scoreModel <- function(parameters){
   library(data.table) # I assume this should data.table
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_
From: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>, Felix 
Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
<user@spark.apache.org<mailto:user@spark.apache.org>>



Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment ‘<<-‘ 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv(‘file.csv’)
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting t

RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Cinquegrana, Piero
I tested both in local and cluster mode and the '<<-' seemed to work at least 
for small data. Or am I missing something? Is there a way for me to test? If 
that does not work, can I use something like this?

sc <- SparkR:::getSparkContext()
bcStack <- SparkR:::broadcast(sc,stack)

I realized that the error: Error in writeBin(batch, con, endian = "big")

Was due to an object within the 'parameters' list which was a R formula.

When the spark.lapply method calls the parallelize method, it splits the list 
and calls the SparkR:::writeRaw method, which tries to convert from formula to 
binary exploding the size of the object being passed.

https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/R/serialize.R

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, August 25, 2016 2:35 PM
To: Cinquegrana, Piero <piero.cinquegr...@neustar.biz>; user@spark.apache.org
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB 
csv:

scoreModel <- function(parameters){
   library(data.table) # I assume this should data.table
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_
From: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>, Felix 
Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
<user@spark.apache.org<mailto:user@spark.apache.org>>



Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment '<<-' 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv('file.csv')
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution /Data Science
Mobile:+39.329.17.62.539/www.neustar.biz<http://www.neustar.biz/>
Reduceyour environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazX

Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Felix Cheung
Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB 
csv:

scoreModel <- function(parameters){
   library(data.table) # I assume this should data.table
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_
From: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>, Felix 
Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
<user@spark.apache.org<mailto:user@spark.apache.org>>


Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment ‘<<-‘ 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv(‘file.csv’)
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution /Data Science
Mobile:+39.329.17.62.539/www.neustar.biz<http://www.neustar.biz/>
Reduceyour environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=kTklp0PwiGNOEuGCv372Uvx3gC_8jom2kpMSDkt1i6U=>
   [New%20Picture%20(1)(1)]  
LinkedIn<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_5349-3Ftrk-3Dtyah-26trkInfo-3DclickedVertical-253Acompany-252CclickedEntityId-253A5349-252Cidx-253A2-2D1-2D4-252CtarId-253A1450369757393-252Ctas-253Aneustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=9N3DRk8Hdq-pUlGXTaUx6fpdayRdhW66Su_NMiSTR2Q=>
   [New%20Picture%20(2)]  
Twitter<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_Neustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=hp6UhqxuA6vRj6lchMSqS0AT_NKE-HGDLDC0aYhEGJ4=>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the in

RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-24 Thread Cinquegrana, Piero
Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment '<<-' 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv('file.csv')
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung <felixcheun...@hotmail.com>; user@spark.apache.org
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz<http://www.neustar.biz/>
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=kTklp0PwiGNOEuGCv372Uvx3gC_8jom2kpMSDkt1i6U=>
   [New%20Picture%20(1)(1)]  
LinkedIn<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_5349-3Ftrk-3Dtyah-26trkInfo-3DclickedVertical-253Acompany-252CclickedEntityId-253A5349-252Cidx-253A2-2D1-2D4-252CtarId-253A1450369757393-252Ctas-253Aneustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=9N3DRk8Hdq-pUlGXTaUx6fpdayRdhW66Su_NMiSTR2Q=>
   [New%20Picture%20(2)]  
Twitter<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_Neustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=hp6UhqxuA6vRj6lchMSqS0AT_NKE-HGDLDC0aYhEGJ4=>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.



Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz<http://www.neustar.biz/>
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=kTklp0PwiGNOEuGCv372Uvx3gC_8jom2kpMSDkt1i6U=>
   [New%20Picture%20(1)(1)]  
LinkedIn<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_5349-3Ftrk-3Dtyah-26trkInfo-3DclickedVertical-253Acompany-252CclickedEntityId-253A5349-252Cidx-253A2-2D1-2D4-252CtarId-253A1450369757393-252Ctas-253Aneustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=

RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-23 Thread Cinquegrana, Piero
The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero <piero.cinquegr...@neustar.biz>; user@spark.apache.org
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?



On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz<http://www.neustar.biz/>
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=kTklp0PwiGNOEuGCv372Uvx3gC_8jom2kpMSDkt1i6U=>
   [New%20Picture%20(1)(1)]  
LinkedIn<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_5349-3Ftrk-3Dtyah-26trkInfo-3DclickedVertical-253Acompany-252CclickedEntityId-253A5349-252Cidx-253A2-2D1-2D4-252CtarId-253A1450369757393-252Ctas-253Aneustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=9N3DRk8Hdq-pUlGXTaUx6fpdayRdhW66Su_NMiSTR2Q=>
   [New%20Picture%20(2)]  
Twitter<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_Neustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=hp6UhqxuA6vRj6lchMSqS0AT_NKE-HGDLDC0aYhEGJ4=>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.



Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz<http://www.neustar.biz/>
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=kTklp0PwiGNOEuGCv372Uvx3gC_8jom2kpMSDkt1i6U=>
   [New%20Picture%20(1)(1)]  
LinkedIn<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_5349-3Ftrk-3Dtyah-26trkInfo-3DclickedVertical-253Acompany-252CclickedEntityId-253A5349-252Cidx-253A2-2D1-2D4-252CtarId-253A1450369757393-252Ctas-253Aneustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=9N3DRk8Hdq-pUlGXTaUx6fpdayRdhW66Su_NMiSTR2Q=>
   [New%20Picture%20(2)]  
Twitter<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_Neustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=hp6UhqxuA6vRj6lchMSqS0AT_NKE-HGDLDC0aYhEGJ4=>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.



Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-22 Thread Felix Cheung
How big is the output from score()?

Also could you elaborate on what you want to broadcast?





On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>> wrote:

Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz<http://www.neustar.biz/>
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://www.facebook.com/pages/NeuStar/104072179630456?fref=ts>   
[New%20Picture%20(1)(1)]  
LinkedIn<https://www.linkedin.com/company/5349?trk=tyah=clickedVertical%3Acompany%2CclickedEntityId%3A5349%2Cidx%3A2-1-4%2CtarId%3A1450369757393%2Ctas%3Aneustar>
   [New%20Picture%20(2)]  Twitter<https://twitter.com/Neustar>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.



Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz<http://www.neustar.biz/>
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://www.facebook.com/pages/NeuStar/104072179630456?fref=ts>   
[New%20Picture%20(1)(1)]  
LinkedIn<https://www.linkedin.com/company/5349?trk=tyah=clickedVertical%3Acompany%2CclickedEntityId%3A5349%2Cidx%3A2-1-4%2CtarId%3A1450369757393%2Ctas%3Aneustar>
   [New%20Picture%20(2)]  Twitter<https://twitter.com/Neustar>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.



spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-22 Thread Cinquegrana, Piero
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz<http://www.neustar.biz/>
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://www.facebook.com/pages/NeuStar/104072179630456?fref=ts>   
[New%20Picture%20(1)(1)]  
LinkedIn<https://www.linkedin.com/company/5349?trk=tyah=clickedVertical%3Acompany%2CclickedEntityId%3A5349%2Cidx%3A2-1-4%2CtarId%3A1450369757393%2Ctas%3Aneustar>
   [New%20Picture%20(2)]  Twitter<https://twitter.com/Neustar>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.



Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz<http://www.neustar.biz/>
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://www.facebook.com/pages/NeuStar/104072179630456?fref=ts>   
[New%20Picture%20(1)(1)]  
LinkedIn<https://www.linkedin.com/company/5349?trk=tyah=clickedVertical%3Acompany%2CclickedEntityId%3A5349%2Cidx%3A2-1-4%2CtarId%3A1450369757393%2Ctas%3Aneustar>
   [New%20Picture%20(2)]  Twitter<https://twitter.com/Neustar>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.