The documentation for the Numpy dependency for MLlib seems somewhat vague [1].
Is Numpy only a dependency for the driver node, or must it also be installed on
every worker node?
Thanks,
Alek
[1] -- http://spark.apache.org/docs/latest/mllib-guide.html#dependencies
CONFIDENTIALITY NOTICE This
Hi Serge,
The broadcast function was made private when SparkR merged into Apache
Spark for the 1.4.0 release. You can still use broadcast by specifying the
private namespace though.
SparkR:::broadcast(sc, obj)
The RDD methods were considered very low-level, and the SparkR devs are
still
that, they will be still much slower than Scala ones (because Python
is lower and the overhead for calling Python).
On Mon, Jul 6, 2015 at 12:55 PM, Eskilson,Aleksander
alek.eskil...@cerner.com wrote:
Hi there,
I’m trying to get a feel for how User Defined Functions from SparkSQL
(as
written
Hi there,
I’m trying to get a feel for how User Defined Functions from SparkSQL (as
written in Python and registered using the udf function from
pyspark.sql.functions) are run behind the scenes. Trying to grok the source it
seems that the native Python function is serialized for distribution
with a file with all columns as String, but the real
data I want to process are all doubles. I'm just exploring what sparkR can do
versus regular scala spark, as I am by heart a R person.
2015-06-25 14:26 GMT-07:00 Eskilson,Aleksander
alek.eskil...@cerner.commailto:alek.eskil...@cerner.com:
Sure, I
Hi there,
Parallelize is part of the RDD API which was made private for Spark v.
1.4.0. Some functions in the RDD API were considered too low-level to
expose, so only most of the DataFrame API is currently public. The
original rationale for this decision can be found on the issue's JIRA [1].
The
The simple answer is that SparkR does support map/reduce operations over RDD’s
through the RDD API, but since Spark v 1.4.0, those functions were made private
in SparkR. They can still be accessed by prepending the function with the
namespace, like SparkR:::lapply(rdd, func). It was thought
Hi there,
The tutorial you’re reading there was written before the merge of SparkR for
Spark 1.4.0
For the merge, the RDD API (which includes the textFile() function) was made
private, as the devs felt many of its functions were too low level. They
focused instead on finishing the DataFrame
, it is very helpful.
Cheers,
Wei
2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander
alek.eskil...@cerner.commailto:alek.eskil...@cerner.com:
Hi there,
The tutorial you’re reading there was written before the merge of SparkR for
Spark 1.4.0
For the merge, the RDD API (which includes the textFile() function
wondering what did I do wrong. Thanks in advance.
Wei
2015-06-25 13:44 GMT-07:00 Wei Zhou
zhweisop...@gmail.commailto:zhweisop...@gmail.com:
Hi Alek,
Thanks for the explanation, it is very helpful.
Cheers,
Wei
2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander
alek.eskil...@cerner.commailto:alek.eskil
---
From: Eskilson,Aleksander alek.eskil...@cerner.com
Sent: June 25, 2015 5:57 AM
To: Felix C felixcheun...@hotmail.com, user@spark.apache.org
Subject: Re: SparkR parallelize not found with 1.4.1?
Hi there,
Parallelize is part of the RDD API which was made private for Spark v.
1.4.0. Some
memory but its hard to say without
more diagnostic information.
Thanks
Shivaram
On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander
alek.eskil...@cerner.commailto:alek.eskil...@cerner.com wrote:
I’ve been attempting to run a SparkR translation of a similar Scala job that
identifies words from
I’ve been attempting to run a SparkR translation of a similar Scala job that
identifies words from a corpus not existing in a newline delimited dictionary.
The R code is:
dict - SparkR:::textFile(sc, src1)
corpus - SparkR:::textFile(sc, src2)
words - distinct(SparkR:::flatMap(corpus,
13 matches
Mail list logo