Re: NA value handling in sparkR

2016-01-26 Thread Deborah Siegel
While fitting the currently available sparkR models, such as glm for linear
and logistic regression, columns which contains strings are one-hot encoded
behind the scenes, as part of the parsing of the RFormula. Does that help,
or did you have something else in mind?




> Thank you so much for your mail. It is working .
>   I have another small question in sparkR - can we create dummy
> variables for categorical columns ( like in R we have " dummies" package)
> eg in iris dataset we have Spieces as a categorical column so 3 dummy
> variables columns like setosa, virginica would be created with 0 and 1 as
> values


On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel <deborah.sie...@gmail.com>
wrote:

> Maybe not ideal, but since read.df is inferring all columns from the csv
> containing "NA" as type of strings, one could filter them rather than using
> dropna().
>
> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
> head(filtered_aq)
>
> Perhaps it would be better to have an option for read.df to convert any
> "NA" it encounters into null types, like createDataFrame does for , and
> then one would be able to use dropna() etc.
>
>
>
> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Yes you are right.
>>
>> I think the problem is with reading of csv files. read.df is not
>> considering NAs in the CSV file
>>
>> So what would be a workable solution in dealing with NAs in csv files?
>>
>>
>>
>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.sie...@gmail.com
>> > wrote:
>>
>>> Hi Devesh,
>>>
>>> I'm not certain why that's happening, and it looks like it doesn't
>>> happen if you use createDataFrame directly:
>>> aq <- createDataFrame(sqlContext,airquality)
>>> head(dropna(aq,how="any"))
>>>
>>> If I had to guess.. dropna(), I believe, drops null values. I suppose
>>> its possible that createDataFrame converts R's  values to null, so
>>> dropna() works with that. But perhaps read.df() does not convert R s to
>>> null, as those are most likely interpreted as strings when they come in
>>> from the csv. Just a guess, can anyone confirm?
>>>
>>> Deb
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>> raj.deves...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have applied the following code on airquality dataset available in R
>>>> , which has some missing values. I want to omit the rows which has NAs
>>>>
>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>>
>>>> sc <- sparkR.init("local",sparkHome =
>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>>
>>>> sqlContext <- sparkRSQL.init(sc)
>>>>
>>>> path<-"/Users/devesh/work/airquality/"
>>>>
>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>>> header="true", inferSchema="true")
>>>>
>>>> head(dropna(aq,how="any"))
>>>>
>>>> I am getting the output as
>>>>
>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5
>>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>>> 14.9 66 5 6
>>>>
>>>> The NAs still exist in the output. Am I missing something here?
>>>>
>>>> --
>>>> Warm regards,
>>>> Devesh.
>>>>
>>>
>>>
>>
>>
>> --
>> Warm regards,
>> Devesh.
>>
>
>


Re: NA value handling in sparkR

2016-01-25 Thread Deborah Siegel
Maybe not ideal, but since read.df is inferring all columns from the csv
containing "NA" as type of strings, one could filter them rather than using
dropna().

filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
head(filtered_aq)

Perhaps it would be better to have an option for read.df to convert any
"NA" it encounters into null types, like createDataFrame does for , and
then one would be able to use dropna() etc.



On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com>
wrote:

> Hi,
>
> Yes you are right.
>
> I think the problem is with reading of csv files. read.df is not
> considering NAs in the CSV file
>
> So what would be a workable solution in dealing with NAs in csv files?
>
>
>
> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.sie...@gmail.com>
> wrote:
>
>> Hi Devesh,
>>
>> I'm not certain why that's happening, and it looks like it doesn't happen
>> if you use createDataFrame directly:
>> aq <- createDataFrame(sqlContext,airquality)
>> head(dropna(aq,how="any"))
>>
>> If I had to guess.. dropna(), I believe, drops null values. I suppose its
>> possible that createDataFrame converts R's  values to null, so dropna()
>> works with that. But perhaps read.df() does not convert R s to null, as
>> those are most likely interpreted as strings when they come in from the
>> csv. Just a guess, can anyone confirm?
>>
>> Deb
>>
>>
>>
>>
>>
>>
>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>> raj.deves...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have applied the following code on airquality dataset available in R ,
>>> which has some missing values. I want to omit the rows which has NAs
>>>
>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>
>>> sc <- sparkR.init("local",sparkHome =
>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>
>>> sqlContext <- sparkRSQL.init(sc)
>>>
>>> path<-"/Users/devesh/work/airquality/"
>>>
>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>> header="true", inferSchema="true")
>>>
>>> head(dropna(aq,how="any"))
>>>
>>> I am getting the output as
>>>
>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5
>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>> 14.9 66 5 6
>>>
>>> The NAs still exist in the output. Am I missing something here?
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>
>
> --
> Warm regards,
> Devesh.
>


SparkR pca?

2015-09-18 Thread Deborah Siegel
Hi,

Can PCA be implemented in a SparkR-MLLib integration?

perhaps 2 separate issues..

1) Having the methods in SparkRWrapper and RFormula which will send the
right input types through the pipeline
MLLib PCA operates either on a RowMatrix,  or the feature vector of an
RDD[LabeledPoint]. The labels aren't used.. though in the second case it
may be useful to be able to keep the label.

2) formula parsing from R
In R syntax, you can, for example in prcomp, have a formula which has no
label (response variable) --  eg.  prcomp(~ Col1 + Col2 + Col3, data =
myDataFrame)
Can RFormula currently parse this type of formula?


Thanks for listening / ideas.
Deb


Re: SparkR - can't create spark context - JVM not ready

2015-08-20 Thread Deborah Siegel
Thanks Shivaram. You got me wondering about the path so I put it in full
and it worked. R does not, of course, expand a ~.

On Thu, Aug 20, 2015 at 4:35 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 Can you check if the file
 `~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit` exists ? The
 error message seems to indicate it is trying to pick up Spark from
 that location and can't seem to find Spark installed there.

 Thanks
 Shivaram

 On Thu, Aug 20, 2015 at 3:30 PM, Deborah Siegel
 deborah.sie...@gmail.com wrote:
  Hello,
 
  I have previously successfully run SparkR in RStudio, with:
 
 Sys.setenv(SPARK_HOME=~/software/spark-1.4.1-bin-hadoop2.4)
 .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib),
 .libPaths()))
 library(SparkR)
 sc - sparkR.init(master=local[2],appName=SparkR-example)
 
 
  Then I tried putting some of it into an .Rprofile. It seemed to work to
 load
  the paths and SparkR, but I got an error when trying to create the sc. I
  then removed my .Rprofile, as well as .rstudio-desktop. However, I still
  cannot create the sc. Here is the error
 
  sc - sparkR.init(master=local[2],appName=SparkR-example)
  Launching java with spark-submit command
  ~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit   sparkr-shell
 
 /var/folders/p7/k1bpgmx93yd6pjq7dzf35gk8gn/T//RtmpOitA28/backend_port23377046db
  sh: ~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit: No such file
 or
  directory
  Error in sparkR.init(master = local[2], appName = SparkR-example) :
  JVM is not ready after 10 seconds
 
  I suspected there was an incomplete process or something. I checked for
 any
  running R or Java processes and there were none. Has someone seen this
 type
  of error? I have the same error in both RStudio and in R shell (but not
  sparkR wrapper).
 
  Thanks,
  Deb
 
 



SparkR - can't create spark context - JVM not ready

2015-08-20 Thread Deborah Siegel
Hello,

I have previously successfully run SparkR in RStudio, with:

Sys.setenv(SPARK_HOME=~/software/spark-1.4.1-bin-hadoop2.4)
.libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
library(SparkR)
sc - sparkR.init(master=local[2],appName=SparkR-example)


Then I tried putting some of it into an .Rprofile. It seemed to work to
load the paths and SparkR, but I got an error when trying to create the sc.
I then removed my .Rprofile, as well as .rstudio-desktop. However, I still
cannot create the sc. Here is the error

 sc - sparkR.init(master=local[2],appName=SparkR-example)
Launching java with spark-submit command
~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit   sparkr-shell
/var/folders/p7/k1bpgmx93yd6pjq7dzf35gk8gn/T//RtmpOitA28/backend_port23377046db
sh: ~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit: No such file or
directory
Error in sparkR.init(master = local[2], appName = SparkR-example) :
JVM is not ready after 10 seconds
I suspected there was an incomplete process or something. I checked for any
running R or Java processes and there were none. Has someone seen this type
of error? I have the same error in both RStudio and in R shell (but not
sparkR wrapper).

Thanks,
Deb


Re: SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
I think I just answered my own question. The privitization of the RDD API
might have resulted in my error, because this worked:

 randomMatBr - SparkR:::broadcast(sc, randomMat)

On Mon, Aug 3, 2015 at 4:59 PM, Deborah Siegel deborah.sie...@gmail.com
wrote:

 Hello,

 In looking at the SparkR codebase, it seems as if broadcast variables
 ought to be working based on the tests.

 I have tried the following in sparkR shell, and similar code in RStudio,
 but in both cases got the same message

  randomMat - matrix(nrow=10, ncol=10, data=rnorm(100))
  randomMatBr - broadcast(sc, randomMat)

 *Error: could not find function broadcast*
 Does someone know how to use broadcast variables on SparkR?
 Thanks,
 Deb



SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
Hello,

In looking at the SparkR codebase, it seems as if broadcast variables ought
to be working based on the tests.

I have tried the following in sparkR shell, and similar code in RStudio,
but in both cases got the same message

 randomMat - matrix(nrow=10, ncol=10, data=rnorm(100))
 randomMatBr - broadcast(sc, randomMat)

*Error: could not find function broadcast*
Does someone know how to use broadcast variables on SparkR?
Thanks,
Deb


contributing code - how to test

2015-04-24 Thread Deborah Siegel
Hi,

I selected a starter task in JIRA, and made changes to my github fork of
the current code.

I assumed I would be able to build and test.
% mvn clean compile was fine
but
%mvn package failed

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:2.18:test (default-test) on
project spark-launcher_2.10: There are test failures.

I then reverted my changes, but same story. Any advice is appreciated!

Deb


ec2 persistent-hdfs with ebs using spot instances

2015-03-10 Thread Deborah Siegel
Hello,

I'm new to ec2. I've set up a spark cluster on ec2 and am using
persistent-hdfs with the data nodes mounting ebs. I launched my cluster
using spot-instances

./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z
us-east-1c --spark-version=1.2.0 --spot-price=.0321
--hadoop-major-version=2  --copy-aws-credentials --ebs-vol-size=100
launch mysparkcluster

My question is, if the spot-instances get dropped, and I try and attach new
slaves to existing master with --use-existing-master, can I mount those new
slaves to the same ebs volumes? I'm guessing not. If somebody has
experience with this, how is it done?

Thanks.
Sincerely,
Deb


Re: Setting up Spark with YARN on EC2 cluster

2015-03-10 Thread Deborah Siegel
Harika,

I think you can modify existing spark on ec2 cluster to run Yarn mapreduce,
not sure if this is what you are looking for.
To try:

1) logon to master

2) go into either  ephemeral-hdfs/conf/  or persistent-hdfs/conf/
and add this to mapred-site.xml :

property
 namemapreduce.framework.name/name
 valueyarn/value
 /property

3) use copy-dir to copy this file over to the slaves (don't know if this
step is necessary)
eg.
~/spark-ec2/copy-dir.sh ~/ephemeral-hdfs/conf/mapred-site.xml

4) stop and restart hdfs (for pesistent-hdfs it wasn't started to begin
with)
ephemeral-hdfs]$  ./sbin/stop-all.sh
ephemeral-hdfs]$  ./sbin/start-all.sh

HTH
Deb





On Wed, Feb 25, 2015 at 11:46 PM, Harika matha.har...@gmail.com wrote:

 Hi,

 I want to setup a Spark cluster with YARN dependency on Amazon EC2. I was
 reading  this https://spark.apache.org/docs/1.2.0/running-on-yarn.html
 document and I understand that Hadoop has to be setup for running Spark
 with
 YARN. My questions -

 1. Do we have to setup Hadoop cluster on EC2 and then build Spark on it?
 2. Is there a way to modify the existing Spark cluster to work with YARN?

 Thanks in advance.

 Harika



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Number of cores per executor on Spark Standalone

2015-03-01 Thread Deborah Siegel
Hi,

Someone else will have a better answer. I think that for standalone mode,
executors will grab whatever cores they can based on either configurations
on the worker, or application specific configurations. Could be wrong, but
I believe mesos is similar to this- and that YARN is alone in the ability
to specify a specific number of cores given to each executor.

For Standalone Mode, configurations on the workers can limit the number of
cores available on themselves, and applications can limit the number of
cores they will grab across the entire cluster.

1) environmental property on each worker -SPARK_WORKER_CORES, or set this
as --cores as you manually start each worker. This will effect how many
cores are available on the worker for all applications.
2) environmental property on each worker - spark.deploy.defaultCores, which
limits the number of cores any single application can grab from the worker
in the case that the application has not set total.maximum.cores  (or
-total-executor-cores as a flag to spark-submit). If the application has
not set total.maximum.cores, and the worker does not have
spark.deploy.defaultCores set, the application can grab infinite cores on
the node. Could be an issue for a shared cluster.

Sincerely,
Deb









On Fri, Feb 27, 2015 at 11:13 PM, bit1...@163.com bit1...@163.com wrote:

 Hi ,

 I know that spark on yarn has a configuration parameter(executor-cores
 NUM) to  specify the number of cores per executor.
 How about spark standalone? I can specify the total cores, but how could I
 know how many cores each executor will take(presume one node one
 executor)?


 --
 bit1...@163.com



documentation - graphx-programming-guide error?

2015-03-01 Thread Deborah Siegel
Hello,

I am running through examples given on
http://spark.apache.org/docs/1.2.1/graphx-programming-guide.html

The section for Map Reduce Triplets Transition Guide (Legacy) indicates
that one can run the following .aggregateMessages code

val graph: Graph[Int, Float] = ...
def msgFun(triplet: EdgeContext[Int, Float, String]) {
triplet.sendToDst(Hi) }
def reduceFun(a: Int, b: Int): Int = a + b
val result = graph.aggregateMessages[String](msgFun, reduceFun)

I created a graph of the indicated type, and get an error

scala val result = graph.aggregateMessages[String](msgFun, reduceFun)
console:23: error: type mismatch;
found   : Int
required: String
Error occurred in an application involving default arguments.
val result = graph.aggregateMessages[String](msgFun, reduceFun)

  ^
What is this example supposed to do? The following would work, although
I'll admit I am perplexed by the example's intent.

def msgFun(triplet: EdgeContext[Int, Float, (Int,String)]) {
  triplet.sendToDst(1, Hi)
}
def reduceFun(a: (Int,String), b: (Int,String)): (Int,String) = ((a._1 +
b._1),a._2)
val result = graph.aggregateMessages[(Int,String)](msgFun, reduceFun)

Sincerely,
Deb


Re: Running spark function on parquet without sql

2015-02-27 Thread Deborah Siegel
Hi Michael,

Would you help me understand  the apparent difference here..

The Spark 1.2.1 programming guide indicates:

Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will *not* be cached using the in-memory
columnar format, and therefore sqlContext.cacheTable(...) is strongly
recommended for this use case.

Yet the API doc shows that :
def cache(): SchemaRDD
https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html
.this.typeOverridden cache function will always use the in-memory columnar
caching.


links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD

Thanks
Sincerely
Deb

On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust mich...@databricks.com
wrote:

 From Zhan Zhang's reply, yes I still get the parquet's advantage.


 You will need to at least use SQL or the DataFrame API (coming in Spark
 1.3) to specify the columns that you want in order to get the parquet
 benefits.   The rest of your operations can be standard Spark.

 My next question is, if I operate on SchemaRdd will I get the advantage of
 Spark SQL's in memory columnar store when cached the table using
 cacheTable()?


 Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and
 .cache() since Spark 1.2+



Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Deborah Siegel
Hi Abe,
I'm new to Spark as well, so someone else could answer better. A few
thoughts which may or may not be the right line of thinking..

1) Spark properties can be set on the SparkConf, and with flags in
spark-submit, but settings on SparkConf take precedence. I think your jars
flag for spark-submit may be redundant.

1) Is there a chance that stanford-corenlp-3.5.0.jar relies on other
dependencies? I could be wrong, but perhaps if there is no other reason not
to, try building your application as an uber-jar with a build tool like
Maven, which will package the whole transitive jar. You can find
stanford-corenlp on maven central .. I think you would add the below
dependencies to your pom.xml. After building simple-project-1.0.jar with
these dependencies, you would not set jars on the sc or jar flags on
spark-submit.

dependencies
dependency
groupIdedu.stanford.nlp/groupId
artifactIdstanford-corenlp/artifactId
version3.5.0/version
/dependency
dependency
groupIdedu.stanford.nlp/groupId
artifactIdstanford-corenlp/artifactId
version3.5.0/version
classifiermodels/classifier
/dependency
/dependencies

HTH.
Deb

On Tue, Feb 10, 2015 at 1:12 PM, Abe Handler akh2...@gmail.com wrote:

 I am new to spark. I am trying to compile and run a spark application that
 requires classes from an (external) jar file on my local machine. If I open
 the jar (on ~/Desktop) I can see the missing class in the local jar but
 when
 I run spark I get

 NoClassDefFoundError: edu/stanford/nlp/ie/AbstractSequenceClassifier

 I add the jar to the spark context like this

 String[] jars = {/home/pathto/Desktop/stanford-corenlp-3.5.0.jar};
 SparkConf conf = new SparkConf().setAppName(Simple
 Application).setJars(jars);
 Then I try to run a submit script like this

 /home/me/Downloads/spark-1.2.0-bin-hadoop2.4/bin/spark-submit \
   --class SimpleApp \
   --master local[4] \
   target/simple-project-1.0.jar \
   --jars local[4] /home/abe/Desktop/stanford-corenlp-3.5.0.jar
 and hit the NoClassDefFoundError.

 I get that this means that the worker threads can't find the class from the
 jar. But I am not sure what I am doing wrong. I have tried different
 syntaxes for the last line (below) but none works.

   --addJars local[4] /home/abe/Desktop/stanford-corenlp-3.5.0.jar
   --addJars local:/home/abe/Desktop/stanford-corenlp-3.5.0.jar
   --addJars local:/home/abe/Desktop/stanford-corenlp-3.5.0.jar

 How can I fix this error?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-can-t-Spark-find-the-classes-in-this-Jar-tp21584.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org