Re: NA value handling in sparkR
While fitting the currently available sparkR models, such as glm for linear and logistic regression, columns which contains strings are one-hot encoded behind the scenes, as part of the parsing of the RFormula. Does that help, or did you have something else in mind? > Thank you so much for your mail. It is working . > I have another small question in sparkR - can we create dummy > variables for categorical columns ( like in R we have " dummies" package) > eg in iris dataset we have Spieces as a categorical column so 3 dummy > variables columns like setosa, virginica would be created with 0 and 1 as > values On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel <deborah.sie...@gmail.com> wrote: > Maybe not ideal, but since read.df is inferring all columns from the csv > containing "NA" as type of strings, one could filter them rather than using > dropna(). > > filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA") > head(filtered_aq) > > Perhaps it would be better to have an option for read.df to convert any > "NA" it encounters into null types, like createDataFrame does for , and > then one would be able to use dropna() etc. > > > > On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com> > wrote: > >> Hi, >> >> Yes you are right. >> >> I think the problem is with reading of csv files. read.df is not >> considering NAs in the CSV file >> >> So what would be a workable solution in dealing with NAs in csv files? >> >> >> >> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.sie...@gmail.com >> > wrote: >> >>> Hi Devesh, >>> >>> I'm not certain why that's happening, and it looks like it doesn't >>> happen if you use createDataFrame directly: >>> aq <- createDataFrame(sqlContext,airquality) >>> head(dropna(aq,how="any")) >>> >>> If I had to guess.. dropna(), I believe, drops null values. I suppose >>> its possible that createDataFrame converts R's values to null, so >>> dropna() works with that. But perhaps read.df() does not convert R s to >>> null, as those are most likely interpreted as strings when they come in >>> from the csv. Just a guess, can anyone confirm? >>> >>> Deb >>> >>> >>> >>> >>> >>> >>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh < >>> raj.deves...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I have applied the following code on airquality dataset available in R >>>> , which has some missing values. I want to omit the rows which has NAs >>>> >>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" >>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"') >>>> >>>> sc <- sparkR.init("local",sparkHome = >>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6") >>>> >>>> sqlContext <- sparkRSQL.init(sc) >>>> >>>> path<-"/Users/devesh/work/airquality/" >>>> >>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", >>>> header="true", inferSchema="true") >>>> >>>> head(dropna(aq,how="any")) >>>> >>>> I am getting the output as >>>> >>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 >>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA >>>> 14.9 66 5 6 >>>> >>>> The NAs still exist in the output. Am I missing something here? >>>> >>>> -- >>>> Warm regards, >>>> Devesh. >>>> >>> >>> >> >> >> -- >> Warm regards, >> Devesh. >> > >
Re: NA value handling in sparkR
Maybe not ideal, but since read.df is inferring all columns from the csv containing "NA" as type of strings, one could filter them rather than using dropna(). filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA") head(filtered_aq) Perhaps it would be better to have an option for read.df to convert any "NA" it encounters into null types, like createDataFrame does for , and then one would be able to use dropna() etc. On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com> wrote: > Hi, > > Yes you are right. > > I think the problem is with reading of csv files. read.df is not > considering NAs in the CSV file > > So what would be a workable solution in dealing with NAs in csv files? > > > > On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.sie...@gmail.com> > wrote: > >> Hi Devesh, >> >> I'm not certain why that's happening, and it looks like it doesn't happen >> if you use createDataFrame directly: >> aq <- createDataFrame(sqlContext,airquality) >> head(dropna(aq,how="any")) >> >> If I had to guess.. dropna(), I believe, drops null values. I suppose its >> possible that createDataFrame converts R's values to null, so dropna() >> works with that. But perhaps read.df() does not convert R s to null, as >> those are most likely interpreted as strings when they come in from the >> csv. Just a guess, can anyone confirm? >> >> Deb >> >> >> >> >> >> >> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh < >> raj.deves...@gmail.com> wrote: >> >>> Hi, >>> >>> I have applied the following code on airquality dataset available in R , >>> which has some missing values. I want to omit the rows which has NAs >>> >>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" >>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"') >>> >>> sc <- sparkR.init("local",sparkHome = >>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6") >>> >>> sqlContext <- sparkRSQL.init(sc) >>> >>> path<-"/Users/devesh/work/airquality/" >>> >>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", >>> header="true", inferSchema="true") >>> >>> head(dropna(aq,how="any")) >>> >>> I am getting the output as >>> >>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 >>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA >>> 14.9 66 5 6 >>> >>> The NAs still exist in the output. Am I missing something here? >>> >>> -- >>> Warm regards, >>> Devesh. >>> >> >> > > > -- > Warm regards, > Devesh. >
SparkR pca?
Hi, Can PCA be implemented in a SparkR-MLLib integration? perhaps 2 separate issues.. 1) Having the methods in SparkRWrapper and RFormula which will send the right input types through the pipeline MLLib PCA operates either on a RowMatrix, or the feature vector of an RDD[LabeledPoint]. The labels aren't used.. though in the second case it may be useful to be able to keep the label. 2) formula parsing from R In R syntax, you can, for example in prcomp, have a formula which has no label (response variable) -- eg. prcomp(~ Col1 + Col2 + Col3, data = myDataFrame) Can RFormula currently parse this type of formula? Thanks for listening / ideas. Deb
Re: SparkR - can't create spark context - JVM not ready
Thanks Shivaram. You got me wondering about the path so I put it in full and it worked. R does not, of course, expand a ~. On Thu, Aug 20, 2015 at 4:35 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Can you check if the file `~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit` exists ? The error message seems to indicate it is trying to pick up Spark from that location and can't seem to find Spark installed there. Thanks Shivaram On Thu, Aug 20, 2015 at 3:30 PM, Deborah Siegel deborah.sie...@gmail.com wrote: Hello, I have previously successfully run SparkR in RStudio, with: Sys.setenv(SPARK_HOME=~/software/spark-1.4.1-bin-hadoop2.4) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=local[2],appName=SparkR-example) Then I tried putting some of it into an .Rprofile. It seemed to work to load the paths and SparkR, but I got an error when trying to create the sc. I then removed my .Rprofile, as well as .rstudio-desktop. However, I still cannot create the sc. Here is the error sc - sparkR.init(master=local[2],appName=SparkR-example) Launching java with spark-submit command ~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit sparkr-shell /var/folders/p7/k1bpgmx93yd6pjq7dzf35gk8gn/T//RtmpOitA28/backend_port23377046db sh: ~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit: No such file or directory Error in sparkR.init(master = local[2], appName = SparkR-example) : JVM is not ready after 10 seconds I suspected there was an incomplete process or something. I checked for any running R or Java processes and there were none. Has someone seen this type of error? I have the same error in both RStudio and in R shell (but not sparkR wrapper). Thanks, Deb
SparkR - can't create spark context - JVM not ready
Hello, I have previously successfully run SparkR in RStudio, with: Sys.setenv(SPARK_HOME=~/software/spark-1.4.1-bin-hadoop2.4) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=local[2],appName=SparkR-example) Then I tried putting some of it into an .Rprofile. It seemed to work to load the paths and SparkR, but I got an error when trying to create the sc. I then removed my .Rprofile, as well as .rstudio-desktop. However, I still cannot create the sc. Here is the error sc - sparkR.init(master=local[2],appName=SparkR-example) Launching java with spark-submit command ~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit sparkr-shell /var/folders/p7/k1bpgmx93yd6pjq7dzf35gk8gn/T//RtmpOitA28/backend_port23377046db sh: ~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit: No such file or directory Error in sparkR.init(master = local[2], appName = SparkR-example) : JVM is not ready after 10 seconds I suspected there was an incomplete process or something. I checked for any running R or Java processes and there were none. Has someone seen this type of error? I have the same error in both RStudio and in R shell (but not sparkR wrapper). Thanks, Deb
Re: SparkR broadcast variables
I think I just answered my own question. The privitization of the RDD API might have resulted in my error, because this worked: randomMatBr - SparkR:::broadcast(sc, randomMat) On Mon, Aug 3, 2015 at 4:59 PM, Deborah Siegel deborah.sie...@gmail.com wrote: Hello, In looking at the SparkR codebase, it seems as if broadcast variables ought to be working based on the tests. I have tried the following in sparkR shell, and similar code in RStudio, but in both cases got the same message randomMat - matrix(nrow=10, ncol=10, data=rnorm(100)) randomMatBr - broadcast(sc, randomMat) *Error: could not find function broadcast* Does someone know how to use broadcast variables on SparkR? Thanks, Deb
SparkR broadcast variables
Hello, In looking at the SparkR codebase, it seems as if broadcast variables ought to be working based on the tests. I have tried the following in sparkR shell, and similar code in RStudio, but in both cases got the same message randomMat - matrix(nrow=10, ncol=10, data=rnorm(100)) randomMatBr - broadcast(sc, randomMat) *Error: could not find function broadcast* Does someone know how to use broadcast variables on SparkR? Thanks, Deb
contributing code - how to test
Hi, I selected a starter task in JIRA, and made changes to my github fork of the current code. I assumed I would be able to build and test. % mvn clean compile was fine but %mvn package failed [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18:test (default-test) on project spark-launcher_2.10: There are test failures. I then reverted my changes, but same story. Any advice is appreciated! Deb
ec2 persistent-hdfs with ebs using spot instances
Hello, I'm new to ec2. I've set up a spark cluster on ec2 and am using persistent-hdfs with the data nodes mounting ebs. I launched my cluster using spot-instances ./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z us-east-1c --spark-version=1.2.0 --spot-price=.0321 --hadoop-major-version=2 --copy-aws-credentials --ebs-vol-size=100 launch mysparkcluster My question is, if the spot-instances get dropped, and I try and attach new slaves to existing master with --use-existing-master, can I mount those new slaves to the same ebs volumes? I'm guessing not. If somebody has experience with this, how is it done? Thanks. Sincerely, Deb
Re: Setting up Spark with YARN on EC2 cluster
Harika, I think you can modify existing spark on ec2 cluster to run Yarn mapreduce, not sure if this is what you are looking for. To try: 1) logon to master 2) go into either ephemeral-hdfs/conf/ or persistent-hdfs/conf/ and add this to mapred-site.xml : property namemapreduce.framework.name/name valueyarn/value /property 3) use copy-dir to copy this file over to the slaves (don't know if this step is necessary) eg. ~/spark-ec2/copy-dir.sh ~/ephemeral-hdfs/conf/mapred-site.xml 4) stop and restart hdfs (for pesistent-hdfs it wasn't started to begin with) ephemeral-hdfs]$ ./sbin/stop-all.sh ephemeral-hdfs]$ ./sbin/start-all.sh HTH Deb On Wed, Feb 25, 2015 at 11:46 PM, Harika matha.har...@gmail.com wrote: Hi, I want to setup a Spark cluster with YARN dependency on Amazon EC2. I was reading this https://spark.apache.org/docs/1.2.0/running-on-yarn.html document and I understand that Hadoop has to be setup for running Spark with YARN. My questions - 1. Do we have to setup Hadoop cluster on EC2 and then build Spark on it? 2. Is there a way to modify the existing Spark cluster to work with YARN? Thanks in advance. Harika -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Number of cores per executor on Spark Standalone
Hi, Someone else will have a better answer. I think that for standalone mode, executors will grab whatever cores they can based on either configurations on the worker, or application specific configurations. Could be wrong, but I believe mesos is similar to this- and that YARN is alone in the ability to specify a specific number of cores given to each executor. For Standalone Mode, configurations on the workers can limit the number of cores available on themselves, and applications can limit the number of cores they will grab across the entire cluster. 1) environmental property on each worker -SPARK_WORKER_CORES, or set this as --cores as you manually start each worker. This will effect how many cores are available on the worker for all applications. 2) environmental property on each worker - spark.deploy.defaultCores, which limits the number of cores any single application can grab from the worker in the case that the application has not set total.maximum.cores (or -total-executor-cores as a flag to spark-submit). If the application has not set total.maximum.cores, and the worker does not have spark.deploy.defaultCores set, the application can grab infinite cores on the node. Could be an issue for a shared cluster. Sincerely, Deb On Fri, Feb 27, 2015 at 11:13 PM, bit1...@163.com bit1...@163.com wrote: Hi , I know that spark on yarn has a configuration parameter(executor-cores NUM) to specify the number of cores per executor. How about spark standalone? I can specify the total cores, but how could I know how many cores each executor will take(presume one node one executor)? -- bit1...@163.com
documentation - graphx-programming-guide error?
Hello, I am running through examples given on http://spark.apache.org/docs/1.2.1/graphx-programming-guide.html The section for Map Reduce Triplets Transition Guide (Legacy) indicates that one can run the following .aggregateMessages code val graph: Graph[Int, Float] = ... def msgFun(triplet: EdgeContext[Int, Float, String]) { triplet.sendToDst(Hi) } def reduceFun(a: Int, b: Int): Int = a + b val result = graph.aggregateMessages[String](msgFun, reduceFun) I created a graph of the indicated type, and get an error scala val result = graph.aggregateMessages[String](msgFun, reduceFun) console:23: error: type mismatch; found : Int required: String Error occurred in an application involving default arguments. val result = graph.aggregateMessages[String](msgFun, reduceFun) ^ What is this example supposed to do? The following would work, although I'll admit I am perplexed by the example's intent. def msgFun(triplet: EdgeContext[Int, Float, (Int,String)]) { triplet.sendToDst(1, Hi) } def reduceFun(a: (Int,String), b: (Int,String)): (Int,String) = ((a._1 + b._1),a._2) val result = graph.aggregateMessages[(Int,String)](msgFun, reduceFun) Sincerely, Deb
Re: Running spark function on parquet without sql
Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will *not* be cached using the in-memory columnar format, and therefore sqlContext.cacheTable(...) is strongly recommended for this use case. Yet the API doc shows that : def cache(): SchemaRDD https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html .this.typeOverridden cache function will always use the in-memory columnar caching. links https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD Thanks Sincerely Deb On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust mich...@databricks.com wrote: From Zhan Zhang's reply, yes I still get the parquet's advantage. You will need to at least use SQL or the DataFrame API (coming in Spark 1.3) to specify the columns that you want in order to get the parquet benefits. The rest of your operations can be standard Spark. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store when cached the table using cacheTable()? Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and .cache() since Spark 1.2+
Re: Why can't Spark find the classes in this Jar?
Hi Abe, I'm new to Spark as well, so someone else could answer better. A few thoughts which may or may not be the right line of thinking.. 1) Spark properties can be set on the SparkConf, and with flags in spark-submit, but settings on SparkConf take precedence. I think your jars flag for spark-submit may be redundant. 1) Is there a chance that stanford-corenlp-3.5.0.jar relies on other dependencies? I could be wrong, but perhaps if there is no other reason not to, try building your application as an uber-jar with a build tool like Maven, which will package the whole transitive jar. You can find stanford-corenlp on maven central .. I think you would add the below dependencies to your pom.xml. After building simple-project-1.0.jar with these dependencies, you would not set jars on the sc or jar flags on spark-submit. dependencies dependency groupIdedu.stanford.nlp/groupId artifactIdstanford-corenlp/artifactId version3.5.0/version /dependency dependency groupIdedu.stanford.nlp/groupId artifactIdstanford-corenlp/artifactId version3.5.0/version classifiermodels/classifier /dependency /dependencies HTH. Deb On Tue, Feb 10, 2015 at 1:12 PM, Abe Handler akh2...@gmail.com wrote: I am new to spark. I am trying to compile and run a spark application that requires classes from an (external) jar file on my local machine. If I open the jar (on ~/Desktop) I can see the missing class in the local jar but when I run spark I get NoClassDefFoundError: edu/stanford/nlp/ie/AbstractSequenceClassifier I add the jar to the spark context like this String[] jars = {/home/pathto/Desktop/stanford-corenlp-3.5.0.jar}; SparkConf conf = new SparkConf().setAppName(Simple Application).setJars(jars); Then I try to run a submit script like this /home/me/Downloads/spark-1.2.0-bin-hadoop2.4/bin/spark-submit \ --class SimpleApp \ --master local[4] \ target/simple-project-1.0.jar \ --jars local[4] /home/abe/Desktop/stanford-corenlp-3.5.0.jar and hit the NoClassDefFoundError. I get that this means that the worker threads can't find the class from the jar. But I am not sure what I am doing wrong. I have tried different syntaxes for the last line (below) but none works. --addJars local[4] /home/abe/Desktop/stanford-corenlp-3.5.0.jar --addJars local:/home/abe/Desktop/stanford-corenlp-3.5.0.jar --addJars local:/home/abe/Desktop/stanford-corenlp-3.5.0.jar How can I fix this error? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-can-t-Spark-find-the-classes-in-this-Jar-tp21584.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org