saving or visualizing PCA
Hi , I am generating PCA using spark . But I dont know how to save it to disk or visualize it. Can some one give me some pointerspl. Thanks -Roni
FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded
I get 2 types of error - -org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 and FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded Spar keeps re-trying to submit the code and keeps getting this error. My file on which I am finding the sliding window strings is 500 MB and I am doing it with length = 150. It woks fine till length is 100. This is my code - val hgfasta = sc.textFile(args(0)) // read the fasta file val kCount = hgfasta.flatMap(r = { r.sliding(args(2).toInt) }) val kmerCount = kCount.map(x = (x, 1)).reduceByKey(_ + _).map { case (x, y) = (y, x) }.sortByKey(false).map { case (i, j) = (j, i) } val filtered = kmerCount.filter(kv = kv._2 5) filtered.map(kv = kv._1 + , + kv._2.toLong).saveAsTextFile(args(1)) } It gets stuck and flat map and save as Text file Throws -org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 and org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
distcp problems on ec2 standalone spark cluster
I got pass the issues with the cluster not started problem by adding Yarn to mapreduce.framework.name . But when I try to to distcp , if I use uRI with s3://path to my bucket .. I get invalid path even though the bucket exists. If I use s3n:// it just hangs. Did anyone else face anything like that ? I also noticed that this script puts the image of cloudera. hadoop. Does it matter? Thanks -R
spark-ec2 script problems
Hi , I used spark-ec2 script to create ec2 cluster. Now I am trying copy data from s3 into hdfs. I am doing this *root@ip-172-31-21-160 ephemeral-hdfs]$ bin/hadoop distcp s3://xxx/home/mydata/small.sam hdfs://ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1 http://ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1* and I get following error - 2015-03-06 01:39:27,299 INFO tools.DistCp (DistCp.java:run(109)) - Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[s3://xxX/home/mydata/small.sam], targetPath=hdfs:// ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1} 2015-03-06 01:39:27,585 INFO mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid mapreduce.jobtracker.address configuration value for LocalJobRunner : ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9001 2015-03-06 01:39:27,585 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) I tried doing start-all.sh , start-dfs.sh and start-yarn.sh what should I do ? Thanks -roni
Re: distcp on ec2 standalone spark cluster
Did you get this to work? I got pass the issues with the cluster not startetd problem I am having problem where distcp with s3 URI says incorrect forlder path and s3n:// hangs. stuck for 2 days :( Thanks -R -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distcp-on-ec2-standalone-spark-cluster-tp13652p21957.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Setting up Spark with YARN on EC2 cluster
Hi Harika, Did you get any solution for this? I want to use yarn , but the spark-ec2 script does not support it. Thanks -Roni -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818p21991.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: diffrence in PCA of MLib vs H2o in R
Reza, That SVD.v matches the H2o and R prComp (non-centered) Thanks -R On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote: (Oh sorry, I've only been thinking of TallSkinnySVD) On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote: If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the other two systems. Reza On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote: Those implementations are computing an SVD of the input matrix directly, and while you generally need the columns to have mean 0, you can turn that off with the options you cite. I don't think this is possible in the MLlib implementation, since it is computing the principal components by computing eigenvectors of the covariance matrix. The means inherently don't matter either way in this computation. On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote: I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD$ at preDefKmerIntersection$.main(kmerIntersetion.scala:26) This line is where I do a JOIN operation. val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the code to use 1.3 and Spark SQL Thanks Roni On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote: For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
My cluster is still on spark 1.2 and in SBT I am using 1.3. So probably it is compiling with 1.3 but running with 1.2 ? On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com wrote: Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but that shouldn't be this issue. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote: Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD$ at preDefKmerIntersection$.main(kmerIntersetion.scala:26) This line is where I do a JOIN operation. val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the code to use 1.3 and Spark SQL Thanks Roni On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote: For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/ ; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Is there any way that I can install the new one and remove previous version. I installed spark 1.3 on my EC2 master and set teh spark home to the new one. But when I start teh spark-shell I get - java.lang.UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) Is There no way to upgrade without creating new cluster? Thanks Roni On Wed, Mar 25, 2015 at 1:18 PM, Dean Wampler deanwamp...@gmail.com wrote: Yes, that's the problem. The RDD class exists in both binary jar files, but the signatures probably don't match. The bottom line, as always for tools like this, is that you can't mix versions. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 3:13 PM, roni roni.epi...@gmail.com wrote: My cluster is still on spark 1.2 and in SBT I am using 1.3. So probably it is compiling with 1.3 but running with 1.2 ? On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com wrote: Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but that shouldn't be this issue. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote: Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD$ at preDefKmerIntersection$.main(kmerIntersetion.scala:26) This line is where I do a JOIN operation. val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the code to use 1.3 and Spark SQL Thanks Roni On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote: For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error
Re: Cannot run spark-shell command not found.
I think you must have downloaded the spark source code gz file. It is little confusing. You have to select the hadoop version also and the actual tgz file will have spark version and hadoop version in it. -R On Mon, Mar 30, 2015 at 10:34 AM, vance46 wang2...@purdue.edu wrote: Hi all, I'm a newbee try to setup spark for my research project on a RedHat system. I've downloaded spark-1.3.0.tgz and untared it. and installed python, java and scala. I've set JAVA_HOME and SCALA_HOME and then try to use sudo sbt/sbt assembly according to https://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_CentOS6 . It pop-up with sbt command not found. I then try directly start spark-shell in ./bin using sudo ./bin/spark-shell and still command not found. I appreciate your help in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-spark-shell-command-not-found-tp22299.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
joining multiple parquet files
Hi , I have 4 parquet files and I want to find data which is common in all of them e.g SELECT TableA.*, TableB.*, TableC.*, TableD.* FROM (TableB INNER JOIN TableA ON TableB.aID= TableA.aID) INNER JOIN TableC ON(TableB.cID= Tablec.cID) INNER JOIN TableA ta ON(ta.dID= TableD.dID) WHERE (DATE(TableC.date)=date(now())) I can do a 2 files join like - val joinedVal = g1.join(g2,g1.col(kmer) === g2.col(kmer)) But I am trying to find common kmer strings from 4 files. Thanks Roni
spark.sql.Row manipulation
I have 2 paraquet files with format e.g name , age, town I read them and then join them to get all the names which are in both towns . the resultant dataset is res4: Array[org.apache.spark.sql.Row] = Array([name1, age1, town1,name2,age2,town2]) Name 1 and name 2 are same as I am joining . Now , I want to get only to the format (name , age1, age2) But I cant seem to getting to manipulate the spark.sql.row. Trying something like map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) does not work . Can you suggest a way ? Thanks -R
Re: Resource manager UI for Spark applications
Hi Ted, I used s3://support.elasticmapreduce/spark/install-spark to install spark on my EMR cluster. It is 1.2.0. When I click on the link for history or logs it takes me to http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop and I get - The server at *ip-172-31-43-116.us-west-2.compute.internal* can't be found, because the DNS lookup failed. DNS is the network service that translates a website's name to its Internet address. This error is most often caused by having no connection to the Internet or a misconfigured network. It can also be caused by an unresponsive DNS server or a firewall preventing Google Chrome from accessing the network. I tried changing the address with internal to the external one , but still does not work. Thanks _roni On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote: bq. spark UI does not work for Yarn-cluster. Can you be a bit more specific on the error(s) you saw ? What Spark release are you using ? Cheers On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com wrote: Sorry , for half email - here it is again in full Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup? 2. when I click on Application Monitoring or history , i get re-directed to some linked with internal Ip address. Even if I replace that address with the public IP , it still does not work. What kind of setup changes are needed for that? Thanks -roni On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com wrote: Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup?
Re: Resource manager UI for Spark applications
Ted, If the application is running then the logs are not available. Plus what i want to view is the details about the running app as in spark UI. Do I have to open some ports or do some other setting changes? On Tue, Mar 3, 2015 at 10:08 AM, Ted Yu yuzhih...@gmail.com wrote: bq. changing the address with internal to the external one , but still does not work. Not sure what happened. For the time being, you can use yarn command line to pull container log (put in your appId and container Id): yarn logs -applicationId application_1386639398517_0007 -containerId container_1386639398517_0007_01_19 Cheers On Tue, Mar 3, 2015 at 9:50 AM, roni roni.epi...@gmail.com wrote: Hi Ted, I used s3://support.elasticmapreduce/spark/install-spark to install spark on my EMR cluster. It is 1.2.0. When I click on the link for history or logs it takes me to http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop and I get - The server at *ip-172-31-43-116.us-west-2.compute.internal* can't be found, because the DNS lookup failed. DNS is the network service that translates a website's name to its Internet address. This error is most often caused by having no connection to the Internet or a misconfigured network. It can also be caused by an unresponsive DNS server or a firewall preventing Google Chrome from accessing the network. I tried changing the address with internal to the external one , but still does not work. Thanks _roni On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote: bq. spark UI does not work for Yarn-cluster. Can you be a bit more specific on the error(s) you saw ? What Spark release are you using ? Cheers On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com wrote: Sorry , for half email - here it is again in full Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup? 2. when I click on Application Monitoring or history , i get re-directed to some linked with internal Ip address. Even if I replace that address with the public IP , it still does not work. What kind of setup changes are needed for that? Thanks -roni On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com wrote: Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup?
Re: Resource manager UI for Spark applications
ah!! I think I know what you mean. My job was just in accepted stage for a long time as it was running a huge file. But now that it is in running stage , I can see it . I can see it at post 9046 though instead of 4040 . But I can see it. Thanks -roni On Tue, Mar 3, 2015 at 1:19 PM, Zhan Zhang zzh...@hortonworks.com wrote: In Yarn (Cluster or client), you can access the spark ui when the app is running. After app is done, you can still access it, but need some extra setup for history server. Thanks. Zhan Zhang On Mar 3, 2015, at 10:08 AM, Ted Yu yuzhih...@gmail.com wrote: bq. changing the address with internal to the external one , but still does not work. Not sure what happened. For the time being, you can use yarn command line to pull container log (put in your appId and container Id): yarn logs -applicationId application_1386639398517_0007 -containerId container_1386639398517_0007_01_19 Cheers On Tue, Mar 3, 2015 at 9:50 AM, roni roni.epi...@gmail.com wrote: Hi Ted, I used s3://support.elasticmapreduce/spark/install-spark to install spark on my EMR cluster. It is 1.2.0. When I click on the link for history or logs it takes me to http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop and I get - The server at *ip-172-31-43-116.us http://ip-172-31-43-116.us-west-2.compute.internal* can't be found, because the DNS lookup failed. DNS is the network service that translates a website's name to its Internet address. This error is most often caused by having no connection to the Internet or a misconfigured network. It can also be caused by an unresponsive DNS server or a firewall preventing Google Chrome from accessing the network. I tried changing the address with internal to the external one , but still does not work. Thanks _roni On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote: bq. spark UI does not work for Yarn-cluster. Can you be a bit more specific on the error(s) you saw ? What Spark release are you using ? Cheers On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com wrote: Sorry , for half email - here it is again in full Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup? 2. when I click on Application Monitoring or history , i get re-directed to some linked with internal Ip address. Even if I replace that address with the public IP , it still does not work. What kind of setup changes are needed for that? Thanks -roni On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com wrote: Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup?
Re: issue Running Spark Job on Yarn Cluster
look at the logs yarn logs --applicationId applicationId That should give the error. On Wed, Mar 4, 2015 at 9:21 AM, sachin Singh sachin.sha...@gmail.com wrote: Not yet, Please let. Me know if you found solution, Regards Sachin On 4 Mar 2015 21:45, mael2210 [via Apache Spark User List] [hidden email] http:///user/SendEmail.jtp?type=nodenode=21912i=0 wrote: Hello, I am facing the exact same issue. Could you solve the problem ? Kind regards -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21697p21909.html To unsubscribe from issue Running Spark Job on Yarn Cluster, click here. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: issue Running Spark Job on Yarn Cluster http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21697p21912.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
diffrence in PCA of MLib vs H2o in R
I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni
Storing data in MySQL from spark hive tables
Hi , I am trying to setup the hive metastore and mysql DB connection. I have a spark cluster and I ran some programs and I have data stored in some hive tables. Now I want to store this data into Mysql so that it is available for further processing. I setup the hive-site.xml file. ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.sasl.enabled/name valuefalse/value /property property namehive.server2.authentication/name valueNONE/value /property property namehive.server2.enable.doAs/name valuetrue/value /property property namehive.warehouse.subdir.inherit.perms/name valuetrue/value /property property namehive.metastore.schema.verification/name valuefalse/value /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://*ip address* :3306/metastore_db?createDatabaseIfNotExist=true/value descriptionmetadata is stored in a MySQL server/description /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionMySQL JDBC driver class/description /property property namejavax.jdo.option.ConnectionUserName/name valueroot/value /property property namejavax.jdo.option.ConnectionPassword/name value/value /property property namehive.metastore.warehouse.dir/name value/user/${user.name}/hive-warehouse/value descriptionlocation of default database for the warehouse/description /property /configuration -- My mysql server is on a separate server than where my spark server is . If I use mySQLWorkbench , I use a SSH connection with a certificate file to connect . How do I specify all that information from spark to the DB ? I want to store the data generated by my spark program into mysql. Thanks _R
Re: which database for gene alignment data ?
Sorry for the delay. The files (called .bed files) have format like - Chromosome start endfeature score strand chr1 713776 714375 peak.1 599+ chr1 752401 753000 peak.2 599+ The mandatory fields are 1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). 2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. 3. chromEnd - The ending position of the feature in the chromosome or scaffold. The *chromEnd* base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as *chromStart=0, chromEnd=100*, and span the bases numbered 0-99. There can be more data as described - https://genome.ucsc.edu/FAQ/FAQformat.html#format1 Many times the use cases are like 1. find the features between given start and end positions 2.Find features which have overlapping start and end points with another feature. 3. read external (reference) data which will have similar format (chr10 4851478549604641MAPK8 49514785+) and find all the data points which are overlapping with the other .bed files. The data is huge. .bed files can range from .5 GB to 5 gb (or more) I was thinking of using cassandra, but not sue if the overlapping queries can be supported and will be fast enough. Thanks for the help -Roni On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu yuzhih...@gmail.com wrote: Can you describe your use case in a bit more detail since not all people on this mailing list are familiar with gene sequencing alignments data ? Thanks On Fri, Jun 5, 2015 at 11:42 PM, roni roni.epi...@gmail.com wrote: I want to use spark for reading compressed .bed file for reading gene sequencing alignments data. I want to store bed file data in db and then use external gene expression data to find overlaps etc, which database is best for it ? Thanks -Roni
which database for gene alignment data ?
I want to use spark for reading compressed .bed file for reading gene sequencing alignments data. I want to store bed file data in db and then use external gene expression data to find overlaps etc, which database is best for it ? Thanks -Roni
Re: which database for gene alignment data ?
Hi Frank, Thanks for the reply. I downloaded ADAM and built it but it does not seem to list this function for command line options. Are these exposed as public API and I can call it from code ? Also , I need to save all my intermediate data. Seems like ADAM stores data in Parquet on HDFS. I want to save something in an external database, so that we can re-use the saved data in multiple ways by multiple people. Any suggestions on the DB selection or keeping data centralized for use by multiple distinct groups? Thanks -Roni On Mon, Jun 8, 2015 at 12:47 PM, Frank Austin Nothaft fnoth...@berkeley.edu wrote: Hi Roni, We have a full suite of genomic feature parsers that can read BED, narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM https://github.com/bigdatagenomics/adam Additionally, we have support for efficient overlap joins (query 3 in your email below). You can load the genomic features with ADAMContext.loadFeatures https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L438. We have two tools for the overlap computation: you can use a BroadcastRegionJoin https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/BroadcastRegionJoin.scala if one of the datasets you want to overlap is small or a ShuffleRegionJoin https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala if both datasets are large. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jun 8, 2015, at 9:39 PM, roni roni.epi...@gmail.com wrote: Sorry for the delay. The files (called .bed files) have format like - Chromosome start endfeature score strand chr1 713776 714375 peak.1 599+ chr1 752401 753000 peak.2 599+ The mandatory fields are 1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). 2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. 3. chromEnd - The ending position of the feature in the chromosome or scaffold. The *chromEnd* base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as *chromStart=0, chromEnd=100*, and span the bases numbered 0-99. There can be more data as described - https://genome.ucsc.edu/FAQ/FAQformat.html#format1 Many times the use cases are like 1. find the features between given start and end positions 2.Find features which have overlapping start and end points with another feature. 3. read external (reference) data which will have similar format (chr10 4851478549604641MAPK8 49514785+) and find all the data points which are overlapping with the other .bed files. The data is huge. .bed files can range from .5 GB to 5 gb (or more) I was thinking of using cassandra, but not sue if the overlapping queries can be supported and will be fast enough. Thanks for the help -Roni On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu yuzhih...@gmail.com wrote: Can you describe your use case in a bit more detail since not all people on this mailing list are familiar with gene sequencing alignments data ? Thanks On Fri, Jun 5, 2015 at 11:42 PM, roni roni.epi...@gmail.com wrote: I want to use spark for reading compressed .bed file for reading gene sequencing alignments data. I want to store bed file data in db and then use external gene expression data to find overlaps etc, which database is best for it ? Thanks -Roni
Re: Is anyone using Amazon EC2? (second attempt!)
Hi , Any update on this? I am not sure if the issue I am seeing is related .. I have 8 slaves and when I created the cluster I specified ebs volume with 100G. I see on Ec2 8 volumes created and each attached to the corresponding slave. But when I try to copy data on it , it complains that /root/ephemeral-hdfs/bin/hadoop fs -cp /intersection hdfs:// ec2-54-149-112-136.us-west-2.compute.amazonaws.com:9010/ 2015-05-28 23:40:35,447 WARN hdfs.DFSClient (DFSOutputStream.java:run(562)) - DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /intersection/kmer150/commonGoodKmers/_temporary/_attempt_201504010056_0004_m_000428_3948/part-00428._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation. It shows only 1 datanode , but for ephermal-hdfs it shows 8 datanodes. Any thoughts? Thanks _R On Sat, May 23, 2015 at 7:24 AM, Joe Wass jw...@crossref.org wrote: I used Spark on EC2 a while ago, but recent revisions seem to have broken the functionality. Is anyone actually using Spark on EC2 at the moment? The bug in question is: https://issues.apache.org/jira/browse/SPARK-5008 It makes it impossible to use persistent HDFS without a workround on each slave node. No-one seems to be interested in the bug, so I wonder if other people aren't actually having this problem. If this is the case, any suggestions? Joe
Re: Do I really need to build Spark for Hive/Thrift Server support?
Hi All, Any explanation for this? As Reece said I can do operations with hive but - val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -- gives error. I have already created spark ec2 cluster with the spark-ec2 script. How can I build it again? Thanks _Roni On Tue, Jul 28, 2015 at 2:46 PM, ReeceRobinson re...@therobinsons.gen.nz wrote: I am building an analytics environment based on Spark and want to use HIVE in multi-user mode i.e. not use the embedded derby database but use Postgres and HDFS instead. I am using the included Spark Thrift Server to process queries using Spark SQL. The documentation gives me the impression that I need to create a custom build of Spark 1.4.1. However I don't think this is either accurate now OR it is for some different context I'm not aware of? I used the pre-built Spark 1.4.1 distribution today with my hive-site.xml for Postgres and HDFS and it worked! I see the warehouse files turn up in HDFS and I see the metadata inserted into Postgres when I created a test table. I can connect to the Thrift Server using beeline and perform queries on my data. I also verified using the Spark UI that the SQL is being processed by Spark SQL. So I guess I'm asking is the document out-of-date or am I missing something? Cheers, Reece -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-I-really-need-to-build-Spark-for-Hive-Thrift-Server-support-tp24013p24039.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
No suitable driver found for jdbc:mysql://
Hi All, I have a cluster with spark 1.4. I am trying to save data to mysql but getting error Exception in thread main java.sql.SQLException: No suitable driver found for jdbc:mysql://.rds.amazonaws.com:3306/DAE_kmer?user=password= *I looked at - https://issues.apache.org/jira/browse/SPARK-8463 https://issues.apache.org/jira/browse/SPARK-8463 and added the connector jar to the same location as on Master using copy-dir script.* *But I am still getting the same error. This sued to work with 1.3.* *This is my command to run the program - **$SPARK_HOME/bin/spark-submit --jars /root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar --conf spark.executor.extraClassPath=/root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar --conf spark.executor.memory=55g --driver-memory=55g --master=spark://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077 http://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077 --class saveBedToDB target/scala-2.10/adam-project_2.10-1.0.jar* *What else can I Do ?* *Thanks* *-Roni*
Upgrade spark cluster to latest version
Hi Spark experts, This may be a very naive question but can you pl. point me to a proper way to upgrade spark version on an existing cluster. Thanks Roni > Hi, > I have a current cluster running spark 1.4 and want to upgrade to latest > version. > How can I do it without creating a new cluster so that all my other > setting getting erased. > Thanks > _R >
connecting to remote spark and reading files on HDFS or s3 in sparkR
I have spark installed on a EC2 cluster. Can I connect to that from my local sparkR in RStudio? if yes , how ? Can I read files which I have saved as parquet files on hdfs or s3 in sparkR ? If yes , How? Thanks -Roni
reading files on HDFS /s3 in sparkR -failing
I am trying this - ddf <- parquetFile(sqlContext, "hdfs:// ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet") and I get path[1]="hdfs:// ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet": No such file or directory when I read file on s3 , I get - java.io.IOException: No FileSystem for scheme: s3 Thanks in advance. -Roni
Re: connecting to remote spark and reading files on HDFS or s3 in sparkR
Thanks Akhil. Very good article. On Mon, Sep 14, 2015 at 4:15 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > You can look into this doc > <https://www.sigmoid.com/how-to-run-sparkr-with-rstudio/> regarding the > connection (its for gce though but it should be similar). > > Thanks > Best Regards > > On Thu, Sep 10, 2015 at 11:20 PM, roni <roni.epi...@gmail.com> wrote: > >> I have spark installed on a EC2 cluster. Can I connect to that from my >> local sparkR in RStudio? if yes , how ? >> >> Can I read files which I have saved as parquet files on hdfs or s3 in >> sparkR ? If yes , How? >> >> Thanks >> -Roni >> >> >
Re: cannot coerce class "data.frame" to a DataFrame - with spark R
Thanks Felix . I tried that but I still get the same error . Am I missing something? dds <- DESeqDataSetFromMatrix(countData = collect(countMat), colData = collect(colData), design = design) Error in DataFrame(colData, row.names = rownames(colData)) : cannot coerce class "data.frame" to a DataFrame On Thu, Feb 18, 2016 at 9:03 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Doesn't DESeqDataSetFromMatrix work with data.frame only? It wouldn't work > with Spark's DataFrame - try collect(countMat) and others to convert them > into data.frame? > > > _ > From: roni <roni.epi...@gmail.com> > Sent: Thursday, February 18, 2016 4:55 PM > Subject: cannot coerce class "data.frame" to a DataFrame - with spark R > To: <user@spark.apache.org> > > > > Hi , > I am trying to convert a bioinformatics R script to use spark R. It uses > external bioconductor package (DESeq2) so the only conversion really I have > made is to change the way it reads the input file. > > When I call my external R library function in DESeq2 I get error cannot > coerce class "data.frame" to a DataFrame . > > I am listing my old R code and new spark R code below and the line giving > problem is in RED. > ORIGINAL R - > library(plyr) > library(dplyr) > library(DESeq2) > library(pheatmap) > library(gplots) > library(RColorBrewer) > library(matrixStats) > library(pheatmap) > library(ggplot2) > library(hexbin) > library(corrplot) > > sampleDictFile <- "/160208.txt" > sampleDict <- read.table(sampleDictFile) > > peaks <- read.table("/Atlas.txt") > countMat <- read.table("/cntMatrix.txt", header = TRUE, sep = "\t") > > colnames(countMat) <- sampleDict$sample > rownames(peaks) <- rownames(countMat) <- paste0(peaks$seqnames, ":", > peaks$start, "-", peaks$end, " ", peaks$symbol) > peaks$id <- rownames(peaks) > > # > SPARK R CODE > peaks <- (read.csv("/Atlas.txt",header = TRUE, sep = "\t"))) > sampleDict<- (read.csv("/160208.txt",header = TRUE, sep = "\t", > stringsAsFactors = FALSE)) > countMat<- (read.csv("/cntMatrix.txt",header = TRUE, sep = "\t")) > --- > > COMMON CODE for both - > > countMat <- countMat[, sampleDict$sample] > colData <- sampleDict[,"group", drop = FALSE] > design <- ~ group > > * dds <- DESeqDataSetFromMatrix(countData = countMat, colData = > colData, design = design)* > > This line gives error - dds <- DESeqDataSetFromMatrix(countData = > countMat, colData = (colData), design = design) > Error in DataFrame(colData, row.names = rownames(colData)) : > cannot coerce class "data.frame" to a DataFrame > > I tried as.data.frame or using DataFrame to wrap the defs , but no luck. > What Can I do differently? > > Thanks > Roni > > > >
Re: sparkR issues ?
Hi , Is there a work around for this? Do i need to file a bug for this? Thanks -R On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui <rui@intel.com> wrote: > It seems as.data.frame() defined in SparkR convers the versions in R base > package. > > We can try to see if we can change the implementation of as.data.frame() > in SparkR to avoid such covering. > > > > *From:* Alex Kozlov [mailto:ale...@gmail.com] > *Sent:* Tuesday, March 15, 2016 2:59 PM > *To:* roni <roni.epi...@gmail.com> > *Cc:* user@spark.apache.org > *Subject:* Re: sparkR issues ? > > > > This seems to be a very unfortunate name collision. SparkR defines it's > own DataFrame class which shadows what seems to be your own definition. > > > > Is DataFrame something you define? Can you rename it? > > > > On Mon, Mar 14, 2016 at 10:44 PM, roni <roni.epi...@gmail.com> wrote: > > Hi, > > I am working with bioinformatics and trying to convert some scripts to > sparkR to fit into other spark jobs. > > > > I tries a simple example from a bioinf lib and as soon as I start sparkR > environment it does not work. > > > > code as follows - > > countData <- matrix(1:100,ncol=4) > > condition <- factor(c("A","A","B","B")) > > dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition) > > > > Works if i dont initialize the sparkR environment. > > if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc) it gives > following error > > > > > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ > condition) > > Error in DataFrame(colData, row.names = rownames(colData)) : > > cannot coerce class "data.frame" to a DataFrame > > > > I am really stumped. I am not using any spark function , so i would expect > it to work as a simple R code. > > why it does not work? > > > > Appreciate the help > > -R > > > > > > > > -- > > Alex Kozlov > (408) 507-4987 > (650) 887-2135 efax > ale...@gmail.com >
Re: sparkR issues ?
Alex, No I have not defined he "dataframe" its the spark default Dataframe. That line is just casting Factor as datarame to send to the function. Thanks -R On Mon, Mar 14, 2016 at 11:58 PM, Alex Kozlov <ale...@gmail.com> wrote: > This seems to be a very unfortunate name collision. SparkR defines it's > own DataFrame class which shadows what seems to be your own definition. > > Is DataFrame something you define? Can you rename it? > > On Mon, Mar 14, 2016 at 10:44 PM, roni <roni.epi...@gmail.com> wrote: > >> Hi, >> I am working with bioinformatics and trying to convert some scripts to >> sparkR to fit into other spark jobs. >> >> I tries a simple example from a bioinf lib and as soon as I start sparkR >> environment it does not work. >> >> code as follows - >> countData <- matrix(1:100,ncol=4) >> condition <- factor(c("A","A","B","B")) >> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ >> condition) >> >> Works if i dont initialize the sparkR environment. >> if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc) it gives >> following error >> >> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ >> condition) >> Error in DataFrame(colData, row.names = rownames(colData)) : >> cannot coerce class "data.frame" to a DataFrame >> >> I am really stumped. I am not using any spark function , so i would >> expect it to work as a simple R code. >> why it does not work? >> >> Appreciate the help >> -R >> >> > > > -- > Alex Kozlov > (408) 507-4987 > (650) 887-2135 efax > ale...@gmail.com >
sparkR issues ?
Hi, I am working with bioinformatics and trying to convert some scripts to sparkR to fit into other spark jobs. I tries a simple example from a bioinf lib and as soon as I start sparkR environment it does not work. code as follows - countData <- matrix(1:100,ncol=4) condition <- factor(c("A","A","B","B")) dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition) Works if i dont initialize the sparkR environment. if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc) it gives following error > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ condition) Error in DataFrame(colData, row.names = rownames(colData)) : cannot coerce class "data.frame" to a DataFrame I am really stumped. I am not using any spark function , so i would expect it to work as a simple R code. why it does not work? Appreciate the help -R
bisecting kmeans model tree
Hi , I want to get the bisecting kmeans tree structure to show a dendogram on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -Roni
bisecting kmeans tree
Hi , I want to get the bisecting kmeans tree structure to show on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -R
Re: bisecting kmeans model tree
Hi Spark,Mlib experts, Anyone who can shine light on this? Thanks _R On Thu, Apr 21, 2016 at 12:46 PM, roni <roni.epi...@gmail.com> wrote: > Hi , > I want to get the bisecting kmeans tree structure to show a dendogram on > the heatmap I am generating based on the hierarchical clustering of data. > How do I get that using mlib . > Thanks > -Roni >
MLIB and R results do not match for SVD
Hi All, Some time back I had asked the question about PCA results not matching between R and MLIB. I was suggested to use svd.v instead of PCA to match the uncentered PCA . But the results of mlib and R for svd do not match .(I can understand the numbers not matching exactly) but the distribution of data when plotted does not match. Is there something I can do to fix. I mostly have tallSkinnt matrix data with 60+ rows and upto 100+ columns . Any help is appreciated . Thanks _R
support vector regression in spark
Hi All, I want to know how can I do support vector regression in SPARK? Thanks R
SVM regression in Spark
Hi All, I am trying to change my R code to spark. I am using SVM regression in R . It seems like spark is providing SVM classification . How can I get the regression results. In my R code I am using call to SVM () function in library("e1071") ( ftp://cran.r-project.org/pub/R/web/packages/e1071/vignettes/svmdoc.pdf) svrObj <- svm(x , y , scale = TRUE, type = "nu-regression", kernel = "linear", nu = .9) Once I get the svm object back , I get the - from the values. How can I do this in spark? Thanks in advance Roni
Re: SVM regression in Spark
Hi Spark expert, Can anyone help for doing SVR (Support vector machine regression) in SPARK. Thanks R On Tue, Nov 29, 2016 at 6:50 PM, roni <roni.epi...@gmail.com> wrote: > Hi All, > I am trying to change my R code to spark. I am using SVM regression in R > . It seems like spark is providing SVM classification . > How can I get the regression results. > In my R code I am using call to SVM () function in library("e1071") ( > ftp://cran.r-project.org/pub/R/web/packages/e1071/vignettes/svmdoc.pdf) > svrObj <- svm(x , > y , > scale = TRUE, > type = "nu-regression", > kernel = "linear", > nu = .9) > > Once I get the svm object back , I get the - > from the values. > > How can I do this in spark? > Thanks in advance > Roni > >
Re: calculate diff of value and median in a group
I was using this function percentile_approx on 100GB of compressed data and it just hangs there. Any pointers? On Wed, Mar 22, 2017 at 6:09 PM, ayan guhawrote: > For median, use percentile_approx with 0.5 (50th percentile is the median) > > On Thu, Mar 23, 2017 at 11:01 AM, Yong Zhang wrote: > >> He is looking for median, not mean/avg. >> >> >> You have to implement the median logic by yourself, as there is no >> directly implementation from Spark. You can use RDD API, if you are using >> 1.6.x, or dataset if 2.x >> >> >> The following example gives you an idea how to calculate the median using >> dataset API. You can even change the code to add additional logic to >> calculate the diff of every value with the median. >> >> >> scala> spark.version >> res31: String = 2.1.0 >> >> scala> val ds = Seq((100,0.43),(100,0.33),(100,0.73),(101,0.29),(101,0.96), >> (101,0.42),(101,0.01)).toDF("id","value").as[(Int, Double)] >> ds: org.apache.spark.sql.Dataset[(Int, Double)] = [id: int, value: double] >> >> scala> ds.show >> +---+-+ >> | id|value| >> +---+-+ >> |100| 0.43| >> |100| 0.33| >> |100| 0.73| >> |101| 0.29| >> |101| 0.96| >> |101| 0.42| >> |101| 0.01| >> +---+-+ >> >> scala> def median(seq: Seq[Double]) = { >> | val size = seq.size >> | val sorted = seq.sorted >> | size match { >> | case even if size % 2 == 0 => (sorted((size-2)/2) + >> sorted(size/2)) / 2 >> | case odd => sorted((size-1)/2) >> | } >> | } >> median: (seq: Seq[Double])Double >> >> scala> ds.groupByKey(_._1).mapGroups((id, iter) => (id, >> median(iter.map(_._2).toSeq))).show >> +---+-+ >> | _1| _2| >> +---+-+ >> |101|0.355| >> |100| 0.43| >> +---+-+ >> >> >> Yong >> >> >> >> >> -- >> *From:* ayan guha >> *Sent:* Wednesday, March 22, 2017 7:23 PM >> *To:* Craig Ching >> *Cc:* Yong Zhang; user@spark.apache.org >> *Subject:* Re: calculate diff of value and median in a group >> >> I would suggest use window function with partitioning. >> >> select group1,group2,name,value, avg(value) over (partition group1,group2 >> order by name) m >> from t >> >> On Thu, Mar 23, 2017 at 9:58 AM, Craig Ching >> wrote: >> >>> Are the elements count big per group? If not, you can group them and use >>> the code to calculate the median and diff. >>> >>> >>> They're not big, no. Any pointers on how I might do that? The part I'm >>> having trouble with is the grouping, I can't seem to see how to do the >>> median per group. For mean, we have the agg feature, but not for median >>> (and I understand the reasons for that). >>> >>> Yong >>> >>> -- >>> *From:* Craig Ching >>> *Sent:* Wednesday, March 22, 2017 3:17 PM >>> *To:* user@spark.apache.org >>> *Subject:* calculate diff of value and median in a group >>> >>> Hi, >>> >>> When using pyspark, I'd like to be able to calculate the difference >>> between grouped values and their median for the group. Is this possible? >>> Here is some code I hacked up that does what I want except that it >>> calculates the grouped diff from mean. Also, please feel free to comment >>> on how I could make this better if you feel like being helpful :) >>> >>> from pyspark import SparkContext >>> from pyspark.sql import SparkSession >>> from pyspark.sql.types import ( >>> StringType, >>> LongType, >>> DoubleType, >>> StructField, >>> StructType >>> ) >>> from pyspark.sql import functions as F >>> >>> >>> sc = SparkContext(appName='myapp') >>> spark = SparkSession(sc) >>> >>> file_name = 'data.csv' >>> >>> fields = [ >>> StructField( >>> 'group2', >>> LongType(), >>> True), >>> StructField( >>> 'name', >>> StringType(), >>> True), >>> StructField( >>> 'value', >>> DoubleType(), >>> True), >>> StructField( >>> 'group1', >>> LongType(), >>> True) >>> ] >>> schema = StructType(fields) >>> >>> df = spark.read.csv( >>> file_name, header=False, mode="DROPMALFORMED", schema=schema >>> ) >>> df.show() >>> means = df.select([ >>> 'group1', >>> 'group2', >>> 'name', >>> 'value']).groupBy([ >>> 'group1', >>> 'group2' >>> ]).agg( >>> F.mean('value').alias('mean_value') >>> ).orderBy('group1', 'group2') >>> >>> cond = [df.group1 == means.group1, df.group2 == means.group2] >>> >>> means.show() >>> df = df.select([ >>> 'group1', >>> 'group2', >>> 'name', >>> 'value']).join( >>> means, >>> cond >>> ).drop( >>> df.group1 >>> ).drop( >>> df.group2 >>> ).select('group1', >>> 'group2', >>> 'name', >>> 'value', >>> 'mean_value') >>> >>> final = df.withColumn( >>> 'diff', >>> F.abs(df.value - df.mean_value))