saving or visualizing PCA

2015-03-18 Thread roni
Hi ,
 I am generating PCA using spark .
But I dont know how to save it to disk or visualize it.
Can some one give me some pointerspl.
Thanks
-Roni


FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded

2015-03-19 Thread roni
I get 2 types of error -
-org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0 and
FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407
- discarded

Spar keeps re-trying to submit the code and keeps getting this error.

My file on which I am finding  the sliding window strings is 500 MB  and I
am doing it with length = 150.
It woks fine till length is 100.

This is my code -
 val hgfasta = sc.textFile(args(0)) // read the fasta file
val kCount = hgfasta.flatMap(r = { r.sliding(args(2).toInt) })
val kmerCount = kCount.map(x = (x, 1)).reduceByKey(_ + _).map { case
(x, y) = (y, x) }.sortByKey(false).map { case (i, j) = (j, i) }

  val filtered = kmerCount.filter(kv = kv._2  5)
  filtered.map(kv = kv._1 + ,  +
kv._2.toLong).saveAsTextFile(args(1))

  }
It gets stuck and flat map and save as Text file  Throws
-org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0 and

org.apache.spark.shuffle.FetchFailedException: Adjusted frame length
exceeds 2147483647: 12716268407 - discarded
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)


distcp problems on ec2 standalone spark cluster

2015-03-09 Thread roni
I got pass the issues with the cluster not started problem by adding Yarn
to mapreduce.framework.name .
But when I try to to distcp , if I use uRI with s3://path to my bucket .. I
get invalid path even though the bucket exists.
If I use s3n:// it just hangs.
Did anyone else  face anything like that ?

I also noticed that this script puts the image of cloudera. hadoop. Does it
matter?
Thanks
-R


spark-ec2 script problems

2015-03-05 Thread roni
Hi ,
 I used spark-ec2 script to create ec2 cluster.

 Now I am trying copy data from s3 into hdfs.
I am doing this
*root@ip-172-31-21-160 ephemeral-hdfs]$ bin/hadoop distcp
s3://xxx/home/mydata/small.sam
hdfs://ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1
http://ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1*

and I get following error -

2015-03-06 01:39:27,299 INFO  tools.DistCp (DistCp.java:run(109)) - Input
Options: DistCpOptions{atomicCommit=false, syncFolder=false,
deleteMissing=false, ignoreFailures=false, maxMaps=20,
sslConfigurationFile='null', copyStrategy='uniformsize',
sourceFileListing=null, sourcePaths=[s3://xxX/home/mydata/small.sam],
targetPath=hdfs://
ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1}
2015-03-06 01:39:27,585 INFO  mapreduce.Cluster
(Cluster.java:initialize(114)) - Failed to use
org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid
mapreduce.jobtracker.address configuration value for LocalJobRunner : 
ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9001
2015-03-06 01:39:27,585 ERROR tools.DistCp (DistCp.java:run(126)) -
Exception encountered
java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

I tried doing start-all.sh , start-dfs.sh  and start-yarn.sh

what should I do ?
Thanks
-roni


Re: distcp on ec2 standalone spark cluster

2015-03-07 Thread roni
Did you get this to work?
I got pass the issues with the cluster not startetd problem
I am having problem where distcp with s3 URI says incorrect forlder path and
s3n:// hangs.
stuck for 2 days :(
Thanks
-R



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distcp-on-ec2-standalone-spark-cluster-tp13652p21957.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Setting up Spark with YARN on EC2 cluster

2015-03-10 Thread roni
Hi Harika,
Did you get any solution for this?
I want to use yarn , but the spark-ec2 script does not support it.
Thanks 
-Roni




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818p21991.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread roni
Reza,
That SVD.v matches the H2o and R prComp (non-centered)
Thanks
-R

On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote:

 (Oh sorry, I've only been thinking of TallSkinnySVD)

 On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote:
  If you want to do a nonstandard (or uncentered) PCA, you can call
  computeSVD on RowMatrix, and look at the resulting 'V' Matrix.
 
  That should match the output of the other two systems.
 
  Reza
 
  On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote:
 
  Those implementations are computing an SVD of the input matrix
  directly, and while you generally need the columns to have mean 0, you
  can turn that off with the options you cite.
 
  I don't think this is possible in the MLlib implementation, since it
  is computing the principal components by computing eigenvectors of the
  covariance matrix. The means inherently don't matter either way in
  this computation.
 
  On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote:
   I am trying to compute PCA  using  computePrincipalComponents.
   I  also computed PCA using h2o in R and R's prcomp. The answers I get
   from
   H2o and R's prComp (non h2o) is same when I set the options for H2o as
   standardized=FALSE and for r's prcomp as center = false.
  
   How do I make sure that the settings for MLib PCA is same as I am
 using
   for
   H2o or prcomp.
  
   Thanks
   Roni
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
I have a EC2 cluster created using spark version 1.2.1.
And I have a SBT project .
Now I want to upgrade to spark 1.3 and use the new features.
Below are issues .
Sorry for the long post.
Appreciate your help.
Thanks
-Roni

Question - Do I have to create a new cluster using spark 1.3?

Here is what I did -

In my SBT file I  changed to -
libraryDependencies += org.apache.spark %% spark-core % 1.3.0

But then I started getting compilation error. along with
Here are some of the libraries that were evicted:
[warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
[warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0
[warn] Run 'evicted' to see detailed eviction warnings

 constructor cannot be instantiated to expected type;
[error]  found   : (T1, T2)
[error]  required: org.apache.spark.sql.catalyst.expressions.Row
[error] val ty = joinRDD.map{case(word,
(file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
[error]  ^

Here is my SBT and code --
SBT -

version := 1.0

scalaVersion := 2.10.4

resolvers += Sonatype OSS Snapshots at 
https://oss.sonatype.org/content/repositories/snapshots;;
resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
resolvers += Maven Repo at 
https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

/* Dependencies - %% appends Scala version to artifactId */
libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
libraryDependencies += org.apache.spark %% spark-core % 1.3.0
libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10


CODE --
import org.apache.spark.{SparkConf, SparkContext}
case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

object preDefKmerIntersection {
  def main(args: Array[String]) {

 val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
 val sc = new SparkContext(sparkConf)
import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val bedFile = sc.textFile(s3n://a/b/c,40)
 val hgfasta = sc.textFile(hdfs://a/b/c,40)
 val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
a(1).trim().toInt))
 val filtered = hgPair.filter(kv = kv._2 == 1)
 val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
a(1).trim().toInt))
 val joinRDD = bedPair.join(filtered)
val ty = joinRDD.map{case(word, (file1Counts, file2Counts))
= KmerIntesect(word, file1Counts,xyz)}
ty.registerTempTable(KmerIntesect)
ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
  }
}


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Even if H2o and ADA are dependent on 1.2.1 , it should be backword
compatible, right?
So using 1.3 should not break them.
And the code is not using the classes from those libs.
I tried sbt clean compile .. same errror
Thanks
_R

On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 What version of Spark do the other dependencies rely on (Adam and H2O?) -
 that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty = joinRDD.map{case(word,
 (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile(s3n://a/b/c,40)
  val hgfasta = sc.textFile(hdfs://a/b/c,40)
  val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val joinRDD = bedPair.join(filtered)
 val ty = joinRDD.map{case(word, (file1Counts,
 file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 ty.registerTempTable(KmerIntesect)
 ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
   }
 }





Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Thanks Dean and Nick.
So, I removed the ADAM and H2o from my SBT as I was not using them.
I got the code to compile  - only for fail while running with -
SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/rdd/RDD$
at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

This line is where I do a JOIN operation.
val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt))
 val filtered = hgPair.filter(kv = kv._2 == 1)
 val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
a(1).trim().toInt))
* val joinRDD = bedPair.join(filtered)   *
Any idea whats going on?
I have data on the EC2 so I am avoiding creating the new cluster , but just
upgrading and changing the code to use 1.3 and Spark SQL
Thanks
Roni



On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote:

 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 What version of Spark do the other dependencies rely on (Adam and H2O?)
 - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -
 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty = joinRDD.map{case(word,
 (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile(s3n://a/b/c,40)
  val hgfasta = sc.textFile(hdfs://a/b/c,40)
  val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a=
 (a(0), a(1).trim().toInt))
  val joinRDD = bedPair.join(filtered)
 val ty = joinRDD.map{case(word, (file1Counts,
 file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 ty.registerTempTable(KmerIntesect)

 ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
   }
 }







Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
My cluster is still on spark 1.2 and in SBT I am using 1.3.
So probably it is compiling with 1.3 but running with 1.2 ?

On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com
wrote:

 Weird. Are you running using SBT console? It should have the spark-core
 jar on the classpath. Similarly, spark-shell or spark-submit should work,
 but be sure you're using the same version of Spark when running as when
 compiling. Also, you might need to add spark-sql to your SBT dependencies,
 but that shouldn't be this issue.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote:

 Thanks Dean and Nick.
 So, I removed the ADAM and H2o from my SBT as I was not using them.
 I got the code to compile  - only for fail while running with -
 SparkContext: Created broadcast 1 from textFile at
 kmerIntersetion.scala:21
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/rdd/RDD$
 at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

 This line is where I do a JOIN operation.
 val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
 * val joinRDD = bedPair.join(filtered)   *
 Any idea whats going on?
 I have data on the EC2 so I am avoiding creating the new cluster , but
 just upgrading and changing the code to use 1.3 and Spark SQL
 Thanks
 Roni



 On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com
 wrote:

 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 What version of Spark do the other dependencies rely on (Adam and
 H2O?) - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -
 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty =
 joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word,
 file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/
 ;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 %
 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Is there any way that I can install the new one and remove previous version.
I installed spark 1.3 on my EC2 master and set teh spark home to the new
one.
But when I start teh spark-shell I get -
 java.lang.UnsatisfiedLinkError:
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
at
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native
Method)

Is There no way to upgrade without creating new cluster?
Thanks
Roni



On Wed, Mar 25, 2015 at 1:18 PM, Dean Wampler deanwamp...@gmail.com wrote:

 Yes, that's the problem. The RDD class exists in both binary jar files,
 but the signatures probably don't match. The bottom line, as always for
 tools like this, is that you can't mix versions.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 3:13 PM, roni roni.epi...@gmail.com wrote:

 My cluster is still on spark 1.2 and in SBT I am using 1.3.
 So probably it is compiling with 1.3 but running with 1.2 ?

 On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 Weird. Are you running using SBT console? It should have the spark-core
 jar on the classpath. Similarly, spark-shell or spark-submit should work,
 but be sure you're using the same version of Spark when running as when
 compiling. Also, you might need to add spark-sql to your SBT dependencies,
 but that shouldn't be this issue.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote:

 Thanks Dean and Nick.
 So, I removed the ADAM and H2o from my SBT as I was not using them.
 I got the code to compile  - only for fail while running with -
 SparkContext: Created broadcast 1 from textFile at
 kmerIntersetion.scala:21
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/rdd/RDD$
 at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

 This line is where I do a JOIN operation.
 val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a=
 (a(0), a(1).trim().toInt))
 * val joinRDD = bedPair.join(filtered)   *
 Any idea whats going on?
 I have data on the EC2 so I am avoiding creating the new cluster , but
 just upgrading and changing the code to use 1.3 and Spark SQL
 Thanks
 Roni



 On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com
 wrote:

 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 What version of Spark do the other dependencies rely on (Adam and
 H2O?) - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0)
 - 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty =
 joinRDD.map{case(word, (file1Counts, file2Counts)) = 
 KmerIntesect(word,
 file1Counts,xyz)}
 [error

Re: Cannot run spark-shell command not found.

2015-03-30 Thread roni
I think you must have downloaded the spark source code gz file.
 It is little confusing. You have to select the hadoop version also  and
the actual  tgz file will have spark version and hadoop version in it.
-R

On Mon, Mar 30, 2015 at 10:34 AM, vance46 wang2...@purdue.edu wrote:

 Hi all,

 I'm a newbee try to setup spark for my research project on a RedHat system.
 I've downloaded spark-1.3.0.tgz and untared it. and installed python, java
 and scala. I've set JAVA_HOME and SCALA_HOME and then try to use sudo
 sbt/sbt assembly according to
 https://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_CentOS6
 .
 It pop-up with sbt command not found. I then try directly start
 spark-shell in ./bin using sudo ./bin/spark-shell and still command
 not
 found. I appreciate your help in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-spark-shell-command-not-found-tp22299.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




joining multiple parquet files

2015-03-31 Thread roni
Hi ,
 I have 4 parquet files and I want to find data which is common in all of
them
e.g

SELECT TableA.*, TableB.*, TableC.*, TableD.* FROM (TableB INNER JOIN TableA
ON TableB.aID= TableA.aID)
INNER JOIN TableC ON(TableB.cID= Tablec.cID)
INNER JOIN TableA ta ON(ta.dID= TableD.dID)
WHERE (DATE(TableC.date)=date(now()))


I can do a 2 files join like -  val joinedVal =
g1.join(g2,g1.col(kmer) === g2.col(kmer))

But I am trying to find common kmer strings  from 4 files.

Thanks

Roni


spark.sql.Row manipulation

2015-03-31 Thread roni
I have 2 paraquet files with format e.g  name , age, town
I read them  and then join them to get  all the names which are in both
towns  .
the resultant dataset is

res4: Array[org.apache.spark.sql.Row] = Array([name1, age1,
town1,name2,age2,town2])

Name 1 and name 2 are same as I am joining .
Now , I want to get only to the format (name , age1, age2)

But I cant seem to getting to manipulate the spark.sql.row.

Trying something like map(_.split (,)).map(a= (a(0), a(1).trim().toInt))
does not work .

Can you suggest a way ?

Thanks
-R


Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
Hi Ted,
 I  used s3://support.elasticmapreduce/spark/install-spark to install spark
on my EMR cluster. It is 1.2.0.
 When I click on the link for history or logs it takes me to
http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop
 and I get -

The server at *ip-172-31-43-116.us-west-2.compute.internal* can't be found,
because the DNS lookup failed. DNS is the network service that translates a
website's name to its Internet address. This error is most often caused by
having no connection to the Internet or a misconfigured network. It can
also be caused by an unresponsive DNS server or a firewall preventing Google
Chrome from accessing the network.
I tried  changing the address with internal to the external one , but still
does not work.
Thanks
_roni


On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. spark UI does not work for Yarn-cluster.

 Can you be a bit more specific on the error(s) you saw ?

 What Spark release are you using ?

 Cheers

 On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

 Sorry , for half email - here it is again in full
 Hi ,
 I have 2 questions -

  1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
 IS that correct or am I missing some setup?

 2. when I click on Application Monitoring or history , i get re-directed
 to some linked with internal Ip address. Even if I replace that address
 with the public IP , it still does not work.  What kind of setup changes
 are needed for that?

 Thanks
 -roni

 On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

 Hi ,
 I have 2 questions -

  1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
 IS that correct or am I missing some setup?









Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
Ted,
If the application is running then the logs are not available. Plus what i
want to view is the details about the running app as in spark UI.
Do I have to open some ports or do some other setting changes?




On Tue, Mar 3, 2015 at 10:08 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. changing the address with internal to the external one , but still
 does not work.
 Not sure what happened.
 For the time being, you can use yarn command line to pull container log
 (put in your appId and container Id):
 yarn logs -applicationId application_1386639398517_0007 -containerId
 container_1386639398517_0007_01_19

 Cheers

 On Tue, Mar 3, 2015 at 9:50 AM, roni roni.epi...@gmail.com wrote:

 Hi Ted,
  I  used s3://support.elasticmapreduce/spark/install-spark to install
 spark on my EMR cluster. It is 1.2.0.
  When I click on the link for history or logs it takes me to

 http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop
  and I get -

 The server at *ip-172-31-43-116.us-west-2.compute.internal* can't be
 found, because the DNS lookup failed. DNS is the network service that
 translates a website's name to its Internet address. This error is most
 often caused by having no connection to the Internet or a misconfigured
 network. It can also be caused by an unresponsive DNS server or a firewall
 preventing Google Chrome from accessing the network.
 I tried  changing the address with internal to the external one , but
 still does not work.
 Thanks
 _roni


 On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. spark UI does not work for Yarn-cluster.

 Can you be a bit more specific on the error(s) you saw ?

 What Spark release are you using ?

 Cheers

 On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

 Sorry , for half email - here it is again in full
 Hi ,
 I have 2 questions -

  1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
 IS that correct or am I missing some setup?

 2. when I click on Application Monitoring or history , i get
 re-directed to some linked with internal Ip address. Even if I replace that
 address with the public IP , it still does not work.  What kind of setup
 changes are needed for that?

 Thanks
 -roni

 On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

 Hi ,
 I have 2 questions -

  1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
 IS that correct or am I missing some setup?











Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
ah!! I think I know what you mean. My job was just in accepted stage for
a long time as it was running a huge file.
But now that it is in running stage , I can see it . I can see it at post
9046 though instead of 4040 . But  I can see it.
Thanks
-roni

On Tue, Mar 3, 2015 at 1:19 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  In Yarn (Cluster or client), you can access the spark ui when the app is
 running. After app is done, you can still access it, but need some extra
 setup for history server.

  Thanks.

  Zhan Zhang

  On Mar 3, 2015, at 10:08 AM, Ted Yu yuzhih...@gmail.com wrote:

  bq. changing the address with internal to the external one , but still
 does not work.
 Not sure what happened.
 For the time being, you can use yarn command line to pull container log
 (put in your appId and container Id):
  yarn logs -applicationId application_1386639398517_0007 -containerId
 container_1386639398517_0007_01_19

  Cheers

 On Tue, Mar 3, 2015 at 9:50 AM, roni roni.epi...@gmail.com wrote:

Hi Ted,
   I  used s3://support.elasticmapreduce/spark/install-spark to install
 spark on my EMR cluster. It is 1.2.0.
   When I click on the link for history or logs it takes me to

 http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop
   and I get -

 The server at *ip-172-31-43-116.us
 http://ip-172-31-43-116.us-west-2.compute.internal* can't be found,
 because the DNS lookup failed. DNS is the network service that translates a
 website's name to its Internet address. This error is most often caused by
 having no connection to the Internet or a misconfigured network. It can
 also be caused by an unresponsive DNS server or a firewall preventing Google
 Chrome from accessing the network.
  I tried  changing the address with internal to the external one , but
 still does not work.
  Thanks
  _roni


 On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. spark UI does not work for Yarn-cluster.

  Can you be a bit more specific on the error(s) you saw ?

  What Spark release are you using ?

  Cheers

 On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

   Sorry , for half email - here it is again in full
  Hi ,
  I have 2 questions -

   1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
  IS that correct or am I missing some setup?

  2. when I click on Application Monitoring or history , i get
 re-directed to some linked with internal Ip address. Even if I replace that
 address with the public IP , it still does not work.  What kind of setup
 changes are needed for that?

  Thanks
  -roni

 On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

  Hi ,
  I have 2 questions -

   1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
  IS that correct or am I missing some setup?












Re: issue Running Spark Job on Yarn Cluster

2015-03-04 Thread roni
look at the logs
yarn logs --applicationId applicationId
That should give the error.

On Wed, Mar 4, 2015 at 9:21 AM, sachin Singh sachin.sha...@gmail.com
wrote:

 Not yet,
 Please let. Me know if you found solution,

 Regards
 Sachin
 On 4 Mar 2015 21:45, mael2210 [via Apache Spark User List] [hidden
 email] http:///user/SendEmail.jtp?type=nodenode=21912i=0 wrote:

 Hello,

 I am facing the exact same issue. Could you solve the problem ?

 Kind regards

 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21697p21909.html
  To unsubscribe from issue Running Spark Job on Yarn Cluster, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


 --
 View this message in context: Re: issue Running Spark Job on Yarn Cluster
 http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21697p21912.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread roni
I am trying to compute PCA  using  computePrincipalComponents.
I  also computed PCA using h2o in R and R's prcomp. The answers I get from
H2o and R's prComp (non h2o) is same when I set the options for H2o as
standardized=FALSE and for r's prcomp as center = false.

How do I make sure that the settings for MLib PCA is same as I am using for
H2o or prcomp.

Thanks
Roni


Storing data in MySQL from spark hive tables

2015-05-20 Thread roni
Hi ,
I am trying to setup the hive metastore and mysql DB connection.
 I have a spark cluster and I ran some programs and I have data stored in
some hive tables.
Now I want to store this data into Mysql  so that it is available for
further processing.

I setup the hive-site.xml file.

?xml version=1.0?

?xml-stylesheet type=text/xsl href=configuration.xsl?


configuration

  property

namehive.semantic.analyzer.factory.impl/name

valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value

  /property


  property

namehive.metastore.sasl.enabled/name

valuefalse/value

  /property


  property

namehive.server2.authentication/name

valueNONE/value

  /property


  property

namehive.server2.enable.doAs/name

valuetrue/value

  /property


  property

namehive.warehouse.subdir.inherit.perms/name

valuetrue/value

  /property


  property

namehive.metastore.schema.verification/name

valuefalse/value

  /property


  property

namejavax.jdo.option.ConnectionURL/name

valuejdbc:mysql://*ip address*
:3306/metastore_db?createDatabaseIfNotExist=true/value

descriptionmetadata is stored in a MySQL server/description

  /property


  property

namejavax.jdo.option.ConnectionDriverName/name

valuecom.mysql.jdbc.Driver/value

descriptionMySQL JDBC driver class/description

  /property


  property

namejavax.jdo.option.ConnectionUserName/name

valueroot/value

  /property


  property

namejavax.jdo.option.ConnectionPassword/name

value/value

  /property

  property

namehive.metastore.warehouse.dir/name

value/user/${user.name}/hive-warehouse/value

descriptionlocation of default database for
the warehouse/description

/property


/configuration
 --
My mysql server is on a separate server than where my spark server is . If
I use mySQLWorkbench , I use a SSH connection  with a certificate file to
connect .
How do I specify all that information from spark to the DB ?
I want to store the data generated by my spark program into mysql.
Thanks
_R


Re: which database for gene alignment data ?

2015-06-08 Thread roni
Sorry for the delay.
The files (called .bed files) have format like -

Chromosome start  endfeature score  strand

chr1 713776  714375  peak.1  599+
chr1 752401  753000  peak.2  599+

The mandatory fields are


   1. chrom - The name of the chromosome (e.g. chr3, chrY,
chr2_random) or scaffold (e.g. scaffold10671).
   2. chromStart - The starting position of the feature in the
chromosome or scaffold. The first base in a chromosome is numbered 0.
   3. chromEnd - The ending position of the feature in the chromosome
or scaffold. The *chromEnd* base is not included in the display of the
feature. For example, the first 100 bases of a chromosome are defined
as *chromStart=0, chromEnd=100*, and span the bases numbered 0-99.

There can be more data as described -
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Many times the use cases are like
1. find the features between given start and end positions
2.Find features which have overlapping start and end points with
another feature.
3. read external (reference) data which will have similar format
(chr10  4851478549604641MAPK8   49514785+) and find all 
the data
points which are overlapping with the other  .bed files.

The data is huge. .bed files can range from .5 GB to 5 gb (or more)
I was thinking of using cassandra, but not sue if the overlapping
queries can be supported and will be fast enough.

Thanks for the help
-Roni


On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you describe your use case in a bit more detail since not all people
 on this mailing list are familiar with gene sequencing alignments data ?

 Thanks

 On Fri, Jun 5, 2015 at 11:42 PM, roni roni.epi...@gmail.com wrote:

 I want to use spark for reading compressed .bed file for reading gene
 sequencing alignments data.
 I want to store bed file data in db and then use external gene expression
 data to find overlaps etc, which database is best for it ?
 Thanks
 -Roni





which database for gene alignment data ?

2015-06-06 Thread roni
I want to use spark for reading compressed .bed file for reading gene
sequencing alignments data.
I want to store bed file data in db and then use external gene expression
data to find overlaps etc, which database is best for it ?
Thanks
-Roni


Re: which database for gene alignment data ?

2015-06-09 Thread roni
Hi Frank,
Thanks for the reply. I downloaded ADAM and built it but it does not seem
to list this function for command line options.
Are these exposed as public API and I can call it from code ?

Also , I need to save all my intermediate data.  Seems like ADAM stores
data in Parquet on HDFS.
I want to save something in an external database, so that  we can re-use
the saved data in multiple ways by multiple people.
Any suggestions on the DB selection or keeping data centralized for use by
multiple distinct groups?
Thanks
-Roni



On Mon, Jun 8, 2015 at 12:47 PM, Frank Austin Nothaft fnoth...@berkeley.edu
 wrote:

 Hi Roni,

 We have a full suite of genomic feature parsers that can read BED,
 narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM
 https://github.com/bigdatagenomics/adam  Additionally, we have support
 for efficient overlap joins (query 3 in your email below). You can load the
 genomic features with ADAMContext.loadFeatures
 https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L438.
 We have two tools for the overlap computation: you can use a
 BroadcastRegionJoin
 https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/BroadcastRegionJoin.scala
  if
 one of the datasets you want to overlap is small or a ShuffleRegionJoin
 https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala
  if
 both datasets are large.

 Regards,

 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466

 On Jun 8, 2015, at 9:39 PM, roni roni.epi...@gmail.com wrote:

 Sorry for the delay.
 The files (called .bed files) have format like -

 Chromosome start  endfeature score  strand

 chr1   713776  714375  peak.1  599+
 chr1   752401  753000  peak.2  599+

 The mandatory fields are


1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or 
 scaffold (e.g. scaffold10671).
2. chromStart - The starting position of the feature in the chromosome or 
 scaffold. The first base in a chromosome is numbered 0.
3. chromEnd - The ending position of the feature in the chromosome or 
 scaffold. The *chromEnd* base is not included in the display of the feature. 
 For example, the first 100 bases of a chromosome are defined as 
 *chromStart=0, chromEnd=100*, and span the bases numbered 0-99.

 There can be more data as described - 
 https://genome.ucsc.edu/FAQ/FAQformat.html#format1
 Many times the use cases are like
 1. find the features between given start and end positions
 2.Find features which have overlapping start and end points with another 
 feature.
 3. read external (reference) data which will have similar format (chr10   
 4851478549604641MAPK8   49514785+) and find all the 
 data points which are overlapping with the other  .bed files.

 The data is huge. .bed files can range from .5 GB to 5 gb (or more)
 I was thinking of using cassandra, but not sue if the overlapping queries can 
 be supported and will be fast enough.

 Thanks for the help
 -Roni


 On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you describe your use case in a bit more detail since not all people
 on this mailing list are familiar with gene sequencing alignments data ?

 Thanks

 On Fri, Jun 5, 2015 at 11:42 PM, roni roni.epi...@gmail.com wrote:

 I want to use spark for reading compressed .bed file for reading gene
 sequencing alignments data.
 I want to store bed file data in db and then use external gene
 expression data to find overlaps etc, which database is best for it ?
 Thanks
 -Roni







Re: Is anyone using Amazon EC2? (second attempt!)

2015-05-29 Thread roni
Hi ,
Any update on this?
I am not sure if the issue I am seeing is related ..
I have 8 slaves and when I created the cluster I specified ebs volume with
100G.
I see on Ec2 8 volumes created and each attached to the corresponding slave.
But when I try to copy data on it , it complains that

/root/ephemeral-hdfs/bin/hadoop fs -cp /intersection hdfs://
ec2-54-149-112-136.us-west-2.compute.amazonaws.com:9010/

2015-05-28 23:40:35,447 WARN  hdfs.DFSClient
(DFSOutputStream.java:run(562)) - DataStreamer Exception

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/intersection/kmer150/commonGoodKmers/_temporary/_attempt_201504010056_0004_m_000428_3948/part-00428._COPYING_
could only be replicated to 0 nodes instead of minReplication (=1).  There
are 1 datanode(s) running and no node(s) are excluded in this operation.


It shows only 1 datanode , but for ephermal-hdfs it shows 8 datanodes.

Any thoughts?

Thanks

_R

On Sat, May 23, 2015 at 7:24 AM, Joe Wass jw...@crossref.org wrote:

 I used Spark on EC2 a while ago, but recent revisions seem to have broken
 the functionality.

 Is anyone actually using Spark on EC2 at the moment?

 The bug in question is:

 https://issues.apache.org/jira/browse/SPARK-5008

 It makes it impossible to use persistent HDFS without a workround on each
 slave node.

 No-one seems to be interested in the bug, so I wonder if other people
 aren't actually having this problem. If this is the case, any suggestions?

 Joe



Re: Do I really need to build Spark for Hive/Thrift Server support?

2015-08-10 Thread roni
Hi All,
 Any explanation for this?
 As Reece said I can do operations with hive but -

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -- gives error.

I have already created spark ec2 cluster with the spark-ec2 script. How can
I build it again?

Thanks
_Roni

On Tue, Jul 28, 2015 at 2:46 PM, ReeceRobinson re...@therobinsons.gen.nz
wrote:

 I am building an analytics environment based on Spark and want to use HIVE
 in
 multi-user mode i.e. not use the embedded derby database but use Postgres
 and HDFS instead. I am using the included Spark Thrift Server to process
 queries using Spark SQL.

 The documentation gives me the impression that I need to create a custom
 build of Spark 1.4.1. However I don't think this is either accurate now OR
 it is for some different context I'm not aware of?

 I used the pre-built Spark 1.4.1 distribution today with my hive-site.xml
 for Postgres and HDFS and it worked! I see the warehouse files turn up in
 HDFS and I see the metadata inserted into Postgres when I created a test
 table.

 I can connect to the Thrift Server using beeline and perform queries on my
 data. I also verified using the Spark UI that the SQL is being processed by
 Spark SQL.

 So I guess I'm asking is the document out-of-date or am I missing
 something?

 Cheers,
 Reece



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Do-I-really-need-to-build-Spark-for-Hive-Thrift-Server-support-tp24013p24039.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




No suitable driver found for jdbc:mysql://

2015-07-22 Thread roni
Hi All,
 I have a cluster with spark 1.4.
I am trying to save data to mysql but getting error

Exception in thread main java.sql.SQLException: No suitable driver found
for jdbc:mysql://.rds.amazonaws.com:3306/DAE_kmer?user=password=


*I looked at - https://issues.apache.org/jira/browse/SPARK-8463
https://issues.apache.org/jira/browse/SPARK-8463 and added the connector
jar to the same location as on Master using copy-dir script.*

*But I am still getting the same error. This sued to work with 1.3.*

*This is my command  to run the program - **$SPARK_HOME/bin/spark-submit
--jars
/root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar
--conf
spark.executor.extraClassPath=/root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar
--conf spark.executor.memory=55g --driver-memory=55g
--master=spark://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077
http://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077  --class
saveBedToDB  target/scala-2.10/adam-project_2.10-1.0.jar*

*What else can I Do ?*

*Thanks*

*-Roni*


Upgrade spark cluster to latest version

2015-11-03 Thread roni
Hi  Spark experts,
  This may be a very naive question but can you pl. point me to a proper
way to upgrade spark version on an existing cluster.
Thanks
Roni

> Hi,
>  I have a current cluster running spark 1.4 and want to upgrade to latest
> version.
>  How can I do it without creating a new cluster  so that all my other
> setting getting erased.
> Thanks
> _R
>


connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-10 Thread roni
I have spark installed on a EC2 cluster. Can I connect to that from my
local sparkR in RStudio? if yes , how ?

Can I read files  which I have saved as parquet files on hdfs  or s3 in
sparkR ? If yes , How?

Thanks
-Roni


reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread roni
I am trying this -

 ddf <- parquetFile(sqlContext,  "hdfs://
ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet")

and I get path[1]="hdfs://
ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet":
No such file or directory


when I read file on s3 , I get -  java.io.IOException: No FileSystem for
scheme: s3


Thanks in advance.

-Roni


Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread roni
Thanks Akhil. Very good article.

On Mon, Sep 14, 2015 at 4:15 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> You can look into this doc
> <https://www.sigmoid.com/how-to-run-sparkr-with-rstudio/> regarding the
> connection (its for gce though but it should be similar).
>
> Thanks
> Best Regards
>
> On Thu, Sep 10, 2015 at 11:20 PM, roni <roni.epi...@gmail.com> wrote:
>
>> I have spark installed on a EC2 cluster. Can I connect to that from my
>> local sparkR in RStudio? if yes , how ?
>>
>> Can I read files  which I have saved as parquet files on hdfs  or s3 in
>> sparkR ? If yes , How?
>>
>> Thanks
>> -Roni
>>
>>
>


Re: cannot coerce class "data.frame" to a DataFrame - with spark R

2016-02-19 Thread roni
Thanks Felix .
I tried that but I still get the same error . Am I missing something?
 dds <- DESeqDataSetFromMatrix(countData = collect(countMat), colData =
collect(colData), design = design)
Error in DataFrame(colData, row.names = rownames(colData)) :
  cannot coerce class "data.frame" to a DataFrame

On Thu, Feb 18, 2016 at 9:03 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Doesn't DESeqDataSetFromMatrix work with data.frame only? It wouldn't work
> with Spark's DataFrame - try collect(countMat) and others to convert them
> into data.frame?
>
>
> _
> From: roni <roni.epi...@gmail.com>
> Sent: Thursday, February 18, 2016 4:55 PM
> Subject: cannot coerce class "data.frame" to a DataFrame - with spark R
> To: <user@spark.apache.org>
>
>
>
> Hi ,
>  I am trying to convert a bioinformatics R script to use spark R. It uses
> external bioconductor package (DESeq2) so the only conversion really I have
> made is to change the way it reads the input file.
>
> When I call my external R library function in DESeq2 I get error cannot
> coerce class "data.frame" to a DataFrame .
>
> I am listing my old R code and new spark R code below and the line giving
> problem is in RED.
> ORIGINAL R -
> library(plyr)
> library(dplyr)
> library(DESeq2)
> library(pheatmap)
> library(gplots)
> library(RColorBrewer)
> library(matrixStats)
> library(pheatmap)
> library(ggplot2)
> library(hexbin)
> library(corrplot)
>
> sampleDictFile <- "/160208.txt"
> sampleDict <- read.table(sampleDictFile)
>
> peaks <- read.table("/Atlas.txt")
> countMat <- read.table("/cntMatrix.txt", header = TRUE, sep = "\t")
>
> colnames(countMat) <- sampleDict$sample
> rownames(peaks) <- rownames(countMat) <- paste0(peaks$seqnames, ":",
> peaks$start, "-", peaks$end, "  ", peaks$symbol)
> peaks$id <- rownames(peaks)
> 
> #
> SPARK R CODE
> peaks <- (read.csv("/Atlas.txt",header = TRUE, sep = "\t")))
> sampleDict<- (read.csv("/160208.txt",header = TRUE, sep = "\t",
> stringsAsFactors = FALSE))
> countMat<-  (read.csv("/cntMatrix.txt",header = TRUE, sep = "\t"))
> ---
>
> COMMON CODE  for both -
>
>   countMat <- countMat[, sampleDict$sample]
>   colData <- sampleDict[,"group", drop = FALSE]
>   design <- ~ group
>
>   * dds <- DESeqDataSetFromMatrix(countData = countMat, colData =
> colData, design = design)*
>
> This line gives error - dds <- DESeqDataSetFromMatrix(countData =
> countMat, colData =  (colData), design = design)
> Error in DataFrame(colData, row.names = rownames(colData)) :
>   cannot coerce class "data.frame" to a DataFrame
>
> I tried as.data.frame or using DataFrame to wrap the defs , but no luck.
> What Can I do differently?
>
> Thanks
> Roni
>
>
>
>


Re: sparkR issues ?

2016-03-15 Thread roni
Hi ,
 Is there a work around for this?
 Do i need to file a bug for this?
Thanks
-R

On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui <rui@intel.com> wrote:

> It seems as.data.frame() defined in SparkR convers the versions in R base
> package.
>
> We can try to see if we can change the implementation of as.data.frame()
> in SparkR to avoid such covering.
>
>
>
> *From:* Alex Kozlov [mailto:ale...@gmail.com]
> *Sent:* Tuesday, March 15, 2016 2:59 PM
> *To:* roni <roni.epi...@gmail.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: sparkR issues ?
>
>
>
> This seems to be a very unfortunate name collision.  SparkR defines it's
> own DataFrame class which shadows what seems to be your own definition.
>
>
>
> Is DataFrame something you define?  Can you rename it?
>
>
>
> On Mon, Mar 14, 2016 at 10:44 PM, roni <roni.epi...@gmail.com> wrote:
>
> Hi,
>
>  I am working with bioinformatics and trying to convert some scripts to
> sparkR to fit into other spark jobs.
>
>
>
> I tries a simple example from a bioinf lib and as soon as I start sparkR
> environment it does not work.
>
>
>
> code as follows -
>
> countData <- matrix(1:100,ncol=4)
>
> condition <- factor(c("A","A","B","B"))
>
> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)
>
>
>
> Works if i dont initialize the sparkR environment.
>
>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
> following error
>
>
>
> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
> condition)
>
> Error in DataFrame(colData, row.names = rownames(colData)) :
>
>   cannot coerce class "data.frame" to a DataFrame
>
>
>
> I am really stumped. I am not using any spark function , so i would expect
> it to work as a simple R code.
>
> why it does not work?
>
>
>
> Appreciate  the help
>
> -R
>
>
>
>
>
>
>
> --
>
> Alex Kozlov
> (408) 507-4987
> (650) 887-2135 efax
> ale...@gmail.com
>


Re: sparkR issues ?

2016-03-15 Thread roni
Alex,
 No I have not defined he "dataframe" its the spark default Dataframe. That
line is just casting Factor as datarame to send to the function.
Thanks
-R

On Mon, Mar 14, 2016 at 11:58 PM, Alex Kozlov <ale...@gmail.com> wrote:

> This seems to be a very unfortunate name collision.  SparkR defines it's
> own DataFrame class which shadows what seems to be your own definition.
>
> Is DataFrame something you define?  Can you rename it?
>
> On Mon, Mar 14, 2016 at 10:44 PM, roni <roni.epi...@gmail.com> wrote:
>
>> Hi,
>>  I am working with bioinformatics and trying to convert some scripts to
>> sparkR to fit into other spark jobs.
>>
>> I tries a simple example from a bioinf lib and as soon as I start sparkR
>> environment it does not work.
>>
>> code as follows -
>> countData <- matrix(1:100,ncol=4)
>> condition <- factor(c("A","A","B","B"))
>> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~
>> condition)
>>
>> Works if i dont initialize the sparkR environment.
>>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
>> following error
>>
>> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
>> condition)
>> Error in DataFrame(colData, row.names = rownames(colData)) :
>>   cannot coerce class "data.frame" to a DataFrame
>>
>> I am really stumped. I am not using any spark function , so i would
>> expect it to work as a simple R code.
>> why it does not work?
>>
>> Appreciate  the help
>> -R
>>
>>
>
>
> --
> Alex Kozlov
> (408) 507-4987
> (650) 887-2135 efax
> ale...@gmail.com
>


sparkR issues ?

2016-03-14 Thread roni
Hi,
 I am working with bioinformatics and trying to convert some scripts to
sparkR to fit into other spark jobs.

I tries a simple example from a bioinf lib and as soon as I start sparkR
environment it does not work.

code as follows -
countData <- matrix(1:100,ncol=4)
condition <- factor(c("A","A","B","B"))
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)

Works if i dont initialize the sparkR environment.
 if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
following error

> dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
condition)
Error in DataFrame(colData, row.names = rownames(colData)) :
  cannot coerce class "data.frame" to a DataFrame

I am really stumped. I am not using any spark function , so i would expect
it to work as a simple R code.
why it does not work?

Appreciate  the help
-R


bisecting kmeans model tree

2016-04-21 Thread roni
Hi ,
 I want to get the bisecting kmeans tree structure to show a dendogram  on
the heatmap I am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-Roni


bisecting kmeans tree

2016-04-20 Thread roni
Hi ,
 I want to get the bisecting kmeans tree structure to show on the heatmap I
am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-R


Re: bisecting kmeans model tree

2016-07-12 Thread roni
Hi Spark,Mlib experts,
Anyone who can shine light on this?
Thanks
_R

On Thu, Apr 21, 2016 at 12:46 PM, roni <roni.epi...@gmail.com> wrote:

> Hi ,
>  I want to get the bisecting kmeans tree structure to show a dendogram  on
> the heatmap I am generating based on the hierarchical clustering of data.
>  How do I get that using mlib .
> Thanks
> -Roni
>


MLIB and R results do not match for SVD

2016-08-16 Thread roni
Hi All,
 Some time back I had asked the question about  PCA results not matching
between R and MLIB. I was suggested to use svd.v instead of PCA  to match
the uncentered PCA .
But the results of mlib and R for svd  do not match .(I can understand the
numbers not matching exactly) but the distribution of data when plotted
does not match.
Is there something I can do to fix.
I mostly have tallSkinnt matrix data with 60+ rows and  upto 100+
columns .
Any help is appreciated .
Thanks
_R


support vector regression in spark

2016-12-01 Thread roni
Hi All,
 I  want to know how can I do support vector regression in SPARK?
 Thanks
R


SVM regression in Spark

2016-11-29 Thread roni
Hi All,
 I am trying to change my R code to spark. I am using  SVM regression in R
. It seems like spark is providing SVM classification .
How can I get the regression results.
In my R code  I am using  call to SVM () function in library("e1071") (
ftp://cran.r-project.org/pub/R/web/packages/e1071/vignettes/svmdoc.pdf)
svrObj <- svm(x ,
y ,
scale = TRUE,
type = "nu-regression",
kernel = "linear",
nu = .9)

Once I get the svm object back , I get the -
 from the values.

How can I do this in spark?
Thanks in advance
Roni


Re: SVM regression in Spark

2016-11-30 Thread roni
Hi Spark expert,
 Can anyone help for doing SVR (Support vector machine  regression) in
SPARK.
Thanks
R

On Tue, Nov 29, 2016 at 6:50 PM, roni <roni.epi...@gmail.com> wrote:

> Hi All,
>  I am trying to change my R code to spark. I am using  SVM regression in R
> . It seems like spark is providing SVM classification .
> How can I get the regression results.
> In my R code  I am using  call to SVM () function in library("e1071") (
> ftp://cran.r-project.org/pub/R/web/packages/e1071/vignettes/svmdoc.pdf)
> svrObj <- svm(x ,
> y ,
> scale = TRUE,
> type = "nu-regression",
> kernel = "linear",
> nu = .9)
>
> Once I get the svm object back , I get the -
>  from the values.
>
> How can I do this in spark?
> Thanks in advance
> Roni
>
>


Re: calculate diff of value and median in a group

2017-07-14 Thread roni
I was using this function percentile_approx  on 100GB of compressed data
and it just hangs there. Any pointers?

On Wed, Mar 22, 2017 at 6:09 PM, ayan guha  wrote:

> For median, use percentile_approx with 0.5 (50th percentile is the median)
>
> On Thu, Mar 23, 2017 at 11:01 AM, Yong Zhang  wrote:
>
>> He is looking for median, not mean/avg.
>>
>>
>> You have to implement the median logic by yourself, as there is no
>> directly implementation from Spark. You can use RDD API, if you are using
>> 1.6.x, or dataset if 2.x
>>
>>
>> The following example gives you an idea how to calculate the median using
>> dataset API. You can even change the code to add additional logic to
>> calculate the diff of every value with the median.
>>
>>
>> scala> spark.version
>> res31: String = 2.1.0
>>
>> scala> val ds = Seq((100,0.43),(100,0.33),(100,0.73),(101,0.29),(101,0.96),
>> (101,0.42),(101,0.01)).toDF("id","value").as[(Int, Double)]
>> ds: org.apache.spark.sql.Dataset[(Int, Double)] = [id: int, value: double]
>>
>> scala> ds.show
>> +---+-+
>> | id|value|
>> +---+-+
>> |100| 0.43|
>> |100| 0.33|
>> |100| 0.73|
>> |101| 0.29|
>> |101| 0.96|
>> |101| 0.42|
>> |101| 0.01|
>> +---+-+
>>
>> scala> def median(seq: Seq[Double]) = {
>>  |   val size = seq.size
>>  |   val sorted = seq.sorted
>>  |   size match {
>>  | case even if size % 2 == 0 => (sorted((size-2)/2) + 
>> sorted(size/2)) / 2
>>  | case odd => sorted((size-1)/2)
>>  |   }
>>  | }
>> median: (seq: Seq[Double])Double
>>
>> scala> ds.groupByKey(_._1).mapGroups((id, iter) => (id, 
>> median(iter.map(_._2).toSeq))).show
>> +---+-+
>> | _1|   _2|
>> +---+-+
>> |101|0.355|
>> |100| 0.43|
>> +---+-+
>>
>>
>> Yong
>>
>>
>>
>>
>> --
>> *From:* ayan guha 
>> *Sent:* Wednesday, March 22, 2017 7:23 PM
>> *To:* Craig Ching
>> *Cc:* Yong Zhang; user@spark.apache.org
>> *Subject:* Re: calculate diff of value and median in a group
>>
>> I would suggest use window function with partitioning.
>>
>> select group1,group2,name,value, avg(value) over (partition group1,group2
>> order by name) m
>> from t
>>
>> On Thu, Mar 23, 2017 at 9:58 AM, Craig Ching 
>> wrote:
>>
>>> Are the elements count big per group? If not, you can group them and use
>>> the code to calculate the median and diff.
>>>
>>>
>>> They're not big, no.  Any pointers on how I might do that?  The part I'm
>>> having trouble with is the grouping, I can't seem to see how to do the
>>> median per group.  For mean, we have the agg feature, but not for median
>>> (and I understand the reasons for that).
>>>
>>> Yong
>>>
>>> --
>>> *From:* Craig Ching 
>>> *Sent:* Wednesday, March 22, 2017 3:17 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* calculate diff of value and median in a group
>>>
>>> Hi,
>>>
>>> When using pyspark, I'd like to be able to calculate the difference
>>> between grouped values and their median for the group.  Is this possible?
>>> Here is some code I hacked up that does what I want except that it
>>> calculates the grouped diff from mean.  Also, please feel free to comment
>>> on how I could make this better if you feel like being helpful :)
>>>
>>> from pyspark import SparkContext
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import (
>>> StringType,
>>> LongType,
>>> DoubleType,
>>> StructField,
>>> StructType
>>> )
>>> from pyspark.sql import functions as F
>>>
>>>
>>> sc = SparkContext(appName='myapp')
>>> spark = SparkSession(sc)
>>>
>>> file_name = 'data.csv'
>>>
>>> fields = [
>>> StructField(
>>> 'group2',
>>> LongType(),
>>> True),
>>> StructField(
>>> 'name',
>>> StringType(),
>>> True),
>>> StructField(
>>> 'value',
>>> DoubleType(),
>>> True),
>>> StructField(
>>> 'group1',
>>> LongType(),
>>> True)
>>> ]
>>> schema = StructType(fields)
>>>
>>> df = spark.read.csv(
>>> file_name, header=False, mode="DROPMALFORMED", schema=schema
>>> )
>>> df.show()
>>> means = df.select([
>>> 'group1',
>>> 'group2',
>>> 'name',
>>> 'value']).groupBy([
>>> 'group1',
>>> 'group2'
>>> ]).agg(
>>> F.mean('value').alias('mean_value')
>>> ).orderBy('group1', 'group2')
>>>
>>> cond = [df.group1 == means.group1, df.group2 == means.group2]
>>>
>>> means.show()
>>> df = df.select([
>>> 'group1',
>>> 'group2',
>>> 'name',
>>> 'value']).join(
>>> means,
>>> cond
>>> ).drop(
>>> df.group1
>>> ).drop(
>>> df.group2
>>> ).select('group1',
>>>  'group2',
>>>  'name',
>>>  'value',
>>>  'mean_value')
>>>
>>> final = df.withColumn(
>>> 'diff',
>>> F.abs(df.value - df.mean_value))