Storing data in MySQL from spark hive tables

2015-05-20 Thread roni
Hi ,
I am trying to setup the hive metastore and mysql DB connection.
 I have a spark cluster and I ran some programs and I have data stored in
some hive tables.
Now I want to store this data into Mysql  so that it is available for
further processing.

I setup the hive-site.xml file.








  

hive.semantic.analyzer.factory.impl

org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory

  


  

hive.metastore.sasl.enabled

false

  


  

hive.server2.authentication

NONE

  


  

hive.server2.enable.doAs

true

  


  

hive.warehouse.subdir.inherit.perms

true

  


  

hive.metastore.schema.verification

false

  


  

javax.jdo.option.ConnectionURL

jdbc:mysql://<*ip address*
>:3306/metastore_db?createDatabaseIfNotExist=true

metadata is stored in a MySQL server

  


  

javax.jdo.option.ConnectionDriverName

com.mysql.jdbc.Driver

MySQL JDBC driver class

  


  

javax.jdo.option.ConnectionUserName

root

  


  

javax.jdo.option.ConnectionPassword



  

  

hive.metastore.warehouse.dir

/user/${user.name}/hive-warehouse

location of default database for
the warehouse





 --
My mysql server is on a separate server than where my spark server is . If
I use mySQLWorkbench , I use a SSH connection  with a certificate file to
connect .
How do I specify all that information from spark to the DB ?
I want to store the data generated by my spark program into mysql.
Thanks
_R


Re: Is anyone using Amazon EC2? (second attempt!)

2015-05-29 Thread roni
Hi ,
Any update on this?
I am not sure if the issue I am seeing is related ..
I have 8 slaves and when I created the cluster I specified ebs volume with
100G.
I see on Ec2 8 volumes created and each attached to the corresponding slave.
But when I try to copy data on it , it complains that

/root/ephemeral-hdfs/bin/hadoop fs -cp /intersection hdfs://
ec2-54-149-112-136.us-west-2.compute.amazonaws.com:9010/

2015-05-28 23:40:35,447 WARN  hdfs.DFSClient
(DFSOutputStream.java:run(562)) - DataStreamer Exception

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/intersection/kmer150/commonGoodKmers/_temporary/_attempt_201504010056_0004_m_000428_3948/part-00428._COPYING_
could only be replicated to 0 nodes instead of minReplication (=1).  There
are 1 datanode(s) running and no node(s) are excluded in this operation.


It shows only 1 datanode , but for ephermal-hdfs it shows 8 datanodes.

Any thoughts?

Thanks

_R

On Sat, May 23, 2015 at 7:24 AM, Joe Wass  wrote:

> I used Spark on EC2 a while ago, but recent revisions seem to have broken
> the functionality.
>
> Is anyone actually using Spark on EC2 at the moment?
>
> The bug in question is:
>
> https://issues.apache.org/jira/browse/SPARK-5008
>
> It makes it impossible to use persistent HDFS without a workround on each
> slave node.
>
> No-one seems to be interested in the bug, so I wonder if other people
> aren't actually having this problem. If this is the case, any suggestions?
>
> Joe
>


which database for gene alignment data ?

2015-06-05 Thread roni
I want to use spark for reading compressed .bed file for reading gene
sequencing alignments data.
I want to store bed file data in db and then use external gene expression
data to find overlaps etc, which database is best for it ?
Thanks
-Roni


Re: which database for gene alignment data ?

2015-06-08 Thread roni
Sorry for the delay.
The files (called .bed files) have format like -

Chromosome start  endfeature score  strand

chr1 713776  714375  peak.1  599+
chr1 752401  753000  peak.2  599+

The mandatory fields are


   1. chrom - The name of the chromosome (e.g. chr3, chrY,
chr2_random) or scaffold (e.g. scaffold10671).
   2. chromStart - The starting position of the feature in the
chromosome or scaffold. The first base in a chromosome is numbered 0.
   3. chromEnd - The ending position of the feature in the chromosome
or scaffold. The *chromEnd* base is not included in the display of the
feature. For example, the first 100 bases of a chromosome are defined
as *chromStart=0, chromEnd=100*, and span the bases numbered 0-99.

There can be more data as described -
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Many times the use cases are like
1. find the features between given start and end positions
2.Find features which have overlapping start and end points with
another feature.
3. read external (reference) data which will have similar format
(chr10  4851478549604641MAPK8   49514785+) and find all 
the data
points which are overlapping with the other  .bed files.

The data is huge. .bed files can range from .5 GB to 5 gb (or more)
I was thinking of using cassandra, but not sue if the overlapping
queries can be supported and will be fast enough.

Thanks for the help
-Roni


On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu  wrote:

> Can you describe your use case in a bit more detail since not all people
> on this mailing list are familiar with gene sequencing alignments data ?
>
> Thanks
>
> On Fri, Jun 5, 2015 at 11:42 PM, roni  wrote:
>
>> I want to use spark for reading compressed .bed file for reading gene
>> sequencing alignments data.
>> I want to store bed file data in db and then use external gene expression
>> data to find overlaps etc, which database is best for it ?
>> Thanks
>> -Roni
>>
>>
>


Re: which database for gene alignment data ?

2015-06-09 Thread roni
Hi Frank,
Thanks for the reply. I downloaded ADAM and built it but it does not seem
to list this function for command line options.
Are these exposed as public API and I can call it from code ?

Also , I need to save all my intermediate data.  Seems like ADAM stores
data in Parquet on HDFS.
I want to save something in an external database, so that  we can re-use
the saved data in multiple ways by multiple people.
Any suggestions on the DB selection or keeping data centralized for use by
multiple distinct groups?
Thanks
-Roni



On Mon, Jun 8, 2015 at 12:47 PM, Frank Austin Nothaft  wrote:

> Hi Roni,
>
> We have a full suite of genomic feature parsers that can read BED,
> narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM
> <https://github.com/bigdatagenomics/adam>  Additionally, we have support
> for efficient overlap joins (query 3 in your email below). You can load the
> genomic features with ADAMContext.loadFeatures
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L438>.
> We have two tools for the overlap computation: you can use a
> BroadcastRegionJoin
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/BroadcastRegionJoin.scala>
>  if
> one of the datasets you want to overlap is small or a ShuffleRegionJoin
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala>
>  if
> both datasets are large.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
>
> On Jun 8, 2015, at 9:39 PM, roni  wrote:
>
> Sorry for the delay.
> The files (called .bed files) have format like -
>
> Chromosome start  endfeature score  strand
>
> chr1   713776  714375  peak.1  599+
> chr1   752401  753000  peak.2  599+
>
> The mandatory fields are
>
>
>1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or 
> scaffold (e.g. scaffold10671).
>2. chromStart - The starting position of the feature in the chromosome or 
> scaffold. The first base in a chromosome is numbered 0.
>3. chromEnd - The ending position of the feature in the chromosome or 
> scaffold. The *chromEnd* base is not included in the display of the feature. 
> For example, the first 100 bases of a chromosome are defined as 
> *chromStart=0, chromEnd=100*, and span the bases numbered 0-99.
>
> There can be more data as described - 
> https://genome.ucsc.edu/FAQ/FAQformat.html#format1
> Many times the use cases are like
> 1. find the features between given start and end positions
> 2.Find features which have overlapping start and end points with another 
> feature.
> 3. read external (reference) data which will have similar format (chr10   
> 4851478549604641MAPK8   49514785+) and find all the 
> data points which are overlapping with the other  .bed files.
>
> The data is huge. .bed files can range from .5 GB to 5 gb (or more)
> I was thinking of using cassandra, but not sue if the overlapping queries can 
> be supported and will be fast enough.
>
> Thanks for the help
> -Roni
>
>
> On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu  wrote:
>
>> Can you describe your use case in a bit more detail since not all people
>> on this mailing list are familiar with gene sequencing alignments data ?
>>
>> Thanks
>>
>> On Fri, Jun 5, 2015 at 11:42 PM, roni  wrote:
>>
>>> I want to use spark for reading compressed .bed file for reading gene
>>> sequencing alignments data.
>>> I want to store bed file data in db and then use external gene
>>> expression data to find overlaps etc, which database is best for it ?
>>> Thanks
>>> -Roni
>>>
>>>
>>
>
>


sparkR issues ?

2016-03-14 Thread roni
Hi,
 I am working with bioinformatics and trying to convert some scripts to
sparkR to fit into other spark jobs.

I tries a simple example from a bioinf lib and as soon as I start sparkR
environment it does not work.

code as follows -
countData <- matrix(1:100,ncol=4)
condition <- factor(c("A","A","B","B"))
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)

Works if i dont initialize the sparkR environment.
 if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
following error

> dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
condition)
Error in DataFrame(colData, row.names = rownames(colData)) :
  cannot coerce class "data.frame" to a DataFrame

I am really stumped. I am not using any spark function , so i would expect
it to work as a simple R code.
why it does not work?

Appreciate  the help
-R


Re: sparkR issues ?

2016-03-15 Thread roni
Alex,
 No I have not defined he "dataframe" its the spark default Dataframe. That
line is just casting Factor as datarame to send to the function.
Thanks
-R

On Mon, Mar 14, 2016 at 11:58 PM, Alex Kozlov  wrote:

> This seems to be a very unfortunate name collision.  SparkR defines it's
> own DataFrame class which shadows what seems to be your own definition.
>
> Is DataFrame something you define?  Can you rename it?
>
> On Mon, Mar 14, 2016 at 10:44 PM, roni  wrote:
>
>> Hi,
>>  I am working with bioinformatics and trying to convert some scripts to
>> sparkR to fit into other spark jobs.
>>
>> I tries a simple example from a bioinf lib and as soon as I start sparkR
>> environment it does not work.
>>
>> code as follows -
>> countData <- matrix(1:100,ncol=4)
>> condition <- factor(c("A","A","B","B"))
>> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~
>> condition)
>>
>> Works if i dont initialize the sparkR environment.
>>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
>> following error
>>
>> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
>> condition)
>> Error in DataFrame(colData, row.names = rownames(colData)) :
>>   cannot coerce class "data.frame" to a DataFrame
>>
>> I am really stumped. I am not using any spark function , so i would
>> expect it to work as a simple R code.
>> why it does not work?
>>
>> Appreciate  the help
>> -R
>>
>>
>
>
> --
> Alex Kozlov
> (408) 507-4987
> (650) 887-2135 efax
> ale...@gmail.com
>


Re: sparkR issues ?

2016-03-15 Thread roni
Hi ,
 Is there a work around for this?
 Do i need to file a bug for this?
Thanks
-R

On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui  wrote:

> It seems as.data.frame() defined in SparkR convers the versions in R base
> package.
>
> We can try to see if we can change the implementation of as.data.frame()
> in SparkR to avoid such covering.
>
>
>
> *From:* Alex Kozlov [mailto:ale...@gmail.com]
> *Sent:* Tuesday, March 15, 2016 2:59 PM
> *To:* roni 
> *Cc:* user@spark.apache.org
> *Subject:* Re: sparkR issues ?
>
>
>
> This seems to be a very unfortunate name collision.  SparkR defines it's
> own DataFrame class which shadows what seems to be your own definition.
>
>
>
> Is DataFrame something you define?  Can you rename it?
>
>
>
> On Mon, Mar 14, 2016 at 10:44 PM, roni  wrote:
>
> Hi,
>
>  I am working with bioinformatics and trying to convert some scripts to
> sparkR to fit into other spark jobs.
>
>
>
> I tries a simple example from a bioinf lib and as soon as I start sparkR
> environment it does not work.
>
>
>
> code as follows -
>
> countData <- matrix(1:100,ncol=4)
>
> condition <- factor(c("A","A","B","B"))
>
> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)
>
>
>
> Works if i dont initialize the sparkR environment.
>
>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
> following error
>
>
>
> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
> condition)
>
> Error in DataFrame(colData, row.names = rownames(colData)) :
>
>   cannot coerce class "data.frame" to a DataFrame
>
>
>
> I am really stumped. I am not using any spark function , so i would expect
> it to work as a simple R code.
>
> why it does not work?
>
>
>
> Appreciate  the help
>
> -R
>
>
>
>
>
>
>
> --
>
> Alex Kozlov
> (408) 507-4987
> (650) 887-2135 efax
> ale...@gmail.com
>


bisecting kmeans tree

2016-04-20 Thread roni
Hi ,
 I want to get the bisecting kmeans tree structure to show on the heatmap I
am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-R


bisecting kmeans model tree

2016-04-21 Thread roni
Hi ,
 I want to get the bisecting kmeans tree structure to show a dendogram  on
the heatmap I am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-Roni


Re: bisecting kmeans model tree

2016-07-12 Thread roni
Hi Spark,Mlib experts,
Anyone who can shine light on this?
Thanks
_R

On Thu, Apr 21, 2016 at 12:46 PM, roni  wrote:

> Hi ,
>  I want to get the bisecting kmeans tree structure to show a dendogram  on
> the heatmap I am generating based on the hierarchical clustering of data.
>  How do I get that using mlib .
> Thanks
> -Roni
>


MLIB and R results do not match for SVD

2016-08-16 Thread roni
Hi All,
 Some time back I had asked the question about  PCA results not matching
between R and MLIB. I was suggested to use svd.v instead of PCA  to match
the uncentered PCA .
But the results of mlib and R for svd  do not match .(I can understand the
numbers not matching exactly) but the distribution of data when plotted
does not match.
Is there something I can do to fix.
I mostly have tallSkinnt matrix data with 60+ rows and  upto 100+
columns .
Any help is appreciated .
Thanks
_R


cannot coerce class "data.frame" to a DataFrame - with spark R

2016-02-18 Thread roni
Hi ,
 I am trying to convert a bioinformatics R script to use spark R. It uses
external bioconductor package (DESeq2) so the only conversion really I have
made is to change the way it reads the input file.

When I call my external R library function in DESeq2 I get error cannot
coerce class "data.frame" to a DataFrame .

I am listing my old R code and new spark R code below and the line giving
problem is in RED.
ORIGINAL R -
library(plyr)
library(dplyr)
library(DESeq2)
library(pheatmap)
library(gplots)
library(RColorBrewer)
library(matrixStats)
library(pheatmap)
library(ggplot2)
library(hexbin)
library(corrplot)

sampleDictFile <- "/160208.txt"
sampleDict <- read.table(sampleDictFile)

peaks <- read.table("/Atlas.txt")
countMat <- read.table("/cntMatrix.txt", header = TRUE, sep = "\t")

colnames(countMat) <- sampleDict$sample
rownames(peaks) <- rownames(countMat) <- paste0(peaks$seqnames, ":",
peaks$start, "-", peaks$end, "  ", peaks$symbol)
peaks$id <- rownames(peaks)

#
SPARK R CODE
peaks <- (read.csv("/Atlas.txt",header = TRUE, sep = "\t")))
sampleDict<- (read.csv("/160208.txt",header = TRUE, sep = "\t",
stringsAsFactors = FALSE))
countMat<-  (read.csv("/cntMatrix.txt",header = TRUE, sep = "\t"))
---

COMMON CODE  for both -

  countMat <- countMat[, sampleDict$sample]
  colData <- sampleDict[,"group", drop = FALSE]
  design <- ~ group

 * dds <- DESeqDataSetFromMatrix(countData = countMat, colData = colData,
design = design)*

This line gives error - dds <- DESeqDataSetFromMatrix(countData = countMat,
colData =  (colData), design = design)
Error in DataFrame(colData, row.names = rownames(colData)) :
  cannot coerce class "data.frame" to a DataFrame

I tried as.data.frame or using DataFrame to wrap the defs , but no luck.
What Can I do differently?

Thanks
Roni


Re: cannot coerce class "data.frame" to a DataFrame - with spark R

2016-02-19 Thread roni
Thanks Felix .
I tried that but I still get the same error . Am I missing something?
 dds <- DESeqDataSetFromMatrix(countData = collect(countMat), colData =
collect(colData), design = design)
Error in DataFrame(colData, row.names = rownames(colData)) :
  cannot coerce class "data.frame" to a DataFrame

On Thu, Feb 18, 2016 at 9:03 PM, Felix Cheung 
wrote:

> Doesn't DESeqDataSetFromMatrix work with data.frame only? It wouldn't work
> with Spark's DataFrame - try collect(countMat) and others to convert them
> into data.frame?
>
>
> _
> From: roni 
> Sent: Thursday, February 18, 2016 4:55 PM
> Subject: cannot coerce class "data.frame" to a DataFrame - with spark R
> To: 
>
>
>
> Hi ,
>  I am trying to convert a bioinformatics R script to use spark R. It uses
> external bioconductor package (DESeq2) so the only conversion really I have
> made is to change the way it reads the input file.
>
> When I call my external R library function in DESeq2 I get error cannot
> coerce class "data.frame" to a DataFrame .
>
> I am listing my old R code and new spark R code below and the line giving
> problem is in RED.
> ORIGINAL R -
> library(plyr)
> library(dplyr)
> library(DESeq2)
> library(pheatmap)
> library(gplots)
> library(RColorBrewer)
> library(matrixStats)
> library(pheatmap)
> library(ggplot2)
> library(hexbin)
> library(corrplot)
>
> sampleDictFile <- "/160208.txt"
> sampleDict <- read.table(sampleDictFile)
>
> peaks <- read.table("/Atlas.txt")
> countMat <- read.table("/cntMatrix.txt", header = TRUE, sep = "\t")
>
> colnames(countMat) <- sampleDict$sample
> rownames(peaks) <- rownames(countMat) <- paste0(peaks$seqnames, ":",
> peaks$start, "-", peaks$end, "  ", peaks$symbol)
> peaks$id <- rownames(peaks)
> 
> #
> SPARK R CODE
> peaks <- (read.csv("/Atlas.txt",header = TRUE, sep = "\t")))
> sampleDict<- (read.csv("/160208.txt",header = TRUE, sep = "\t",
> stringsAsFactors = FALSE))
> countMat<-  (read.csv("/cntMatrix.txt",header = TRUE, sep = "\t"))
> ---
>
> COMMON CODE  for both -
>
>   countMat <- countMat[, sampleDict$sample]
>   colData <- sampleDict[,"group", drop = FALSE]
>   design <- ~ group
>
>   * dds <- DESeqDataSetFromMatrix(countData = countMat, colData =
> colData, design = design)*
>
> This line gives error - dds <- DESeqDataSetFromMatrix(countData =
> countMat, colData =  (colData), design = design)
> Error in DataFrame(colData, row.names = rownames(colData)) :
>   cannot coerce class "data.frame" to a DataFrame
>
> I tried as.data.frame or using DataFrame to wrap the defs , but no luck.
> What Can I do differently?
>
> Thanks
> Roni
>
>
>
>


upgrading from spark 1.4 to latest version

2015-11-02 Thread roni
Hi,
 I have a current cluster running spark 1.4 and want to upgrade to latest
version.
 How can I do it without creating a new cluster  so that all my other
setting getting erased.
Thanks
_R


Upgrade spark cluster to latest version

2015-11-03 Thread roni
Hi  Spark experts,
  This may be a very naive question but can you pl. point me to a proper
way to upgrade spark version on an existing cluster.
Thanks
Roni

> Hi,
>  I have a current cluster running spark 1.4 and want to upgrade to latest
> version.
>  How can I do it without creating a new cluster  so that all my other
> setting getting erased.
> Thanks
> _R
>


No suitable driver found for jdbc:mysql://

2015-07-22 Thread roni
Hi All,
 I have a cluster with spark 1.4.
I am trying to save data to mysql but getting error

Exception in thread "main" java.sql.SQLException: No suitable driver found
for jdbc:mysql://<>.rds.amazonaws.com:3306/DAE_kmer?user=<>&password=<>


*I looked at - https://issues.apache.org/jira/browse/SPARK-8463
<https://issues.apache.org/jira/browse/SPARK-8463> and added the connector
jar to the same location as on Master using copy-dir script.*

*But I am still getting the same error. This sued to work with 1.3.*

*This is my command  to run the program - **$SPARK_HOME/bin/spark-submit
--jars
/root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar
--conf
spark.executor.extraClassPath=/root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar
--conf spark.executor.memory=55g --driver-memory=55g
--master=spark://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077
<http://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077>  --class
"saveBedToDB"  target/scala-2.10/adam-project_2.10-1.0.jar*

*What else can I Do ?*

*Thanks*

*-Roni*


Re: Do I really need to build Spark for Hive/Thrift Server support?

2015-08-10 Thread roni
Hi All,
 Any explanation for this?
 As Reece said I can do operations with hive but -

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -- gives error.

I have already created spark ec2 cluster with the spark-ec2 script. How can
I build it again?

Thanks
_Roni

On Tue, Jul 28, 2015 at 2:46 PM, ReeceRobinson 
wrote:

> I am building an analytics environment based on Spark and want to use HIVE
> in
> multi-user mode i.e. not use the embedded derby database but use Postgres
> and HDFS instead. I am using the included Spark Thrift Server to process
> queries using Spark SQL.
>
> The documentation gives me the impression that I need to create a custom
> build of Spark 1.4.1. However I don't think this is either accurate now OR
> it is for some different context I'm not aware of?
>
> I used the pre-built Spark 1.4.1 distribution today with my hive-site.xml
> for Postgres and HDFS and it worked! I see the warehouse files turn up in
> HDFS and I see the metadata inserted into Postgres when I created a test
> table.
>
> I can connect to the Thrift Server using beeline and perform queries on my
> data. I also verified using the Spark UI that the SQL is being processed by
> Spark SQL.
>
> So I guess I'm asking is the document out-of-date or am I missing
> something?
>
> Cheers,
> Reece
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Do-I-really-need-to-build-Spark-for-Hive-Thrift-Server-support-tp24013p24039.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-10 Thread roni
I have spark installed on a EC2 cluster. Can I connect to that from my
local sparkR in RStudio? if yes , how ?

Can I read files  which I have saved as parquet files on hdfs  or s3 in
sparkR ? If yes , How?

Thanks
-Roni


reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread roni
I am trying this -

 ddf <- parquetFile(sqlContext,  "hdfs://
ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet")

and I get path[1]="hdfs://
ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet":
No such file or directory


when I read file on s3 , I get -  java.io.IOException: No FileSystem for
scheme: s3


Thanks in advance.

-Roni


Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread roni
Thanks Akhil. Very good article.

On Mon, Sep 14, 2015 at 4:15 AM, Akhil Das 
wrote:

> You can look into this doc
> <https://www.sigmoid.com/how-to-run-sparkr-with-rstudio/> regarding the
> connection (its for gce though but it should be similar).
>
> Thanks
> Best Regards
>
> On Thu, Sep 10, 2015 at 11:20 PM, roni  wrote:
>
>> I have spark installed on a EC2 cluster. Can I connect to that from my
>> local sparkR in RStudio? if yes , how ?
>>
>> Can I read files  which I have saved as parquet files on hdfs  or s3 in
>> sparkR ? If yes , How?
>>
>> Thanks
>> -Roni
>>
>>
>


SVM regression in Spark

2016-11-29 Thread roni
Hi All,
 I am trying to change my R code to spark. I am using  SVM regression in R
. It seems like spark is providing SVM classification .
How can I get the regression results.
In my R code  I am using  call to SVM () function in library("e1071") (
ftp://cran.r-project.org/pub/R/web/packages/e1071/vignettes/svmdoc.pdf)
svrObj <- svm(x ,
y ,
scale = TRUE,
type = "nu-regression",
kernel = "linear",
nu = .9)

Once I get the svm object back , I get the -
 from the values.

How can I do this in spark?
Thanks in advance
Roni


Re: SVM regression in Spark

2016-11-30 Thread roni
Hi Spark expert,
 Can anyone help for doing SVR (Support vector machine  regression) in
SPARK.
Thanks
R

On Tue, Nov 29, 2016 at 6:50 PM, roni  wrote:

> Hi All,
>  I am trying to change my R code to spark. I am using  SVM regression in R
> . It seems like spark is providing SVM classification .
> How can I get the regression results.
> In my R code  I am using  call to SVM () function in library("e1071") (
> ftp://cran.r-project.org/pub/R/web/packages/e1071/vignettes/svmdoc.pdf)
> svrObj <- svm(x ,
> y ,
> scale = TRUE,
> type = "nu-regression",
> kernel = "linear",
> nu = .9)
>
> Once I get the svm object back , I get the -
>  from the values.
>
> How can I do this in spark?
> Thanks in advance
> Roni
>
>


support vector regression in spark

2016-12-01 Thread roni
Hi All,
 I  want to know how can I do support vector regression in SPARK?
 Thanks
R


Re: calculate diff of value and median in a group

2017-07-14 Thread roni
I was using this function percentile_approx  on 100GB of compressed data
and it just hangs there. Any pointers?

On Wed, Mar 22, 2017 at 6:09 PM, ayan guha  wrote:

> For median, use percentile_approx with 0.5 (50th percentile is the median)
>
> On Thu, Mar 23, 2017 at 11:01 AM, Yong Zhang  wrote:
>
>> He is looking for median, not mean/avg.
>>
>>
>> You have to implement the median logic by yourself, as there is no
>> directly implementation from Spark. You can use RDD API, if you are using
>> 1.6.x, or dataset if 2.x
>>
>>
>> The following example gives you an idea how to calculate the median using
>> dataset API. You can even change the code to add additional logic to
>> calculate the diff of every value with the median.
>>
>>
>> scala> spark.version
>> res31: String = 2.1.0
>>
>> scala> val ds = Seq((100,0.43),(100,0.33),(100,0.73),(101,0.29),(101,0.96),
>> (101,0.42),(101,0.01)).toDF("id","value").as[(Int, Double)]
>> ds: org.apache.spark.sql.Dataset[(Int, Double)] = [id: int, value: double]
>>
>> scala> ds.show
>> +---+-+
>> | id|value|
>> +---+-+
>> |100| 0.43|
>> |100| 0.33|
>> |100| 0.73|
>> |101| 0.29|
>> |101| 0.96|
>> |101| 0.42|
>> |101| 0.01|
>> +---+-+
>>
>> scala> def median(seq: Seq[Double]) = {
>>  |   val size = seq.size
>>  |   val sorted = seq.sorted
>>  |   size match {
>>  | case even if size % 2 == 0 => (sorted((size-2)/2) + 
>> sorted(size/2)) / 2
>>  | case odd => sorted((size-1)/2)
>>  |   }
>>  | }
>> median: (seq: Seq[Double])Double
>>
>> scala> ds.groupByKey(_._1).mapGroups((id, iter) => (id, 
>> median(iter.map(_._2).toSeq))).show
>> +---+-+
>> | _1|   _2|
>> +---+-+
>> |101|0.355|
>> |100| 0.43|
>> +---+-+
>>
>>
>> Yong
>>
>>
>>
>>
>> --
>> *From:* ayan guha 
>> *Sent:* Wednesday, March 22, 2017 7:23 PM
>> *To:* Craig Ching
>> *Cc:* Yong Zhang; user@spark.apache.org
>> *Subject:* Re: calculate diff of value and median in a group
>>
>> I would suggest use window function with partitioning.
>>
>> select group1,group2,name,value, avg(value) over (partition group1,group2
>> order by name) m
>> from t
>>
>> On Thu, Mar 23, 2017 at 9:58 AM, Craig Ching 
>> wrote:
>>
>>> Are the elements count big per group? If not, you can group them and use
>>> the code to calculate the median and diff.
>>>
>>>
>>> They're not big, no.  Any pointers on how I might do that?  The part I'm
>>> having trouble with is the grouping, I can't seem to see how to do the
>>> median per group.  For mean, we have the agg feature, but not for median
>>> (and I understand the reasons for that).
>>>
>>> Yong
>>>
>>> --
>>> *From:* Craig Ching 
>>> *Sent:* Wednesday, March 22, 2017 3:17 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* calculate diff of value and median in a group
>>>
>>> Hi,
>>>
>>> When using pyspark, I'd like to be able to calculate the difference
>>> between grouped values and their median for the group.  Is this possible?
>>> Here is some code I hacked up that does what I want except that it
>>> calculates the grouped diff from mean.  Also, please feel free to comment
>>> on how I could make this better if you feel like being helpful :)
>>>
>>> from pyspark import SparkContext
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import (
>>> StringType,
>>> LongType,
>>> DoubleType,
>>> StructField,
>>> StructType
>>> )
>>> from pyspark.sql import functions as F
>>>
>>>
>>> sc = SparkContext(appName='myapp')
>>> spark = SparkSession(sc)
>>>
>>> file_name = 'data.csv'
>>>
>>> fields = [
>>> StructField(
>>> 'group2',
>>> LongType(),
>>> True),
>>> StructField(
>>> 'name',
>>> StringType(),
>>> True),
>>> StructField(
>>> 'value',
>>> DoubleType(),
>>> True),
>>> StructField(
>>> 'group1',
>>> LongType(),
>>> True)
>>> ]
>>> schema = StructType(fields)
>>>
>>> df = spark.read.csv(
>>> file_name, header=False, mode="DROPMALFORMED", schema=schema
>>> )
>>> df.show()
>>> means = df.select([
>>> 'group1',
>>> 'group2',
>>> 'name',
>>> 'value']).groupBy([
>>> 'group1',
>>> 'group2'
>>> ]).agg(
>>> F.mean('value').alias('mean_value')
>>> ).orderBy('group1', 'group2')
>>>
>>> cond = [df.group1 == means.group1, df.group2 == means.group2]
>>>
>>> means.show()
>>> df = df.select([
>>> 'group1',
>>> 'group2',
>>> 'name',
>>> 'value']).join(
>>> means,
>>> cond
>>> ).drop(
>>> df.group1
>>> ).drop(
>>> df.group2
>>> ).select('group1',
>>>  'group2',
>>>  'name',
>>>  'value',
>>>  'mean_value')
>>>
>>> final = df.withColumn(
>>> 'diff',
>>> F.abs(df.value - df.mean_value))
>>> final.show()
>>>
>>> sc.stop()
>>>
>>> And here is an example dataset I'm playing with:
>>>
>>> 100,name1,0

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
Hi Ted,
 I  used s3://support.elasticmapreduce/spark/install-spark to install spark
on my EMR cluster. It is 1.2.0.
 When I click on the link for history or logs it takes me to
http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop
 and I get -

The server at *ip-172-31-43-116.us-west-2.compute.internal* can't be found,
because the DNS lookup failed. DNS is the network service that translates a
website's name to its Internet address. This error is most often caused by
having no connection to the Internet or a misconfigured network. It can
also be caused by an unresponsive DNS server or a firewall preventing Google
Chrome from accessing the network.
I tried  changing the address with internal to the external one , but still
does not work.
Thanks
_roni


On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu  wrote:

> bq. spark UI does not work for Yarn-cluster.
>
> Can you be a bit more specific on the error(s) you saw ?
>
> What Spark release are you using ?
>
> Cheers
>
> On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi 
> wrote:
>
>> Sorry , for half email - here it is again in full
>> Hi ,
>> I have 2 questions -
>>
>>  1. I was trying to use Resource Manager UI for my SPARK application
>> using yarn cluster mode as I observed that spark UI does not work for
>> Yarn-cluster.
>> IS that correct or am I missing some setup?
>>
>> 2. when I click on Application Monitoring or history , i get re-directed
>> to some linked with internal Ip address. Even if I replace that address
>> with the public IP , it still does not work.  What kind of setup changes
>> are needed for that?
>>
>> Thanks
>> -roni
>>
>> On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi 
>> wrote:
>>
>>> Hi ,
>>> I have 2 questions -
>>>
>>>  1. I was trying to use Resource Manager UI for my SPARK application
>>> using yarn cluster mode as I observed that spark UI does not work for
>>> Yarn-cluster.
>>> IS that correct or am I missing some setup?
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
Ted,
If the application is running then the logs are not available. Plus what i
want to view is the details about the running app as in spark UI.
Do I have to open some ports or do some other setting changes?




On Tue, Mar 3, 2015 at 10:08 AM, Ted Yu  wrote:

> bq. changing the address with internal to the external one , but still
> does not work.
> Not sure what happened.
> For the time being, you can use yarn command line to pull container log
> (put in your appId and container Id):
> yarn logs -applicationId application_1386639398517_0007 -containerId
> container_1386639398517_0007_01_19
>
> Cheers
>
> On Tue, Mar 3, 2015 at 9:50 AM, roni  wrote:
>
>> Hi Ted,
>>  I  used s3://support.elasticmapreduce/spark/install-spark to install
>> spark on my EMR cluster. It is 1.2.0.
>>  When I click on the link for history or logs it takes me to
>>
>> http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop
>>  and I get -
>>
>> The server at *ip-172-31-43-116.us-west-2.compute.internal* can't be
>> found, because the DNS lookup failed. DNS is the network service that
>> translates a website's name to its Internet address. This error is most
>> often caused by having no connection to the Internet or a misconfigured
>> network. It can also be caused by an unresponsive DNS server or a firewall
>> preventing Google Chrome from accessing the network.
>> I tried  changing the address with internal to the external one , but
>> still does not work.
>> Thanks
>> _roni
>>
>>
>> On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu  wrote:
>>
>>> bq. spark UI does not work for Yarn-cluster.
>>>
>>> Can you be a bit more specific on the error(s) you saw ?
>>>
>>> What Spark release are you using ?
>>>
>>> Cheers
>>>
>>> On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi 
>>> wrote:
>>>
>>>> Sorry , for half email - here it is again in full
>>>> Hi ,
>>>> I have 2 questions -
>>>>
>>>>  1. I was trying to use Resource Manager UI for my SPARK application
>>>> using yarn cluster mode as I observed that spark UI does not work for
>>>> Yarn-cluster.
>>>> IS that correct or am I missing some setup?
>>>>
>>>> 2. when I click on Application Monitoring or history , i get
>>>> re-directed to some linked with internal Ip address. Even if I replace that
>>>> address with the public IP , it still does not work.  What kind of setup
>>>> changes are needed for that?
>>>>
>>>> Thanks
>>>> -roni
>>>>
>>>> On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi 
>>>> wrote:
>>>>
>>>>> Hi ,
>>>>> I have 2 questions -
>>>>>
>>>>>  1. I was trying to use Resource Manager UI for my SPARK application
>>>>> using yarn cluster mode as I observed that spark UI does not work for
>>>>> Yarn-cluster.
>>>>> IS that correct or am I missing some setup?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
ah!! I think I know what you mean. My job was just in "accepted" stage for
a long time as it was running a huge file.
But now that it is in running stage , I can see it . I can see it at post
9046 though instead of 4040 . But  I can see it.
Thanks
-roni

On Tue, Mar 3, 2015 at 1:19 PM, Zhan Zhang  wrote:

>  In Yarn (Cluster or client), you can access the spark ui when the app is
> running. After app is done, you can still access it, but need some extra
> setup for history server.
>
>  Thanks.
>
>  Zhan Zhang
>
>  On Mar 3, 2015, at 10:08 AM, Ted Yu  wrote:
>
>  bq. changing the address with internal to the external one , but still
> does not work.
> Not sure what happened.
> For the time being, you can use yarn command line to pull container log
> (put in your appId and container Id):
>  yarn logs -applicationId application_1386639398517_0007 -containerId
> container_1386639398517_0007_01_19
>
>  Cheers
>
> On Tue, Mar 3, 2015 at 9:50 AM, roni  wrote:
>
>>Hi Ted,
>>   I  used s3://support.elasticmapreduce/spark/install-spark to install
>> spark on my EMR cluster. It is 1.2.0.
>>   When I click on the link for history or logs it takes me to
>>
>> http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop
>>   and I get -
>>
>> The server at *ip-172-31-43-116.us
>> <http://ip-172-31-43-116.us>-west-2.compute.internal* can't be found,
>> because the DNS lookup failed. DNS is the network service that translates a
>> website's name to its Internet address. This error is most often caused by
>> having no connection to the Internet or a misconfigured network. It can
>> also be caused by an unresponsive DNS server or a firewall preventing Google
>> Chrome from accessing the network.
>>  I tried  changing the address with internal to the external one , but
>> still does not work.
>>  Thanks
>>  _roni
>>
>>
>> On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu  wrote:
>>
>>> bq. spark UI does not work for Yarn-cluster.
>>>
>>>  Can you be a bit more specific on the error(s) you saw ?
>>>
>>>  What Spark release are you using ?
>>>
>>>  Cheers
>>>
>>> On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi 
>>> wrote:
>>>
>>>>   Sorry , for half email - here it is again in full
>>>>  Hi ,
>>>>  I have 2 questions -
>>>>
>>>>   1. I was trying to use Resource Manager UI for my SPARK application
>>>> using yarn cluster mode as I observed that spark UI does not work for
>>>> Yarn-cluster.
>>>>  IS that correct or am I missing some setup?
>>>>
>>>>  2. when I click on Application Monitoring or history , i get
>>>> re-directed to some linked with internal Ip address. Even if I replace that
>>>> address with the public IP , it still does not work.  What kind of setup
>>>> changes are needed for that?
>>>>
>>>>  Thanks
>>>>  -roni
>>>>
>>>> On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi 
>>>> wrote:
>>>>
>>>>>  Hi ,
>>>>>  I have 2 questions -
>>>>>
>>>>>   1. I was trying to use Resource Manager UI for my SPARK application
>>>>> using yarn cluster mode as I observed that spark UI does not work for
>>>>> Yarn-cluster.
>>>>>  IS that correct or am I missing some setup?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>


Re: issue Running Spark Job on Yarn Cluster

2015-03-04 Thread roni
look at the logs
yarn logs --applicationId 
That should give the error.

On Wed, Mar 4, 2015 at 9:21 AM, sachin Singh 
wrote:

> Not yet,
> Please let. Me know if you found solution,
>
> Regards
> Sachin
> On 4 Mar 2015 21:45, "mael2210 [via Apache Spark User List]" <[hidden
> email] > wrote:
>
>> Hello,
>>
>> I am facing the exact same issue. Could you solve the problem ?
>>
>> Kind regards
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21697p21909.html
>>  To unsubscribe from issue Running Spark Job on Yarn Cluster, click here.
>> NAML
>> 
>>
>
> --
> View this message in context: Re: issue Running Spark Job on Yarn Cluster
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


spark-ec2 script problems

2015-03-05 Thread roni
Hi ,
 I used spark-ec2 script to create ec2 cluster.

 Now I am trying copy data from s3 into hdfs.
I am doing this
*root@ip-172-31-21-160 ephemeral-hdfs]$ bin/hadoop distcp
s3:///home/mydata/small.sam
hdfs://ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1
<http://ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1>*

and I get following error -

2015-03-06 01:39:27,299 INFO  tools.DistCp (DistCp.java:run(109)) - Input
Options: DistCpOptions{atomicCommit=false, syncFolder=false,
deleteMissing=false, ignoreFailures=false, maxMaps=20,
sslConfigurationFile='null', copyStrategy='uniformsize',
sourceFileListing=null, sourcePaths=[s3:///home/mydata/small.sam],
targetPath=hdfs://
ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1}
2015-03-06 01:39:27,585 INFO  mapreduce.Cluster
(Cluster.java:initialize(114)) - Failed to use
org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid
"mapreduce.jobtracker.address" configuration value for LocalJobRunner : "
ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9001"
2015-03-06 01:39:27,585 ERROR tools.DistCp (DistCp.java:run(126)) -
Exception encountered
java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

I tried doing start-all.sh , start-dfs.sh  and start-yarn.sh

what should I do ?
Thanks
-roni


Re: distcp on ec2 standalone spark cluster

2015-03-07 Thread roni
Did you get this to work?
I got pass the issues with the cluster not startetd problem
I am having problem where distcp with s3 URI says incorrect forlder path and
s3n:// hangs.
stuck for 2 days :(
Thanks
-R



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distcp-on-ec2-standalone-spark-cluster-tp13652p21957.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



distcp problems on ec2 standalone spark cluster

2015-03-09 Thread roni
I got pass the issues with the cluster not started problem by adding Yarn
to mapreduce.framework.name .
But when I try to to distcp , if I use uRI with s3://path to my bucket .. I
get invalid path even though the bucket exists.
If I use s3n:// it just hangs.
Did anyone else  face anything like that ?

I also noticed that this script puts the image of cloudera. hadoop. Does it
matter?
Thanks
-R


Re: Setting up Spark with YARN on EC2 cluster

2015-03-10 Thread roni
Hi Harika,
Did you get any solution for this?
I want to use yarn , but the spark-ec2 script does not support it.
Thanks 
-Roni




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818p21991.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



saving or visualizing PCA

2015-03-18 Thread roni
Hi ,
 I am generating PCA using spark .
But I dont know how to save it to disk or visualize it.
Can some one give me some pointerspl.
Thanks
-Roni


FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded

2015-03-19 Thread roni
I get 2 types of error -
-org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0 and
FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407
- discarded

Spar keeps re-trying to submit the code and keeps getting this error.

My file on which I am finding  the sliding window strings is 500 MB  and I
am doing it with length = 150.
It woks fine till length is 100.

This is my code -
 val hgfasta = sc.textFile(args(0)) // read the fasta file
val kCount = hgfasta.flatMap(r => { r.sliding(args(2).toInt) })
val kmerCount = kCount.map(x => (x, 1)).reduceByKey(_ + _).map { case
(x, y) => (y, x) }.sortByKey(false).map { case (i, j) => (j, i) }

  val filtered = kmerCount.filter(kv => kv._2 < 5)
  filtered.map(kv => kv._1 + ", " +
kv._2.toLong).saveAsTextFile(args(1))

  }
It gets stuck and flat map and save as Text file  Throws
-org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0 and

org.apache.spark.shuffle.FetchFailedException: Adjusted frame length
exceeds 2147483647: 12716268407 - discarded
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)


diffrence in PCA of MLib vs H2o in R

2015-03-23 Thread roni
I am trying to compute PCA  using  computePrincipalComponents.
I  also computed PCA using h2o in R and R's prcomp. The answers I get from
H2o and R's prComp (non h2o) is same when I set the options for H2o as
standardized=FALSE and for r's prcomp as center = false.

How do I make sure that the settings for MLib PCA is same as I am using for
H2o or prcomp.

Thanks
Roni


Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread roni
Reza,
That SVD.v matches the H2o and R prComp (non-centered)
Thanks
-R

On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen  wrote:

> (Oh sorry, I've only been thinking of TallSkinnySVD)
>
> On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh  wrote:
> > If you want to do a nonstandard (or uncentered) PCA, you can call
> > "computeSVD" on RowMatrix, and look at the resulting 'V' Matrix.
> >
> > That should match the output of the other two systems.
> >
> > Reza
> >
> > On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen  wrote:
> >>
> >> Those implementations are computing an SVD of the input matrix
> >> directly, and while you generally need the columns to have mean 0, you
> >> can turn that off with the options you cite.
> >>
> >> I don't think this is possible in the MLlib implementation, since it
> >> is computing the principal components by computing eigenvectors of the
> >> covariance matrix. The means inherently don't matter either way in
> >> this computation.
> >>
> >> On Tue, Mar 24, 2015 at 6:13 AM, roni  wrote:
> >> > I am trying to compute PCA  using  computePrincipalComponents.
> >> > I  also computed PCA using h2o in R and R's prcomp. The answers I get
> >> > from
> >> > H2o and R's prComp (non h2o) is same when I set the options for H2o as
> >> > standardized=FALSE and for r's prcomp as center = false.
> >> >
> >> > How do I make sure that the settings for MLib PCA is same as I am
> using
> >> > for
> >> > H2o or prcomp.
> >> >
> >> > Thanks
> >> > Roni
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>


upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
I have a EC2 cluster created using spark version 1.2.1.
And I have a SBT project .
Now I want to upgrade to spark 1.3 and use the new features.
Below are issues .
Sorry for the long post.
Appreciate your help.
Thanks
-Roni

Question - Do I have to create a new cluster using spark 1.3?

Here is what I did -

In my SBT file I  changed to -
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

But then I started getting compilation error. along with
Here are some of the libraries that were evicted:
[warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
[warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -> 2.6.0
[warn] Run 'evicted' to see detailed eviction warnings

 constructor cannot be instantiated to expected type;
[error]  found   : (T1, T2)
[error]  required: org.apache.spark.sql.catalyst.expressions.Row
[error] val ty = joinRDD.map{case(word,
(file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
[error]  ^

Here is my SBT and code --
SBT -

version := "1.0"

scalaVersion := "2.10.4"

resolvers += "Sonatype OSS Snapshots" at "
https://oss.sonatype.org/content/repositories/snapshots";;
resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
resolvers += "Maven Repo" at "
https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;

/* Dependencies - %% appends Scala version to artifactId */
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "0.2.10"


CODE --
import org.apache.spark.{SparkConf, SparkContext}
case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

object preDefKmerIntersection {
  def main(args: Array[String]) {

 val sparkConf = new SparkConf().setAppName("preDefKmer-intersect")
 val sc = new SparkContext(sparkConf)
import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val bedFile = sc.textFile("s3n://a/b/c",40)
 val hgfasta = sc.textFile("hdfs://a/b/c",40)
 val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
a(1).trim().toInt))
 val filtered = hgPair.filter(kv => kv._2 == 1)
 val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
a(1).trim().toInt))
 val joinRDD = bedPair.join(filtered)
val ty = joinRDD.map{case(word, (file1Counts, file2Counts))
=> KmerIntesect(word, file1Counts,"xyz")}
ty.registerTempTable("KmerIntesect")
ty.saveAsParquetFile("hdfs://x/y/z/kmerIntersect.parquet")
  }
}


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Even if H2o and ADA are dependent on 1.2.1 , it should be backword
compatible, right?
So using 1.3 should not break them.
And the code is not using the classes from those libs.
I tried sbt clean compile .. same errror
Thanks
_R

On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
wrote:

> What version of Spark do the other dependencies rely on (Adam and H2O?) -
> that could be it
>
> Or try sbt clean compile
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>
>> I have a EC2 cluster created using spark version 1.2.1.
>> And I have a SBT project .
>> Now I want to upgrade to spark 1.3 and use the new features.
>> Below are issues .
>> Sorry for the long post.
>> Appreciate your help.
>> Thanks
>> -Roni
>>
>> Question - Do I have to create a new cluster using spark 1.3?
>>
>> Here is what I did -
>>
>> In my SBT file I  changed to -
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>
>> But then I started getting compilation error. along with
>> Here are some of the libraries that were evicted:
>> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
>> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -> 2.6.0
>> [warn] Run 'evicted' to see detailed eviction warnings
>>
>>  constructor cannot be instantiated to expected type;
>> [error]  found   : (T1, T2)
>> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
>> [error] val ty = joinRDD.map{case(word,
>> (file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>> [error]  ^
>>
>> Here is my SBT and code --
>> SBT -
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.4"
>>
>> resolvers += "Sonatype OSS Snapshots" at "
>> https://oss.sonatype.org/content/repositories/snapshots";;
>> resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
>> resolvers += "Maven Repo" at "
>> https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;
>>
>> /* Dependencies - %% appends Scala version to artifactId */
>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>> libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
>> libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "0.2.10"
>>
>>
>> CODE --
>> import org.apache.spark.{SparkConf, SparkContext}
>> case class KmerIntesect(kmer: String, kCount: Int, fileName: String)
>>
>> object preDefKmerIntersection {
>>   def main(args: Array[String]) {
>>
>>  val sparkConf = new SparkConf().setAppName("preDefKmer-intersect")
>>  val sc = new SparkContext(sparkConf)
>> import sqlContext.createSchemaRDD
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> val bedFile = sc.textFile("s3n://a/b/c",40)
>>  val hgfasta = sc.textFile("hdfs://a/b/c",40)
>>  val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
>> a(1).trim().toInt))
>>  val filtered = hgPair.filter(kv => kv._2 == 1)
>>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
>> a(1).trim().toInt))
>>  val joinRDD = bedPair.join(filtered)
>> val ty = joinRDD.map{case(word, (file1Counts,
>> file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>> ty.registerTempTable("KmerIntesect")
>> ty.saveAsParquetFile("hdfs://x/y/z/kmerIntersect.parquet")
>>   }
>> }
>>
>>
>


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Thanks Dean and Nick.
So, I removed the ADAM and H2o from my SBT as I was not using them.
I got the code to compile  - only for fail while running with -
SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/rdd/RDD$
at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

This line is where I do a "JOIN" operation.
val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0), a(1).trim().toInt))
 val filtered = hgPair.filter(kv => kv._2 == 1)
 val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
a(1).trim().toInt))
* val joinRDD = bedPair.join(filtered)   *
Any idea whats going on?
I have data on the EC2 so I am avoiding creating the new cluster , but just
upgrading and changing the code to use 1.3 and Spark SQL
Thanks
Roni



On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler  wrote:

> For the Spark SQL parts, 1.3 breaks backwards compatibility, because
> before 1.3, Spark SQL was considered experimental where API changes were
> allowed.
>
> So, H2O and ADA compatible with 1.2.X might not work with 1.3.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:
>
>> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
>> compatible, right?
>> So using 1.3 should not break them.
>> And the code is not using the classes from those libs.
>> I tried sbt clean compile .. same errror
>> Thanks
>> _R
>>
>> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath > > wrote:
>>
>>> What version of Spark do the other dependencies rely on (Adam and H2O?)
>>> - that could be it
>>>
>>> Or try sbt clean compile
>>>
>>> —
>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>
>>>
>>> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>>>
>>>> I have a EC2 cluster created using spark version 1.2.1.
>>>> And I have a SBT project .
>>>> Now I want to upgrade to spark 1.3 and use the new features.
>>>> Below are issues .
>>>> Sorry for the long post.
>>>> Appreciate your help.
>>>> Thanks
>>>> -Roni
>>>>
>>>> Question - Do I have to create a new cluster using spark 1.3?
>>>>
>>>> Here is what I did -
>>>>
>>>> In my SBT file I  changed to -
>>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>>>
>>>> But then I started getting compilation error. along with
>>>> Here are some of the libraries that were evicted:
>>>> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
>>>> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) ->
>>>> 2.6.0
>>>> [warn] Run 'evicted' to see detailed eviction warnings
>>>>
>>>>  constructor cannot be instantiated to expected type;
>>>> [error]  found   : (T1, T2)
>>>> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
>>>> [error] val ty = joinRDD.map{case(word,
>>>> (file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>>>> [error]  ^
>>>>
>>>> Here is my SBT and code --
>>>> SBT -
>>>>
>>>> version := "1.0"
>>>>
>>>> scalaVersion := "2.10.4"
>>>>
>>>> resolvers += "Sonatype OSS Snapshots" at "
>>>> https://oss.sonatype.org/content/repositories/snapshots";;
>>>> resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
>>>> resolvers += "Maven Repo" at "
>>>> https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;
>>>>
>>>> /* Dependencies - %% appends Scala version to artifactId */
>>>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
>>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>>> libraryDependencies += "org.bdgenomics.adam&q

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
My cluster is still on spark 1.2 and in SBT I am using 1.3.
So probably it is compiling with 1.3 but running with 1.2 ?

On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler 
wrote:

> Weird. Are you running using SBT console? It should have the spark-core
> jar on the classpath. Similarly, spark-shell or spark-submit should work,
> but be sure you're using the same version of Spark when running as when
> compiling. Also, you might need to add spark-sql to your SBT dependencies,
> but that shouldn't be this issue.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Wed, Mar 25, 2015 at 12:09 PM, roni  wrote:
>
>> Thanks Dean and Nick.
>> So, I removed the ADAM and H2o from my SBT as I was not using them.
>> I got the code to compile  - only for fail while running with -
>> SparkContext: Created broadcast 1 from textFile at
>> kmerIntersetion.scala:21
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/rdd/RDD$
>> at preDefKmerIntersection$.main(kmerIntersetion.scala:26)
>>
>> This line is where I do a "JOIN" operation.
>> val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0), a(1).trim().toInt))
>>  val filtered = hgPair.filter(kv => kv._2 == 1)
>>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
>> a(1).trim().toInt))
>> * val joinRDD = bedPair.join(filtered)   *
>> Any idea whats going on?
>> I have data on the EC2 so I am avoiding creating the new cluster , but
>> just upgrading and changing the code to use 1.3 and Spark SQL
>> Thanks
>> Roni
>>
>>
>>
>> On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler 
>> wrote:
>>
>>> For the Spark SQL parts, 1.3 breaks backwards compatibility, because
>>> before 1.3, Spark SQL was considered experimental where API changes were
>>> allowed.
>>>
>>> So, H2O and ADA compatible with 1.2.X might not work with 1.3.
>>>
>>> dean
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>> Typesafe <http://typesafe.com>
>>> @deanwampler <http://twitter.com/deanwampler>
>>> http://polyglotprogramming.com
>>>
>>> On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:
>>>
>>>> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
>>>> compatible, right?
>>>> So using 1.3 should not break them.
>>>> And the code is not using the classes from those libs.
>>>> I tried sbt clean compile .. same errror
>>>> Thanks
>>>> _R
>>>>
>>>> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> What version of Spark do the other dependencies rely on (Adam and
>>>>> H2O?) - that could be it
>>>>>
>>>>> Or try sbt clean compile
>>>>>
>>>>> —
>>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>>
>>>>>
>>>>> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>>>>>
>>>>>> I have a EC2 cluster created using spark version 1.2.1.
>>>>>> And I have a SBT project .
>>>>>> Now I want to upgrade to spark 1.3 and use the new features.
>>>>>> Below are issues .
>>>>>> Sorry for the long post.
>>>>>> Appreciate your help.
>>>>>> Thanks
>>>>>> -Roni
>>>>>>
>>>>>> Question - Do I have to create a new cluster using spark 1.3?
>>>>>>
>>>>>> Here is what I did -
>>>>>>
>>>>>> In my SBT file I  changed to -
>>>>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>>>>>
>>>>>> But then I started getting compilation error. along with
>>>>>> Here are some of the libraries that were evicted:
>>>>>> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
>>>>>> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Is there any way that I can install the new one and remove previous version.
I installed spark 1.3 on my EC2 master and set teh spark home to the new
one.
But when I start teh spark-shell I get -
 java.lang.UnsatisfiedLinkError:
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
at
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native
Method)

Is There no way to upgrade without creating new cluster?
Thanks
Roni



On Wed, Mar 25, 2015 at 1:18 PM, Dean Wampler  wrote:

> Yes, that's the problem. The RDD class exists in both binary jar files,
> but the signatures probably don't match. The bottom line, as always for
> tools like this, is that you can't mix versions.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Wed, Mar 25, 2015 at 3:13 PM, roni  wrote:
>
>> My cluster is still on spark 1.2 and in SBT I am using 1.3.
>> So probably it is compiling with 1.3 but running with 1.2 ?
>>
>> On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler 
>> wrote:
>>
>>> Weird. Are you running using SBT console? It should have the spark-core
>>> jar on the classpath. Similarly, spark-shell or spark-submit should work,
>>> but be sure you're using the same version of Spark when running as when
>>> compiling. Also, you might need to add spark-sql to your SBT dependencies,
>>> but that shouldn't be this issue.
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>> Typesafe <http://typesafe.com>
>>> @deanwampler <http://twitter.com/deanwampler>
>>> http://polyglotprogramming.com
>>>
>>> On Wed, Mar 25, 2015 at 12:09 PM, roni  wrote:
>>>
>>>> Thanks Dean and Nick.
>>>> So, I removed the ADAM and H2o from my SBT as I was not using them.
>>>> I got the code to compile  - only for fail while running with -
>>>> SparkContext: Created broadcast 1 from textFile at
>>>> kmerIntersetion.scala:21
>>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>>> org/apache/spark/rdd/RDD$
>>>> at preDefKmerIntersection$.main(kmerIntersetion.scala:26)
>>>>
>>>> This line is where I do a "JOIN" operation.
>>>> val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
>>>> a(1).trim().toInt))
>>>>  val filtered = hgPair.filter(kv => kv._2 == 1)
>>>>  val bedPair = bedFile.map(_.split (",")).map(a=>
>>>> (a(0), a(1).trim().toInt))
>>>> * val joinRDD = bedPair.join(filtered)   *
>>>> Any idea whats going on?
>>>> I have data on the EC2 so I am avoiding creating the new cluster , but
>>>> just upgrading and changing the code to use 1.3 and Spark SQL
>>>> Thanks
>>>> Roni
>>>>
>>>>
>>>>
>>>> On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler 
>>>> wrote:
>>>>
>>>>> For the Spark SQL parts, 1.3 breaks backwards compatibility, because
>>>>> before 1.3, Spark SQL was considered experimental where API changes were
>>>>> allowed.
>>>>>
>>>>> So, H2O and ADA compatible with 1.2.X might not work with 1.3.
>>>>>
>>>>> dean
>>>>>
>>>>> Dean Wampler, Ph.D.
>>>>> Author: Programming Scala, 2nd Edition
>>>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>>>> Typesafe <http://typesafe.com>
>>>>> @deanwampler <http://twitter.com/deanwampler>
>>>>> http://polyglotprogramming.com
>>>>>
>>>>> On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:
>>>>>
>>>>>> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
>>>>>> compatible, right?
>>>>>> So using 1.3 should not break them.
>>>>>> And the code is not using the classes from those libs.
>>>>>> I tried sbt clean compile .. same errror
>>>>>> Thanks
>>>>>> _R
>>>>>>
>>>>>> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath <
>>>&

Re: Cannot run spark-shell "command not found".

2015-03-30 Thread roni
I think you must have downloaded the spark source code gz file.
 It is little confusing. You have to select the hadoop version also  and
the actual  tgz file will have spark version and hadoop version in it.
-R

On Mon, Mar 30, 2015 at 10:34 AM, vance46  wrote:

> Hi all,
>
> I'm a newbee try to setup spark for my research project on a RedHat system.
> I've downloaded spark-1.3.0.tgz and untared it. and installed python, java
> and scala. I've set JAVA_HOME and SCALA_HOME and then try to use "sudo
> sbt/sbt assembly" according to
> https://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_CentOS6
> .
> It pop-up with "sbt command not found". I then try directly start
> spark-shell in "./bin" using "sudo ./bin/spark-shell" and still "command
> not
> found". I appreciate your help in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-spark-shell-command-not-found-tp22299.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


joining multiple parquet files

2015-03-31 Thread roni
Hi ,
 I have 4 parquet files and I want to find data which is common in all of
them
e.g

SELECT TableA.*, TableB.*, TableC.*, TableD.* FROM (TableB INNER JOIN TableA
ON TableB.aID= TableA.aID)
INNER JOIN TableC ON(TableB.cID= Tablec.cID)
INNER JOIN TableA ta ON(ta.dID= TableD.dID)
WHERE (DATE(TableC.date)=date(now()))


I can do a 2 files join like -  val joinedVal =
g1.join(g2,g1.col("kmer") === g2.col("kmer"))

But I am trying to find common kmer strings  from 4 files.

Thanks

Roni


spark.sql.Row manipulation

2015-03-31 Thread roni
I have 2 paraquet files with format e.g  name , age, town
I read them  and then join them to get  all the names which are in both
towns  .
the resultant dataset is

res4: Array[org.apache.spark.sql.Row] = Array([name1, age1,
town1,name2,age2,town2])

Name 1 and name 2 are same as I am joining .
Now , I want to get only to the format (name , age1, age2)

But I cant seem to getting to manipulate the spark.sql.row.

Trying something like map(_.split (",")).map(a=> (a(0), a(1).trim().toInt))
does not work .

Can you suggest a way ?

Thanks
-R