Re: How to merge a RDD of RDDs into one uber RDD

2015-01-06 Thread k.tham
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within RDDs, in fact, I don't think it makes any sense) I suggest rather than having an RDD of file names, collect those file name strings back on to the driver as a Scala array of file names, and then from there, make an array

Re: Spark LIBLINEAR

2014-10-24 Thread k.tham
Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented. I'll try it out when I have time. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html Sent from the Apache Spark User List mailing list archive at Nab

Re: Spark LIBLINEAR

2014-10-24 Thread k.tham
Just wondering, any update on this? Is there a plan to integrate CJ's work with mllib? I'm asking since SVM impl in MLLib did not give us good results and we have to resort to training our svm classifier in a serial manner on the driver node with liblinear. Also, it looks like CJ Lin is coming to

Running Spark on Yarn vs Mesos

2014-07-10 Thread k.tham
What do people usually do for this? It looks like Yarn might be the simplest since the Cloudera distribution already installs this for you when you install hadoop. Any advantages of using Mesos instead? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

Recommended pipeline automation tool? Oozie?

2014-07-10 Thread k.tham
I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggest

Re: SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError

2014-06-03 Thread k.tham
I see, thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837p6848.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError

2014-06-03 Thread k.tham
I've read through that thread, and it seems for him, he needed to add a particular hadoop-client dependency. However, I don't think I should be required to do that as I'm not reading from HDFS. I'm just running a straight up minimal example, in local mode, and out of the box. Here's an example m

Re: SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError

2014-06-03 Thread k.tham
Oh, I missed that thread. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837p6839.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError

2014-06-03 Thread k.tham
I'm trying to save an RDD as a parquet file through the saveAsParquestFile() api, With code that looks something like: val sc = ... val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val someRDD: RDD[SomeCaseClass] = ... someRDD.saveAsParquetFile("someRDD.parquet") How