Re: How to merge a RDD of RDDs into one uber RDD

2015-01-06 Thread k.tham
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within
RDDs, in fact, I don't think it makes any sense)

I suggest rather than having an RDD of file names, collect those file name
strings back on to the driver as a Scala array of file names, and then from
there, make an array of RDDs from which you can fold over them and merge
them.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one-uber-RDD-tp20986p21007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark LIBLINEAR

2014-10-24 Thread k.tham
Just wondering, any update on this? Is there a plan to integrate CJ's work
with mllib? I'm asking since  SVM impl in MLLib did not give us good results
and we have to resort to training our svm classifier in a serial manner on
the driver node with liblinear.

Also, it looks like CJ Lin is coming to the bay area in the coming weeks
(http://www.meetup.com/sfmachinelearning/events/208078582/) might be a good
time to connect with him.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17236.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark LIBLINEAR

2014-10-24 Thread k.tham
Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented. I'll
try it out when I have time. Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Recommended pipeline automation tool? Oozie?

2014-07-10 Thread k.tham
I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Running Spark on Yarn vs Mesos

2014-07-10 Thread k.tham
What do people usually do for this?

It looks like Yarn might be the simplest since the Cloudera distribution
already installs this for you when you install hadoop.

Any advantages of using Mesos instead?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-on-Yarn-vs-Mesos-tp9320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError

2014-06-03 Thread k.tham
I'm trying to save an RDD as a parquet file through the saveAsParquestFile()
api,

With code that looks something like:

val sc = ...
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._

val someRDD: RDD[SomeCaseClass] = ...
someRDD.saveAsParquetFile(someRDD.parquet)

However, I get the following error:
java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

I'm trying to figure out what the issue is, help is appreciated, thx!

My sbt configuration has the following:

val sparkV = 1.0.0
// ...
org.apache.spark  %% spark-core   % sparkV,
org.apache.spark  %% spark-mllib  % sparkV,
org.apache.spark  %% spark-sql% sparkV,

Here's the stack trace:

java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at
org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:256)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:224)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:242)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError

2014-06-03 Thread k.tham
Oh, I missed that thread. Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837p6839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError

2014-06-03 Thread k.tham
I've read through that thread, and it seems for him, he needed to add a
particular hadoop-client dependency.
However, I don't think I should be required to do that as I'm not reading
from HDFS.

I'm just running a straight up minimal example, in local mode, and out of
the box. 

Here's an example minimal project that reproduces this error:

https://github.com/ktham/spark-parquet-example




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837p6846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.