Re: How to merge a RDD of RDDs into one uber RDD
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within RDDs, in fact, I don't think it makes any sense) I suggest rather than having an RDD of file names, collect those file name strings back on to the driver as a Scala array of file names, and then from there, make an array of RDDs from which you can fold over them and merge them. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one-uber-RDD-tp20986p21007.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark LIBLINEAR
Just wondering, any update on this? Is there a plan to integrate CJ's work with mllib? I'm asking since SVM impl in MLLib did not give us good results and we have to resort to training our svm classifier in a serial manner on the driver node with liblinear. Also, it looks like CJ Lin is coming to the bay area in the coming weeks (http://www.meetup.com/sfmachinelearning/events/208078582/) might be a good time to connect with him. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17236.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark LIBLINEAR
Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented. I'll try it out when I have time. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Recommended pipeline automation tool? Oozie?
I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Running Spark on Yarn vs Mesos
What do people usually do for this? It looks like Yarn might be the simplest since the Cloudera distribution already installs this for you when you install hadoop. Any advantages of using Mesos instead? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-on-Yarn-vs-Mesos-tp9320.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError
I'm trying to save an RDD as a parquet file through the saveAsParquestFile() api, With code that looks something like: val sc = ... val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val someRDD: RDD[SomeCaseClass] = ... someRDD.saveAsParquetFile(someRDD.parquet) However, I get the following error: java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected I'm trying to figure out what the issue is, help is appreciated, thx! My sbt configuration has the following: val sparkV = 1.0.0 // ... org.apache.spark %% spark-core % sparkV, org.apache.spark %% spark-mllib % sparkV, org.apache.spark %% spark-sql% sparkV, Here's the stack trace: java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:256) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:224) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:242) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:242) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError
Oh, I missed that thread. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837p6839.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError
I've read through that thread, and it seems for him, he needed to add a particular hadoop-client dependency. However, I don't think I should be required to do that as I'm not reading from HDFS. I'm just running a straight up minimal example, in local mode, and out of the box. Here's an example minimal project that reproduces this error: https://github.com/ktham/spark-parquet-example -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837p6846.html Sent from the Apache Spark User List mailing list archive at Nabble.com.