unsubscribe
Re: Spark streaming checkpoint against s3
So as long as jar is kept on s3 and available across different runs, then the s3 checkpoint is working. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25081.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark streaming checkpoint against s3
Hi, I am trying to set spark streaming checkpoint to s3, here is what I did basically val checkpoint = "s3://myBucket/checkpoint" val ssc = StreamingContext.getOrCreate(checkpointDir, () => getStreamingContext(sparkJobName, batchDurationSec), classOf[MyClassKryoRegistrator], checkpointDir), getHadoopConfiguration) def getHadoopConfiguration: Configuration = { val hadoopConf = new Configuration() hadoopConf.set("fs.defaultFS", "s3://"+myBucket+"/") hadoopConf.set("fs.s3.awsAccessKeyId", "myAccessKey") hadoopConf.set("fs.s3.awsSecretAccessKey", "mySecretKey") hadoopConf.set("fs.s3n.awsAccessKeyId", "myAccessKey") hadoopConf.set("fs.s3n.awsSecretAccessKey", "mySecretKey hadoopConf } It is working as I can see that it tries to retrieve checkpoint from s3. However it did more than what I intended. I saw in the log of the following 15/10/14 19:58:47 ERROR spark.SparkContext: Jar not found at file:/media/ephemeral0/oncue/mesos-slave/slaves/20151007-172900-436893194-5050-2984-S9/frameworks/20150825-180042-604730890-5050-4268-0003/executors/tian-act-reg.47368a1a-71f9-11e5-ad61-de5fb3a867da/runs/dfc28a6c-48a0-464b-bdb1-d6dd057acd51/artifacts/rna-spark-streaming.jar Now SparkContext is trying to look the following path instead of local file:/media/ephemeral0/oncue/mesos-slave/slaves/20151007-172900-436893194-5050-2984-S9/frameworks/20150825-180042-604730890-5050-4268-0003/executors/tian-act-reg.47368a1a-71f9-11e5-ad61-de5fb3a867da/runs/dfc28a6c-48a0-464b-bdb1-d6dd057acd51/artifacts/rna-spark-streaming.jar How do I let SparkContext to look just /media/ephemeral0/oncue/mesos-slave/slaves/20151007-172900-436893194-5050-2984-S9/frameworks/20150825-180042-604730890-5050-4268-0003/executors/tian-act-reg.47368a1a-71f9-11e5-ad61-de5fb3a867da/runs/dfc28a6c-48a0-464b-bdb1-d6dd057acd51/artifacts/rna-spark-streaming.jar? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark streaming checkpoint against s3
It looks like that reconstruction of SparkContext from checkpoint data is trying to look for the jar file of previous failed runs. It can not find the jar files as our jar files are on local machines and were cleaned up after each failed run. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25070.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: updateStateByKey and stack overflow
It turns out that our hdfs checkpoint failed, but spark streaming is running and building up a long lineage ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-and-stack-overflow-tp25015p25054.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: "Too many open files" exception on reduceByKey
It turns out the mesos can overwrite the OS ulimit -n setting. So we have increased the mesos slave ulimit -n setting. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
updateStateByKey and stack overflow
Hi, I am following the spark streaming stateful application example and write a simple counting application with updateStateByKey. val keyStateStream = actRegBatchCountStream.updateStateByKey(update, new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initKeyStateRDD) This runs for a few hours and hit the following stack overflow issue. Any idea? 15/10/10 18:30:08 INFO BlockManagerInfo: Added broadcast_5249_piece0 in memory on ip-10-112-11-64.ec2.internal:60489 (size: 16.5 KB, free: 4.1 GB) 15/10/10 18:30:08 WARN TaskSetManager: Lost task 4.0 in stage 129045.0 (TID 175432, ip-10-112-11-64.ec2.internal): java.lang.StackOverflowError at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1982) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-and-stack-overflow-tp25015.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: "Too many open files" exception on reduceByKey
You are right, I did find that mesos overwrite this to a smaller number.So we will modify that and try to run again. Thanks! Tian On Thursday, October 8, 2015 4:18 PM, DB Tsai <dbt...@dbtsai.com> wrote: Try to run to see actual ulimit. We found that mesos overrides the ulimit which causes the issue. import sys.process._ val p = 1 to 100 val rdd = sc.parallelize(p, 100) val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect Sincerely, DB Tsai --Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Thu, Oct 8, 2015 at 3:22 PM, Tian Zhang <tzhang...@yahoo.com> wrote: I hit this issue with spark 1.3.0 stateful application (with updateStateByKey) function on mesos. It will fail after running fine for about 24 hours. The error stack trace as below, I checked ulimit -n and we have very large numbers set on the machines. What else can be wrong? 15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 113727.0 (TID 833758, ip-10-112-10-221.ec2.internal): java.io.FileNotFoundException: /media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:171) at org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: "Too many open files" exception on reduceByKey
I hit this issue with spark 1.3.0 stateful application (with updateStateByKey) function on mesos. It will fail after running fine for about 24 hours. The error stack trace as below, I checked ulimit -n and we have very large numbers set on the machines. What else can be wrong? 15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 113727.0 (TID 833758, ip-10-112-10-221.ec2.internal): java.io.FileNotFoundException: /media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:171) at org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
how to pass configuration properties from driver to executor?
Hi, We have a scenario as below and would like your suggestion. We have app.conf file with propX=A as default built into the fat jar file that is provided to spark-submit WE have env.conf file with propX=B that would like spark-submit to take as input to overwrite the default and populate to both driver and executors. Note in the executor, we are using some package that is using typesafe config to read configuration properties. How do we do that? Thanks. Tian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-pass-configuration-properties-from-driver-to-executor-tp22728.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lifecycle of RDD in spark-streaming
I have found this paper seems to answer most of questions about life duration.https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf Tian On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hey Experts, I wanted to understand in detail about the lifecycle of rdd(s) in a streaming app. From my current understanding- rdd gets created out of the realtime input stream. - Transform(s) functions are applied in a lazy fashion on the RDD to transform into another rdd(s).- Actions are taken on the final transformed rdds to get the data out of the system. Also rdd(s) are stored in the clusters RAM (disc if configured so) and are cleaned in LRU fashion. So I have the following questions on the same. - How spark (streaming) guarantees that all the actions are taken on each input rdd/batch. - How does spark determines that the life-cycle of a rdd is complete. Is there any chance that a RDD will be cleaned out of ram before all actions are taken on them? Thanks in advance for all your help. Also, I'm relatively new to scala spark so pardon me in case these are naive questions/assumptions. -- Thanks Regards, Mukesh Jha
2 spark streaming questions
Hi, Dear Spark Streaming Developers and Users, We are prototyping using spark streaming and hit the following 2 issues thatI would like to seek your expertise. 1) We have a spark streaming application in scala, that reads data from Kafka intoa DStream, does some processing and output a transformed DStream. If for some reasonthe Kafka connection is not available or timed out, the spark streaming job will startto send empty RDD afterwards. The log is clean w/o any ERROR indicator. I googled around and this seems to be a known issue.We believe that spark streaming infrastructure should either retry or return error/exception.Can you share how you handle this case? 2) We would like implement a spark streaming job that join an 1 minute duration DStream of real time eventswith a metadata RDD that was read from a database. The metadata only changes slightly each day in the database.So what is the best practice of refresh the RDD daily keep the streaming join job running? Is this do-able as of spark 1.1.0? Thanks. Tian
Re: spark streaming and the spark shell
I am hitting the same issue, i.e., after running for some time, if spark streaming job lost or timeout kafka connection, it will just start to return empty RDD's .. Is there a timeline for when this issue will be fixed so that I can plan accordingly? Thanks. Tian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-and-the-spark-shell-tp3347p19296.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark 1.1.0/yarn hang
We have narrowed this hanging issue down to the calliope package that we used to create RDD from reading cassandra table. The calliope native RDD interface seems hanging and I have decided to switch to the calliope cql3 RDD interface. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-yarn-hang-tp16396p17087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark 1.1.0 RDD and Calliope 1.1.0-CTP-U2-H2
Hi, I am using the latest calliope library from tuplejump.com to create RDD for cassandra table. I am on a 3 nodes spark 1.1.0 with yarn. My cassandra table is defined as below and I have about 2000 rows of data inserted. CREATE TABLE top_shows ( program_id varchar, view_minute timestamp, view_count counter, PRIMARY KEY (view_minute, program_id) //note that view_minute is the partition key ); Here are the simple steps I ran from spark-shell on master node spark-shell --master yarn-client --jars rna/rna-spark-streaming-assembly-1.0-SNAPSHOT.jar --driver-memory 512m --executor-memory 512m --num-executors 3 --executor-cores 1 // Import the necessary import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import com.tuplejump.calliope.utils.RichByteBuffer._ import com.tuplejump.calliope.Implicits._ import com.tuplejump.calliope.CasBuilder import com.tuplejump.calliope.Types.{CQLRowKeyMap, CQLRowMap} // Define my class and the implicit cast case class ProgramViewCount(viewMinute:Long, program:String, viewCount:Long) implicit def keyValtoProgramViewCount(key:CQLRowKeyMap, values:CQLRowMap):ProgramViewCount = ProgramViewCount(key.get(view_minute).get.getLong, key.get(program_id).toString, values.get(view_count).get.getLong) // Use the cql3 interface to read from table with WHERE predicate. val cas = CasBuilder.cql3.withColumnFamily(streaming_qa, top_shows).onHost(23.22.120.96) .where(view_minute = 141386178) val allPrograms = sc.cql3Cassandra[ProgramViewCount](cas) // Lazy evaluation till this point val rowCount = allPrograms.count I hit the following exception. It seems that it does not like my where clause. If I do not have the WHERE CLAUSE, it works fine. But with the WHERE CLAUSE, no matter the predicate is on partition key or not, it will fail with the following exception. Anyone else using calliope package can share some lights? Thanks a lot. Tian scala val rowCount = allPrograms.count 14/10/21 23:26:07 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, ip-10-187-51-136.ec2.internal): java.lang.RuntimeException: com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:665) com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.init(CqlPagingRecordReader.java:301) com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167) com.tuplejump.calliope.cql3.Cql3CassandraRDD$$anon$1.init(Cql3CassandraRDD.scala:75) com.tuplejump.calliope.cql3.Cql3CassandraRDD.compute(Cql3CassandraRDD.scala:64) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-RDD-and-Calliope-1-1-0-CTP-U2-H2-tp16975.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark 1.1.0/yarn hang
Hi, I have spark 1.1.0 yarn installation. I am using spark-submit to run a simple application. From the console output, I have 769 partitions and after task 768 in stage 0 (count) finished, it hangs. I used jstack to dump the stacktop and it shows it is waiting ... Any suggestion what might go wrong and how to debug this kind of hanging? Thanks. Tian main prio=10 tid=0x7f6058009000 nid=0x7ecd in Object.wait() [0x7f605e4d9000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xfdd30500 (a org.apache.spark.scheduler.JobWaiter) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked 0xfdd30500 (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:511) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1088) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1107) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1121) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135) at org.apache.spark.rdd.RDD.count(RDD.scala:904) at com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob.getTopShows(TopShowsToKafkaJob.scala:29) at com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob.getTopShows(TopShowsToKafkaJob.scala:45) at com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob$$anonfun$5.apply(TopShowsToKafkaJob.scala:79) at com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob$$anonfun$5.apply(TopShowsToKafkaJob.scala:76) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob.processRecentWindow(TopShowsToKafkaJob.scala:76) at com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob$.processRecentWindow(TopShowsToKafkaJob.scala:98) at com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob$.main(TopShowsToKafkaJob.scala:112) at com.oncue.rna.realtime.streaming.spark.TopShowsToKafkaJob.main(TopShowsToKafkaJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Locked ownable synchronizers: - None
Re: Spark Streaming : Could not compute split, block not found
I have figured out why I am getting this error: We have a lot of data in kafka and the DStream from Kafka used MEMROY_ONLY_SER, so once the memory is low, spark started to discard data that is needed later ... So once I change to MEMORY_AND_DISK_SER, the error is gone. Tian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p16084.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [ANN] SparkSQL support for Cassandra with Calliope
Rohit, Thank you very much for release the H2 version and now my app compiles file and there is no more runtime error wrt. hadoop 1.x class or interface. Tian On Saturday, October 4, 2014 9:47 AM, Rohit Rai ro...@tuplejump.com wrote: Hi Tian, We have published a build against Hadoop 2.0 with version 1.1.0-CTP-U2-H2 Let us know how your testing goes. Regards, Rohit Founder CEO, Tuplejump, Inc. www.tuplejump.comThe Data Engineering Platform On Sat, Oct 4, 2014 at 3:49 AM, tian zhang tzhang...@yahoo.com wrote: Hi, Rohit, Thank you for sharing this good news. I have some relevant issue that I would like to ask your help. I am using spark 1.1.0 and I have a spark application using com.tuplejump% calliope-core_2.10% 1.1.0-CTP-U2, At runtime there are following errors that seem indicate that calliope package is compiled with hadoop 1.x and spark is running on hadoop 2.x. Can you release a new version of calliope so that it will be compatible with spark 1.1.0? Thanks. here is the error details. java.lang.IncompatibleClassChangeError: Found interface (hadoop 2.x) org.apache.hadoop.mapreduce.TaskAttemptContext, but class (hadoop 1.x) was expected com.tuplejump.calliope.hadoop.cql3.CqlRecordReader.initialize(CqlRecordReader.java:82) Tian On Friday, October 3, 2014 11:15 AM, Rohit Rai ro...@tuplejump.com wrote: Hi All, An year ago we started this journey and laid the path for Spark + Cassandra stack. We established the ground work and direction for Spark Cassandra connectors and we have been happy seeing the results. With Spark 1.1.0 and SparkSQL release, we its time to take Calliope to the logical next level also paving the way for much more advanced functionality to come. Yesterday we released Calliope 1.1.0 Community Tech Preview, which brings Native SparkSQL support for Cassandra. The further details are available here. This release showcases in core spark-sql, hiveql and HiveThriftServer support. I differentiate it as native spark-sql integration as it doesn't rely on Cassandra's hive connectors (like Cash or DSE) and saves a level of indirection through Hive. It also allows us to harness Spark's analyzer and optimizer in future to work out the best execution plan targeting a balance between Cassandra's querying restrictions and Sparks in memory processing. As far as we know this it the first and only third party datastore connector for SparkSQL. This is a CTP release as it relies on Spark internals that still don't have/stabilized a developer API and we will work with the Spark Community in documenting the requirements and working towards a standard and stable API for third party data store integration. On another note, we no longer require you to signup to access the early access code repository. Inviting all of you try it and give us your valuable feedback. Regards, Rohit Founder CEO, Tuplejump, Inc. www.tuplejump.comThe Data Engineering Platform
Spark 1.1.0 (w/ hadoop 2.4) versus aws-java-sdk-1.7.2.jar
Hi, Spark experts, I have the following issue when using aws java sdk in my spark application. Here I narrowed down the following steps to reproduce the problem 1) I have Spark 1.1.0 with hadoop 2.4 installed on 3 nodes cluster 2) from the master node, I did the following steps. spark-shell --jars ws-java-sdk-1.7.2.jar import com.amazonaws.{Protocol, ClientConfiguration} import com.amazonaws.auth.BasicAWSCredentials import com.amazonaws.services.s3.AmazonS3Client val clientConfiguration = new ClientConfiguration() val s3accessKey=X val s3secretKey=Y val credentials = new BasicAWSCredentials(s3accessKey,s3secretKey) println(CLASSPATH=+System.getenv(CLASSPATH)) CLASSPATH=::/home/hadoop/spark/conf:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar:/home/hadoop/conf:/home/hadoop/conf println(java.class.path=+System.getProperty(java.class.path)) java.class.path=::/home/hadoop/spark/conf:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar:/home/hadoop/conf:/home/hadoop/conf So far all look good and normal. But then the following step will fail and it looks like the class loader can't resolve to the right class. Any suggestion for Spark application that requires aws sdk? scala val s3Client = new AmazonS3Client(credentials, clientConfiguration) java.lang.NoClassDefFoundError: org/apache/http/impl/conn/PoolingClientConnectionManager at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:155) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:334) at $iwC$$iwC$$iwC$$iwC.init(console:21) at $iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC.init(console:28) at $iwC.init(console:30) at init(console:32) at .init(console:36) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.http.impl.conn.PoolingClientConnectionManager at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 46 more Thanks. Tian