RE: streaming sequence files?
Running as Standalone Cluster. From my monitoring console: [spark-logo-77x50px-hd.png] Spark Master at spark://101.73.54.149:7077 * URL: spark://101.73.54.149:7077 * Workers: 1 * Cores: 2 Total, 0 Used * Memory: 2.4 GB Total, 0.0 B Used * Applications: 0 Running, 24 Completed * Drivers: 0 Running, 0 Completed * Status: ALIVE Workers Id Address State Cores Memory worker-20140723222518-101.73.54.149-37995 101.73.54.149:37995 ALIVE 2 (0 Used) 2.4 GB (0.0 B Used) From: tathagata.das1...@gmail.com Date: Sat, 26 Jul 2014 20:14:37 -0700 Subject: Re: streaming sequence files? To: user@spark.apache.org CC: u...@spark.incubator.apache.org Which deployment environment are you running the streaming programs? Standalone? In that case you have to specify what is the max cores for each application, other all the cluster resources may get consumed by the application. http://spark.apache.org/docs/latest/spark-standalone.html TD On Thu, Jul 24, 2014 at 4:57 PM, Barnaby bfa...@outlook.com wrote: I have the streaming program writing sequence files. I can find one of the files and load it in the shell using: scala val rdd = sc.sequenceFile[String, Int](tachyon://localhost:19998/files/WordCounts/20140724-213930) 14/07/24 21:47:50 INFO storage.MemoryStore: ensureFreeSpace(32856) called with curMem=0, maxMem=309225062 14/07/24 21:47:50 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.1 KB, free 294.9 MB) rdd: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[1] at sequenceFile at console:12 So I got some type information, seems good. It took a while to research but I got the following streaming code to compile and run: val wordCounts = ssc.fileStream[String, Int, SequenceFileInputFormat[String, Int]](args(0)) It works now and I offer this for reference to anybody else who may be curious about saving sequence files and then streaming them back in. Question: When running both streaming programs at the same time using spark-submit I noticed that only one app would really run. To get the one app to continue I had to stop the other app. Is there a way to get these running simultaneously? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-sequence-files-tp10557p10620.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: streaming sequence files?
Can you just call fileStream or textFileStream in the second app, to consume files that appear in HDFS / Tachyon from the first job? On Thu, Jul 24, 2014 at 2:43 AM, Barnaby bfa...@outlook.com wrote: If I save an RDD as a sequence file such as: val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _) wordCounts.foreachRDD( d = { d.saveAsSequenceFile(tachyon://localhost:19998/files/WordCounts- + (new SimpleDateFormat(MMdd-HHmmss) format Calendar.getInstance.getTime).toString) }) How can I use these results in another Spark app since there is no StreamingContext.sequenceFileStream()? Or, What is the best way to save RDDs of objects to files in one streaming app so that another app can stream those files in? Basically, reuse partially reduced RDDs for further processing so that it doesn't have to be done more than once. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-sequence-files-tp10557.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: streaming sequence files?
I have the streaming program writing sequence files. I can find one of the files and load it in the shell using: scala val rdd = sc.sequenceFile[String, Int](tachyon://localhost:19998/files/WordCounts/20140724-213930) 14/07/24 21:47:50 INFO storage.MemoryStore: ensureFreeSpace(32856) called with curMem=0, maxMem=309225062 14/07/24 21:47:50 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.1 KB, free 294.9 MB) rdd: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[1] at sequenceFile at console:12 So I got some type information, seems good. It took a while to research but I got the following streaming code to compile and run: val wordCounts = ssc.fileStream[String, Int, SequenceFileInputFormat[String, Int]](args(0)) It works now and I offer this for reference to anybody else who may be curious about saving sequence files and then streaming them back in. Question: When running both streaming programs at the same time using spark-submit I noticed that only one app would really run. To get the one app to continue I had to stop the other app. Is there a way to get these running simultaneously? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-sequence-files-tp10557p10620.html Sent from the Apache Spark User List mailing list archive at Nabble.com.