RE: streaming sequence files?
Running as Standalone Cluster. From my monitoring console: [spark-logo-77x50px-hd.png] Spark Master at spark://101.73.54.149:7077 * URL: spark://101.73.54.149:7077 * Workers: 1 * Cores: 2 Total, 0 Used * Memory: 2.4 GB Total, 0.0 B Used * Applications: 0 Running, 24 Completed * Drivers: 0 Running, 0 Completed * Status: ALIVE Workers Id Address State Cores Memory worker-20140723222518-101.73.54.149-37995 101.73.54.149:37995 ALIVE 2 (0 Used) 2.4 GB (0.0 B Used) > From: tathagata.das1...@gmail.com > Date: Sat, 26 Jul 2014 20:14:37 -0700 > Subject: Re: streaming sequence files? > To: user@spark.apache.org > CC: u...@spark.incubator.apache.org > > Which deployment environment are you running the streaming programs? > Standalone? In that case you have to specify what is the max cores for > each application, other all the cluster resources may get consumed by > the application. > http://spark.apache.org/docs/latest/spark-standalone.html > > TD > > On Thu, Jul 24, 2014 at 4:57 PM, Barnaby wrote: > > I have the streaming program writing sequence files. I can find one of the > > files and load it in the shell using: > > > > scala> val rdd = sc.sequenceFile[String, > > Int]("tachyon://localhost:19998/files/WordCounts/20140724-213930") > > 14/07/24 21:47:50 INFO storage.MemoryStore: ensureFreeSpace(32856) called > > with curMem=0, maxMem=309225062 > > 14/07/24 21:47:50 INFO storage.MemoryStore: Block broadcast_0 stored as > > values to memory (estimated size 32.1 KB, free 294.9 MB) > > rdd: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[1] at sequenceFile > > at :12 > > > > So I got some type information, seems good. > > > > It took a while to research but I got the following streaming code to > > compile and run: > > > > val wordCounts = ssc.fileStream[String, Int, SequenceFileInputFormat[String, > > Int]](args(0)) > > > > It works now and I offer this for reference to anybody else who may be > > curious about saving sequence files and then streaming them back in. > > > > Question: > > When running both streaming programs at the same time using spark-submit I > > noticed that only one app would really run. To get the one app to continue I > > had to stop the other app. Is there a way to get these running > > simultaneously? > > > > > > > > -- > > View this message in context: > > http://apache-spark-user-list.1001560.n3.nabble.com/streaming-sequence-files-tp10557p10620.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: saveAsSequenceFile for DStream
Thanks Sean! I got that working last night similar to how you solved it. Any ideas about how to monitor that same folder in another script by creating a stream? I can use sc.sequenceFile() to read in the RDD, but how do I get the name of the file that got added since there is no sequenceFileStream() method? Thanks again for your help. > On Jul 22, 2014, at 1:57, "Sean Owen" wrote: > > What about simply: > > dstream.foreachRDD(_.saveAsSequenceFile(...)) > > ? > >> On Tue, Jul 22, 2014 at 2:06 AM, Barnaby wrote: >> First of all, I do not know Scala, but learning. >> >> I'm doing a proof of concept by streaming content from a socket, counting >> the words and write it to a Tachyon disk. A different script will read the >> file stream and print out the results. >> >> val lines = ssc.socketTextStream(args(0), args(1).toInt, >> StorageLevel.MEMORY_AND_DISK_SER) >> val words = lines.flatMap(_.split(" ")) >> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) >> wordCounts.saveAs???Files("tachyon://localhost:19998/files/WordCounts") >> ssc.start() >> ssc.awaitTermination() >> >> I already did a proof of concept to write and read sequence files but there >> doesn't seem to be a saveAsSequenceFiles() method in DStream. What is the >> best way to write out an RDD to a stream so that the timestamps are in the >> filenames and so there is minimal overhead in reading the data back in as >> "objects", see below. >> >> My simple successful proof was the following: >> val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1))) >> rdd.saveAsSequenceFile("tachyon://.../123.sf2") >> val rdd2 = sc.sequenceFile[String,Int]("tachyon://.../123.sf2") >> >> How can I do something similar with streaming? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsSequenceFile-for-DStream-tp10369.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com.