RE: streaming sequence files?

2014-07-28 Thread Barnaby Falls
Running as Standalone Cluster. From my monitoring console:
[spark-logo-77x50px-hd.png] Spark Master at spark://101.73.54.149:7077
 * URL: spark://101.73.54.149:7077 * Workers: 1 * Cores: 2 Total, 0 
Used * Memory: 2.4 GB Total, 0.0 B Used * Applications: 0 Running, 24 
Completed * Drivers: 0 Running, 0 Completed * Status: ALIVE
Workers
   Id   Address   State
   Cores Memory   worker-20140723222518-101.73.54.149-37995 101.73.54.149:37995 
ALIVE 2 (0 Used) 2.4 GB (0.0 B Used)

 From: tathagata.das1...@gmail.com
 Date: Sat, 26 Jul 2014 20:14:37 -0700
 Subject: Re: streaming sequence files?
 To: user@spark.apache.org
 CC: u...@spark.incubator.apache.org
 
 Which deployment environment are you running the streaming programs?
 Standalone? In that case you have to specify what is the max cores for
 each application, other all the cluster resources may get consumed by
 the application.
 http://spark.apache.org/docs/latest/spark-standalone.html
 
 TD
 
 On Thu, Jul 24, 2014 at 4:57 PM, Barnaby bfa...@outlook.com wrote:
  I have the streaming program writing sequence files. I can find one of the
  files and load it in the shell using:
 
  scala val rdd = sc.sequenceFile[String,
  Int](tachyon://localhost:19998/files/WordCounts/20140724-213930)
  14/07/24 21:47:50 INFO storage.MemoryStore: ensureFreeSpace(32856) called
  with curMem=0, maxMem=309225062
  14/07/24 21:47:50 INFO storage.MemoryStore: Block broadcast_0 stored as
  values to memory (estimated size 32.1 KB, free 294.9 MB)
  rdd: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[1] at sequenceFile
  at console:12
 
  So I got some type information, seems good.
 
  It took a while to research but I got the following streaming code to
  compile and run:
 
  val wordCounts = ssc.fileStream[String, Int, SequenceFileInputFormat[String,
  Int]](args(0))
 
  It works now and I offer this for reference to anybody else who may be
  curious about saving sequence files and then streaming them back in.
 
  Question:
  When running both streaming programs at the same time using spark-submit I
  noticed that only one app would really run. To get the one app to continue I
  had to stop the other app. Is there a way to get these running
  simultaneously?
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/streaming-sequence-files-tp10557p10620.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
  

Re: streaming sequence files?

2014-07-24 Thread Sean Owen
Can you just call fileStream or textFileStream in the second app, to
consume files that appear in HDFS / Tachyon from the first job?

On Thu, Jul 24, 2014 at 2:43 AM, Barnaby bfa...@outlook.com wrote:
 If I save an RDD as a sequence file such as:

 val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _)
 wordCounts.foreachRDD( d = {
 d.saveAsSequenceFile(tachyon://localhost:19998/files/WordCounts- +
 (new SimpleDateFormat(MMdd-HHmmss) format
 Calendar.getInstance.getTime).toString)
 })

 How can I use these results in another Spark app since there is no
 StreamingContext.sequenceFileStream()?

 Or,

 What is the best way to save RDDs of objects to files in one streaming app
 so that another app can stream those files in? Basically, reuse partially
 reduced RDDs for further processing so that it doesn't have to be done more
 than once.




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/streaming-sequence-files-tp10557.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: streaming sequence files?

2014-07-24 Thread Barnaby
I have the streaming program writing sequence files. I can find one of the
files and load it in the shell using:

scala val rdd = sc.sequenceFile[String,
Int](tachyon://localhost:19998/files/WordCounts/20140724-213930)
14/07/24 21:47:50 INFO storage.MemoryStore: ensureFreeSpace(32856) called
with curMem=0, maxMem=309225062
14/07/24 21:47:50 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 32.1 KB, free 294.9 MB)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[1] at sequenceFile
at console:12

So I got some type information, seems good.

It took a while to research but I got the following streaming code to
compile and run:

val wordCounts = ssc.fileStream[String, Int, SequenceFileInputFormat[String,
Int]](args(0))

It works now and I offer this for reference to anybody else who may be
curious about saving sequence files and then streaming them back in.

Question:
When running both streaming programs at the same time using spark-submit I
noticed that only one app would really run. To get the one app to continue I
had to stop the other app. Is there a way to get these running
simultaneously?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-sequence-files-tp10557p10620.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.