Amit

I'm certain files were taken in without any issue -- I see the intake files
names print out in logs as the last message I replied to Suraj.

I did a little further experiments by using spark.RDD.saveAsTextFiles()
instead of DStream.saveAsTextFiles(). It works on the same dataset, so I'm
pretty sure the problem is with Dstream.

Can anyone confirm this is a bug or I'm using/understanding it in a wrong
way? 

*// This line does not work*
actions.saveAsTextFiles("hdfs://nn1:8020/user/etl/rtp_sink/actions/test",
"testtest")
*// This line works*
actions.foreachRDD(rdd =>
rdd.saveAsTextFile("hdfs://nn1:8020/user/etl/rtp_sink/actions/test",
classOf[org.apache.hadoop.io.compress.GzipCodec]))

Logs from using Dsteam.saveAsTextFiles() -- new files detected OK, jobs
finished without any error/warning:

14/02/18 19:35:00 INFO FileInputDStream: New files at time 1392752100000 ms:
hdfs://nn1:8020/user/etl/rtp_sink/staging/0134-ae2fc0ed9f824ab1a91f5fd81ad45af3.sdb.gz
hdfs://nn1:8020/user/etl/rtp_sink/staging/0134-f03bd4319bd24f13aa2f7c6b6a0d7631.sdb.gz
14/02/18 19:35:00 INFO MemoryStore: ensureFreeSpace(170493) called with
curMem=6990229, maxMem=9003781324
14/02/18 19:35:00 INFO MemoryStore: Block broadcast_41 stored as values to
memory (estimated size 166.5 KB, free 8.4 GB)
14/02/18 19:35:00 INFO MemoryStore: ensureFreeSpace(170493) called with
curMem=7160722, maxMem=9003781324
14/02/18 19:35:00 INFO MemoryStore: Block broadcast_42 stored as values to
memory (estimated size 166.5 KB, free 8.4 GB)
14/02/18 19:35:00 INFO FileInputFormat: Total input paths to process : 1
14/02/18 19:35:00 INFO FileInputFormat: Total input paths to process : 1
14/02/18 19:35:00 INFO JobScheduler: Added jobs for time 1392752100000 ms
14/02/18 19:35:00 INFO JobScheduler: Starting job streaming job
1392752100000 ms.0 from job set of time 1392752100000 ms
14/02/18 19:35:00 INFO SparkContext: Starting job: saveAsTextFile at
DStream.scala:762
14/02/18 19:35:00 INFO DAGScheduler: Got job 7 (saveAsTextFile at
DStream.scala:762) with 2 output partitions (allowLocal=false)
14/02/18 19:35:00 INFO DAGScheduler: Final stage: Stage 2 (saveAsTextFile at
DStream.scala:762)
14/02/18 19:35:00 INFO DAGScheduler: Parents of final stage: List()
14/02/18 19:35:00 INFO DAGScheduler: Missing parents: List()
14/02/18 19:35:00 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[98] at
saveAsTextFile at DStream.scala:762), which has no missing parents
14/02/18 19:35:00 INFO DAGScheduler: Submitting 2 missing tasks from Stage 2
(MappedRDD[98] at saveAsTextFile at DStream.scala:762)
14/02/18 19:35:00 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
14/02/18 19:35:00 INFO TaskSetManager: Starting task 2.0:1 as TID 49 on
executor 2: hadoop-dal01-dev-dn8.tapjoy.com (NODE_LOCAL)
14/02/18 19:35:00 INFO TaskSetManager: Serialized task 2.0:1 as 12419 bytes
in 0 ms
14/02/18 19:35:00 INFO TaskSetManager: Starting task 2.0:0 as TID 50 on
executor 5: hadoop-dal01-dev-dn7.tapjoy.com (NODE_LOCAL)
14/02/18 19:35:00 INFO TaskSetManager: Serialized task 2.0:0 as 12419 bytes
in 0 ms
14/02/18 19:35:26 INFO TaskSetManager: Finished TID 50 in 26508 ms on
hadoop-dal01-dev-dn7.tapjoy.com (progress: 0/2)
14/02/18 19:35:26 INFO DAGScheduler: Completed ResultTask(2, 0)
14/02/18 19:35:30 INFO FileInputDStream: Finding new files took 6 ms



-----
-- Robin Li
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DStream-saveAsTextFiles-saves-nothing-tp1666p1691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to