Spark 0.9.1 - saveAsSequenceFile and large RDD

Allen Lee Mon, 05 May 2014 14:48:48 -0700

Hi,

Fairly new to Spark.  I'm using Spark's saveAsSequenceFile() to write large
Sequence Files to HDFS.  The Sequence Files need to be large to be
efficiently accessed in HDFS, preferably larger than Hadoop's block size,
64MB.  The task works for files smaller than 64 MiB (with a warning for
sequence files close to 64 MiB).  For files larger than 64 MiB, the task
fails with a libprotobuf error. Here is the full log:


14/05/05 18:18:00 INFO MesosSchedulerBackend: Registered as framework ID
201404231353-1315739402-5050-26649-0091
14/05/05 18:18:12 INFO SequenceFileRDDFunctions: Saving as sequence file of
type (LongWritable,BytesWritable)
14/05/05 18:18:14 INFO SparkContext: Starting job: saveAsSequenceFile at
XXXXX.scala:171
14/05/05 18:18:14 INFO DAGScheduler: Got job 0 (saveAsSequenceFile at
XXXXX.scala:171) with 1 output partitions (allowLocal=false)
14/05/05 18:18:14 INFO DAGScheduler: Final stage: Stage 0
(saveAsSequenceFile at XXXXX.scala:171)
14/05/05 18:18:14 INFO DAGScheduler: Parents of final stage: List()
14/05/05 18:18:14 INFO DAGScheduler: Missing parents: List()
14/05/05 18:18:14 INFO DAGScheduler: Submitting Stage 0
(ParallelCollectionRDD[0] at makeRDD at XXXXX.scala:170), which has no
missing parents
14/05/05 18:18:19 INFO DAGScheduler: Submitting 1 missing tasks from Stage
0 (ParallelCollectionRDD[0] at makeRDD at XXXXX.scala:170)
14/05/05 18:18:19 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/05/05 18:18:19 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
executor 201404231353-1315739402-5050-26649-3: dn-04 (PROCESS_LOCAL)
14/05/05 18:18:23 INFO TaskSetManager: Serialized task 0.0:0 as 113006452
bytes in 3890 ms
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:171] A protocol
message was rejected because it was too big (more than 67108864 bytes).  To
increase the limit (or to disable these warnings), see
CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
F0505 18:18:24.616025 27889 construct.cpp:48] Check failed: parsed
Unexpected failure while parsing protobuf
*** Check failure stack trace: ***
    @     0x7fc8d49ba96d  google::LogMessage::Fail()
    @     0x7fc8d49be987  google::LogMessage::SendToLog()
    @     0x7fc8d49bc809  google::LogMessage::Flush()
    @     0x7fc8d49bcb0d  google::LogMessageFatal::~LogMessageFatal()



The code is fairly simple

val kv = <large Seq of Key Value pairs>

//set parallelism to 1 to keep the file from being partitioned
sc.makeRDD(kv,1)
   .saveAsSequenceFile(path)


Does anyone have any pointers on how to get past this?

Thanks,

-- 
*Allen Lee*
Software Engineer
MediaCrossing Inc.

Spark 0.9.1 - saveAsSequenceFile and large RDD

Reply via email to