Hi, Fairly new to Spark. I'm using Spark's saveAsSequenceFile() to write large Sequence Files to HDFS. The Sequence Files need to be large to be efficiently accessed in HDFS, preferably larger than Hadoop's block size, 64MB. The task works for files smaller than 64 MiB (with a warning for sequence files close to 64 MiB). For files larger than 64 MiB, the task fails with a libprotobuf error. Here is the full log:
14/05/05 18:18:00 INFO MesosSchedulerBackend: Registered as framework ID 201404231353-1315739402-5050-26649-0091 14/05/05 18:18:12 INFO SequenceFileRDDFunctions: Saving as sequence file of type (LongWritable,BytesWritable) 14/05/05 18:18:14 INFO SparkContext: Starting job: saveAsSequenceFile at XXXXX.scala:171 14/05/05 18:18:14 INFO DAGScheduler: Got job 0 (saveAsSequenceFile at XXXXX.scala:171) with 1 output partitions (allowLocal=false) 14/05/05 18:18:14 INFO DAGScheduler: Final stage: Stage 0 (saveAsSequenceFile at XXXXX.scala:171) 14/05/05 18:18:14 INFO DAGScheduler: Parents of final stage: List() 14/05/05 18:18:14 INFO DAGScheduler: Missing parents: List() 14/05/05 18:18:14 INFO DAGScheduler: Submitting Stage 0 (ParallelCollectionRDD[0] at makeRDD at XXXXX.scala:170), which has no missing parents 14/05/05 18:18:19 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (ParallelCollectionRDD[0] at makeRDD at XXXXX.scala:170) 14/05/05 18:18:19 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/05/05 18:18:19 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 201404231353-1315739402-5050-26649-3: dn-04 (PROCESS_LOCAL) 14/05/05 18:18:23 INFO TaskSetManager: Serialized task 0.0:0 as 113006452 bytes in 3890 ms [libprotobuf ERROR google/protobuf/io/coded_stream.cc:171] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. F0505 18:18:24.616025 27889 construct.cpp:48] Check failed: parsed Unexpected failure while parsing protobuf *** Check failure stack trace: *** @ 0x7fc8d49ba96d google::LogMessage::Fail() @ 0x7fc8d49be987 google::LogMessage::SendToLog() @ 0x7fc8d49bc809 google::LogMessage::Flush() @ 0x7fc8d49bcb0d google::LogMessageFatal::~LogMessageFatal() The code is fairly simple val kv = <large Seq of Key Value pairs> //set parallelism to 1 to keep the file from being partitioned sc.makeRDD(kv,1) .saveAsSequenceFile(path) Does anyone have any pointers on how to get past this? Thanks, -- *Allen Lee* Software Engineer MediaCrossing Inc.