Re: using MultipleOutputFormat to ensure one output file per key

Rafal Kwasny Tue, 25 Nov 2014 11:37:07 -0800

Hi,

Arpan Ghosh wrote:
> Hi,
>
> How can I implement a custom MultipleOutputFormat and specify it as
> the output of my Spark job so that I can ensure that there is a unique
> output file per key (instead of a a unique output file per reducer)?
>


I use something like this:

class KeyBasedOutput[T >: Null ,V <: AnyRef] extends
MultipleTextOutputFormat[T , V] {
  override protected def generateFileNameForKeyValue(key: T, value: V,
leaf: String) = {
    key.toString()+"/"+leaf
  }
  override protected def generateActualKey(key: T, value: V) = {
    null
  }
  // this could be dangerous and overwrite files
  @throws(classOf[FileAlreadyExistsException])
  @throws(classOf[InvalidJobConfException])
  @throws(classOf[IOException])
  override def checkOutputSpecs(ignored: FileSystem,job: JobConf) ={
  }
}

and then just set a jobconf:

      val jobConf = new JobConf(self.context.hadoopConfiguration)
      jobConf.setOutputKeyClass(classOf[String])
      jobConf.setOutputValueClass(classOf[String])
      jobConf.setOutputFormat(classOf[KeyBasedOutput[String, String]])
      rdd.saveAsHadoopDataset(jobConf)


/Rafal

> Thanks
>
> Arpan


-- 
Regards
Rafał Kwasny
mailto:/jabberid: m...@entropy.be

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: using MultipleOutputFormat to ensure one output file per key

Reply via email to