Hi, Arpan Ghosh wrote: > Hi, > > How can I implement a custom MultipleOutputFormat and specify it as > the output of my Spark job so that I can ensure that there is a unique > output file per key (instead of a a unique output file per reducer)? >
I use something like this: class KeyBasedOutput[T >: Null ,V <: AnyRef] extends MultipleTextOutputFormat[T , V] { override protected def generateFileNameForKeyValue(key: T, value: V, leaf: String) = { key.toString()+"/"+leaf } override protected def generateActualKey(key: T, value: V) = { null } // this could be dangerous and overwrite files @throws(classOf[FileAlreadyExistsException]) @throws(classOf[InvalidJobConfException]) @throws(classOf[IOException]) override def checkOutputSpecs(ignored: FileSystem,job: JobConf) ={ } } and then just set a jobconf: val jobConf = new JobConf(self.context.hadoopConfiguration) jobConf.setOutputKeyClass(classOf[String]) jobConf.setOutputValueClass(classOf[String]) jobConf.setOutputFormat(classOf[KeyBasedOutput[String, String]]) rdd.saveAsHadoopDataset(jobConf) /Rafal > Thanks > > Arpan -- Regards RafaĆ Kwasny mailto:/jabberid: m...@entropy.be --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org