Re: HFileOutputFormatForCrunch with spark pipeline

Josh Wills Thu, 13 Aug 2015 06:59:04 -0700

Hey Surbhi,

I think it's just a bug-- Crunch-on-Spark should be handling the
partitioner stuff correctly w/o requiring you to write your own. I think
the problem is that we set the location of the partition file (the one that
the code is mad that it can't find in your gist) inside of the
GroupingOptions class, and we're not updating the Configuration object that
the Spark job is going to use w/the location of that file in the same way
we do on MapReduce. I'll file a bug for it and see if I can't come up w/a
fix and unit test tomorrow.


Thanks!
Josh

On Wed, Aug 12, 2015 at 10:45 AM, Surbhi Mungre <[email protected]>
wrote:

> I am converting a MRPipeline to SparkPipeline with these[1] instructions.
> My SparkPipeline fails with this[2] exception. In my pipeline I am trying
> to write to HBase using HFiles. IIUC M/R job which creates HFiles uses a
> custom partitioner. I am not sure how Crunch translates this to Spark. From
> the exception stack trace it looks like Spark is using M/R partitioner. I
> am completely new to Spark but I think I will have to create a custom spark
> partitioner and use it instead. When I am converting a MRPipeline to
> SparkPipeline, if a M/R job uses custom partitioner will Crunch handle it?
>
>
> [1]
> http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_crunch_with_spark.html
>
> [2] https://gist.github.com/anonymous/920c000f20229eaa76d8
>
> Thanks,
> Surbhi
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: HFileOutputFormatForCrunch with spark pipeline

Reply via email to