Thanks.

So i take there's no way in pig to specify custom partitioner And the
ordering in one MR step?

I don't think prebuilding HFILEs is the best strategy in my case. For my job
is incremental (i.e. i am not replacing 100% of the data). However, it is
big enough that i don't want to create random writes.

but using custom partitioner in GROUP statement along with PARALLEL and
somehow specifying ordering as well would probably be ideal .

i wonder if sequential spec of GROUP and ORDER BY could translate into a
single MR job? i guess not, would it?



-d

On Mon, Jan 24, 2011 at 1:12 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Pushing this logic into the storefunc would force an MR boundary before the
> store (unless the StoreFunc passed, I suppose) which can make things overly
> complex.
>
> I think for the purposes of bulk-loading into HBase, a better approach
> might
> be to use the native map-reduce functionality and feed results you want to
> store into a map-reduce job created as per
>
> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html(the<http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the>
> bulk loading section).
>
> D
>
> On Mon, Jan 24, 2011 at 11:51 AM, Dmitriy Lyubimov <[email protected]
> >wrote:
>
> > Better yet, it would've seem to be logical if partitioning and advise on
> > partition #s is somehow tailored to a storefunc . It would stand to
> reason
> > that for as long as we are not storing to hdfs, store func is in the best
> > position to determine optimal save parameters such as order, partitioning
> > and parallelism.
> >
> > On Mon, Jan 24, 2011 at 11:47 AM, Dmitriy Lyubimov <[email protected]
> > >wrote:
> >
> > > Hi,
> > >
> > > so it seems to be more efficient if storing to hbase partitions by
> > regions
> > > and orders by hbase keys.
> > >
> > > I see that pig 0.8 (pig-282) added custom partitioner in a group but i
> am
> > > not sure if order is enforced there.
> > >
> > > Is there a way to run single MR that orders and partitions data as per
> > > above and uses an explicitly specifed store func in reducers?
> > >
> > > Thank you.
> > >
> >
>

Reply via email to