Thank you, Alan. Let me consider this for a moment.

-d

On Mon, Jan 24, 2011 at 2:26 PM, Alan Gates <[email protected]> wrote:

> Since Pig uses the partitioner to provide a total order (by which I mean an
> order across part files), we don't allow users to override the partitioner
> in that case.  But I think what you want to do would be achievable if you
> have a UDF that maps the key to the region server you want it in and a
> custom partitioner that partitions based on the region server id generated
> by the udf:
>
> ...
> C = foreach B generate *, key_to_region_mapper(key) as region;
> D = group C by region partition using region_partitioner;
> E = foreach D {
>      E1 = order C by key;
>      generate flatten(E1);
> }
> F = store E into HBaseStorage();
>
> This will group by the region and partition by it (so each reducer can get
> one part file to turn into one hfile for hbase) and order the keys within
> that region's part file.  The ordering will be done as a secondary sort in
> MR.
>
> The only issue I see here is that Pig isn't smart enough to realize that
> you don't need to pull the entire bag into memory in order to flatten it.
>  Ideally it would realize this and just stream from the reduce iterator to
> the collect, but it won't.  It will read everything off of the reduce
> iterator into memory (spilling if there is more than can fit) and then
> storing it all to hbase.
>
> Alan.
>
>
> On Jan 24, 2011, at 2:06 PM, Dmitriy Lyubimov wrote:
>
>  i guess i want to order the groups. the grouping is actually irrelevant in
>> this case, it is only used for the sake of specifying custom partitioner
>> in
>> the PARTITIONED BY clause.
>>
>> I guess what would really solve the problem is custom partitioner in the
>> ORDER BY. so using GROUP would just be a hack.
>>
>> On Mon, Jan 24, 2011 at 1:28 PM, Alan Gates <[email protected]> wrote:
>>
>>  Do you want to order the groups or just within the groups?  If you want
>>> to
>>> order within the groups you can do that in Pig in a single job.
>>>
>>> Alan.
>>>
>>>
>>> On Jan 24, 2011, at 1:20 PM, Dmitriy Lyubimov wrote:
>>>
>>> Thanks.
>>>
>>>>
>>>> So i take there's no way in pig to specify custom partitioner And the
>>>> ordering in one MR step?
>>>>
>>>> I don't think prebuilding HFILEs is the best strategy in my case. For my
>>>> job
>>>> is incremental (i.e. i am not replacing 100% of the data). However, it
>>>> is
>>>> big enough that i don't want to create random writes.
>>>>
>>>> but using custom partitioner in GROUP statement along with PARALLEL and
>>>> somehow specifying ordering as well would probably be ideal .
>>>>
>>>> i wonder if sequential spec of GROUP and ORDER BY could translate into a
>>>> single MR job? i guess not, would it?
>>>>
>>>>
>>>>
>>>> -d
>>>>
>>>> On Mon, Jan 24, 2011 at 1:12 PM, Dmitriy Ryaboy <[email protected]>
>>>> wrote:
>>>>
>>>> Pushing this logic into the storefunc would force an MR boundary before
>>>>
>>>>> the
>>>>> store (unless the StoreFunc passed, I suppose) which can make things
>>>>> overly
>>>>> complex.
>>>>>
>>>>> I think for the purposes of bulk-loading into HBase, a better approach
>>>>> might
>>>>> be to use the native map-reduce functionality and feed results you want
>>>>> to
>>>>> store into a map-reduce job created as per
>>>>>
>>>>>
>>>>>
>>>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html(the<http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the>
>>>>> <
>>>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the
>>>>> >
>>>>> <
>>>>>
>>>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the
>>>>>
>>>>>>
>>>>>>
>>>>> bulk loading section).
>>>>>
>>>>> D
>>>>>
>>>>> On Mon, Jan 24, 2011 at 11:51 AM, Dmitriy Lyubimov <[email protected]
>>>>>
>>>>>  wrote:
>>>>>>
>>>>>>
>>>>> Better yet, it would've seem to be logical if partitioning and advise
>>>>> on
>>>>>
>>>>>> partition #s is somehow tailored to a storefunc . It would stand to
>>>>>>
>>>>>>  reason
>>>>>
>>>>>  that for as long as we are not storing to hdfs, store func is in the
>>>>>> best
>>>>>> position to determine optimal save parameters such as order,
>>>>>> partitioning
>>>>>> and parallelism.
>>>>>>
>>>>>> On Mon, Jan 24, 2011 at 11:47 AM, Dmitriy Lyubimov <[email protected]
>>>>>>
>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>>
>>>>>>> so it seems to be more efficient if storing to hbase partitions by
>>>>>>>
>>>>>>>  regions
>>>>>>
>>>>>>  and orders by hbase keys.
>>>>>>>
>>>>>>> I see that pig 0.8 (pig-282) added custom partitioner in a group but
>>>>>>> i
>>>>>>>
>>>>>>>  am
>>>>>>
>>>>>
>>>>>  not sure if order is enforced there.
>>>>>>
>>>>>>>
>>>>>>> Is there a way to run single MR that orders and partitions data as
>>>>>>> per
>>>>>>> above and uses an explicitly specifed store func in reducers?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>

Reply via email to