Thank you, Alan. Let me consider this for a moment. -d
On Mon, Jan 24, 2011 at 2:26 PM, Alan Gates <[email protected]> wrote: > Since Pig uses the partitioner to provide a total order (by which I mean an > order across part files), we don't allow users to override the partitioner > in that case. But I think what you want to do would be achievable if you > have a UDF that maps the key to the region server you want it in and a > custom partitioner that partitions based on the region server id generated > by the udf: > > ... > C = foreach B generate *, key_to_region_mapper(key) as region; > D = group C by region partition using region_partitioner; > E = foreach D { > E1 = order C by key; > generate flatten(E1); > } > F = store E into HBaseStorage(); > > This will group by the region and partition by it (so each reducer can get > one part file to turn into one hfile for hbase) and order the keys within > that region's part file. The ordering will be done as a secondary sort in > MR. > > The only issue I see here is that Pig isn't smart enough to realize that > you don't need to pull the entire bag into memory in order to flatten it. > Ideally it would realize this and just stream from the reduce iterator to > the collect, but it won't. It will read everything off of the reduce > iterator into memory (spilling if there is more than can fit) and then > storing it all to hbase. > > Alan. > > > On Jan 24, 2011, at 2:06 PM, Dmitriy Lyubimov wrote: > > i guess i want to order the groups. the grouping is actually irrelevant in >> this case, it is only used for the sake of specifying custom partitioner >> in >> the PARTITIONED BY clause. >> >> I guess what would really solve the problem is custom partitioner in the >> ORDER BY. so using GROUP would just be a hack. >> >> On Mon, Jan 24, 2011 at 1:28 PM, Alan Gates <[email protected]> wrote: >> >> Do you want to order the groups or just within the groups? If you want >>> to >>> order within the groups you can do that in Pig in a single job. >>> >>> Alan. >>> >>> >>> On Jan 24, 2011, at 1:20 PM, Dmitriy Lyubimov wrote: >>> >>> Thanks. >>> >>>> >>>> So i take there's no way in pig to specify custom partitioner And the >>>> ordering in one MR step? >>>> >>>> I don't think prebuilding HFILEs is the best strategy in my case. For my >>>> job >>>> is incremental (i.e. i am not replacing 100% of the data). However, it >>>> is >>>> big enough that i don't want to create random writes. >>>> >>>> but using custom partitioner in GROUP statement along with PARALLEL and >>>> somehow specifying ordering as well would probably be ideal . >>>> >>>> i wonder if sequential spec of GROUP and ORDER BY could translate into a >>>> single MR job? i guess not, would it? >>>> >>>> >>>> >>>> -d >>>> >>>> On Mon, Jan 24, 2011 at 1:12 PM, Dmitriy Ryaboy <[email protected]> >>>> wrote: >>>> >>>> Pushing this logic into the storefunc would force an MR boundary before >>>> >>>>> the >>>>> store (unless the StoreFunc passed, I suppose) which can make things >>>>> overly >>>>> complex. >>>>> >>>>> I think for the purposes of bulk-loading into HBase, a better approach >>>>> might >>>>> be to use the native map-reduce functionality and feed results you want >>>>> to >>>>> store into a map-reduce job created as per >>>>> >>>>> >>>>> >>>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html(the<http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the> >>>>> < >>>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the >>>>> > >>>>> < >>>>> >>>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the >>>>> >>>>>> >>>>>> >>>>> bulk loading section). >>>>> >>>>> D >>>>> >>>>> On Mon, Jan 24, 2011 at 11:51 AM, Dmitriy Lyubimov <[email protected] >>>>> >>>>> wrote: >>>>>> >>>>>> >>>>> Better yet, it would've seem to be logical if partitioning and advise >>>>> on >>>>> >>>>>> partition #s is somehow tailored to a storefunc . It would stand to >>>>>> >>>>>> reason >>>>> >>>>> that for as long as we are not storing to hdfs, store func is in the >>>>>> best >>>>>> position to determine optimal save parameters such as order, >>>>>> partitioning >>>>>> and parallelism. >>>>>> >>>>>> On Mon, Jan 24, 2011 at 11:47 AM, Dmitriy Lyubimov <[email protected] >>>>>> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>> Hi, >>>>>> >>>>>>> >>>>>>> so it seems to be more efficient if storing to hbase partitions by >>>>>>> >>>>>>> regions >>>>>> >>>>>> and orders by hbase keys. >>>>>>> >>>>>>> I see that pig 0.8 (pig-282) added custom partitioner in a group but >>>>>>> i >>>>>>> >>>>>>> am >>>>>> >>>>> >>>>> not sure if order is enforced there. >>>>>> >>>>>>> >>>>>>> Is there a way to run single MR that orders and partitions data as >>>>>>> per >>>>>>> above and uses an explicitly specifed store func in reducers? >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>> >
