There's a very detailed write-up about this here: http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html
2011/12/18 唐亮 <[email protected]>: > Prashant Kommireddi, > Thank you very much! > And your code seems cool, especially the usage of '*'. > > But, I'm still not very sure about the details. > > My PIG scripts are as below: > *-- Load IP Segments* > *raw_ip_segment = load ... * > *ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;* > *group_ip_segs = group ip_segs all;* > * > * > *order_ip_segs = foreach group_ip_segs {* > * order_seg = order ip_segs by ipstart, ipend;* > * generate 't' as tag, order_seg;* > *}* > *describe order_ip_segs* > *order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: long,poid: > chararray}}* > * > * > *-- Load IP from LOG* > *ip_log = load ... * > *ip_tag = foreach ip_log generate 't' as tag, ip;* > * > * > *-- Join by tag* > *join_ip_tag = join order_ip_segs by tag, ip_tag by tag;* > * > * > *retain_ip_segs = foreach join_ip_tag generate ip_tag::ip as ip, > order_ip_segs::order_seg as order_seg;* > *-- ip: the ip I want to look up;* > *-- order_seg: ordered ip segments used for BinarySearch* > > > Can you show me the detailed followings? > Such as the codes of UDF, and the PIG script to call the UDF. > > > 在 2011年12月18日 下午6:17,Prashant Kommireddi <[email protected]>写道: > >> to_tuple = FOREACH order_ip_segs GENERATE tag, FLATTEN(order_seq); >> >> result = foreach totuple GetProvinceNameFromIPNum(toSearch, * ); >> >> >> 2011/12/18 唐亮 <[email protected]> >> >> > Prashant Kommireddi, >> > How to call your UDF in PIG script? >> > >> > Thanks! >> > >> > 在 2011年12月16日 下午1:12,唐亮 <[email protected]>写道: >> > >> > > Thanks Prashant Kommireddi, >> > > >> > > But my question is: >> > > How to call the UDF in PIG, especially the parameters to put into the >> > UDF. >> > > >> > > 在 2011年12月15日 下午4:05,Prashant Kommireddi <[email protected]>写道: >> > > >> > > Not sure what you mean. Have you tried the code I forwarded? Are you >> > facing >> > >> any issues there? >> > >> >> > >> If your question is regarding binarySearch implementation, here is >> > >> pseudo-code'ish implementation. I have not tested this, please treat >> > this >> > >> as a general idea on how to go about accessing the elements within the >> > >> Tuple. >> > >> >> > >> ALSO, I am assuming you have defined schema for (inner) Tuple >> contents. >> > >> >> > >> public String binarySearch(Tuple tuple, long toSearch, int low, int >> > high) >> > >> { >> > >> if(low > high) >> > >> return "NOT FOUND"; //Handle this the way you would like >> > >> >> > >> if(tuple == null) >> > >> throw new IllegalArgumentException("Tuple is null"); //Handle >> > >> this the way you would like >> > >> >> > >> int mid = (low + high)/2; >> > >> Tuple midTuple = tuple.get(mid); >> > >> String tag = midTuple.get(0).toString(); >> > >> long ipstart = (Long)midTuple.get(1); >> > >> long ipend = (Long)midTuple.get(2); >> > >> String loc = midTuple.get(3).toString(); >> > >> >> > >> if(toSearch == ipstart) //Or ipend, I am not sure how you want to >> > search >> > >> { >> > >> return loc; >> > >> } >> > >> else if(toSearch < ipstart) >> > >> return binarySearch(tuple, low, mid - 1); >> > >> >> > >> else >> > >> return binarySearch(tuple, mid+1, high); >> > >> >> > >> } >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> 2011/12/14 唐亮 <[email protected]> >> > >> >> > >> > Hi Prashant Kommireddi, >> > >> > >> > >> > If so, how should I write the UDF, especially the data types in UDF? >> > >> > >> > >> > 2011/12/15 Prashant Kommireddi <[email protected]> >> > >> > >> > >> > > When you flatten your BAG all your segments are within a single >> > tuple. >> > >> > > Something like >> > >> > > >> > >> > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, >> > >> > > ipstartN, ipendN, locN)) >> > >> > > >> > >> > > You can access the inner tuples positionally. >> > >> > > >> > >> > > Sent from my iPhone >> > >> > > >> > >> > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[email protected]> wrote: >> > >> > > >> > >> > > > Now the question is: >> > >> > > > How should I put all the "IP Segments" in one TUPLE? >> > >> > > > >> > >> > > > Please help me! >> > >> > > > >> > >> > > > >> > >> > > > 2011/12/15 Prashant Kommireddi <[email protected]> >> > >> > > > >> > >> > > >> Michael, >> > >> > > >> >> > >> > > >> This would have no benefit over using a DistributedCache. For a >> > >> large >> > >> > > >> cluster this would mean poor performance. If the file is static >> > and >> > >> > > needs >> > >> > > >> to be looked-up across the cluster, DistributedCache would be a >> > >> better >> > >> > > >> approach. >> > >> > > >> >> > >> > > >> Thanks, >> > >> > > >> Prashant >> > >> > > >> >> > >> > > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht < >> > >> [email protected]> >> > >> > > >> wrote: >> > >> > > >> >> > >> > > >>> If that list of ip pairs is pretty static most time and will >> be >> > >> used >> > >> > > >>> frequently, maybe just copy it in hdfs with a high replication >> > >> > factor. >> > >> > > >> Then >> > >> > > >>> use it as a look up table or some binary tree or treemap kind >> of >> > >> > thing >> > >> > > by >> > >> > > >>> reading it from hdfs instead of using distributed cache if >> that >> > >> > sounds >> > >> > > an >> > >> > > >>> easier thing to do. >> > >> > > >>> >> > >> > > >>> >> > >> > > >>> Best regards, >> > >> > > >>> Michael >> > >> > > >>> >> > >> > > >>> >> > >> > > >>> ________________________________ >> > >> > > >>> From: Dmitriy Ryaboy <[email protected]> >> > >> > > >>> To: [email protected] >> > >> > > >>> Sent: Wednesday, December 14, 2011 10:28 AM >> > >> > > >>> Subject: Re: Implement Binary Search in PIG >> > >> > > >>> >> > >> > > >>> hbase has nothing to do with distributed cache. >> > >> > > >>> >> > >> > > >>> >> > >> > > >>> 2011/12/14 唐亮 <[email protected]> >> > >> > > >>> >> > >> > > >>>> Now, I didn't use HBase, >> > >> > > >>>> so, maybe I can't use DistributedCache. >> > >> > > >>>> >> > >> > > >>>> And if FLATTEN DataBag, the results are Tuples, >> > >> > > >>>> then in UDF I can process only one Tuple, which can't >> implement >> > >> > > >>>> BinarySearch. >> > >> > > >>>> >> > >> > > >>>> So, please help and show me the detailed solution. >> > >> > > >>>> Thanks! >> > >> > > >>>> >> > >> > > >>>> 在 2011年12月14日 下午5:59,唐亮 <[email protected]>写道: >> > >> > > >>>> >> > >> > > >>>>> Hi Prashant Kommireddi, >> > >> > > >>>>> >> > >> > > >>>>> If I do 1. and 2. as you mentioned, >> > >> > > >>>>> the schema will be {tag, ipStart, ipEnd, locName}. >> > >> > > >>>>> >> > >> > > >>>>> BUT, how should I write the UDF, especially how should I set >> > the >> > >> > type >> > >> > > >>> of >> > >> > > >>>>> the input parameter? >> > >> > > >>>>> >> > >> > > >>>>> Currently, the UDF codes are as below, whose input parameter >> > is >> > >> > > >>> DataBag: >> > >> > > >>>>> >> > >> > > >>>>> public class GetProvinceNameFromIPNum extends >> > EvalFunc<String> { >> > >> > > >>>>> >> > >> > > >>>>> public String exec(Tuple input) throws IOException { >> > >> > > >>>>> if (input == null || input.size() == 0) >> > >> > > >>>>> return UnknownIP; >> > >> > > >>>>> if (input.size() != 2) { >> > >> > > >>>>> throw new IOException("Expected input's size is 2, but >> is: >> > " >> > >> + >> > >> > > >>>>> input.size()); >> > >> > > >>>>> } >> > >> > > >>>>> >> > >> > > >>>>> Object o1 = input.get(0); * // This should be the IP >> > you >> > >> > want >> > >> > > >>> to >> > >> > > >>>>> look up* >> > >> > > >>>>> if (!(o1 instanceof Long)) { >> > >> > > >>>>> throw new IOException("Expected input 1 to be >> Long, >> > >> but >> > >> > > >>> got " >> > >> > > >>>>> + o1.getClass().getName()); >> > >> > > >>>>> } >> > >> > > >>>>> Object o2 = input.get(1); *// This is the Bag of IP >> > >> segs* >> > >> > > >>>>> if (!(o2 instanceof *DataBag*)) { //* Should I >> change >> > >> it to >> > >> > > >>> "(o2 >> > >> > > >>>>> instanceof Tuple)"?* >> > >> > > >>>>> throw new IOException("Expected input 2 to be >> > >> DataBag, >> > >> > > >> but >> > >> > > >>>> got >> > >> > > >>>>> " >> > >> > > >>>>> + o2.getClass().getName()); >> > >> > > >>>>> } >> > >> > > >>>>> >> > >> > > >>>>> ........... other codes ........... >> > >> > > >>>>> } >> > >> > > >>>>> >> > >> > > >>>>> } >> > >> > > >>>>> >> > >> > > >>>>> >> > >> > > >>>>> >> > >> > > >>>>> 在 2011年12月14日 下午3:16,Prashant Kommireddi < >> [email protected] >> > >> >写道: >> > >> > > >>>>> >> > >> > > >>>>> Seems like at the end of this you have a Single bag with all >> > the >> > >> > > >>>> elements, >> > >> > > >>>>>> and somehow you would like to check whether an element >> exists >> > >> in >> > >> > it >> > >> > > >>>> based >> > >> > > >>>>>> on ipstart/end. >> > >> > > >>>>>> >> > >> > > >>>>>> >> > >> > > >>>>>> 1. Use FLATTEN >> > >> > > >> http://pig.apache.org/docs/r0.9.1/basic.html#flatten- >> > >> > > >>>>>> this will convert the Bag to Tuple: to_tuple = FOREACH >> > >> > > >>> order_ip_segs >> > >> > > >>>>>> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) >> > >> > > >>>>>> 2. Now write a UDF that can access the elements >> positionally >> > >> for >> > >> > > >> the >> > >> > > >>>>>> BinarySearch >> > >> > > >>>>>> 3. Dmitriy and Jonathan's ideas with DistributedCache >> could >> > >> > > >> perform >> > >> > > >>>>>> better than the above approach, so you could go down that >> > >> route >> > >> > > >> too. >> > >> > > >>>>>> >> > >> > > >>>>>> >> > >> > > >>>>>> 2011/12/13 唐亮 <[email protected]> >> > >> > > >>>>>> >> > >> > > >>>>>>> The detailed PIG codes are as below: >> > >> > > >>>>>>> >> > >> > > >>>>>>> raw_ip_segment = load ... >> > >> > > >>>>>>> ip_segs = foreach raw_ip_segment generate ipstart, ipend, >> > >> name; >> > >> > > >>>>>>> group_ip_segs = group ip_segs all; >> > >> > > >>>>>>> >> > >> > > >>>>>>> order_ip_segs = foreach group_ip_segs { >> > >> > > >>>>>>> order_seg = order ip_segs by ipstart, ipend; >> > >> > > >>>>>>> generate 't' as tag, order_seg; >> > >> > > >>>>>>> } >> > >> > > >>>>>>> describe order_ip_segs >> > >> > > >>>>>>> order_ip_segs: {tag: chararray,order_seg: {ipstart: >> > >> long,ipend: >> > >> > > >>>>>> long,poid: >> > >> > > >>>>>>> chararray}} >> > >> > > >>>>>>> >> > >> > > >>>>>>> Here, the order_ip_segs::order_seg is a BAG, >> > >> > > >>>>>>> how can I transer it to a TUPLE? >> > >> > > >>>>>>> >> > >> > > >>>>>>> And can I access the TUPLE randomly in UDF? >> > >> > > >>>>>>> >> > >> > > >>>>>>> 在 2011年12月14日 下午2:41,唐亮 <[email protected]>写道: >> > >> > > >>>>>>> >> > >> > > >>>>>>>> Then how can I transfer all the items in Bag to a Tuple? >> > >> > > >>>>>>>> >> > >> > > >>>>>>>> >> > >> > > >>>>>>>> 2011/12/14 Jonathan Coveney <[email protected]> >> > >> > > >>>>>>>> >> > >> > > >>>>>>>>> It's funny, but if you look wayyyy in the past, I >> actually >> > >> > > >> asked >> > >> > > >>> a >> > >> > > >>>>>> bunch >> > >> > > >>>>>>>>> of >> > >> > > >>>>>>>>> questions that circled around, literally, this exact >> > >> problem. >> > >> > > >>>>>>>>> >> > >> > > >>>>>>>>> Dmitriy and Prahsant are correct: the best way is to >> make >> > a >> > >> UDF >> > >> > > >>>> that >> > >> > > >>>>>> can >> > >> > > >>>>>>>>> do >> > >> > > >>>>>>>>> the lookup really efficiently. This is what the maxmind >> > API >> > >> > > >> does, >> > >> > > >>>> for >> > >> > > >>>>>>>>> example. >> > >> > > >>>>>>>>> >> > >> > > >>>>>>>>> 2011/12/13 Prashant Kommireddi <[email protected]> >> > >> > > >>>>>>>>> >> > >> > > >>>>>>>>>> I am lost when you say "If enumerate every IP, it will >> be >> > >> > > >> more >> > >> > > >>>> than >> > >> > > >>>>>>>>>> 100000000 single IPs" >> > >> > > >>>>>>>>>> >> > >> > > >>>>>>>>>> If each bag is a collection of 30000 tuples it might >> not >> > be >> > >> > > >> too >> > >> > > >>>>>> bad on >> > >> > > >>>>>>>>> the >> > >> > > >>>>>>>>>> memory if you used Tuple to store segments instead? >> > >> > > >>>>>>>>>> >> > >> > > >>>>>>>>>> (8 bytes long + 8 bytes long + 20 bytes for chararray >> ) = >> > >> 36 >> > >> > > >>>>>>>>>> Lets say we incur an additional overhead 4X times this, >> > >> which >> > >> > > >>> is >> > >> > > >>>>>> ~160 >> > >> > > >>>>>>>>> bytes >> > >> > > >>>>>>>>>> per tuple. >> > >> > > >>>>>>>>>> Total per Bag = 30000 X 160 = ~5 MB >> > >> > > >>>>>>>>>> >> > >> > > >>>>>>>>>> You could probably store the ipsegments as Tuple and >> test >> > >> it >> > >> > > >> on >> > >> > > >>>>>> your >> > >> > > >>>>>>>>>> servers. >> > >> > > >>>>>>>>>> >> > >> > > >>>>>>>>>> >> > >> > > >>>>>>>>>> On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy < >> > >> > > >>>>>> [email protected]> >> > >> > > >>>>>>>>>> wrote: >> > >> > > >>>>>>>>>> >> > >> > > >>>>>>>>>>> Do you have many such bags or just one? If one, and >> you >> > >> > > >> want >> > >> > > >>> to >> > >> > > >>>>>> look >> > >> > > >>>>>>>>> up >> > >> > > >>>>>>>>>>> many ups in it, might be more efficient to serialize >> > this >> > >> > > >>>>>> relation >> > >> > > >>>>>>> to >> > >> > > >>>>>>>>>> hdfs, >> > >> > > >>>>>>>>>>> and write a lookup udf that specifies the serialized >> > data >> > >> > > >> set >> > >> > > >>>> as >> > >> > > >>>>>> a >> > >> > > >>>>>>>>> file >> > >> > > >>>>>>>>>> to >> > >> > > >>>>>>>>>>> put in distributed cache. At init time, load up the >> file >> > >> > > >> into >> > >> > > >>>>>>> memory, >> > >> > > >>>>>>>>>> then >> > >> > > >>>>>>>>>>> for every ip do the binary search in exec() >> > >> > > >>>>>>>>>>> >> > >> > > >>>>>>>>>>> On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> >> > >> > > >> wrote: >> > >> > > >>>>>>>>>>> >> > >> > > >>>>>>>>>>>> Thank you all! >> > >> > > >>>>>>>>>>>> >> > >> > > >>>>>>>>>>>> The detail is: >> > >> > > >>>>>>>>>>>> A bag contains many "IP Segments", whose schema is >> > >> > > >>>>>> (ipStart:long, >> > >> > > >>>>>>>>>>>> ipEnd:long, locName:chararray) and the number of >> tuples >> > >> > > >> is >> > >> > > >>>>>> about >> > >> > > >>>>>>>>> 30000, >> > >> > > >>>>>>>>>>>> and I want to check wheather an IP is belong to one >> > >> > > >> segment >> > >> > > >>>> in >> > >> > > >>>>>> the >> > >> > > >>>>>>>>> bag. >> > >> > > >>>>>>>>>>>> >> > >> > > >>>>>>>>>>>> I want to order the "IP Segments" by (ipStart, ipEnd) >> > in >> > >> > > >>> MR, >> > >> > > >>>>>>>>>>>> and then binary search wheather an IP is in the bag >> in >> > >> > > >> UDF. >> > >> > > >>>>>>>>>>>> >> > >> > > >>>>>>>>>>>> If enumerate every IP, it will be more than 100000000 >> > >> > > >>> single >> > >> > > >>>>>> IPs, >> > >> > > >>>>>>>>>>>> I think it will also be time consuming by JOIN in >> PIG. >> > >> > > >>>>>>>>>>>> >> > >> > > >>>>>>>>>>>> Please help me how can I deal with it efficiently! >> > >> > > >>>>>>>>>>>> >> > >> > > >>>>>>>>>>>> >> > >> > > >>>>>>>>>>>> 2011/12/14 Thejas Nair <[email protected]> >> > >> > > >>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>> My assumption is that 唐亮 is trying to do binary >> search >> > >> > > >> on >> > >> > > >>>> bags >> > >> > > >>>>>>>>> within >> > >> > > >>>>>>>>>>> the >> > >> > > >>>>>>>>>>>>> tuples in a relation (ie schema of the relation has >> a >> > >> > > >> bag >> > >> > > >>>>>>> column). >> > >> > > >>>>>>>>> I >> > >> > > >>>>>>>>>>> don't >> > >> > > >>>>>>>>>>>>> think he is trying to treat the entire relation as >> one >> > >> > > >> bag >> > >> > > >>>>>> and do >> > >> > > >>>>>>>>>> binary >> > >> > > >>>>>>>>>>>>> search on that. >> > >> > > >>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>> -Thejas >> > >> > > >>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>> On 12/13/11 2:30 PM, Andrew Wells wrote: >> > >> > > >>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> I don't think this could be done, >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> pig is just a hadoop job, and the idea behind >> hadoop >> > is >> > >> > > >>> to >> > >> > > >>>>>> read >> > >> > > >>>>>>>>> all >> > >> > > >>>>>>>>>> the >> > >> > > >>>>>>>>>>>>>> data in a file. >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> so by the time you put all the data into an array, >> > you >> > >> > > >>>> would >> > >> > > >>>>>>> have >> > >> > > >>>>>>>>>> been >> > >> > > >>>>>>>>>>>>>> better off just checking each element for the one >> you >> > >> > > >>> were >> > >> > > >>>>>>> looking >> > >> > > >>>>>>>>>> for. >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> So what you would get is [n + lg (n)], which will >> > just >> > >> > > >> be >> > >> > > >>>> [n] >> > >> > > >>>>>>>>> after >> > >> > > >>>>>>>>>>>>>> putting >> > >> > > >>>>>>>>>>>>>> that into an array. >> > >> > > >>>>>>>>>>>>>> Second, hadoop is all about large data analysis, >> > >> > > >> usually >> > >> > > >>>> more >> > >> > > >>>>>>> than >> > >> > > >>>>>>>>>>> 100GB, >> > >> > > >>>>>>>>>>>>>> so putting this into memory is out of the question. >> > >> > > >>>>>>>>>>>>>> Third, hadoop is efficient because it processes >> this >> > >> > > >>> large >> > >> > > >>>>>>> amount >> > >> > > >>>>>>>>> of >> > >> > > >>>>>>>>>>> data >> > >> > > >>>>>>>>>>>>>> by splitting it up into multiple processes. To do >> an >> > >> > > >>>>>> efficient >> > >> > > >>>>>>>>> binary >> > >> > > >>>>>>>>>>>>>> search, you would need do this in one mapper or one >> > >> > > >>>> reducer. >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> My opinion is just don't fight hadoop/pig. >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair< >> > >> > > >>>>>>>>> [email protected]> >> > >> > > >>>>>>>>>>>>>> wrote: >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> Bags can be very large might not fit into memory, >> and >> > >> > > >> in >> > >> > > >>>> such >> > >> > > >>>>>>>>> cases >> > >> > > >>>>>>>>>>> some >> > >> > > >>>>>>>>>>>>>>> or all of the bag might have to be stored on disk. >> > In >> > >> > > >>> such >> > >> > > >>>>>>>>> cases, it >> > >> > > >>>>>>>>>>> is >> > >> > > >>>>>>>>>>>>>>> not >> > >> > > >>>>>>>>>>>>>>> efficient to do random access on the bag. That is >> > why >> > >> > > >>> the >> > >> > > >>>>>>> DataBag >> > >> > > >>>>>>>>>>>>>>> interface >> > >> > > >>>>>>>>>>>>>>> does not support it. >> > >> > > >>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>> As Prashant suggested, storing it in a tuple would >> > be >> > >> > > >> a >> > >> > > >>>> good >> > >> > > >>>>>>>>>>> alternative, >> > >> > > >>>>>>>>>>>>>>> if you want to have random access to do binary >> > search. >> > >> > > >>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>> -Thejas >> > >> > > >>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>> On 12/12/11 7:54 PM, 唐亮 wrote: >> > >> > > >>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>> Hi all, >> > >> > > >>>>>>>>>>>>>>>> How can I implement a binary search in pig? >> > >> > > >>>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>>> In one relation, there exists a bag whose items >> are >> > >> > > >>>> sorted. >> > >> > > >>>>>>>>>>>>>>>> And I want to check there exists a specific item >> in >> > >> > > >> the >> > >> > > >>>>>> bag. >> > >> > > >>>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>>> In UDF, I can't random access items in DataBag >> > >> > > >>> container. >> > >> > > >>>>>>>>>>>>>>>> So I have to transfer the items in DataBag to an >> > >> > > >>>> ArrayList, >> > >> > > >>>>>>> and >> > >> > > >>>>>>>>>> this >> > >> > > >>>>>>>>>>> is >> > >> > > >>>>>>>>>>>>>>>> time consuming. >> > >> > > >>>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>>> How can I implement the binary search efficiently >> > in >> > >> > > >>> pig? >> > >> > > >>>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>>> >> > >> > > >>>>>>>>>>>>> >> > >> > > >>>>>>>>>>> >> > >> > > >>>>>>>>>> >> > >> > > >>>>>>>>> >> > >> > > >>>>>>>> >> > >> > > >>>>>>>> >> > >> > > >>>>>>> >> > >> > > >>>>>> >> > >> > > >>>>> >> > >> > > >>>>> >> > >> > > >>>> >> > >> > > >>> >> > >> > > >> >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >>
