Now, I didn't use HBase, so, maybe I can't use DistributedCache. And if FLATTEN DataBag, the results are Tuples, then in UDF I can process only one Tuple, which can't implement BinarySearch.
So, please help and show me the detailed solution. Thanks! 在 2011年12月14日 下午5:59,唐亮 <[email protected]>写道: > Hi Prashant Kommireddi, > > If I do 1. and 2. as you mentioned, > the schema will be {tag, ipStart, ipEnd, locName}. > > BUT, how should I write the UDF, especially how should I set the type of > the input parameter? > > Currently, the UDF codes are as below, whose input parameter is DataBag: > > public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > public String exec(Tuple input) throws IOException { > if (input == null || input.size() == 0) > return UnknownIP; > if (input.size() != 2) { > throw new IOException("Expected input's size is 2, but is: " + > input.size()); > } > > Object o1 = input.get(0); * // This should be the IP you want to > look up* > if (!(o1 instanceof Long)) { > throw new IOException("Expected input 1 to be Long, but got " > + o1.getClass().getName()); > } > Object o2 = input.get(1); *// This is the Bag of IP segs* > if (!(o2 instanceof *DataBag*)) { //* Should I change it to "(o2 > instanceof Tuple)"?* > throw new IOException("Expected input 2 to be DataBag, but got > " > + o2.getClass().getName()); > } > > ........... other codes ........... > } > > } > > > > 在 2011年12月14日 下午3:16,Prashant Kommireddi <[email protected]>写道: > > Seems like at the end of this you have a Single bag with all the elements, >> and somehow you would like to check whether an element exists in it based >> on ipstart/end. >> >> >> 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten - >> this will convert the Bag to Tuple: to_tuple = FOREACH order_ip_segs >> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) >> 2. Now write a UDF that can access the elements positionally for the >> BinarySearch >> 3. Dmitriy and Jonathan's ideas with DistributedCache could perform >> better than the above approach, so you could go down that route too. >> >> >> 2011/12/13 唐亮 <[email protected]> >> >> > The detailed PIG codes are as below: >> > >> > raw_ip_segment = load ... >> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; >> > group_ip_segs = group ip_segs all; >> > >> > order_ip_segs = foreach group_ip_segs { >> > order_seg = order ip_segs by ipstart, ipend; >> > generate 't' as tag, order_seg; >> > } >> > describe order_ip_segs >> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: >> long,poid: >> > chararray}} >> > >> > Here, the order_ip_segs::order_seg is a BAG, >> > how can I transer it to a TUPLE? >> > >> > And can I access the TUPLE randomly in UDF? >> > >> > 在 2011年12月14日 下午2:41,唐亮 <[email protected]>写道: >> > >> > > Then how can I transfer all the items in Bag to a Tuple? >> > > >> > > >> > > 2011/12/14 Jonathan Coveney <[email protected]> >> > > >> > >> It's funny, but if you look wayyyy in the past, I actually asked a >> bunch >> > >> of >> > >> questions that circled around, literally, this exact problem. >> > >> >> > >> Dmitriy and Prahsant are correct: the best way is to make a UDF that >> can >> > >> do >> > >> the lookup really efficiently. This is what the maxmind API does, for >> > >> example. >> > >> >> > >> 2011/12/13 Prashant Kommireddi <[email protected]> >> > >> >> > >> > I am lost when you say "If enumerate every IP, it will be more than >> > >> > 100000000 single IPs" >> > >> > >> > >> > If each bag is a collection of 30000 tuples it might not be too >> bad on >> > >> the >> > >> > memory if you used Tuple to store segments instead? >> > >> > >> > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 >> > >> > Lets say we incur an additional overhead 4X times this, which is >> ~160 >> > >> bytes >> > >> > per tuple. >> > >> > Total per Bag = 30000 X 160 = ~5 MB >> > >> > >> > >> > You could probably store the ipsegments as Tuple and test it on >> your >> > >> > servers. >> > >> > >> > >> > >> > >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy < >> [email protected]> >> > >> > wrote: >> > >> > >> > >> > > Do you have many such bags or just one? If one, and you want to >> look >> > >> up >> > >> > > many ups in it, might be more efficient to serialize this >> relation >> > to >> > >> > hdfs, >> > >> > > and write a lookup udf that specifies the serialized data set as >> a >> > >> file >> > >> > to >> > >> > > put in distributed cache. At init time, load up the file into >> > memory, >> > >> > then >> > >> > > for every ip do the binary search in exec() >> > >> > > >> > >> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote: >> > >> > > >> > >> > > > Thank you all! >> > >> > > > >> > >> > > > The detail is: >> > >> > > > A bag contains many "IP Segments", whose schema is >> (ipStart:long, >> > >> > > > ipEnd:long, locName:chararray) and the number of tuples is >> about >> > >> 30000, >> > >> > > > and I want to check wheather an IP is belong to one segment in >> the >> > >> bag. >> > >> > > > >> > >> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, >> > >> > > > and then binary search wheather an IP is in the bag in UDF. >> > >> > > > >> > >> > > > If enumerate every IP, it will be more than 100000000 single >> IPs, >> > >> > > > I think it will also be time consuming by JOIN in PIG. >> > >> > > > >> > >> > > > Please help me how can I deal with it efficiently! >> > >> > > > >> > >> > > > >> > >> > > > 2011/12/14 Thejas Nair <[email protected]> >> > >> > > > >> > >> > > >> My assumption is that 唐亮 is trying to do binary search on bags >> > >> within >> > >> > > the >> > >> > > >> tuples in a relation (ie schema of the relation has a bag >> > column). >> > >> I >> > >> > > don't >> > >> > > >> think he is trying to treat the entire relation as one bag >> and do >> > >> > binary >> > >> > > >> search on that. >> > >> > > >> >> > >> > > >> >> > >> > > >> -Thejas >> > >> > > >> >> > >> > > >> >> > >> > > >> >> > >> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote: >> > >> > > >> >> > >> > > >>> I don't think this could be done, >> > >> > > >>> >> > >> > > >>> pig is just a hadoop job, and the idea behind hadoop is to >> read >> > >> all >> > >> > the >> > >> > > >>> data in a file. >> > >> > > >>> >> > >> > > >>> so by the time you put all the data into an array, you would >> > have >> > >> > been >> > >> > > >>> better off just checking each element for the one you were >> > looking >> > >> > for. >> > >> > > >>> >> > >> > > >>> So what you would get is [n + lg (n)], which will just be [n] >> > >> after >> > >> > > >>> putting >> > >> > > >>> that into an array. >> > >> > > >>> Second, hadoop is all about large data analysis, usually more >> > than >> > >> > > 100GB, >> > >> > > >>> so putting this into memory is out of the question. >> > >> > > >>> Third, hadoop is efficient because it processes this large >> > amount >> > >> of >> > >> > > data >> > >> > > >>> by splitting it up into multiple processes. To do an >> efficient >> > >> binary >> > >> > > >>> search, you would need do this in one mapper or one reducer. >> > >> > > >>> >> > >> > > >>> My opinion is just don't fight hadoop/pig. >> > >> > > >>> >> > >> > > >>> >> > >> > > >>> >> > >> > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair< >> > >> [email protected]> >> > >> > > >>> wrote: >> > >> > > >>> >> > >> > > >>> Bags can be very large might not fit into memory, and in such >> > >> cases >> > >> > > some >> > >> > > >>>> or all of the bag might have to be stored on disk. In such >> > >> cases, it >> > >> > > is >> > >> > > >>>> not >> > >> > > >>>> efficient to do random access on the bag. That is why the >> > DataBag >> > >> > > >>>> interface >> > >> > > >>>> does not support it. >> > >> > > >>>> >> > >> > > >>>> As Prashant suggested, storing it in a tuple would be a good >> > >> > > alternative, >> > >> > > >>>> if you want to have random access to do binary search. >> > >> > > >>>> >> > >> > > >>>> -Thejas >> > >> > > >>>> >> > >> > > >>>> >> > >> > > >>>> >> > >> > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote: >> > >> > > >>>> >> > >> > > >>>> Hi all, >> > >> > > >>>>> How can I implement a binary search in pig? >> > >> > > >>>>> >> > >> > > >>>>> In one relation, there exists a bag whose items are sorted. >> > >> > > >>>>> And I want to check there exists a specific item in the >> bag. >> > >> > > >>>>> >> > >> > > >>>>> In UDF, I can't random access items in DataBag container. >> > >> > > >>>>> So I have to transfer the items in DataBag to an ArrayList, >> > and >> > >> > this >> > >> > > is >> > >> > > >>>>> time consuming. >> > >> > > >>>>> >> > >> > > >>>>> How can I implement the binary search efficiently in pig? >> > >> > > >>>>> >> > >> > > >>>>> >> > >> > > >>>>> >> > >> > > >>>> >> > >> > > >>> >> > >> > > >> >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >> > >
