Hi Prashant Kommireddi, If so, how should I write the UDF, especially the data types in UDF?
2011/12/15 Prashant Kommireddi <[email protected]> > When you flatten your BAG all your segments are within a single tuple. > Something like > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, > ipstartN, ipendN, locN)) > > You can access the inner tuples positionally. > > Sent from my iPhone > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[email protected]> wrote: > > > Now the question is: > > How should I put all the "IP Segments" in one TUPLE? > > > > Please help me! > > > > > > 2011/12/15 Prashant Kommireddi <[email protected]> > > > >> Michael, > >> > >> This would have no benefit over using a DistributedCache. For a large > >> cluster this would mean poor performance. If the file is static and > needs > >> to be looked-up across the cluster, DistributedCache would be a better > >> approach. > >> > >> Thanks, > >> Prashant > >> > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[email protected]> > >> wrote: > >> > >>> If that list of ip pairs is pretty static most time and will be used > >>> frequently, maybe just copy it in hdfs with a high replication factor. > >> Then > >>> use it as a look up table or some binary tree or treemap kind of thing > by > >>> reading it from hdfs instead of using distributed cache if that sounds > an > >>> easier thing to do. > >>> > >>> > >>> Best regards, > >>> Michael > >>> > >>> > >>> ________________________________ > >>> From: Dmitriy Ryaboy <[email protected]> > >>> To: [email protected] > >>> Sent: Wednesday, December 14, 2011 10:28 AM > >>> Subject: Re: Implement Binary Search in PIG > >>> > >>> hbase has nothing to do with distributed cache. > >>> > >>> > >>> 2011/12/14 唐亮 <[email protected]> > >>> > >>>> Now, I didn't use HBase, > >>>> so, maybe I can't use DistributedCache. > >>>> > >>>> And if FLATTEN DataBag, the results are Tuples, > >>>> then in UDF I can process only one Tuple, which can't implement > >>>> BinarySearch. > >>>> > >>>> So, please help and show me the detailed solution. > >>>> Thanks! > >>>> > >>>> 在 2011年12月14日 下午5:59,唐亮 <[email protected]>写道: > >>>> > >>>>> Hi Prashant Kommireddi, > >>>>> > >>>>> If I do 1. and 2. as you mentioned, > >>>>> the schema will be {tag, ipStart, ipEnd, locName}. > >>>>> > >>>>> BUT, how should I write the UDF, especially how should I set the type > >>> of > >>>>> the input parameter? > >>>>> > >>>>> Currently, the UDF codes are as below, whose input parameter is > >>> DataBag: > >>>>> > >>>>> public class GetProvinceNameFromIPNum extends EvalFunc<String> { > >>>>> > >>>>> public String exec(Tuple input) throws IOException { > >>>>> if (input == null || input.size() == 0) > >>>>> return UnknownIP; > >>>>> if (input.size() != 2) { > >>>>> throw new IOException("Expected input's size is 2, but is: " + > >>>>> input.size()); > >>>>> } > >>>>> > >>>>> Object o1 = input.get(0); * // This should be the IP you want > >>> to > >>>>> look up* > >>>>> if (!(o1 instanceof Long)) { > >>>>> throw new IOException("Expected input 1 to be Long, but > >>> got " > >>>>> + o1.getClass().getName()); > >>>>> } > >>>>> Object o2 = input.get(1); *// This is the Bag of IP segs* > >>>>> if (!(o2 instanceof *DataBag*)) { //* Should I change it to > >>> "(o2 > >>>>> instanceof Tuple)"?* > >>>>> throw new IOException("Expected input 2 to be DataBag, > >> but > >>>> got > >>>>> " > >>>>> + o2.getClass().getName()); > >>>>> } > >>>>> > >>>>> ........... other codes ........... > >>>>> } > >>>>> > >>>>> } > >>>>> > >>>>> > >>>>> > >>>>> 在 2011年12月14日 下午3:16,Prashant Kommireddi <[email protected]>写道: > >>>>> > >>>>> Seems like at the end of this you have a Single bag with all the > >>>> elements, > >>>>>> and somehow you would like to check whether an element exists in it > >>>> based > >>>>>> on ipstart/end. > >>>>>> > >>>>>> > >>>>>> 1. Use FLATTEN > >> http://pig.apache.org/docs/r0.9.1/basic.html#flatten- > >>>>>> this will convert the Bag to Tuple: to_tuple = FOREACH > >>> order_ip_segs > >>>>>> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > >>>>>> 2. Now write a UDF that can access the elements positionally for > >> the > >>>>>> BinarySearch > >>>>>> 3. Dmitriy and Jonathan's ideas with DistributedCache could > >> perform > >>>>>> better than the above approach, so you could go down that route > >> too. > >>>>>> > >>>>>> > >>>>>> 2011/12/13 唐亮 <[email protected]> > >>>>>> > >>>>>>> The detailed PIG codes are as below: > >>>>>>> > >>>>>>> raw_ip_segment = load ... > >>>>>>> ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > >>>>>>> group_ip_segs = group ip_segs all; > >>>>>>> > >>>>>>> order_ip_segs = foreach group_ip_segs { > >>>>>>> order_seg = order ip_segs by ipstart, ipend; > >>>>>>> generate 't' as tag, order_seg; > >>>>>>> } > >>>>>>> describe order_ip_segs > >>>>>>> order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: > >>>>>> long,poid: > >>>>>>> chararray}} > >>>>>>> > >>>>>>> Here, the order_ip_segs::order_seg is a BAG, > >>>>>>> how can I transer it to a TUPLE? > >>>>>>> > >>>>>>> And can I access the TUPLE randomly in UDF? > >>>>>>> > >>>>>>> 在 2011年12月14日 下午2:41,唐亮 <[email protected]>写道: > >>>>>>> > >>>>>>>> Then how can I transfer all the items in Bag to a Tuple? > >>>>>>>> > >>>>>>>> > >>>>>>>> 2011/12/14 Jonathan Coveney <[email protected]> > >>>>>>>> > >>>>>>>>> It's funny, but if you look wayyyy in the past, I actually > >> asked > >>> a > >>>>>> bunch > >>>>>>>>> of > >>>>>>>>> questions that circled around, literally, this exact problem. > >>>>>>>>> > >>>>>>>>> Dmitriy and Prahsant are correct: the best way is to make a UDF > >>>> that > >>>>>> can > >>>>>>>>> do > >>>>>>>>> the lookup really efficiently. This is what the maxmind API > >> does, > >>>> for > >>>>>>>>> example. > >>>>>>>>> > >>>>>>>>> 2011/12/13 Prashant Kommireddi <[email protected]> > >>>>>>>>> > >>>>>>>>>> I am lost when you say "If enumerate every IP, it will be > >> more > >>>> than > >>>>>>>>>> 100000000 single IPs" > >>>>>>>>>> > >>>>>>>>>> If each bag is a collection of 30000 tuples it might not be > >> too > >>>>>> bad on > >>>>>>>>> the > >>>>>>>>>> memory if you used Tuple to store segments instead? > >>>>>>>>>> > >>>>>>>>>> (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > >>>>>>>>>> Lets say we incur an additional overhead 4X times this, which > >>> is > >>>>>> ~160 > >>>>>>>>> bytes > >>>>>>>>>> per tuple. > >>>>>>>>>> Total per Bag = 30000 X 160 = ~5 MB > >>>>>>>>>> > >>>>>>>>>> You could probably store the ipsegments as Tuple and test it > >> on > >>>>>> your > >>>>>>>>>> servers. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy < > >>>>>> [email protected]> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Do you have many such bags or just one? If one, and you > >> want > >>> to > >>>>>> look > >>>>>>>>> up > >>>>>>>>>>> many ups in it, might be more efficient to serialize this > >>>>>> relation > >>>>>>> to > >>>>>>>>>> hdfs, > >>>>>>>>>>> and write a lookup udf that specifies the serialized data > >> set > >>>> as > >>>>>> a > >>>>>>>>> file > >>>>>>>>>> to > >>>>>>>>>>> put in distributed cache. At init time, load up the file > >> into > >>>>>>> memory, > >>>>>>>>>> then > >>>>>>>>>>> for every ip do the binary search in exec() > >>>>>>>>>>> > >>>>>>>>>>> On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> > >> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Thank you all! > >>>>>>>>>>>> > >>>>>>>>>>>> The detail is: > >>>>>>>>>>>> A bag contains many "IP Segments", whose schema is > >>>>>> (ipStart:long, > >>>>>>>>>>>> ipEnd:long, locName:chararray) and the number of tuples > >> is > >>>>>> about > >>>>>>>>> 30000, > >>>>>>>>>>>> and I want to check wheather an IP is belong to one > >> segment > >>>> in > >>>>>> the > >>>>>>>>> bag. > >>>>>>>>>>>> > >>>>>>>>>>>> I want to order the "IP Segments" by (ipStart, ipEnd) in > >>> MR, > >>>>>>>>>>>> and then binary search wheather an IP is in the bag in > >> UDF. > >>>>>>>>>>>> > >>>>>>>>>>>> If enumerate every IP, it will be more than 100000000 > >>> single > >>>>>> IPs, > >>>>>>>>>>>> I think it will also be time consuming by JOIN in PIG. > >>>>>>>>>>>> > >>>>>>>>>>>> Please help me how can I deal with it efficiently! > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> 2011/12/14 Thejas Nair <[email protected]> > >>>>>>>>>>>> > >>>>>>>>>>>>> My assumption is that 唐亮 is trying to do binary search > >> on > >>>> bags > >>>>>>>>> within > >>>>>>>>>>> the > >>>>>>>>>>>>> tuples in a relation (ie schema of the relation has a > >> bag > >>>>>>> column). > >>>>>>>>> I > >>>>>>>>>>> don't > >>>>>>>>>>>>> think he is trying to treat the entire relation as one > >> bag > >>>>>> and do > >>>>>>>>>> binary > >>>>>>>>>>>>> search on that. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> -Thejas > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 12/13/11 2:30 PM, Andrew Wells wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> I don't think this could be done, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> pig is just a hadoop job, and the idea behind hadoop is > >>> to > >>>>>> read > >>>>>>>>> all > >>>>>>>>>> the > >>>>>>>>>>>>>> data in a file. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> so by the time you put all the data into an array, you > >>>> would > >>>>>>> have > >>>>>>>>>> been > >>>>>>>>>>>>>> better off just checking each element for the one you > >>> were > >>>>>>> looking > >>>>>>>>>> for. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So what you would get is [n + lg (n)], which will just > >> be > >>>> [n] > >>>>>>>>> after > >>>>>>>>>>>>>> putting > >>>>>>>>>>>>>> that into an array. > >>>>>>>>>>>>>> Second, hadoop is all about large data analysis, > >> usually > >>>> more > >>>>>>> than > >>>>>>>>>>> 100GB, > >>>>>>>>>>>>>> so putting this into memory is out of the question. > >>>>>>>>>>>>>> Third, hadoop is efficient because it processes this > >>> large > >>>>>>> amount > >>>>>>>>> of > >>>>>>>>>>> data > >>>>>>>>>>>>>> by splitting it up into multiple processes. To do an > >>>>>> efficient > >>>>>>>>> binary > >>>>>>>>>>>>>> search, you would need do this in one mapper or one > >>>> reducer. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> My opinion is just don't fight hadoop/pig. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair< > >>>>>>>>> [email protected]> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Bags can be very large might not fit into memory, and > >> in > >>>> such > >>>>>>>>> cases > >>>>>>>>>>> some > >>>>>>>>>>>>>>> or all of the bag might have to be stored on disk. In > >>> such > >>>>>>>>> cases, it > >>>>>>>>>>> is > >>>>>>>>>>>>>>> not > >>>>>>>>>>>>>>> efficient to do random access on the bag. That is why > >>> the > >>>>>>> DataBag > >>>>>>>>>>>>>>> interface > >>>>>>>>>>>>>>> does not support it. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> As Prashant suggested, storing it in a tuple would be > >> a > >>>> good > >>>>>>>>>>> alternative, > >>>>>>>>>>>>>>> if you want to have random access to do binary search. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> -Thejas > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 12/12/11 7:54 PM, 唐亮 wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>> How can I implement a binary search in pig? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> In one relation, there exists a bag whose items are > >>>> sorted. > >>>>>>>>>>>>>>>> And I want to check there exists a specific item in > >> the > >>>>>> bag. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> In UDF, I can't random access items in DataBag > >>> container. > >>>>>>>>>>>>>>>> So I have to transfer the items in DataBag to an > >>>> ArrayList, > >>>>>>> and > >>>>>>>>>> this > >>>>>>>>>>> is > >>>>>>>>>>>>>>>> time consuming. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> How can I implement the binary search efficiently in > >>> pig? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >> >
