Now the question is: How should I put all the "IP Segments" in one TUPLE?
Please help me! 2011/12/15 Prashant Kommireddi <[email protected]> > Michael, > > This would have no benefit over using a DistributedCache. For a large > cluster this would mean poor performance. If the file is static and needs > to be looked-up across the cluster, DistributedCache would be a better > approach. > > Thanks, > Prashant > > On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[email protected]> > wrote: > > > If that list of ip pairs is pretty static most time and will be used > > frequently, maybe just copy it in hdfs with a high replication factor. > Then > > use it as a look up table or some binary tree or treemap kind of thing by > > reading it from hdfs instead of using distributed cache if that sounds an > > easier thing to do. > > > > > > Best regards, > > Michael > > > > > > ________________________________ > > From: Dmitriy Ryaboy <[email protected]> > > To: [email protected] > > Sent: Wednesday, December 14, 2011 10:28 AM > > Subject: Re: Implement Binary Search in PIG > > > > hbase has nothing to do with distributed cache. > > > > > > 2011/12/14 唐亮 <[email protected]> > > > > > Now, I didn't use HBase, > > > so, maybe I can't use DistributedCache. > > > > > > And if FLATTEN DataBag, the results are Tuples, > > > then in UDF I can process only one Tuple, which can't implement > > > BinarySearch. > > > > > > So, please help and show me the detailed solution. > > > Thanks! > > > > > > 在 2011年12月14日 下午5:59,唐亮 <[email protected]>写道: > > > > > > > Hi Prashant Kommireddi, > > > > > > > > If I do 1. and 2. as you mentioned, > > > > the schema will be {tag, ipStart, ipEnd, locName}. > > > > > > > > BUT, how should I write the UDF, especially how should I set the type > > of > > > > the input parameter? > > > > > > > > Currently, the UDF codes are as below, whose input parameter is > > DataBag: > > > > > > > > public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > > > > > > > public String exec(Tuple input) throws IOException { > > > > if (input == null || input.size() == 0) > > > > return UnknownIP; > > > > if (input.size() != 2) { > > > > throw new IOException("Expected input's size is 2, but is: " + > > > > input.size()); > > > > } > > > > > > > > Object o1 = input.get(0); * // This should be the IP you want > > to > > > > look up* > > > > if (!(o1 instanceof Long)) { > > > > throw new IOException("Expected input 1 to be Long, but > > got " > > > > + o1.getClass().getName()); > > > > } > > > > Object o2 = input.get(1); *// This is the Bag of IP segs* > > > > if (!(o2 instanceof *DataBag*)) { //* Should I change it to > > "(o2 > > > > instanceof Tuple)"?* > > > > throw new IOException("Expected input 2 to be DataBag, > but > > > got > > > > " > > > > + o2.getClass().getName()); > > > > } > > > > > > > > ........... other codes ........... > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > 在 2011年12月14日 下午3:16,Prashant Kommireddi <[email protected]>写道: > > > > > > > > Seems like at the end of this you have a Single bag with all the > > > elements, > > > >> and somehow you would like to check whether an element exists in it > > > based > > > >> on ipstart/end. > > > >> > > > >> > > > >> 1. Use FLATTEN > http://pig.apache.org/docs/r0.9.1/basic.html#flatten- > > > >> this will convert the Bag to Tuple: to_tuple = FOREACH > > order_ip_segs > > > >> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > > > >> 2. Now write a UDF that can access the elements positionally for > the > > > >> BinarySearch > > > >> 3. Dmitriy and Jonathan's ideas with DistributedCache could > perform > > > >> better than the above approach, so you could go down that route > too. > > > >> > > > >> > > > >> 2011/12/13 唐亮 <[email protected]> > > > >> > > > >> > The detailed PIG codes are as below: > > > >> > > > > >> > raw_ip_segment = load ... > > > >> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > > > >> > group_ip_segs = group ip_segs all; > > > >> > > > > >> > order_ip_segs = foreach group_ip_segs { > > > >> > order_seg = order ip_segs by ipstart, ipend; > > > >> > generate 't' as tag, order_seg; > > > >> > } > > > >> > describe order_ip_segs > > > >> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: > > > >> long,poid: > > > >> > chararray}} > > > >> > > > > >> > Here, the order_ip_segs::order_seg is a BAG, > > > >> > how can I transer it to a TUPLE? > > > >> > > > > >> > And can I access the TUPLE randomly in UDF? > > > >> > > > > >> > 在 2011年12月14日 下午2:41,唐亮 <[email protected]>写道: > > > >> > > > > >> > > Then how can I transfer all the items in Bag to a Tuple? > > > >> > > > > > >> > > > > > >> > > 2011/12/14 Jonathan Coveney <[email protected]> > > > >> > > > > > >> > >> It's funny, but if you look wayyyy in the past, I actually > asked > > a > > > >> bunch > > > >> > >> of > > > >> > >> questions that circled around, literally, this exact problem. > > > >> > >> > > > >> > >> Dmitriy and Prahsant are correct: the best way is to make a UDF > > > that > > > >> can > > > >> > >> do > > > >> > >> the lookup really efficiently. This is what the maxmind API > does, > > > for > > > >> > >> example. > > > >> > >> > > > >> > >> 2011/12/13 Prashant Kommireddi <[email protected]> > > > >> > >> > > > >> > >> > I am lost when you say "If enumerate every IP, it will be > more > > > than > > > >> > >> > 100000000 single IPs" > > > >> > >> > > > > >> > >> > If each bag is a collection of 30000 tuples it might not be > too > > > >> bad on > > > >> > >> the > > > >> > >> > memory if you used Tuple to store segments instead? > > > >> > >> > > > > >> > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > > > >> > >> > Lets say we incur an additional overhead 4X times this, which > > is > > > >> ~160 > > > >> > >> bytes > > > >> > >> > per tuple. > > > >> > >> > Total per Bag = 30000 X 160 = ~5 MB > > > >> > >> > > > > >> > >> > You could probably store the ipsegments as Tuple and test it > on > > > >> your > > > >> > >> > servers. > > > >> > >> > > > > >> > >> > > > > >> > >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy < > > > >> [email protected]> > > > >> > >> > wrote: > > > >> > >> > > > > >> > >> > > Do you have many such bags or just one? If one, and you > want > > to > > > >> look > > > >> > >> up > > > >> > >> > > many ups in it, might be more efficient to serialize this > > > >> relation > > > >> > to > > > >> > >> > hdfs, > > > >> > >> > > and write a lookup udf that specifies the serialized data > set > > > as > > > >> a > > > >> > >> file > > > >> > >> > to > > > >> > >> > > put in distributed cache. At init time, load up the file > into > > > >> > memory, > > > >> > >> > then > > > >> > >> > > for every ip do the binary search in exec() > > > >> > >> > > > > > >> > >> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> > wrote: > > > >> > >> > > > > > >> > >> > > > Thank you all! > > > >> > >> > > > > > > >> > >> > > > The detail is: > > > >> > >> > > > A bag contains many "IP Segments", whose schema is > > > >> (ipStart:long, > > > >> > >> > > > ipEnd:long, locName:chararray) and the number of tuples > is > > > >> about > > > >> > >> 30000, > > > >> > >> > > > and I want to check wheather an IP is belong to one > segment > > > in > > > >> the > > > >> > >> bag. > > > >> > >> > > > > > > >> > >> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in > > MR, > > > >> > >> > > > and then binary search wheather an IP is in the bag in > UDF. > > > >> > >> > > > > > > >> > >> > > > If enumerate every IP, it will be more than 100000000 > > single > > > >> IPs, > > > >> > >> > > > I think it will also be time consuming by JOIN in PIG. > > > >> > >> > > > > > > >> > >> > > > Please help me how can I deal with it efficiently! > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > 2011/12/14 Thejas Nair <[email protected]> > > > >> > >> > > > > > > >> > >> > > >> My assumption is that 唐亮 is trying to do binary search > on > > > bags > > > >> > >> within > > > >> > >> > > the > > > >> > >> > > >> tuples in a relation (ie schema of the relation has a > bag > > > >> > column). > > > >> > >> I > > > >> > >> > > don't > > > >> > >> > > >> think he is trying to treat the entire relation as one > bag > > > >> and do > > > >> > >> > binary > > > >> > >> > > >> search on that. > > > >> > >> > > >> > > > >> > >> > > >> > > > >> > >> > > >> -Thejas > > > >> > >> > > >> > > > >> > >> > > >> > > > >> > >> > > >> > > > >> > >> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote: > > > >> > >> > > >> > > > >> > >> > > >>> I don't think this could be done, > > > >> > >> > > >>> > > > >> > >> > > >>> pig is just a hadoop job, and the idea behind hadoop is > > to > > > >> read > > > >> > >> all > > > >> > >> > the > > > >> > >> > > >>> data in a file. > > > >> > >> > > >>> > > > >> > >> > > >>> so by the time you put all the data into an array, you > > > would > > > >> > have > > > >> > >> > been > > > >> > >> > > >>> better off just checking each element for the one you > > were > > > >> > looking > > > >> > >> > for. > > > >> > >> > > >>> > > > >> > >> > > >>> So what you would get is [n + lg (n)], which will just > be > > > [n] > > > >> > >> after > > > >> > >> > > >>> putting > > > >> > >> > > >>> that into an array. > > > >> > >> > > >>> Second, hadoop is all about large data analysis, > usually > > > more > > > >> > than > > > >> > >> > > 100GB, > > > >> > >> > > >>> so putting this into memory is out of the question. > > > >> > >> > > >>> Third, hadoop is efficient because it processes this > > large > > > >> > amount > > > >> > >> of > > > >> > >> > > data > > > >> > >> > > >>> by splitting it up into multiple processes. To do an > > > >> efficient > > > >> > >> binary > > > >> > >> > > >>> search, you would need do this in one mapper or one > > > reducer. > > > >> > >> > > >>> > > > >> > >> > > >>> My opinion is just don't fight hadoop/pig. > > > >> > >> > > >>> > > > >> > >> > > >>> > > > >> > >> > > >>> > > > >> > >> > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair< > > > >> > >> [email protected]> > > > >> > >> > > >>> wrote: > > > >> > >> > > >>> > > > >> > >> > > >>> Bags can be very large might not fit into memory, and > in > > > such > > > >> > >> cases > > > >> > >> > > some > > > >> > >> > > >>>> or all of the bag might have to be stored on disk. In > > such > > > >> > >> cases, it > > > >> > >> > > is > > > >> > >> > > >>>> not > > > >> > >> > > >>>> efficient to do random access on the bag. That is why > > the > > > >> > DataBag > > > >> > >> > > >>>> interface > > > >> > >> > > >>>> does not support it. > > > >> > >> > > >>>> > > > >> > >> > > >>>> As Prashant suggested, storing it in a tuple would be > a > > > good > > > >> > >> > > alternative, > > > >> > >> > > >>>> if you want to have random access to do binary search. > > > >> > >> > > >>>> > > > >> > >> > > >>>> -Thejas > > > >> > >> > > >>>> > > > >> > >> > > >>>> > > > >> > >> > > >>>> > > > >> > >> > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote: > > > >> > >> > > >>>> > > > >> > >> > > >>>> Hi all, > > > >> > >> > > >>>>> How can I implement a binary search in pig? > > > >> > >> > > >>>>> > > > >> > >> > > >>>>> In one relation, there exists a bag whose items are > > > sorted. > > > >> > >> > > >>>>> And I want to check there exists a specific item in > the > > > >> bag. > > > >> > >> > > >>>>> > > > >> > >> > > >>>>> In UDF, I can't random access items in DataBag > > container. > > > >> > >> > > >>>>> So I have to transfer the items in DataBag to an > > > ArrayList, > > > >> > and > > > >> > >> > this > > > >> > >> > > is > > > >> > >> > > >>>>> time consuming. > > > >> > >> > > >>>>> > > > >> > >> > > >>>>> How can I implement the binary search efficiently in > > pig? > > > >> > >> > > >>>>> > > > >> > >> > > >>>>> > > > >> > >> > > >>>>> > > > >> > >> > > >>>> > > > >> > >> > > >>> > > > >> > >> > > >> > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >
