How are you storing segments in a Bag? Can you forward the script. 2011/12/13 唐亮 <[email protected]>
> Then how can I transfer all the items in Bag to a Tuple? > > > 2011/12/14 Jonathan Coveney <[email protected]> > > > It's funny, but if you look wayyyy in the past, I actually asked a bunch > of > > questions that circled around, literally, this exact problem. > > > > Dmitriy and Prahsant are correct: the best way is to make a UDF that can > do > > the lookup really efficiently. This is what the maxmind API does, for > > example. > > > > 2011/12/13 Prashant Kommireddi <[email protected]> > > > > > I am lost when you say "If enumerate every IP, it will be more than > > > 100000000 single IPs" > > > > > > If each bag is a collection of 30000 tuples it might not be too bad on > > the > > > memory if you used Tuple to store segments instead? > > > > > > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > > > Lets say we incur an additional overhead 4X times this, which is ~160 > > bytes > > > per tuple. > > > Total per Bag = 30000 X 160 = ~5 MB > > > > > > You could probably store the ipsegments as Tuple and test it on your > > > servers. > > > > > > > > > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[email protected]> > > > wrote: > > > > > > > Do you have many such bags or just one? If one, and you want to look > up > > > > many ups in it, might be more efficient to serialize this relation to > > > hdfs, > > > > and write a lookup udf that specifies the serialized data set as a > file > > > to > > > > put in distributed cache. At init time, load up the file into memory, > > > then > > > > for every ip do the binary search in exec() > > > > > > > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote: > > > > > > > > > Thank you all! > > > > > > > > > > The detail is: > > > > > A bag contains many "IP Segments", whose schema is (ipStart:long, > > > > > ipEnd:long, locName:chararray) and the number of tuples is about > > 30000, > > > > > and I want to check wheather an IP is belong to one segment in the > > bag. > > > > > > > > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, > > > > > and then binary search wheather an IP is in the bag in UDF. > > > > > > > > > > If enumerate every IP, it will be more than 100000000 single IPs, > > > > > I think it will also be time consuming by JOIN in PIG. > > > > > > > > > > Please help me how can I deal with it efficiently! > > > > > > > > > > > > > > > 2011/12/14 Thejas Nair <[email protected]> > > > > > > > > > >> My assumption is that 唐亮 is trying to do binary search on bags > > within > > > > the > > > > >> tuples in a relation (ie schema of the relation has a bag > column). I > > > > don't > > > > >> think he is trying to treat the entire relation as one bag and do > > > binary > > > > >> search on that. > > > > >> > > > > >> > > > > >> -Thejas > > > > >> > > > > >> > > > > >> > > > > >> On 12/13/11 2:30 PM, Andrew Wells wrote: > > > > >> > > > > >>> I don't think this could be done, > > > > >>> > > > > >>> pig is just a hadoop job, and the idea behind hadoop is to read > all > > > the > > > > >>> data in a file. > > > > >>> > > > > >>> so by the time you put all the data into an array, you would have > > > been > > > > >>> better off just checking each element for the one you were > looking > > > for. > > > > >>> > > > > >>> So what you would get is [n + lg (n)], which will just be [n] > after > > > > >>> putting > > > > >>> that into an array. > > > > >>> Second, hadoop is all about large data analysis, usually more > than > > > > 100GB, > > > > >>> so putting this into memory is out of the question. > > > > >>> Third, hadoop is efficient because it processes this large amount > > of > > > > data > > > > >>> by splitting it up into multiple processes. To do an efficient > > binary > > > > >>> search, you would need do this in one mapper or one reducer. > > > > >>> > > > > >>> My opinion is just don't fight hadoop/pig. > > > > >>> > > > > >>> > > > > >>> > > > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair< > > [email protected]> > > > > >>> wrote: > > > > >>> > > > > >>> Bags can be very large might not fit into memory, and in such > cases > > > > some > > > > >>>> or all of the bag might have to be stored on disk. In such > cases, > > it > > > > is > > > > >>>> not > > > > >>>> efficient to do random access on the bag. That is why the > DataBag > > > > >>>> interface > > > > >>>> does not support it. > > > > >>>> > > > > >>>> As Prashant suggested, storing it in a tuple would be a good > > > > alternative, > > > > >>>> if you want to have random access to do binary search. > > > > >>>> > > > > >>>> -Thejas > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote: > > > > >>>> > > > > >>>> Hi all, > > > > >>>>> How can I implement a binary search in pig? > > > > >>>>> > > > > >>>>> In one relation, there exists a bag whose items are sorted. > > > > >>>>> And I want to check there exists a specific item in the bag. > > > > >>>>> > > > > >>>>> In UDF, I can't random access items in DataBag container. > > > > >>>>> So I have to transfer the items in DataBag to an ArrayList, and > > > this > > > > is > > > > >>>>> time consuming. > > > > >>>>> > > > > >>>>> How can I implement the binary search efficiently in pig? > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > >
