Here is a super naive UDF (pseudocode). It assumes that you have the data
in HDFS, per Dmitriy's suggestion.
public MyUdf() {
Get data from distributed cache
load data into a TreeMap
}
public T exec(Tuple input) {
TreeMap.get(input.get(0));
}
and so on. You might want to lazily initialize the TreeMap, because hitting
the distributed cache and making the TreeMap is costly, so you only want to
do it on execution.
There's another slightly crazier way, which may be what people above
alluded to...if you know how many elements you have, you can make a binary
tree with an array of fixed size (where your root is at 0, and then it's
children are at 1 and 2, and their children at 3 through 7, and so on). So
you could actually construct a binary search tree as an array. This is a
bit more pain, but you'd be able to use it in various places, and Pig would
handle serialization.
2011/12/13 Prashant Kommireddi <[email protected]>
> How are you storing segments in a Bag? Can you forward the script.
>
> 2011/12/13 唐亮 <[email protected]>
>
> > Then how can I transfer all the items in Bag to a Tuple?
> >
> >
> > 2011/12/14 Jonathan Coveney <[email protected]>
> >
> > > It's funny, but if you look wayyyy in the past, I actually asked a
> bunch
> > of
> > > questions that circled around, literally, this exact problem.
> > >
> > > Dmitriy and Prahsant are correct: the best way is to make a UDF that
> can
> > do
> > > the lookup really efficiently. This is what the maxmind API does, for
> > > example.
> > >
> > > 2011/12/13 Prashant Kommireddi <[email protected]>
> > >
> > > > I am lost when you say "If enumerate every IP, it will be more than
> > > > 100000000 single IPs"
> > > >
> > > > If each bag is a collection of 30000 tuples it might not be too bad
> on
> > > the
> > > > memory if you used Tuple to store segments instead?
> > > >
> > > > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
> > > > Lets say we incur an additional overhead 4X times this, which is ~160
> > > bytes
> > > > per tuple.
> > > > Total per Bag = 30000 X 160 = ~5 MB
> > > >
> > > > You could probably store the ipsegments as Tuple and test it on your
> > > > servers.
> > > >
> > > >
> > > > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[email protected]>
> > > > wrote:
> > > >
> > > > > Do you have many such bags or just one? If one, and you want to
> look
> > up
> > > > > many ups in it, might be more efficient to serialize this relation
> to
> > > > hdfs,
> > > > > and write a lookup udf that specifies the serialized data set as a
> > file
> > > > to
> > > > > put in distributed cache. At init time, load up the file into
> memory,
> > > > then
> > > > > for every ip do the binary search in exec()
> > > > >
> > > > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote:
> > > > >
> > > > > > Thank you all!
> > > > > >
> > > > > > The detail is:
> > > > > > A bag contains many "IP Segments", whose schema is (ipStart:long,
> > > > > > ipEnd:long, locName:chararray) and the number of tuples is about
> > > 30000,
> > > > > > and I want to check wheather an IP is belong to one segment in
> the
> > > bag.
> > > > > >
> > > > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR,
> > > > > > and then binary search wheather an IP is in the bag in UDF.
> > > > > >
> > > > > > If enumerate every IP, it will be more than 100000000 single IPs,
> > > > > > I think it will also be time consuming by JOIN in PIG.
> > > > > >
> > > > > > Please help me how can I deal with it efficiently!
> > > > > >
> > > > > >
> > > > > > 2011/12/14 Thejas Nair <[email protected]>
> > > > > >
> > > > > >> My assumption is that 唐亮 is trying to do binary search on bags
> > > within
> > > > > the
> > > > > >> tuples in a relation (ie schema of the relation has a bag
> > column). I
> > > > > don't
> > > > > >> think he is trying to treat the entire relation as one bag and
> do
> > > > binary
> > > > > >> search on that.
> > > > > >>
> > > > > >>
> > > > > >> -Thejas
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 12/13/11 2:30 PM, Andrew Wells wrote:
> > > > > >>
> > > > > >>> I don't think this could be done,
> > > > > >>>
> > > > > >>> pig is just a hadoop job, and the idea behind hadoop is to read
> > all
> > > > the
> > > > > >>> data in a file.
> > > > > >>>
> > > > > >>> so by the time you put all the data into an array, you would
> have
> > > > been
> > > > > >>> better off just checking each element for the one you were
> > looking
> > > > for.
> > > > > >>>
> > > > > >>> So what you would get is [n + lg (n)], which will just be [n]
> > after
> > > > > >>> putting
> > > > > >>> that into an array.
> > > > > >>> Second, hadoop is all about large data analysis, usually more
> > than
> > > > > 100GB,
> > > > > >>> so putting this into memory is out of the question.
> > > > > >>> Third, hadoop is efficient because it processes this large
> amount
> > > of
> > > > > data
> > > > > >>> by splitting it up into multiple processes. To do an efficient
> > > binary
> > > > > >>> search, you would need do this in one mapper or one reducer.
> > > > > >>>
> > > > > >>> My opinion is just don't fight hadoop/pig.
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
> > > [email protected]>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> Bags can be very large might not fit into memory, and in such
> > cases
> > > > > some
> > > > > >>>> or all of the bag might have to be stored on disk. In such
> > cases,
> > > it
> > > > > is
> > > > > >>>> not
> > > > > >>>> efficient to do random access on the bag. That is why the
> > DataBag
> > > > > >>>> interface
> > > > > >>>> does not support it.
> > > > > >>>>
> > > > > >>>> As Prashant suggested, storing it in a tuple would be a good
> > > > > alternative,
> > > > > >>>> if you want to have random access to do binary search.
> > > > > >>>>
> > > > > >>>> -Thejas
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
> > > > > >>>>
> > > > > >>>> Hi all,
> > > > > >>>>> How can I implement a binary search in pig?
> > > > > >>>>>
> > > > > >>>>> In one relation, there exists a bag whose items are sorted.
> > > > > >>>>> And I want to check there exists a specific item in the bag.
> > > > > >>>>>
> > > > > >>>>> In UDF, I can't random access items in DataBag container.
> > > > > >>>>> So I have to transfer the items in DataBag to an ArrayList,
> and
> > > > this
> > > > > is
> > > > > >>>>> time consuming.
> > > > > >>>>>
> > > > > >>>>> How can I implement the binary search efficiently in pig?
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
>