It's funny, but if you look wayyyy in the past, I actually asked a bunch of
questions that circled around, literally, this exact problem.

Dmitriy and Prahsant are correct: the best way is to make a UDF that can do
the lookup really efficiently. This is what the maxmind API does, for
example.

2011/12/13 Prashant Kommireddi <[email protected]>

> I am lost when you say "If enumerate every IP, it will be more than
> 100000000 single IPs"
>
> If each bag is a collection of 30000 tuples it might not be too bad on the
> memory if you used Tuple to store segments instead?
>
> (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
> Lets say we incur an additional overhead 4X times this, which is ~160 bytes
> per tuple.
> Total per Bag = 30000 X 160 = ~5 MB
>
> You could probably store the ipsegments as Tuple and test it on your
> servers.
>
>
> On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
>
> > Do you have many such bags or just one? If one, and you want to look up
> > many ups in it, might be more efficient to serialize this relation to
> hdfs,
> > and write a lookup udf that specifies the serialized data set as a file
> to
> > put in distributed cache. At init time, load up the file into memory,
> then
> > for every ip do the binary search in exec()
> >
> > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote:
> >
> > > Thank you all!
> > >
> > > The detail is:
> > > A bag contains many "IP Segments", whose schema is (ipStart:long,
> > > ipEnd:long, locName:chararray) and the number of tuples is about 30000,
> > > and I want to check wheather an IP is belong to one segment in the bag.
> > >
> > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR,
> > > and then binary search wheather an IP is in the bag in UDF.
> > >
> > > If enumerate every IP, it will be more than 100000000 single IPs,
> > > I think it will also be time consuming by JOIN in PIG.
> > >
> > > Please help me how can I deal with it efficiently!
> > >
> > >
> > > 2011/12/14 Thejas Nair <[email protected]>
> > >
> > >> My assumption is that 唐亮 is trying to do binary search on bags within
> > the
> > >> tuples in a relation (ie schema of the relation has a bag column). I
> > don't
> > >> think he is trying to treat the entire relation as one bag and do
> binary
> > >> search on that.
> > >>
> > >>
> > >> -Thejas
> > >>
> > >>
> > >>
> > >> On 12/13/11 2:30 PM, Andrew Wells wrote:
> > >>
> > >>> I don't think this could be done,
> > >>>
> > >>> pig is just a hadoop job, and the idea behind hadoop is to read all
> the
> > >>> data in a file.
> > >>>
> > >>> so by the time you put all the data into an array, you would have
> been
> > >>> better off just checking each element for the one you were looking
> for.
> > >>>
> > >>> So what you would get is [n + lg (n)], which will just be [n] after
> > >>> putting
> > >>> that into an array.
> > >>> Second, hadoop is all about large data analysis, usually more than
> > 100GB,
> > >>> so putting this into memory is out of the question.
> > >>> Third, hadoop is efficient because it processes this large amount of
> > data
> > >>> by splitting it up into multiple processes. To do an efficient binary
> > >>> search, you would need do this in one mapper or one reducer.
> > >>>
> > >>> My opinion is just don't fight hadoop/pig.
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<[email protected]>
> > >>> wrote:
> > >>>
> > >>> Bags can be very large might not fit into memory, and in such cases
> > some
> > >>>> or all of the bag might have to be stored on disk. In such cases, it
> > is
> > >>>> not
> > >>>> efficient to do random access on the bag. That is why the DataBag
> > >>>> interface
> > >>>> does not support it.
> > >>>>
> > >>>> As Prashant suggested, storing it in a tuple would be a good
> > alternative,
> > >>>> if you want to have random access to do binary search.
> > >>>>
> > >>>> -Thejas
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
> > >>>>
> > >>>> Hi all,
> > >>>>> How can I implement a binary search in pig?
> > >>>>>
> > >>>>> In one relation, there exists a bag whose items are sorted.
> > >>>>> And I want to check there exists a specific item in the bag.
> > >>>>>
> > >>>>> In UDF, I can't random access items in DataBag container.
> > >>>>> So I have to transfer the items in DataBag to an ArrayList, and
> this
> > is
> > >>>>> time consuming.
> > >>>>>
> > >>>>> How can I implement the binary search efficiently in pig?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>

Reply via email to