I am lost when you say "If enumerate every IP, it will be more than
100000000 single IPs"

If each bag is a collection of 30000 tuples it might not be too bad on the
memory if you used Tuple to store segments instead?

(8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
Lets say we incur an additional overhead 4X times this, which is ~160 bytes
per tuple.
Total per Bag = 30000 X 160 = ~5 MB

You could probably store the ipsegments as Tuple and test it on your
servers.


On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Do you have many such bags or just one? If one, and you want to look up
> many ups in it, might be more efficient to serialize this relation to hdfs,
> and write a lookup udf that specifies the serialized data set as a file to
> put in distributed cache. At init time, load up the file into memory, then
> for every ip do the binary search in exec()
>
> On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote:
>
> > Thank you all!
> >
> > The detail is:
> > A bag contains many "IP Segments", whose schema is (ipStart:long,
> > ipEnd:long, locName:chararray) and the number of tuples is about 30000,
> > and I want to check wheather an IP is belong to one segment in the bag.
> >
> > I want to order the "IP Segments" by (ipStart, ipEnd) in MR,
> > and then binary search wheather an IP is in the bag in UDF.
> >
> > If enumerate every IP, it will be more than 100000000 single IPs,
> > I think it will also be time consuming by JOIN in PIG.
> >
> > Please help me how can I deal with it efficiently!
> >
> >
> > 2011/12/14 Thejas Nair <[email protected]>
> >
> >> My assumption is that 唐亮 is trying to do binary search on bags within
> the
> >> tuples in a relation (ie schema of the relation has a bag column). I
> don't
> >> think he is trying to treat the entire relation as one bag and do binary
> >> search on that.
> >>
> >>
> >> -Thejas
> >>
> >>
> >>
> >> On 12/13/11 2:30 PM, Andrew Wells wrote:
> >>
> >>> I don't think this could be done,
> >>>
> >>> pig is just a hadoop job, and the idea behind hadoop is to read all the
> >>> data in a file.
> >>>
> >>> so by the time you put all the data into an array, you would have been
> >>> better off just checking each element for the one you were looking for.
> >>>
> >>> So what you would get is [n + lg (n)], which will just be [n] after
> >>> putting
> >>> that into an array.
> >>> Second, hadoop is all about large data analysis, usually more than
> 100GB,
> >>> so putting this into memory is out of the question.
> >>> Third, hadoop is efficient because it processes this large amount of
> data
> >>> by splitting it up into multiple processes. To do an efficient binary
> >>> search, you would need do this in one mapper or one reducer.
> >>>
> >>> My opinion is just don't fight hadoop/pig.
> >>>
> >>>
> >>>
> >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<[email protected]>
> >>> wrote:
> >>>
> >>> Bags can be very large might not fit into memory, and in such cases
> some
> >>>> or all of the bag might have to be stored on disk. In such cases, it
> is
> >>>> not
> >>>> efficient to do random access on the bag. That is why the DataBag
> >>>> interface
> >>>> does not support it.
> >>>>
> >>>> As Prashant suggested, storing it in a tuple would be a good
> alternative,
> >>>> if you want to have random access to do binary search.
> >>>>
> >>>> -Thejas
> >>>>
> >>>>
> >>>>
> >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
> >>>>
> >>>> Hi all,
> >>>>> How can I implement a binary search in pig?
> >>>>>
> >>>>> In one relation, there exists a bag whose items are sorted.
> >>>>> And I want to check there exists a specific item in the bag.
> >>>>>
> >>>>> In UDF, I can't random access items in DataBag container.
> >>>>> So I have to transfer the items in DataBag to an ArrayList, and this
> is
> >>>>> time consuming.
> >>>>>
> >>>>> How can I implement the binary search efficiently in pig?
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Reply via email to