Seems like at the end of this you have a Single bag with all the elements,
and somehow you would like to check whether an element exists in it based
on ipstart/end.


   1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten -
   this will convert the Bag to Tuple:  to_tuple = FOREACH order_ip_segs
   GENERATE tag, FLATTEN(order_seq); ---- This is O(n)
   2. Now write a UDF that can access the elements positionally for the
   BinarySearch
   3. Dmitriy and Jonathan's ideas with DistributedCache could perform
   better than the above approach, so you could go down that route too.


2011/12/13 唐亮 <[email protected]>

> The detailed PIG codes are as below:
>
> raw_ip_segment = load ...
> ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
> group_ip_segs = group ip_segs all;
>
> order_ip_segs = foreach group_ip_segs {
>  order_seg = order ip_segs by ipstart, ipend;
>  generate 't' as tag, order_seg;
> }
> describe order_ip_segs
> order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: long,poid:
> chararray}}
>
> Here, the order_ip_segs::order_seg is a BAG,
> how can I transer it to a TUPLE?
>
> And can I access the TUPLE randomly in UDF?
>
> 在 2011年12月14日 下午2:41,唐亮 <[email protected]>写道:
>
> > Then how can I transfer all the items in Bag to a Tuple?
> >
> >
> > 2011/12/14 Jonathan Coveney <[email protected]>
> >
> >> It's funny, but if you look wayyyy in the past, I actually asked a bunch
> >> of
> >> questions that circled around, literally, this exact problem.
> >>
> >> Dmitriy and Prahsant are correct: the best way is to make a UDF that can
> >> do
> >> the lookup really efficiently. This is what the maxmind API does, for
> >> example.
> >>
> >> 2011/12/13 Prashant Kommireddi <[email protected]>
> >>
> >> > I am lost when you say "If enumerate every IP, it will be more than
> >> > 100000000 single IPs"
> >> >
> >> > If each bag is a collection of 30000 tuples it might not be too bad on
> >> the
> >> > memory if you used Tuple to store segments instead?
> >> >
> >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
> >> > Lets say we incur an additional overhead 4X times this, which is ~160
> >> bytes
> >> > per tuple.
> >> > Total per Bag = 30000 X 160 = ~5 MB
> >> >
> >> > You could probably store the ipsegments as Tuple and test it on your
> >> > servers.
> >> >
> >> >
> >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[email protected]>
> >> > wrote:
> >> >
> >> > > Do you have many such bags or just one? If one, and you want to look
> >> up
> >> > > many ups in it, might be more efficient to serialize this relation
> to
> >> > hdfs,
> >> > > and write a lookup udf that specifies the serialized data set as a
> >> file
> >> > to
> >> > > put in distributed cache. At init time, load up the file into
> memory,
> >> > then
> >> > > for every ip do the binary search in exec()
> >> > >
> >> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote:
> >> > >
> >> > > > Thank you all!
> >> > > >
> >> > > > The detail is:
> >> > > > A bag contains many "IP Segments", whose schema is (ipStart:long,
> >> > > > ipEnd:long, locName:chararray) and the number of tuples is about
> >> 30000,
> >> > > > and I want to check wheather an IP is belong to one segment in the
> >> bag.
> >> > > >
> >> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR,
> >> > > > and then binary search wheather an IP is in the bag in UDF.
> >> > > >
> >> > > > If enumerate every IP, it will be more than 100000000 single IPs,
> >> > > > I think it will also be time consuming by JOIN in PIG.
> >> > > >
> >> > > > Please help me how can I deal with it efficiently!
> >> > > >
> >> > > >
> >> > > > 2011/12/14 Thejas Nair <[email protected]>
> >> > > >
> >> > > >> My assumption is that 唐亮 is trying to do binary search on bags
> >> within
> >> > > the
> >> > > >> tuples in a relation (ie schema of the relation has a bag
> column).
> >> I
> >> > > don't
> >> > > >> think he is trying to treat the entire relation as one bag and do
> >> > binary
> >> > > >> search on that.
> >> > > >>
> >> > > >>
> >> > > >> -Thejas
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote:
> >> > > >>
> >> > > >>> I don't think this could be done,
> >> > > >>>
> >> > > >>> pig is just a hadoop job, and the idea behind hadoop is to read
> >> all
> >> > the
> >> > > >>> data in a file.
> >> > > >>>
> >> > > >>> so by the time you put all the data into an array, you would
> have
> >> > been
> >> > > >>> better off just checking each element for the one you were
> looking
> >> > for.
> >> > > >>>
> >> > > >>> So what you would get is [n + lg (n)], which will just be [n]
> >> after
> >> > > >>> putting
> >> > > >>> that into an array.
> >> > > >>> Second, hadoop is all about large data analysis, usually more
> than
> >> > > 100GB,
> >> > > >>> so putting this into memory is out of the question.
> >> > > >>> Third, hadoop is efficient because it processes this large
> amount
> >> of
> >> > > data
> >> > > >>> by splitting it up into multiple processes. To do an efficient
> >> binary
> >> > > >>> search, you would need do this in one mapper or one reducer.
> >> > > >>>
> >> > > >>> My opinion is just don't fight hadoop/pig.
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
> >> [email protected]>
> >> > > >>> wrote:
> >> > > >>>
> >> > > >>> Bags can be very large might not fit into memory, and in such
> >> cases
> >> > > some
> >> > > >>>> or all of the bag might have to be stored on disk. In such
> >> cases, it
> >> > > is
> >> > > >>>> not
> >> > > >>>> efficient to do random access on the bag. That is why the
> DataBag
> >> > > >>>> interface
> >> > > >>>> does not support it.
> >> > > >>>>
> >> > > >>>> As Prashant suggested, storing it in a tuple would be a good
> >> > > alternative,
> >> > > >>>> if you want to have random access to do binary search.
> >> > > >>>>
> >> > > >>>> -Thejas
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
> >> > > >>>>
> >> > > >>>> Hi all,
> >> > > >>>>> How can I implement a binary search in pig?
> >> > > >>>>>
> >> > > >>>>> In one relation, there exists a bag whose items are sorted.
> >> > > >>>>> And I want to check there exists a specific item in the bag.
> >> > > >>>>>
> >> > > >>>>> In UDF, I can't random access items in DataBag container.
> >> > > >>>>> So I have to transfer the items in DataBag to an ArrayList,
> and
> >> > this
> >> > > is
> >> > > >>>>> time consuming.
> >> > > >>>>>
> >> > > >>>>> How can I implement the binary search efficiently in pig?
> >> > > >>>>>
> >> > > >>>>>
> >> > > >>>>>
> >> > > >>>>
> >> > > >>>
> >> > > >>
> >> > >
> >> >
> >>
> >
> >
>

Reply via email to