Re: Implement Binary Search in PIG

唐亮 Wed, 14 Dec 2011 21:27:21 -0800

Hi Prashant Kommireddi，

If so, how should I write the UDF, especially the data types in UDF?


2011/12/15 Prashant Kommireddi <[email protected]>

> When you flatten your BAG all your segments are within a single tuple.
> Something like
>
> ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN,
> ipstartN, ipendN, locN))
>
> You can access the inner tuples positionally.
>
> Sent from my iPhone
>
> On Dec 14, 2011, at 6:28 PM, "唐亮" <[email protected]> wrote:
>
> > Now the question is:
> > How should I put all the "IP Segments" in one TUPLE?
> >
> > Please help me!
> >
> >
> > 2011/12/15 Prashant Kommireddi <[email protected]>
> >
> >> Michael,
> >>
> >> This would have no benefit over using a DistributedCache. For a large
> >> cluster this would mean poor performance. If the file is static and
> needs
> >> to be looked-up across the cluster, DistributedCache would be a better
> >> approach.
> >>
> >> Thanks,
> >> Prashant
> >>
> >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[email protected]>
> >> wrote:
> >>
> >>> If that list of ip pairs is pretty static most time and will be used
> >>> frequently, maybe just copy it in hdfs with a high replication factor.
> >> Then
> >>> use it as a look up table or some binary tree or treemap kind of thing
> by
> >>> reading it from hdfs instead of using distributed cache if that sounds
> an
> >>> easier thing to do.
> >>>
> >>>
> >>> Best regards,
> >>> Michael
> >>>
> >>>
> >>> ________________________________
> >>> From: Dmitriy Ryaboy <[email protected]>
> >>> To: [email protected]
> >>> Sent: Wednesday, December 14, 2011 10:28 AM
> >>> Subject: Re: Implement Binary Search in PIG
> >>>
> >>> hbase has nothing to do with distributed cache.
> >>>
> >>>
> >>> 2011/12/14 唐亮 <[email protected]>
> >>>
> >>>> Now, I didn't use HBase,
> >>>> so, maybe I can't use DistributedCache.
> >>>>
> >>>> And if FLATTEN DataBag, the results are Tuples,
> >>>> then in UDF I can process only one Tuple, which can't implement
> >>>> BinarySearch.
> >>>>
> >>>> So, please help and show me the detailed solution.
> >>>> Thanks!
> >>>>
> >>>> 在 2011年12月14日 下午5:59，唐亮 <[email protected]>写道：
> >>>>
> >>>>> Hi Prashant Kommireddi，
> >>>>>
> >>>>> If I do 1. and 2. as you mentioned，
> >>>>> the schema will be {tag, ipStart, ipEnd, locName}.
> >>>>>
> >>>>> BUT, how should I write the UDF, especially how should I set the type
> >>> of
> >>>>> the input parameter?
> >>>>>
> >>>>> Currently, the UDF codes are as below, whose input parameter is
> >>> DataBag:
> >>>>>
> >>>>> public class GetProvinceNameFromIPNum extends EvalFunc<String> {
> >>>>>
> >>>>>   public String exec(Tuple input) throws IOException {
> >>>>> if (input == null || input.size() == 0)
> >>>>>            return UnknownIP;
> >>>>> if (input.size() != 2) {
> >>>>>    throw new IOException("Expected input's size is 2, but is: " +
> >>>>> input.size());
> >>>>>    }
> >>>>>
> >>>>>        Object o1 = input.get(0); * // This should be the IP you want
> >>> to
> >>>>> look up*
> >>>>>        if (!(o1 instanceof Long)) {
> >>>>>            throw new IOException("Expected input 1 to be Long, but
> >>> got "
> >>>>>            + o1.getClass().getName());
> >>>>>        }
> >>>>>        Object o2 = input.get(1);  *// This is the Bag of IP segs*
> >>>>>        if (!(o2 instanceof *DataBag*)) {  //* Should I change it to
> >>> "(o2
> >>>>> instanceof Tuple)"?*
> >>>>>            throw new IOException("Expected input 2 to be DataBag,
> >> but
> >>>> got
> >>>>> "
> >>>>>            + o2.getClass().getName());
> >>>>>        }
> >>>>>
> >>>>>        ........... other codes ...........
> >>>>>   }
> >>>>>
> >>>>> }
> >>>>>
> >>>>>
> >>>>>
> >>>>> 在 2011年12月14日 下午3:16，Prashant Kommireddi <[email protected]>写道：
> >>>>>
> >>>>> Seems like at the end of this you have a Single bag with all the
> >>>> elements,
> >>>>>> and somehow you would like to check whether an element exists in it
> >>>> based
> >>>>>> on ipstart/end.
> >>>>>>
> >>>>>>
> >>>>>>  1. Use FLATTEN
> >> http://pig.apache.org/docs/r0.9.1/basic.html#flatten-
> >>>>>>  this will convert the Bag to Tuple:  to_tuple = FOREACH
> >>> order_ip_segs
> >>>>>>  GENERATE tag, FLATTEN(order_seq); ---- This is O(n)
> >>>>>>  2. Now write a UDF that can access the elements positionally for
> >> the
> >>>>>>  BinarySearch
> >>>>>>  3. Dmitriy and Jonathan's ideas with DistributedCache could
> >> perform
> >>>>>>  better than the above approach, so you could go down that route
> >> too.
> >>>>>>
> >>>>>>
> >>>>>> 2011/12/13 唐亮 <[email protected]>
> >>>>>>
> >>>>>>> The detailed PIG codes are as below:
> >>>>>>>
> >>>>>>> raw_ip_segment = load ...
> >>>>>>> ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
> >>>>>>> group_ip_segs = group ip_segs all;
> >>>>>>>
> >>>>>>> order_ip_segs = foreach group_ip_segs {
> >>>>>>> order_seg = order ip_segs by ipstart, ipend;
> >>>>>>> generate 't' as tag, order_seg;
> >>>>>>> }
> >>>>>>> describe order_ip_segs
> >>>>>>> order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend:
> >>>>>> long,poid:
> >>>>>>> chararray}}
> >>>>>>>
> >>>>>>> Here, the order_ip_segs::order_seg is a BAG,
> >>>>>>> how can I transer it to a TUPLE?
> >>>>>>>
> >>>>>>> And can I access the TUPLE randomly in UDF?
> >>>>>>>
> >>>>>>> 在 2011年12月14日 下午2:41，唐亮 <[email protected]>写道：
> >>>>>>>
> >>>>>>>> Then how can I transfer all the items in Bag to a Tuple?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2011/12/14 Jonathan Coveney <[email protected]>
> >>>>>>>>
> >>>>>>>>> It's funny, but if you look wayyyy in the past, I actually
> >> asked
> >>> a
> >>>>>> bunch
> >>>>>>>>> of
> >>>>>>>>> questions that circled around, literally, this exact problem.
> >>>>>>>>>
> >>>>>>>>> Dmitriy and Prahsant are correct: the best way is to make a UDF
> >>>> that
> >>>>>> can
> >>>>>>>>> do
> >>>>>>>>> the lookup really efficiently. This is what the maxmind API
> >> does,
> >>>> for
> >>>>>>>>> example.
> >>>>>>>>>
> >>>>>>>>> 2011/12/13 Prashant Kommireddi <[email protected]>
> >>>>>>>>>
> >>>>>>>>>> I am lost when you say "If enumerate every IP, it will be
> >> more
> >>>> than
> >>>>>>>>>> 100000000 single IPs"
> >>>>>>>>>>
> >>>>>>>>>> If each bag is a collection of 30000 tuples it might not be
> >> too
> >>>>>> bad on
> >>>>>>>>> the
> >>>>>>>>>> memory if you used Tuple to store segments instead?
> >>>>>>>>>>
> >>>>>>>>>> (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
> >>>>>>>>>> Lets say we incur an additional overhead 4X times this, which
> >>> is
> >>>>>> ~160
> >>>>>>>>> bytes
> >>>>>>>>>> per tuple.
> >>>>>>>>>> Total per Bag = 30000 X 160 = ~5 MB
> >>>>>>>>>>
> >>>>>>>>>> You could probably store the ipsegments as Tuple and test it
> >> on
> >>>>>> your
> >>>>>>>>>> servers.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <
> >>>>>> [email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Do you have many such bags or just one? If one, and you
> >> want
> >>> to
> >>>>>> look
> >>>>>>>>> up
> >>>>>>>>>>> many ups in it, might be more efficient to serialize this
> >>>>>> relation
> >>>>>>> to
> >>>>>>>>>> hdfs,
> >>>>>>>>>>> and write a lookup udf that specifies the serialized data
> >> set
> >>>> as
> >>>>>> a
> >>>>>>>>> file
> >>>>>>>>>> to
> >>>>>>>>>>> put in distributed cache. At init time, load up the file
> >> into
> >>>>>>> memory,
> >>>>>>>>>> then
> >>>>>>>>>>> for every ip do the binary search in exec()
> >>>>>>>>>>>
> >>>>>>>>>>> On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]>
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thank you all!
> >>>>>>>>>>>>
> >>>>>>>>>>>> The detail is:
> >>>>>>>>>>>> A bag contains many "IP Segments", whose schema is
> >>>>>> (ipStart:long,
> >>>>>>>>>>>> ipEnd:long, locName:chararray) and the number of tuples
> >> is
> >>>>>> about
> >>>>>>>>> 30000,
> >>>>>>>>>>>> and I want to check wheather an IP is belong to one
> >> segment
> >>>> in
> >>>>>> the
> >>>>>>>>> bag.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I want to order the "IP Segments" by (ipStart, ipEnd) in
> >>> MR,
> >>>>>>>>>>>> and then binary search wheather an IP is in the bag in
> >> UDF.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If enumerate every IP, it will be more than 100000000
> >>> single
> >>>>>> IPs,
> >>>>>>>>>>>> I think it will also be time consuming by JOIN in PIG.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please help me how can I deal with it efficiently!
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2011/12/14 Thejas Nair <[email protected]>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> My assumption is that 唐亮 is trying to do binary search
> >> on
> >>>> bags
> >>>>>>>>> within
> >>>>>>>>>>> the
> >>>>>>>>>>>>> tuples in a relation (ie schema of the relation has a
> >> bag
> >>>>>>> column).
> >>>>>>>>> I
> >>>>>>>>>>> don't
> >>>>>>>>>>>>> think he is trying to treat the entire relation as one
> >> bag
> >>>>>> and do
> >>>>>>>>>> binary
> >>>>>>>>>>>>> search on that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Thejas
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 12/13/11 2:30 PM, Andrew Wells wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I don't think this could be done,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> pig is just a hadoop job, and the idea behind hadoop is
> >>> to
> >>>>>> read
> >>>>>>>>> all
> >>>>>>>>>> the
> >>>>>>>>>>>>>> data in a file.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> so by the time you put all the data into an array, you
> >>>> would
> >>>>>>> have
> >>>>>>>>>> been
> >>>>>>>>>>>>>> better off just checking each element for the one you
> >>> were
> >>>>>>> looking
> >>>>>>>>>> for.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So what you would get is [n + lg (n)], which will just
> >> be
> >>>> [n]
> >>>>>>>>> after
> >>>>>>>>>>>>>> putting
> >>>>>>>>>>>>>> that into an array.
> >>>>>>>>>>>>>> Second, hadoop is all about large data analysis,
> >> usually
> >>>> more
> >>>>>>> than
> >>>>>>>>>>> 100GB,
> >>>>>>>>>>>>>> so putting this into memory is out of the question.
> >>>>>>>>>>>>>> Third, hadoop is efficient because it processes this
> >>> large
> >>>>>>> amount
> >>>>>>>>> of
> >>>>>>>>>>> data
> >>>>>>>>>>>>>> by splitting it up into multiple processes. To do an
> >>>>>> efficient
> >>>>>>>>> binary
> >>>>>>>>>>>>>> search, you would need do this in one mapper or one
> >>>> reducer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My opinion is just don't fight hadoop/pig.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
> >>>>>>>>> [email protected]>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Bags can be very large might not fit into memory, and
> >> in
> >>>> such
> >>>>>>>>> cases
> >>>>>>>>>>> some
> >>>>>>>>>>>>>>> or all of the bag might have to be stored on disk. In
> >>> such
> >>>>>>>>> cases, it
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>> efficient to do random access on the bag. That is why
> >>> the
> >>>>>>> DataBag
> >>>>>>>>>>>>>>> interface
> >>>>>>>>>>>>>>> does not support it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As Prashant suggested, storing it in a tuple would be
> >> a
> >>>> good
> >>>>>>>>>>> alternative,
> >>>>>>>>>>>>>>> if you want to have random access to do binary search.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Thejas
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 12/12/11 7:54 PM, 唐亮 wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>> How can I implement a binary search in pig?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In one relation, there exists a bag whose items are
> >>>> sorted.
> >>>>>>>>>>>>>>>> And I want to check there exists a specific item in
> >> the
> >>>>>> bag.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In UDF, I can't random access items in DataBag
> >>> container.
> >>>>>>>>>>>>>>>> So I have to transfer the items in DataBag to an
> >>>> ArrayList,
> >>>>>>> and
> >>>>>>>>>> this
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>> time consuming.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> How can I implement the binary search efficiently in
> >>> pig?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: Implement Binary Search in PIG

Reply via email to