Re: Implement Binary Search in PIG

Prashant Kommireddi Wed, 14 Dec 2011 18:34:29 -0800

When you flatten your BAG all your segments are within a single tuple.
Something like


((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN,
ipstartN, ipendN, locN))

You can access the inner tuples positionally.

Sent from my iPhone

On Dec 14, 2011, at 6:28 PM, "唐亮" <[email protected]> wrote:

> Now the question is:
> How should I put all the "IP Segments" in one TUPLE?
>
> Please help me!
>
>
> 2011/12/15 Prashant Kommireddi <[email protected]>
>
>> Michael,
>>
>> This would have no benefit over using a DistributedCache. For a large
>> cluster this would mean poor performance. If the file is static and needs
>> to be looked-up across the cluster, DistributedCache would be a better
>> approach.
>>
>> Thanks,
>> Prashant
>>
>> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[email protected]>
>> wrote:
>>
>>> If that list of ip pairs is pretty static most time and will be used
>>> frequently, maybe just copy it in hdfs with a high replication factor.
>> Then
>>> use it as a look up table or some binary tree or treemap kind of thing by
>>> reading it from hdfs instead of using distributed cache if that sounds an
>>> easier thing to do.
>>>
>>>
>>> Best regards,
>>> Michael
>>>
>>>
>>> ________________________________
>>> From: Dmitriy Ryaboy <[email protected]>
>>> To: [email protected]
>>> Sent: Wednesday, December 14, 2011 10:28 AM
>>> Subject: Re: Implement Binary Search in PIG
>>>
>>> hbase has nothing to do with distributed cache.
>>>
>>>
>>> 2011/12/14 唐亮 <[email protected]>
>>>
>>>> Now, I didn't use HBase,
>>>> so, maybe I can't use DistributedCache.
>>>>
>>>> And if FLATTEN DataBag, the results are Tuples,
>>>> then in UDF I can process only one Tuple, which can't implement
>>>> BinarySearch.
>>>>
>>>> So, please help and show me the detailed solution.
>>>> Thanks!
>>>>
>>>> 在 2011年12月14日 下午5:59，唐亮 <[email protected]>写道：
>>>>
>>>>> Hi Prashant Kommireddi，
>>>>>
>>>>> If I do 1. and 2. as you mentioned，
>>>>> the schema will be {tag, ipStart, ipEnd, locName}.
>>>>>
>>>>> BUT, how should I write the UDF, especially how should I set the type
>>> of
>>>>> the input parameter?
>>>>>
>>>>> Currently, the UDF codes are as below, whose input parameter is
>>> DataBag:
>>>>>
>>>>> public class GetProvinceNameFromIPNum extends EvalFunc<String> {
>>>>>
>>>>>   public String exec(Tuple input) throws IOException {
>>>>> if (input == null || input.size() == 0)
>>>>>            return UnknownIP;
>>>>> if (input.size() != 2) {
>>>>>    throw new IOException("Expected input's size is 2, but is: " +
>>>>> input.size());
>>>>>    }
>>>>>
>>>>>        Object o1 = input.get(0); * // This should be the IP you want
>>> to
>>>>> look up*
>>>>>        if (!(o1 instanceof Long)) {
>>>>>            throw new IOException("Expected input 1 to be Long, but
>>> got "
>>>>>            + o1.getClass().getName());
>>>>>        }
>>>>>        Object o2 = input.get(1);  *// This is the Bag of IP segs*
>>>>>        if (!(o2 instanceof *DataBag*)) {  //* Should I change it to
>>> "(o2
>>>>> instanceof Tuple)"?*
>>>>>            throw new IOException("Expected input 2 to be DataBag,
>> but
>>>> got
>>>>> "
>>>>>            + o2.getClass().getName());
>>>>>        }
>>>>>
>>>>>        ........... other codes ...........
>>>>>   }
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> 在 2011年12月14日 下午3:16，Prashant Kommireddi <[email protected]>写道：
>>>>>
>>>>> Seems like at the end of this you have a Single bag with all the
>>>> elements,
>>>>>> and somehow you would like to check whether an element exists in it
>>>> based
>>>>>> on ipstart/end.
>>>>>>
>>>>>>
>>>>>>  1. Use FLATTEN
>> http://pig.apache.org/docs/r0.9.1/basic.html#flatten-
>>>>>>  this will convert the Bag to Tuple:  to_tuple = FOREACH
>>> order_ip_segs
>>>>>>  GENERATE tag, FLATTEN(order_seq); ---- This is O(n)
>>>>>>  2. Now write a UDF that can access the elements positionally for
>> the
>>>>>>  BinarySearch
>>>>>>  3. Dmitriy and Jonathan's ideas with DistributedCache could
>> perform
>>>>>>  better than the above approach, so you could go down that route
>> too.
>>>>>>
>>>>>>
>>>>>> 2011/12/13 唐亮 <[email protected]>
>>>>>>
>>>>>>> The detailed PIG codes are as below:
>>>>>>>
>>>>>>> raw_ip_segment = load ...
>>>>>>> ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
>>>>>>> group_ip_segs = group ip_segs all;
>>>>>>>
>>>>>>> order_ip_segs = foreach group_ip_segs {
>>>>>>> order_seg = order ip_segs by ipstart, ipend;
>>>>>>> generate 't' as tag, order_seg;
>>>>>>> }
>>>>>>> describe order_ip_segs
>>>>>>> order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend:
>>>>>> long,poid:
>>>>>>> chararray}}
>>>>>>>
>>>>>>> Here, the order_ip_segs::order_seg is a BAG,
>>>>>>> how can I transer it to a TUPLE?
>>>>>>>
>>>>>>> And can I access the TUPLE randomly in UDF?
>>>>>>>
>>>>>>> 在 2011年12月14日 下午2:41，唐亮 <[email protected]>写道：
>>>>>>>
>>>>>>>> Then how can I transfer all the items in Bag to a Tuple?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2011/12/14 Jonathan Coveney <[email protected]>
>>>>>>>>
>>>>>>>>> It's funny, but if you look wayyyy in the past, I actually
>> asked
>>> a
>>>>>> bunch
>>>>>>>>> of
>>>>>>>>> questions that circled around, literally, this exact problem.
>>>>>>>>>
>>>>>>>>> Dmitriy and Prahsant are correct: the best way is to make a UDF
>>>> that
>>>>>> can
>>>>>>>>> do
>>>>>>>>> the lookup really efficiently. This is what the maxmind API
>> does,
>>>> for
>>>>>>>>> example.
>>>>>>>>>
>>>>>>>>> 2011/12/13 Prashant Kommireddi <[email protected]>
>>>>>>>>>
>>>>>>>>>> I am lost when you say "If enumerate every IP, it will be
>> more
>>>> than
>>>>>>>>>> 100000000 single IPs"
>>>>>>>>>>
>>>>>>>>>> If each bag is a collection of 30000 tuples it might not be
>> too
>>>>>> bad on
>>>>>>>>> the
>>>>>>>>>> memory if you used Tuple to store segments instead?
>>>>>>>>>>
>>>>>>>>>> (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
>>>>>>>>>> Lets say we incur an additional overhead 4X times this, which
>>> is
>>>>>> ~160
>>>>>>>>> bytes
>>>>>>>>>> per tuple.
>>>>>>>>>> Total per Bag = 30000 X 160 = ~5 MB
>>>>>>>>>>
>>>>>>>>>> You could probably store the ipsegments as Tuple and test it
>> on
>>>>>> your
>>>>>>>>>> servers.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <
>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Do you have many such bags or just one? If one, and you
>> want
>>> to
>>>>>> look
>>>>>>>>> up
>>>>>>>>>>> many ups in it, might be more efficient to serialize this
>>>>>> relation
>>>>>>> to
>>>>>>>>>> hdfs,
>>>>>>>>>>> and write a lookup udf that specifies the serialized data
>> set
>>>> as
>>>>>> a
>>>>>>>>> file
>>>>>>>>>> to
>>>>>>>>>>> put in distributed cache. At init time, load up the file
>> into
>>>>>>> memory,
>>>>>>>>>> then
>>>>>>>>>>> for every ip do the binary search in exec()
>>>>>>>>>>>
>>>>>>>>>>> On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]>
>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you all!
>>>>>>>>>>>>
>>>>>>>>>>>> The detail is:
>>>>>>>>>>>> A bag contains many "IP Segments", whose schema is
>>>>>> (ipStart:long,
>>>>>>>>>>>> ipEnd:long, locName:chararray) and the number of tuples
>> is
>>>>>> about
>>>>>>>>> 30000,
>>>>>>>>>>>> and I want to check wheather an IP is belong to one
>> segment
>>>> in
>>>>>> the
>>>>>>>>> bag.
>>>>>>>>>>>>
>>>>>>>>>>>> I want to order the "IP Segments" by (ipStart, ipEnd) in
>>> MR,
>>>>>>>>>>>> and then binary search wheather an IP is in the bag in
>> UDF.
>>>>>>>>>>>>
>>>>>>>>>>>> If enumerate every IP, it will be more than 100000000
>>> single
>>>>>> IPs,
>>>>>>>>>>>> I think it will also be time consuming by JOIN in PIG.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me how can I deal with it efficiently!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2011/12/14 Thejas Nair <[email protected]>
>>>>>>>>>>>>
>>>>>>>>>>>>> My assumption is that 唐亮 is trying to do binary search
>> on
>>>> bags
>>>>>>>>> within
>>>>>>>>>>> the
>>>>>>>>>>>>> tuples in a relation (ie schema of the relation has a
>> bag
>>>>>>> column).
>>>>>>>>> I
>>>>>>>>>>> don't
>>>>>>>>>>>>> think he is trying to treat the entire relation as one
>> bag
>>>>>> and do
>>>>>>>>>> binary
>>>>>>>>>>>>> search on that.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Thejas
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/13/11 2:30 PM, Andrew Wells wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think this could be done,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pig is just a hadoop job, and the idea behind hadoop is
>>> to
>>>>>> read
>>>>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>>> data in a file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> so by the time you put all the data into an array, you
>>>> would
>>>>>>> have
>>>>>>>>>> been
>>>>>>>>>>>>>> better off just checking each element for the one you
>>> were
>>>>>>> looking
>>>>>>>>>> for.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So what you would get is [n + lg (n)], which will just
>> be
>>>> [n]
>>>>>>>>> after
>>>>>>>>>>>>>> putting
>>>>>>>>>>>>>> that into an array.
>>>>>>>>>>>>>> Second, hadoop is all about large data analysis,
>> usually
>>>> more
>>>>>>> than
>>>>>>>>>>> 100GB,
>>>>>>>>>>>>>> so putting this into memory is out of the question.
>>>>>>>>>>>>>> Third, hadoop is efficient because it processes this
>>> large
>>>>>>> amount
>>>>>>>>> of
>>>>>>>>>>> data
>>>>>>>>>>>>>> by splitting it up into multiple processes. To do an
>>>>>> efficient
>>>>>>>>> binary
>>>>>>>>>>>>>> search, you would need do this in one mapper or one
>>>> reducer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My opinion is just don't fight hadoop/pig.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bags can be very large might not fit into memory, and
>> in
>>>> such
>>>>>>>>> cases
>>>>>>>>>>> some
>>>>>>>>>>>>>>> or all of the bag might have to be stored on disk. In
>>> such
>>>>>>>>> cases, it
>>>>>>>>>>> is
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> efficient to do random access on the bag. That is why
>>> the
>>>>>>> DataBag
>>>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>> does not support it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As Prashant suggested, storing it in a tuple would be
>> a
>>>> good
>>>>>>>>>>> alternative,
>>>>>>>>>>>>>>> if you want to have random access to do binary search.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Thejas
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/12/11 7:54 PM, 唐亮 wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>> How can I implement a binary search in pig?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In one relation, there exists a bag whose items are
>>>> sorted.
>>>>>>>>>>>>>>>> And I want to check there exists a specific item in
>> the
>>>>>> bag.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In UDF, I can't random access items in DataBag
>>> container.
>>>>>>>>>>>>>>>> So I have to transfer the items in DataBag to an
>>>> ArrayList,
>>>>>>> and
>>>>>>>>>> this
>>>>>>>>>>> is
>>>>>>>>>>>>>>>> time consuming.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> How can I implement the binary search efficiently in
>>> pig?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: Implement Binary Search in PIG

Reply via email to