Not sure what you mean. Have you tried the code I forwarded? Are you facing
any issues there?
If your question is regarding binarySearch implementation, here is
pseudo-code'ish implementation. I have not tested this, please treat this
as a general idea on how to go about accessing the elements within the
Tuple.
ALSO, I am assuming you have defined schema for (inner) Tuple contents.
public String binarySearch(Tuple tuple, long toSearch, int low, int high) {
if(low > high)
return "NOT FOUND"; //Handle this the way you would like
if(tuple == null)
throw new IllegalArgumentException("Tuple is null"); //Handle
this the way you would like
int mid = (low + high)/2;
Tuple midTuple = tuple.get(mid);
String tag = midTuple.get(0).toString();
long ipstart = (Long)midTuple.get(1);
long ipend = (Long)midTuple.get(2);
String loc = midTuple.get(3).toString();
if(toSearch == ipstart) //Or ipend, I am not sure how you want to search
{
return loc;
}
else if(toSearch < ipstart)
return binarySearch(tuple, low, mid - 1);
else
return binarySearch(tuple, mid+1, high);
}
2011/12/14 唐亮 <[email protected]>
> Hi Prashant Kommireddi,
>
> If so, how should I write the UDF, especially the data types in UDF?
>
> 2011/12/15 Prashant Kommireddi <[email protected]>
>
> > When you flatten your BAG all your segments are within a single tuple.
> > Something like
> >
> > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN,
> > ipstartN, ipendN, locN))
> >
> > You can access the inner tuples positionally.
> >
> > Sent from my iPhone
> >
> > On Dec 14, 2011, at 6:28 PM, "唐亮" <[email protected]> wrote:
> >
> > > Now the question is:
> > > How should I put all the "IP Segments" in one TUPLE?
> > >
> > > Please help me!
> > >
> > >
> > > 2011/12/15 Prashant Kommireddi <[email protected]>
> > >
> > >> Michael,
> > >>
> > >> This would have no benefit over using a DistributedCache. For a large
> > >> cluster this would mean poor performance. If the file is static and
> > needs
> > >> to be looked-up across the cluster, DistributedCache would be a better
> > >> approach.
> > >>
> > >> Thanks,
> > >> Prashant
> > >>
> > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[email protected]>
> > >> wrote:
> > >>
> > >>> If that list of ip pairs is pretty static most time and will be used
> > >>> frequently, maybe just copy it in hdfs with a high replication
> factor.
> > >> Then
> > >>> use it as a look up table or some binary tree or treemap kind of
> thing
> > by
> > >>> reading it from hdfs instead of using distributed cache if that
> sounds
> > an
> > >>> easier thing to do.
> > >>>
> > >>>
> > >>> Best regards,
> > >>> Michael
> > >>>
> > >>>
> > >>> ________________________________
> > >>> From: Dmitriy Ryaboy <[email protected]>
> > >>> To: [email protected]
> > >>> Sent: Wednesday, December 14, 2011 10:28 AM
> > >>> Subject: Re: Implement Binary Search in PIG
> > >>>
> > >>> hbase has nothing to do with distributed cache.
> > >>>
> > >>>
> > >>> 2011/12/14 唐亮 <[email protected]>
> > >>>
> > >>>> Now, I didn't use HBase,
> > >>>> so, maybe I can't use DistributedCache.
> > >>>>
> > >>>> And if FLATTEN DataBag, the results are Tuples,
> > >>>> then in UDF I can process only one Tuple, which can't implement
> > >>>> BinarySearch.
> > >>>>
> > >>>> So, please help and show me the detailed solution.
> > >>>> Thanks!
> > >>>>
> > >>>> 在 2011年12月14日 下午5:59,唐亮 <[email protected]>写道:
> > >>>>
> > >>>>> Hi Prashant Kommireddi,
> > >>>>>
> > >>>>> If I do 1. and 2. as you mentioned,
> > >>>>> the schema will be {tag, ipStart, ipEnd, locName}.
> > >>>>>
> > >>>>> BUT, how should I write the UDF, especially how should I set the
> type
> > >>> of
> > >>>>> the input parameter?
> > >>>>>
> > >>>>> Currently, the UDF codes are as below, whose input parameter is
> > >>> DataBag:
> > >>>>>
> > >>>>> public class GetProvinceNameFromIPNum extends EvalFunc<String> {
> > >>>>>
> > >>>>> public String exec(Tuple input) throws IOException {
> > >>>>> if (input == null || input.size() == 0)
> > >>>>> return UnknownIP;
> > >>>>> if (input.size() != 2) {
> > >>>>> throw new IOException("Expected input's size is 2, but is: " +
> > >>>>> input.size());
> > >>>>> }
> > >>>>>
> > >>>>> Object o1 = input.get(0); * // This should be the IP you
> want
> > >>> to
> > >>>>> look up*
> > >>>>> if (!(o1 instanceof Long)) {
> > >>>>> throw new IOException("Expected input 1 to be Long, but
> > >>> got "
> > >>>>> + o1.getClass().getName());
> > >>>>> }
> > >>>>> Object o2 = input.get(1); *// This is the Bag of IP segs*
> > >>>>> if (!(o2 instanceof *DataBag*)) { //* Should I change it to
> > >>> "(o2
> > >>>>> instanceof Tuple)"?*
> > >>>>> throw new IOException("Expected input 2 to be DataBag,
> > >> but
> > >>>> got
> > >>>>> "
> > >>>>> + o2.getClass().getName());
> > >>>>> }
> > >>>>>
> > >>>>> ........... other codes ...........
> > >>>>> }
> > >>>>>
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> 在 2011年12月14日 下午3:16,Prashant Kommireddi <[email protected]>写道:
> > >>>>>
> > >>>>> Seems like at the end of this you have a Single bag with all the
> > >>>> elements,
> > >>>>>> and somehow you would like to check whether an element exists in
> it
> > >>>> based
> > >>>>>> on ipstart/end.
> > >>>>>>
> > >>>>>>
> > >>>>>> 1. Use FLATTEN
> > >> http://pig.apache.org/docs/r0.9.1/basic.html#flatten-
> > >>>>>> this will convert the Bag to Tuple: to_tuple = FOREACH
> > >>> order_ip_segs
> > >>>>>> GENERATE tag, FLATTEN(order_seq); ---- This is O(n)
> > >>>>>> 2. Now write a UDF that can access the elements positionally for
> > >> the
> > >>>>>> BinarySearch
> > >>>>>> 3. Dmitriy and Jonathan's ideas with DistributedCache could
> > >> perform
> > >>>>>> better than the above approach, so you could go down that route
> > >> too.
> > >>>>>>
> > >>>>>>
> > >>>>>> 2011/12/13 唐亮 <[email protected]>
> > >>>>>>
> > >>>>>>> The detailed PIG codes are as below:
> > >>>>>>>
> > >>>>>>> raw_ip_segment = load ...
> > >>>>>>> ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
> > >>>>>>> group_ip_segs = group ip_segs all;
> > >>>>>>>
> > >>>>>>> order_ip_segs = foreach group_ip_segs {
> > >>>>>>> order_seg = order ip_segs by ipstart, ipend;
> > >>>>>>> generate 't' as tag, order_seg;
> > >>>>>>> }
> > >>>>>>> describe order_ip_segs
> > >>>>>>> order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend:
> > >>>>>> long,poid:
> > >>>>>>> chararray}}
> > >>>>>>>
> > >>>>>>> Here, the order_ip_segs::order_seg is a BAG,
> > >>>>>>> how can I transer it to a TUPLE?
> > >>>>>>>
> > >>>>>>> And can I access the TUPLE randomly in UDF?
> > >>>>>>>
> > >>>>>>> 在 2011年12月14日 下午2:41,唐亮 <[email protected]>写道:
> > >>>>>>>
> > >>>>>>>> Then how can I transfer all the items in Bag to a Tuple?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> 2011/12/14 Jonathan Coveney <[email protected]>
> > >>>>>>>>
> > >>>>>>>>> It's funny, but if you look wayyyy in the past, I actually
> > >> asked
> > >>> a
> > >>>>>> bunch
> > >>>>>>>>> of
> > >>>>>>>>> questions that circled around, literally, this exact problem.
> > >>>>>>>>>
> > >>>>>>>>> Dmitriy and Prahsant are correct: the best way is to make a UDF
> > >>>> that
> > >>>>>> can
> > >>>>>>>>> do
> > >>>>>>>>> the lookup really efficiently. This is what the maxmind API
> > >> does,
> > >>>> for
> > >>>>>>>>> example.
> > >>>>>>>>>
> > >>>>>>>>> 2011/12/13 Prashant Kommireddi <[email protected]>
> > >>>>>>>>>
> > >>>>>>>>>> I am lost when you say "If enumerate every IP, it will be
> > >> more
> > >>>> than
> > >>>>>>>>>> 100000000 single IPs"
> > >>>>>>>>>>
> > >>>>>>>>>> If each bag is a collection of 30000 tuples it might not be
> > >> too
> > >>>>>> bad on
> > >>>>>>>>> the
> > >>>>>>>>>> memory if you used Tuple to store segments instead?
> > >>>>>>>>>>
> > >>>>>>>>>> (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
> > >>>>>>>>>> Lets say we incur an additional overhead 4X times this, which
> > >>> is
> > >>>>>> ~160
> > >>>>>>>>> bytes
> > >>>>>>>>>> per tuple.
> > >>>>>>>>>> Total per Bag = 30000 X 160 = ~5 MB
> > >>>>>>>>>>
> > >>>>>>>>>> You could probably store the ipsegments as Tuple and test it
> > >> on
> > >>>>>> your
> > >>>>>>>>>> servers.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <
> > >>>>>> [email protected]>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Do you have many such bags or just one? If one, and you
> > >> want
> > >>> to
> > >>>>>> look
> > >>>>>>>>> up
> > >>>>>>>>>>> many ups in it, might be more efficient to serialize this
> > >>>>>> relation
> > >>>>>>> to
> > >>>>>>>>>> hdfs,
> > >>>>>>>>>>> and write a lookup udf that specifies the serialized data
> > >> set
> > >>>> as
> > >>>>>> a
> > >>>>>>>>> file
> > >>>>>>>>>> to
> > >>>>>>>>>>> put in distributed cache. At init time, load up the file
> > >> into
> > >>>>>>> memory,
> > >>>>>>>>>> then
> > >>>>>>>>>>> for every ip do the binary search in exec()
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]>
> > >> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Thank you all!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The detail is:
> > >>>>>>>>>>>> A bag contains many "IP Segments", whose schema is
> > >>>>>> (ipStart:long,
> > >>>>>>>>>>>> ipEnd:long, locName:chararray) and the number of tuples
> > >> is
> > >>>>>> about
> > >>>>>>>>> 30000,
> > >>>>>>>>>>>> and I want to check wheather an IP is belong to one
> > >> segment
> > >>>> in
> > >>>>>> the
> > >>>>>>>>> bag.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I want to order the "IP Segments" by (ipStart, ipEnd) in
> > >>> MR,
> > >>>>>>>>>>>> and then binary search wheather an IP is in the bag in
> > >> UDF.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> If enumerate every IP, it will be more than 100000000
> > >>> single
> > >>>>>> IPs,
> > >>>>>>>>>>>> I think it will also be time consuming by JOIN in PIG.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Please help me how can I deal with it efficiently!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 2011/12/14 Thejas Nair <[email protected]>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> My assumption is that 唐亮 is trying to do binary search
> > >> on
> > >>>> bags
> > >>>>>>>>> within
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> tuples in a relation (ie schema of the relation has a
> > >> bag
> > >>>>>>> column).
> > >>>>>>>>> I
> > >>>>>>>>>>> don't
> > >>>>>>>>>>>>> think he is trying to treat the entire relation as one
> > >> bag
> > >>>>>> and do
> > >>>>>>>>>> binary
> > >>>>>>>>>>>>> search on that.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -Thejas
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 12/13/11 2:30 PM, Andrew Wells wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I don't think this could be done,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> pig is just a hadoop job, and the idea behind hadoop is
> > >>> to
> > >>>>>> read
> > >>>>>>>>> all
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>> data in a file.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> so by the time you put all the data into an array, you
> > >>>> would
> > >>>>>>> have
> > >>>>>>>>>> been
> > >>>>>>>>>>>>>> better off just checking each element for the one you
> > >>> were
> > >>>>>>> looking
> > >>>>>>>>>> for.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> So what you would get is [n + lg (n)], which will just
> > >> be
> > >>>> [n]
> > >>>>>>>>> after
> > >>>>>>>>>>>>>> putting
> > >>>>>>>>>>>>>> that into an array.
> > >>>>>>>>>>>>>> Second, hadoop is all about large data analysis,
> > >> usually
> > >>>> more
> > >>>>>>> than
> > >>>>>>>>>>> 100GB,
> > >>>>>>>>>>>>>> so putting this into memory is out of the question.
> > >>>>>>>>>>>>>> Third, hadoop is efficient because it processes this
> > >>> large
> > >>>>>>> amount
> > >>>>>>>>> of
> > >>>>>>>>>>> data
> > >>>>>>>>>>>>>> by splitting it up into multiple processes. To do an
> > >>>>>> efficient
> > >>>>>>>>> binary
> > >>>>>>>>>>>>>> search, you would need do this in one mapper or one
> > >>>> reducer.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> My opinion is just don't fight hadoop/pig.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
> > >>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Bags can be very large might not fit into memory, and
> > >> in
> > >>>> such
> > >>>>>>>>> cases
> > >>>>>>>>>>> some
> > >>>>>>>>>>>>>>> or all of the bag might have to be stored on disk. In
> > >>> such
> > >>>>>>>>> cases, it
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>> efficient to do random access on the bag. That is why
> > >>> the
> > >>>>>>> DataBag
> > >>>>>>>>>>>>>>> interface
> > >>>>>>>>>>>>>>> does not support it.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> As Prashant suggested, storing it in a tuple would be
> > >> a
> > >>>> good
> > >>>>>>>>>>> alternative,
> > >>>>>>>>>>>>>>> if you want to have random access to do binary search.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> -Thejas
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On 12/12/11 7:54 PM, 唐亮 wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>> How can I implement a binary search in pig?
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> In one relation, there exists a bag whose items are
> > >>>> sorted.
> > >>>>>>>>>>>>>>>> And I want to check there exists a specific item in
> > >> the
> > >>>>>> bag.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> In UDF, I can't random access items in DataBag
> > >>> container.
> > >>>>>>>>>>>>>>>> So I have to transfer the items in DataBag to an
> > >>>> ArrayList,
> > >>>>>>> and
> > >>>>>>>>>> this
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>>>>> time consuming.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> How can I implement the binary search efficiently in
> > >>> pig?
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>