Re: GeoIP database lookups

Ross Nordeen Mon, 11 Jul 2011 16:37:39 -0700

Thanks for the quick help Matt!  I'll try and work on getting that working, if 
anyone else as a UDF that they could send that would be cool.

-Ross

--
Ross Nordeen
Computer Networking And Systems Administration
Michigan Technological University
http://www.linkedin.com/in/rjnordee

----- Original Message -----
From: "Matt Davies" <m...@mattdavies.net>
To: user@pig.apache.org
Sent: Monday, July 11, 2011 1:34:29 PM GMT -08:00 US/Canada Pacific
Subject: Re: GeoIP database lookups

I can't really share the code, but I can tell you the general way of doing
it that works well.

When the UDF is instantiated create a HashMap or the like with the data you
need. So, you'll do a DEFINE GEO com.xyz.Geo('$filename) in your pig code.

This varies depending on how you are looking up the data. You could be
looking up based on IP address or the actual ID of the location.  This is a
one-time hit, and in our case, very very fast and not even noticed.

As each tuple hits the exec method then it becomes a quick lookup.

In terms of why HDFS - we found that there were too many issues in our shop
keeping things synced and much easier to read the file out of HDFS.  So, for
instance, if a job has 1000 mappers, you read the file 1000 times from HDFS.
 True, you get performance gains reading from the classpath of the jar, but,
as with all of programming, there are tradeoffs. This format worked best for
us in our release structure. One instance is a general UDF like this that
could have different input files dependent on the jobs. Or, to be even
faster we may filter out all non-US data from the locations so different
files are used.   YMMV.

Hope that helps some.

On Mon, Jul 11, 2011 at 2:18 PM, Ross Nordeen <rjnor...@mtu.edu> wrote:

> Matt,
>
> So dont ship the GeoIP database with the jar?  Does your mapper then cache
> the locations.csv?  Would you mind sending me your UDF?  That sounds like an
> interesting solution but I don't really understand how you would do that.  I
> was under the impression the fastest way to do it would be to ship and cache
> the binary database instead of calling from the HDFS.
>
> -Ross
>
> ----- Original Message -----
> From: "Matt Davies" <m...@mattdavies.net>
> To: user@pig.apache.org, "Ross" <rjnord...@semesteratsea.net>
> Sent: Monday, July 11, 2011 12:34:38 PM GMT -08:00 US/Canada Pacific
> Subject: Re: GeoIP database lookups
>
> We wrote a snazzy UDF that does 1 initialization per mapper and does all
> the
> necessary conversions. Quite efficient and fast.
>
> The trick to maintainability is to have your UDF initialize the
> locations.csv from HDFS and not to include the csv file within your jar.
>  That way you can easily update the locations without recompiling.
>
> -Matt
>
> On Mon, Jul 11, 2011 at 12:57 PM, Ross Nordeen <rjnor...@mtu.edu> wrote:
>
> > Hello all,
> >
> > Is there an accepted way to use the GeoIP database with pig?
> >
> > I've found some people have tried to write UDF's with their java api.
> > http://www.maxmind.com/java
> >
> > Others say to use the streaming interface within pig and run the queries
> > through a perl script.
> >
> >
> http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/#comments
> >
> > I'm just trying to find the most efficient way to run this.  any ideas?
> >
> > -Ross
> >
>

Re: GeoIP database lookups

Reply via email to