Thanks for the quick help Matt! I'll try and work on getting that working, if anyone else as a UDF that they could send that would be cool.
-Ross -- Ross Nordeen Computer Networking And Systems Administration Michigan Technological University http://www.linkedin.com/in/rjnordee ----- Original Message ----- From: "Matt Davies" <m...@mattdavies.net> To: user@pig.apache.org Sent: Monday, July 11, 2011 1:34:29 PM GMT -08:00 US/Canada Pacific Subject: Re: GeoIP database lookups I can't really share the code, but I can tell you the general way of doing it that works well. When the UDF is instantiated create a HashMap or the like with the data you need. So, you'll do a DEFINE GEO com.xyz.Geo('$filename) in your pig code. This varies depending on how you are looking up the data. You could be looking up based on IP address or the actual ID of the location. This is a one-time hit, and in our case, very very fast and not even noticed. As each tuple hits the exec method then it becomes a quick lookup. In terms of why HDFS - we found that there were too many issues in our shop keeping things synced and much easier to read the file out of HDFS. So, for instance, if a job has 1000 mappers, you read the file 1000 times from HDFS. True, you get performance gains reading from the classpath of the jar, but, as with all of programming, there are tradeoffs. This format worked best for us in our release structure. One instance is a general UDF like this that could have different input files dependent on the jobs. Or, to be even faster we may filter out all non-US data from the locations so different files are used. YMMV. Hope that helps some. On Mon, Jul 11, 2011 at 2:18 PM, Ross Nordeen <rjnor...@mtu.edu> wrote: > Matt, > > So dont ship the GeoIP database with the jar? Does your mapper then cache > the locations.csv? Would you mind sending me your UDF? That sounds like an > interesting solution but I don't really understand how you would do that. I > was under the impression the fastest way to do it would be to ship and cache > the binary database instead of calling from the HDFS. > > -Ross > > ----- Original Message ----- > From: "Matt Davies" <m...@mattdavies.net> > To: user@pig.apache.org, "Ross" <rjnord...@semesteratsea.net> > Sent: Monday, July 11, 2011 12:34:38 PM GMT -08:00 US/Canada Pacific > Subject: Re: GeoIP database lookups > > We wrote a snazzy UDF that does 1 initialization per mapper and does all > the > necessary conversions. Quite efficient and fast. > > The trick to maintainability is to have your UDF initialize the > locations.csv from HDFS and not to include the csv file within your jar. > That way you can easily update the locations without recompiling. > > -Matt > > On Mon, Jul 11, 2011 at 12:57 PM, Ross Nordeen <rjnor...@mtu.edu> wrote: > > > Hello all, > > > > Is there an accepted way to use the GeoIP database with pig? > > > > I've found some people have tried to write UDF's with their java api. > > http://www.maxmind.com/java > > > > Others say to use the streaming interface within pig and run the queries > > through a perl script. > > > > > http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/#comments > > > > I'm just trying to find the most efficient way to run this. any ideas? > > > > -Ross > > >