Hello, This is my pig script : DEFINE iplookup `wrapper.sh GeoIP` ship ('wrapper.sh') cache('/GeoIP/GeoIPcity.dat#GeoIP') input (stdin using PigStreaming(',')) output (stdout using PigStreaming(','));
A = load 'log' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('default:body','-gt=_f:squid_t:20110920 -loadKey') AS (rowkey, data); B = FILTER A BY rowkey matches '.*_s:204-.*'; C = FOREACH B { t = REGEX_EXTRACT(data,'([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}):([0-9]+) ',1); generate rowkey, t; } D = STREAM C THROUGH iplookup; STORE D INTO 'geoip_pig' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('location:ip,location:country_code,location:country_code3,location:country_name,location:region,location:city,location:postal_code,location:latitude,location:longitude,location:area_code,location:metro_code'); There is 11 columns in my final table/columnFamily (STORE). I get some jobs (2/46) ending with : java.lang.IndexOutOfBoundsException: Index: 11, Size: 11 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:666) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:269) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) Most the jobs them ended successfully. In src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java around line 666 ( Damn ! ) for (int i=1;i< t.size();++i){ ColumnInfo columnInfo = columnInfo_.get(i-1); if (LOG.isDebugEnabled()) { LOG.debug("putNext - tuple: " + i + ", value=" + t.get(i) + ", cf:column=" + columnInfo); } Is it possible that columnInfo_ and t are not the same size ? In which case ? Regards, -- Damien