Hi, I am using Hadoop 1.0.4 and Hive 0.11.0.
I am trying to create my own indexes. Given the problems that I have had in the past I thought it best to try and do things slowly. So I created my own class which derived from TableBasedIndexHandler I copied all the methods from CompactIndexHandler but I added lots of System.out.printlns so that I could check and see what was going on. So this is, effectively, an instrumented copy of CompactIndexHandler. When I try to create an index using compact most things seem to be working: > DROP INDEX champions_attendance ON champions; OK Time taken: 0.139 seconds hive> CREATE INDEX champions_attendance ON TABLE champions(attendance) AS 'compact' WITH DEFERRED REBUILD; OK Time taken: 0.173 seconds hive> SHOW INDEX ON champions; OK champions_attendance champions attendance default__champions_champions_attendance__ compact Time taken: 0.073 seconds, Fetched: 1 row(s) hive> SHOW FORMATTED INDEX ON champions; OK idx_name tab_name col_names idx_tab_name idx_type comment champions_attendance champions attendance default__champions_champions_attendance__ compact Time taken: 0.067 seconds, Fetched: 4 row(s) hive> However when I try the same thing with my class things start promising: Time taken: 0.149 seconds hive> CREATE INDEX champions_attendance ON TABLE champions (attendance) AS 'com.trilliumsoftware.profiling.index.ProfilerIndex' WITH DEFERRED REBUILD; My usesIndexTable - returning true! My analyzeIndexDefinitionYYY table ->Table(tableName:champions, dbName:default, owner:pmarron, createTime:1390214100, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:string, comment:null), FieldSchema(name:home, type:string, comment:null), FieldSchema(name:away, type:string, comment:null), FieldSchema(name:score, type:string, comment:null), FieldSchema(name:venue, type:string, comment:null), FieldSchema(name:attendance, type:string, comment:null)], location:hdfs://hpcluster1/user/pmarron/Ex/data, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1390214100}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)<- index ->Index(indexName:champions_attendance, indexHandlerClass:com.trilliumsoftware.profiling.index.ProfilerIndex, dbName:default, origTableName:champions, createTime:1390832429, lastAccessTime:1390832429, indexTableName:default__champions_champions_attendance__, sd:StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:null, sortCols:[Order(col:attendance, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{}, deferredRebuild:true)<- My usesIndexTable - returning true! usesIndexTable ->true<- indexTable ->Table(tableName:default__champions_champions_attendance__, dbName:default, owner:null, createTime:0, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:null, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:INDEX_TABLE)<- storageDesc ->StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:null, sortCols:[Order(col:attendance, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false)<- My usesIndexTable - returning true! Going into the branch My analyzeIndexDefinition OUT My usesIndexTable - returning true! OK Time taken: 0.263 seconds hive> > But then things seem to go wrong. Time taken: 0.149 seconds > SHOW INDEX ON champions; FAILED: Error in metadata: java.lang.NullPointerException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask hive> I have instrumented all of the method calls, so the fact that I don't see any tracing suggests that there isn't of my code on the path that makes this fail. So I am at a loss to know where to start. Is there some other sort of registration of my index handler class that I have to make somewhere? If I ignore this error and carry on then the command ALTER INDEX champions_attendance ON champions REBUILD; seems to succeed _and_ build an index. However when I issue a query on my indexed table: > SELECT * FROM champions WHERE attendance=50000; OK 2000 Real Madrid Valencia 3-0 Paris 50000 1980 Nottingham Forest Hamburg 1-0 Madrid 50000 1975 Bayern Munich Leeds Utd 2-0 Paris 50000 1970 Feyenoord Celtic 2-1 (aet) Milan 50000 1969 AC Milan Ajax 04-Jan Madrid 50000 Time taken: 0.158 seconds, Fetched: 5 row(s) it doesn't seem to go into my index method generateIndexQuery which was what I was hoping to achieve. Maybe this is for the same reason that the SHOW INDEX fails? I guess that I could build Hive and try and debug it, but I haven't built Hive before and I'm worried that they will mean that I will have to move to the latest version and then move to Hadoop 2 and that that will mean that I will spend some time upgrading my cluster. Is there anyone who can through any light on my problems? Or suggest any way forward? All feedback welcome. Z Peter Marron Office: +44 (0) 118-940-7609 peter.mar...@trilliumsoftware.com<mailto:peter.mar...@trilliumsoftware.com> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK [cid:image001.png@01CF1B6C.35BE0FE0] [cid:image002.png@01CF1B6C.35BE0FE0]<https://www.facebook.com/pages/Trillium-Software/109184815778307> [cid:image003.png@01CF1B6C.35BE0FE0]<https://twitter.com/TrilliumSW> [cid:image004.png@01CF1B6C.35BE0FE0]<http://www.linkedin.com/company/17710> www.trilliumsoftware.com<http://www.trilliumsoftware.com/> Be Certain About Your Data. Be Trillium Certain.
<<inline: image001.png>>
<<inline: image002.png>>
<<inline: image003.png>>
<<inline: image004.png>>