Indexes, again

Peter Marron Mon, 27 Jan 2014 06:32:14 -0800

Hi,

I am using Hadoop 1.0.4 and Hive 0.11.0.


I am trying to create my own indexes. Given the problems that I have had in the 
past I thought
it best to try and do things slowly. So I created my own class which derived 
from TableBasedIndexHandler
I copied all the methods from CompactIndexHandler but I added lots of 
System.out.printlns so that I
could check and see what was going on. So this is, effectively, an instrumented 
copy of CompactIndexHandler.

When I try to create an index using compact most things seem to be working:

> DROP INDEX champions_attendance ON champions;
OK
Time taken: 0.139 seconds
hive> CREATE INDEX champions_attendance ON TABLE champions(attendance) AS 
'compact' WITH DEFERRED REBUILD;
OK
Time taken: 0.173 seconds
hive> SHOW INDEX ON champions;
OK
champions_attendance    champions               attendance              
default__champions_champions_attendance__       compact
Time taken: 0.073 seconds, Fetched: 1 row(s)
hive> SHOW FORMATTED INDEX ON champions;
OK
idx_name                tab_name                col_names               
idx_tab_name            idx_type                comment


champions_attendance    champions               attendance              
default__champions_champions_attendance__       compact
Time taken: 0.067 seconds, Fetched: 4 row(s)
hive>

However when I try the same thing with my class things start promising:

Time taken: 0.149 seconds
hive> CREATE INDEX champions_attendance ON TABLE champions (attendance) AS 
'com.trilliumsoftware.profiling.index.ProfilerIndex' WITH DEFERRED REBUILD;
My usesIndexTable - returning true!
My analyzeIndexDefinitionYYY
table ->Table(tableName:champions, dbName:default, owner:pmarron, 
createTime:1390214100, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:year, type:string, comment:null), 
FieldSchema(name:home, type:string, comment:null), FieldSchema(name:away, 
type:string, comment:null), FieldSchema(name:score, type:string, comment:null), 
FieldSchema(name:venue, type:string, comment:null), 
FieldSchema(name:attendance, type:string, comment:null)], 
location:hdfs://hpcluster1/user/pmarron/Ex/data, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=,, field.delim=,}), bucketCols:[], 
sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], 
skewedColValues:[], skewedColValueLocationMaps:{}), 
storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE, 
transient_lastDdlTime=1390214100}, viewOriginalText:null, 
viewExpandedText:null, tableType:EXTERNAL_TABLE)<-
index ->Index(indexName:champions_attendance, 
indexHandlerClass:com.trilliumsoftware.profiling.index.ProfilerIndex, 
dbName:default, origTableName:champions, createTime:1390832429, 
lastAccessTime:1390832429, 
indexTableName:default__champions_champions_attendance__, 
sd:StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, 
comment:null)], location:null, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=,, field.delim=,}), bucketCols:null, 
sortCols:[Order(col:attendance, order:1)], parameters:{}, 
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{}, 
deferredRebuild:true)<-
My usesIndexTable - returning true!
usesIndexTable ->true<-
indexTable ->Table(tableName:default__champions_champions_attendance__, 
dbName:default, owner:null, createTime:0, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[], location:null, 
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, 
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, 
viewOriginalText:null, viewExpandedText:null, tableType:INDEX_TABLE)<-
storageDesc ->StorageDescriptor(cols:[FieldSchema(name:attendance, type:string, 
comment:null)], location:null, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=,, field.delim=,}), bucketCols:null, 
sortCols:[Order(col:attendance, order:1)], parameters:{}, 
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
skewedColValueLocationMaps:{}), storedAsSubDirectories:false)<-
My usesIndexTable - returning true!
Going into the branch
My analyzeIndexDefinition OUT
My usesIndexTable - returning true!
OK
Time taken: 0.263 seconds
hive>
    >
But then things seem to go wrong.
Time taken: 0.149 seconds
    > SHOW INDEX ON champions;
FAILED: Error in metadata: java.lang.NullPointerException
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask
hive>

I have instrumented all of the method calls, so the fact that I don't see any 
tracing suggests that there isn't
of my code on the path that makes this fail. So I am at a loss to know where to 
start.
Is there some other sort of registration of my index handler class that I have 
to make somewhere?

If I ignore this error and carry on then the command

                ALTER INDEX champions_attendance ON champions REBUILD;

seems to succeed _and_ build an index. However when I issue a query on my 
indexed table:

    > SELECT * FROM champions WHERE attendance=50000;
OK
2000    Real Madrid     Valencia        3-0     Paris   50000
1980    Nottingham Forest       Hamburg 1-0     Madrid  50000
1975    Bayern Munich   Leeds Utd       2-0     Paris   50000
1970    Feyenoord       Celtic  2-1 (aet)       Milan   50000
1969    AC Milan        Ajax    04-Jan  Madrid  50000
Time taken: 0.158 seconds, Fetched: 5 row(s)

it doesn't seem to go into my index method generateIndexQuery
which was what I was hoping to achieve. Maybe this is for the same
reason that the SHOW INDEX fails?

I guess that I could build Hive and try and debug it, but I haven't built Hive
before and I'm worried that they will mean that I will have to move to the
latest version and then move to Hadoop 2 and that that will mean that I
will spend some time upgrading my cluster.

Is there anyone who can through any light on my problems? Or suggest
any way forward?

All feedback welcome.

Z

Peter Marron

Office: +44 (0) 118-940-7609  
peter.mar...@trilliumsoftware.com<mailto:peter.mar...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CF1B6C.35BE0FE0]

[cid:image002.png@01CF1B6C.35BE0FE0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CF1B6C.35BE0FE0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CF1B6C.35BE0FE0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

<<inline: image001.png>>

<<inline: image002.png>>

<<inline: image003.png>>

<<inline: image004.png>>

Indexes, again

Reply via email to