Oh I had very little data (a few MB) - I am just testing whether features work at all before I try it on bigger data (for example compression does not work for me http://wiki.apache.org/hadoop/Hive/CompressedStorage ; also not sure whether indexes work on compressed files).

The index got used because the second query

SELECT key, col2 FROM table WHERE key=100    -- no GROUP BY used

was almost instant as opposed to when not using the index (then it did one full map scan). Not sure what the total speedup is but there SHOULD be speed up for some queries or otherwise indexes would be useless. Try comparing a query with just a WHERE condition without a GROUP BY.

Martin

On 08/06/2011 18:08, Guillaume WEILL wrote:
Thanks for your reply.

Could you say me how much time have you saved with the index and the time used on a query without query? The amount of data of your table could be helpful too. This is to verify your point of view about the amount of data because I am really not impressed by performance of my index.
Personally I work on 100 GB.

Guillaume



2011/6/8 Martin Konicek <[email protected] <mailto:[email protected]>>

    Hi,

    I was testing indexes today as well and the index definitely got
    used. You should be able to see this when you run two separate
    queries:

    INSERT OVERWRITE DIRECTORY "/tmp/index-result2" ...
    SELECT ...

    The SELECT was faster for me than without the index. In your case
    the time might be spent in the GROUP BY and maybe you have little
    data so the times look the same.

    What is not so good is that index can't be partitioned on
    different columns than the table. E.g. I would like to partition
    the table on date and the index on region (I can't partition the
    table on both date&region bc there are thousands of regions and
    that would create huge directory structure in HDFS, which I read
    is not recommended).

    Martin


    On 08/06/2011 11:28, Guillaume WEILL wrote:

        Hi,

        I want to test the use of indexes in hive. For this I created
        anindex, I launched a first query above, I changed the
        settings on Hive and ran my query on my database table:

        CREATE INDEX index ON TABLE table(key) as 'COMPACT' WITH
        DEFERRED REBUILD;
        ALTER INDEX index ON table REBUILD;
        INSERT OVERWRITE DIRECTORY "/tmp/index-result2" SELECT
        `_bucketname` ,  `_offsets` FROM default__table_index__ x
        WHERE x.key=100;

        SET hive.index.compact.file=/tmp/index_result2;
        SET
        
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
        SELECT key, col2, sum(col3) FROM tableWHERE key=100  group BY
        col2;



        No error but I am not sure that the index is really used.
        Indeed I get the same performance with and without the index.
        When I look at the logs (tasktracker, datanode, job_config), I
        see no call to the directory / tmp / index_result2.
        How do I know if my index has been really used?

        Thanks for your help,

-- Guillaume WEILL

Reply via email to