Hi,


I have a question about the getNumberOfValues() method of the
ColumnStatistics interface.



In the Hive documentation (for example, here:
https://hive.apache.org/javadocs/r0.12.0/api/org/apache/hadoop/hive/ql/io/orc/ColumnStatistics.html),
the method is described as returning “the number of values in this column”.
Under Method Detail, it says, “it will differ from the number of rows
because of NULL values and repeated values.”



My question concerns “repeated values”.



Being an SQL guy, I leap to the conclusion that getNumberOfValues() returns
the equivalent of “select count(distinct column) from orc_table”, that is,
the number of distinct values for that column in the table. (Well, for ORC
it is for a particular stripe of the table, but I hope my meaning gets
across.)



But when I experiment with this API, it seems to be returning the number of
non-null values instead. For example, using the Trafodion SQL engine to
query an example Hive table using ORC files, I see:



>>select s_rec_end_date from hive.hive.store2_orc order by s_rec_end_date;



S_REC_END_DATE

--------------



    1999-03-13

    1999-03-13

    2000-03-12

    2000-03-12

    2001-03-12

    2001-03-12

?

?

?

?

?

?



--- 12 row(s) selected.



But when I look at what ColumnStatistics.getNumberOfValues() returns for
this column, I get 6. (This particular example table has just one stripe.)
Looking at the values, though, there are just 3 distinct values here.



So, my question is: Is it the case that
ColumnStatistics.getNumberOfValues() returns the number of non-null values
in a column (in a given stripe)? And the Hive documentation is incorrect
when it mentions “repeated values”?



Thanks,



Dave

Reply via email to