It is the number of non-null values. The "and repeated values" is incorrect and should be fixed.
.. Owen On Wed, Apr 6, 2016 at 11:28 AM, Dave Birdsall <[email protected]> wrote: > Hi, > > > > I have a question about the getNumberOfValues() method of the > ColumnStatistics interface. > > > > In the Hive documentation (for example, here: > https://hive.apache.org/javadocs/r0.12.0/api/org/apache/hadoop/hive/ql/io/orc/ColumnStatistics.html), > the method is described as returning “the number of values in this column”. > Under Method Detail, it says, “it will differ from the number of rows > because of NULL values and repeated values.” > > > > My question concerns “repeated values”. > > > > Being an SQL guy, I leap to the conclusion that getNumberOfValues() > returns the equivalent of “select count(distinct column) from orc_table”, > that is, the number of distinct values for that column in the table. (Well, > for ORC it is for a particular stripe of the table, but I hope my meaning > gets across.) > > > > But when I experiment with this API, it seems to be returning the number > of non-null values instead. For example, using the Trafodion SQL engine to > query an example Hive table using ORC files, I see: > > > > >>select s_rec_end_date from hive.hive.store2_orc order by s_rec_end_date; > > > > S_REC_END_DATE > > -------------- > > > > 1999-03-13 > > 1999-03-13 > > 2000-03-12 > > 2000-03-12 > > 2001-03-12 > > 2001-03-12 > > ? > > ? > > ? > > ? > > ? > > ? > > > > --- 12 row(s) selected. > > > > But when I look at what ColumnStatistics.getNumberOfValues() returns for > this column, I get 6. (This particular example table has just one stripe.) > Looking at the values, though, there are just 3 distinct values here. > > > > So, my question is: Is it the case that > ColumnStatistics.getNumberOfValues() returns the number of non-null values > in a column (in a given stripe)? And the Hive documentation is incorrect > when it mentions “repeated values”? > > > > Thanks, > > > > Dave >
