Re: Difference in count(*) result for KUDU and parquet

William Berkeley Wed, 09 May 2018 23:00:08 -0700

Hi Geetika. While I don't know anything about TPCH data, when people load
data and see less rows it's usually because of duplicated primary keys.
Kudu, unlike parquet, has a unique key constraint. What's the schema for
the Kudu table?


Also, might be useful to know what Kudu version and Impala version you are
using.

-Will

On Wed, May 9, 2018 at 10:03 PM, Geetika Gupta <[email protected]>
wrote:

> Hi community,
>
> We executed the below command to load data in KUDU, but the table in which
> we loaded the data has less number of rows. We executed the following
> command:
>
> insert into LINEITEM select * from PARQUETIMPALA500.LINEITEM
>
> This query was successful, but when we tried the count(*) on both the
> tables, row count was different:
>
> 0: jdbc:hive2://slave2:21050/default> select count(*) from lineitem
> . . . . . . . . . . . . . . . . . . > ;
> 536870912
>
> 0: jdbc:hive2://slave2:21050/default> select count(*) from
> parquetimpala500.lineitem;
> 3000028242
>
> We are loading 500GB of TPCH data in kudu from parquet table.
>
> --
> Regards,
> Geetika Gupta
>

Re: Difference in count(*) result for KUDU and parquet

Reply via email to