It is possibly because missing a SerDe propterty when you creating HIVE tables 
for TPC-DS:
'serialization.null.format'=''

For example,
CREATE TABLE call_center
(
    cc_call_center_sk         BIGINT,
    cc_call_center_id         STRING,
    cc_rec_start_date         STRING,
    cc_rec_end_date           STRING,
    cc_closed_date_sk         BIGINT,
    cc_open_date_sk           BIGINT,
    cc_name                   STRING,
    cc_class                  STRING,
    cc_employees              BIGINT,
    cc_sq_ft                  BIGINT,
    cc_hours                  STRING,
    cc_manager                STRING,
    cc_mkt_id                 BIGINT,
    cc_mkt_class              STRING,
    cc_mkt_desc               STRING,
    cc_market_manager         STRING,
    cc_division               BIGINT,
    cc_division_name          STRING,
    cc_company                BIGINT,
    cc_company_name           STRING,
    cc_street_number          STRING,
    cc_street_name            STRING,
    cc_street_type            STRING,
    cc_suite_number           STRING,
    cc_city                   STRING,
    cc_county                 STRING,
    cc_state                  STRING,
    cc_zip                    STRING,
    cc_country                STRING,
    cc_gmt_offset             DOUBLE,
    cc_tax_percentage         DOUBLE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'='|','serialization.null.format'='');

From: Jesse F Chen [mailto:jfc...@us.ibm.com]
Sent: Thursday, March 3, 2016 10:39 AM
To: spark users <user@spark.apache.org>
Subject: select count(*) return wrong row counts


I am finding a strange issue with Spark SQL where "select count(*) " returns 
wrong row counts for certain tables.

I am using TPCDS tables, so here are the actual counts:



Row counts in
raw generated files

Spark SQL tables count(*) Parquet

Spark SQL tables count(*) Text

call_center

6

0

0

catalog_page

11718

11515

11515

catalog_returns

144067

138352

138352

catalog_sales

1441548

1427257

1427257

customer

100000

93063

93063

customer_address

50000

48444

48444

customer_demographics

1920800

1920800

1920800

date_dim

73049

73049

73049

household_demographics

7200

7200

7200

income_band

20

20

20

inventory

11745000

11158087

11158087

item

18000

17917

17917

promotion

300

289

289

reason

35

35

35

ship_mode

20

20

20

store

12

3

3

store_returns

287514

267471

267471

store_sales

2880404

2620573

2620573

time_dim

86400

86400

86400

warehouse

5

4

4

web_page

60

21

21

web_returns

71763

65384

65384

web_sales

719384

719025

719025

web_site

30

25

25


call_center returned 0 count :(

The code used to do the count is fairly simple:

df.registerTempTable(tablename)
println("registered tempTable")
val rc=sqlContext.sql("select count(*) from "+tablename)
rc.map(t => "row count table "+tablename+": "+ t(0)).collect().foreach(println)

This made many tpcds-derived queries return WRONG results.

Wanted to know if anything key is missing here.

Jesse

[cid:image003.gif@01D175FF.8D6AEC00]

JESSE CHEN
Big Data Performance | IBM Analytics

Office: 408 463 2296
Mobile: 408 828 9068
Email: jfc...@us.ibm.com<mailto:jfc...@us.ibm.com>




Reply via email to