It is possibly because missing a SerDe propterty when you creating HIVE tables for TPC-DS: 'serialization.null.format'=''
For example, CREATE TABLE call_center ( cc_call_center_sk BIGINT, cc_call_center_id STRING, cc_rec_start_date STRING, cc_rec_end_date STRING, cc_closed_date_sk BIGINT, cc_open_date_sk BIGINT, cc_name STRING, cc_class STRING, cc_employees BIGINT, cc_sq_ft BIGINT, cc_hours STRING, cc_manager STRING, cc_mkt_id BIGINT, cc_mkt_class STRING, cc_mkt_desc STRING, cc_market_manager STRING, cc_division BIGINT, cc_division_name STRING, cc_company BIGINT, cc_company_name STRING, cc_street_number STRING, cc_street_name STRING, cc_street_type STRING, cc_suite_number STRING, cc_city STRING, cc_county STRING, cc_state STRING, cc_zip STRING, cc_country STRING, cc_gmt_offset DOUBLE, cc_tax_percentage DOUBLE ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('field.delim'='|','serialization.null.format'=''); From: Jesse F Chen [mailto:jfc...@us.ibm.com] Sent: Thursday, March 3, 2016 10:39 AM To: spark users <user@spark.apache.org> Subject: select count(*) return wrong row counts I am finding a strange issue with Spark SQL where "select count(*) " returns wrong row counts for certain tables. I am using TPCDS tables, so here are the actual counts: Row counts in raw generated files Spark SQL tables count(*) Parquet Spark SQL tables count(*) Text call_center 6 0 0 catalog_page 11718 11515 11515 catalog_returns 144067 138352 138352 catalog_sales 1441548 1427257 1427257 customer 100000 93063 93063 customer_address 50000 48444 48444 customer_demographics 1920800 1920800 1920800 date_dim 73049 73049 73049 household_demographics 7200 7200 7200 income_band 20 20 20 inventory 11745000 11158087 11158087 item 18000 17917 17917 promotion 300 289 289 reason 35 35 35 ship_mode 20 20 20 store 12 3 3 store_returns 287514 267471 267471 store_sales 2880404 2620573 2620573 time_dim 86400 86400 86400 warehouse 5 4 4 web_page 60 21 21 web_returns 71763 65384 65384 web_sales 719384 719025 719025 web_site 30 25 25 call_center returned 0 count :( The code used to do the count is fairly simple: df.registerTempTable(tablename) println("registered tempTable") val rc=sqlContext.sql("select count(*) from "+tablename) rc.map(t => "row count table "+tablename+": "+ t(0)).collect().foreach(println) This made many tpcds-derived queries return WRONG results. Wanted to know if anything key is missing here. Jesse [cid:image003.gif@01D175FF.8D6AEC00] JESSE CHEN Big Data Performance | IBM Analytics Office: 408 463 2296 Mobile: 408 828 9068 Email: jfc...@us.ibm.com<mailto:jfc...@us.ibm.com>