Hi, I have verified using tpch tables with 1 GB generated data. on 1.1.1 but I got below result. I don't have the exact schema as you mentioned but with original TPCH schema, I verified.
0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey) from customer, orders where o_Custkey = c_CustKey; +-------------------+-------------------+--+ | count(c_CustKey) | count(o_CustKey) | +-------------------+-------------------+--+ | 1500000 | 1500000 | +-------------------+-------------------+--+ On parquet with same data. 0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey) from customer, orders where o_Custkey = c_CustKey; +-------------------+-------------------+--+ | count(c_CustKey) | count(o_CustKey) | +-------------------+-------------------+--+ | 1500000 | 1500000 | +-------------------+-------------------+--+ Regards, Ravindra. On 23 August 2017 at 19:40, Swapnil Shinde <[email protected]> wrote: > Hello All > We are observing incorrect query results with carbondata 1.1.1. Please > find details below - > > *Datasets used -* > TPC-H star schema based datasets (http://www.cs.umb.edu/~ > poneil/StarSchemaB.PDF) > *Query - * > * select cCustKey,loCustKey from customer, lineorder where loCustkey = > cCustKey* > *How we load data -* > We validated loading data through dataframe and "INSERT" statements > and both ways produce incorrect results. I am putting one way here- > > > *-- CREATE CUSTOMER TABLE* > > *carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName > string, cAddress string, cCity string, cNation string, cRegion string, > cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")* > > *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE > customer > OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")* > > > > *-- CREATE LINEORDER TABLE* > > *carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey > bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey > Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity > Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue > Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy > String) STORED BY 'carbondata'")* > > *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO > TABLE lineorder > OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")* > > > *Results with different version - * > > * 1.1.0 - *Provides correct results for above query. Validated with > results from parquet. > > * 1.1.1 - *Built from this > <https://github.com/apache/carbondata/tree/apache-carbondata-1.1.1-rc1>. > Join is missing lots of rows compared to parquet. > > * 1.1.1 - *Built from source code available for download > <https://dist.apache.org/repos/dist/release/carbondata/1.1.1/apache-carbondata-1.1.1-source-release.zip>. > Join is missing lots of rows compared to parquet. > > * 1.2 - *Built from master branch. Generated correct results similar > to parquet. > > > *Debugging further - * > > 1. Row counts for both lineOrder and customer tables are same. > > 2. If I try to find out key column in carbondata vs parquet then it is > matching as well - > > val cd = carbon.sql("select cCustKey from customer") > //.distinct.count -- 30,000,000 > > val sp = spark.sql("select cCustKey from pcustomer") > //.distinct.count -- 30,000,000 > > cd.intersect(sp) -- 30,000,000 (carbon data has same keys > compared to parquet) > > > > val cd = carbon.sql("select loCustKey from lineorder") > //.distinct.count -- 13,365,986 > > val sp = spark.sql("select loCustKey from plineorder") > //.distinct.count -- 13,365,986 > > cd.intersect(sp) --13,365,986 (carbon data has same keys > compared to parquet) > > > Above query shows that carbondata customer and lineitem has same key > values compared to parquet. > > However, when you run above join query, carbondata generates very small > subset of expected rows. If we run filter query for any specific key then > that also returns no results. > > > Not sure why v1.1.1 is producing incorrect results. My guess is that > carbondata is skipping rows that it shouldn't in v1.1.1. > > Any help and suggestions are very much appreciated!! Thanks in advance.. > > > > Thanks > > Swapnil Shinde > > > > > > > > > > > -- Thanks & Regards, Ravi
