[jira] [Commented] (HIVE-13292) Different DOUBLE type precision issue between Spark and MR engine

2016-03-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197317#comment-15197317
 ] 

Xuefu Zhang commented on HIVE-13292:


Yeah. Doubles are implemented in many programming languages in different ways. 
I'm not sure of if scala does it differently from java, but this seems to be 
the case. This issue seems insignificant if matters at all.

If users are concerned about minute difference like this, decimal type is 
strongly recommended.

> Different DOUBLE type precision issue between Spark and MR engine
> -
>
> Key: HIVE-13292
> URL: https://issues.apache.org/jira/browse/HIVE-13292
> Project: Hive
>  Issue Type: Bug
> Environment: Apache Hive 2.0.0
> Apache Spark 1.6.0
>Reporter: Xin Hao
>
> Different DOUBLE type precision issue between Spark and MR engine.
> Found when executing the TPC-H query5 with scale factor 2 (2GB data size). 
> More details are as below.
> (1)The MR engine output:
> MOZAMBIQUE,1.0646195910990009E8
> ETHIOPIA,1.0108856206629996E8
> ALGERIA,9.987582690420012E7
> MOROCCO,9.785484184850013E7
> KENYA,9.412388077690017E7
> (2)The Spark engine output:
> MOZAMBIQUE,1.064619591099E8
> ETHIOPIA,1.0108856206630005E8
> ALGERIA,9.987582690419997E7
> MOROCCO,9.785484184850003E7
> KENYA,9.412388077690002E7
> (3)Detail SQL used:
> drop table if exists ${env:RESULT_TABLE};
> create table ${env:RESULT_TABLE} (
>   pid1 STRING,
>   pid2 DOUBLE
> )
> row format delimited fields terminated by ',' lines terminated by '\n'
> stored as ${env:HIVE_DEFAULT_FILEFORMAT_RESULT_TABLE} location 
> '${env:RESULT_DIR}';
> insert into table ${env:RESULT_TABLE}
> select
> n_name,
> sum(l_extendedprice * (1 - l_discount)) as revenue
> from
> customer,
> orders,
> lineitem,
> supplier,
> nation,
> region
> where
> c_custkey = o_custkey
> and l_orderkey = o_orderkey
> and l_suppkey = s_suppkey
> and c_nationkey = s_nationkey
> and s_nationkey = n_nationkey
> and n_regionkey = r_regionkey
> and r_name = 'AFRICA'
> and o_orderdate >= '1993-01-01'
> and o_orderdate < '1994-01-01'
> group by
> n_name
> order by
> revenue desc;
> (4)Similar issue also exists even after we simplified original query to a 
> simpler one as below:
> drop table if exists ${env:RESULT_TABLE};
> create table ${env:RESULT_TABLE} (
>   pid2 DOUBLE
> )
> row format delimited fields terminated by ',' lines terminated by '\n'
> stored as ${env:HIVE_DEFAULT_FILEFORMAT_RESULT_TABLE} location 
> '${env:RESULT_DIR}';
> insert into table ${env:RESULT_TABLE}
> select
> sum(l_extendedprice * (1 - l_discount)) as revenue
> from
> lineitem
> group by
> l_orderkey
> order by
> revenue;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13292) Different DOUBLE type precision issue between Spark and MR engine

2016-03-16 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196819#comment-15196819
 ] 

Sergey Shelukhin commented on HIVE-13292:
-

With double type, it's usually by design

> Different DOUBLE type precision issue between Spark and MR engine
> -
>
> Key: HIVE-13292
> URL: https://issues.apache.org/jira/browse/HIVE-13292
> Project: Hive
>  Issue Type: Bug
> Environment: Apache Hive 2.0.0
> Apache Spark 1.6.0
>Reporter: Xin Hao
>
> Different DOUBLE type precision issue between Spark and MR engine.
> Found when executing the TPC-H query5 with scale factor 2 (2GB data size). 
> More details are as below.
> (1)The MR engine output:
> MOZAMBIQUE,1.0646195910990009E8
> ETHIOPIA,1.0108856206629996E8
> ALGERIA,9.987582690420012E7
> MOROCCO,9.785484184850013E7
> KENYA,9.412388077690017E7
> (2)The Spark engine output:
> MOZAMBIQUE,1.064619591099E8
> ETHIOPIA,1.0108856206630005E8
> ALGERIA,9.987582690419997E7
> MOROCCO,9.785484184850003E7
> KENYA,9.412388077690002E7
> (3)Detail SQL used:
> drop table if exists ${env:RESULT_TABLE};
> create table ${env:RESULT_TABLE} (
>   pid1 STRING,
>   pid2 DOUBLE
> )
> row format delimited fields terminated by ',' lines terminated by '\n'
> stored as ${env:HIVE_DEFAULT_FILEFORMAT_RESULT_TABLE} location 
> '${env:RESULT_DIR}';
> insert into table ${env:RESULT_TABLE}
> select
> n_name,
> sum(l_extendedprice * (1 - l_discount)) as revenue
> from
> customer,
> orders,
> lineitem,
> supplier,
> nation,
> region
> where
> c_custkey = o_custkey
> and l_orderkey = o_orderkey
> and l_suppkey = s_suppkey
> and c_nationkey = s_nationkey
> and s_nationkey = n_nationkey
> and n_regionkey = r_regionkey
> and r_name = 'AFRICA'
> and o_orderdate >= '1993-01-01'
> and o_orderdate < '1994-01-01'
> group by
> n_name
> order by
> revenue desc;
> (4)Similar issue also exists even after we simplified original query to a 
> simpler one as below:
> drop table if exists ${env:RESULT_TABLE};
> create table ${env:RESULT_TABLE} (
>   pid2 DOUBLE
> )
> row format delimited fields terminated by ',' lines terminated by '\n'
> stored as ${env:HIVE_DEFAULT_FILEFORMAT_RESULT_TABLE} location 
> '${env:RESULT_DIR}';
> insert into table ${env:RESULT_TABLE}
> select
> sum(l_extendedprice * (1 - l_discount)) as revenue
> from
> lineitem
> group by
> l_orderkey
> order by
> revenue;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)