The table was created in spark-sql as ORC table
use asehadoop;
drop table if exists tt;
create table tt (
owner varchar(30)
,object_name varchar(30)
,subobject_name varchar(30)
,object_id bigint
,data_object_id bigint
,object_type varchar(19)
,created timestamp
,last_ddl_time timestamp
,timestamp varchar(19)
,status varchar(7)
,temporary2 varchar(1)
,generated varchar(1)
,secondary varchar(1)
,namespace bigint
,edition_name varchar(30)
,padding1 varchar(4000)
,padding2 varchar(3500)
,attribute varchar(32)
,op_type int
,op_time timestamp
)
CLUSTERED BY (object_id) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"transactional"="true",
"orc.create.index"="true",
"orc.bloom.filter.columns"="object_id",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="10000" )
;
show create table tt;
INSERT INTO TABLE tt
SELECT
owner
, object_name
, subobject_name
, object_id
, data_object_id
, object_type
, cast(created AS timestamp)
, cast(last_ddl_time AS timestamp)
, timestamp
, status
, temporary2
, generated
, secondary
, namespace
, edition_name
, padding1
, padding2
, attribute
, 1
, cast(from_unixtime(unix_timestamp()) AS timestamp)
FROM t_staging
;
And it was analysed ass below
hive> analyze table tt compute statistics;
Table asehadoop.tt stats: [numFiles=30, numRows=1767886, totalSize=88388380,
rawDataSize=5984968162]
OK
Time taken: 0.241 seconds
HTH
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one
out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Technology Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any
responsibility.
From: Jörn Franke [mailto:[email protected]]
Sent: 01 December 2015 16:58
To: [email protected]
Subject: Re: Using spark in tandem with Hive
How did you create the tables? Do you have automated statistics activated in
Hive?
Btw mr is outdated as a Hive execution engine. Use TEZ (maybe wait for 0.8 for
sub second queries ) or use Spark as an execution engine in Hive.
On 01 Dec 2015, at 17:40, Mich Talebzadeh <[email protected]
<mailto:[email protected]> > wrote:
What if we decide to use spark with Hive. I look to hear similar views
My test bed comprised
1. Spark version 1.5.2
2. Hive version 1.2.1
3. Hadoop version 2.6
I made Spark to use Hive metastore. So using spark-sql I can pretty do whatever
one can do with HiveQL
I created and populated an ORC table in spark-sql.. It took 90 seconds to
create and populate the table with 1.7 million rows
spark-sql> select count(1) from tt;
1767886
Time taken: 5.169 seconds, Fetched 1 row(s)
Now let me try to do the said operation on the same table with HCL and MR
hive> use asehadoop;
OK
Time taken: 0.639 seconds
hive> select count(1) from tt;
Query ID = hduser_20151201162717_e3102633-f501-413b-b9cb-384ac50880ac
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1448969636093_0001, Tracking URL =
http://rhes564:8088/proxy/application_1448969636093_0001/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill
job_1448969636093_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-12-01 16:27:27,154 Stage-1 map = 0%, reduce = 0%
2015-12-01 16:27:35,427 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.65
sec
2015-12-01 16:27:41,611 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.71
sec
MapReduce Total cumulative CPU time: 3 seconds 710 msec
Ended Job = job_1448969636093_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.71 sec HDFS Read: 520151
HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 710 msec
OK
1767886
Time taken: 25.635 seconds, Fetched: 1 row(s)
So 5 seconds in Spark versus 25 seconds in Hive
On a point query Hive does not seem to return the correct timing?
hive> select * from tt where data_object_id = 10;
Time taken: 0.063 seconds, Fetched: 72 row(s)
Whereas in Spark I get
spark-sql> select * from tt where data_object_id = 10;
Time taken: 9.002 seconds, Fetched 72 row(s)
9 seconds looks far more plausible to me than 0.063 seonds. Or in an unlikely
event Spark returns elapsed time, whereas Hive returns execution time?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one
out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Technology Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any
responsibility.