Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Marcin Tustin Wed, 30 Dec 2015 13:21:53 -0800

Yes, that's why I haven't had to compile anything.

On Wed, Dec 30, 2015 at 4:16 PM, Jörn Franke <jornfra...@gmail.com> wrote:


> Hdp Should have TEZ already on-Board bye default.
>
> On 30 Dec 2015, at 21:42, Marcin Tustin <mtus...@handybook.com> wrote:
>
> I'm afraid I use the HDP distribution so I haven't yet had to compile
> anything. (Incidentally, this isn't a recommendation of HDP over anything
> else).
>
> On Wed, Dec 30, 2015 at 3:33 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
>> Thanks Marcin
>>
>>
>>
>> Trying to build TEZ 0.7 in
>>
>>
>>
>> /usr/lib/apache-tez-0.7.0-src
>>
>>
>>
>> using
>>
>>
>>
>> mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
>>
>>
>>
>> with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it
>> OK with 3.2.5 following the same error ass below
>>
>>
>>
>> mvn --version
>>
>> Apache Maven *3.2.5* (12a6b3acb947671f09b81f49094c53f426d8cea1;
>> 2014-12-14T17:29:23+00:00)
>>
>> Maven home: /usr/local/apache-maven/apache-maven-3.2.5
>>
>> Java version: 1.7.0_25, vendor: Oracle Corporation
>>
>> Java home: /usr/java/jdk1.7.0_25/jre
>>
>>
>>
>> *I get this error*
>>
>>
>>
>> [INFO] tez-ui ............................................. FAILURE [
>> 0.411 s]
>>
>> [
>>
>>
>>
>> DEBUG] -- end configuration --
>>
>> [INFO] Running 'npm install --color=false' in
>> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
>>
>> [INFO]
>> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh:
>> line 3: 23781 Aborted                 "$@"
>>
>>
>>
>>
>>
>> [ERROR] Failed to execute goal
>> com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
>> project tez-ui: Failed to run task: 'npm install --color=false' failed.
>> (error code 134) -> [Help 1]
>>
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
>> project tez-ui: Failed to run task
>>
>>
>>
>>
>>
>> any ideas as there is little info available in net.
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
>> *Sent:* 30 December 2015 19:27
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Running the same query on 1 billion rows fact table in
>> Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is
>> much faster than MR in pretty much all cases. Also, with hive, you'll make
>> sure you've performed optimizations like aligning ORC stripe sizes with
>> HDFS block sizes, and concatenated your tables (not so much an optimization
>> as a must for avoiding the small files problem).
>>
>>
>>
>> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <m...@peridale.co.uk>
>> wrote:
>>
>> Thanks again Jorn.
>>
>>
>>
>>
>>
>> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I
>> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap)
>> indexes on the time_id column. For the dimension table (times) I have
>> time_id defined as primary key. Also Sybase IQ creates FP (fast projection)
>> indexes on every column by default.
>>
>>
>>
>> Anyway I am trying to download and build TEZ. Do we know which version of
>> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Jörn Franke [mailto:jornfra...@gmail.com]
>> *Sent:* 30 December 2015 16:29
>>
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Running the same query on 1 billion rows fact table in
>> Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>>
>> Hmm i think the execution Engine TEZ has (currently) the most
>> optimizations on Hive. What about your hardware - is it the same? Do you
>> have also compression on Sybase?
>>
>> Alternatively you need to wait for Hive for interactive analytics (tez
>> 0.8 + llap).
>>
>>
>> On 30 Dec 2015, at 13:47, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>>
>> Hi Jorn,
>>
>>
>>
>> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have
>> not tried it on TEZ. I tried the query on MR engine and it did nor fair
>> better. I also ran it without SDDDEV function and found out that the
>> function did not slow it down.
>>
>>
>>
>> I tried a simple query as follows builr in sales FACT table 1e9 rows and
>> dimension table times (1826 rows)
>>
>>
>>
>> --
>>
>> -- Get the total amount sold for each calendar month
>>
>> --
>>
>> *SELECT t.calendar_month_desc, SUM(s.amount_sold)*
>>
>> *FROM sales s, times t WHERE s.time_id = t.time_id*
>>
>> *GROUP BY t.calendar_month_desc;*
>>
>>
>>
>> Now Sybase IQ comes back in around 30 seconds.
>>
>>
>>
>> Started query at Dec 30 2015 08:14:33:399AM
>>
>> (48 rows affected)
>>
>> Finished query at Dec 30 2015 08:15:04:640AM
>>
>>
>>
>> Whereas Hive with the following setting and running the same query
>>
>>
>>
>> set
>> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>
>> set hive.optimize.bucketmapjoin=true;
>>
>> set hive.optimize.bucketmapjoin.sortedmerge=true;
>>
>>
>>
>> Comes back in
>>
>>
>>
>> 48 rows selected (1514.687 seconds)
>>
>>
>>
>> I don’t know what else can be done. Obviously this is all schema on read
>> so I am not sure I can change bucketing on FACT table based on one query
>> alone!
>>
>>
>>
>>
>>
>>
>>
>> +--------------------------------------------------------------------+--+
>>
>> |                           createtab_stmt                           |
>>
>> +--------------------------------------------------------------------+--+
>>
>> | CREATE TABLE `times`(                                              |
>>
>> |   `time_id` timestamp,                                             |
>>
>> |   `day_name` varchar(9),                                           |
>>
>> |   `day_number_in_week` int,                                        |
>>
>> |   `day_number_in_month` int,                                       |
>>
>> |   `calendar_week_number` int,                                      |
>>
>> |   `fiscal_week_number` int,                                        |
>>
>> |   `week_ending_day` timestamp,                                     |
>>
>> |   `week_ending_day_id` bigint,                                     |
>>
>> |   `calendar_month_number` int,                                     |
>>
>> |   `fiscal_month_number` int,                                       |
>>
>> |   `calendar_month_desc` varchar(8),                                |
>>
>> ----------
>>
>> |   `days_in_fis_year` bigint,                                       |
>>
>> |   `end_of_cal_year` timestamp,                                     |
>>
>> |   `end_of_fis_year` timestamp)                                     |
>>
>> | CLUSTERED BY (                                                     |
>>
>> |   time_id)                                                         |
>>
>> | INTO 256 BUCKETS                                                   |
>>
>> | ROW FORMAT SERDE                                                   |
>>
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                      |
>>
>> | STORED AS INPUTFORMAT                                              |
>>
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                |
>>
>> | OUTPUTFORMAT                                                       |
>>
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'               |
>>
>> | LOCATION                                                           |
>>
>> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times'  |
>>
>> | TBLPROPERTIES (                                                    |
>>
>> |   'COLUMN_STATS_ACCURATE'='true',                                  |
>>
>> |   'numFiles'='1',                                                  |
>>
>> |   'numRows'='1826',                                                |
>>
>> |   'orc.bloom.filter.columns'='TIME_ID',                            |
>>
>> |   'orc.bloom.filter.fpp'='0.05',                                   |
>>
>> |   'orc.compress'='SNAPPY',                                         |
>>
>> |   'orc.create.index'='true',                                       |
>>
>> |   'orc.row.index.stride'='10000',                                  |
>>
>> |   'orc.stripe.size'='268435456',                                   |
>>
>> |   'rawDataSize'='0',                                               |
>>
>> |   'totalSize'='11155',                                             |
>>
>> |   'transient_lastDdlTime'='1451429900')                            |
>>
>>
>>
>> ;
>>
>>
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Jörn Franke [mailto:jornfra...@gmail.com <jornfra...@gmail.com>]
>> *Sent:* 30 December 2015 08:28
>> *To:* user@hive.apache.org
>> *Subject:* Re: Running the same query on 1 billion rows fact table in
>> Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>> Have you tried it with Hive ob TEZ? It contains (currently) more
>> optimizations than Hive on Spark.
>>
>> I assume you use the latest Hive version.
>>
>> Additionally you may want to think about calculating statistics
>> (depending on your configuration you need to trigger it) - I am not sure if
>> Spark can use them.
>>
>> I am not sure if bloom filters on the columns you mention make sense. You
>> may also want to increase stride size (depending on your data).
>>
>> Currently you bucket by a lot of fields, which may not make sense. You
>> also may want to sort the data by customer Id in the table.
>>
>> You also seem to have a lot of reducers, which you may want to decrease.
>>
>>
>>
>> Have you tried without "having stddev_samp" ? Is the query exactly the
>> same as in Sybase?
>>
>>
>> On 29 Dec 2015, at 11:53, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>>
>> Hi,
>>
>>
>>
>> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1
>> billion rows as follows:
>>
>>
>>
>> show create table sales;
>>
>>
>> +-------------------------------------------------------------------------------+--+
>>
>> |
>> createtab_stmt                                 |
>>
>>
>> +-------------------------------------------------------------------------------+--+
>>
>> | CREATE TABLE
>> `sales`(                                                         |
>>
>> |   `prod_id`
>> bigint,                                                           |
>>
>> |   `cust_id`
>> bigint,                                                           |
>>
>> |   `time_id`
>> timestamp,                                                        |
>>
>> |   `channel_id`
>> bigint,                                                        |
>>
>> |   `promo_id`
>> bigint,                                                          |
>>
>> |   `quantity_sold`
>> decimal(10,0),                                              |
>>
>> |   `amount_sold`
>> decimal(10,0))                                                |
>>
>> | CLUSTERED BY (
>>                                                      |
>>
>> |
>> prod_id,
>> |
>>
>> |
>> cust_id,
>> |
>>
>> |   time_id,
>>                                            |
>>
>> |
>> channel_id,
>> |
>>
>> |
>> promo_id)
>> |
>>
>> | INTO 256 BUCKETS
>>                                  |
>>
>> | ROW FORMAT
>> SERDE                                                              |
>>
>> |
>> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>> |
>>
>> | STORED AS INPUTFORMAT
>>                        |
>>
>> |
>> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>> |
>>
>> |
>> OUTPUTFORMAT
>> |
>>
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>              |
>>
>> |
>> LOCATION
>> |
>>
>> |
>> 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'
>> |
>>
>> | TBLPROPERTIES
>> (                                                               |
>>
>> |
>> 'COLUMN_STATS_ACCURATE'='true',
>> |
>>
>> |
>> 'last_modified_by'='hduser',
>> |
>>
>> |
>> 'last_modified_time'='1451305626',
>> |
>>
>> |
>> 'numFiles'='11',
>> |
>>
>> |
>> 'numRows'='1000000000',
>> |
>>
>> |
>> 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',
>> |
>>
>> |
>> 'orc.bloom.filter.fpp'='0.05',
>> |
>>
>> |
>> 'orc.compress'='SNAPPY',
>> |
>>
>> |
>> 'orc.create.index'='true',
>> |
>>
>> |
>> 'orc.row.index.stride'='10000',
>> |
>>
>> |
>> 'orc.stripe.size'='268435456',
>> |
>>
>> |
>> 'rawDataSize'='296000000000',
>> |
>>
>> |   'totalSize'='2678882153',
>>                                                   |
>>
>> |
>> 'transient_lastDdlTime'='1451305626')
>> |
>>
>>
>> +-------------------------------------------------------------------------------+--+
>>
>>
>>
>> I use the following query to run against sales table only against Hive
>>
>>
>>
>> SELECT
>>
>>           rs.Customer_ID
>>
>>         , rs.Number_of_orders
>>
>>         , rs.Total_customer_amount
>>
>>         , rs.Average_order
>>
>>         , rs.Standard_deviation
>>
>> FROM
>>
>> (
>>
>>         SELECT cust_id AS Customer_ID,
>>
>>         COUNT(amount_sold) AS Number_of_orders,
>>
>>         SUM(amount_sold) AS Total_customer_amount,
>>
>>         AVG(amount_sold) AS Average_order,
>>
>>         stddev_samp(amount_sold) AS Standard_deviation
>>
>>         FROM sales
>>
>>         GROUP BY cust_id
>>
>>         HAVING SUM(amount_sold) > 94000
>>
>>         AND AVG(amount_sold) < stddev_samp(amount_sold)
>>
>> ) rs
>>
>> ORDER BY
>>
>>           -- Total_customer_amount DESC
>>
>>           3 DESC
>>
>>
>>
>> Hive comes back in 17 minutes with 5,948 rows
>>
>>
>>
>> bl -f sales.hql > sales.log
>>
>> Connecting to jdbc:hive2://rhes564:10010/default
>>
>> Connected to: Apache Hive (version 1.2.1)
>>
>> Driver: Hive JDBC (version 1.2.1)
>>
>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>>
>> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
>>
>> No rows affected (0.097 seconds)
>>
>> No rows affected (0.001 seconds)
>>
>> No rows affected (0.001 seconds)
>>
>> No rows affected (0.038 seconds)
>>
>> INFO  : Warning: Using constant number 3 in order by. If you try to use
>> position alias when hive.groupby.orderby.position.alias is false, the
>> position alias will be ignored.
>>
>> INFO  :
>>
>> Query Hive on Spark job[0] stages:
>>
>> INFO  : 0
>>
>> INFO  : 1
>>
>> INFO  : 2
>>
>> INFO  :
>>
>> Status: Running (Hive on Spark job[0])
>>
>> INFO  : Job Progress Format
>>
>> CurrentTime StageId_StageAttemptId:
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>> [StageCost]
>>
>> INFO  : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009
>> Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009
>> Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11     Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>>
>>
>> INFO  : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11    Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11    Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11    Stage-1_0:
>> 0/1009       Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished
>> Stage-1_0: 0(+2)/1009   Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished
>> Stage-1_0: 6(+2)/1009   Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished
>> Stage-1_0: 30(+2)/1009  Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished
>> Stage-1_0: 59(+2)/1009  Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished
>> Stage-1_0: 90(+2)/1009  Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished
>> Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished
>> Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished
>> Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished
>> Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished
>> Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished
>> Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished
>> Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished
>> Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished
>> Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished
>> Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished
>> Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished
>> Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished
>> Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished
>> Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished
>> Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished
>> Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished
>> Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished
>> Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished
>> Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished
>> Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished
>> Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished
>> Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
>>
>> INFO  : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished
>> Stage-1_0: 1009/1009 Finished   Stage-2_0: 0(+1)/1
>>
>> INFO  : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished
>> Stage-1_0: 1009/1009 Finished   Stage-2_0: 1/1 Finished
>>
>> INFO  : Status: Finished successfully in 1036.00 seconds
>>
>> *5,948 rows selected (1074.817 seconds)*
>>
>>
>>
>> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows
>> in 23 seconds
>>
>>
>>
>> Sybase IQ is a columnar database so each column is created as a fast
>> projection index by default. In addition I have created LF (bitmap) indexes
>> on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now
>> the query only touches CUST_ID.
>>
>>
>>
>> My suspicion is that it is the Standard Deviation function stddev_samp() that
>> could be the bottleneck?
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>>
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>>
>> Latest news <http://www.handy.com/press> at Handy
>>
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>>
>> <image001.jpg>
>>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Reply via email to