i refuse to take anybody seriously who has a sig file longer than one line and that there is just plain repugnant.
On Wed, Feb 3, 2016 at 1:47 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > I just did some further tests joining a 5 million rows FACT tables with 2 > DIMENSION tables. > > > > SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS > TotalSales > > FROM sales s, times t, channels c > > WHERE s.time_id = t.time_id > > AND s.channel_id = c.channel_id > > GROUP BY t.calendar_month_desc, c.channel_desc > > ; > > > > > > Hive on Spark crashes, Hive with MR finishes in 85 sec and Spark on Hive > finishes in 267 sec. I am trying to understand this behaviour > > > > OK I changed the three below parameters as suggested by Jeff > > > > export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers > (Default: 1). > > export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G) > (Default: 1G) > > export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G) > (Default: 512 Mb) > > > > > > *1) **Hive 1.2.1 on Spark 1.3.1* > > It fails. Never completes. > > > > ERROR : Status: Failed > > Error: Error while processing statement: FAILED: Execution Error, return > code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask > (state=08S01,code=3) > > > > *2) **Hive 1.2.1 on MR engine Looks good and completes in 85 sec* > > > > 0: jdbc:hive2://rhes564:10010/default> SELECT t.calendar_month_desc, > c.channel_desc, SUM(s.amount_sold) AS TotalSales > > 0: jdbc:hive2://rhes564:10010/default> FROM sales s, times t, channels c > > 0: jdbc:hive2://rhes564:10010/default> WHERE s.time_id = t.time_id > > 0: jdbc:hive2://rhes564:10010/default> AND s.channel_id = c.channel_id > > 0: jdbc:hive2://rhes564:10010/default> GROUP BY t.calendar_month_desc, > c.channel_desc > > 0: jdbc:hive2://rhes564:10010/default> ; > > INFO : Execution completed successfully > > INFO : MapredLocal task succeeded > > INFO : Number of reduce tasks not specified. Estimated from input data > size: 1 > > INFO : In order to change the average load for a reducer (in bytes): > > INFO : set hive.exec.reducers.bytes.per.reducer=<number> > > INFO : In order to limit the maximum number of reducers: > > INFO : set hive.exec.reducers.max=<number> > > INFO : In order to set a constant number of reducers: > > INFO : set mapreduce.job.reduces=<number> > > WARN : Hadoop command-line option parsing not performed. Implement the > Tool interface and execute your application with ToolRunner to remedy this. > > INFO : number of splits:1 > > INFO : Submitting tokens for job: job_1454534517374_0002 > > INFO : The url to track the job: > http://rhes564:8088/proxy/application_1454534517374_0002/ > > INFO : Starting Job = job_1454534517374_0002, Tracking URL = > http://rhes564:8088/proxy/application_1454534517374_0002/ > > INFO : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill > job_1454534517374_0002 > > INFO : Hadoop job information for Stage-3: number of mappers: 1; number > of reducers: 1 > > INFO : 2016-02-03 21:25:17,769 Stage-3 map = 0%, reduce = 0% > > INFO : 2016-02-03 21:25:29,103 Stage-3 map = 2%, reduce = 0%, Cumulative > CPU 7.52 sec > > INFO : 2016-02-03 21:25:32,205 Stage-3 map = 5%, reduce = 0%, Cumulative > CPU 10.19 sec > > INFO : 2016-02-03 21:25:35,295 Stage-3 map = 7%, reduce = 0%, Cumulative > CPU 12.69 sec > > INFO : 2016-02-03 21:25:38,392 Stage-3 map = 10%, reduce = 0%, > Cumulative CPU 15.2 sec > > INFO : 2016-02-03 21:25:41,502 Stage-3 map = 13%, reduce = 0%, > Cumulative CPU 17.31 sec > > INFO : 2016-02-03 21:25:44,600 Stage-3 map = 16%, reduce = 0%, > Cumulative CPU 21.55 sec > > INFO : 2016-02-03 21:25:47,691 Stage-3 map = 20%, reduce = 0%, > Cumulative CPU 24.32 sec > > INFO : 2016-02-03 21:25:50,786 Stage-3 map = 23%, reduce = 0%, > Cumulative CPU 26.3 sec > > INFO : 2016-02-03 21:25:52,858 Stage-3 map = 27%, reduce = 0%, > Cumulative CPU 28.52 sec > > INFO : 2016-02-03 21:25:55,948 Stage-3 map = 31%, reduce = 0%, > Cumulative CPU 30.65 sec > > INFO : 2016-02-03 21:25:59,032 Stage-3 map = 35%, reduce = 0%, > Cumulative CPU 32.7 sec > > INFO : 2016-02-03 21:26:02,120 Stage-3 map = 40%, reduce = 0%, > Cumulative CPU 34.69 sec > > INFO : 2016-02-03 21:26:05,217 Stage-3 map = 43%, reduce = 0%, > Cumulative CPU 36.67 sec > > INFO : 2016-02-03 21:26:08,310 Stage-3 map = 47%, reduce = 0%, > Cumulative CPU 38.78 sec > > INFO : 2016-02-03 21:26:11,408 Stage-3 map = 52%, reduce = 0%, > Cumulative CPU 40.7 sec > > INFO : 2016-02-03 21:26:14,512 Stage-3 map = 56%, reduce = 0%, > Cumulative CPU 42.69 sec > > INFO : 2016-02-03 21:26:17,607 Stage-3 map = 60%, reduce = 0%, > Cumulative CPU 44.69 sec > > INFO : 2016-02-03 21:26:20,722 Stage-3 map = 64%, reduce = 0%, > Cumulative CPU 46.83 sec > > INFO : 2016-02-03 21:26:22,787 Stage-3 map = 100%, reduce = 0%, > Cumulative CPU 48.46 sec > > INFO : 2016-02-03 21:26:29,030 Stage-3 map = 100%, reduce = 100%, > Cumulative CPU 50.01 sec > > INFO : MapReduce Total cumulative CPU time: 50 seconds 10 msec > > INFO : Ended Job = job_1454534517374_0002 > > +------------------------+-----------------+-------------+--+ > > | t.calendar_month_desc | c.channel_desc | totalsales | > > +------------------------+-----------------+-------------+--+ > > +------------------------+-----------------+-------------+--+ > > 150 rows selected (85.67 seconds) > > > > *3) **Spark on Hive engine completes in 267 sec* > > spark-sql> SELECT t.calendar_month_desc, c.channel_desc, > SUM(s.amount_sold) AS TotalSales > > > FROM sales s, times t, channels c > > > WHERE s.time_id = t.time_id > > > AND s.channel_id = c.channel_id > > > GROUP BY t.calendar_month_desc, c.channel_desc > > > ; > > Time taken: 267.138 seconds, Fetched 150 row(s) > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk] > *Sent:* 03 February 2016 16:21 > *To:* user@hive.apache.org > *Subject:* RE: Hive on Spark Engine versus Spark using Hive metastore > > > > OK thanks. These are my new ENV settings based upon the availability of > resources > > > > export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers > (Default: 1). > > export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G) > (Default: 1G) > > export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G) > (Default: 512 Mb) > > > > These are the new runs after these settings: > > > > *Spark on Hive (3 consecutive runs)* > > > > > > spark-sql> select * from dummy where id in (1, 5, 100000); > > 1 0 0 63 > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 > xxxxxxxxxx > > 5 0 4 31 > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 > xxxxxxxxxx > > 100000 99 999 188 > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 > xxxxxxxxxx > > Time taken: 47.987 seconds, Fetched 3 row(s) > > > > Around 48 seconds > > > > *Hive on Spark 1.3.1* > > > > 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in > (1, 5, 100000); > > INFO : > > Query Hive on Spark job[2] stages: > > INFO : 2 > > INFO : > > Status: Running (Hive on Spark job[2]) > > INFO : Job Progress Format > > CurrentTime StageId_StageAttemptId: > SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount > [StageCost] > > INFO : 2016-02-03 16:20:50,315 Stage-2_0: 0(+18)/18 > > INFO : 2016-02-03 16:20:53,369 Stage-2_0: 0(+18)/18 > > INFO : 2016-02-03 16:20:56,478 Stage-2_0: 0(+18)/18 > > INFO : 2016-02-03 16:20:58,530 Stage-2_0: 1(+17)/18 > > INFO : 2016-02-03 16:21:01,570 Stage-2_0: 1(+17)/18 > > INFO : 2016-02-03 16:21:04,680 Stage-2_0: 1(+17)/18 > > INFO : 2016-02-03 16:21:07,767 Stage-2_0: 1(+17)/18 > > INFO : 2016-02-03 16:21:10,877 Stage-2_0: 1(+17)/18 > > INFO : 2016-02-03 16:21:13,941 Stage-2_0: 1(+17)/18 > > INFO : 2016-02-03 16:21:17,019 Stage-2_0: 1(+17)/18 > > INFO : 2016-02-03 16:21:20,090 Stage-2_0: 3(+15)/18 > > INFO : 2016-02-03 16:21:21,138 Stage-2_0: 6(+12)/18 > > INFO : 2016-02-03 16:21:22,145 Stage-2_0: 10(+8)/18 > > INFO : 2016-02-03 16:21:23,150 Stage-2_0: 14(+4)/18 > > INFO : 2016-02-03 16:21:24,154 Stage-2_0: 17(+1)/18 > > INFO : 2016-02-03 16:21:26,161 Stage-2_0: 18/18 Finished > > INFO : Status: Finished successfully in 36.88 seconds > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised > | dummy.random_string | dummy.small_vc | > dummy.padding | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | 1 | 0 | 0 | 63 | > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | > xxxxxxxxxx | > > | 5 | 0 | 4 | 31 | > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | > xxxxxxxxxx | > > | 100000 | 99 | 999 | 188 | > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | > xxxxxxxxxx | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > 3 rows selected (37.161 seconds) > > > > Around 37 seconds > > > > Interesting results > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Xuefu Zhang [mailto:xzh...@cloudera.com <xzh...@cloudera.com>] > *Sent:* 03 February 2016 12:47 > *To:* user@hive.apache.org > *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore > > > > In YARN or standalone mode, you can set spark.executor.cores to utilize > all cores on the node. You can also set spark.executor.memory to allocate > memory for Spark to use. Once you do this, you may only have two executors > to run your map tasks, but each core in each executor can take up one task, > increasing parallelism. With this, the eventually limit may come down to > the bandwidth of your disks in the cluster. > > Having said that, a two-node cluster isn't really big enough to do > performance benchmark. Nevertheless, you still need to configure properly > to make full use of the cluster. > > --Xuefu > > > > On Wed, Feb 3, 2016 at 1:25 AM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > > Hi Jeff, > > > > I only have a two node cluster. Is there anyway one can simulate > additional parallel runs in such an environment thus having more than two > maps? > > > > thanks > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Xuefu Zhang [mailto:xzh...@cloudera.com] > *Sent:* 03 February 2016 02:39 > > > *To:* user@hive.apache.org > *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore > > > > Yes, regardless what spark mode you're running in, from Spark AM webui, > you should be able to see how many task are concurrently running. I'm a > little surprised to see that your Hive configuration only allows 2 map > tasks to run in parallel. If your cluster has the capacity, you should > parallelize all the tasks to achieve optimal performance. Since I don't > know your Spark SQL configuration, I cannot tell how much parallelism you > have over there. Thus, I'm not sure if your comparison is valid. > > --Xuefu > > > > On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > > Hi Jeff, > > > > In below > > > > …. You should be able to see the resource usage in YARN resource manage > URL. > > > > Just to be clear we are talking about Port 8088/cluster? > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Koert Kuipers [mailto:ko...@tresata.com] > *Sent:* 03 February 2016 00:09 > > *To:* user@hive.apache.org > *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore > > > > uuuhm with spark using Hive metastore you actually have a real > programming environment and you can write real functions, versus just being > boxed into some version of sql and limited udfs? > > > > On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote: > > When comparing the performance, you need to do it apple vs apple. In > another thread, you mentioned that Hive on Spark is much slower than Spark > SQL. However, you configured Hive such that only two tasks can run in > parallel. However, you didn't provide information on how much Spark SQL is > utilizing. Thus, it's hard to tell whether it's just a configuration > problem in your Hive or Spark SQL is indeed faster. You should be able to > see the resource usage in YARN resource manage URL. > > --Xuefu > > > > On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > > Thanks Jeff. > > > > Obviously Hive is much more feature rich compared to Spark. Having said > that in certain areas for example where the SQL feature is available in > Spark, Spark seems to deliver faster. > > > > This may be: > > > > 1. Spark does both the optimisation and execution seamlessly > > 2. Hive on Spark has to invoke YARN that adds another layer to the > process > > > > Now I did some simple tests on a 100Million rows ORC table available > through Hive to both. > > > > *Spark 1.5.2 on Hive 1.2.1 Metastore* > > > > > > spark-sql> select * from dummy where id in (1, 5, 100000); > > 1 0 0 63 > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 > xxxxxxxxxx > > 5 0 4 31 > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 > xxxxxxxxxx > > 100000 99 999 188 > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 > xxxxxxxxxx > > Time taken: 50.805 seconds, Fetched 3 row(s) > > spark-sql> select * from dummy where id in (1, 5, 100000); > > 1 0 0 63 > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 > xxxxxxxxxx > > 5 0 4 31 > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 > xxxxxxxxxx > > 100000 99 999 188 > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 > xxxxxxxxxx > > Time taken: 50.358 seconds, Fetched 3 row(s) > > spark-sql> select * from dummy where id in (1, 5, 100000); > > 1 0 0 63 > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 > xxxxxxxxxx > > 5 0 4 31 > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 > xxxxxxxxxx > > 100000 99 999 188 > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 > xxxxxxxxxx > > Time taken: 50.563 seconds, Fetched 3 row(s) > > > > So three runs returning three rows just over 50 seconds > > > > *Hive 1.2.1 on spark 1.3.1 execution engine* > > > > 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, > 5, 100000); > > INFO : > > Query Hive on Spark job[4] stages: > > INFO : 4 > > INFO : > > Status: Running (Hive on Spark job[4]) > > INFO : Status: Finished successfully in 82.49 seconds > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised > | dummy.random_string | dummy.small_vc | > dummy.padding | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | 1 | 0 | 0 | 63 | > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | > xxxxxxxxxx | > > | 5 | 0 | 4 | 31 | > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | > xxxxxxxxxx | > > | 100000 | 99 | 999 | 188 | > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | > xxxxxxxxxx | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > 3 rows selected (82.66 seconds) > > 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, > 5, 100000); > > INFO : Status: Finished successfully in 76.67 seconds > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised > | dummy.random_string | dummy.small_vc | > dummy.padding | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | 1 | 0 | 0 | 63 | > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | > xxxxxxxxxx | > > | 5 | 0 | 4 | 31 | > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | > xxxxxxxxxx | > > | 100000 | 99 | 999 | 188 | > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | > xxxxxxxxxx | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > 3 rows selected (76.835 seconds) > > 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, > 5, 100000); > > INFO : Status: Finished successfully in 80.54 seconds > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised > | dummy.random_string | dummy.small_vc | > dummy.padding | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | 1 | 0 | 0 | 63 | > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | > xxxxxxxxxx | > > | 5 | 0 | 4 | 31 | > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | > xxxxxxxxxx | > > | 100000 | 99 | 999 | 188 | > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | > xxxxxxxxxx | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > 3 rows selected (80.718 seconds) > > > > Three runs returning the same rows in 80 seconds. > > > > It is possible that My Spark engine with Hive is 1.3.1 which is out of > date and that causes this lag. > > > > There are certain queries that one cannot do with Spark. Besides it does > not recognize CHAR fields which is a pain. > > > > spark-sql> *CREATE TEMPORARY TABLE tmp AS* > > > SELECT t.calendar_month_desc, c.channel_desc, > SUM(s.amount_sold) AS TotalSales > > > FROM sales s, times t, channels c > > > WHERE s.time_id = t.time_id > > > AND s.channel_id = c.channel_id > > > GROUP BY t.calendar_month_desc, c.channel_desc > > > ; > > Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 > > . > > You are likely trying to use an unsupported Hive feature."; > > > > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Xuefu Zhang [mailto:xzh...@cloudera.com] > *Sent:* 02 February 2016 23:12 > *To:* user@hive.apache.org > *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore > > > > I think the diff is not only about which does optimization but more on > feature parity. Hive on Spark offers all functional features that Hive > offers and these features play out faster. However, Spark SQL is far from > offering this parity as far as I know. > > > > On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > > Hi, > > > > My understanding is that with Hive on Spark engine, one gets the Hive > optimizer and Spark query engine > > > > With spark using Hive metastore, Spark does both the optimization and > query engine. The only value add is that one can access the underlying Hive > tables from spark-sql etc > > > > > > Is this assessment correct? > > > > > > > > Thanks > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > > > > > > > > > >