Hello,

For more recent benchmark results, please see [1] where we compare Trino
418, Spark 3.4.0, and Hive 3.1.3 (on MR3 1.7) using TPC-DS 10TB. Spark
takes about 19600 seconds to complete all the queries, whereas Trino and
Hive take about 7400 seconds only. The experiment does not use Hive-LLAP,
but you may think of Hive on MR3 as a substitute for Hive-LLAP because
both systems are comparable in performance. In the experiment, we tried our
best to get the best performance of Trino and Spark, and did not
intentionally penalize them in order to favor Hive.

Spark is a great project with many cool features and a huge lively
community, but speed is no longer a key feature of Spark that would
differentiate itself from other competing technologies. As far as speed is
concerned, it seems that Spark folks are living inside their own world,
still believing in its so-called 'in-memory' computing technology.

Recently someone read the article [1] and summarily dismissed the results,
saying "it is simply impossible for Hive to run faster than Spark". I guess
many people still think that Hive is so slow and only for good ETL.

Regards,

--- Sungwoo
[1]
https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/

On Sat, Aug 19, 2023 at 7:18 PM Aaron Grubb <aa...@kaden.ai> wrote:

> Hi Mich,
>
> It's not a question of cannot but rather a) is it worth converting our
> pipelines from Hive to Spark and b) is Spark more performant than LLAP, and
> in both cases the answer seems to be no. 2016 is a lifetime ago in
> technological time and since then there's been a major release of Hive as
> well as many minor releases. When we started looking for our "big data
> processor" 2 years ago, we had evaluated Spark, Presto, AWS Athena and Hive
> on LLAP and all literature pointed to Hive on LLAP being the most
> performant, in particular when you're able to take advantage of the ORC
> footer caching. If you'd like to review some benchmarks, you can take a
> look at this [1] but the direct comparison between Spark and LLAP is done
> with a fork of Hive.
>
> Regards,
> Aaron
>
> [1] https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/
>
> On Fri, 2023-08-18 at 16:06 +0100, Mich Talebzadeh wrote:
>
> interesting!
>
> In 2016 I gave a presentation in London, in Future of DataOrganised by
> Hortonworks July 20, 2016,
>
> Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
> <https://talebzadehmich.files.wordpress.com/2016/08/hive_on_spark_only.pdf>
>
>
> Then I thought Spark as an underlying engine for Hive did the best job.
> However, I am not sure there has been many new developments to make Spark
> as the underlying engine for Hive. Any particular reason you cannot use
> Spark as the ET: tool with Hive providing the underlying storage? Spark has
> excellent APIs to work with hive including spark thrift server (which is
> under the bonnet Hive thrift server).
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 15:45, Aaron Grubb <aa...@kaden.ai> wrote:
>
> Hi Mich,
>
> Yes, that's correct
>
> On Fri, 2023-08-18 at 15:24 +0100, Mich Talebzadeh wrote:
>
> Hi,
>
> Are you using LLAP (Long live and prosper) as a Hive engine?
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 15:09, Aaron Grubb <aa...@kaden.ai> wrote:
>
> For those interested, I managed to define a way to launch the LLAP
> application master and daemons on separate, targeted machines. It was
> inspired by an article I found [1] and implemented using YARN Node Labels
> [2] and Placement Constraints [3] with a modification to the file
> scripts/llap/yarn/templates.py. Here are the basic instructions:
>
> 1. Configure YARN to enable placement constraints and node labels. You
> have the option of using 2 node labels or 1 node label + the default
> partition. The machines that are intended to run the daemons must have a
> label associated with them. If you choose to use 2 node labels, you must
> set the default label for the queue that you're submitting LLAP to, to the
> node label associated with the machine that will run the application
> master. Note that this affects other applications submitted to the same
> queue. If it's only 1 label, the machine that will run the AM must be
> accessible by the DEFAULT_PARTITION queue, and that machine will not be
> specifically targeted if you have more than one machine accessible by the
> DEFAULT_PARTITION, so this scenario is recommended only if you have a
> single machine intended for application masters, as is my case.
>
> 2. Modify scripts/llap/yarn/templates.py like so:
>
> #SNIP
>
>           "APP_ROOT": "<WORK_DIR>/app/install/",
>
>           "APP_TMP_DIR": "<WORK_DIR>/tmp/"
>
>         }
>
>       },
>
>       "placement_policy": {
>
>         "constraints": [
>
>           {
>
>             "type": "ANTI_AFFINITY",
>
>             "scope": "NODE",
>
>             "target_tags": [
>
>               "llap"
>
>             ],
>
>             "node_partitions": [
>
>               "<INSERT LLAP DAEMON NODE LABEL HERE>"
>
>             ]
>
>           }
>
>         ]
>
>       }
>
>     }
>
>   ],
>
>   "kerberos_principal" : {
>
> #SNIP
>
> Note that ANTI_AFFINITY means that only 1 daemon will be spawned per
> machine but that should be the desired behaviour anyway. Read more about it
> in [3].
>
> 3. Launch LLAP using the *hive --service llap *command
>
> Hope this helps someone!
> Aaron
>
> [1]
> https://www.gresearch.com/blog/article/hive-llap-in-practice-sizing-setup-and-troubleshooting/
> [2]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
> [3]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html
>
> On 2023/03/22 10:19:57 Aaron Grubb wrote:
> > Hi all,
> >
> > I have a Hadoop cluster (3.3.4) with 6 nodes of equal resource size that
> run HDFS and YARN and 1 node with lower resources which only runs YARN that
> I use for Hive AMs, the LLAP AM, Spark AMs and Hive file merge containers.
> The HDFS nodes are set up such that the queue for LLAP on the YARN
> NodeManager is allocated resources exactly equal to what the LLAP daemons
> consume. However, when I need to re-launch LLAP, I currently have to stop
> the NodeManager processes on each HDFS node, then launch LLAP to guarantee
> that the application master ends up on the YARN-only machine, then start
> the NodeManager processes again to let the daemons start spawning on the
> nodes. This used to not be a problem because only Hive/LLAP was using YARN
> but now we've started using Spark in my company and I'm in a position where
> if LLAP happens to crash, I would need to wait for Spark jobs to finish
> before I can re-launch LLAP, which would put our ETL processes behind,
> potentially to unacceptable delays. I could allocate 1 vcore and 1024mb
> memory extra for the LLAP queue on each machine, however that would mean I
> have 5 vcores and 5gb RAM being reserved and unused at all times, so I was
> wondering if there's a way to specify which node to launch the LLAP AM on,
> perhaps through YARN node labels similar to the Spark
> "spark.yarn.am.nodeLabelExpression" configuration? Or even a way to specify
> the node machine through a different mechanism? My Hive version is 3.1.3.
> >
> > Thanks,
> > Aaron
> >
>
>
>
>

Reply via email to