Re: Format dillema

Gopal Vijayaraghavan Fri, 23 Jun 2017 14:27:38 -0700

> I guess I see different things. Having used all the tech. In particular for 
> large hive queries I see OOM simply SCANNING THE INPUT of a data directory, 
> after 20 seconds!


If you've got an LLAP deployment you're not happy with - this list is the right 
place to air your grievances. I usually try help anybody who haven't quite 
figured their way around it & I'm always happy to debug performance issues 
across the board.

But, if your definition of "all the tech" is text files and gzip on MR, then 
Datanami has an article out today.

https://www.datanami.com/2017/06/22/hadoop-engines-compete-comcast-query-smackdown/

And my favourite quote is the side-bar image - MapReduce SQL query performance 
is “a dumpster fire”.

> "Magically" jk. Impala allow me to query those TEXT files in milliseconds, so 
> logical deduction says the format of the data ORC/TEXT can't be the most 
> important factor here.

You're right - the most important factor is the execution engine for Hive. That 
drives the logical plans and optimizations specific to those engines, which are 
definitely more significant.

Picking the best Hive engine is the most important factor, if you're doing ETL 
or BI  - Comcast got Parquet on LLAP to work fast as well.

https://www.slideshare.net/Hadoop_Summit/hadoop-query-performance-smackdown/20

Stinger wasn't just about ORC or Vectorization or ACID or Tez, but to be 
greater than sum of the parts for an EDW.

Your milliseconds comment reminded me of an off-hand comment made by the Yahoo 
Japan folks about the time they tried Impala out for their 700 node use-case.

https://hortonworks.com/blog/impala-vs-hive-performance-benchmark/

Impala did very well at low workloads. But low load is a very artificial 
condition in a well utilized cluster. Production clusters usually run at 
near-full utilization or they're leaving money on the table.

This is also a scale problem, running at 50% utilization at 2 nodes is very 
different from running at 50% on 50, in real dollars.

>From their utilization tests, Yahoo Japan detailed their experience running 
>LLAP with 300+ concurrent queries last year (with *NO* query failures, instead 
>of "fast or die").

https://www.slideshare.net/HadoopSummit/achieving-100k-queries-per-hour-on-hive-on-tez/38

> 2 impala server (each node)

The problem with making up your mind based on a 2 node cluster, is that scale 
problems are somewhat like the tides - huge and massive, ignore it at your 
risk, but impossible to measure in a swimming pool.

If you want, I can go into details about what is different about 3-replica HDFS 
data on 700 nodes vs 2 nodes - also imagine that your last hour is "hot" in the 
BI dashboard (and that means 4 HDFS blocks are 85% of IO reqs, which in an MPP 
database model runs out of capacity at 0.5% disk IO utilization) .

If you're happy with Impala at the single digit node scale, that's useful to 
know - but it does not extrapolate naturally.

Cheers,
Gopal

Re: Format dillema

Reply via email to