> I guess I see different things. Having used all the tech. In particular for > large hive queries I see OOM simply SCANNING THE INPUT of a data directory, > after 20 seconds!
If you've got an LLAP deployment you're not happy with - this list is the right place to air your grievances. I usually try help anybody who haven't quite figured their way around it & I'm always happy to debug performance issues across the board. But, if your definition of "all the tech" is text files and gzip on MR, then Datanami has an article out today. https://www.datanami.com/2017/06/22/hadoop-engines-compete-comcast-query-smackdown/ And my favourite quote is the side-bar image - MapReduce SQL query performance is “a dumpster fire”. > "Magically" jk. Impala allow me to query those TEXT files in milliseconds, so > logical deduction says the format of the data ORC/TEXT can't be the most > important factor here. You're right - the most important factor is the execution engine for Hive. That drives the logical plans and optimizations specific to those engines, which are definitely more significant. Picking the best Hive engine is the most important factor, if you're doing ETL or BI - Comcast got Parquet on LLAP to work fast as well. https://www.slideshare.net/Hadoop_Summit/hadoop-query-performance-smackdown/20 Stinger wasn't just about ORC or Vectorization or ACID or Tez, but to be greater than sum of the parts for an EDW. Your milliseconds comment reminded me of an off-hand comment made by the Yahoo Japan folks about the time they tried Impala out for their 700 node use-case. https://hortonworks.com/blog/impala-vs-hive-performance-benchmark/ Impala did very well at low workloads. But low load is a very artificial condition in a well utilized cluster. Production clusters usually run at near-full utilization or they're leaving money on the table. This is also a scale problem, running at 50% utilization at 2 nodes is very different from running at 50% on 50, in real dollars. >From their utilization tests, Yahoo Japan detailed their experience running >LLAP with 300+ concurrent queries last year (with *NO* query failures, instead >of "fast or die"). https://www.slideshare.net/HadoopSummit/achieving-100k-queries-per-hour-on-hive-on-tez/38 > 2 impala server (each node) The problem with making up your mind based on a 2 node cluster, is that scale problems are somewhat like the tides - huge and massive, ignore it at your risk, but impossible to measure in a swimming pool. If you want, I can go into details about what is different about 3-replica HDFS data on 700 nodes vs 2 nodes - also imagine that your last hour is "hot" in the BI dashboard (and that means 4 HDFS blocks are 85% of IO reqs, which in an MPP database model runs out of capacity at 0.5% disk IO utilization) . If you're happy with Impala at the single digit node scale, that's useful to know - but it does not extrapolate naturally. Cheers, Gopal