I am seeing similar issues when I was working on Oracle with Tableau as the
Currently I have a batch layer that gets streaming data from
source -> Kafka -> Flume -> HDFS
It stored on HDFS as text files and a cron process sinks Hive table with
the the external table build on the directory. I tried both ORC and Parquet
but I don't think the query itself is the issue.
Meaning it does not matter how clever your execution engine is, the fact
you still have to do considerable amount of Physical IO (PIO) as opposed
to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
One option is to limit the amount of data in Zeppelin to certain number of
rows or something similar. However, you cannot tell a user he/she cannot
see the full data.
We resolved this with Oracle by using Oracle TimesTen
to cache certain tables in memory and get them refreshed (depending on
refresh frequency) from the underlying table in Oracle when data is
updated). That is done through cache fusion.
I was looking around and came across Alluxio <http://www.alluxio.org/>.
Ideally I like to utilise such concept like TimesTen. Can one distribute
Hive table data (or any table data) across the nodes cached. In that case
we will be doing Logical IO which is about 20 times or more lightweight
compared to Physical IO.
Anyway this is the concept.
Dr Mich Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.