Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Jörn Franke Sat, 17 Sep 2016 23:38:21 -0700

In Tableau you can use the in-memory facilities of the Tableau server.

As said, Apache Ignite could be one way. You can also use it to make Hive 
tables in-memory. While reducing IO can make sense, I do not think you will 
receive in production systems so much difference (at least not 20x). If the 
data is processed in parallel then IO will be done in parallel thanks to the 
architecture of HDFS. Oracle Exadata exploits similar concepts. The advantage 
of Ignite compared to e.g.Exadata would be that you have also the indexes of 
ORC and Parquet in-memory which avoids reading data in-memory that is not 
needed for the query.
That being said, even if you use in-memory it still makes sense that some data 
is pre-aggregated/calculated for the users based on their needs.


> On 17 Sep 2016, at 18:53, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi,
> 
> I am seeing similar issues when I was working on Oracle with Tableau as the 
> dashboard.
> 
> Currently I have a batch layer that gets streaming data from
> 
> source -> Kafka -> Flume -> HDFS
> 
> It stored on HDFS as text files and a cron process sinks Hive table with the 
> the external table build on the directory. I tried both ORC and Parquet but I 
> don't think the query itself is the issue.
> 
> Meaning it does not matter how clever your execution engine is, the fact you 
> still have to do  considerable amount of Physical IO (PIO) as opposed to 
> Logical IO (LIO) to get the data to Zeppelin is on the critical path.
> 
> One option is to limit the amount of data in Zeppelin to certain number of 
> rows or something similar. However, you cannot tell a user he/she cannot see 
> the full data.
> 
> We resolved this with Oracle by using Oracle TimesTen IMDB to cache certain 
> tables in memory and get them refreshed (depending on refresh frequency) from 
> the underlying table in Oracle when data is updated). That is done through 
> cache fusion.
> 
> I was looking around and came across Alluxio. Ideally I like to utilise such 
> concept like TimesTen. Can one distribute Hive table data (or any table data) 
> across the nodes cached. In that case we will be doing Logical IO which is 
> about 20 times or more lightweight compared to Physical IO.
> 
> Anyway this is the concept.
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Reply via email to