Re: In-memory cache in Drill

Kunal Khatua Wed, 10 May 2017 09:56:59 -0700

Not really :)

You get into the problem of having to deal with cache management. Once you 
start using memory to serve a cache for holding a table in-memory, you are 
sacrificing the memory resource for doing the actual computation. Also, Drill 
actually tries to work with Direct Memory and not heap. To work around this, 
you would then have to introduce a swapping policy, so as to reclaim the memory.

If you were to use Heap for storing the table in memory, then Drill will need 
to copy the data into DirectMemory to do useful work. So now you have about 2x 
the memory being used for the data!

If you are using HDFS (or MapR-FS), these filesystems themselves implement a 
cache management, so we are already leveraging (to a limited extent) the 
benefits of an in-memory cache.

________________________________
From: Michael Shtelma <[email protected]>
Sent: Wednesday, May 10, 2017 9:44:50 AM
To: [email protected]
Subject: Re: In-memory cache in Drill

yes, for sure this is also the viable approach... but it would be far
better to be able to have the data also in memory..
Does it make sense to have something like an in-memory storage plugin?
In this case it can be also used as a storage for the temporary
tables.
Sincerely,
Michael Shtelma

On Wed, May 10, 2017 at 6:30 PM, Kunal Khatua <[email protected]> wrote:
> Drill does not cache data in memory because it introduces the risk of dealing 
> with stale data when working with data at a large scale.
>
>
> If you want to avoid hitting the actual storage repeatedly, one option is to 
> use the 'create temp table ' feature 
> (https://drill.apache.org/docs/create-temporary-table-as-cttas/). This allows 
> you to land the data to a local (or distributed) F, and use that data storage 
> instead. These tables are alive only for the lifetime of the session 
> (connection your client/SQLLine) makes to the Drill cluster.
>
>
> There is a second benefit of using this approach. You can translate the 
> original data source into a format that is highly suitable to what you are 
> doing with the data. For e.g., you could pull in data from an RDBMS or a JSON 
> store and write the temp table in parquet for performing analytics on.
>
>
> ~ Kunal
>
> ________________________________
> From: Michael Shtelma <[email protected]>
> Sent: Wednesday, May 10, 2017 9:16:30 AM
> To: [email protected]
> Subject: In-memory cache in Drill
>
> Hi all,
>
> Are there any way to cache the data that was loaded from the actual
> storage plugin in Drill?
> As far as I understand, when the query is executed, the data is first
> loaded from the storage plugin and handled by the format plugin. After
> that, the data is stored using internal vectorized representation and
> the query is executed. Is it correct? I am wondering, if there is a
> way to store somewhere these data vectors, so that they do not have to
> be loaded from the actual storage for each query? Spark does something
> like that, by storing data frames  in off heap storage.
>
> Regards,
> Michael

Re: In-memory cache in Drill

Reply via email to