I would like to have your opinion about an idea I had...

I am thinking of answering the issue of interactive query on small/medium
dataset (max 500 GB or 1 TB) with a solution based on the thriftserver and
spark cache management. Currently the problem of caching the dataset in
Spark is that you cannot have a high data freshness and the cache isn't
resilient.
If dataframe is thread safe, would it be possible to implement a cache
management strategy that periodically refresh the cached dataset from the
backends ?

Another question regarding the persist MEMORY_AND_DISK, what is the
promote/eviction strategy implemented ? Is FIFO, LIFO, heat based ?

Note: I already know Alluxio and it could potentially also solve this
issue, my question is on Spark only, I would like to benefits from tungsten
project and the no-serialization options...

2017-02-15 9:05 GMT+01:00 萝卜丝炒饭 <1427357...@qq.com>:

> updating dataframe  returns NEW dataframe  like RDD please?
>
> ---Original---
> *From:* "vincent gromakowski"<vincent.gromakow...@gmail.com>
> *Date:* 2017/2/14 01:15:35
> *To:* "Reynold Xin"<r...@databricks.com>;
> *Cc:* "user"<user@spark.apache.org>;"Mendelson, Assaf"<
> assaf.mendel...@rsa.com>;
> *Subject:* Re: is dataframe thread safe?
>
> How about having a thread that update and cache a dataframe in-memory next
> to other threads requesting this dataframe, is it thread safe ?
>
> 2017-02-13 9:02 GMT+01:00 Reynold Xin <r...@databricks.com>:
>
>> Yes your use case should be fine. Multiple threads can transform the same
>> data frame in parallel since they create different data frames.
>>
>>
>> On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf <assaf.mendel...@rsa.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I was wondering if dataframe is considered thread safe. I know the spark
>>> session and spark context are thread safe (and actually have tools to
>>> manage jobs from different threads) but the question is, can I use the same
>>> dataframe in both threads.
>>>
>>> The idea would be to create a dataframe in the main thread and then in
>>> two sub threads do different transformations and actions on it.
>>>
>>> I understand that some things might not be thread safe (e.g. if I
>>> unpersist in one thread it would affect the other. Checkpointing would
>>> cause similar issues), however, I can’t find any documentation as to what
>>> operations (if any) are thread safe.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>                 Assaf.
>>>
>>
>

Reply via email to