Re: "Sharing" dataframes...

Jean Georges Perrin Wed, 21 Jun 2017 08:16:53 -0700

I have looked at Livy in the (very recent past) past and it will not do the 
trick for me. It seems pretty greedy in terms of resources (or at least that 
was our experience). I will investigate how job-server could do the trick.


(on a side note I tried to find a paper on memory lifecycle within Spark but 
was not very successful, maybe someone has a link to spare.)

My need is to keep one/several dataframes in memory (well, within Spark) so 
it/they can be reused at a later time, without persisting it/them to disk 
(unless Spark wants to, of course).



> On Jun 21, 2017, at 10:47 AM, Michael Mior <mm...@uwaterloo.ca> wrote:
> 
> This is a puzzling suggestion to me. It's unclear what features the OP needs, 
> so it's really hard to say whether Livy or job-server aren't sufficient. It's 
> true that neither are particularly mature, but they're much more mature than 
> a homemade project which hasn't started yet.
> 
> That said, I'm not very familiar with either project, so perhaps there are 
> some big concerns I'm not aware of.
> 
> --
> Michael Mior
> mm...@apache.org <mailto:mm...@apache.org>
> 
> 2017-06-21 3:19 GMT-04:00 Rick Moritz <rah...@gmail.com 
> <mailto:rah...@gmail.com>>:
> Keeping it inside the same program/SparkContext is the most performant 
> solution, since you can avoid serialization and deserialization. 
> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM and 
> invokes serialization and deserialization. Technologies that can help you do 
> that easily are Ignite (as mentioned) but also Alluxio, Cassandra with 
> in-memory tables and a memory-backed HDFS-directory (see tiered storage).
> Although livy and job-server provide the functionality of providing a single 
> SparkContext to mutliple programs, I would recommend you build your own 
> framework for integrating different jobs, since many features you may need 
> aren't present yet, while others may cause issues due to lack of maturity. 
> Artificially splitting jobs is in general a bad idea, since it breaks the DAG 
> and thus prevents some potential push-down optimizations.
> 
> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Thanks Vadim & Jörn... I will look into those.
> 
> jg
> 
>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.seme...@datadoghq.com 
>> <mailto:vadim.seme...@datadoghq.com>> wrote:
>> 
>> You can launch one permanent spark context and then execute your jobs within 
>> the context. And since they'll be running in the same context, they can 
>> share data easily.
>> 
>> These two projects provide the functionality that you need:
>> https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs
>>  
>> <https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs>
>> https://github.com/cloudera/livy#post-sessions 
>> <https://github.com/cloudera/livy#post-sessions>
>> 
>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> Hey,
>> 
>> Here is my need: program A does something on a set of data and produces 
>> results, program B does that on another set, and finally, program C combines 
>> the data of A and B. Of course, the easy way is to dump all on disk after A 
>> and B are done, but I wanted to avoid this.
>> 
>> I was thinking of creating a temp view, but I do not really like the temp 
>> aspect of it ;). Any idea (they are all worth sharing)
>> 
>> jg
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
> 
> 
>

Re: "Sharing" dataframes...

Reply via email to