Hi:

        I’m sorry that I’m still confused about Livy's shared object mechanism. 
As for our product design. the shared object is a DataFrame or  table which is 
created from some parquet files in AWS S3. When I use Python programmatic API 
to submit a job like this:
(1) I don’t cache df and tmpTable into spark memory.
def job1(context):
    spark = context.spark_session
    df = 
spark.read.parquet('s3://stg-elasticlog-backup-ex4/type=facet_bandwidth_total/ts_hour=41784*/*/*.parquet')
 \
         .select('in_byte', 'out_byte', 
'user_name').registerTempTable('tmpTable')


(2) Then, I will reuse the “tmpTable” to do some SQL queries, like:
def job4(context):
    spark = context.spark_session
    tmpRes3 = spark.sql("select sum(in_byte) as sum_in, user_name from tmpTable 
\
                           group by user_name \
                           having sum_in > 1000000000 \
                           order by sum_in").toJSON().collect()
    return tmpRes3

(3) Then, I do another query:
def job4(context):
    session = context.spark_session
    tmpRes3 = session.sql("select sum(out_byte) as sum_out, user_name from 
tmpTable \
                           group by user_name \
                           having sum_out > 1000000000 \
                           order by sum_out").toJSON().collect()
    return tmpRes3
My understanding is : For (1), It will find all the parquet files path and 
create a DataFrame, then create a table. But those will not be cached into 
Spark memory. For (2) and (3), every time we use “tmpTable” it knows what 
“tmpTable” is, but it will re-find all the files to re-build this table? Is 
that right?


As for Livy's shared object mechanism. You said "user could store this object 
Foo into JobContext with a provided name”. So, it will store the data into 
memory, or just store the reference to the data?  Maybe my question confuses 
you……

In our previous design, we use AWS s3 and Athena. For a report, about 20 SQL 
queries use the same files in AWS s3, but every query must re-load the files. 
So, it will be very slow. We want to speed up these queries. 

Because these queries use the same files, we actually only need load the files 
just once.  Can Livy's shared object mechanism help us? or we need use Spark 
cache mechanism? 


Thank you so much!


Yours
Wandong


> 在 2018年7月11日,20:17,Saisai Shao <[email protected]> 写道:
> 
> Hi Wandong,
> 
> Livy's shared object mechanism mainly used to share objects between different 
> Livy jobs, this is mainly used for Job API. For example job A create a object 
> Foo which wants to be accessed by Job B, then user could store this object 
> Foo into JobContext with a provided name, after that Job B could get this 
> object by the name.
> 
> This is different from Spark's cache mechanism. What you mentioned above (tmp 
> table) is a Spark provided table cache mechanism, which is unrelated to Livy.
> 
> 
> 
> Wandong Wu <[email protected] <mailto:[email protected]>> 于2018年7月11日周三 
> 下午5:46写道:
> Dear Sir or Madam:
>  
>       I am a Livy beginner. I use Livy, because within an interactive 
> session, different spark jobs could share cached RDDs or DataFrames.
>  
>       When I read some parquet files and create a table called “TmpTable”. 
> The following queries will use this table. Does it mean this table has been 
> cached?
>       If cached, where is the table cached? The table is cached in Livy or 
> Spark cluster?
>  
>       Spark also supports cache function.  When I read some parquet files and 
> create a table called “TmpTable2”. I add such code: 
> sql_ctx.cacheTable('tmpTable2').
>       In the next query using this table. It will be cached in Spark cluster. 
> Then the following queries could use this cached table.
>  
>       What is the difference between cached in Livy and cached in Spark 
> cluster?
>  
> Thanks!
>  
> Yours
> Wandong
>  

Reply via email to