Hi:
I’m sorry that I’m still confused about Livy's shared object mechanism.
As for our product design. the shared object is a DataFrame or table which is
created from some parquet files in AWS S3. When I use Python programmatic API
to submit a job like this:
(1) I don’t cache df and tmpTable into spark memory.
def job1(context):
spark = context.spark_session
df =
spark.read.parquet('s3://stg-elasticlog-backup-ex4/type=facet_bandwidth_total/ts_hour=41784*/*/*.parquet')
\
.select('in_byte', 'out_byte',
'user_name').registerTempTable('tmpTable')
(2) Then, I will reuse the “tmpTable” to do some SQL queries, like:
def job4(context):
spark = context.spark_session
tmpRes3 = spark.sql("select sum(in_byte) as sum_in, user_name from tmpTable
\
group by user_name \
having sum_in > 1000000000 \
order by sum_in").toJSON().collect()
return tmpRes3
(3) Then, I do another query:
def job4(context):
session = context.spark_session
tmpRes3 = session.sql("select sum(out_byte) as sum_out, user_name from
tmpTable \
group by user_name \
having sum_out > 1000000000 \
order by sum_out").toJSON().collect()
return tmpRes3
My understanding is : For (1), It will find all the parquet files path and
create a DataFrame, then create a table. But those will not be cached into
Spark memory. For (2) and (3), every time we use “tmpTable” it knows what
“tmpTable” is, but it will re-find all the files to re-build this table? Is
that right?
As for Livy's shared object mechanism. You said "user could store this object
Foo into JobContext with a provided name”. So, it will store the data into
memory, or just store the reference to the data? Maybe my question confuses
you……
In our previous design, we use AWS s3 and Athena. For a report, about 20 SQL
queries use the same files in AWS s3, but every query must re-load the files.
So, it will be very slow. We want to speed up these queries.
Because these queries use the same files, we actually only need load the files
just once. Can Livy's shared object mechanism help us? or we need use Spark
cache mechanism?
Thank you so much!
Yours
Wandong
> 在 2018年7月11日,20:17,Saisai Shao <[email protected]> 写道:
>
> Hi Wandong,
>
> Livy's shared object mechanism mainly used to share objects between different
> Livy jobs, this is mainly used for Job API. For example job A create a object
> Foo which wants to be accessed by Job B, then user could store this object
> Foo into JobContext with a provided name, after that Job B could get this
> object by the name.
>
> This is different from Spark's cache mechanism. What you mentioned above (tmp
> table) is a Spark provided table cache mechanism, which is unrelated to Livy.
>
>
>
> Wandong Wu <[email protected] <mailto:[email protected]>> 于2018年7月11日周三
> 下午5:46写道:
> Dear Sir or Madam:
>
> I am a Livy beginner. I use Livy, because within an interactive
> session, different spark jobs could share cached RDDs or DataFrames.
>
> When I read some parquet files and create a table called “TmpTable”.
> The following queries will use this table. Does it mean this table has been
> cached?
> If cached, where is the table cached? The table is cached in Livy or
> Spark cluster?
>
> Spark also supports cache function. When I read some parquet files and
> create a table called “TmpTable2”. I add such code:
> sql_ctx.cacheTable('tmpTable2').
> In the next query using this table. It will be cached in Spark cluster.
> Then the following queries could use this cached table.
>
> What is the difference between cached in Livy and cached in Spark
> cluster?
>
> Thanks!
>
> Yours
> Wandong
>