[ 
https://issues.apache.org/jira/browse/SPARK-32274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32274:
----------------------------------------
    Description: 
Caching a dataset or dataframe can be a very expensive operation, but has a 
huge benefit for later queries that use it.  There are many use cases that 
could benefit from caching the data but not enough to justify the current 
scheme.  I would like to propose that we make the serialization of the caching 
plugable.  That way users can explore other formats and compression code.

 

As an example I took the line item table from TPCH at a scale factor of 10 and 
converted it to parquet.  This resulted in 2.1 GB of data on disk. With the 
current caching it can take nearly 8 GB to store that same data in memory, and 
about 5 GB to store in on disk.

 

If I want to read all of that data and and write it out again.
{code:java}
scala> val a = spark.read.parquet("../data/tpch/SF10_parquet/lineitem.tbl/")
a: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint ... 
14 more fields]

scala> spark.time(a.write.mode("overwrite").parquet("./target/tmp"))
Time taken: 25832 ms {code}
But a query that reads that data directly from the cache after it is built only 
takes 21531 ms. For some queries having much more data that can be stored in 
the cache might be worth the extra query time.

 

It also takes about a lot less time to do the parquet compression than it does 
to do the cache compression.

  was:
Caching a dataset or dataframe can be a very expensive operation, but has a 
huge benefit for later queries that use it.  There are many use cases that 
could benefit from caching the data but not enough to justify the current 
scheme.  I would like to propose that we make the serialization of the caching 
plugable.  That way users can explore other formats and compression code.

 

As an example I took the line item table from TPCH at a scale factor of 10 and 
converted it to parquet.  This resulted in 2.1 GB of data on disk. With the 
current caching it can take nearly 8 GB to store that same data in memory, and 
about 5 GB to store in on disk.

 

If I want to read all of that data and and write it out again.

```

scala> val a = spark.read.parquet("../data/tpch/SF10_parquet/lineitem.tbl/")

a: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint ... 
14 more fields]

 

scala> spark.time(a.write.mode("overwrite").parquet("./target/tmp"))
Time taken: 25832 ms 

```

But a query that reads that data directly from the cache after it is built only 
takes 21531 ms. For some queries having much more data that can be stored in 
the cache might be worth the extra query time.

 

It also takes about a lot less time to do the parquet compression than it does 
to do the cache compression.


> Add in the ability for a user to replace the serialization format of the cache
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-32274
>                 URL: https://issues.apache.org/jira/browse/SPARK-32274
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Robert Joseph Evans
>            Priority: Major
>
> Caching a dataset or dataframe can be a very expensive operation, but has a 
> huge benefit for later queries that use it.  There are many use cases that 
> could benefit from caching the data but not enough to justify the current 
> scheme.  I would like to propose that we make the serialization of the 
> caching plugable.  That way users can explore other formats and compression 
> code.
>  
> As an example I took the line item table from TPCH at a scale factor of 10 
> and converted it to parquet.  This resulted in 2.1 GB of data on disk. With 
> the current caching it can take nearly 8 GB to store that same data in 
> memory, and about 5 GB to store in on disk.
>  
> If I want to read all of that data and and write it out again.
> {code:java}
> scala> val a = spark.read.parquet("../data/tpch/SF10_parquet/lineitem.tbl/")
> a: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint 
> ... 14 more fields]
> scala> spark.time(a.write.mode("overwrite").parquet("./target/tmp"))
> Time taken: 25832 ms {code}
> But a query that reads that data directly from the cache after it is built 
> only takes 21531 ms. For some queries having much more data that can be 
> stored in the cache might be worth the extra query time.
>  
> It also takes about a lot less time to do the parquet compression than it 
> does to do the cache compression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to