[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611633#comment-16611633
 ] 

Kazuaki Ishizaki commented on SPARK-16196:
--

[~cloud_fan] This PR in the Jira entry proposes two fixes
 # Read data in a table cache directry from a columnar storage
 # Generate code to build a table cache

We already implemented 1. But, we have not implmented 2. yet. Let us address 2. 
in the next release.

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611494#comment-16611494
 ] 

Kazuaki Ishizaki commented on SPARK-16196:
--

I see. I will check this.

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610702#comment-16610702
 ] 

Wenchen Fan commented on SPARK-16196:
-

I believe it has been fixed, [~kiszk] can you confirm?

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348703#comment-15348703
 ] 

Apache Spark commented on SPARK-16196:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13899

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org