Andrew Or created SPARK-16196:
---------------------------------

             Summary: Optimize in-memory scan performance using ColumnarBatches
                 Key: SPARK-16196
                 URL: https://issues.apache.org/jira/browse/SPARK-16196
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Andrew Or
            Assignee: Andrew Or


A simple benchmark such as the following reveals inefficiencies in the existing 
in-memory scan implementation:
{code}
spark.range(N)
  .selectExpr("id", "floor(rand() * 10000) as k")
  .createOrReplaceTempView("test")
val ds = spark.sql("select count(k), count(id) from test").cache()
ds.collect()
ds.collect()
{code}

There are many reasons why caching is slow. The biggest is that compression 
takes a long time. The second is that there are a lot of virtual function calls 
in this hot code path since the rows are processed using iterators. Further, 
the rows are converted to and from ByteBuffers, which are slow to read in 
general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to