Ordering pushdown for Spark Datasources

Kohki Nishio Sun, 04 Apr 2021 15:54:22 -0700

Hello,

I'm trying to use Spark SQL as a log analytics solution. As you might
guess, for most use-cases, data is ordered by timestamp and the amount of
data is large.


If I want to show the first 100 entries (ordered by timestamp) for a given
condition, Spark Executor has to scan the whole entries to select the
top 100 by timestamp.

I understand this behavior, however, some of the data sources such as JDBC
or Lucene can support ordering and in this case, the target data is large
(a couple of millions). I believe it is possible to pushdown orderings to
the data sources and make the executors return early.

Here's my ask, I know Spark doesn't do such a thing... but I'm looking for
any pointers, references which might be relevant to this, or .. any random
idea would be appreciated. So far I found, some folks are working on
aggregation pushdown (SPARK-22390), but I don't see any current activity
for ordering pushdown.

Thanks


-- 
Kohki Nishio

Ordering pushdown for Spark Datasources

Reply via email to