Hello, I'm trying to use Spark SQL as a log analytics solution. As you might guess, for most use-cases, data is ordered by timestamp and the amount of data is large.
If I want to show the first 100 entries (ordered by timestamp) for a given condition, Spark Executor has to scan the whole entries to select the top 100 by timestamp. I understand this behavior, however, some of the data sources such as JDBC or Lucene can support ordering and in this case, the target data is large (a couple of millions). I believe it is possible to pushdown orderings to the data sources and make the executors return early. Here's my ask, I know Spark doesn't do such a thing... but I'm looking for any pointers, references which might be relevant to this, or .. any random idea would be appreciated. So far I found, some folks are working on aggregation pushdown (SPARK-22390), but I don't see any current activity for ordering pushdown. Thanks -- Kohki Nishio