Kudu/Spark LIMIT support

Pavel Martynov Mon, 20 Jan 2020 22:38:35 -0800

Hi, folks!

For testing purposes, I need to read a small chunk of rows of a big table
(~12 blns rows) on my dev machine. So I started driver with "local[4]"
executors and wrote a code like:


sparkSession.sqlContext.read.options(Map(
  "kudu.master" -> "master",
  "kudu.table" -> "thebigtable",
  "kudu.splitSizeBytes" -> SplitSize512Mb
)).format("kudu").load
  .limit(1000)
  .select($"col1", $"col2", $"col3")

My expectation: only 1000 rows should be actually read from Kudu in very
fast manner.

Actually observed: Spark started 4 parallel scanners for one of the tablets
and looks like this scanning process scanning the whole tablet (which is
~2.4 blns rows) and scanning time is really big.

Is this expected behavior?

I found this closed ticket https://issues.apache.org/jira/browse/KUDU-16 with
comments on Spark: "No support on the Spark side, but AFAICT, support for
limits given our current Scala bindings is somewhat unnatural.".

Kudu ver 1.11.1.

-- 
with best regards, Pavel Martynov

Kudu/Spark LIMIT support

Reply via email to