Yes, your observations match what's in the code: Kudu Spark bindings don't support scanner row limits, but Kudu Java, C++ and Python clients do support that. And indeed, https://issues.apache.org/jira/browse/KUDU-16 contains relevant information on the status of this feature, missing
As of my knowledge, nobody currently works on implementing scanner limits for kudu-spark. However, patches are always welcome! Kind regards, Alexey On Mon, Jan 20, 2020 at 10:38 PM Pavel Martynov <mr.xk...@gmail.com> wrote: > Hi, folks! > > For testing purposes, I need to read a small chunk of rows of a big table > (~12 blns rows) on my dev machine. So I started driver with "local[4]" > executors and wrote a code like: > > sparkSession.sqlContext.read.options(Map( > "kudu.master" -> "master", > "kudu.table" -> "thebigtable", > "kudu.splitSizeBytes" -> SplitSize512Mb > )).format("kudu").load > .limit(1000) > .select($"col1", $"col2", $"col3") > > My expectation: only 1000 rows should be actually read from Kudu in very > fast manner. > > Actually observed: Spark started 4 parallel scanners for one of the > tablets and looks like this scanning process scanning the whole tablet > (which is ~2.4 blns rows) and scanning time is really big. > > Is this expected behavior? > > I found this closed ticket https://issues.apache.org/jira/browse/KUDU-16 with > comments on Spark: "No support on the Spark side, but AFAICT, support for > limits given our current Scala bindings is somewhat unnatural.". > > Kudu ver 1.11.1. > > -- > with best regards, Pavel Martynov >