Yes, your observations match what's in the code: Kudu Spark bindings don't
support scanner row limits, but Kudu Java, C++ and Python clients do
support that.  And indeed,
https://issues.apache.org/jira/browse/KUDU-16 contains
relevant information on the status of this feature, missing

As of my knowledge, nobody currently works on implementing scanner
limits for kudu-spark.  However, patches are always welcome!


Kind regards,

Alexey

On Mon, Jan 20, 2020 at 10:38 PM Pavel Martynov <mr.xk...@gmail.com> wrote:

> Hi, folks!
>
> For testing purposes, I need to read a small chunk of rows of a big table
> (~12 blns rows) on my dev machine. So I started driver with "local[4]"
> executors and wrote a code like:
>
> sparkSession.sqlContext.read.options(Map(
>   "kudu.master" -> "master",
>   "kudu.table" -> "thebigtable",
>   "kudu.splitSizeBytes" -> SplitSize512Mb
> )).format("kudu").load
>   .limit(1000)
>   .select($"col1", $"col2", $"col3")
>
> My expectation: only 1000 rows should be actually read from Kudu in very
> fast manner.
>
> Actually observed: Spark started 4 parallel scanners for one of the
> tablets and looks like this scanning process scanning the whole tablet
> (which is ~2.4 blns rows) and scanning time is really big.
>
> Is this expected behavior?
>
> I found this closed ticket https://issues.apache.org/jira/browse/KUDU-16 with
> comments on Spark: "No support on the Spark side, but AFAICT, support for
> limits given our current Scala bindings is somewhat unnatural.".
>
> Kudu ver 1.11.1.
>
> --
> with best regards, Pavel Martynov
>

Reply via email to