Hi, folks! For testing purposes, I need to read a small chunk of rows of a big table (~12 blns rows) on my dev machine. So I started driver with "local[4]" executors and wrote a code like:
sparkSession.sqlContext.read.options(Map( "kudu.master" -> "master", "kudu.table" -> "thebigtable", "kudu.splitSizeBytes" -> SplitSize512Mb )).format("kudu").load .limit(1000) .select($"col1", $"col2", $"col3") My expectation: only 1000 rows should be actually read from Kudu in very fast manner. Actually observed: Spark started 4 parallel scanners for one of the tablets and looks like this scanning process scanning the whole tablet (which is ~2.4 blns rows) and scanning time is really big. Is this expected behavior? I found this closed ticket https://issues.apache.org/jira/browse/KUDU-16 with comments on Spark: "No support on the Spark side, but AFAICT, support for limits given our current Scala bindings is somewhat unnatural.". Kudu ver 1.11.1. -- with best regards, Pavel Martynov