Inconsistent read performance with Spark

Faraz Mateen Tue, 12 Feb 2019 06:59:29 -0800

Hi all,

I am using spark to pull data from my single node testing kudu setup and
publish it to kafka. However, my query time is not consistent.


I am querying a table with around *1.1 million *packets. Initially my query
was taking* 537 seconds to read 51042 records* from kudu and write them to
kafka. This rate was quite low than what I had expected. I had around 45
tables with little data in them that was not needed anymore. I deleted all
those tables, restarted spark session and attempted the same query. Now the
query completed in* 5.3 seconds*.

I increased the number of rows to be fetched and tried the same query. Rows
count was *118741* but it took *1861 seconds *to complete. During the
query, resource utilization of my servers was very low. When I attempted
the same query again after a couple of hours, it took only* 16 secs*.

After this I kept increasing the number of rows to be fetched and the time
kept increasing in linear fashion.

What I want to ask is:

   - How can I debug why the time for these queries is varying so much? I
   am not able to get anything out of Kudu logs.
   - I am running kudu with default configurations. Are there any tweaks I
   should perform to boast the performance of my setup?
   - Does having a lot of tables cause performance issues?
   - Will having more masters and tservers improve my query time?

*Environment Details:*

   - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16 GB
   RAM.
   - Table that I am querying is hash partitioned on the basis of an ID
   with 3 buckets. It is also range partitioned on the basis of datetime with
   a new partition for each month.
   - Kafka version 1.1.
   - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.

-- 
Faraz Mateen

Inconsistent read performance with Spark

Reply via email to