Hi all, I am using spark to pull data from my single node testing kudu setup and publish it to kafka. However, my query time is not consistent.
I am querying a table with around *1.1 million *packets. Initially my query was taking* 537 seconds to read 51042 records* from kudu and write them to kafka. This rate was quite low than what I had expected. I had around 45 tables with little data in them that was not needed anymore. I deleted all those tables, restarted spark session and attempted the same query. Now the query completed in* 5.3 seconds*. I increased the number of rows to be fetched and tried the same query. Rows count was *118741* but it took *1861 seconds *to complete. During the query, resource utilization of my servers was very low. When I attempted the same query again after a couple of hours, it took only* 16 secs*. After this I kept increasing the number of rows to be fetched and the time kept increasing in linear fashion. What I want to ask is: - How can I debug why the time for these queries is varying so much? I am not able to get anything out of Kudu logs. - I am running kudu with default configurations. Are there any tweaks I should perform to boast the performance of my setup? - Does having a lot of tables cause performance issues? - Will having more masters and tservers improve my query time? *Environment Details:* - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16 GB RAM. - Table that I am querying is hash partitioned on the basis of an ID with 3 buckets. It is also range partitioned on the basis of datetime with a new partition for each month. - Kafka version 1.1. - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM. -- Faraz Mateen
