Hi Faraz, Answered inline below.
Best, Hao On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <[email protected]> wrote: > Hi all, > > I am using spark to pull data from my single node testing kudu setup and > publish it to kafka. However, my query time is not consistent. > > I am querying a table with around *1.1 million *packets. Initially my > query was taking* 537 seconds to read 51042 records* from kudu and write > them to kafka. This rate was quite low than what I had expected. I had > around 45 tables with little data in them that was not needed anymore. I > deleted all those tables, restarted spark session and attempted the same > query. Now the query completed in* 5.3 seconds*. > > I increased the number of rows to be fetched and tried the same query. > Rows count was *118741* but it took *1861 seconds *to complete. During > the query, resource utilization of my servers was very low. When I > attempted the same query again after a couple of hours, it took only* 16 > secs*. > > After this I kept increasing the number of rows to be fetched and the time > kept increasing in linear fashion. > > What I want to ask is: > > - How can I debug why the time for these queries is varying so much? I > am not able to get anything out of Kudu logs. > > You can use tablet server web UI scans dashboard (/scans) to get a better understanding of the ongoing/past queries. The flag 'scan_history_count' is used to configure the size of the buffer. From there, you can get information such as the applied predicates and column stats for the selected columns. > > - I am running kudu with default configurations. Are there any tweaks > I should perform to boast the performance of my setup? > > Did you notice any compactions in Kudu between you issued the two queries? What is your ingest pattern, are you inserting data in random primary key order? > > - Does having a lot of tables cause performance issues? > > If there is no hitting of resource limitation due to writes/scans to the other tables, they shouldn't affect the performance of your queries. Just FYI, this is the scale guide <https://kudu.apache.org/docs/scaling_guide.html> with respect to various system resources. > > - Will having more masters and tservers improve my query time? > > Master is not likely to be the bottleneck, as client communicate directly to tserver for query once he/she knows which tserver to talk to. But separating master and tserver to be on the same node might help. This is the scale limitation <https://kudu.apache.org/docs/known_issues.html#_scale> guide for roughly estimation of number of tservers required for a given quantity of data. *Environment Details:* > > - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16 > GB RAM. > - Table that I am querying is hash partitioned on the basis of an ID > with 3 buckets. It is also range partitioned on the basis of datetime with > a new partition for each month. > - Kafka version 1.1. > - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM. > > -- > Faraz Mateen >
