Hao, The order of my primary key is (ID, datetime). My query had 'WHERE' clause for both these keys. How does the order exactly affect scan performance?
I think restarting the tablet server removed all previous records on scan dashboard. I can't find any query that took too long to complete. On Thu, Feb 14, 2019 at 4:31 AM Hao Hao <[email protected]> wrote: > Hi Faraz, > > What is the order of your primary key? Is it (datetime, ID) or (ID, > datatime)? > > On the contrary, I suspect your scan performance got better for the same > query because compaction happened in between, and thus there were less > blocks to scan. Also would you mind sharing the screen shot of the tablet > server web UI page when your scans took place (to do a comparison between > the 'good' and 'bad' scans) ? > > Best, > Hao > > On Wed, Feb 13, 2019 at 9:37 AM Faraz Mateen <[email protected]> wrote: > >> By "not noticing any compaction" I meant I did not see any visible change >> in disk space. However, logs show that there were some compaction related >> operations happening during this whole time period. These statements >> appeared multiple times in tserver logs: >> >> W0211 13:44:10.991221 15822 tablet.cc:1679] T >> 00b8818d0713485b83982ac56d9e342a P 7b44fc5229fe43e190d4d6c1e8022988: Can't >> schedule compaction. Clean time has not been advanced past its initial >> value. >> ... >> ... >> I0211 14:36:33.883819 15822 maintenance_manager.cc:302] P >> 7b44fc5229fe43e190d4d6c1e8022988: Scheduling >> MajorDeltaCompactionOp(30c9aaadcb13460fab832bdea1104349): perf >> score=0.106957 >> I0211 14:36:33.884233 13179 diskrowset.cc:560] T >> 30c9aaadcb13460fab832bdea1104349 P 7b44fc5229fe43e190d4d6c1e8022988: >> RowSet(3080): Major compacting REDO delta stores (cols: 2 3 4 5 6 7 9 10 11 >> 13 14 15 16 20 22 29 31 33 36 38 39 41 42 47 49 51 52 56 57 58 64 67 68 71 >> 75 77 78 79 80 81 109 128 137) >> >> >> Does compaction affect scan performance? And if it does, what can I do to >> limit this degradation? >> >> >> On Wed, Feb 13, 2019 at 7:24 PM Faraz Mateen <[email protected]> wrote: >> >>> Thanks a lot for the help, Hao. >>> >>> Response Inline: >>> >>> You can use tablet server web UI scans dashboard (/scans) to get a >>>> better understanding of the ongoing/past queries. The flag >>>> 'scan_history_count' is used to configure the size of the buffer. From >>>> there, you can get information such as the applied predicates and column >>>> stats for the selected columns. >>> >>> >>> Thanks. I did not know about this. >>> >>> Did you notice any compactions in Kudu between you issued the two >>>> queries? What is your ingest pattern, are you inserting data in random >>>> primary key order? >>> >>> >>> The table has hash partitioning on a ID column that can have 15 >>> different values and range partition on datetime which is split monthly. >>> Both ID and datetime are my primary keys. The data we ingest is in >>> increasing order of time (usually) but the order of IDs is random. >>> >>> However, ingestion into kudu was stopped while I was performing these >>> queries. I did not notice any compaction either. >>> >>> On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <[email protected]> wrote: >>> >>>> Hi Faraz, >>>> >>>> Answered inline below. >>>> >>>> Best, >>>> Hao >>>> >>>> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I am using spark to pull data from my single node testing kudu setup >>>>> and publish it to kafka. However, my query time is not consistent. >>>>> >>>>> I am querying a table with around *1.1 million *packets. Initially my >>>>> query was taking* 537 seconds to read 51042 records* from kudu and >>>>> write them to kafka. This rate was quite low than what I had expected. I >>>>> had around 45 tables with little data in them that was not needed anymore. >>>>> I deleted all those tables, restarted spark session and attempted the same >>>>> query. Now the query completed in* 5.3 seconds*. >>>>> >>>>> I increased the number of rows to be fetched and tried the same query. >>>>> Rows count was *118741* but it took *1861 seconds *to complete. >>>>> During the query, resource utilization of my servers was very low. When >>>>> I attempted the same query again after a couple of hours, it took only* >>>>> 16 secs*. >>>>> >>>>> After this I kept increasing the number of rows to be fetched and the >>>>> time kept increasing in linear fashion. >>>>> >>>>> What I want to ask is: >>>>> >>>>> - How can I debug why the time for these queries is varying so >>>>> much? I am not able to get anything out of Kudu logs. >>>>> >>>>> You can use tablet server web UI scans dashboard (/scans) to get a >>>> better understanding of the ongoing/past queries. The flag >>>> 'scan_history_count' is used to configure the size of the buffer. From >>>> there, you can get information such as the applied predicates and column >>>> stats for the selected columns. >>>> >>>> >>>>> >>>>> - I am running kudu with default configurations. Are there any >>>>> tweaks I should perform to boast the performance of my setup? >>>>> >>>>> Did you notice any compactions in Kudu between you issued the two >>>> queries? What is your ingest pattern, are you inserting data in random >>>> primary key order? >>>> >>>>> >>>>> - Does having a lot of tables cause performance issues? >>>>> >>>>> If there is no hitting of resource limitation due to writes/scans to >>>> the other tables, they shouldn't affect the performance of your queries. >>>> Just FYI, this is the scale guide >>>> <https://kudu.apache.org/docs/scaling_guide.html> with respect to >>>> various system resources. >>>> >>>>> >>>>> - Will having more masters and tservers improve my query time? >>>>> >>>>> Master is not likely to be the bottleneck, as client communicate >>>> directly to tserver for query once he/she knows which tserver to talk to. >>>> But separating master and tserver to be on the same node might help. This >>>> is the scale limitation >>>> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for >>>> roughly estimation of number of tservers required for a given quantity of >>>> data. >>>> >>>> *Environment Details:* >>>>> >>>>> - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and >>>>> 16 GB RAM. >>>>> - Table that I am querying is hash partitioned on the basis of an >>>>> ID with 3 buckets. It is also range partitioned on the basis of >>>>> datetime >>>>> with a new partition for each month. >>>>> - Kafka version 1.1. >>>>> - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM. >>>>> >>>>> -- >>>>> Faraz Mateen >>>>> >>>> >>> >>> -- >>> Faraz Mateen >>> >> >> >> -- >> Faraz Mateen >> > -- Faraz Mateen
