thanks J-D! we are going to try that and see how it impacts the runtime. is there any way to load this metadata upfront? a lot of our queries are adhoc in nature but they will be hitting the same tables with different predicates and join patterns though.
I am curious why this metadata does not survive restarts though. We are going to run our benchmarks again and this time restart Kudu and Impala. I just ran another query first time which hits 2 large tables and these tables have been scanned by the previous query and this time I do not see any difference in query time before the first and second time - I guess this confirms your statement about " first time ever scanning the table since a Kudu restart" and collecting metadata. On Wed, Dec 13, 2017 at 11:18 AM, Jean-Daniel Cryans <[email protected]> wrote: > Hi Boris, > > Given that we don't have much data we can use here, I'll have to > extrapolate. As an aside though, this is yet another example where we need > more Kudu-side metrics in the query profile. > > So, Kudu lazily loads a bunch of metadata and that can really affect scan > times. If this was your first time ever scanning the table since a Kudu > restart, it's very possible that that's where that time was spent. There's > also the page cache in the OS that might now be populated. You could do > something like "sync; echo 3 > /proc/sys/vm/drop_caches" on all the > machines and run the query 2 times again, without restarting Kudu, to > understand the effect of the page cache itself. There's currently now way > to purge the cached metadata in Kudu though. > > Hope this helps a bit, > > J-D > > On Wed, Dec 13, 2017 at 8:07 AM, Boris Tyukin <[email protected]> > wrote: > >> Hi guys, >> >> I am doing some benchmarks with Kudu and Impala/Parquet and hope to share >> it soon but there is one thing that bugs me. This is perhaps Impala >> question but since I am using Kudu with Impala I am going to try and ask >> anyway. >> >> One of my queries takes 120 seconds to run the very first time. It joins >> one large 5B row table with a bunch of smaller tables and then stores >> result in Impala/parquet (not Kudu). >> >> Now if I run it second and third time, it only takes 60 seconds. Can >> someone explain why? Is there any settings to decrease this gap? >> >> I've compared query profiles in CM and the only thing that was very >> different is scan against Kudu table (the large one): >> >> *************************** >> first time: >> *************************** >> KUDU_SCAN_NODE (id=0) (47.68s) >> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=5143f7165be82819%3Ae00a103500000000&serviceName=impala#> >> >> >> >> - BytesRead: *0 B* >> - InactiveTotalTime: *0ns* >> - KuduRemoteScanTokens: *0* >> - NumScannerThreadsStarted: *20* >> - PeakMemoryUsage: *35.8 MiB* >> - RowsRead: *693,502,241* >> - RowsReturned: *693,502,241* >> - RowsReturnedRate: *14643448 per second* >> - ScanRangesComplete: *20* >> - ScannerThreadsInvoluntaryContextSwitches: *1,341* >> - ScannerThreadsTotalWallClockTime: *36.2m* >> - MaterializeTupleTime(*): *47.57s* >> - ScannerThreadsSysTime: *31.42s* >> - ScannerThreadsUserTime: *1.7m* >> - ScannerThreadsVoluntaryContextSwitches: *96,855* >> - TotalKuduScanRoundTrips: *52,308* >> - TotalReadThroughput: *0 B/s* >> - TotalTime: *47.68s* >> >> >> *************************** >> second time: >> *************************** >> KUDU_SCAN_NODE (id=0) (4.28s) >> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=53497a308f860837%3A243772e000000000&serviceName=impala#> >> >> >> >> - BytesRead: *0 B* >> - InactiveTotalTime: *0ns* >> - KuduRemoteScanTokens: *0* >> - NumScannerThreadsStarted: *20* >> - PeakMemoryUsage: *37.9 MiB* >> - RowsRead: *693,502,241* >> - RowsReturned: *693,502,241* >> - RowsReturnedRate: *173481534 per second* >> - ScanRangesComplete: *20* >> - ScannerThreadsInvoluntaryContextSwitches: *1,451* >> - ScannerThreadsTotalWallClockTime: *19.5m* >> - MaterializeTupleTime(*): *4.20s* >> - ScannerThreadsSysTime: *38.22s* >> - ScannerThreadsUserTime: *1.7m* >> - ScannerThreadsVoluntaryContextSwitches: *480,870* >> - TotalKuduScanRoundTrips: *52,142* >> - TotalReadThroughput: *0 B/s* >> - TotalTime: *4.28s* >> >> >> >> >> >
