Hi Boris, How exactly did HDFS and ZK go down? A Kudu restart is fairly IO-intensive but I don't know how that can cause things like DataNodes to fail.
J-D On Sat, Dec 16, 2017 at 11:45 AM, Boris Tyukin <[email protected]> wrote: > well our admin had fun two days - it was the first time we restarted Kudu > on our DEV cluster and it did not go well. He is still troubleshooting what > happened but after Kudu restart zookeeper and HDFS went down after 3-4 > minutes. If we disable Kudu, all is well. No error in Kudu logs...I will > have more details next week so not asking for help as I do not know all the > details. What is obvious thought is that it has to do something with Kudu :) > > On Thu, Dec 14, 2017 at 9:40 AM, Boris Tyukin <[email protected]> > wrote: > >> thanks for your suggestions, J-D, I am sure you are right more often than >> that! :)) >> >> I will report back with our results. So far I am really impressed with >> Kudu - we have been benchmarking ingest and egress throughput and our >> typical queries runtime. The biggest pain so far is lack of support for >> decimals >> >> On Wed, Dec 13, 2017 at 5:07 PM, Jean-Daniel Cryans <[email protected]> >> wrote: >> >>> On Wed, Dec 13, 2017 at 11:30 AM, Boris Tyukin <[email protected]> >>> wrote: >>> >>>> thanks J-D! we are going to try that and see how it impacts the >>>> runtime. >>>> >>>> is there any way to load this metadata upfront? a lot of our queries >>>> are adhoc in nature but they will be hitting the same tables with different >>>> predicates and join patterns though. >>>> >>> >>> You could use Impala to compute all the stats of all the tables after >>> each Kudu restart. Actually, do try that, restart Kudu then compute stats >>> and see how fast it scans. >>> >>> >>>> >>>> I am curious why this metadata does not survive restarts though. We are >>>> going to run our benchmarks again and this time restart Kudu and Impala. >>>> >>> >>> It's in the tserver memory, it can't survive a restart. >>> >>> >>>> >>>> I just ran another query first time which hits 2 large tables and these >>>> tables have been scanned by the previous query and this time I do not see >>>> any difference in query time before the first and second time - I guess >>>> this confirms your statement about " first time ever scanning the >>>> table since a Kudu restart" and collecting metadata. >>>> >>> >>> Maybe, I've been known to be right once or twice a year :) >>> >>> >>>> >>>> >>>> On Wed, Dec 13, 2017 at 11:18 AM, Jean-Daniel Cryans < >>>> [email protected]> wrote: >>>> >>>>> Hi Boris, >>>>> >>>>> Given that we don't have much data we can use here, I'll have to >>>>> extrapolate. As an aside though, this is yet another example where we need >>>>> more Kudu-side metrics in the query profile. >>>>> >>>>> So, Kudu lazily loads a bunch of metadata and that can really affect >>>>> scan times. If this was your first time ever scanning the table since a >>>>> Kudu restart, it's very possible that that's where that time was spent. >>>>> There's also the page cache in the OS that might now be populated. You >>>>> could do something like "sync; echo 3 > /proc/sys/vm/drop_caches" on all >>>>> the machines and run the query 2 times again, without restarting Kudu, to >>>>> understand the effect of the page cache itself. There's currently now way >>>>> to purge the cached metadata in Kudu though. >>>>> >>>>> Hope this helps a bit, >>>>> >>>>> J-D >>>>> >>>>> On Wed, Dec 13, 2017 at 8:07 AM, Boris Tyukin <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi guys, >>>>>> >>>>>> I am doing some benchmarks with Kudu and Impala/Parquet and hope to >>>>>> share it soon but there is one thing that bugs me. This is perhaps Impala >>>>>> question but since I am using Kudu with Impala I am going to try and ask >>>>>> anyway. >>>>>> >>>>>> One of my queries takes 120 seconds to run the very first time. It >>>>>> joins one large 5B row table with a bunch of smaller tables and then >>>>>> stores >>>>>> result in Impala/parquet (not Kudu). >>>>>> >>>>>> Now if I run it second and third time, it only takes 60 seconds. Can >>>>>> someone explain why? Is there any settings to decrease this gap? >>>>>> >>>>>> I've compared query profiles in CM and the only thing that was very >>>>>> different is scan against Kudu table (the large one): >>>>>> >>>>>> *************************** >>>>>> first time: >>>>>> *************************** >>>>>> KUDU_SCAN_NODE (id=0) (47.68s) >>>>>> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=5143f7165be82819%3Ae00a103500000000&serviceName=impala#> >>>>>> >>>>>> >>>>>> >>>>>> - BytesRead: *0 B* >>>>>> - InactiveTotalTime: *0ns* >>>>>> - KuduRemoteScanTokens: *0* >>>>>> - NumScannerThreadsStarted: *20* >>>>>> - PeakMemoryUsage: *35.8 MiB* >>>>>> - RowsRead: *693,502,241* >>>>>> - RowsReturned: *693,502,241* >>>>>> - RowsReturnedRate: *14643448 per second* >>>>>> - ScanRangesComplete: *20* >>>>>> - ScannerThreadsInvoluntaryContextSwitches: *1,341* >>>>>> - ScannerThreadsTotalWallClockTime: *36.2m* >>>>>> - MaterializeTupleTime(*): *47.57s* >>>>>> - ScannerThreadsSysTime: *31.42s* >>>>>> - ScannerThreadsUserTime: *1.7m* >>>>>> - ScannerThreadsVoluntaryContextSwitches: *96,855* >>>>>> - TotalKuduScanRoundTrips: *52,308* >>>>>> - TotalReadThroughput: *0 B/s* >>>>>> - TotalTime: *47.68s* >>>>>> >>>>>> >>>>>> *************************** >>>>>> second time: >>>>>> *************************** >>>>>> KUDU_SCAN_NODE (id=0) (4.28s) >>>>>> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=53497a308f860837%3A243772e000000000&serviceName=impala#> >>>>>> >>>>>> >>>>>> >>>>>> - BytesRead: *0 B* >>>>>> - InactiveTotalTime: *0ns* >>>>>> - KuduRemoteScanTokens: *0* >>>>>> - NumScannerThreadsStarted: *20* >>>>>> - PeakMemoryUsage: *37.9 MiB* >>>>>> - RowsRead: *693,502,241* >>>>>> - RowsReturned: *693,502,241* >>>>>> - RowsReturnedRate: *173481534 per second* >>>>>> - ScanRangesComplete: *20* >>>>>> - ScannerThreadsInvoluntaryContextSwitches: *1,451* >>>>>> - ScannerThreadsTotalWallClockTime: *19.5m* >>>>>> - MaterializeTupleTime(*): *4.20s* >>>>>> - ScannerThreadsSysTime: *38.22s* >>>>>> - ScannerThreadsUserTime: *1.7m* >>>>>> - ScannerThreadsVoluntaryContextSwitches: *480,870* >>>>>> - TotalKuduScanRoundTrips: *52,142* >>>>>> - TotalReadThroughput: *0 B/s* >>>>>> - TotalTime: *4.28s* >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
