Re: tserver died during bulk indexing and dies again after restarting

Todd Lipcon Mon, 24 Apr 2017 10:51:48 -0700

One other idea: if you can share a *redacted* WAL dump, that might also
help us understand the issue. For example:


$ kudu wal dump /data/1/kudu/wals/<tablet-id>/wal-000000001  --redact=all |
gzip - > /tmp/wal.txt.gz

The "--redact" flag will ensure that no cell data is present in the WAL.
For example, from one of our test clusters:

        op 33: INSERT (int64 ps_partkey=<redacted>, int64
ps_suppkey=<redacted>, int64 ps_availqty=<redacted>, double
ps_supplycost=<redacted>, string ps_comment=<redacted>)

You can view the resulting wal.txt.gz file before sending it to confirm
that nothing sensitive is included.

Thanks
-Todd


On Mon, Apr 24, 2017 at 10:39 AM, David Alves <[email protected]> wrote:

> Hi Jason
>
>   No problem. Sorry if I misunderstood your previous email.
>   If you could share the log files themselves that would be great, if not,
> that's ok too.
>   You could use the kudu tool to delete the local replica for that tablet
> (without a running tserver daemon), but its likely that it's been gone a
> while and kicked out of most if not all consensus config, at which point,
> if all you data is available you could just delete the data and re-add it
> to the cluster.
>
> Best
> David
>
>
> On Mon, Apr 24, 2017 at 4:33 AM, Jason Heo <[email protected]>
> wrote:
>
>> Hi David.
>>
>> Thank you for your kind reply.
>>
>> I understood but I'm afraid I can't provide my WAL because it has
>> sensitive data, even via your private email.
>>
>> Regards,
>>
>> Jason
>>
>> 2017-04-24 15:12 GMT+09:00 David Alves <[email protected]>:
>>
>>> Hi Jason
>>>
>>>   I meant the last wal segment for the 30aaccdf7c8c496a8ad73255856a1724
>>> tablet on the dead server (if you don't have sensitive data in there)
>>>   Not sure whether you specified the flag: "--fs_wal_dir". If so it
>>> should be in there, if not the wals are in the same dir as the value set
>>> for "--fs_data_dirs".
>>>   A wal file has a name like: "wal-000000001"
>>>
>>> Best
>>> David
>>>
>>>
>>> On Sat, Apr 22, 2017 at 7:46 PM, Jason Heo <[email protected]>
>>> wrote:
>>>
>>>> Hi David.
>>>>
>>>> Sorry for the insufficient information.
>>>>
>>>> There are 14 nodes in my test kudu cluster. Only one tserver has been
>>>> dead. It has only above two logs.
>>>>
>>>> Other 13 nodes has "Error trying to read ahead of the log while
>>>> preparing peer request: Incomplete: Op with" error 7~10 times.
>>>>
>>>> >> *Would it be possible to also get the WAL with the corrupted entry?*
>>>>
>>>> Would you please explain how to get it in more detail?
>>>>
>>>> I tried what I did again and again to reproduce same error, but it
>>>> didn't happen again.
>>>>
>>>> Please feel free to ask me for anything what you need to resolve.
>>>>
>>>> Regards,
>>>>
>>>> Jason
>>>>
>>>> 2017-04-23 1:56 GMT+09:00 <[email protected]>:
>>>>
>>>>> Hi Jason
>>>>>
>>>>>   Anything else of interest in those logs? Can you share them (with
>>>>> just me, if you prefer)? Would it be possible to also get the WAL with
>>>>> the corrupted entry?
>>>>>   Did this happen on a single server?
>>>>>
>>>>> Best
>>>>> David
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: tserver died during bulk indexing and dies again after restarting

Reply via email to