Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-20 Thread Tetsuo Handa
On 2019/01/20 22:30, Dmitry Vyukov wrote:
>> The first messages I want to look at is kernel output. Then, I look at
>> syz-program lines as needed. But current "a self-contained file" is
>> hard to find kernel output.
> 
> I think everybody looks at kernel crash first, that's why we provide
> kernel crash inline in the email so it's super easy to find. One does
> not need to look at console output at all to read the crash message.

I don't think so. Sometimes it happens that a backtrace of memory allocation
fault injection prior to the crash tells everything. But since such lines are
not immediately findable from a file containing console output, people fails
to understand what has happened.

And one (of my two suggestions) is about helping people to easily find kernel
messages from console output, by moving syzbot-program lines into a dedicated
location.

> Console output is meant for more complex cases when a developer needs
> to extract some long tail of custom information.

This "INFO: rcu detected stall in ndisc_alloc_skb" is exactly a case where only
syzbot-program lines can provide some clue. And the other (of my two 
suggestions)
is about preserving all syzbot-program lines in a file containing console 
output.

>  We don't know what
> exactly information a developer is looking for and it is different in
> each case, so it's not possible to optimize for this.

I'm not asking to optimize. I'm asking to preserve all syzbot-program lines.

>   We preserve
> console output intact to not destroy some potentially important
> information. Say, if we start reordering messages, we lose timing
> information and timing/interleaving information is important in some
> cases.

My suggestion is not a reordering of messages. It is a cross referencing.
The [$(uptime)] part acts as the timing information. Since inlining 
syzbot-program
line there makes difficult to find previous/next kernel messages, I'm suggesting
to move syzbot-program lines into a dedicated block and cross reference using 
some
identifiers like hash. There is no loss of timing information, and we can
reconstruct interleaved output (if needed) as long as identifiers are unique
within that report.

> 
>> Even if we keep both kernel output and
>> syz-program lines in a single file, we can improve readability by
>> splitting into kernel output section and syz-program section.
>>
>>   # Kernel output section start
>>   [$(uptime)][$(caller_info)] executing program 
>> #0123456789abcdef0123456789abcdef
>>   [$(uptime)][$(caller_info)] 
>> $(kernel_messages_caused_by_0123456789abcdef0123456789abcdef_are_here)
>>   [$(uptime)][$(caller_info)] executing program 
>> #456789abcdef0123456789abcdef0123
>>   [$(uptime)][$(caller_info)] 
>> $(kernel_messages_caused_by_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
>>   [$(uptime)][$(caller_info)] executing program 
>> #89abcdef0123456789abcdef01234567
>>   [$(uptime)][$(caller_info)] 
>> $(kernel_messages_caused_by_89abcdef0123456789abcdef01234567_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
>>   [$(uptime)][$(caller_info)] BUG: unable to handle kernel paging request at 
>> $(address)
>>   [$(uptime)][$(caller_info)] CPU: $(cpu) PID: $(pid) Comm: syz#89abcdef0123 
>> Not tainted $(version) #$(build)
>>   [$(uptime)][$(caller_info)] $(backtrace_of_caller_info_is_here)
>>   [$(uptime)][$(caller_info)] Kernel panic - not syncing: Fatal exception
>>   # Kernel output section end
>>   # syzbot code section start
>>   Program for #0123456789abcdef0123456789abcdef
>>   $(program_lines_for_0123456789abcdef0123456789abcdef_is_here)
>>   Program for #456789abcdef0123456789abcdef0123
>>   $(program_lines_for_456789abcdef0123456789abcdef0123_is_here)
>>   Program for #89abcdef0123456789abcdef01234567
>>   $(program_lines_for_89abcdef0123456789abcdef01234567_is_here)
>>   # syzbot code section end
>>
> 

 Current output 
[  938.184721][T10912] F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - 
read(0x0)
[  938.193080][T10912] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th 
superblock
[  938.202030][T10912] F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - 
read(0x0)
[  938.210375][T10912] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th 
superblock
22:37:55 executing program 4:
r0 = syz_open_dev$sg(&(0x7f40)='/dev/sg#\x00', 0x0, 0x2)
write$binfmt_elf64(r0, 
&(0x7f000340)=ANY=[@ANYBLOB="7f454c460400d40048001cca00e4"],
 0x2e)

Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-20 Thread Dmitry Vyukov
On Sat, Jan 19, 2019 at 2:10 PM Tetsuo Handa
 wrote:
>
> On 2019/01/19 21:16, Dmitry Vyukov wrote:
> >> The question for me is, whether sysbot can detect hash collision with 
> >> different
> >> syz-program lines before writing the hash value to /dev/kmsg, and retry by 
> >> modifying
> >> syz-program lines in order to get a new hash value until collision is 
> >> avoided.
> >> If it is difficult, simpler choice like current Unix time and PID could be 
> >> used
> >> instead...
> >
> > Hummm, say, if you run syz-manager locally and report a bug, where
> > will the webserver and database that allows to download all satellite
> > info work? How long you need to keep this info and provide the web
> > service? You will also need to pay and maintain the server for... how
> > long? I don't see how this can work and how we can ask people to do
> > this. This frankly looks like overly complex solution to a problem
> > were simpler solutions will work. Keeping all info in a self-contained
> > file looks like the only option to make it work reliably.
> > It's also not possible to attribute kernel output to individual programs.
>
> The first messages I want to look at is kernel output. Then, I look at
> syz-program lines as needed. But current "a self-contained file" is
> hard to find kernel output.

I think everybody looks at kernel crash first, that's why we provide
kernel crash inline in the email so it's super easy to find. One does
not need to look at console output at all to read the crash message.
Console output is meant for more complex cases when a developer needs
to extract some long tail of custom information. We don't know what
exactly information a developer is looking for and it is different in
each case, so it's not possible to optimize for this. We preserve
console output intact to not destroy some potentially important
information. Say, if we start reordering messages, we lose timing
information and timing/interleaving information is important in some
cases.

> Even if we keep both kernel output and
> syz-program lines in a single file, we can improve readability by
> splitting into kernel output section and syz-program section.
>
>   # Kernel output section start
>   [$(uptime)][$(caller_info)] executing program 
> #0123456789abcdef0123456789abcdef
>   [$(uptime)][$(caller_info)] 
> $(kernel_messages_caused_by_0123456789abcdef0123456789abcdef_are_here)
>   [$(uptime)][$(caller_info)] executing program 
> #456789abcdef0123456789abcdef0123
>   [$(uptime)][$(caller_info)] 
> $(kernel_messages_caused_by_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
>   [$(uptime)][$(caller_info)] executing program 
> #89abcdef0123456789abcdef01234567
>   [$(uptime)][$(caller_info)] 
> $(kernel_messages_caused_by_89abcdef0123456789abcdef01234567_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
>   [$(uptime)][$(caller_info)] BUG: unable to handle kernel paging request at 
> $(address)
>   [$(uptime)][$(caller_info)] CPU: $(cpu) PID: $(pid) Comm: syz#89abcdef0123 
> Not tainted $(version) #$(build)
>   [$(uptime)][$(caller_info)] $(backtrace_of_caller_info_is_here)
>   [$(uptime)][$(caller_info)] Kernel panic - not syncing: Fatal exception
>   # Kernel output section end
>   # syzbot code section start
>   Program for #0123456789abcdef0123456789abcdef
>   $(program_lines_for_0123456789abcdef0123456789abcdef_is_here)
>   Program for #456789abcdef0123456789abcdef0123
>   $(program_lines_for_456789abcdef0123456789abcdef0123_is_here)
>   Program for #89abcdef0123456789abcdef01234567
>   $(program_lines_for_89abcdef0123456789abcdef01234567_is_here)
>   # syzbot code section end
>


Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-19 Thread Tetsuo Handa
On 2019/01/19 21:16, Dmitry Vyukov wrote:
>> The question for me is, whether sysbot can detect hash collision with 
>> different
>> syz-program lines before writing the hash value to /dev/kmsg, and retry by 
>> modifying
>> syz-program lines in order to get a new hash value until collision is 
>> avoided.
>> If it is difficult, simpler choice like current Unix time and PID could be 
>> used
>> instead...
> 
> Hummm, say, if you run syz-manager locally and report a bug, where
> will the webserver and database that allows to download all satellite
> info work? How long you need to keep this info and provide the web
> service? You will also need to pay and maintain the server for... how
> long? I don't see how this can work and how we can ask people to do
> this. This frankly looks like overly complex solution to a problem
> were simpler solutions will work. Keeping all info in a self-contained
> file looks like the only option to make it work reliably.
> It's also not possible to attribute kernel output to individual programs.

The first messages I want to look at is kernel output. Then, I look at
syz-program lines as needed. But current "a self-contained file" is
hard to find kernel output. Even if we keep both kernel output and
syz-program lines in a single file, we can improve readability by
splitting into kernel output section and syz-program section.

  # Kernel output section start
  [$(uptime)][$(caller_info)] executing program 
#0123456789abcdef0123456789abcdef
  [$(uptime)][$(caller_info)] 
$(kernel_messages_caused_by_0123456789abcdef0123456789abcdef_are_here)
  [$(uptime)][$(caller_info)] executing program 
#456789abcdef0123456789abcdef0123
  [$(uptime)][$(caller_info)] 
$(kernel_messages_caused_by_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
  [$(uptime)][$(caller_info)] executing program 
#89abcdef0123456789abcdef01234567
  [$(uptime)][$(caller_info)] 
$(kernel_messages_caused_by_89abcdef0123456789abcdef01234567_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
  [$(uptime)][$(caller_info)] BUG: unable to handle kernel paging request at 
$(address)
  [$(uptime)][$(caller_info)] CPU: $(cpu) PID: $(pid) Comm: syz#89abcdef0123 
Not tainted $(version) #$(build)
  [$(uptime)][$(caller_info)] $(backtrace_of_caller_info_is_here)
  [$(uptime)][$(caller_info)] Kernel panic - not syncing: Fatal exception
  # Kernel output section end
  # syzbot code section start
  Program for #0123456789abcdef0123456789abcdef
  $(program_lines_for_0123456789abcdef0123456789abcdef_is_here)
  Program for #456789abcdef0123456789abcdef0123
  $(program_lines_for_456789abcdef0123456789abcdef0123_is_here)
  Program for #89abcdef0123456789abcdef01234567
  $(program_lines_for_89abcdef0123456789abcdef01234567_is_here)
  # syzbot code section end



Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-19 Thread Dmitry Vyukov
On Fri, Jan 18, 2019 at 6:20 AM Tetsuo Handa
 wrote:
>
> Dmitry Vyukov wrote:
> > On Sun, Jan 6, 2019 at 2:47 PM Tetsuo Handa
> >  wrote:
> > >
> > > On 2019/01/06 22:24, Dmitry Vyukov wrote:
> > > >> A report at 2019/01/05 10:08 from "no output from test machine (2)"
> > > >> ( https://syzkaller.appspot.com/text?tag=CrashLog=1700726f40 )
> > > >> says that there are flood of memory allocation failure messages.
> > > >> Since continuous memory allocation failure messages itself is not
> > > >> recognized as a crash, we might be misunderstanding that this problem
> > > >> is not occurring recently. It will be nice if we can run testcases
> > > >> which are executed on bpf-next tree.
> > > >
> > > > What exactly do you mean by running test cases on bpf-next tree?
> > > > syzbot tests bpf-next, so it executes lots of test cases on that tree.
> > > > One can also ask for patch testing on bpf-next tree to test a specific
> > > > test case.
> > >
> > > syzbot ran "some tests" before getting this report, but we can't find from
> > > this report what the "some tests" are. If we could record all tests 
> > > executed
> > > in syzbot environments before getting this report, we could rerun the 
> > > tests
> > > (with manually examining where the source of memory consumption is) in 
> > > local
> > > environments.
> >
> > Filed https://github.com/google/syzkaller/issues/917 for this.
>
> Thanks. Here is what I would suggest.
>
> Let syz-fuzzer write to /dev/kmsg . But don't directly write syz-program 
> lines.
> Instead, just write the hash value of syz-program lines, and allow downloading
> syz-program lines from external URL. Also, use the first 12 characters of the
> hash value as comm name executing that syz-program lines. An example of 
> console
> output would look something like below.
>
>
>   [$(uptime)][$(caller_info)] executing program 
> #0123456789abcdef0123456789abcdef
>   [$(uptime)][$(caller_info)] 
> $(kernel_messages_caused_by_0123456789abcdef0123456789abcdef_are_here)
>   [$(uptime)][$(caller_info)] executing program 
> #456789abcdef0123456789abcdef0123
>   [$(uptime)][$(caller_info)] 
> $(kernel_messages_caused_by_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
>   [$(uptime)][$(caller_info)] executing program 
> #89abcdef0123456789abcdef01234567
>   [$(uptime)][$(caller_info)] 
> $(kernel_messages_caused_by_89abcdef0123456789abcdef01234567_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
>   [$(uptime)][$(caller_info)] BUG: unable to handle kernel paging request at 
> $(address)
>   [$(uptime)][$(caller_info)] CPU: $(cpu) PID: $(pid) Comm: syz#89abcdef0123 
> Not tainted $(version) #$(build)
>   [$(uptime)][$(caller_info)] $(backtrace_of_caller_info_is_here)
>   [$(uptime)][$(caller_info)] Kernel panic - not syncing: Fatal exception
>
> Then, we can build CrashLog by picking up all "executing program #" lines and
> "latest lines up to available space" from console output like below.
>
>   [$(uptime)][$(caller_info)] executing program 
> #0123456789abcdef0123456789abcdef
>   [$(uptime)][$(caller_info)] executing program 
> #456789abcdef0123456789abcdef0123
>   [$(uptime)][$(caller_info)] executing program 
> #89abcdef0123456789abcdef01234567
>   [$(uptime)][$(caller_info)] 
> $(kernel_messages_caused_by_89abcdef0123456789abcdef01234567_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
>   [$(uptime)][$(caller_info)] BUG: unable to handle kernel paging request at 
> $(address)
>   [$(uptime)][$(caller_info)] CPU: $(cpu) PID: $(pid) Comm: syz89abcdef0123 
> Not tainted $(version) #$(build)
>   [$(uptime)][$(caller_info)] $(backtrace_of_caller_info_is_here)
>   [$(uptime)][$(caller_info)] Kernel panic - not syncing: Fatal exception
>
> Then, we can understand that a crash happened when executing 89abcdef0123 and
> download 89abcdef0123456789abcdef01234567 for analysis. Also, we can download
> 0123456789abcdef0123456789abcdef and 456789abcdef0123456789abcdef0123 as 
> needed.
>
> Honestly, since lines which follows "$(date) executing program $(num):" line 
> can
> become so long, it is difficult to find where previous/next kernel messages 
> are.
> If only one-liner "executing program #" output is used, it is easy to find
> previous/next kernel messages.
>
> The program referenced by "executing program #" would be made downloadable via
> Web server or git repository. Maybe "executing program https://$server/$hash;
> for the former case. But repeating "https://$server/; part would be redundant.
>
> The question for me is, whether sysbot can detect hash collision with 
> different
> syz-program lines before writing the hash value to /dev/kmsg, and retry by 
> modifying
> syz-program lines in order to get a new hash value until collision is avoided.
> If it is difficult, simpler choice like current Unix time and PID could be 
> used
> instead...

Hummm, say, if you run syz-manager locally and report a bug, 

Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-17 Thread Tetsuo Handa
Dmitry Vyukov wrote:
> On Sun, Jan 6, 2019 at 2:47 PM Tetsuo Handa
>  wrote:
> >
> > On 2019/01/06 22:24, Dmitry Vyukov wrote:
> > >> A report at 2019/01/05 10:08 from "no output from test machine (2)"
> > >> ( https://syzkaller.appspot.com/text?tag=CrashLog=1700726f40 )
> > >> says that there are flood of memory allocation failure messages.
> > >> Since continuous memory allocation failure messages itself is not
> > >> recognized as a crash, we might be misunderstanding that this problem
> > >> is not occurring recently. It will be nice if we can run testcases
> > >> which are executed on bpf-next tree.
> > >
> > > What exactly do you mean by running test cases on bpf-next tree?
> > > syzbot tests bpf-next, so it executes lots of test cases on that tree.
> > > One can also ask for patch testing on bpf-next tree to test a specific
> > > test case.
> >
> > syzbot ran "some tests" before getting this report, but we can't find from
> > this report what the "some tests" are. If we could record all tests executed
> > in syzbot environments before getting this report, we could rerun the tests
> > (with manually examining where the source of memory consumption is) in local
> > environments.
> 
> Filed https://github.com/google/syzkaller/issues/917 for this.

Thanks. Here is what I would suggest.

Let syz-fuzzer write to /dev/kmsg . But don't directly write syz-program lines.
Instead, just write the hash value of syz-program lines, and allow downloading
syz-program lines from external URL. Also, use the first 12 characters of the
hash value as comm name executing that syz-program lines. An example of console
output would look something like below.


  [$(uptime)][$(caller_info)] executing program 
#0123456789abcdef0123456789abcdef
  [$(uptime)][$(caller_info)] 
$(kernel_messages_caused_by_0123456789abcdef0123456789abcdef_are_here)
  [$(uptime)][$(caller_info)] executing program 
#456789abcdef0123456789abcdef0123
  [$(uptime)][$(caller_info)] 
$(kernel_messages_caused_by_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
  [$(uptime)][$(caller_info)] executing program 
#89abcdef0123456789abcdef01234567
  [$(uptime)][$(caller_info)] 
$(kernel_messages_caused_by_89abcdef0123456789abcdef01234567_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
  [$(uptime)][$(caller_info)] BUG: unable to handle kernel paging request at 
$(address)
  [$(uptime)][$(caller_info)] CPU: $(cpu) PID: $(pid) Comm: syz#89abcdef0123 
Not tainted $(version) #$(build)
  [$(uptime)][$(caller_info)] $(backtrace_of_caller_info_is_here)
  [$(uptime)][$(caller_info)] Kernel panic - not syncing: Fatal exception

Then, we can build CrashLog by picking up all "executing program #" lines and
"latest lines up to available space" from console output like below.

  [$(uptime)][$(caller_info)] executing program 
#0123456789abcdef0123456789abcdef
  [$(uptime)][$(caller_info)] executing program 
#456789abcdef0123456789abcdef0123
  [$(uptime)][$(caller_info)] executing program 
#89abcdef0123456789abcdef01234567
  [$(uptime)][$(caller_info)] 
$(kernel_messages_caused_by_89abcdef0123456789abcdef01234567_456789abcdef0123456789abcdef0123_and_0123456789abcdef0123456789abcdef_are_here)
  [$(uptime)][$(caller_info)] BUG: unable to handle kernel paging request at 
$(address)
  [$(uptime)][$(caller_info)] CPU: $(cpu) PID: $(pid) Comm: syz89abcdef0123 Not 
tainted $(version) #$(build)
  [$(uptime)][$(caller_info)] $(backtrace_of_caller_info_is_here)
  [$(uptime)][$(caller_info)] Kernel panic - not syncing: Fatal exception

Then, we can understand that a crash happened when executing 89abcdef0123 and
download 89abcdef0123456789abcdef01234567 for analysis. Also, we can download
0123456789abcdef0123456789abcdef and 456789abcdef0123456789abcdef0123 as needed.

Honestly, since lines which follows "$(date) executing program $(num):" line can
become so long, it is difficult to find where previous/next kernel messages are.
If only one-liner "executing program #" output is used, it is easy to find
previous/next kernel messages.

The program referenced by "executing program #" would be made downloadable via
Web server or git repository. Maybe "executing program https://$server/$hash;
for the former case. But repeating "https://$server/; part would be redundant.

The question for me is, whether sysbot can detect hash collision with different
syz-program lines before writing the hash value to /dev/kmsg, and retry by 
modifying
syz-program lines in order to get a new hash value until collision is avoided.
If it is difficult, simpler choice like current Unix time and PID could be used
instead...


Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-07 Thread Dmitry Vyukov
On Sun, Jan 6, 2019 at 2:47 PM Tetsuo Handa
 wrote:
>
> On 2019/01/06 22:24, Dmitry Vyukov wrote:
> >> A report at 2019/01/05 10:08 from "no output from test machine (2)"
> >> ( https://syzkaller.appspot.com/text?tag=CrashLog=1700726f40 )
> >> says that there are flood of memory allocation failure messages.
> >> Since continuous memory allocation failure messages itself is not
> >> recognized as a crash, we might be misunderstanding that this problem
> >> is not occurring recently. It will be nice if we can run testcases
> >> which are executed on bpf-next tree.
> >
> > What exactly do you mean by running test cases on bpf-next tree?
> > syzbot tests bpf-next, so it executes lots of test cases on that tree.
> > One can also ask for patch testing on bpf-next tree to test a specific
> > test case.
>
> syzbot ran "some tests" before getting this report, but we can't find from
> this report what the "some tests" are. If we could record all tests executed
> in syzbot environments before getting this report, we could rerun the tests
> (with manually examining where the source of memory consumption is) in local
> environments.

Filed https://github.com/google/syzkaller/issues/917 for this.

> Since syzbot is now using memcg, maybe we can test with sysctl_panic_on_oom 
> == 1.
> Any memory consumption that triggers global OOM killer could be considered as
> a problem (e.g. memory leak or uncontrolled memory allocation).

Interesting idea. This will also alleviate the previous problem as I
think only a stream of OOMs currently produces 1+MB of output.

+Shakeel who was interested in catching more memcg-escaping allocations.

To do this we need a buy-in from kernel community to consider this as
a bug/something to fix in kernel. Systematic testing can't work gray
checks requiring humans to look at each case and some cases left as
being working-as-intended.

There are also 2 interesting points:
 - testing of kernel without memcg-enabled (some kernel users
obviously do this); it's doable, but currently syzkaller have no
precedents/infrastructure to consider some output patterns as bugs or
not depending on kernel features
 - false positives for minimized C reproducers that have memcg code
stripped off (people complain that reproducers are too large/complex
otherwise)


Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-06 Thread Tetsuo Handa
On 2019/01/06 22:24, Dmitry Vyukov wrote:
>> A report at 2019/01/05 10:08 from "no output from test machine (2)"
>> ( https://syzkaller.appspot.com/text?tag=CrashLog=1700726f40 )
>> says that there are flood of memory allocation failure messages.
>> Since continuous memory allocation failure messages itself is not
>> recognized as a crash, we might be misunderstanding that this problem
>> is not occurring recently. It will be nice if we can run testcases
>> which are executed on bpf-next tree.
> 
> What exactly do you mean by running test cases on bpf-next tree?
> syzbot tests bpf-next, so it executes lots of test cases on that tree.
> One can also ask for patch testing on bpf-next tree to test a specific
> test case.

syzbot ran "some tests" before getting this report, but we can't find from
this report what the "some tests" are. If we could record all tests executed
in syzbot environments before getting this report, we could rerun the tests
(with manually examining where the source of memory consumption is) in local
environments.

Since syzbot is now using memcg, maybe we can test with sysctl_panic_on_oom == 
1.
Any memory consumption that triggers global OOM killer could be considered as
a problem (e.g. memory leak or uncontrolled memory allocation).



Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-06 Thread Dmitry Vyukov
On Sat, Jan 5, 2019 at 11:49 AM Tetsuo Handa
 wrote:
>
> On 2019/01/03 2:06, Tetsuo Handa wrote:
> > On 2018/12/31 17:24, Dmitry Vyukov wrote:
>  Since this involves OOMs and looks like a one-off induced memory 
>  corruption:
> 
>  #syz dup: kernel panic: corrupted stack end in wb_workfn
> 
> >>>
> >>> Why?
> >>>
> >>> RCU stall in this case is likely to be latency caused by flooding of 
> >>> printk().
> >>
> >> Just a hypothesis. OOMs lead to arbitrary memory corruptions, so can
> >> cause stalls as well. But can be what you said too. I just thought
> >> that cleaner dashboard is more useful than a large assorted pile of
> >> crashes. If you think it's actionable in some way, feel free to undup.
> >>
> >
> > We don't know why bpf tree is hitting this problem.
> > Let's continue monitoring this problem.
> >
> > #syz undup
> >
>
> A report at 2019/01/05 10:08 from "no output from test machine (2)"
> ( https://syzkaller.appspot.com/text?tag=CrashLog=1700726f40 )
> says that there are flood of memory allocation failure messages.
> Since continuous memory allocation failure messages itself is not
> recognized as a crash, we might be misunderstanding that this problem
> is not occurring recently. It will be nice if we can run testcases
> which are executed on bpf-next tree.

What exactly do you mean by running test cases on bpf-next tree?
syzbot tests bpf-next, so it executes lots of test cases on that tree.
One can also ask for patch testing on bpf-next tree to test a specific
test case.


Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-05 Thread Tetsuo Handa
On 2019/01/03 2:06, Tetsuo Handa wrote:
> On 2018/12/31 17:24, Dmitry Vyukov wrote:
 Since this involves OOMs and looks like a one-off induced memory 
 corruption:

 #syz dup: kernel panic: corrupted stack end in wb_workfn

>>>
>>> Why?
>>>
>>> RCU stall in this case is likely to be latency caused by flooding of 
>>> printk().
>>
>> Just a hypothesis. OOMs lead to arbitrary memory corruptions, so can
>> cause stalls as well. But can be what you said too. I just thought
>> that cleaner dashboard is more useful than a large assorted pile of
>> crashes. If you think it's actionable in some way, feel free to undup.
>>
> 
> We don't know why bpf tree is hitting this problem.
> Let's continue monitoring this problem.
> 
> #syz undup
> 

A report at 2019/01/05 10:08 from "no output from test machine (2)"
( https://syzkaller.appspot.com/text?tag=CrashLog=1700726f40 )
says that there are flood of memory allocation failure messages.
Since continuous memory allocation failure messages itself is not
recognized as a crash, we might be misunderstanding that this problem
is not occurring recently. It will be nice if we can run testcases
which are executed on bpf-next tree.


Re: INFO: rcu detected stall in ndisc_alloc_skb

2019-01-02 Thread Tetsuo Handa
On 2018/12/31 17:24, Dmitry Vyukov wrote:
>>> Since this involves OOMs and looks like a one-off induced memory corruption:
>>>
>>> #syz dup: kernel panic: corrupted stack end in wb_workfn
>>>
>>
>> Why?
>>
>> RCU stall in this case is likely to be latency caused by flooding of 
>> printk().
> 
> Just a hypothesis. OOMs lead to arbitrary memory corruptions, so can
> cause stalls as well. But can be what you said too. I just thought
> that cleaner dashboard is more useful than a large assorted pile of
> crashes. If you think it's actionable in some way, feel free to undup.
> 

We don't know why bpf tree is hitting this problem.
Let's continue monitoring this problem.

#syz undup


Re: INFO: rcu detected stall in ndisc_alloc_skb

2018-12-31 Thread Dmitry Vyukov
On Mon, Dec 31, 2018 at 9:17 AM Tetsuo Handa
 wrote:
>
> On 2018/12/31 16:49, Dmitry Vyukov wrote:
> > On Mon, Dec 31, 2018 at 8:42 AM syzbot
> >  wrote:
> >>
> >> Hello,
> >>
> >> syzbot found the following crash on:
> >>
> >> HEAD commit:ef4ab8447aa2 selftests: bpf: install script with_addr.sh
> >> git tree:   bpf-next
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=14a28b6e40
> >> kernel config:  https://syzkaller.appspot.com/x/.config?x=7e7e2279c0020d5f
> >> dashboard link: 
> >> https://syzkaller.appspot.com/bug?extid=ea7d9cb314b4ab49a18a
> >> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >>
> >> Unfortunately, I don't have any reproducer for this crash yet.
> >>
> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >> Reported-by: syzbot+ea7d9cb314b4ab49a...@syzkaller.appspotmail.com
> >
> > Since this involves OOMs and looks like a one-off induced memory corruption:
> >
> > #syz dup: kernel panic: corrupted stack end in wb_workfn
> >
>
> Why?
>
> RCU stall in this case is likely to be latency caused by flooding of printk().

Just a hypothesis. OOMs lead to arbitrary memory corruptions, so can
cause stalls as well. But can be what you said too. I just thought
that cleaner dashboard is more useful than a large assorted pile of
crashes. If you think it's actionable in some way, feel free to undup.


Re: INFO: rcu detected stall in ndisc_alloc_skb

2018-12-31 Thread Tetsuo Handa
On 2018/12/31 16:49, Dmitry Vyukov wrote:
> On Mon, Dec 31, 2018 at 8:42 AM syzbot
>  wrote:
>>
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:ef4ab8447aa2 selftests: bpf: install script with_addr.sh
>> git tree:   bpf-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=14a28b6e40
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=7e7e2279c0020d5f
>> dashboard link: https://syzkaller.appspot.com/bug?extid=ea7d9cb314b4ab49a18a
>> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+ea7d9cb314b4ab49a...@syzkaller.appspotmail.com
> 
> Since this involves OOMs and looks like a one-off induced memory corruption:
> 
> #syz dup: kernel panic: corrupted stack end in wb_workfn
> 

Why?

RCU stall in this case is likely to be latency caused by flooding of printk().


Re: INFO: rcu detected stall in ndisc_alloc_skb

2018-12-30 Thread Dmitry Vyukov
On Mon, Dec 31, 2018 at 8:42 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:ef4ab8447aa2 selftests: bpf: install script with_addr.sh
> git tree:   bpf-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=14a28b6e40
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7e7e2279c0020d5f
> dashboard link: https://syzkaller.appspot.com/bug?extid=ea7d9cb314b4ab49a18a
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+ea7d9cb314b4ab49a...@syzkaller.appspotmail.com

Since this involves OOMs and looks like a one-off induced memory corruption:

#syz dup: kernel panic: corrupted stack end in wb_workfn

> CPU: 1 PID: 5702 Comm: rsyslogd Not tainted 4.19.0-rc6+ #118
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> rcu:(detected by 0, t=10712 jiffies, g=90369, q=135)
>   
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
> rcu: All QSes seen, last rcu_preempt kthread activity 10548
> (4295003843-4294993295), jiffies_till_next_fqs=1, root ->qsmask 0x0
> syz-executor0   R
>running task
>   warn_alloc.cold.119+0xb7/0x1bd mm/page_alloc.c:3426
> 22896  7592   5826 0x801c
> Call Trace:
>   
>   sched_show_task.cold.83+0x2b6/0x30a kernel/sched/core.c:5296
>   __alloc_pages_slowpath+0x2667/0x2d80 mm/page_alloc.c:4297
>   print_other_cpu_stall.cold.79+0xa83/0xba5 kernel/rcu/tree.c:1430
>   check_cpu_stall kernel/rcu/tree.c:1557 [inline]
>   __rcu_pending kernel/rcu/tree.c:3276 [inline]
>   rcu_pending kernel/rcu/tree.c:3319 [inline]
>   rcu_check_callbacks+0xafc/0x1990 kernel/rcu/tree.c:2665
>   __alloc_pages_nodemask+0xa80/0xde0 mm/page_alloc.c:4390
>   __alloc_pages include/linux/gfp.h:473 [inline]
>   __alloc_pages_node include/linux/gfp.h:486 [inline]
>   kmem_getpages mm/slab.c:1409 [inline]
>   cache_grow_begin+0x91/0x8c0 mm/slab.c:2677
>   fallback_alloc+0x203/0x2e0 mm/slab.c:3219
>   cache_alloc_node+0x1c7/0x1e0 mm/slab.c:3287
>   slab_alloc_node mm/slab.c:3327 [inline]
>   kmem_cache_alloc_node+0xe3/0x730 mm/slab.c:3642
>   __alloc_skb+0x119/0x770 net/core/skbuff.c:193
>   alloc_skb include/linux/skbuff.h:997 [inline]
>   ndisc_alloc_skb+0x144/0x340 net/ipv6/ndisc.c:403
>   ndisc_send_rs+0x331/0x6e0 net/ipv6/ndisc.c:669
>   update_process_times+0x2d/0x70 kernel/time/timer.c:1636
>   addrconf_rs_timer+0x314/0x690 net/ipv6/addrconf.c:3836
>   tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:164
>   tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1274
>   __run_hrtimer kernel/time/hrtimer.c:1398 [inline]
>   __hrtimer_run_queues+0x41c/0x10d0 kernel/time/hrtimer.c:1460
>   call_timer_fn+0x272/0x920 kernel/time/timer.c:1326
>   hrtimer_interrupt+0x313/0x780 kernel/time/hrtimer.c:1518
>   local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1029 [inline]
>   smp_apic_timer_interrupt+0x1a1/0x760 arch/x86/kernel/apic/apic.c:1054
>   apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:864
> RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:788
> [inline]
> RIP: 0010:slab_alloc_node mm/slab.c:3329 [inline]
> RIP: 0010:kmem_cache_alloc_node+0x247/0x730 mm/slab.c:3642
> Code: 3f 7e 0f 85 32 ff ff ff e8 a5 7f 3e ff e9 28 ff ff ff e8 0c e3 c2 ff
> 48 83 3d 5c f4 6f 07 00 0f 84 33 01 00 00 4c 89 ff 57 9d <0f> 1f 44 00 00
> e9 bf fe ff ff 31 d2 be a5 01 00 00 48 c7 c7 62 23
> RSP: :8801dae07450 EFLAGS: 0286 ORIG_RAX: ff13
> RAX:  RBX: 00480020 RCX: 8184e1ca
> RDX: 0004 RSI: 8184e1e4 RDI: 0286
> RBP: 8801dae074c0 R08: 880193c38700 R09: fbfff12812c4
> R10: 8801dae06098 R11: 89409623 R12: 8801d9a04040
> R13: 8801d9a04040 R14:  R15: 0286
>   __alloc_skb+0x119/0x770 net/core/skbuff.c:193
>   alloc_skb include/linux/skbuff.h:997 [inline]
>   ndisc_alloc_skb+0x144/0x340 net/ipv6/ndisc.c:403
>   expire_timers kernel/time/timer.c:1363 [inline]
>   __run_timers+0x7e5/0xc70 kernel/time/timer.c:1682
>   ndisc_send_rs+0x331/0x6e0 net/ipv6/ndisc.c:669
>   addrconf_rs_timer+0x314/0x690 net/ipv6/addrconf.c:3836
>   call_timer_fn+0x272/0x920 kernel/time/timer.c:1326
>   run_timer_softirq+0x52/0xb0 kernel/time/timer.c:1695
>   __do_softirq+0x30b/0xad8 kernel/softirq.c:292
>   expire_timers kernel/time/timer.c:1363 [inline]
>   __run_timers+0x7e5/0xc70 kernel/time/timer.c:1682
>   invoke_softirq kernel/softirq.c:372 [inline]
>   irq_exit+0x17f/0x1c0 kernel/softirq.c:412
>   exiting_irq arch/x86/include/asm/apic.h:536 [inline]
>   smp_apic_timer_interrupt+0x1cb/0x760 arch/x86/kernel/apic/apic.c:1056
>   run_timer_softirq+0x52/0xb0 kernel/time/timer.c:1695
>   

INFO: rcu detected stall in ndisc_alloc_skb

2018-12-30 Thread syzbot

Hello,

syzbot found the following crash on:

HEAD commit:ef4ab8447aa2 selftests: bpf: install script with_addr.sh
git tree:   bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=14a28b6e40
kernel config:  https://syzkaller.appspot.com/x/.config?x=7e7e2279c0020d5f
dashboard link: https://syzkaller.appspot.com/bug?extid=ea7d9cb314b4ab49a18a
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+ea7d9cb314b4ab49a...@syzkaller.appspotmail.com

CPU: 1 PID: 5702 Comm: rsyslogd Not tainted 4.19.0-rc6+ #118
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
rcu:(detected by 0, t=10712 jiffies, g=90369, q=135)
 
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
rcu: All QSes seen, last rcu_preempt kthread activity 10548  
(4295003843-4294993295), jiffies_till_next_fqs=1, root ->qsmask 0x0

syz-executor0   R
  running task
 warn_alloc.cold.119+0xb7/0x1bd mm/page_alloc.c:3426
22896  7592   5826 0x801c
Call Trace:
 
 sched_show_task.cold.83+0x2b6/0x30a kernel/sched/core.c:5296
 __alloc_pages_slowpath+0x2667/0x2d80 mm/page_alloc.c:4297
 print_other_cpu_stall.cold.79+0xa83/0xba5 kernel/rcu/tree.c:1430
 check_cpu_stall kernel/rcu/tree.c:1557 [inline]
 __rcu_pending kernel/rcu/tree.c:3276 [inline]
 rcu_pending kernel/rcu/tree.c:3319 [inline]
 rcu_check_callbacks+0xafc/0x1990 kernel/rcu/tree.c:2665
 __alloc_pages_nodemask+0xa80/0xde0 mm/page_alloc.c:4390
 __alloc_pages include/linux/gfp.h:473 [inline]
 __alloc_pages_node include/linux/gfp.h:486 [inline]
 kmem_getpages mm/slab.c:1409 [inline]
 cache_grow_begin+0x91/0x8c0 mm/slab.c:2677
 fallback_alloc+0x203/0x2e0 mm/slab.c:3219
 cache_alloc_node+0x1c7/0x1e0 mm/slab.c:3287
 slab_alloc_node mm/slab.c:3327 [inline]
 kmem_cache_alloc_node+0xe3/0x730 mm/slab.c:3642
 __alloc_skb+0x119/0x770 net/core/skbuff.c:193
 alloc_skb include/linux/skbuff.h:997 [inline]
 ndisc_alloc_skb+0x144/0x340 net/ipv6/ndisc.c:403
 ndisc_send_rs+0x331/0x6e0 net/ipv6/ndisc.c:669
 update_process_times+0x2d/0x70 kernel/time/timer.c:1636
 addrconf_rs_timer+0x314/0x690 net/ipv6/addrconf.c:3836
 tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:164
 tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1274
 __run_hrtimer kernel/time/hrtimer.c:1398 [inline]
 __hrtimer_run_queues+0x41c/0x10d0 kernel/time/hrtimer.c:1460
 call_timer_fn+0x272/0x920 kernel/time/timer.c:1326
 hrtimer_interrupt+0x313/0x780 kernel/time/hrtimer.c:1518
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1029 [inline]
 smp_apic_timer_interrupt+0x1a1/0x760 arch/x86/kernel/apic/apic.c:1054
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:864
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:788  
[inline]

RIP: 0010:slab_alloc_node mm/slab.c:3329 [inline]
RIP: 0010:kmem_cache_alloc_node+0x247/0x730 mm/slab.c:3642
Code: 3f 7e 0f 85 32 ff ff ff e8 a5 7f 3e ff e9 28 ff ff ff e8 0c e3 c2 ff  
48 83 3d 5c f4 6f 07 00 0f 84 33 01 00 00 4c 89 ff 57 9d <0f> 1f 44 00 00  
e9 bf fe ff ff 31 d2 be a5 01 00 00 48 c7 c7 62 23

RSP: :8801dae07450 EFLAGS: 0286 ORIG_RAX: ff13
RAX:  RBX: 00480020 RCX: 8184e1ca
RDX: 0004 RSI: 8184e1e4 RDI: 0286
RBP: 8801dae074c0 R08: 880193c38700 R09: fbfff12812c4
R10: 8801dae06098 R11: 89409623 R12: 8801d9a04040
R13: 8801d9a04040 R14:  R15: 0286
 __alloc_skb+0x119/0x770 net/core/skbuff.c:193
 alloc_skb include/linux/skbuff.h:997 [inline]
 ndisc_alloc_skb+0x144/0x340 net/ipv6/ndisc.c:403
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x7e5/0xc70 kernel/time/timer.c:1682
 ndisc_send_rs+0x331/0x6e0 net/ipv6/ndisc.c:669
 addrconf_rs_timer+0x314/0x690 net/ipv6/addrconf.c:3836
 call_timer_fn+0x272/0x920 kernel/time/timer.c:1326
 run_timer_softirq+0x52/0xb0 kernel/time/timer.c:1695
 __do_softirq+0x30b/0xad8 kernel/softirq.c:292
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x7e5/0xc70 kernel/time/timer.c:1682
 invoke_softirq kernel/softirq.c:372 [inline]
 irq_exit+0x17f/0x1c0 kernel/softirq.c:412
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x1cb/0x760 arch/x86/kernel/apic/apic.c:1056
 run_timer_softirq+0x52/0xb0 kernel/time/timer.c:1695
 __do_softirq+0x30b/0xad8 kernel/softirq.c:292
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:864
 
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:788  
[inline]

RIP: 0010:console_unlock+0xfdf/0x1160 kernel/printk/printk.c:2409
Code: c1 e8 03 42 80 3c 20 00 0f 85 d1 00 00 00 48 83 3d cd 54 cd 07 00 0f  
84 bc 00 00 00 e8 ca 37 1a 00 48 8b bd b0 fe ff ff 57 9d <0f> 1f 44 00 00  
e9 cc