Re: amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-16 Thread Ian Applegate
I will be doing much wider testing on 3.6 this week, will keep you
updated if I see this issue again. I don't see a logical way into that
trap myself and looking at the revision history we've been running
this code path for a very long time without issue. The machine in
question has a fairly high MTBF which is why I thought this odd, but
it is most likely random memory corruption.

Thanks!
Ian Applegate

On Tue, Oct 16, 2012 at 9:27 AM, Pablo Neira Ayuso  wrote:
> On Fri, Oct 12, 2012 at 01:32:06AM +0200, Pablo Neira Ayuso wrote:
>> On Thu, Oct 11, 2012 at 11:27:33PM +0200, Borislav Petkov wrote:
>> > On Thu, Oct 11, 2012 at 12:13:33PM -0700, Ian Applegate wrote:
>> > > On machines serving mainly http traffic we are seeing the following
>> > > panic, which is not yet reproducible.
>> >
>> > Must be this BUG_ON:
>> >
>> > if (!nf_ct_is_confirmed(ct)) {
>> > 
>> > BUG_ON(hlist_nulls_unhashed(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
>> > hlist_nulls_del_rcu(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
>> > }
>>
>> At quick glance, I think we're hitting a memory corruption, I don't
>> see by now any sane code path to reach that bugtrap.
>
> Did you manage to obtain more information regarding this bug?
>
> I don't find any recent change in the conntrack source that may lead
> us to hit that.
>
> Thanks.
>
>> More comments below:
>>
>> > Spamming some more lists and leaving the rest for reference.
>> >
>> > >
>> > >
>> > > [180926.566743] [ cut here ]
>> > > [180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
>> > > [180926.578873] invalid opcode:  [#1] SMP
>> > > [180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
>> > > ipmi_msghandler dm_mod md_mod nf_conntr
>> > > ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
>> > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
>> > > t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
>> > > iptable_raw ip_tables x_tables nfsv4 auth_rp
>> > > cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
>> > > i2c_core mdio igb dca uhci_hcd coretem
>> > > p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
>> > > cryptd aes_x86_64 aes_generic evdev sd_mod
>> > >  crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
>> > > libahci tpm_bios soundcore libata snd_pag
>> > > e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
>> > > usb_common button processor thermal_sys
>> > > [180926.657762] CPU 12
>> > > [180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
>> > > #1 HP ProLiant DL180 G6
>> > > [180926.669820] RIP: 0010:[]  []
>> > > destroy_conntrack+0x55/0xa9 [nf_conntrack]
>> > > [180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
>> > > [180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
>> > > 8805bd73fc48
>> > > [180926.695055] RDX:  RSI: 0006 RDI:
>> > > 8806b6f56c30
>> > > [180926.703179] RBP: 81651780 R08: 000172e0 R09:
>> > > 812cef91
>> > > [180926.711304] R10: dead00200200 R11: dead00100100 R12:
>> > > 8806b6f56c30
>> > > [180926.727451] R13:  R14: a02d6030 R15:
>> > > 
>> > > [180926.735575] FS:  7f382cdb2710() GS:880627cc()
>> > > knlGS:
>> > > [180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
>> > > [180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
>> > > 07e0
>> > > [180926.759436] DR0:  DR1:  DR2:
>> > > 
>> > > [180926.767560] DR3:  DR6: 0ff0 DR7:
>> > > 0400
>> > > [180926.775686] Process nginx-fl (pid: 5948, threadinfo
>> > > 8805bd73e000, task 8805c9755960)
>> > > [180926.785265] Stack:
>> > > [180926.787634]   81651780 8802720c2900
>> > > a02cde78
>> > > [180926.796087]  81651ec0 a02d6030 bff0efab
>> > > 8805
>> > > [180926.804532]  000288050002 00030014 00140003
>> > > 06ff88060002
>> > > [180926.812985] Call Trace:
>> > > [180926.815845]  [] ? nf_conntrack_in+0x4ed/0x5bc
>> > > [nf_conntrack]
>>
>> Here below the trace shows the output path to close a tcp socket. But
>> the line above refers to a conntrack function that is called in the
>> input path.
>>
>> If this process is just acting as plain http server, this backtrace
>> doesn't seem consistent to me.
>>
>> > > [180926.824069]  [] ? nf_iterate+0x41/0x77
>> > > [180926.830131]  [] ? ip_options_echo+0x2ed/0x2ed
>> > > [180926.836873]  [] ? nf_hook_slow+0x68/0xfd
>> > > [180926.843127]  [] ? ip_options_echo+0x2ed/0x2ed
>> > > [180926.849866]  [] ? __ip_local_out+0x98/0x9d
>> > > [180926.856315]  [] ? ip_local_out+0x9/0x19
>> > > [180926.862465]  [] ? tcp_transmit_skb+0x7ae/0x7f1
>> > > [180926.869305]  [] ? 

Re: amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-16 Thread Pablo Neira Ayuso
On Fri, Oct 12, 2012 at 01:32:06AM +0200, Pablo Neira Ayuso wrote:
> On Thu, Oct 11, 2012 at 11:27:33PM +0200, Borislav Petkov wrote:
> > On Thu, Oct 11, 2012 at 12:13:33PM -0700, Ian Applegate wrote:
> > > On machines serving mainly http traffic we are seeing the following
> > > panic, which is not yet reproducible.
> > 
> > Must be this BUG_ON:
> > 
> > if (!nf_ct_is_confirmed(ct)) {
> > 
> > BUG_ON(hlist_nulls_unhashed(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
> > hlist_nulls_del_rcu(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> > }
> 
> At quick glance, I think we're hitting a memory corruption, I don't
> see by now any sane code path to reach that bugtrap.

Did you manage to obtain more information regarding this bug?

I don't find any recent change in the conntrack source that may lead
us to hit that.

Thanks.

> More comments below:
> 
> > Spamming some more lists and leaving the rest for reference.
> > 
> > > 
> > > 
> > > [180926.566743] [ cut here ]
> > > [180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
> > > [180926.578873] invalid opcode:  [#1] SMP
> > > [180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
> > > ipmi_msghandler dm_mod md_mod nf_conntr
> > > ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
> > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
> > > t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
> > > iptable_raw ip_tables x_tables nfsv4 auth_rp
> > > cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
> > > i2c_core mdio igb dca uhci_hcd coretem
> > > p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
> > > cryptd aes_x86_64 aes_generic evdev sd_mod
> > >  crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
> > > libahci tpm_bios soundcore libata snd_pag
> > > e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
> > > usb_common button processor thermal_sys
> > > [180926.657762] CPU 12
> > > [180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
> > > #1 HP ProLiant DL180 G6
> > > [180926.669820] RIP: 0010:[]  []
> > > destroy_conntrack+0x55/0xa9 [nf_conntrack]
> > > [180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
> > > [180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
> > > 8805bd73fc48
> > > [180926.695055] RDX:  RSI: 0006 RDI:
> > > 8806b6f56c30
> > > [180926.703179] RBP: 81651780 R08: 000172e0 R09:
> > > 812cef91
> > > [180926.711304] R10: dead00200200 R11: dead00100100 R12:
> > > 8806b6f56c30
> > > [180926.727451] R13:  R14: a02d6030 R15:
> > > 
> > > [180926.735575] FS:  7f382cdb2710() GS:880627cc()
> > > knlGS:
> > > [180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
> > > [180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
> > > 07e0
> > > [180926.759436] DR0:  DR1:  DR2:
> > > 
> > > [180926.767560] DR3:  DR6: 0ff0 DR7:
> > > 0400
> > > [180926.775686] Process nginx-fl (pid: 5948, threadinfo
> > > 8805bd73e000, task 8805c9755960)
> > > [180926.785265] Stack:
> > > [180926.787634]   81651780 8802720c2900
> > > a02cde78
> > > [180926.796087]  81651ec0 a02d6030 bff0efab
> > > 8805
> > > [180926.804532]  000288050002 00030014 00140003
> > > 06ff88060002
> > > [180926.812985] Call Trace:
> > > [180926.815845]  [] ? nf_conntrack_in+0x4ed/0x5bc
> > > [nf_conntrack]
> 
> Here below the trace shows the output path to close a tcp socket. But
> the line above refers to a conntrack function that is called in the
> input path.
> 
> If this process is just acting as plain http server, this backtrace
> doesn't seem consistent to me.
> 
> > > [180926.824069]  [] ? nf_iterate+0x41/0x77
> > > [180926.830131]  [] ? ip_options_echo+0x2ed/0x2ed
> > > [180926.836873]  [] ? nf_hook_slow+0x68/0xfd
> > > [180926.843127]  [] ? ip_options_echo+0x2ed/0x2ed
> > > [180926.849866]  [] ? __ip_local_out+0x98/0x9d
> > > [180926.856315]  [] ? ip_local_out+0x9/0x19
> > > [180926.862465]  [] ? tcp_transmit_skb+0x7ae/0x7f1
> > > [180926.869305]  [] ? virt_to_head_page+0x9/0x2c
> > > [180926.875949]  [] ? tcp_send_active_reset+0xd5/0x101
> > > [180926.883175]  [] ? tcp_close+0x118/0x354
> > > [180926.889334]  [] ? inet_release+0x75/0x7b
> > > [180926.895591]  [] ? sock_release+0x19/0x73
> > > [180926.901845]  [] ? sock_close+0x22/0x27
> > > [180926.907906]  [] ? __fput+0xe9/0x1ae
> > > [180926.913677]  [] ? task_work_run+0x53/0x67
> > > [180926.920031]  [] ? do_notify_resume+0x79/0x8d
> > > [180926.926673]  [] ? int_signal+0x12/0x17
> > > [180926.932732] Code: 05 48 89 df ff d0 48 c7 c7 30 66 2d a0 e8 11 b0
> > > 07 

Re: amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-16 Thread Pablo Neira Ayuso
On Fri, Oct 12, 2012 at 01:32:06AM +0200, Pablo Neira Ayuso wrote:
 On Thu, Oct 11, 2012 at 11:27:33PM +0200, Borislav Petkov wrote:
  On Thu, Oct 11, 2012 at 12:13:33PM -0700, Ian Applegate wrote:
   On machines serving mainly http traffic we are seeing the following
   panic, which is not yet reproducible.
  
  Must be this BUG_ON:
  
  if (!nf_ct_is_confirmed(ct)) {
  
  BUG_ON(hlist_nulls_unhashed(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
  hlist_nulls_del_rcu(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
  }
 
 At quick glance, I think we're hitting a memory corruption, I don't
 see by now any sane code path to reach that bugtrap.

Did you manage to obtain more information regarding this bug?

I don't find any recent change in the conntrack source that may lead
us to hit that.

Thanks.

 More comments below:
 
  Spamming some more lists and leaving the rest for reference.
  
   
   
   [180926.566743] [ cut here ]
   [180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
   [180926.578873] invalid opcode:  [#1] SMP
   [180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
   ipmi_msghandler dm_mod md_mod nf_conntr
   ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
   nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
   t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
   iptable_raw ip_tables x_tables nfsv4 auth_rp
   cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
   i2c_core mdio igb dca uhci_hcd coretem
   p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
   cryptd aes_x86_64 aes_generic evdev sd_mod
crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
   libahci tpm_bios soundcore libata snd_pag
   e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
   usb_common button processor thermal_sys
   [180926.657762] CPU 12
   [180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
   #1 HP ProLiant DL180 G6
   [180926.669820] RIP: 0010:[a02ccdd1]  [a02ccdd1]
   destroy_conntrack+0x55/0xa9 [nf_conntrack]
   [180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
   [180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
   8805bd73fc48
   [180926.695055] RDX:  RSI: 0006 RDI:
   8806b6f56c30
   [180926.703179] RBP: 81651780 R08: 000172e0 R09:
   812cef91
   [180926.711304] R10: dead00200200 R11: dead00100100 R12:
   8806b6f56c30
   [180926.727451] R13:  R14: a02d6030 R15:
   
   [180926.735575] FS:  7f382cdb2710() GS:880627cc()
   knlGS:
   [180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
   [180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
   07e0
   [180926.759436] DR0:  DR1:  DR2:
   
   [180926.767560] DR3:  DR6: 0ff0 DR7:
   0400
   [180926.775686] Process nginx-fl (pid: 5948, threadinfo
   8805bd73e000, task 8805c9755960)
   [180926.785265] Stack:
   [180926.787634]   81651780 8802720c2900
   a02cde78
   [180926.796087]  81651ec0 a02d6030 bff0efab
   8805
   [180926.804532]  000288050002 00030014 00140003
   06ff88060002
   [180926.812985] Call Trace:
   [180926.815845]  [a02cde78] ? nf_conntrack_in+0x4ed/0x5bc
   [nf_conntrack]
 
 Here below the trace shows the output path to close a tcp socket. But
 the line above refers to a conntrack function that is called in the
 input path.
 
 If this process is just acting as plain http server, this backtrace
 doesn't seem consistent to me.
 
   [180926.824069]  [812ac195] ? nf_iterate+0x41/0x77
   [180926.830131]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
   [180926.836873]  [812ac2ee] ? nf_hook_slow+0x68/0xfd
   [180926.843127]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
   [180926.849866]  [812b55a8] ? __ip_local_out+0x98/0x9d
   [180926.856315]  [812b55b6] ? ip_local_out+0x9/0x19
   [180926.862465]  [812c7582] ? tcp_transmit_skb+0x7ae/0x7f1
   [180926.869305]  [810fff4c] ? virt_to_head_page+0x9/0x2c
   [180926.875949]  [812c78a6] ? tcp_send_active_reset+0xd5/0x101
   [180926.883175]  [812bc38f] ? tcp_close+0x118/0x354
   [180926.889334]  [812dbdc2] ? inet_release+0x75/0x7b
   [180926.895591]  [81278bff] ? sock_release+0x19/0x73
   [180926.901845]  [81278c7b] ? sock_close+0x22/0x27
   [180926.907906]  [81109de8] ? __fput+0xe9/0x1ae
   [180926.913677]  [81057157] ? task_work_run+0x53/0x67
   [180926.920031]  [8100e50c] ? do_notify_resume+0x79/0x8d
   [180926.926673]  [8134e5b2] ? int_signal+0x12/0x17
   [180926.932732] 

Re: amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-16 Thread Ian Applegate
I will be doing much wider testing on 3.6 this week, will keep you
updated if I see this issue again. I don't see a logical way into that
trap myself and looking at the revision history we've been running
this code path for a very long time without issue. The machine in
question has a fairly high MTBF which is why I thought this odd, but
it is most likely random memory corruption.

Thanks!
Ian Applegate

On Tue, Oct 16, 2012 at 9:27 AM, Pablo Neira Ayuso pa...@netfilter.org wrote:
 On Fri, Oct 12, 2012 at 01:32:06AM +0200, Pablo Neira Ayuso wrote:
 On Thu, Oct 11, 2012 at 11:27:33PM +0200, Borislav Petkov wrote:
  On Thu, Oct 11, 2012 at 12:13:33PM -0700, Ian Applegate wrote:
   On machines serving mainly http traffic we are seeing the following
   panic, which is not yet reproducible.
 
  Must be this BUG_ON:
 
  if (!nf_ct_is_confirmed(ct)) {
  
  BUG_ON(hlist_nulls_unhashed(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
  hlist_nulls_del_rcu(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
  }

 At quick glance, I think we're hitting a memory corruption, I don't
 see by now any sane code path to reach that bugtrap.

 Did you manage to obtain more information regarding this bug?

 I don't find any recent change in the conntrack source that may lead
 us to hit that.

 Thanks.

 More comments below:

  Spamming some more lists and leaving the rest for reference.
 
  
  
   [180926.566743] [ cut here ]
   [180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
   [180926.578873] invalid opcode:  [#1] SMP
   [180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
   ipmi_msghandler dm_mod md_mod nf_conntr
   ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
   nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
   t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
   iptable_raw ip_tables x_tables nfsv4 auth_rp
   cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
   i2c_core mdio igb dca uhci_hcd coretem
   p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
   cryptd aes_x86_64 aes_generic evdev sd_mod
crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
   libahci tpm_bios soundcore libata snd_pag
   e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
   usb_common button processor thermal_sys
   [180926.657762] CPU 12
   [180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
   #1 HP ProLiant DL180 G6
   [180926.669820] RIP: 0010:[a02ccdd1]  [a02ccdd1]
   destroy_conntrack+0x55/0xa9 [nf_conntrack]
   [180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
   [180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
   8805bd73fc48
   [180926.695055] RDX:  RSI: 0006 RDI:
   8806b6f56c30
   [180926.703179] RBP: 81651780 R08: 000172e0 R09:
   812cef91
   [180926.711304] R10: dead00200200 R11: dead00100100 R12:
   8806b6f56c30
   [180926.727451] R13:  R14: a02d6030 R15:
   
   [180926.735575] FS:  7f382cdb2710() GS:880627cc()
   knlGS:
   [180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
   [180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
   07e0
   [180926.759436] DR0:  DR1:  DR2:
   
   [180926.767560] DR3:  DR6: 0ff0 DR7:
   0400
   [180926.775686] Process nginx-fl (pid: 5948, threadinfo
   8805bd73e000, task 8805c9755960)
   [180926.785265] Stack:
   [180926.787634]   81651780 8802720c2900
   a02cde78
   [180926.796087]  81651ec0 a02d6030 bff0efab
   8805
   [180926.804532]  000288050002 00030014 00140003
   06ff88060002
   [180926.812985] Call Trace:
   [180926.815845]  [a02cde78] ? nf_conntrack_in+0x4ed/0x5bc
   [nf_conntrack]

 Here below the trace shows the output path to close a tcp socket. But
 the line above refers to a conntrack function that is called in the
 input path.

 If this process is just acting as plain http server, this backtrace
 doesn't seem consistent to me.

   [180926.824069]  [812ac195] ? nf_iterate+0x41/0x77
   [180926.830131]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
   [180926.836873]  [812ac2ee] ? nf_hook_slow+0x68/0xfd
   [180926.843127]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
   [180926.849866]  [812b55a8] ? __ip_local_out+0x98/0x9d
   [180926.856315]  [812b55b6] ? ip_local_out+0x9/0x19
   [180926.862465]  [812c7582] ? tcp_transmit_skb+0x7ae/0x7f1
   [180926.869305]  [810fff4c] ? virt_to_head_page+0x9/0x2c
   [180926.875949]  [812c78a6] ? tcp_send_active_reset+0xd5/0x101
   [180926.883175]  [812bc38f] ? 

Re: amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-11 Thread Pablo Neira Ayuso
On Thu, Oct 11, 2012 at 11:27:33PM +0200, Borislav Petkov wrote:
> On Thu, Oct 11, 2012 at 12:13:33PM -0700, Ian Applegate wrote:
> > On machines serving mainly http traffic we are seeing the following
> > panic, which is not yet reproducible.
> 
> Must be this BUG_ON:
> 
>   if (!nf_ct_is_confirmed(ct)) {
>   
> BUG_ON(hlist_nulls_unhashed(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
>   hlist_nulls_del_rcu(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
>   }

At quick glance, I think we're hitting a memory corruption, I don't
see by now any sane code path to reach that bugtrap.

More comments below:

> Spamming some more lists and leaving the rest for reference.
> 
> > 
> > 
> > [180926.566743] [ cut here ]
> > [180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
> > [180926.578873] invalid opcode:  [#1] SMP
> > [180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
> > ipmi_msghandler dm_mod md_mod nf_conntr
> > ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
> > nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
> > t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
> > iptable_raw ip_tables x_tables nfsv4 auth_rp
> > cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
> > i2c_core mdio igb dca uhci_hcd coretem
> > p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
> > cryptd aes_x86_64 aes_generic evdev sd_mod
> >  crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
> > libahci tpm_bios soundcore libata snd_pag
> > e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
> > usb_common button processor thermal_sys
> > [180926.657762] CPU 12
> > [180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
> > #1 HP ProLiant DL180 G6
> > [180926.669820] RIP: 0010:[]  []
> > destroy_conntrack+0x55/0xa9 [nf_conntrack]
> > [180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
> > [180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
> > 8805bd73fc48
> > [180926.695055] RDX:  RSI: 0006 RDI:
> > 8806b6f56c30
> > [180926.703179] RBP: 81651780 R08: 000172e0 R09:
> > 812cef91
> > [180926.711304] R10: dead00200200 R11: dead00100100 R12:
> > 8806b6f56c30
> > [180926.727451] R13:  R14: a02d6030 R15:
> > 
> > [180926.735575] FS:  7f382cdb2710() GS:880627cc()
> > knlGS:
> > [180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
> > [180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
> > 07e0
> > [180926.759436] DR0:  DR1:  DR2:
> > 
> > [180926.767560] DR3:  DR6: 0ff0 DR7:
> > 0400
> > [180926.775686] Process nginx-fl (pid: 5948, threadinfo
> > 8805bd73e000, task 8805c9755960)
> > [180926.785265] Stack:
> > [180926.787634]   81651780 8802720c2900
> > a02cde78
> > [180926.796087]  81651ec0 a02d6030 bff0efab
> > 8805
> > [180926.804532]  000288050002 00030014 00140003
> > 06ff88060002
> > [180926.812985] Call Trace:
> > [180926.815845]  [] ? nf_conntrack_in+0x4ed/0x5bc
> > [nf_conntrack]

Here below the trace shows the output path to close a tcp socket. But
the line above refers to a conntrack function that is called in the
input path.

If this process is just acting as plain http server, this backtrace
doesn't seem consistent to me.

> > [180926.824069]  [] ? nf_iterate+0x41/0x77
> > [180926.830131]  [] ? ip_options_echo+0x2ed/0x2ed
> > [180926.836873]  [] ? nf_hook_slow+0x68/0xfd
> > [180926.843127]  [] ? ip_options_echo+0x2ed/0x2ed
> > [180926.849866]  [] ? __ip_local_out+0x98/0x9d
> > [180926.856315]  [] ? ip_local_out+0x9/0x19
> > [180926.862465]  [] ? tcp_transmit_skb+0x7ae/0x7f1
> > [180926.869305]  [] ? virt_to_head_page+0x9/0x2c
> > [180926.875949]  [] ? tcp_send_active_reset+0xd5/0x101
> > [180926.883175]  [] ? tcp_close+0x118/0x354
> > [180926.889334]  [] ? inet_release+0x75/0x7b
> > [180926.895591]  [] ? sock_release+0x19/0x73
> > [180926.901845]  [] ? sock_close+0x22/0x27
> > [180926.907906]  [] ? __fput+0xe9/0x1ae
> > [180926.913677]  [] ? task_work_run+0x53/0x67
> > [180926.920031]  [] ? do_notify_resume+0x79/0x8d
> > [180926.926673]  [] ? int_signal+0x12/0x17
> > [180926.932732] Code: 05 48 89 df ff d0 48 c7 c7 30 66 2d a0 e8 11 b0
> > 07 e1 48 89 df e8 72 25 00 00 48 8b 43
> >  78 a8 08 75 2a 48 8b 53 10 48 85 d2 75 04 <0f> 0b eb fe 48 8b 43 08
> > 48 89 02 a8 01 75 04 48 89 50 08 48 be
> > [180926.954788] RIP  [] destroy_conntrack+0x55/0xa9
> > [nf_conntrack]
> > [180926.963217]  RSP 
> > [180926.967700] ---[ end trace 54a660a52afd5820 ]---
> > [180926.973038] Kernel panic - not syncing: Fatal exception in interrupt
> > 
> > ---
> > Ian Applegate
> > 

Re: amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-11 Thread Borislav Petkov
On Thu, Oct 11, 2012 at 12:13:33PM -0700, Ian Applegate wrote:
> On machines serving mainly http traffic we are seeing the following
> panic, which is not yet reproducible.

Must be this BUG_ON:

if (!nf_ct_is_confirmed(ct)) {

BUG_ON(hlist_nulls_unhashed(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
hlist_nulls_del_rcu(>tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
}

Spamming some more lists and leaving the rest for reference.

> 
> 
> [180926.566743] [ cut here ]
> [180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
> [180926.578873] invalid opcode:  [#1] SMP
> [180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
> ipmi_msghandler dm_mod md_mod nf_conntr
> ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
> t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
> iptable_raw ip_tables x_tables nfsv4 auth_rp
> cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
> i2c_core mdio igb dca uhci_hcd coretem
> p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
> cryptd aes_x86_64 aes_generic evdev sd_mod
>  crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
> libahci tpm_bios soundcore libata snd_pag
> e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
> usb_common button processor thermal_sys
> [180926.657762] CPU 12
> [180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
> #1 HP ProLiant DL180 G6
> [180926.669820] RIP: 0010:[]  []
> destroy_conntrack+0x55/0xa9 [nf_conntrack]
> [180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
> [180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
> 8805bd73fc48
> [180926.695055] RDX:  RSI: 0006 RDI:
> 8806b6f56c30
> [180926.703179] RBP: 81651780 R08: 000172e0 R09:
> 812cef91
> [180926.711304] R10: dead00200200 R11: dead00100100 R12:
> 8806b6f56c30
> [180926.727451] R13:  R14: a02d6030 R15:
> 
> [180926.735575] FS:  7f382cdb2710() GS:880627cc()
> knlGS:
> [180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
> [180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
> 07e0
> [180926.759436] DR0:  DR1:  DR2:
> 
> [180926.767560] DR3:  DR6: 0ff0 DR7:
> 0400
> [180926.775686] Process nginx-fl (pid: 5948, threadinfo
> 8805bd73e000, task 8805c9755960)
> [180926.785265] Stack:
> [180926.787634]   81651780 8802720c2900
> a02cde78
> [180926.796087]  81651ec0 a02d6030 bff0efab
> 8805
> [180926.804532]  000288050002 00030014 00140003
> 06ff88060002
> [180926.812985] Call Trace:
> [180926.815845]  [] ? nf_conntrack_in+0x4ed/0x5bc
> [nf_conntrack]
> [180926.824069]  [] ? nf_iterate+0x41/0x77
> [180926.830131]  [] ? ip_options_echo+0x2ed/0x2ed
> [180926.836873]  [] ? nf_hook_slow+0x68/0xfd
> [180926.843127]  [] ? ip_options_echo+0x2ed/0x2ed
> [180926.849866]  [] ? __ip_local_out+0x98/0x9d
> [180926.856315]  [] ? ip_local_out+0x9/0x19
> [180926.862465]  [] ? tcp_transmit_skb+0x7ae/0x7f1
> [180926.869305]  [] ? virt_to_head_page+0x9/0x2c
> [180926.875949]  [] ? tcp_send_active_reset+0xd5/0x101
> [180926.883175]  [] ? tcp_close+0x118/0x354
> [180926.889334]  [] ? inet_release+0x75/0x7b
> [180926.895591]  [] ? sock_release+0x19/0x73
> [180926.901845]  [] ? sock_close+0x22/0x27
> [180926.907906]  [] ? __fput+0xe9/0x1ae
> [180926.913677]  [] ? task_work_run+0x53/0x67
> [180926.920031]  [] ? do_notify_resume+0x79/0x8d
> [180926.926673]  [] ? int_signal+0x12/0x17
> [180926.932732] Code: 05 48 89 df ff d0 48 c7 c7 30 66 2d a0 e8 11 b0
> 07 e1 48 89 df e8 72 25 00 00 48 8b 43
>  78 a8 08 75 2a 48 8b 53 10 48 85 d2 75 04 <0f> 0b eb fe 48 8b 43 08
> 48 89 02 a8 01 75 04 48 89 50 08 48 be
> [180926.954788] RIP  [] destroy_conntrack+0x55/0xa9
> [nf_conntrack]
> [180926.963217]  RSP 
> [180926.967700] ---[ end trace 54a660a52afd5820 ]---
> [180926.973038] Kernel panic - not syncing: Fatal exception in interrupt
> 
> ---
> Ian Applegate
> CloudFlare
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-11 Thread Ian Applegate
On machines serving mainly http traffic we are seeing the following
panic, which is not yet reproducible.


[180926.566743] [ cut here ]
[180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
[180926.578873] invalid opcode:  [#1] SMP
[180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
ipmi_msghandler dm_mod md_mod nf_conntr
ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
iptable_raw ip_tables x_tables nfsv4 auth_rp
cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
i2c_core mdio igb dca uhci_hcd coretem
p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
cryptd aes_x86_64 aes_generic evdev sd_mod
 crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
libahci tpm_bios soundcore libata snd_pag
e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
usb_common button processor thermal_sys
[180926.657762] CPU 12
[180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
#1 HP ProLiant DL180 G6
[180926.669820] RIP: 0010:[]  []
destroy_conntrack+0x55/0xa9 [nf_conntrack]
[180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
[180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
8805bd73fc48
[180926.695055] RDX:  RSI: 0006 RDI:
8806b6f56c30
[180926.703179] RBP: 81651780 R08: 000172e0 R09:
812cef91
[180926.711304] R10: dead00200200 R11: dead00100100 R12:
8806b6f56c30
[180926.727451] R13:  R14: a02d6030 R15:

[180926.735575] FS:  7f382cdb2710() GS:880627cc()
knlGS:
[180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
[180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
07e0
[180926.759436] DR0:  DR1:  DR2:

[180926.767560] DR3:  DR6: 0ff0 DR7:
0400
[180926.775686] Process nginx-fl (pid: 5948, threadinfo
8805bd73e000, task 8805c9755960)
[180926.785265] Stack:
[180926.787634]   81651780 8802720c2900
a02cde78
[180926.796087]  81651ec0 a02d6030 bff0efab
8805
[180926.804532]  000288050002 00030014 00140003
06ff88060002
[180926.812985] Call Trace:
[180926.815845]  [] ? nf_conntrack_in+0x4ed/0x5bc
[nf_conntrack]
[180926.824069]  [] ? nf_iterate+0x41/0x77
[180926.830131]  [] ? ip_options_echo+0x2ed/0x2ed
[180926.836873]  [] ? nf_hook_slow+0x68/0xfd
[180926.843127]  [] ? ip_options_echo+0x2ed/0x2ed
[180926.849866]  [] ? __ip_local_out+0x98/0x9d
[180926.856315]  [] ? ip_local_out+0x9/0x19
[180926.862465]  [] ? tcp_transmit_skb+0x7ae/0x7f1
[180926.869305]  [] ? virt_to_head_page+0x9/0x2c
[180926.875949]  [] ? tcp_send_active_reset+0xd5/0x101
[180926.883175]  [] ? tcp_close+0x118/0x354
[180926.889334]  [] ? inet_release+0x75/0x7b
[180926.895591]  [] ? sock_release+0x19/0x73
[180926.901845]  [] ? sock_close+0x22/0x27
[180926.907906]  [] ? __fput+0xe9/0x1ae
[180926.913677]  [] ? task_work_run+0x53/0x67
[180926.920031]  [] ? do_notify_resume+0x79/0x8d
[180926.926673]  [] ? int_signal+0x12/0x17
[180926.932732] Code: 05 48 89 df ff d0 48 c7 c7 30 66 2d a0 e8 11 b0
07 e1 48 89 df e8 72 25 00 00 48 8b 43
 78 a8 08 75 2a 48 8b 53 10 48 85 d2 75 04 <0f> 0b eb fe 48 8b 43 08
48 89 02 a8 01 75 04 48 89 50 08 48 be
[180926.954788] RIP  [] destroy_conntrack+0x55/0xa9
[nf_conntrack]
[180926.963217]  RSP 
[180926.967700] ---[ end trace 54a660a52afd5820 ]---
[180926.973038] Kernel panic - not syncing: Fatal exception in interrupt

---
Ian Applegate
CloudFlare
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-11 Thread Ian Applegate
On machines serving mainly http traffic we are seeing the following
panic, which is not yet reproducible.


[180926.566743] [ cut here ]
[180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
[180926.578873] invalid opcode:  [#1] SMP
[180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
ipmi_msghandler dm_mod md_mod nf_conntr
ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
iptable_raw ip_tables x_tables nfsv4 auth_rp
cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
i2c_core mdio igb dca uhci_hcd coretem
p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
cryptd aes_x86_64 aes_generic evdev sd_mod
 crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
libahci tpm_bios soundcore libata snd_pag
e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
usb_common button processor thermal_sys
[180926.657762] CPU 12
[180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
#1 HP ProLiant DL180 G6
[180926.669820] RIP: 0010:[a02ccdd1]  [a02ccdd1]
destroy_conntrack+0x55/0xa9 [nf_conntrack]
[180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
[180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
8805bd73fc48
[180926.695055] RDX:  RSI: 0006 RDI:
8806b6f56c30
[180926.703179] RBP: 81651780 R08: 000172e0 R09:
812cef91
[180926.711304] R10: dead00200200 R11: dead00100100 R12:
8806b6f56c30
[180926.727451] R13:  R14: a02d6030 R15:

[180926.735575] FS:  7f382cdb2710() GS:880627cc()
knlGS:
[180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
[180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
07e0
[180926.759436] DR0:  DR1:  DR2:

[180926.767560] DR3:  DR6: 0ff0 DR7:
0400
[180926.775686] Process nginx-fl (pid: 5948, threadinfo
8805bd73e000, task 8805c9755960)
[180926.785265] Stack:
[180926.787634]   81651780 8802720c2900
a02cde78
[180926.796087]  81651ec0 a02d6030 bff0efab
8805
[180926.804532]  000288050002 00030014 00140003
06ff88060002
[180926.812985] Call Trace:
[180926.815845]  [a02cde78] ? nf_conntrack_in+0x4ed/0x5bc
[nf_conntrack]
[180926.824069]  [812ac195] ? nf_iterate+0x41/0x77
[180926.830131]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
[180926.836873]  [812ac2ee] ? nf_hook_slow+0x68/0xfd
[180926.843127]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
[180926.849866]  [812b55a8] ? __ip_local_out+0x98/0x9d
[180926.856315]  [812b55b6] ? ip_local_out+0x9/0x19
[180926.862465]  [812c7582] ? tcp_transmit_skb+0x7ae/0x7f1
[180926.869305]  [810fff4c] ? virt_to_head_page+0x9/0x2c
[180926.875949]  [812c78a6] ? tcp_send_active_reset+0xd5/0x101
[180926.883175]  [812bc38f] ? tcp_close+0x118/0x354
[180926.889334]  [812dbdc2] ? inet_release+0x75/0x7b
[180926.895591]  [81278bff] ? sock_release+0x19/0x73
[180926.901845]  [81278c7b] ? sock_close+0x22/0x27
[180926.907906]  [81109de8] ? __fput+0xe9/0x1ae
[180926.913677]  [81057157] ? task_work_run+0x53/0x67
[180926.920031]  [8100e50c] ? do_notify_resume+0x79/0x8d
[180926.926673]  [8134e5b2] ? int_signal+0x12/0x17
[180926.932732] Code: 05 48 89 df ff d0 48 c7 c7 30 66 2d a0 e8 11 b0
07 e1 48 89 df e8 72 25 00 00 48 8b 43
 78 a8 08 75 2a 48 8b 53 10 48 85 d2 75 04 0f 0b eb fe 48 8b 43 08
48 89 02 a8 01 75 04 48 89 50 08 48 be
[180926.954788] RIP  [a02ccdd1] destroy_conntrack+0x55/0xa9
[nf_conntrack]
[180926.963217]  RSP 8805bd73fbb8
[180926.967700] ---[ end trace 54a660a52afd5820 ]---
[180926.973038] Kernel panic - not syncing: Fatal exception in interrupt

---
Ian Applegate
CloudFlare
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-11 Thread Borislav Petkov
On Thu, Oct 11, 2012 at 12:13:33PM -0700, Ian Applegate wrote:
 On machines serving mainly http traffic we are seeing the following
 panic, which is not yet reproducible.

Must be this BUG_ON:

if (!nf_ct_is_confirmed(ct)) {

BUG_ON(hlist_nulls_unhashed(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
hlist_nulls_del_rcu(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
}

Spamming some more lists and leaving the rest for reference.

 
 
 [180926.566743] [ cut here ]
 [180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
 [180926.578873] invalid opcode:  [#1] SMP
 [180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
 ipmi_msghandler dm_mod md_mod nf_conntr
 ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
 t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
 iptable_raw ip_tables x_tables nfsv4 auth_rp
 cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
 i2c_core mdio igb dca uhci_hcd coretem
 p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
 cryptd aes_x86_64 aes_generic evdev sd_mod
  crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
 libahci tpm_bios soundcore libata snd_pag
 e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
 usb_common button processor thermal_sys
 [180926.657762] CPU 12
 [180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
 #1 HP ProLiant DL180 G6
 [180926.669820] RIP: 0010:[a02ccdd1]  [a02ccdd1]
 destroy_conntrack+0x55/0xa9 [nf_conntrack]
 [180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
 [180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
 8805bd73fc48
 [180926.695055] RDX:  RSI: 0006 RDI:
 8806b6f56c30
 [180926.703179] RBP: 81651780 R08: 000172e0 R09:
 812cef91
 [180926.711304] R10: dead00200200 R11: dead00100100 R12:
 8806b6f56c30
 [180926.727451] R13:  R14: a02d6030 R15:
 
 [180926.735575] FS:  7f382cdb2710() GS:880627cc()
 knlGS:
 [180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
 [180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
 07e0
 [180926.759436] DR0:  DR1:  DR2:
 
 [180926.767560] DR3:  DR6: 0ff0 DR7:
 0400
 [180926.775686] Process nginx-fl (pid: 5948, threadinfo
 8805bd73e000, task 8805c9755960)
 [180926.785265] Stack:
 [180926.787634]   81651780 8802720c2900
 a02cde78
 [180926.796087]  81651ec0 a02d6030 bff0efab
 8805
 [180926.804532]  000288050002 00030014 00140003
 06ff88060002
 [180926.812985] Call Trace:
 [180926.815845]  [a02cde78] ? nf_conntrack_in+0x4ed/0x5bc
 [nf_conntrack]
 [180926.824069]  [812ac195] ? nf_iterate+0x41/0x77
 [180926.830131]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
 [180926.836873]  [812ac2ee] ? nf_hook_slow+0x68/0xfd
 [180926.843127]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
 [180926.849866]  [812b55a8] ? __ip_local_out+0x98/0x9d
 [180926.856315]  [812b55b6] ? ip_local_out+0x9/0x19
 [180926.862465]  [812c7582] ? tcp_transmit_skb+0x7ae/0x7f1
 [180926.869305]  [810fff4c] ? virt_to_head_page+0x9/0x2c
 [180926.875949]  [812c78a6] ? tcp_send_active_reset+0xd5/0x101
 [180926.883175]  [812bc38f] ? tcp_close+0x118/0x354
 [180926.889334]  [812dbdc2] ? inet_release+0x75/0x7b
 [180926.895591]  [81278bff] ? sock_release+0x19/0x73
 [180926.901845]  [81278c7b] ? sock_close+0x22/0x27
 [180926.907906]  [81109de8] ? __fput+0xe9/0x1ae
 [180926.913677]  [81057157] ? task_work_run+0x53/0x67
 [180926.920031]  [8100e50c] ? do_notify_resume+0x79/0x8d
 [180926.926673]  [8134e5b2] ? int_signal+0x12/0x17
 [180926.932732] Code: 05 48 89 df ff d0 48 c7 c7 30 66 2d a0 e8 11 b0
 07 e1 48 89 df e8 72 25 00 00 48 8b 43
  78 a8 08 75 2a 48 8b 53 10 48 85 d2 75 04 0f 0b eb fe 48 8b 43 08
 48 89 02 a8 01 75 04 48 89 50 08 48 be
 [180926.954788] RIP  [a02ccdd1] destroy_conntrack+0x55/0xa9
 [nf_conntrack]
 [180926.963217]  RSP 8805bd73fbb8
 [180926.967700] ---[ end trace 54a660a52afd5820 ]---
 [180926.973038] Kernel panic - not syncing: Fatal exception in interrupt
 
 ---
 Ian Applegate
 CloudFlare
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a 

Re: amd64, v3.6.0: Kernel panic + BUG at net/netfilter/nf_conntrack_core.c:220!

2012-10-11 Thread Pablo Neira Ayuso
On Thu, Oct 11, 2012 at 11:27:33PM +0200, Borislav Petkov wrote:
 On Thu, Oct 11, 2012 at 12:13:33PM -0700, Ian Applegate wrote:
  On machines serving mainly http traffic we are seeing the following
  panic, which is not yet reproducible.
 
 Must be this BUG_ON:
 
   if (!nf_ct_is_confirmed(ct)) {
   
 BUG_ON(hlist_nulls_unhashed(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
   hlist_nulls_del_rcu(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
   }

At quick glance, I think we're hitting a memory corruption, I don't
see by now any sane code path to reach that bugtrap.

More comments below:

 Spamming some more lists and leaving the rest for reference.
 
  
  
  [180926.566743] [ cut here ]
  [180926.572034] kernel BUG at net/netfilter/nf_conntrack_core.c:220!
  [180926.578873] invalid opcode:  [#1] SMP
  [180926.583594] Modules linked in: xfs exportfs ipmi_devintf ipmi_si
  ipmi_msghandler dm_mod md_mod nf_conntr
  ack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables
  nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp x
  t_conntrack xt_multiport iptable_filter xt_NOTRACK nf_conntrack
  iptable_raw ip_tables x_tables nfsv4 auth_rp
  cgss fuse nfsv3 nfs_acl nfs fscache lockd sunrpc sfc mtd i2c_algo_bit
  i2c_core mdio igb dca uhci_hcd coretem
  p acpi_cpufreq kvm_intel kvm crc32c_intel aesni_intel ablk_helper
  cryptd aes_x86_64 aes_generic evdev sd_mod
   crc_t10dif mperf snd_pcm ahci snd_timer tpm_tis microcode snd tpm
  libahci tpm_bios soundcore libata snd_pag
  e_alloc pcspkr ehci_hcd lpc_ich usbcore mfd_core hpsa scsi_mod
  usb_common button processor thermal_sys
  [180926.657762] CPU 12
  [180926.660008] Pid: 5948, comm: nginx-fl Not tainted 3.6.0-cloudflare
  #1 HP ProLiant DL180 G6
  [180926.669820] RIP: 0010:[a02ccdd1]  [a02ccdd1]
  destroy_conntrack+0x55/0xa9 [nf_conntrack]
  [180926.680871] RSP: 0018:8805bd73fbb8  EFLAGS: 00010246
  [180926.686930] RAX:  RBX: 8806b6f56c30 RCX:
  8805bd73fc48
  [180926.695055] RDX:  RSI: 0006 RDI:
  8806b6f56c30
  [180926.703179] RBP: 81651780 R08: 000172e0 R09:
  812cef91
  [180926.711304] R10: dead00200200 R11: dead00100100 R12:
  8806b6f56c30
  [180926.727451] R13:  R14: a02d6030 R15:
  
  [180926.735575] FS:  7f382cdb2710() GS:880627cc()
  knlGS:
  [180926.744766] CS:  0010 DS:  ES:  CR0: 80050033
  [180926.751312] CR2: ff600400 CR3: 0005bd8d3000 CR4:
  07e0
  [180926.759436] DR0:  DR1:  DR2:
  
  [180926.767560] DR3:  DR6: 0ff0 DR7:
  0400
  [180926.775686] Process nginx-fl (pid: 5948, threadinfo
  8805bd73e000, task 8805c9755960)
  [180926.785265] Stack:
  [180926.787634]   81651780 8802720c2900
  a02cde78
  [180926.796087]  81651ec0 a02d6030 bff0efab
  8805
  [180926.804532]  000288050002 00030014 00140003
  06ff88060002
  [180926.812985] Call Trace:
  [180926.815845]  [a02cde78] ? nf_conntrack_in+0x4ed/0x5bc
  [nf_conntrack]

Here below the trace shows the output path to close a tcp socket. But
the line above refers to a conntrack function that is called in the
input path.

If this process is just acting as plain http server, this backtrace
doesn't seem consistent to me.

  [180926.824069]  [812ac195] ? nf_iterate+0x41/0x77
  [180926.830131]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
  [180926.836873]  [812ac2ee] ? nf_hook_slow+0x68/0xfd
  [180926.843127]  [812b3820] ? ip_options_echo+0x2ed/0x2ed
  [180926.849866]  [812b55a8] ? __ip_local_out+0x98/0x9d
  [180926.856315]  [812b55b6] ? ip_local_out+0x9/0x19
  [180926.862465]  [812c7582] ? tcp_transmit_skb+0x7ae/0x7f1
  [180926.869305]  [810fff4c] ? virt_to_head_page+0x9/0x2c
  [180926.875949]  [812c78a6] ? tcp_send_active_reset+0xd5/0x101
  [180926.883175]  [812bc38f] ? tcp_close+0x118/0x354
  [180926.889334]  [812dbdc2] ? inet_release+0x75/0x7b
  [180926.895591]  [81278bff] ? sock_release+0x19/0x73
  [180926.901845]  [81278c7b] ? sock_close+0x22/0x27
  [180926.907906]  [81109de8] ? __fput+0xe9/0x1ae
  [180926.913677]  [81057157] ? task_work_run+0x53/0x67
  [180926.920031]  [8100e50c] ? do_notify_resume+0x79/0x8d
  [180926.926673]  [8134e5b2] ? int_signal+0x12/0x17
  [180926.932732] Code: 05 48 89 df ff d0 48 c7 c7 30 66 2d a0 e8 11 b0
  07 e1 48 89 df e8 72 25 00 00 48 8b 43
   78 a8 08 75 2a 48 8b 53 10 48 85 d2 75 04 0f 0b eb fe 48 8b 43 08
  48 89 02 a8 01 75 04 48 89 50 08 48 be
  [180926.954788] RIP  [a02ccdd1] destroy_conntrack+0x55/0xa9
  [nf_conntrack]
  [180926.963217]  RSP