Re: [pfSense] DNS resolution issues under heavy load
Well, it looks like it's the cable modem after all. Under load I'm unable to connect to it's admin panel, even when I'm directly connected to it. I called Comcast's technical support and had them run their diagnostics on it while everything was running and it failed miserably. The tech agreed with the conclusion that the modem was incapable of handling the load. So it looks like I'm in the market for a new cable modem. I'm not sure how to find one that will meet my needs though. Any DOCSIS 3 compatible modem will work on Comcast's network. Does anyone know of any models that are designed for heavy load? I'd probably need something that was built for networks of ~10,000 users. I'm not sure what sort of load 10,000 users generates, but I suspect it would peak around the 10-100 requests per second that my crawlers are putting out. If not, can anyone recommend a place where I might be able to find an answer to this question? Mailing list? Web forum? IRC channel, even? I'd really rather not have to pull specs on every DOCSIS 3 compatible modem and make a best guess based on microcontrollers/CPUs. Many thanks, -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: Well, I bumped Maximum State Table from the default of 23,000 to 75,000, and now it's throwing fewer UnknownHostException's. But they're still being thrown. My resource utilization is getting pretty high though. I don't think these ALIX boards can handle much more of a load, and I still have 2 more servers I need to scale these crawlers out to. I do see there's a Firewall Adaptive Timeouts setting in the web configurator.. this seems like it might be useful. Can anyone recommend any settings I should try to free up some system resources? I'm not clear on the consequences of purging pf state entries and whether that's something I'd want to do though. The state table on my primary router (alix1) is at roughly 50% utilization, or 40,000 states. The state table on my secondary router (alix2) is at 0%, roughly 250 states. This seems odd. Is this to be expected under CARP? Why is the load not distributed evenly? Memory usage on my primary router (alix1) is hovering around 55% (of 235MB). On my backup (alix2) it's pushing 85-90%. Does this make sense to anyone? Top output looks roughly the same... and now alix2 has gone down. 95% packet loss. Web Configurator unresponsive. ... It's back up but throwing 500 - Internal Server Errors periodically. I've ssh'd in to alix2 and am looking at top output.. tcpdump seems to be running for pflog purposes.. and it's hogging quite a bit of CPU. Is this necessary? Can I disable it somehow? -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: I've encountered a strange issue while scaling a Java project that I'm not quite sure how to resolve. Any thoughts would be appreciated. The code is a crawler that uses HTMLUnit to crawl a bunch of pages concurrently. It uses HTMLUnits getPage method to do the crawling. I'm running 100 threads per instance. When I have 1 instance up and running on 1 machine everything is fine. When I scale it to a second machine though I start having trouble. Calls to getPage keep throwing UnknownHostException's (DNS resolution error). With 2 servers running, roughly 1 out of every 20 calls to getPage throw this exception. For some reason it's unable to resolve domain names.. and it's not just the crawlers, my entire network starts to bug on DNS queries. On different systems on the same network I get 'unable to resolve host' errors in my web browser periodically when loading URL's. Usually when I retry it goes through, but it keeps happening sporadically as long as the crawlers are running. So many things could be going wrong here. Thinking maybe it was my provider throttling DNS queries I've tried changing DNS servers, but that's done nothing. Thinking it might be a bandwidth issue I checked systat, but the cumulative load is well under what my line can handle. What else could be causing this? My network is pretty simple: Provider -- modem -- 2 ALIX boards running pfSense -- Servers and workstations. The servers are running FreeBSD, and the workstations run FreeBSD, Windows, and OSX. Has anyone encountered this before? Does anyone have any thoughts on what might be causing it? My only other thought is that maybe pfSense is doing something strange so if I can't come up with any better ideas I'll try plugging the servers directly into the modem. I'd rather have them behind the routers though, so this would be a less-than-ideal solution. UPDATE: Ok, so it seems to be a pfSense issue. I launched the crawlers on 2 servers as before and waited for UnknownHostException's to be thrown. I then took a spare laptop and connected it directly into my modem, bypassing my 2 pfSense routers. All DNS queries have gone through without a hitch, so something strange is going on with pfSense. Can anyone think of what might be causing
Re: [pfSense] DNS resolution issues under heavy load
Pls share ur load with two pfsense server 1 is too much heavy users load i have 1200 users thats why i install two pfsense boxes in my network. After i never face this type of problem. On Mar 25, 2014 7:15 PM, David Noel david.i.n...@gmail.com wrote: Well, it looks like it's the cable modem after all. Under load I'm unable to connect to it's admin panel, even when I'm directly connected to it. I called Comcast's technical support and had them run their diagnostics on it while everything was running and it failed miserably. The tech agreed with the conclusion that the modem was incapable of handling the load. So it looks like I'm in the market for a new cable modem. I'm not sure how to find one that will meet my needs though. Any DOCSIS 3 compatible modem will work on Comcast's network. Does anyone know of any models that are designed for heavy load? I'd probably need something that was built for networks of ~10,000 users. I'm not sure what sort of load 10,000 users generates, but I suspect it would peak around the 10-100 requests per second that my crawlers are putting out. If not, can anyone recommend a place where I might be able to find an answer to this question? Mailing list? Web forum? IRC channel, even? I'd really rather not have to pull specs on every DOCSIS 3 compatible modem and make a best guess based on microcontrollers/CPUs. Many thanks, -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: Well, I bumped Maximum State Table from the default of 23,000 to 75,000, and now it's throwing fewer UnknownHostException's. But they're still being thrown. My resource utilization is getting pretty high though. I don't think these ALIX boards can handle much more of a load, and I still have 2 more servers I need to scale these crawlers out to. I do see there's a Firewall Adaptive Timeouts setting in the web configurator.. this seems like it might be useful. Can anyone recommend any settings I should try to free up some system resources? I'm not clear on the consequences of purging pf state entries and whether that's something I'd want to do though. The state table on my primary router (alix1) is at roughly 50% utilization, or 40,000 states. The state table on my secondary router (alix2) is at 0%, roughly 250 states. This seems odd. Is this to be expected under CARP? Why is the load not distributed evenly? Memory usage on my primary router (alix1) is hovering around 55% (of 235MB). On my backup (alix2) it's pushing 85-90%. Does this make sense to anyone? Top output looks roughly the same... and now alix2 has gone down. 95% packet loss. Web Configurator unresponsive. ... It's back up but throwing 500 - Internal Server Errors periodically. I've ssh'd in to alix2 and am looking at top output.. tcpdump seems to be running for pflog purposes.. and it's hogging quite a bit of CPU. Is this necessary? Can I disable it somehow? -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: I've encountered a strange issue while scaling a Java project that I'm not quite sure how to resolve. Any thoughts would be appreciated. The code is a crawler that uses HTMLUnit to crawl a bunch of pages concurrently. It uses HTMLUnits getPage method to do the crawling. I'm running 100 threads per instance. When I have 1 instance up and running on 1 machine everything is fine. When I scale it to a second machine though I start having trouble. Calls to getPage keep throwing UnknownHostException's (DNS resolution error). With 2 servers running, roughly 1 out of every 20 calls to getPage throw this exception. For some reason it's unable to resolve domain names.. and it's not just the crawlers, my entire network starts to bug on DNS queries. On different systems on the same network I get 'unable to resolve host' errors in my web browser periodically when loading URL's. Usually when I retry it goes through, but it keeps happening sporadically as long as the crawlers are running. So many things could be going wrong here. Thinking maybe it was my provider throttling DNS queries I've tried changing DNS servers, but that's done nothing. Thinking it might be a bandwidth issue I checked systat, but the cumulative load is well under what my line can handle. What else could be causing this? My network is pretty simple: Provider -- modem -- 2 ALIX boards running pfSense -- Servers and workstations. The servers are running FreeBSD, and the workstations run FreeBSD, Windows, and OSX. Has anyone encountered this before? Does anyone have any thoughts on what might be causing it? My only other thought is that maybe pfSense is doing something strange so if I can't come up with any better ideas I'll try plugging the servers directly into the modem. I'd rather have them behind the routers though, so this would be a less-than-ideal solution. UPDATE: Ok, so it seems to be a pfSense issue. I
Re: [pfSense] DNS resolution issues under heavy load
I’m perfectly content renting a DOCSIS3 from Comcast and have been doing so for two years. Cost be damned - it’s worth it to not have to own it. What model do you have? SMC? Nortel? Motorola? On Mar 25, 2014, at 8:45 AM, David Noel david.i.n...@gmail.com wrote: Well, it looks like it's the cable modem after all. Under load I'm unable to connect to it's admin panel, even when I'm directly connected to it. I called Comcast's technical support and had them run their diagnostics on it while everything was running and it failed miserably. The tech agreed with the conclusion that the modem was incapable of handling the load. So it looks like I'm in the market for a new cable modem. I'm not sure how to find one that will meet my needs though. Any DOCSIS 3 compatible modem will work on Comcast's network. Does anyone know of any models that are designed for heavy load? I'd probably need something that was built for networks of ~10,000 users. I'm not sure what sort of load 10,000 users generates, but I suspect it would peak around the 10-100 requests per second that my crawlers are putting out. If not, can anyone recommend a place where I might be able to find an answer to this question? Mailing list? Web forum? IRC channel, even? I'd really rather not have to pull specs on every DOCSIS 3 compatible modem and make a best guess based on microcontrollers/CPUs. Many thanks, -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: Well, I bumped Maximum State Table from the default of 23,000 to 75,000, and now it's throwing fewer UnknownHostException's. But they're still being thrown. My resource utilization is getting pretty high though. I don't think these ALIX boards can handle much more of a load, and I still have 2 more servers I need to scale these crawlers out to. I do see there's a Firewall Adaptive Timeouts setting in the web configurator.. this seems like it might be useful. Can anyone recommend any settings I should try to free up some system resources? I'm not clear on the consequences of purging pf state entries and whether that's something I'd want to do though. The state table on my primary router (alix1) is at roughly 50% utilization, or 40,000 states. The state table on my secondary router (alix2) is at 0%, roughly 250 states. This seems odd. Is this to be expected under CARP? Why is the load not distributed evenly? Memory usage on my primary router (alix1) is hovering around 55% (of 235MB). On my backup (alix2) it's pushing 85-90%. Does this make sense to anyone? Top output looks roughly the same... and now alix2 has gone down. 95% packet loss. Web Configurator unresponsive. ... It's back up but throwing 500 - Internal Server Errors periodically. I've ssh'd in to alix2 and am looking at top output.. tcpdump seems to be running for pflog purposes.. and it's hogging quite a bit of CPU. Is this necessary? Can I disable it somehow? -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: I've encountered a strange issue while scaling a Java project that I'm not quite sure how to resolve. Any thoughts would be appreciated. The code is a crawler that uses HTMLUnit to crawl a bunch of pages concurrently. It uses HTMLUnits getPage method to do the crawling. I'm running 100 threads per instance. When I have 1 instance up and running on 1 machine everything is fine. When I scale it to a second machine though I start having trouble. Calls to getPage keep throwing UnknownHostException's (DNS resolution error). With 2 servers running, roughly 1 out of every 20 calls to getPage throw this exception. For some reason it's unable to resolve domain names.. and it's not just the crawlers, my entire network starts to bug on DNS queries. On different systems on the same network I get 'unable to resolve host' errors in my web browser periodically when loading URL's. Usually when I retry it goes through, but it keeps happening sporadically as long as the crawlers are running. So many things could be going wrong here. Thinking maybe it was my provider throttling DNS queries I've tried changing DNS servers, but that's done nothing. Thinking it might be a bandwidth issue I checked systat, but the cumulative load is well under what my line can handle. What else could be causing this? My network is pretty simple: Provider -- modem -- 2 ALIX boards running pfSense -- Servers and workstations. The servers are running FreeBSD, and the workstations run FreeBSD, Windows, and OSX. Has anyone encountered this before? Does anyone have any thoughts on what might be causing it? My only other thought is that maybe pfSense is doing something strange so if I can't come up with any better ideas I'll try plugging the servers directly into the modem. I'd rather have them behind the routers though, so this would be a less-than-ideal solution. UPDATE: Ok, so it seems to be a pfSense issue. I launched the crawlers on 2 servers as
Re: [pfSense] DNS resolution issues under heavy load
On Mar 25, 2014, at 8:45 AM, David Noel david.i.n...@gmail.com wrote: Well, it looks like it's the cable modem after all. Under load I'm unable to connect to it's admin panel, even when I'm directly connected to it. I called Comcast's technical support and had them run their diagnostics on it while everything was running and it failed miserably. The tech agreed with the conclusion that the modem was incapable of handling the load. So it looks like I'm in the market for a new cable modem. I'm not sure how to find one that will meet my needs though. Any DOCSIS 3 compatible modem will work on Comcast's network. Does anyone know of any models that are designed for heavy load? I'd probably need something that was built for networks of ~10,000 users. I'm not sure what sort of load 10,000 users generates, but I suspect it would peak around the 10-100 requests per second that my crawlers are putting out. If not, can anyone recommend a place where I might be able to find an answer to this question? Mailing list? Web forum? IRC channel, even? I'd really rather not have to pull specs on every DOCSIS 3 compatible modem and make a best guess based on microcontrollers/CPUs. Short answer: no DOCSIS cable modems are designed for that kind of throughput! Juniper sells MX480 routers to 10,000-customer-ISPs for ~$250k! (Granted, that *is* overkill, but even 10k-user corporations will have fairly high-end routers connected via fiber to handle that much traffic.) Your best bet, I think, would be to find a DOCSIS 3 cable modem that can be put into bridging mode. At that point, the CPU/RAM limitations of the cable modem are no longer relevant. Some confirmation: - http://jkoblovsky.wordpress.com/2012/11/21/how-to-use-your-own-router-with-rogers-docsis-3-0-upgrade/ - http://communityforums.rogers.com/t5/forums/forumtopicpage/board-id/Getting_connected/thread-id/12199 (implies Hitron and Moto/ARRIS modems can also do bridge-mode) - http://digitalhome.ca/forum/showthread.php?t=145997page=6 (implies SMC modem can do bridge mode) - http://www.dslreports.com/faq/comcast/2.1_Modems#17174 (Comcast-specific) Once your modem is in bridge mode, the bottleneck should be your router. As you've mentioned, your ALIX boxes are pretty much at their limit, too, so you're just moving the bottleneck around. Apologies if I've missed something fundamental - I haven't followed this thread from the beginning... -- -Adam Thompson athom...@athompso.net ___ List mailing list List@lists.pfsense.org https://lists.pfsense.org/mailman/listinfo/list
Re: [pfSense] DNS resolution issues under heavy load
SMCD3G On 3/25/14, Ryan Coleman ryanjc...@me.com wrote: I'm perfectly content renting a DOCSIS3 from Comcast and have been doing so for two years. Cost be damned - it's worth it to not have to own it. What model do you have? SMC? Nortel? Motorola? On Mar 25, 2014, at 8:45 AM, David Noel david.i.n...@gmail.com wrote: Well, it looks like it's the cable modem after all. Under load I'm unable to connect to it's admin panel, even when I'm directly connected to it. I called Comcast's technical support and had them run their diagnostics on it while everything was running and it failed miserably. The tech agreed with the conclusion that the modem was incapable of handling the load. So it looks like I'm in the market for a new cable modem. I'm not sure how to find one that will meet my needs though. Any DOCSIS 3 compatible modem will work on Comcast's network. Does anyone know of any models that are designed for heavy load? I'd probably need something that was built for networks of ~10,000 users. I'm not sure what sort of load 10,000 users generates, but I suspect it would peak around the 10-100 requests per second that my crawlers are putting out. If not, can anyone recommend a place where I might be able to find an answer to this question? Mailing list? Web forum? IRC channel, even? I'd really rather not have to pull specs on every DOCSIS 3 compatible modem and make a best guess based on microcontrollers/CPUs. Many thanks, -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: Well, I bumped Maximum State Table from the default of 23,000 to 75,000, and now it's throwing fewer UnknownHostException's. But they're still being thrown. My resource utilization is getting pretty high though. I don't think these ALIX boards can handle much more of a load, and I still have 2 more servers I need to scale these crawlers out to. I do see there's a Firewall Adaptive Timeouts setting in the web configurator.. this seems like it might be useful. Can anyone recommend any settings I should try to free up some system resources? I'm not clear on the consequences of purging pf state entries and whether that's something I'd want to do though. The state table on my primary router (alix1) is at roughly 50% utilization, or 40,000 states. The state table on my secondary router (alix2) is at 0%, roughly 250 states. This seems odd. Is this to be expected under CARP? Why is the load not distributed evenly? Memory usage on my primary router (alix1) is hovering around 55% (of 235MB). On my backup (alix2) it's pushing 85-90%. Does this make sense to anyone? Top output looks roughly the same... and now alix2 has gone down. 95% packet loss. Web Configurator unresponsive. ... It's back up but throwing 500 - Internal Server Errors periodically. I've ssh'd in to alix2 and am looking at top output.. tcpdump seems to be running for pflog purposes.. and it's hogging quite a bit of CPU. Is this necessary? Can I disable it somehow? -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: I've encountered a strange issue while scaling a Java project that I'm not quite sure how to resolve. Any thoughts would be appreciated. The code is a crawler that uses HTMLUnit to crawl a bunch of pages concurrently. It uses HTMLUnits getPage method to do the crawling. I'm running 100 threads per instance. When I have 1 instance up and running on 1 machine everything is fine. When I scale it to a second machine though I start having trouble. Calls to getPage keep throwing UnknownHostException's (DNS resolution error). With 2 servers running, roughly 1 out of every 20 calls to getPage throw this exception. For some reason it's unable to resolve domain names.. and it's not just the crawlers, my entire network starts to bug on DNS queries. On different systems on the same network I get 'unable to resolve host' errors in my web browser periodically when loading URL's. Usually when I retry it goes through, but it keeps happening sporadically as long as the crawlers are running. So many things could be going wrong here. Thinking maybe it was my provider throttling DNS queries I've tried changing DNS servers, but that's done nothing. Thinking it might be a bandwidth issue I checked systat, but the cumulative load is well under what my line can handle. What else could be causing this? My network is pretty simple: Provider -- modem -- 2 ALIX boards running pfSense -- Servers and workstations. The servers are running FreeBSD, and the workstations run FreeBSD, Windows, and OSX. Has anyone encountered this before? Does anyone have any thoughts on what might be causing it? My only other thought is that maybe pfSense is doing something strange so if I can't come up with any better ideas I'll try plugging the servers directly into the modem. I'd rather have them behind the routers though, so this would be a less-than-ideal solution. UPDATE: Ok, so it seems to be a pfSense
Re: [pfSense] DNS resolution issues under heavy load
Short answer: no DOCSIS cable modems are designed for that kind of throughput! Ugh... I've been suspecting that. Juniper sells MX480 routers to 10,000-customer-ISPs for ~$250k! (Granted, that *is* overkill, but even 10k-user corporations will have fairly high-end routers connected via fiber to handle that much traffic.) Yikes. That's way outside of my budget. I suspect co-locating or leasing a T3 are really my only options. Your best bet, I think, would be to find a DOCSIS 3 cable modem that can be put into bridging mode. At that point, the CPU/RAM limitations of the cable modem are no longer relevant. Some confirmation: - http://jkoblovsky.wordpress.com/2012/11/21/how-to-use-your-own-router-with-rogers-docsis-3-0-upgrade/ - http://communityforums.rogers.com/t5/forums/forumtopicpage/board-id/Getting_connected/thread-id/12199 (implies Hitron and Moto/ARRIS modems can also do bridge-mode) - http://digitalhome.ca/forum/showthread.php?t=145997page=6 (implies SMC modem can do bridge mode) - http://www.dslreports.com/faq/comcast/2.1_Modems#17174 (Comcast-specific) Once your modem is in bridge mode, the bottleneck should be your router. As you've mentioned, your ALIX boxes are pretty much at their limit, too, so you're just moving the bottleneck around. I've enabled bridging for the statics and it's still giving me trouble. I think I'm going to wind up having to dig through the specs of the highest-end cable modems I can find and buy the one with the most CPU/RAM. Thanks for the links -- I didn't know they made 24x8's. If any cable modem can handle the load I'm generating I bet it'd be one of those. -Davod ___ List mailing list List@lists.pfsense.org https://lists.pfsense.org/mailman/listinfo/list
Re: [pfSense] DNS resolution issues under heavy load
Unsubscribe is here: http://lists.pfsense.org/mailman/listinfo/list On 3/19/14, Edouard De Keyser edou...@ipfix.be wrote: Please stop your mail. Thank you Envoyé de mon SkyTel Le 19 mars 2014 à 20:29, Chris Buechler c...@pfsense.com a écrit : It sounds like you don't have state sync enabled on the secondary, it won't accept the primary's states without that. Depending on how much load you're generating with the crawlers, you could be hitting the limits of the ALIX in new connections per sec. I've seen with one customer where they were blasting out 10K+ emails (and 10K+ SMTP connections) in less than a second, which put adequate load on their ALIX pair that it failed over CARP because the primary was under too much load to send its advertisements. Though the modem theory is just as plausible, especially if the modem is doing any kind of NAT or filtering. If you're not hitting it so hard you're failing over CARP, that points to it being something other than the firewall. Packet capture on WAN filtered on port 53 would be more telling. If you see DNS queries leaving there that get no reply back, it's not the firewall. On Wed, Mar 19, 2014 at 9:50 AM, David Noel david.i.n...@gmail.com wrote: Well, it may not be the ALIX boards after all. I connected the servers directly to the modem, ran the crawlers, and I'm still getting UnknownHostException's. I'm guessing my modem's to blame... I'll have to upgrade it and find out. On 3/18/14, David Noel david.i.n...@gmail.com wrote: Well, I bumped Maximum State Table from the default of 23,000 to 75,000, and now it's throwing fewer UnknownHostException's. But they're still being thrown. My resource utilization is getting pretty high though. I don't think these ALIX boards can handle much more of a load, and I still have 2 more servers I need to scale these crawlers out to. I do see there's a Firewall Adaptive Timeouts setting in the web configurator.. this seems like it might be useful. Can anyone recommend any settings I should try to free up some system resources? I'm not clear on the consequences of purging pf state entries and whether that's something I'd want to do though. The state table on my primary router (alix1) is at roughly 50% utilization, or 40,000 states. The state table on my secondary router (alix2) is at 0%, roughly 250 states. This seems odd. Is this to be expected under CARP? Why is the load not distributed evenly? Memory usage on my primary router (alix1) is hovering around 55% (of 235MB). On my backup (alix2) it's pushing 85-90%. Does this make sense to anyone? Top output looks roughly the same... and now alix2 has gone down. 95% packet loss. Web Configurator unresponsive. ... It's back up but throwing 500 - Internal Server Errors periodically. I've ssh'd in to alix2 and am looking at top output.. tcpdump seems to be running for pflog purposes.. and it's hogging quite a bit of CPU. Is this necessary? Can I disable it somehow? -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: I've encountered a strange issue while scaling a Java project that I'm not quite sure how to resolve. Any thoughts would be appreciated. The code is a crawler that uses HTMLUnit to crawl a bunch of pages concurrently. It uses HTMLUnits getPage method to do the crawling. I'm running 100 threads per instance. When I have 1 instance up and running on 1 machine everything is fine. When I scale it to a second machine though I start having trouble. Calls to getPage keep throwing UnknownHostException's (DNS resolution error). With 2 servers running, roughly 1 out of every 20 calls to getPage throw this exception. For some reason it's unable to resolve domain names.. and it's not just the crawlers, my entire network starts to bug on DNS queries. On different systems on the same network I get 'unable to resolve host' errors in my web browser periodically when loading URL's. Usually when I retry it goes through, but it keeps happening sporadically as long as the crawlers are running. So many things could be going wrong here. Thinking maybe it was my provider throttling DNS queries I've tried changing DNS servers, but that's done nothing. Thinking it might be a bandwidth issue I checked systat, but the cumulative load is well under what my line can handle. What else could be causing this? My network is pretty simple: Provider -- modem -- 2 ALIX boards running pfSense -- Servers and workstations. The servers are running FreeBSD, and the workstations run FreeBSD, Windows, and OSX. Has anyone encountered this before? Does anyone have any thoughts on what might be causing it? My only other thought is that maybe pfSense is doing something strange so if I can't come up with any better ideas I'll try plugging the servers directly into the modem. I'd rather have them behind the routers though, so this would be a less-than-ideal solution. UPDATE: Ok, so it seems to be a pfSense issue. I
Re: [pfSense] DNS resolution issues under heavy load
It sounds like you don't have state sync enabled on the secondary, it won't accept the primary's states without that. Depending on how much load you're generating with the crawlers, you could be hitting the limits of the ALIX in new connections per sec. I've seen with one customer where they were blasting out 10K+ emails (and 10K+ SMTP connections) in less than a second, which put adequate load on their ALIX pair that it failed over CARP because the primary was under too much load to send its advertisements. Though the modem theory is just as plausible, especially if the modem is doing any kind of NAT or filtering. If you're not hitting it so hard you're failing over CARP, that points to it being something other than the firewall. Packet capture on WAN filtered on port 53 would be more telling. If you see DNS queries leaving there that get no reply back, it's not the firewall. On Wed, Mar 19, 2014 at 9:50 AM, David Noel david.i.n...@gmail.com wrote: Well, it may not be the ALIX boards after all. I connected the servers directly to the modem, ran the crawlers, and I'm still getting UnknownHostException's. I'm guessing my modem's to blame... I'll have to upgrade it and find out. On 3/18/14, David Noel david.i.n...@gmail.com wrote: Well, I bumped Maximum State Table from the default of 23,000 to 75,000, and now it's throwing fewer UnknownHostException's. But they're still being thrown. My resource utilization is getting pretty high though. I don't think these ALIX boards can handle much more of a load, and I still have 2 more servers I need to scale these crawlers out to. I do see there's a Firewall Adaptive Timeouts setting in the web configurator.. this seems like it might be useful. Can anyone recommend any settings I should try to free up some system resources? I'm not clear on the consequences of purging pf state entries and whether that's something I'd want to do though. The state table on my primary router (alix1) is at roughly 50% utilization, or 40,000 states. The state table on my secondary router (alix2) is at 0%, roughly 250 states. This seems odd. Is this to be expected under CARP? Why is the load not distributed evenly? Memory usage on my primary router (alix1) is hovering around 55% (of 235MB). On my backup (alix2) it's pushing 85-90%. Does this make sense to anyone? Top output looks roughly the same... and now alix2 has gone down. 95% packet loss. Web Configurator unresponsive. ... It's back up but throwing 500 - Internal Server Errors periodically. I've ssh'd in to alix2 and am looking at top output.. tcpdump seems to be running for pflog purposes.. and it's hogging quite a bit of CPU. Is this necessary? Can I disable it somehow? -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: I've encountered a strange issue while scaling a Java project that I'm not quite sure how to resolve. Any thoughts would be appreciated. The code is a crawler that uses HTMLUnit to crawl a bunch of pages concurrently. It uses HTMLUnits getPage method to do the crawling. I'm running 100 threads per instance. When I have 1 instance up and running on 1 machine everything is fine. When I scale it to a second machine though I start having trouble. Calls to getPage keep throwing UnknownHostException's (DNS resolution error). With 2 servers running, roughly 1 out of every 20 calls to getPage throw this exception. For some reason it's unable to resolve domain names.. and it's not just the crawlers, my entire network starts to bug on DNS queries. On different systems on the same network I get 'unable to resolve host' errors in my web browser periodically when loading URL's. Usually when I retry it goes through, but it keeps happening sporadically as long as the crawlers are running. So many things could be going wrong here. Thinking maybe it was my provider throttling DNS queries I've tried changing DNS servers, but that's done nothing. Thinking it might be a bandwidth issue I checked systat, but the cumulative load is well under what my line can handle. What else could be causing this? My network is pretty simple: Provider -- modem -- 2 ALIX boards running pfSense -- Servers and workstations. The servers are running FreeBSD, and the workstations run FreeBSD, Windows, and OSX. Has anyone encountered this before? Does anyone have any thoughts on what might be causing it? My only other thought is that maybe pfSense is doing something strange so if I can't come up with any better ideas I'll try plugging the servers directly into the modem. I'd rather have them behind the routers though, so this would be a less-than-ideal solution. UPDATE: Ok, so it seems to be a pfSense issue. I launched the crawlers on 2 servers as before and waited for UnknownHostException's to be thrown. I then took a spare laptop and connected it directly into my modem, bypassing my 2 pfSense routers. All DNS queries have gone through without a hitch, so something strange
Re: [pfSense] DNS resolution issues under heavy load
Thanks. I do recall seeing some notifications in the Web Configurator about sync failing when I had everything up and running. I'm pretty sure there's still an issue with either the modem or line itself though. I've plugged the servers directly into the modem, run the crawlers, and DNS queries still fail. If it were purely an ALIX or pfSense issue bypassing them should have fixed it. It's strange that only DNS queries fail... once the addresses resolve the throughput is fine. At any rate I contacted my provider and they agreed to send out a newer, heavier-duty modem to try. Hopefully that fixes it. -David On 3/19/14, Chris Buechler c...@pfsense.com wrote: It sounds like you don't have state sync enabled on the secondary, it won't accept the primary's states without that. Depending on how much load you're generating with the crawlers, you could be hitting the limits of the ALIX in new connections per sec. I've seen with one customer where they were blasting out 10K+ emails (and 10K+ SMTP connections) in less than a second, which put adequate load on their ALIX pair that it failed over CARP because the primary was under too much load to send its advertisements. Though the modem theory is just as plausible, especially if the modem is doing any kind of NAT or filtering. If you're not hitting it so hard you're failing over CARP, that points to it being something other than the firewall. Packet capture on WAN filtered on port 53 would be more telling. If you see DNS queries leaving there that get no reply back, it's not the firewall. On Wed, Mar 19, 2014 at 9:50 AM, David Noel david.i.n...@gmail.com wrote: Well, it may not be the ALIX boards after all. I connected the servers directly to the modem, ran the crawlers, and I'm still getting UnknownHostException's. I'm guessing my modem's to blame... I'll have to upgrade it and find out. On 3/18/14, David Noel david.i.n...@gmail.com wrote: Well, I bumped Maximum State Table from the default of 23,000 to 75,000, and now it's throwing fewer UnknownHostException's. But they're still being thrown. My resource utilization is getting pretty high though. I don't think these ALIX boards can handle much more of a load, and I still have 2 more servers I need to scale these crawlers out to. I do see there's a Firewall Adaptive Timeouts setting in the web configurator.. this seems like it might be useful. Can anyone recommend any settings I should try to free up some system resources? I'm not clear on the consequences of purging pf state entries and whether that's something I'd want to do though. The state table on my primary router (alix1) is at roughly 50% utilization, or 40,000 states. The state table on my secondary router (alix2) is at 0%, roughly 250 states. This seems odd. Is this to be expected under CARP? Why is the load not distributed evenly? Memory usage on my primary router (alix1) is hovering around 55% (of 235MB). On my backup (alix2) it's pushing 85-90%. Does this make sense to anyone? Top output looks roughly the same... and now alix2 has gone down. 95% packet loss. Web Configurator unresponsive. ... It's back up but throwing 500 - Internal Server Errors periodically. I've ssh'd in to alix2 and am looking at top output.. tcpdump seems to be running for pflog purposes.. and it's hogging quite a bit of CPU. Is this necessary? Can I disable it somehow? -David On 3/18/14, David Noel david.i.n...@gmail.com wrote: I've encountered a strange issue while scaling a Java project that I'm not quite sure how to resolve. Any thoughts would be appreciated. The code is a crawler that uses HTMLUnit to crawl a bunch of pages concurrently. It uses HTMLUnits getPage method to do the crawling. I'm running 100 threads per instance. When I have 1 instance up and running on 1 machine everything is fine. When I scale it to a second machine though I start having trouble. Calls to getPage keep throwing UnknownHostException's (DNS resolution error). With 2 servers running, roughly 1 out of every 20 calls to getPage throw this exception. For some reason it's unable to resolve domain names.. and it's not just the crawlers, my entire network starts to bug on DNS queries. On different systems on the same network I get 'unable to resolve host' errors in my web browser periodically when loading URL's. Usually when I retry it goes through, but it keeps happening sporadically as long as the crawlers are running. So many things could be going wrong here. Thinking maybe it was my provider throttling DNS queries I've tried changing DNS servers, but that's done nothing. Thinking it might be a bandwidth issue I checked systat, but the cumulative load is well under what my line can handle. What else could be causing this? My network is pretty simple: Provider -- modem -- 2 ALIX boards running pfSense -- Servers and workstations. The servers are running FreeBSD, and the workstations run FreeBSD, Windows, and OSX. Has