Re: Dynamic Googlebot identification via lua?
Reinhard, Am 12.09.20 um 16:43 schrieb Reinhard Vicinus: >>> thanks, for your reply and the information. Sorry for my late reply, but >>> I had only today time to test. I did try to get the spoa server working >>> on a ubuntu bionic (18.04.4) with haproxy 2.2.3-2ppa1~bionic from the >>> vbernat ppa. I could compile the spoa server with python 3.6 support >>> from the latest github sources without obvious problems and it also >>> started without problems with the example python script (./spoa -d -f >>> ps_python.py). >>> >>> If I start haproxy with the following command: >>> >>> haproxy -f spoa-server.conf -d >>> >>> haproxy seg faults on the first request to port 10001 >> This is a bug in HAProxy then. Do you happen to have a core dump / stack >> trace? > Yes I have a core dump. But I am somewhat rusty in analyzing them. So > any pointers what to do with the core dump is appreciated. Also the > segmentation fault only occurs if the spoa server is running so the > problem is probably somewhere in the code regarding the connection to > the spoa server. Install the haproxy-dbgsym package to install the debug symbols to make the stack trace readable. Then: gdb haproxy In gdb use 'bt' to get the backtrace of the thread that killed the process. Possibly 't a a bt' to see what the other threads were doing. Ideally file an issue within the tracker: https://github.com/haproxy/haproxy/issues. It's easier to track it within there and there's a template to fill in. >>> If I start haproxy with the additional parameter -Ws then it does not >>> seg fault, but only the first and every 4th request get (correctly?) >>> forwarded to the spoa server, the 3 requests in between get answered >>> with an empty %[var(sess.iprep.ip_score)]. >>> >>> [...] >>> >>> I am unsure if I am making some stupid mistakes, or if I should test it >>> with an older haproxy version or how to debug the issue further. So any >>> pointers are very much appreciated. >> Can you share the configuration you attempted to use? > Sorry, I forgot to mention that the configuration used is the example > configuration from haproxy repository. But here it is, to ensure that > there were no changes in the meantime: Unfortunately I can't comment on the SPOA functionality, because I never used it. Hopefully the information is sufficient for someone more proficient to tell what's going wrong there. Best regards Tim Düsterhus
Re: Dynamic Googlebot identification via lua?
Tim, On 9/12/20 4:25 PM, Tim Düsterhus wrote: > Reinhard, > > Am 12.09.20 um 12:45 schrieb Reinhard Vicinus: >> thanks, for your reply and the information. Sorry for my late reply, but >> I had only today time to test. I did try to get the spoa server working >> on a ubuntu bionic (18.04.4) with haproxy 2.2.3-2ppa1~bionic from the >> vbernat ppa. I could compile the spoa server with python 3.6 support >> from the latest github sources without obvious problems and it also >> started without problems with the example python script (./spoa -d -f >> ps_python.py). >> >> If I start haproxy with the following command: >> >> haproxy -f spoa-server.conf -d >> >> haproxy seg faults on the first request to port 10001 > This is a bug in HAProxy then. Do you happen to have a core dump / stack > trace? Yes I have a core dump. But I am somewhat rusty in analyzing them. So any pointers what to do with the core dump is appreciated. Also the segmentation fault only occurs if the spoa server is running so the problem is probably somewhere in the code regarding the connection to the spoa server. > >> If I start haproxy with the additional parameter -Ws then it does not >> seg fault, but only the first and every 4th request get (correctly?) >> forwarded to the spoa server, the 3 requests in between get answered >> with an empty %[var(sess.iprep.ip_score)]. >> >> [...] >> >> I am unsure if I am making some stupid mistakes, or if I should test it >> with an older haproxy version or how to debug the issue further. So any >> pointers are very much appreciated. > Can you share the configuration you attempted to use? Sorry, I forgot to mention that the configuration used is the example configuration from haproxy repository. But here it is, to ensure that there were no changes in the meantime: spoa-server.conf: global debug defaults mode http option httplog option dontlognull timeout connect 5000 timeout client 5000 timeout server 5000 listen test mode http bind :10001 filter spoe engine spoa-server config spoa-server.spoe.conf http-request set-var(req.a) var(txn.iprep.null),debug http-request set-var(req.a) var(txn.iprep.boolean),debug http-request set-var(req.a) var(txn.iprep.int32),debug http-request set-var(req.a) var(txn.iprep.uint32),debug http-request set-var(req.a) var(txn.iprep.int64),debug http-request set-var(req.a) var(txn.iprep.uint64),debug http-request set-var(req.a) var(txn.iprep.ipv4),debug http-request set-var(req.a) var(txn.iprep.ipv6),debug http-request set-var(req.a) var(txn.iprep.str),debug http-request set-var(req.a) var(txn.iprep.bin),debug http-request redirect location /%[var(sess.iprep.ip_score)] backend spoe-server mode tcp balance roundrobin timeout connect 5s timeout server 3m server spoe-server 127.0.0.1:12345 spoa-server.spoe.conf: [spoa-server] spoe-agent spoa-server messages check-client-ip option var-prefix iprep timeout hello 100ms timeout idle 30s timeout processing 15ms use-backend spoe-server spoe-message check-client-ip args always_true int(1234) src ipv6(::55) req.fhdr(host) event on-frontend-http-request Thanks in advance Reinhard Vicinus
Re: Dynamic Googlebot identification via lua?
Reinhard, Am 12.09.20 um 12:45 schrieb Reinhard Vicinus: > thanks, for your reply and the information. Sorry for my late reply, but > I had only today time to test. I did try to get the spoa server working > on a ubuntu bionic (18.04.4) with haproxy 2.2.3-2ppa1~bionic from the > vbernat ppa. I could compile the spoa server with python 3.6 support > from the latest github sources without obvious problems and it also > started without problems with the example python script (./spoa -d -f > ps_python.py). > > If I start haproxy with the following command: > > haproxy -f spoa-server.conf -d > > haproxy seg faults on the first request to port 10001 This is a bug in HAProxy then. Do you happen to have a core dump / stack trace? > If I start haproxy with the additional parameter -Ws then it does not > seg fault, but only the first and every 4th request get (correctly?) > forwarded to the spoa server, the 3 requests in between get answered > with an empty %[var(sess.iprep.ip_score)]. > > [...] > > I am unsure if I am making some stupid mistakes, or if I should test it > with an older haproxy version or how to debug the issue further. So any > pointers are very much appreciated. Can you share the configuration you attempted to use? Best regards Tim Düsterhus
Re: Dynamic Googlebot identification via lua?
Tim, Aleksandar, On 9/8/20 11:18 PM, Aleksandar Lazic wrote: > On 08.09.20 22:54, Tim Düsterhus wrote: >> Reinhard, >> Björn, >> >> Am 08.09.20 um 21:39 schrieb Björn Jacke: the only official supported way to identify a google bot is to run a reverse DNS lookup on the accessing IP address and run a forward DNS lookup on the result to verify that it points to accessing IP address and the resulting domain name is in either googlebot.com or google.com domain. ... >>> >>> thanks for asking this again, I brought this up earlier this year and I >>> got no answer: >>> >>> https://www.mail-archive.com/haproxy@formilux.org/msg37301.html >>> >>> I would expect that this is something that most sites would actually >>> want to check and I'm surprised that there is no solution for this >>> or at >>> least none that is obvious to find. >> >> The usually recommended solution for this kind of checks is either Lua >> or the SPOA, running the actual logic out of process. >> >> For Lua my haproxy-auth-request script is a batteries included solution >> to query an arbitrary HTTP service: >> https://github.com/TimWolla/haproxy-auth-request. It comes with the >> drawback that Lua runs single-threaded within HAProxy, so you might not >> want to use this if the checks need to run in the hot path, handling >> thousands of requests per second. >> >> It should be possible to cache the results of the script using a stick >> table or a map. >> >> Back in nginx times I used nginx' auth_request to query a local service >> that checked whether the client IP address was a Tor exit node. It >> worked well. >> >> For SPOA there's this random IP reputation service within the HAProxy >> repository: >> https://github.com/haproxy/haproxy/tree/master/contrib/spoa_example. I >> never used the SPOA feature, so I can't comment on whether that example >> generally works and how hard it would be to extend it. It certainly >> comes with the restriction that you are limited to C or Python (or a >> manual implementation of the SPOA protocol) vs anything that speaks >> HTTP. > > In addition to Tim's answer you can also try to use spoa_server which > supports `-n `. > https://github.com/haproxy/haproxy/tree/master/contrib/spoa_server > thanks, for your reply and the information. Sorry for my late reply, but I had only today time to test. I did try to get the spoa server working on a ubuntu bionic (18.04.4) with haproxy 2.2.3-2ppa1~bionic from the vbernat ppa. I could compile the spoa server with python 3.6 support from the latest github sources without obvious problems and it also started without problems with the example python script (./spoa -d -f ps_python.py). If I start haproxy with the following command: haproxy -f spoa-server.conf -d haproxy seg faults on the first request to port 10001 If I start haproxy with the additional parameter -Ws then it does not seg fault, but only the first and every 4th request get (correctly?) forwarded to the spoa server, the 3 requests in between get answered with an empty %[var(sess.iprep.ip_score)]. Here are the log files of a working request: from haproxy: :test.accept(0008)=0014 from [127.0.0.1:57570] ALPN= :test.clireq[0014:]: GET / HTTP/1.1 :test.clihdr[0014:]: host: localhost:10001 :test.clihdr[0014:]: user-agent: curl/7.58.0 :test.clihdr[0014:]: accept: */* :test.clicls[0014:] :test.closed[0014:] from spoa server: 1599906552.714422 [01] New connection from HAProxy accepted 1599906552.714593 [01] Hello handshake done: version=2.0 - max-frame-size=16380 - healthcheck=false 1599906552.714780 [01] Notify frame received: stream-id=0 - frame-id=1 1599906552.714800 [01] Message 'check-client-ip' received [{'name': '', 'value': True}, {'name': '', 'value': 1234}, {'name': '', 'value': IPv4Address('127.0.0.1')}, {'name': '', 'value': IPv6Address('::55')}, {'name': '', 'value': 'localhost:10001'}] 1599906552.716741 [01] Ack frame sent: stream-id=0 - frame-id=1 And here from a not working request: from haproxy: 001f:test.accept(0008)=0015 from [127.0.0.1:57634] ALPN= 001f:test.clireq[0015:]: GET / HTTP/1.1 001f:test.clihdr[0015:]: host: localhost:10001 001f:test.clihdr[0015:]: user-agent: curl/7.58.0 001f:test.clihdr[0015:]: accept: */* 001f:test.clicls[0015:] 001f:test.closed[0015:] 0020:spoe-server.srvcls[:adfd] 0020:spoe-server.clicls[:adfd] 0020:spoe-server.closed[:adfd] the spoa server does not log anything, during the request, but after a while the following lines are logged: 1599906689.387816 [01] New connection from HAProxy accepted 1599906689.387848 [01] Failed to write Agent frame 1599906689.387853 [01] Close the client socket because of I/O errors Every requests works if between the requests are at least 30 seconds, because after 30 seconds the
Re: Dynamic Googlebot identification via lua?
On 08.09.20 22:54, Tim Düsterhus wrote: Reinhard, Björn, Am 08.09.20 um 21:39 schrieb Björn Jacke: the only official supported way to identify a google bot is to run a reverse DNS lookup on the accessing IP address and run a forward DNS lookup on the result to verify that it points to accessing IP address and the resulting domain name is in either googlebot.com or google.com domain. ... thanks for asking this again, I brought this up earlier this year and I got no answer: https://www.mail-archive.com/haproxy@formilux.org/msg37301.html I would expect that this is something that most sites would actually want to check and I'm surprised that there is no solution for this or at least none that is obvious to find. The usually recommended solution for this kind of checks is either Lua or the SPOA, running the actual logic out of process. For Lua my haproxy-auth-request script is a batteries included solution to query an arbitrary HTTP service: https://github.com/TimWolla/haproxy-auth-request. It comes with the drawback that Lua runs single-threaded within HAProxy, so you might not want to use this if the checks need to run in the hot path, handling thousands of requests per second. It should be possible to cache the results of the script using a stick table or a map. Back in nginx times I used nginx' auth_request to query a local service that checked whether the client IP address was a Tor exit node. It worked well. For SPOA there's this random IP reputation service within the HAProxy repository: https://github.com/haproxy/haproxy/tree/master/contrib/spoa_example. I never used the SPOA feature, so I can't comment on whether that example generally works and how hard it would be to extend it. It certainly comes with the restriction that you are limited to C or Python (or a manual implementation of the SPOA protocol) vs anything that speaks HTTP. In addition to Tim's answer you can also try to use spoa_server which supports `-n `. https://github.com/haproxy/haproxy/tree/master/contrib/spoa_server Best regards Tim Düsterhus Regards Aleks
Re: Dynamic Googlebot identification via lua?
Reinhard, Björn, Am 08.09.20 um 21:39 schrieb Björn Jacke: >> the only official supported way to identify a google bot is to run a >> reverse DNS lookup on the accessing IP address and run a forward DNS >> lookup on the result to verify that it points to accessing IP address >> and the resulting domain name is in either googlebot.com or google.com >> domain. >> ... > > thanks for asking this again, I brought this up earlier this year and I > got no answer: > > https://www.mail-archive.com/haproxy@formilux.org/msg37301.html > > I would expect that this is something that most sites would actually > want to check and I'm surprised that there is no solution for this or at > least none that is obvious to find. The usually recommended solution for this kind of checks is either Lua or the SPOA, running the actual logic out of process. For Lua my haproxy-auth-request script is a batteries included solution to query an arbitrary HTTP service: https://github.com/TimWolla/haproxy-auth-request. It comes with the drawback that Lua runs single-threaded within HAProxy, so you might not want to use this if the checks need to run in the hot path, handling thousands of requests per second. It should be possible to cache the results of the script using a stick table or a map. Back in nginx times I used nginx' auth_request to query a local service that checked whether the client IP address was a Tor exit node. It worked well. For SPOA there's this random IP reputation service within the HAProxy repository: https://github.com/haproxy/haproxy/tree/master/contrib/spoa_example. I never used the SPOA feature, so I can't comment on whether that example generally works and how hard it would be to extend it. It certainly comes with the restriction that you are limited to C or Python (or a manual implementation of the SPOA protocol) vs anything that speaks HTTP. Best regards Tim Düsterhus
Re: Dynamic Googlebot identification via lua?
Hi Reinhard, On 08.09.20 21:20, Reinhard Vicinus wrote: > the only official supported way to identify a google bot is to run a > reverse DNS lookup on the accessing IP address and run a forward DNS > lookup on the result to verify that it points to accessing IP address > and the resulting domain name is in either googlebot.com or google.com > domain. > ... thanks for asking this again, I brought this up earlier this year and I got no answer: https://www.mail-archive.com/haproxy@formilux.org/msg37301.html I would expect that this is something that most sites would actually want to check and I'm surprised that there is no solution for this or at least none that is obvious to find. Björn signature.asc Description: OpenPGP digital signature
Dynamic Googlebot identification via lua?
Hi, the only official supported way to identify a google bot is to run a reverse DNS lookup on the accessing IP address and run a forward DNS lookup on the result to verify that it points to accessing IP address and the resulting domain name is in either googlebot.com or google.com domain. As far as I understand the lua api documentation, it is not possible in lua to perform DNS requests in runtime mode, so the only solution would be to use an external service to do the actual checking of an accessing IP address and use lua to question the external service and cache the result of the IP to increase performance. So as I am not that experienced in lua programming my question is if this is feasible or if I am missing something? Also, if there are other solutions I am not aware I would be thankful if I got pointers. Thanks in advance Reinhard Vicinus