Re: Dynamic Googlebot identification via lua?

2020-09-12 Thread Tim Düsterhus
Reinhard,

Am 12.09.20 um 16:43 schrieb Reinhard Vicinus:
>>> thanks, for your reply and the information. Sorry for my late reply, but
>>> I had only today time to test. I did try to get the spoa server working
>>> on a ubuntu bionic (18.04.4) with haproxy 2.2.3-2ppa1~bionic from the
>>> vbernat ppa. I could compile the spoa server with python 3.6 support
>>> from the latest github sources without obvious problems and it also
>>> started without problems with the example python script (./spoa -d -f
>>> ps_python.py).
>>>
>>> If I start haproxy with the following command:
>>>
>>> haproxy -f spoa-server.conf -d
>>>
>>> haproxy seg faults on the first request to port 10001
>> This is a bug in HAProxy then. Do you happen to have a core dump / stack
>> trace?
> Yes I have a core dump. But I am somewhat rusty in analyzing them. So
> any pointers what to do with the core dump is appreciated. Also the
> segmentation fault only occurs if the spoa server is running so the
> problem is probably somewhere in the code regarding the connection to
> the spoa server.

Install the haproxy-dbgsym package to install the debug symbols to make
the stack trace readable. Then:

gdb haproxy 

In gdb use 'bt' to get the backtrace of the thread that killed the
process. Possibly 't a a bt' to see what the other threads were doing.

Ideally file an issue within the tracker:
https://github.com/haproxy/haproxy/issues. It's easier to track it
within there and there's a template to fill in.

>>> If I start haproxy with the additional parameter -Ws then it does not
>>> seg fault, but only the first and every 4th request get (correctly?)
>>> forwarded to the spoa server, the 3 requests in between get answered
>>> with an empty %[var(sess.iprep.ip_score)].
>>>
>>> [...]
>>>
>>> I am unsure if I am making some stupid mistakes, or if I should test it
>>> with an older haproxy version or how to debug the issue further. So any
>>> pointers are very much appreciated.
>> Can you share the configuration you attempted to use?
> Sorry, I forgot to mention that the configuration used is the example
> configuration from haproxy repository. But here it is, to ensure that
> there were no changes in the meantime:

Unfortunately I can't comment on the SPOA functionality, because I never
used it. Hopefully the information is sufficient for someone more
proficient to tell what's going wrong there.

Best regards
Tim Düsterhus



Re: Dynamic Googlebot identification via lua?

2020-09-12 Thread Reinhard Vicinus
Tim,

On 9/12/20 4:25 PM, Tim Düsterhus wrote:
> Reinhard,
>
> Am 12.09.20 um 12:45 schrieb Reinhard Vicinus:
>> thanks, for your reply and the information. Sorry for my late reply, but
>> I had only today time to test. I did try to get the spoa server working
>> on a ubuntu bionic (18.04.4) with haproxy 2.2.3-2ppa1~bionic from the
>> vbernat ppa. I could compile the spoa server with python 3.6 support
>> from the latest github sources without obvious problems and it also
>> started without problems with the example python script (./spoa -d -f
>> ps_python.py).
>>
>> If I start haproxy with the following command:
>>
>> haproxy -f spoa-server.conf -d
>>
>> haproxy seg faults on the first request to port 10001
> This is a bug in HAProxy then. Do you happen to have a core dump / stack
> trace?
Yes I have a core dump. But I am somewhat rusty in analyzing them. So
any pointers what to do with the core dump is appreciated. Also the
segmentation fault only occurs if the spoa server is running so the
problem is probably somewhere in the code regarding the connection to
the spoa server.
>
>> If I start haproxy with the additional parameter -Ws then it does not
>> seg fault, but only the first and every 4th request get (correctly?)
>> forwarded to the spoa server, the 3 requests in between get answered
>> with an empty %[var(sess.iprep.ip_score)].
>>
>> [...]
>>
>> I am unsure if I am making some stupid mistakes, or if I should test it
>> with an older haproxy version or how to debug the issue further. So any
>> pointers are very much appreciated.
> Can you share the configuration you attempted to use?
Sorry, I forgot to mention that the configuration used is the example
configuration from haproxy repository. But here it is, to ensure that
there were no changes in the meantime:

spoa-server.conf:
global
    debug

defaults
    mode http
    option httplog
    option dontlognull
    timeout connect 5000
    timeout client 5000
    timeout server 5000

listen test
    mode http
    bind :10001
    filter spoe engine spoa-server config spoa-server.spoe.conf
    http-request set-var(req.a) var(txn.iprep.null),debug
    http-request set-var(req.a) var(txn.iprep.boolean),debug
    http-request set-var(req.a) var(txn.iprep.int32),debug
    http-request set-var(req.a) var(txn.iprep.uint32),debug
    http-request set-var(req.a) var(txn.iprep.int64),debug
    http-request set-var(req.a) var(txn.iprep.uint64),debug
    http-request set-var(req.a) var(txn.iprep.ipv4),debug
    http-request set-var(req.a) var(txn.iprep.ipv6),debug
    http-request set-var(req.a) var(txn.iprep.str),debug
    http-request set-var(req.a) var(txn.iprep.bin),debug
    http-request redirect location /%[var(sess.iprep.ip_score)]

backend spoe-server
    mode tcp
    balance roundrobin
    timeout connect 5s
    timeout server  3m
    server spoe-server 127.0.0.1:12345


spoa-server.spoe.conf:
[spoa-server]

spoe-agent spoa-server
    messages check-client-ip
    option var-prefix  iprep
    timeout hello  100ms
    timeout idle   30s
    timeout processing 15ms
    use-backend spoe-server

spoe-message check-client-ip
    args always_true int(1234) src ipv6(::55) req.fhdr(host)
    event on-frontend-http-request


Thanks in advance
Reinhard Vicinus


Re: Dynamic Googlebot identification via lua?

2020-09-12 Thread Tim Düsterhus
Reinhard,

Am 12.09.20 um 12:45 schrieb Reinhard Vicinus:
> thanks, for your reply and the information. Sorry for my late reply, but
> I had only today time to test. I did try to get the spoa server working
> on a ubuntu bionic (18.04.4) with haproxy 2.2.3-2ppa1~bionic from the
> vbernat ppa. I could compile the spoa server with python 3.6 support
> from the latest github sources without obvious problems and it also
> started without problems with the example python script (./spoa -d -f
> ps_python.py).
> 
> If I start haproxy with the following command:
> 
> haproxy -f spoa-server.conf -d
> 
> haproxy seg faults on the first request to port 10001

This is a bug in HAProxy then. Do you happen to have a core dump / stack
trace?

> If I start haproxy with the additional parameter -Ws then it does not
> seg fault, but only the first and every 4th request get (correctly?)
> forwarded to the spoa server, the 3 requests in between get answered
> with an empty %[var(sess.iprep.ip_score)].
> 
> [...]
> 
> I am unsure if I am making some stupid mistakes, or if I should test it
> with an older haproxy version or how to debug the issue further. So any
> pointers are very much appreciated.

Can you share the configuration you attempted to use?

Best regards
Tim Düsterhus



Re: Dynamic Googlebot identification via lua?

2020-09-12 Thread Reinhard Vicinus
Tim,
Aleksandar,

On 9/8/20 11:18 PM, Aleksandar Lazic wrote:
> On 08.09.20 22:54, Tim Düsterhus wrote:
>> Reinhard,
>> Björn,
>>
>> Am 08.09.20 um 21:39 schrieb Björn Jacke:
 the only official supported way to identify a google bot is to run a
 reverse DNS lookup on the accessing IP address and run a forward DNS
 lookup on the result to verify that it points to accessing IP address
 and the resulting domain name is in either googlebot.com or google.com
 domain.
 ...
>>>
>>> thanks for asking this again, I brought this up earlier this year and I
>>> got no answer:
>>>
>>> https://www.mail-archive.com/haproxy@formilux.org/msg37301.html
>>>
>>> I would expect that this is something that most sites would actually
>>> want to check and I'm surprised that there is no solution for this
>>> or at
>>> least none that is obvious to find.
>>
>> The usually recommended solution for this kind of checks is either Lua
>> or the SPOA, running the actual logic out of process.
>>
>> For Lua my haproxy-auth-request script is a batteries included solution
>> to query an arbitrary HTTP service:
>> https://github.com/TimWolla/haproxy-auth-request. It comes with the
>> drawback that Lua runs single-threaded within HAProxy, so you might not
>> want to use this if the checks need to run in the hot path, handling
>> thousands of requests per second.
>>
>> It should be possible to cache the results of the script using a stick
>> table or a map.
>>
>> Back in nginx times I used nginx' auth_request to query a local service
>> that checked whether the client IP address was a Tor exit node. It
>> worked well.
>>
>> For SPOA there's this random IP reputation service within the HAProxy
>> repository:
>> https://github.com/haproxy/haproxy/tree/master/contrib/spoa_example. I
>> never used the SPOA feature, so I can't comment on whether that example
>> generally works and how hard it would be to extend it. It certainly
>> comes with the restriction that you are limited to C or Python (or a
>> manual implementation of the SPOA protocol) vs anything that speaks
>> HTTP.
>
> In addition to Tim's answer you can also try to use spoa_server which
> supports `-n `.
> https://github.com/haproxy/haproxy/tree/master/contrib/spoa_server
>
thanks, for your reply and the information. Sorry for my late reply, but
I had only today time to test. I did try to get the spoa server working
on a ubuntu bionic (18.04.4) with haproxy 2.2.3-2ppa1~bionic from the
vbernat ppa. I could compile the spoa server with python 3.6 support
from the latest github sources without obvious problems and it also
started without problems with the example python script (./spoa -d -f
ps_python.py).

If I start haproxy with the following command:

haproxy -f spoa-server.conf -d

haproxy seg faults on the first request to port 10001

If I start haproxy with the additional parameter -Ws then it does not
seg fault, but only the first and every 4th request get (correctly?)
forwarded to the spoa server, the 3 requests in between get answered
with an empty %[var(sess.iprep.ip_score)].

Here are the log files of a working request:

from haproxy:
:test.accept(0008)=0014 from [127.0.0.1:57570] ALPN=
:test.clireq[0014:]: GET / HTTP/1.1
:test.clihdr[0014:]: host: localhost:10001
:test.clihdr[0014:]: user-agent: curl/7.58.0
:test.clihdr[0014:]: accept: */*
:test.clicls[0014:]
:test.closed[0014:]

from spoa server:
1599906552.714422 [01] New connection from HAProxy accepted
1599906552.714593 [01] Hello handshake done: version=2.0 -
max-frame-size=16380 - healthcheck=false
1599906552.714780 [01] Notify frame received: stream-id=0 - frame-id=1
1599906552.714800 [01]   Message 'check-client-ip' received
[{'name': '', 'value': True},
 {'name': '', 'value': 1234},
 {'name': '', 'value': IPv4Address('127.0.0.1')},
 {'name': '', 'value': IPv6Address('::55')},
 {'name': '', 'value': 'localhost:10001'}]
1599906552.716741 [01] Ack frame sent: stream-id=0 - frame-id=1

And here from a not working request:

from haproxy:
001f:test.accept(0008)=0015 from [127.0.0.1:57634] ALPN=
001f:test.clireq[0015:]: GET / HTTP/1.1
001f:test.clihdr[0015:]: host: localhost:10001
001f:test.clihdr[0015:]: user-agent: curl/7.58.0
001f:test.clihdr[0015:]: accept: */*
001f:test.clicls[0015:]
001f:test.closed[0015:]
0020:spoe-server.srvcls[:adfd]
0020:spoe-server.clicls[:adfd]
0020:spoe-server.closed[:adfd]

the spoa server does not log anything, during the request, but after a
while the following lines are logged:

1599906689.387816 [01] New connection from HAProxy accepted
1599906689.387848 [01] Failed to write Agent frame
1599906689.387853 [01] Close the client socket because of I/O errors

Every requests works if between the requests are at least 30 seconds,
because after 30 seconds the 

Re: Dynamic Googlebot identification via lua?

2020-09-08 Thread Aleksandar Lazic

On 08.09.20 22:54, Tim Düsterhus wrote:

Reinhard,
Björn,

Am 08.09.20 um 21:39 schrieb Björn Jacke:

the only official supported way to identify a google bot is to run a
reverse DNS lookup on the accessing IP address and run a forward DNS
lookup on the result to verify that it points to accessing IP address
and the resulting domain name is in either googlebot.com or google.com
domain.
...


thanks for asking this again, I brought this up earlier this year and I
got no answer:

https://www.mail-archive.com/haproxy@formilux.org/msg37301.html

I would expect that this is something that most sites would actually
want to check and I'm surprised that there is no solution for this or at
least none that is obvious to find.


The usually recommended solution for this kind of checks is either Lua
or the SPOA, running the actual logic out of process.

For Lua my haproxy-auth-request script is a batteries included solution
to query an arbitrary HTTP service:
https://github.com/TimWolla/haproxy-auth-request. It comes with the
drawback that Lua runs single-threaded within HAProxy, so you might not
want to use this if the checks need to run in the hot path, handling
thousands of requests per second.

It should be possible to cache the results of the script using a stick
table or a map.

Back in nginx times I used nginx' auth_request to query a local service
that checked whether the client IP address was a Tor exit node. It
worked well.

For SPOA there's this random IP reputation service within the HAProxy
repository:
https://github.com/haproxy/haproxy/tree/master/contrib/spoa_example. I
never used the SPOA feature, so I can't comment on whether that example
generally works and how hard it would be to extend it. It certainly
comes with the restriction that you are limited to C or Python (or a
manual implementation of the SPOA protocol) vs anything that speaks HTTP.


In addition to Tim's answer you can also try to use spoa_server which
supports `-n `.
https://github.com/haproxy/haproxy/tree/master/contrib/spoa_server


Best regards
Tim Düsterhus


Regards
Aleks



Re: Dynamic Googlebot identification via lua?

2020-09-08 Thread Tim Düsterhus
Reinhard,
Björn,

Am 08.09.20 um 21:39 schrieb Björn Jacke:
>> the only official supported way to identify a google bot is to run a
>> reverse DNS lookup on the accessing IP address and run a forward DNS
>> lookup on the result to verify that it points to accessing IP address
>> and the resulting domain name is in either googlebot.com or google.com
>> domain.
>> ...
> 
> thanks for asking this again, I brought this up earlier this year and I
> got no answer:
> 
> https://www.mail-archive.com/haproxy@formilux.org/msg37301.html
> 
> I would expect that this is something that most sites would actually
> want to check and I'm surprised that there is no solution for this or at
> least none that is obvious to find.

The usually recommended solution for this kind of checks is either Lua
or the SPOA, running the actual logic out of process.

For Lua my haproxy-auth-request script is a batteries included solution
to query an arbitrary HTTP service:
https://github.com/TimWolla/haproxy-auth-request. It comes with the
drawback that Lua runs single-threaded within HAProxy, so you might not
want to use this if the checks need to run in the hot path, handling
thousands of requests per second.

It should be possible to cache the results of the script using a stick
table or a map.

Back in nginx times I used nginx' auth_request to query a local service
that checked whether the client IP address was a Tor exit node. It
worked well.

For SPOA there's this random IP reputation service within the HAProxy
repository:
https://github.com/haproxy/haproxy/tree/master/contrib/spoa_example. I
never used the SPOA feature, so I can't comment on whether that example
generally works and how hard it would be to extend it. It certainly
comes with the restriction that you are limited to C or Python (or a
manual implementation of the SPOA protocol) vs anything that speaks HTTP.

Best regards
Tim Düsterhus



Re: Dynamic Googlebot identification via lua?

2020-09-08 Thread Björn Jacke
Hi Reinhard,

On 08.09.20 21:20, Reinhard Vicinus wrote:
> the only official supported way to identify a google bot is to run a
> reverse DNS lookup on the accessing IP address and run a forward DNS
> lookup on the result to verify that it points to accessing IP address
> and the resulting domain name is in either googlebot.com or google.com
> domain.
> ...

thanks for asking this again, I brought this up earlier this year and I
got no answer:

https://www.mail-archive.com/haproxy@formilux.org/msg37301.html

I would expect that this is something that most sites would actually
want to check and I'm surprised that there is no solution for this or at
least none that is obvious to find.

Björn



signature.asc
Description: OpenPGP digital signature


Dynamic Googlebot identification via lua?

2020-09-08 Thread Reinhard Vicinus
Hi,

the only official supported way to identify a google bot is to run a
reverse DNS lookup on the accessing IP address and run a forward DNS
lookup on the result to verify that it points to accessing IP address
and the resulting domain name is in either googlebot.com or google.com
domain.

As far as I understand the lua api documentation, it is not possible in
lua to perform DNS requests in runtime mode, so the only solution would
be to use an external service to do the actual checking of an accessing
IP address and use lua to question the external service and cache the
result of the IP to increase performance.

So as I am not that experienced in lua programming my question is if
this is feasible or if I am missing something? Also, if there are other
solutions I am not aware I would be thankful if I got pointers.

Thanks in advance
Reinhard Vicinus