Re: [xwiki-users] severe trouble with web crawlers

JPL Fri, 27 Apr 2012 18:18:42 -0700

Hi I'm a companion of Olaf,

the last week we exposed our xwiki to the googlebot to test our
configuration. Yesterday we got the same problems with the crawler and we
are lost like before.


I will get more precise to the problem and quote some of our logs.

My results of the analysis:

- some critical actions (eg edit) redirects the googlebot to the login page
with 302. The login page is 401, the googlebot's path stops here. fine!

Log example: 

example.com - - 66.249.73.10 [27/Apr/2012:15:44:12 +0200] "GET
/wiki/example.com/edit/XWiki/GadgetClass HTTP/1.1" 302 20
www.example.com - - 66.249.73.10 [27/Apr/2012:15:44:12 +0200] "GET
/wiki/example.com/login/XWiki/XWikiLogin;jsessionid=1F380560FA9E3582D6DDB9B1D286B151?srid=yWSymcYq&xredirect=%2Fwiki%2Fexample.com%2Fedit%2FXWiki%2FGadgetClass%3Fsrid%3DyWSymcYq
HTTP/1.1" 401 3004

============================================================================

- some critical actions results in a OK (200). These are for example
deletespace but also some edits:

Log example:

example.com - - 66.249.73.10 [27/Apr/2012:15:46:30 +0200] "GET
/wiki/example.com/get/Hilfe/WebPreferences HTTP/1.1" 200 985
example.com - - 66.249.73.10 [27/Apr/2012:15:46:33 +0200] "GET
/wiki/example.com/edit/Blog/WebPreferences HTTP/1.1" 200 5271
example.com - - 66.249.73.10 [27/Apr/2012:15:46:37 +0200] "GET
/wiki/example.com/save/Blog/WebPreferences HTTP/1.1" 302 20
example.com - - 66.249.73.10 [27/Apr/2012:15:46:37 +0200] "GET
/wiki/example.com/view/Blog/WebPreferences?resubmit=%2Fwiki%2Fexample.com%2Fsave%2FBlog%2FWebPreferences%3Fsrid%3Dn3Ake7tL&xback=%2Fwiki%2Fexample.com%2Fview%2FBlog%2FWebPreferences&xpage=resubmit
HTTP/1.1" 200 3689

Here I see one part of the googlebot path. It triggers actions guests are
not allowed to. According to the example: When I entered the
/edit/Blog/WebPreferences I get a 302 redirect and a 401 login page:


jloos@live:~$ curl -IL
http:/example.com/wiki/example.com/edit/Blog/WebPreferences
HTTP/1.1 302 Moved Temporarily
Date: Sat, 28 Apr 2012 00:41:33 GMT
Server: Apache/2.2.16
Set-Cookie: JSESSIONID=6819FB0D96E0695388E1AA2A1A92AF49; Path=/
Location:
http://example.com/wiki/example.com/login/XWiki/XWikiLogin;jsessionid=6819FB0D96E0695388E1AA2A1A92AF49?srid=FnsRh0bE&xredirect=%2Fwiki%2Fexample.com%2Fedit%2FBlog%2FWebPreferences%3Fsrid%3DFnsRh0bE
Content-Language: de
Vary: Accept-Encoding
Content-Type: text/html

HTTP/1.1 401 Unauthorized
Date: Sat, 28 Apr 2012 00:41:33 GMT
Server: Apache/2.2.16
Pragma: no-cache
Cache-Control: no-cache
Expires: Wed, 31 Dec 1969 23:59:59 GMT
Content-Language: de
Content-Length: 13590
Vary: Accept-Encoding
Content-Type: text/html;charset=utf-8

============================================================================

- some actions with confirmation forms are delivered to the googlebot too

Log example:

example.com - - 66.249.71.33 [27/Apr/2012:16:03:05 +0200] "GET
/wiki/example.com/deletespace/Start/WebHome HTTP/1.1" 200 3702
[...]
example.com - - 66.249.71.33 [27/Apr/2012:16:43:03 +0200] "GET
/wiki/example.com/deletespace/Start/WebHome?confirm=1&form_token=saMxN4MidDarWDBvxciU2w
HTTP/1.1" 200 3001

So the googlebot gets a form with the csrf-token. Than it chose the yes in
the delete confirmation. So our disaster is complete.

============================================================================


I can trace the googlebot actions very well with our logging. But I can't
reproduce these actions as a guest in any way. I tried it with and without
cookies in several browsers and with curl from the command-line.

A wild guess: There seems to be some connections with other user-logins. The
last googlebot disaster-actions occurs when a admin logged in and a crawl
was in progress. My guess: Under some crazy circumstances, the sessions of a
user flips or copied to the crawler. But I think its really far-fetched.
 
The IP seems to be the googlebot:
jloos@test:~$ host 66.249.71.33
33.71.249.66.in-addr.arpa domain name pointer
crawl-66-249-71-33.googlebot.com.

We are using XEM. The Master-Wiki is behind htaccess, and only the relating
wiki is free accessible.

I hope these detailed analysis isn't to detailed. And I can quote Olaf:
> any hints greatly appreciated! 


Greetings

Jan

--
View this message in context: 
http://xwiki.475771.n2.nabble.com/severe-trouble-with-web-crawlers-tp7442162p7507847.html
Sent from the XWiki- Users mailing list archive at Nabble.com.
_______________________________________________
users mailing list
users@xwiki.org
http://lists.xwiki.org/mailman/listinfo/users

Re: [xwiki-users] severe trouble with web crawlers

Reply via email to