Hi I'm a companion of Olaf, the last week we exposed our xwiki to the googlebot to test our configuration. Yesterday we got the same problems with the crawler and we are lost like before.
I will get more precise to the problem and quote some of our logs. My results of the analysis: - some critical actions (eg edit) redirects the googlebot to the login page with 302. The login page is 401, the googlebot's path stops here. fine! Log example: example.com - - 66.249.73.10 [27/Apr/2012:15:44:12 +0200] "GET /wiki/example.com/edit/XWiki/GadgetClass HTTP/1.1" 302 20 www.example.com - - 66.249.73.10 [27/Apr/2012:15:44:12 +0200] "GET /wiki/example.com/login/XWiki/XWikiLogin;jsessionid=1F380560FA9E3582D6DDB9B1D286B151?srid=yWSymcYq&xredirect=%2Fwiki%2Fexample.com%2Fedit%2FXWiki%2FGadgetClass%3Fsrid%3DyWSymcYq HTTP/1.1" 401 3004 ============================================================================ - some critical actions results in a OK (200). These are for example deletespace but also some edits: Log example: example.com - - 66.249.73.10 [27/Apr/2012:15:46:30 +0200] "GET /wiki/example.com/get/Hilfe/WebPreferences HTTP/1.1" 200 985 example.com - - 66.249.73.10 [27/Apr/2012:15:46:33 +0200] "GET /wiki/example.com/edit/Blog/WebPreferences HTTP/1.1" 200 5271 example.com - - 66.249.73.10 [27/Apr/2012:15:46:37 +0200] "GET /wiki/example.com/save/Blog/WebPreferences HTTP/1.1" 302 20 example.com - - 66.249.73.10 [27/Apr/2012:15:46:37 +0200] "GET /wiki/example.com/view/Blog/WebPreferences?resubmit=%2Fwiki%2Fexample.com%2Fsave%2FBlog%2FWebPreferences%3Fsrid%3Dn3Ake7tL&xback=%2Fwiki%2Fexample.com%2Fview%2FBlog%2FWebPreferences&xpage=resubmit HTTP/1.1" 200 3689 Here I see one part of the googlebot path. It triggers actions guests are not allowed to. According to the example: When I entered the /edit/Blog/WebPreferences I get a 302 redirect and a 401 login page: jloos@live:~$ curl -IL http:/example.com/wiki/example.com/edit/Blog/WebPreferences HTTP/1.1 302 Moved Temporarily Date: Sat, 28 Apr 2012 00:41:33 GMT Server: Apache/2.2.16 Set-Cookie: JSESSIONID=6819FB0D96E0695388E1AA2A1A92AF49; Path=/ Location: http://example.com/wiki/example.com/login/XWiki/XWikiLogin;jsessionid=6819FB0D96E0695388E1AA2A1A92AF49?srid=FnsRh0bE&xredirect=%2Fwiki%2Fexample.com%2Fedit%2FBlog%2FWebPreferences%3Fsrid%3DFnsRh0bE Content-Language: de Vary: Accept-Encoding Content-Type: text/html HTTP/1.1 401 Unauthorized Date: Sat, 28 Apr 2012 00:41:33 GMT Server: Apache/2.2.16 Pragma: no-cache Cache-Control: no-cache Expires: Wed, 31 Dec 1969 23:59:59 GMT Content-Language: de Content-Length: 13590 Vary: Accept-Encoding Content-Type: text/html;charset=utf-8 ============================================================================ - some actions with confirmation forms are delivered to the googlebot too Log example: example.com - - 66.249.71.33 [27/Apr/2012:16:03:05 +0200] "GET /wiki/example.com/deletespace/Start/WebHome HTTP/1.1" 200 3702 [...] example.com - - 66.249.71.33 [27/Apr/2012:16:43:03 +0200] "GET /wiki/example.com/deletespace/Start/WebHome?confirm=1&form_token=saMxN4MidDarWDBvxciU2w HTTP/1.1" 200 3001 So the googlebot gets a form with the csrf-token. Than it chose the yes in the delete confirmation. So our disaster is complete. ============================================================================ I can trace the googlebot actions very well with our logging. But I can't reproduce these actions as a guest in any way. I tried it with and without cookies in several browsers and with curl from the command-line. A wild guess: There seems to be some connections with other user-logins. The last googlebot disaster-actions occurs when a admin logged in and a crawl was in progress. My guess: Under some crazy circumstances, the sessions of a user flips or copied to the crawler. But I think its really far-fetched. The IP seems to be the googlebot: jloos@test:~$ host 66.249.71.33 33.71.249.66.in-addr.arpa domain name pointer crawl-66-249-71-33.googlebot.com. We are using XEM. The Master-Wiki is behind htaccess, and only the relating wiki is free accessible. I hope these detailed analysis isn't to detailed. And I can quote Olaf: > any hints greatly appreciated! Greetings Jan -- View this message in context: http://xwiki.475771.n2.nabble.com/severe-trouble-with-web-crawlers-tp7442162p7507847.html Sent from the XWiki- Users mailing list archive at Nabble.com. _______________________________________________ users mailing list users@xwiki.org http://lists.xwiki.org/mailman/listinfo/users