On 09/02/2010 15:46, Christopher Schultz wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marian,

On 2/9/2010 9:31 AM, Marian Simpetru wrote:
Google act as a non cookie browser and hence he is served with non
unique URLs (because of session ID is appended to URL).

I heard at one point that Google's crawler *did* support cookies. I
never verified that, but it sounds like they currently do not support them.

Question is: Is there a way to configure tomcat to only use cookies (not
append jsessionid to URL for cookie0less browsers).

It's not a Tomcat configuration, but you can always write a filter like
this:

public class NoURLRewriteFilter
    implements Filter
{
   public void doFilter(...) {
     chain.doFilter(request, new HttpServletResponseWrapper(response) {
       public String encodeURL(String url) { return url };
       public String encodeUrl(String url) { return url };
       public String encodeRedirectURL(String url) { return url };
       public String encodeRedirectUrl(String url) { return url };
     });
   }
}

Now, this will likely cause an explosion in the number of sessions
generated by Google's crawler. You might want to couple this with a
separate filter (or just create a GoogleCrawlerFilter that does all
this) that identifies Google's (and others) user agent and intercepts
calls to getSession() and either refuses to create a session (probably
not a good idea) or returns a fake session that gets discarded after
every request. Another option would be to set the session timeout to
something like 10 seconds so the session dies relatively quickly instead
of sticking around for a long time, wasting memory.

Maybe a better idea would be that someone from Apache Tomcat should push
to google with some standards tomcat implement in this respect so that
google change the algorithm and not punish with low ranking websites
powered by tomcat.

This is not a "Tomcat problem": it's a problem with any site that
requires sessions to maintain state on the server.

I agree with Chuck: fix your webapp to tolerate Google's crawler, or
suffer the consequences.

Something else you can do is use a robots.txt file to prevent the
crawler from hitting certain URLs. That might help.

I'm not doing anything special, I don't think.
Google bots hit our site, the session count goes up a bit.
Google does not include jsessionid in the URLs it indexes.

It may be that the site has been around for long enough that the Google algorithms know that we have a session id should be removed from a URL.

It would be surprising to me if Google (et al) was not trying to remove PHPSESSIONID and JSESSIONID data from URLs.


p


- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktxg08ACgkQ9CaO5/Lv0PBxDACgweTaZAglz476s7TvYo63//2a
IgcAoIp0u2ZxOes8fFPuUAoP2FrHk/VN
=FjsP
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to