url normalization

2010-01-27 Thread Claudio Martella
Hello, i'm crawling our intranet site, i see that the default configuration normalizes urls removing '?', which means no queries. This is basically saying that you crawl just static data. most of our table-based sites are handled with paging with '? and = ' queries, like 99% out there. What is

Re: url normalization

2010-01-27 Thread Ken Krugler
On Jan 27, 2010, at 9:47am, Claudio Martella wrote: Hello, i'm crawling our intranet site, i see that the default configuration normalizes urls removing '?', which means no queries. This is basically saying that you crawl just static data. most of our table-based sites are handled with

Re: url normalization

2010-01-27 Thread Claudio Martella
Ken Krugler wrote: On Jan 27, 2010, at 9:47am, Claudio Martella wrote: Hello, i'm crawling our intranet site, i see that the default configuration normalizes urls removing '?', which means no queries. This is basically saying that you crawl just static data. most of our table-based sites

Re: url normalization

2010-01-27 Thread Jesse Hires
This also prevents things like over indexing generated calendars where the next day/month/year link will always produce output no matter how far it goes. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Wed,

Re: url normalization

2010-01-27 Thread Claudio Martella
I do understand this problem. But then, how do you handle this? avoiding completely the queries is suicide. Google indexes queries. How do you think it can do it? Jesse Hires wrote: This also prevents things like over indexing generated calendars where the next day/month/year link will always

URL normalization ...

2009-03-22 Thread David M. Cole
-insensitive. I have Google'd Nutch URL normalization, but those postings seem to deal with issues such as http://my.domain.com:80/ vs. http://my.domain.com/ ... Any thoughts about how to resolve this (admittedly minor) problem would be appreciated. Thanks. \dmc

url normalization

2007-12-05 Thread Lyndon Maydwell
Is there a way to apply regex normalization on the urls currently in the database? e.g. I would like to make www.asdf.com equivalent to asdf.com