Hello,
i'm crawling our intranet site, i see that the default configuration
normalizes urls removing '?', which means no queries. This is basically
saying that you crawl just static data. most of our table-based sites
are handled with paging with '? and = ' queries, like 99% out there.
What is
On Jan 27, 2010, at 9:47am, Claudio Martella wrote:
Hello,
i'm crawling our intranet site, i see that the default configuration
normalizes urls removing '?', which means no queries. This is
basically
saying that you crawl just static data. most of our table-based sites
are handled with
Ken Krugler wrote:
On Jan 27, 2010, at 9:47am, Claudio Martella wrote:
Hello,
i'm crawling our intranet site, i see that the default configuration
normalizes urls removing '?', which means no queries. This is basically
saying that you crawl just static data. most of our table-based sites
This also prevents things like over indexing generated calendars where the
next day/month/year link will always produce output no matter how far it
goes.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Wed,
I do understand this problem. But then, how do you handle this? avoiding
completely the queries is suicide. Google indexes queries. How do you
think it can do it?
Jesse Hires wrote:
This also prevents things like over indexing generated calendars where the
next day/month/year link will always
-insensitive. I have Google'd Nutch URL normalization, but
those postings seem to deal with issues such as
http://my.domain.com:80/ vs. http://my.domain.com/ ...
Any thoughts about how to resolve this (admittedly minor) problem
would be appreciated.
Thanks.
\dmc
Is there a way to apply regex normalization on the urls currently in
the database?
e.g. I would like to make www.asdf.com equivalent to asdf.com