Crawling URLs with query string while limiting only web pages

ytthet Fri, 22 Feb 2013 17:52:46 -0800

Hi Folks,

I have a question on crawling URLs with query string. I am crawling about
10,000 sites. Some of the site uses query string to serve the content while
some uses simple URLs. Example I have following cases

Case 1:

site1.com/article1
site1.com/article2

Case 2:
site2.com/?pid=123
site2.com/?pid=124

The only way to crawl and fetch webpages/articles in case 2 is to fetch URLs
with query string "?" . While for the case 1 I can set NOT to fetch "?" in
URL. Thus currently in my regex-urlfilter.txt , I commented the following
lines for my crawler to fetch URL with query string.

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

The above setting cause the crawler to fetch all URLs including URLs with
query string thus pages such as download, login, comments, search query,
printer friendly pages, zoom in view and other not valuable pages are being
fetch. Practically, the crawler is going deep web. The undesirable cause of
this is as following:

1. Duplicate pages are being fetch, effecting the crawl DB to be bloated
- Printer friendly view, zoom in view
e.g. site1.com/article1
e.g. site1.com/article1/?view=printerfriendly
e.g. site1.com/article1/?zoom=large
e.g. site1.com/article1/?zoom=extralarge

2. Download pages are being fetch, effecting the segment to be too large
e.g. site1/com/getcontentID?id=1&format=pdf
e.g. site1/com/getcontentID?id=1&format=doc

3. Crawling take very long time (10 days for depth 5) since is it going deep
web.

My current solution to the problem is to add additional regex in the
regex-urlfilter.txt to prevent the crawler from fetching undesired pages.
Now I have another problems.
1. regex to exclude undesired URLs patter is not exhausted for there are
many site and many pattern. Thus crawler is still going deep web.
2. regex filters to exclude is getting too long so far 50 regex to exclude
the URLs pattern.

I hope I am not the only one with the similar problem and someone knows
smarter way to solve the problem. Has anybody have a solution or suggestion
on how to solve the problem? Some tips or direction would be very much
appreciated.

Btw, I am using nutch 1.2 but I believe the crawler principle is pretty much
the same.

Warm Regards,

--
View this message in context:
http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Crawling URLs with query string while limiting only web pages

Reply via email to