Re: Nutch project and my Ph.D. thesis.

Markus Jelsma Wed, 16 Nov 2011 04:11:51 -0800

Hi Sergey,

The most profound problems or interesting stuff we've encountered are:
- dealing with dynamic URL's such as calendars, also known as spider traps;
- detecting duplicates based on sub-domain, many sites allow www, ww, wwww or 
everything else, you have to deal with it;
- normalizing of URL's, highly important as it already prevents a lot of 
duplicates
- various kinds of link analysis
- detecting spam (link spam, content spam, various techniques)
- general crawler ethics
- dynamic politeness, large sites can be crawled more intense than small sites
- deep and or shallow crawling, is coverage or freshness more important

For me Bing Lui's excellent book on Web and Data mining gives a lot of 
insights. The best thing is that the author provides a royal list of 
references to highly interesting papers that you can then find online.

In my opinion this book is mandatory when one is serious with web crawling.

[1]: http://www.cs.uic.edu/~liub/WebMiningBook.html

Good luck!
Markus

On Wednesday 16 November 2011 01:51:20 Sergey A Volkov wrote:
> Hi!
> 
> I am postgraduate student in Saint Petersburg State University. I was
> working with Nutch for about 3 years, have written my graduate work
> based on it, and now I don't know what to do in my Ph.D work. (Nobody in
> my department (System Programming) deals with web crawling)
> 
> I hope someone knows problems in web crawling, whose solutions can help
> Nutch project and me in my future Ph.D. thesis. Any ideas?
> 
> Thanks,
>   Sergey.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch project and my Ph.D. thesis.

Reply via email to