Thank you for your reply!

Looks like at first I should read this book. I'll came back with my thought after this=)

Sergey.

On Wed 16 Nov 2011 04:11:54 PM MSK, Markus Jelsma wrote:
Hi Sergey,

The most profound problems or interesting stuff we've encountered are:
- dealing with dynamic URL's such as calendars, also known as spider traps;
- detecting duplicates based on sub-domain, many sites allow www, ww, wwww or
everything else, you have to deal with it;
- normalizing of URL's, highly important as it already prevents a lot of
duplicates
- various kinds of link analysis
- detecting spam (link spam, content spam, various techniques)
- general crawler ethics
- dynamic politeness, large sites can be crawled more intense than small sites
- deep and or shallow crawling, is coverage or freshness more important

For me Bing Lui's excellent book on Web and Data mining gives a lot of
insights. The best thing is that the author provides a royal list of
references to highly interesting papers that you can then find online.

In my opinion this book is mandatory when one is serious with web crawling.

[1]: http://www.cs.uic.edu/~liub/WebMiningBook.html

Good luck!
Markus

On Wednesday 16 November 2011 01:51:20 Sergey A Volkov wrote:
Hi!

I am postgraduate student in Saint Petersburg State University. I was
working with Nutch for about 3 years, have written my graduate work
based on it, and now I don't know what to do in my Ph.D work. (Nobody in
my department (System Programming) deals with web crawling)

I hope someone knows problems in web crawling, whose solutions can help
Nutch project and me in my future Ph.D. thesis. Any ideas?

Thanks,
   Sergey.



Reply via email to