Hi Sergey, The most profound problems or interesting stuff we've encountered are: - dealing with dynamic URL's such as calendars, also known as spider traps; - detecting duplicates based on sub-domain, many sites allow www, ww, wwww or everything else, you have to deal with it; - normalizing of URL's, highly important as it already prevents a lot of duplicates - various kinds of link analysis - detecting spam (link spam, content spam, various techniques) - general crawler ethics - dynamic politeness, large sites can be crawled more intense than small sites - deep and or shallow crawling, is coverage or freshness more important
For me Bing Lui's excellent book on Web and Data mining gives a lot of insights. The best thing is that the author provides a royal list of references to highly interesting papers that you can then find online. In my opinion this book is mandatory when one is serious with web crawling. [1]: http://www.cs.uic.edu/~liub/WebMiningBook.html Good luck! Markus On Wednesday 16 November 2011 01:51:20 Sergey A Volkov wrote: > Hi! > > I am postgraduate student in Saint Petersburg State University. I was > working with Nutch for about 3 years, have written my graduate work > based on it, and now I don't know what to do in my Ph.D work. (Nobody in > my department (System Programming) deals with web crawling) > > I hope someone knows problems in web crawling, whose solutions can help > Nutch project and me in my future Ph.D. thesis. Any ideas? > > Thanks, > Sergey. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

