Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp The comment on the change is: List of potential discussion topics for ApacheCon US 2009 MeetUp New page: We're planning to have a "Web Crawler Developer" !MeetUp at this year's ApacheCon US in Oakland. Tentative plan is for Thursday evening, November 5th. The actual schedule for !MeetUps is [http://wiki.apache.org/apachecon/ApacheMeetupsUs09 here]. Below are some potential topics for discussion - feel free to add/comment. * Potential synergies between crawler projects - e.g. sharing robots.txt processing code. * How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite. * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly. * robots.txt processing - current problems with existing implementations * Avoiding crawler traps - link farms, honeypots, etc. * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?) * Testing challenges - is it possible to unit test a crawler? * Fuzzy classification - mime-type, charset, language. * The future of Nutch, Droids, Heritrix, Bixo, etc. * Optimizing for types of crawling - intranet, focused, whole web.