Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

The comment on the change is:
List of potential discussion topics for ApacheCon US 2009 MeetUp

New page:
We're planning to have a "Web Crawler Developer" !MeetUp at this year's 
ApacheCon US in Oakland.

Tentative plan is for Thursday evening, November 5th. The actual schedule for 
!MeetUps is [http://wiki.apache.org/apachecon/ApacheMeetupsUs09 here].

Below are some potential topics for discussion - feel free to add/comment.

* Potential synergies between crawler projects - e.g. sharing robots.txt 
processing code.
* How to avoid end-user abuse - webmasters sometimes block crawlers because 
users configure it to be impolite.
* Politeness vs. efficiency - various options for how to be considered polite, 
while still crawling quickly.
* robots.txt processing - current problems with existing implementations
* Avoiding crawler traps - link farms, honeypots, etc.
* Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
* Search infrastructure - options for serving up crawl results (Nutch, Solr, 
Katta, others?)
* Testing challenges - is it possible to unit test a crawler?
* Fuzzy classification - mime-type, charset, language.
* The future of Nutch, Droids, Heritrix, Bixo, etc.
* Optimizing for types of crawling - intranet, focused, whole web.

Reply via email to