Free live video streaming of ApacheCon US 2009
Team, For those Lucene fanatics not in Oakland this week for ApacheCon US, don't miss the FREE live video streaming, starting today: http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm Note that there are many talks available, covering Apache Hadoop, Apache HTTPD, Lucene, as well as the Apache Pioneer's Panel and keynote presentations. Lucene's track is this Friday (NOTE these times are UTC -- use http://www.timeanddate.com to map to your time zone): 17:00 Implementing an Information Retrieval Framework for an Organizational Repository, Sithu D Sudarsan 18:00 Apache Mahout - Going from raw data to information Isabel Drost 19:15 MIME Magic with Apache Tika Jukka Zitting 20:15 Keynote: How Open Source Developers Can (Still!) Save The World Brian Behlendorf 22:00 Building Intelligent Search Applications with the Lucene Ecosystem, Ted Dunning 23:00 Realtime Search Jason Rutherglen Happy viewing, Mike
[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=5rev2=6 -- - We were planning to have a Web Crawler Developer !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. + We had a Web Crawler Developer !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. - Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp. + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. - So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :) + == Attendees == + + * Andrzej Bialeki - Apache Nutch + * Thorsten xxx - Apache Droids + * Michael Stack - Formerly with Heritrix, now HBase + * Ken Krugler - Bixo + + == Topics == + + === Roadmaps === + + Nutch - become more component based. + Droids - get more people involved. + + === Sharable Components === + + * robots.txt parsing + * URL normalization + * URL filtering + * Page cleansing + * General purpose + * Specialized + * Sub-page parsing (portlets) + * AJAX-ish page interactions + * Document parsing (via Tika) + * HttpClient (configuration) + * Text similarity + * Mime/charset/language detection + + === Tika === + + * Needs help to become really usable + * Would benefit from large test corpus + * Could do comparison with Nutch parser + * Needs option for direct DOM querying (screen scraping tasks) + * Handles mime charset detection now (some issues) + * Could be extended to include language detection (wrap other impl) + + === URL Normalization === + + * Includes both domain (www.x.com == x.com), path, and query portions of URL + * Often site-specific rules + * Option to derive rules using URLs to similar documents. + + === AJAX-ish Page Interaction === + + * Not applicable for broad/general crawling + * Can be very important for specific web sites + * Use Selenium or headless Mozilla + + === Component API Issues === + + * Want to avoid using an API that's tied too closely to any implementation. + * One option is to have simple (e.g. URL param) API that takes meta-data. + * Similar to Tika passing in of meta-data. + + === Hosting Options === + + * As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch. + * As part of Droids - but Droids is both a framework (queue-based) and set of components. + * New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids. + * Google code - seems like a good short-term solution, to judge level of interest and help shake out issues. + + == Next Steps == + + * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page. + * Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up. + * Make decision about build system (and then move on to code formatting debate :)) + * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good. + * Start contributing code + * Ken will put in robots.txt parser. + + == Original Discussion Topic List == Below are some potential topics for discussion - feel free to add/comment.
[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=6rev2=7 -- == Attendees == * Andrzej Bialeki - Apache Nutch - * Thorsten xxx - Apache Droids + * Thorsten Sherler - Apache Droids * Michael Stack - Formerly with Heritrix, now HBase * Ken Krugler - Bixo
[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=7rev2=8 -- We had a Web Crawler Developer !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. + + - == Attendees == @@ -15, +17 @@ === Roadmaps === - Nutch - become more component based. + * Nutch - become more component based. - Droids - get more people involved. + * Droids - get more people involved. === Sharable Components === @@ -76, +78 @@ * Start contributing code * Ken will put in robots.txt parser. + - + == Original Discussion Topic List == Below are some potential topics for discussion - feel free to add/comment.
[Nutch Wiki] Update of ApacheConUs2009MeetUp by Andrz ejBialecki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by AndrzejBialecki. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=8rev2=9 -- == Attendees == - * Andrzej Bialeki - Apache Nutch + * Andrzej Bialecki - Apache Nutch * Thorsten Sherler - Apache Droids * Michael Stack - Formerly with Heritrix, now HBase * Ken Krugler - Bixo
Re: Free live video streaming of ApacheCon US 2009
Thanks a lot. This will be very helpful to me. As I am not able to attend. On Wed, Nov 4, 2009 at 8:25 AM, Michael McCandless luc...@mikemccandless.com wrote: Team, For those Lucene fanatics not in Oakland this week for ApacheCon US, don't miss the FREE live video streaming, starting today: http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm Note that there are many talks available, covering Apache Hadoop, Apache HTTPD, Lucene, as well as the Apache Pioneer's Panel and keynote presentations. Lucene's track is this Friday (NOTE these times are UTC -- use http://www.timeanddate.com to map to your time zone): 17:00 Implementing an Information Retrieval Framework for an Organizational Repository, Sithu D Sudarsan 18:00 Apache Mahout - Going from raw data to information Isabel Drost 19:15 MIME Magic with Apache Tika Jukka Zitting 20:15 Keynote: How Open Source Developers Can (Still!) Save The World Brian Behlendorf 22:00 Building Intelligent Search Applications with the Lucene Ecosystem, Ted Dunning 23:00 Realtime Search Jason Rutherglen Happy viewing, Mike -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.