Free live video streaming of ApacheCon US 2009

2009-11-04 Thread Michael McCandless
Team,

For those Lucene fanatics not in Oakland this week for ApacheCon US,
don't miss the FREE live video streaming, starting today:

  http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm

Note that there are many talks available, covering Apache Hadoop,
Apache HTTPD, Lucene, as well as the Apache Pioneer's Panel and
keynote presentations.

Lucene's track is this Friday (NOTE these times are UTC -- use
http://www.timeanddate.com to map to your time zone):

 17:00 Implementing an Information Retrieval Framework for an
   Organizational Repository, Sithu D Sudarsan

 18:00 Apache Mahout - Going from raw data to information
   Isabel Drost

 19:15 MIME Magic with Apache Tika
   Jukka Zitting

 20:15 Keynote: How Open Source Developers Can (Still!) Save The World
   Brian Behlendorf

 22:00 Building Intelligent Search Applications with the Lucene
   Ecosystem, Ted Dunning

 23:00 Realtime Search
   Jason Rutherglen

Happy viewing,

Mike


[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler

2009-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by KenKrugler.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=5rev2=6

--

- We were planning to have a Web Crawler Developer !MeetUp at this year's 
[[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
+ We had a Web Crawler Developer !MeetUp at this year's 
[[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
  
- Unfortunately the only time slot where people would be around was Thursday 
night, which wound up conflicting with the Hadoop !MeetUp.
+ It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 
11am - 1pm. 
  
- So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th 
from 11am - 1pm. Location is TBD, hopefully we can get some space at the event 
but might be a lunch meeting :)
+ == Attendees ==
+ 
+  * Andrzej Bialeki - Apache Nutch
+  * Thorsten xxx - Apache Droids
+  * Michael Stack - Formerly with Heritrix, now HBase
+  * Ken Krugler - Bixo
+ 
+ == Topics ==
+ 
+ === Roadmaps ===
+ 
+ Nutch - become more component based.
+ Droids - get more people involved.
+ 
+ === Sharable Components ===
+ 
+  * robots.txt parsing
+  * URL normalization
+  * URL filtering
+  * Page cleansing
+   * General purpose
+   * Specialized
+  * Sub-page parsing (portlets)
+  * AJAX-ish page interactions
+  * Document parsing (via Tika)
+  * HttpClient (configuration)
+  * Text similarity
+  * Mime/charset/language detection
+ 
+ === Tika ===
+ 
+  * Needs help to become really usable
+  * Would benefit from large test corpus
+  * Could do comparison with Nutch parser
+  * Needs option for direct DOM querying (screen scraping tasks)
+  * Handles mime  charset detection now (some issues)
+  * Could be extended to include language detection (wrap other impl)
+ 
+ === URL Normalization ===
+ 
+  * Includes both domain (www.x.com == x.com), path, and query portions of URL
+  * Often site-specific rules
+   * Option to derive rules using URLs to similar documents.
+ 
+ === AJAX-ish Page Interaction ===
+ 
+  * Not applicable for broad/general crawling
+  * Can be very important for specific web sites
+  * Use Selenium or headless Mozilla
+ 
+ === Component API Issues ===
+ 
+  * Want to avoid using an API that's tied too closely to any implementation.
+  * One option is to have simple (e.g. URL param) API that takes meta-data.
+   * Similar to Tika passing in of meta-data.
+ 
+ === Hosting Options ===
+ 
+  * As part of Nutch - but easy to get lost in Nutch codebase, and can be 
associated too closely with Nutch.
+  * As part of Droids - but Droids is both a framework (queue-based) and set 
of components.
+  * New sub-project under Lucene TLP - but overhead to set up/maintain, and 
then confusion between it and Droids.
+  * Google code - seems like a good short-term solution, to judge level of 
interest and help shake out issues.
+ 
+ == Next Steps ==
+ 
+  * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally 
he'd add his comments to this page.
+  * Get input from Thorsten on Google code option. If OK as starting point, 
then Andrzej to set up.
+  * Make decision about build system (and then move on to code formatting 
debate :))
+   * I'm going to propose ant + maven ant tasks for dependency management. I'm 
using this with Bixo, and so far it's been pretty good.
+  * Start contributing code
+   * Ken will put in robots.txt parser.
+ 
+ == Original Discussion Topic List ==
  
  Below are some potential topics for discussion - feel free to add/comment.
  


[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler

2009-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by KenKrugler.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=6rev2=7

--

  == Attendees ==
  
   * Andrzej Bialeki - Apache Nutch
-  * Thorsten xxx - Apache Droids
+  * Thorsten Sherler - Apache Droids
   * Michael Stack - Formerly with Heritrix, now HBase
   * Ken Krugler - Bixo
  


[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler

2009-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by KenKrugler.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=7rev2=8

--

  We had a Web Crawler Developer !MeetUp at this year's 
[[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
  
  It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 
11am - 1pm. 
+ 
+ -
  
  == Attendees ==
  
@@ -15, +17 @@

  
  === Roadmaps ===
  
- Nutch - become more component based.
+  * Nutch - become more component based.
- Droids - get more people involved.
+  * Droids - get more people involved.
  
  === Sharable Components ===
  
@@ -76, +78 @@

   * Start contributing code
* Ken will put in robots.txt parser.
  
+ -
+ 
  == Original Discussion Topic List ==
  
  Below are some potential topics for discussion - feel free to add/comment.


[Nutch Wiki] Update of ApacheConUs2009MeetUp by Andrz ejBialecki

2009-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by AndrzejBialecki.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=8rev2=9

--

  
  == Attendees ==
  
-  * Andrzej Bialeki - Apache Nutch
+  * Andrzej Bialecki - Apache Nutch
   * Thorsten Sherler - Apache Droids
   * Michael Stack - Formerly with Heritrix, now HBase
   * Ken Krugler - Bixo


Re: Free live video streaming of ApacheCon US 2009

2009-11-04 Thread Israel Ekpo
Thanks a lot.

This will be very helpful to me.

As I am not able to attend.

On Wed, Nov 4, 2009 at 8:25 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Team,

 For those Lucene fanatics not in Oakland this week for ApacheCon US,
 don't miss the FREE live video streaming, starting today:

  http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm

 Note that there are many talks available, covering Apache Hadoop,
 Apache HTTPD, Lucene, as well as the Apache Pioneer's Panel and
 keynote presentations.

 Lucene's track is this Friday (NOTE these times are UTC -- use
 http://www.timeanddate.com to map to your time zone):

  17:00 Implementing an Information Retrieval Framework for an
   Organizational Repository, Sithu D Sudarsan

  18:00 Apache Mahout - Going from raw data to information
   Isabel Drost

  19:15 MIME Magic with Apache Tika
   Jukka Zitting

  20:15 Keynote: How Open Source Developers Can (Still!) Save The World
   Brian Behlendorf

  22:00 Building Intelligent Search Applications with the Lucene
   Ecosystem, Ted Dunning

  23:00 Realtime Search
   Jason Rutherglen

 Happy viewing,

 Mike




-- 
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.