Hi Guys, Here is my draft of the report. Let me know if you guys concur, and I'll add it to the wiki:
<report> Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser Libraries. Tika entered incubation on March 22nd, 2007. Community There have been a number of positive items within Tika during the last few months. The traffic on the Tika mailing list has increased significantly (with typically 2, 3 questions, and 1 or 2 commits every day, or every other day), and there have been a lot of recent inquiries from external projects wanting to collaborate with Tika (including Aperture, PDFBox and a fellow developing a JSon library currently hosted at Google code). In addition, Tika's architecture has become a recent discussion of interest (as we'll see below). We recently elected Keith Bennett as a new committer to Tika. Keith has been spearheading many of the new patches committed to Tika, as well as participating in discussions about the architecture, and future direction of the project. Tika will be represented at the "Fast Feather" track at Apache Con US by Jukka Zitting. The rest of the community is helping to create the content for the presentation. The abstract is listed below: ----- Tika is a new content analysis framework borne from the desire to factor our commonality from the Apache Nutch search engine framework. Tika provides a mime detection framework, an extensible parsing framework and metadata environment for content analysis. Though in its nascent stages, progress on Tika has recently taken shape and the project is nearing a stable 0.1 release. In this talk, we'll describe the core APIs of Tika and discuss its use in several distinct domains including search engines, scientific data dissemination and an industrial setting. ----- Development There have been a flurry of JIRA issues and code activity [1] including 47 issues currently in JIRA, with 32 resolved issues, 14 closed issues, and 2 open major/minor issues in progress). Tika's Parser interface (one of its key components) has just undergone a major overhaul led by Jukka Zitting, and Chris Mattmann has recently contributed a MimeType system (with help from fellow Apache Nutch committer Jerome Charron) to Tika. We also cleaned up and refactored large parts of the rest of the code (removing references to LuisLite and branding the project wherever possible with the Tika name), in preparation for an upcoming 0.1 release. Chris Mattmann has led an effort to carve out the existing MimeType detection system in Apache Nutch [2] and replace it with Tika's improved MimeType detection system. There is a patch sitting in JIRA right now [3], and barring objections, Nutch will rely on Tika for its MimeType detection abilities. Also active recently were committers Bertrand Delacretaz, Sami Siren and Rida Benjelloun, committing patches and improvements wherever needed. Issues before graduation No changes since our last report: the Tika project is still at an early stage of incubation. We need to continue bringing in the initial codebases and are targeting an initial incubating release (0.1) probably within the next month. We also need to work on growing the community and figuring out how to best interact with external parser projects. [1] http://issues.apache.org/jira/browse/TIKA [2] http://lucene.apache.org/nutch/ [3] http://issues.apache.org/jira/browse/NUTCH-562 </report> Let me know what you guys think. Thanks to Bertrand for his original report which inspired mine ;) Cheers, Chris ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
