Julien Nioche wrote:

And I have several questions as well:
1. Should I define formats such as pdf and msword when tika is set? Or is
the tika setting sufficient?


Tika handles the pdf and msword formats so you should not need to specify
them in plugin.includes

OK, thanks for clarifying.

2. Do I have to enable all document formats in parse-plugins.xml? A lot of
document formats are disabled. Even though I tried to enable them, e.g.
<mimeType name="application/msword">, it still does not parse MS Word.


Since you've added parse-tika to plugin.includes, it will be used by default
on all mime-types which is why we don't have an explicit association for all
mimeTypes in parse-plugins.xml

OK, then I will let the parse-plugin.xml file remain untouched.

Here are some lines from the hadoop.log file regarding MS Word:

2010-10-08 10:29:03,925 INFO  fetcher.Fetcher - fetching
http://ridder.uio.no/wtest2.doc
2010-10-08 10:29:04,012 WARN  parse.ParseUtil - Unable to successfully
parse content http://ridder.uio.no/wtest2.doc of type
application/x-tika-msoffice
2010-10-08 10:29:04,013 WARN  fetcher.Fetcher - Error parsing:
http://ridder.uio.no/wtest2.doc: failed(2,200):
org.apache.nutch.parse.ParseException: Unable to successfully parse content


Strange, I can't reproduce the issue with nutch-1.2. Can you try running the
command below?

*bin/nutch org.apache.nutch.parse.ParserChecker
http://ridder.uio.no/wtest2.doc

I guess I have found the source of some of my problems. I followed the Eclipse tutorial which is probably outdated:
http://wiki.apache.org/nutch/RunNutchInEclipse1.0

Probably because I did the following:
"Change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch-default.xml"

After I changed the nutch-default.xml file back to its original form, I'm able to parse html, pdf, xls and xlsz (by using the command line instead of running Nutch inside Eclipse). But do you know how to parse MS Office, RTF and ODT?

BTW, here's the output:
hoppalong:apache-nutch-1.2 erlendfg$ bin/nutch org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc
---------
Url
---------------
http://ridder.uio.no/wtest2.doc---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: ETag="20064f-5600-47d0d1f0f8400" Date=Fri, 08 Oct 2010 11:26:38 GMT Content-Length=22016 Last-Modified=Wed, 13 Jan 2010 15:06:56 GMT Content-Type=application/msword Connection=close Accept-Ranges=bytes Server=Apache/2.2.16 (Unix) mod_ssl/2.2.16 OpenSSL/0.9.8j DAV/2 PHP/5.2.14 mod_perl/2.0.4 Perl/v5.8.8 Parse Metadata: Revision-Number=2 Last-Author=Erlend Gar?sen Template=Normal.dotm subject= Page-Count=1 Application-Name=Microsoft Macintosh Word Author=Erlend Gar?sen Edit-Time=600000000 Creation-Date=Wed Jan 13 15:58:00 CET 2010 Company=Universitetet i Oslo Content-Type=application/msword Keywords= Last-Save-Date=Wed Jan 13 15:58:00 CET 2010

Erlend


--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to