Julien Nioche wrote:
And I have several questions as well:
1. Should I define formats such as pdf and msword when tika is set? Or is
the tika setting sufficient?
Tika handles the pdf and msword formats so you should not need to specify
them in plugin.includes
OK, thanks for clarifying.
2. Do I have to enable all document formats in parse-plugins.xml? A lot of
document formats are disabled. Even though I tried to enable them, e.g.
<mimeType name="application/msword">, it still does not parse MS Word.
Since you've added parse-tika to plugin.includes, it will be used by default
on all mime-types which is why we don't have an explicit association for all
mimeTypes in parse-plugins.xml
OK, then I will let the parse-plugin.xml file remain untouched.
Here are some lines from the hadoop.log file regarding MS Word:
2010-10-08 10:29:03,925 INFO fetcher.Fetcher - fetching
http://ridder.uio.no/wtest2.doc
2010-10-08 10:29:04,012 WARN parse.ParseUtil - Unable to successfully
parse content http://ridder.uio.no/wtest2.doc of type
application/x-tika-msoffice
2010-10-08 10:29:04,013 WARN fetcher.Fetcher - Error parsing:
http://ridder.uio.no/wtest2.doc: failed(2,200):
org.apache.nutch.parse.ParseException: Unable to successfully parse content
Strange, I can't reproduce the issue with nutch-1.2. Can you try running the
command below?
*bin/nutch org.apache.nutch.parse.ParserChecker
http://ridder.uio.no/wtest2.doc
I guess I have found the source of some of my problems. I followed the
Eclipse tutorial which is probably outdated:
http://wiki.apache.org/nutch/RunNutchInEclipse1.0
Probably because I did the following:
"Change the property "plugin.folders" to "./src/plugin" on
$NUTCH_HOME/conf/nutch-default.xml"
After I changed the nutch-default.xml file back to its original form,
I'm able to parse html, pdf, xls and xlsz (by using the command line
instead of running Nutch inside Eclipse). But do you know how to parse
MS Office, RTF and ODT?
BTW, here's the output:
hoppalong:apache-nutch-1.2 erlendfg$ bin/nutch
org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc
---------
Url
---------------
http://ridder.uio.no/wtest2.doc---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: ETag="20064f-5600-47d0d1f0f8400" Date=Fri, 08 Oct 2010
11:26:38 GMT Content-Length=22016 Last-Modified=Wed, 13 Jan 2010
15:06:56 GMT Content-Type=application/msword Connection=close
Accept-Ranges=bytes Server=Apache/2.2.16 (Unix) mod_ssl/2.2.16
OpenSSL/0.9.8j DAV/2 PHP/5.2.14 mod_perl/2.0.4 Perl/v5.8.8
Parse Metadata: Revision-Number=2 Last-Author=Erlend Gar?sen
Template=Normal.dotm subject= Page-Count=1 Application-Name=Microsoft
Macintosh Word Author=Erlend Gar?sen Edit-Time=600000000
Creation-Date=Wed Jan 13 15:58:00 CET 2010 Company=Universitetet i Oslo
Content-Type=application/msword Keywords= Last-Save-Date=Wed Jan 13
15:58:00 CET 2010
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050