Hello, Thanks for the replies.
I have started trying to use Nutch 1.3 after your suggestions, especially since I am using Tika 0.9, but I am not getting anywhere with it. I am able to build fine but whenever I try to run any command it gives the error stating that it cannot find C:\Program. For example, if I try to run the following command to crawl: runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 It then gives me the following error right away before any other output: runtime/local/bin/nutch: line 251: exec: C:\Program: not found I am running on Cygwin on Windows 7, if that helps. As for Tika, I did modify the CompositeDetector.java file in tika-core since I added a Detector to detect the AFM files and had to make a slight change to the CompositeDetector. I did rebuild Nutch after I changed the jars and it built fine but that is when I started getting the fetch failed error. Thanks, Fernando On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche < [email protected]> wrote: > Hi Fernando > > > > I have made some additions (a new parser) to the Apache Tika application > > and > > I am trying to see if I can run my new changes using the crawl mechanism > in > > Nutch, but I am having some trouble updating Nutch with my modified Tika > > application. > > > > The Tika updates I made run fine if I run Tika as a standalone using > either > > the command line or the Tika GUI. > > > > OK > > > > > > I am using Nutch 1.2, 1.3 seems to not be able to run for me (I get an > > error > > saying C:/Program not found whenever I try to do anything but 1.2 should > be > > fine for what I am trying to do which is just to see the parse results > from > > the new parser I added to Tika). > > > > I have replaced the tika-core.jar, tika-parsers.jar and > tika-mimetypes.xml > > files with my versions of those files as described in the following link: > > http://issues.apache.org/jira/browse/NUTCH-766. I also updated the > > nutch-site.xml to enable the parse-tika plugin. I also updated the > > parse-plugins.xml file with the following (afm files are what I am trying > > to > > parse): > > > > <mimeType name="application/x-font-afm"> > > <plugin id="parse-tika" /> > > </mimeType> > > > > This is not necessary as by default parse-tika is used for any mime-type > unless the mapping mime-type / parser is specified in parse-plugins.xml. > This should not have an impact though > > > > > > I am crawling a personal site in which I have links to .afm files. If I > > crawl before making any updates to Nutch, it fetches the files fine. > After > > making the updates detailed above, I get the following error: "fetch of > > http://scf.usc.edu/~jfarreol/woor2___.AFM failed with: > > java.lang.NoClassDefFoundError: org/apache/james/mime4j/MimeException". > > > > Not really sure, what the issue is but my guess is that I have not > updated > > all the necessary files. Any help would be greatly appreciated. > > > > yep, sounds like you have a few jars missing. Nutch-1.2 came with tika-0.7, > which version of tika are you trying to use? > if you just added a new parser then it would be easier to ship it as a > separate jar file. I assume that you did not have to modify anything in > tika-core, so you could use the standard tika libs and simply add yours > using Ivy. > > Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over 1.2 so it > would be worth getting to the bottom of the issue you're encountering and > get 1.3 to work. Moreover I am not sure that you can use a version of Tika > > > 0.7 on Nutch 1.2 without changing parts of the code (to be checked though) > > Julien > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

