Hello,

Thanks for the replies.

I have started trying to use Nutch 1.3 after your suggestions, especially
since I am using Tika 0.9, but I am not getting anywhere with it. I am able
to build fine but whenever I try to run any command it gives the error
stating that it cannot find C:\Program. For example, if I try to run the
following command to crawl:

runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50

It then gives me the following error right away before any other output:

runtime/local/bin/nutch: line 251: exec: C:\Program: not found

I am running on Cygwin on Windows 7, if that helps.

As for Tika, I did modify the CompositeDetector.java file in tika-core since
I added a Detector to detect the AFM files and had to make a slight change
to the CompositeDetector. I did rebuild Nutch after I changed the jars and
it built fine but that is when I started getting the fetch failed error.

Thanks,
Fernando

On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche <
[email protected]> wrote:

> Hi Fernando
>
>
> > I have made some additions (a new parser) to the Apache Tika application
> > and
> > I am trying to see if I can run my new changes using the crawl mechanism
> in
> > Nutch, but I am having some trouble updating Nutch with my modified Tika
> > application.
> >
> > The Tika updates I made run fine if I run Tika as a standalone using
> either
> > the command line or the Tika GUI.
> >
>
> OK
>
>
> >
> > I am using Nutch 1.2, 1.3 seems to not be able to run for me (I get an
> > error
> > saying C:/Program not found whenever I try to do anything but 1.2 should
> be
> > fine for what I am trying to do which is just to see the parse results
> from
> > the new parser I added to Tika).
> >
> > I have replaced the tika-core.jar, tika-parsers.jar and
> tika-mimetypes.xml
> > files with my versions of those files as described in the following link:
> > http://issues.apache.org/jira/browse/NUTCH-766. I also updated the
> > nutch-site.xml to enable the parse-tika plugin. I also updated the
> > parse-plugins.xml file with the following (afm files are what I am trying
> > to
> > parse):
> >
> >        <mimeType name="application/x-font-afm">
> >                <plugin id="parse-tika" />
> >        </mimeType>
> >
>
> This is not necessary as by default parse-tika is used for any mime-type
> unless the mapping mime-type / parser is specified in parse-plugins.xml.
> This should not have an impact though
>
>
> >
> > I am crawling a personal site in which I have links to .afm files. If I
> > crawl before making any updates to Nutch, it fetches the files fine.
> After
> > making the updates detailed above, I get the following error: "fetch of
> > http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
> > java.lang.NoClassDefFoundError: org/apache/james/mime4j/MimeException".
> >
> > Not really sure, what the issue is but my guess is that I have not
> updated
> > all the necessary files. Any help would be greatly appreciated.
> >
>
> yep, sounds like you have a few jars missing. Nutch-1.2 came with tika-0.7,
> which version of tika are you trying to use?
> if you just added a new parser then it would be easier to ship it as a
> separate jar file. I assume that you did not have to modify anything in
> tika-core, so you could use the standard tika libs and simply add yours
> using Ivy.
>
> Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over 1.2 so it
> would be worth getting to the bottom of the issue you're encountering and
> get 1.3 to work. Moreover I am not sure that you can use a version of Tika
> >
> 0.7 on Nutch 1.2 without changing parts of the code (to be checked though)
>
> Julien
>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Reply via email to