Thanks for the help. I seem to be getting close to what I need to do, but
not quite there.

I downloaded Nutch 1.3 and built it on a unix machine. It built and ran fine
(before changing any jar files) when I tested it on the site with the .afm
files that I want to get parsed.

I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml (to
enable the parse-tika plugin) and tika-mimetypes.xml files with my updated
versions. I rebuilt (no errors) and then ran the crawl command on the same
site. The fetch seemed to work, I did not see any errors when running or in
the log file. There is a parse error but it is related to a pdf I have
linked in the site I crawled and since I am not interested in the pdf I
don't think it matters.

Now here is my completely newb question: how can I tell if the afm files
were parsed correctly in the absence of errors?

I looked at the files in the segments/*/parse_data directory (since that is
where the tutorial says the metadata goes and the parser I created mostly
extracts metadata) but the files aren't really readable. I also figured
maybe I could search for some terms I expect parser to extract but couldn't
perform a search. When I typed the following command in the runtime/local
directory:

bin/nutch org.apache.nutch.searcher.NutchBean *search_term*

I get the following error:

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/searcher/NutchBean

I looked in the src directory and did not find the searcher (it was in there
in the 1.2 version). I tried downloading both the binary and the src
distributions for 1.3 and it was in neither. Is there a different way to
perform a search in 1.3 or is there a different way I can see readable
results of the parsed information?

Thanks,
Fernando

On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney <
[email protected]> wrote:

> OK so at least we seem to have sorted out the first of you're problems...
> but now face the dreaded Windows Cygwin partnership.
>
> We do not currently have an up-to-date tutorial for this. We do however
> have
> a tutorial for older versions of Nutch which you can find here [1] [2]
>
> I'm going to be brutally honest with you here, working with Cygwin was
> horrible from my own experience. There seems to be so much overhead and
> working with almost any other OS was a significantly easier option. I
> understand that this may mean a fundamental shift in you're computing style
> but the benefit is well worth it.
>
> [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin
> [2]
>
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28cygwin%29
>
> On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola <[email protected]
> >wrote:
>
> > Hello,
> >
> > Thanks for the replies.
> >
> > I have started trying to use Nutch 1.3 after your suggestions, especially
> > since I am using Tika 0.9, but I am not getting anywhere with it. I am
> able
> > to build fine but whenever I try to run any command it gives the error
> > stating that it cannot find C:\Program. For example, if I try to run the
> > following command to crawl:
> >
> > runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> >
> > It then gives me the following error right away before any other output:
> >
> > runtime/local/bin/nutch: line 251: exec: C:\Program: not found
> >
> > I am running on Cygwin on Windows 7, if that helps.
> >
> > As for Tika, I did modify the CompositeDetector.java file in tika-core
> > since
> > I added a Detector to detect the AFM files and had to make a slight
> change
> > to the CompositeDetector. I did rebuild Nutch after I changed the jars
> and
> > it built fine but that is when I started getting the fetch failed error.
> >
> > Thanks,
> > Fernando
> >
> > On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche <
> > [email protected]> wrote:
> >
> > > Hi Fernando
> > >
> > >
> > > > I have made some additions (a new parser) to the Apache Tika
> > application
> > > > and
> > > > I am trying to see if I can run my new changes using the crawl
> > mechanism
> > > in
> > > > Nutch, but I am having some trouble updating Nutch with my modified
> > Tika
> > > > application.
> > > >
> > > > The Tika updates I made run fine if I run Tika as a standalone using
> > > either
> > > > the command line or the Tika GUI.
> > > >
> > >
> > > OK
> > >
> > >
> > > >
> > > > I am using Nutch 1.2, 1.3 seems to not be able to run for me (I get
> an
> > > > error
> > > > saying C:/Program not found whenever I try to do anything but 1.2
> > should
> > > be
> > > > fine for what I am trying to do which is just to see the parse
> results
> > > from
> > > > the new parser I added to Tika).
> > > >
> > > > I have replaced the tika-core.jar, tika-parsers.jar and
> > > tika-mimetypes.xml
> > > > files with my versions of those files as described in the following
> > link:
> > > > http://issues.apache.org/jira/browse/NUTCH-766. I also updated the
> > > > nutch-site.xml to enable the parse-tika plugin. I also updated the
> > > > parse-plugins.xml file with the following (afm files are what I am
> > trying
> > > > to
> > > > parse):
> > > >
> > > >        <mimeType name="application/x-font-afm">
> > > >                <plugin id="parse-tika" />
> > > >        </mimeType>
> > > >
> > >
> > > This is not necessary as by default parse-tika is used for any
> mime-type
> > > unless the mapping mime-type / parser is specified in
> parse-plugins.xml.
> > > This should not have an impact though
> > >
> > >
> > > >
> > > > I am crawling a personal site in which I have links to .afm files. If
> I
> > > > crawl before making any updates to Nutch, it fetches the files fine.
> > > After
> > > > making the updates detailed above, I get the following error: "fetch
> of
> > > > http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
> > > > java.lang.NoClassDefFoundError:
> org/apache/james/mime4j/MimeException".
> > > >
> > > > Not really sure, what the issue is but my guess is that I have not
> > > updated
> > > > all the necessary files. Any help would be greatly appreciated.
> > > >
> > >
> > > yep, sounds like you have a few jars missing. Nutch-1.2 came with
> > tika-0.7,
> > > which version of tika are you trying to use?
> > > if you just added a new parser then it would be easier to ship it as a
> > > separate jar file. I assume that you did not have to modify anything in
> > > tika-core, so you could use the standard tika libs and simply add yours
> > > using Ivy.
> > >
> > > Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over 1.2 so it
> > > would be worth getting to the bottom of the issue you're encountering
> and
> > > get 1.3 to work. Moreover I am not sure that you can use a version of
> > Tika
> > > >
> > > 0.7 on Nutch 1.2 without changing parts of the code (to be checked
> > though)
> > >
> > > Julien
> > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
>
>
>
> --
> *Lewis*
>

Reply via email to