Hi,

You were right, it is enough to provide the right clues in the
tika-mimetypes.xml file. Once the correct clues got in there, thanks to a
Tika developer, all I had to do was replace the jar files with mine. It is
working just as I want it now.

Thanks everyone for the help.

Fernando

On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche <
[email protected]> wrote:

> You probably need to make sure that conf/tika-mimetypes.xml is the version
> you've modified and contains the clues for detecting afm files.
> BTW out of curiosity why did you have to modify tika-core.jar? Isn't it
> enough to provide the clues in tika-mimetypes.xml?
>
> Jul
>
> On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote:
>
> > Thanks, I really appreciate all the help. I used the ParserChecker and I
> > could see the metadata my parser extracted!
> >
> > I have one more question though, I could only see the metadata my parser
> > extracted if I used the -forceAs mimetype option. Otherwise it is
> detected
> > as a text/plain file and my parser is then not called. I ran into a
> similar
> > problem in tika and added some functionality there so that Tika's
> detection
> > mechanism would not think afm files are text/plain. Does this mean not
> all
> > of my tika changes made it in (I updated both the tika-core.jar and
> > tika-parsers.jar files) or does Nutch have its own file type detection
> > mechanism?
> >
> > Thanks,
> > Fernando
> >
> > On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma
> > <[email protected]>wrote:
> >
> > >
> > > > Thanks for the help. I seem to be getting close to what I need to do,
> > but
> > > > not quite there.
> > > >
> > > > I downloaded Nutch 1.3 and built it on a unix machine. It built and
> ran
> > > > fine (before changing any jar files) when I tested it on the site
> with
> > > the
> > > > .afm files that I want to get parsed.
> > > >
> > > > I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml
> (to
> > > > enable the parse-tika plugin) and tika-mimetypes.xml files with my
> > > updated
> > > > versions. I rebuilt (no errors) and then ran the crawl command on the
> > > same
> > > > site. The fetch seemed to work, I did not see any errors when running
> > or
> > > in
> > > > the log file. There is a parse error but it is related to a pdf I
> have
> > > > linked in the site I crawled and since I am not interested in the pdf
> I
> > > > don't think it matters.
> > > >
> > > > Now here is my completely newb question: how can I tell if the afm
> > files
> > > > were parsed correctly in the absence of errors?
> > >
> > > The ParserChecker is what you're looking for. It's a handy tool you can
> > > locally use to find out if all goes well.
> > >
> > > bin/nutch org.apache.nutch.parse.ParserChecker
> > >
> > > >
> > > > I looked at the files in the segments/*/parse_data directory (since
> > that
> > > is
> > > > where the tutorial says the metadata goes and the parser I created
> > mostly
> > > > extracts metadata) but the files aren't really readable. I also
> figured
> > > > maybe I could search for some terms I expect parser to extract but
> > > couldn't
> > > > perform a search. When I typed the following command in the
> > runtime/local
> > > > directory:
> > > >
> > > > bin/nutch org.apache.nutch.searcher.NutchBean *search_term*
> > > >
> > > > I get the following error:
> > > >
> > > > Exception in thread "main" java.lang.NoClassDefFoundError:
> > > > org/apache/nutch/searcher/NutchBean
> > > >
> > > > I looked in the src directory and did not find the searcher (it was
> in
> > > > there in the 1.2 version). I tried downloading both the binary and
> the
> > > src
> > > > distributions for 1.3 and it was in neither. Is there a different way
> > to
> > > > perform a search in 1.3 or is there a different way I can see
> readable
> > > > results of the parsed information?
> > >
> > > There is no searcher in 1.3. It is deprecated and removed. Use Solr for
> > > indexing to confirm or use ParserChecker or the new 1.4-dev
> > > o.a.n.indexer.IndexingFiltersChecker.
> > >
> > > >
> > > > Thanks,
> > > > Fernando
> > > >
> > > > On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney <
> > > >
> > > > [email protected]> wrote:
> > > > > OK so at least we seem to have sorted out the first of you're
> > > problems...
> > > > > but now face the dreaded Windows Cygwin partnership.
> > > > >
> > > > > We do not currently have an up-to-date tutorial for this. We do
> > however
> > > > > have
> > > > > a tutorial for older versions of Nutch which you can find here [1]
> > [2]
> > > > >
> > > > > I'm going to be brutally honest with you here, working with Cygwin
> > was
> > > > > horrible from my own experience. There seems to be so much overhead
> > and
> > > > > working with almost any other OS was a significantly easier option.
> I
> > > > > understand that this may mean a fundamental shift in you're
> computing
> > > > > style but the benefit is well worth it.
> > > > >
> > > > > [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin
> > > > > [2]
> > > > >
> > > > >
> > >
> >
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28
> > > > > cygwin%29
> > > > >
> > > > > On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola <
> > [email protected]
> > > > >
> > > > > >wrote:
> > > > > > Hello,
> > > > > >
> > > > > > Thanks for the replies.
> > > > > >
> > > > > > I have started trying to use Nutch 1.3 after your suggestions,
> > > > > > especially since I am using Tika 0.9, but I am not getting
> anywhere
> > > > > > with it. I am
> > > > >
> > > > > able
> > > > >
> > > > > > to build fine but whenever I try to run any command it gives the
> > > error
> > > > > > stating that it cannot find C:\Program. For example, if I try to
> > run
> > > > > > the following command to crawl:
> > > > > >
> > > > > > runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > >
> > > > > > It then gives me the following error right away before any other
> > > > > > output:
> > > > > >
> > > > > > runtime/local/bin/nutch: line 251: exec: C:\Program: not found
> > > > > >
> > > > > > I am running on Cygwin on Windows 7, if that helps.
> > > > > >
> > > > > > As for Tika, I did modify the CompositeDetector.java file in
> > > tika-core
> > > > > > since
> > > > > > I added a Detector to detect the AFM files and had to make a
> slight
> > > > >
> > > > > change
> > > > >
> > > > > > to the CompositeDetector. I did rebuild Nutch after I changed the
> > > jars
> > > > >
> > > > > and
> > > > >
> > > > > > it built fine but that is when I started getting the fetch failed
> > > > > > error.
> > > > > >
> > > > > > Thanks,
> > > > > > Fernando
> > > > > >
> > > > > > On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche <
> > > > > >
> > > > > > [email protected]> wrote:
> > > > > > > Hi Fernando
> > > > > > >
> > > > > > > > I have made some additions (a new parser) to the Apache Tika
> > > > > >
> > > > > > application
> > > > > >
> > > > > > > > and
> > > > > > > > I am trying to see if I can run my new changes using the
> crawl
> > > > > >
> > > > > > mechanism
> > > > > >
> > > > > > > in
> > > > > > >
> > > > > > > > Nutch, but I am having some trouble updating Nutch with my
> > > modified
> > > > > >
> > > > > > Tika
> > > > > >
> > > > > > > > application.
> > > > > > > >
> > > > > > > > The Tika updates I made run fine if I run Tika as a
> standalone
> > > > > > > > using
> > > > > > >
> > > > > > > either
> > > > > > >
> > > > > > > > the command line or the Tika GUI.
> > > > > > >
> > > > > > > OK
> > > > > > >
> > > > > > > > I am using Nutch 1.2, 1.3 seems to not be able to run for me
> (I
> > > get
> > > > >
> > > > > an
> > > > >
> > > > > > > > error
> > > > > > > > saying C:/Program not found whenever I try to do anything but
> > 1.2
> > > > > >
> > > > > > should
> > > > > >
> > > > > > > be
> > > > > > >
> > > > > > > > fine for what I am trying to do which is just to see the
> parse
> > > > >
> > > > > results
> > > > >
> > > > > > > from
> > > > > > >
> > > > > > > > the new parser I added to Tika).
> > > > > > > >
> > > > > > > > I have replaced the tika-core.jar, tika-parsers.jar and
> > > > > > >
> > > > > > > tika-mimetypes.xml
> > > > > > >
> > > > > > > > files with my versions of those files as described in the
> > > following
> > > > > >
> > > > > > link:
> > > > > > > > http://issues.apache.org/jira/browse/NUTCH-766. I also
> updated
> > > the
> > > > > > > > nutch-site.xml to enable the parse-tika plugin. I also
> updated
> > > the
> > > > > > > > parse-plugins.xml file with the following (afm files are what
> I
> > > am
> > > > > >
> > > > > > trying
> > > > > >
> > > > > > > > to
> > > > > > > >
> > > > > > > > parse):
> > > > > > > >        <mimeType name="application/x-font-afm">
> > > > > > > >
> > > > > > > >                <plugin id="parse-tika" />
> > > > > > > >
> > > > > > > >        </mimeType>
> > > > > > >
> > > > > > > This is not necessary as by default parse-tika is used for any
> > > > >
> > > > > mime-type
> > > > >
> > > > > > > unless the mapping mime-type / parser is specified in
> > > > >
> > > > > parse-plugins.xml.
> > > > >
> > > > > > > This should not have an impact though
> > > > > > >
> > > > > > > > I am crawling a personal site in which I have links to .afm
> > > files.
> > > > > > > > If
> > > > >
> > > > > I
> > > > >
> > > > > > > > crawl before making any updates to Nutch, it fetches the
> files
> > > > > > > > fine.
> > > > > > >
> > > > > > > After
> > > > > > >
> > > > > > > > making the updates detailed above, I get the following error:
> > > > > > > > "fetch
> > > > >
> > > > > of
> > > > >
> > > > > > > > http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
> > > > >
> > > > > > > > java.lang.NoClassDefFoundError:
> > > > > org/apache/james/mime4j/MimeException".
> > > > >
> > > > > > > > Not really sure, what the issue is but my guess is that I
> have
> > > not
> > > > > > >
> > > > > > > updated
> > > > > > >
> > > > > > > > all the necessary files. Any help would be greatly
> appreciated.
> > > > > > >
> > > > > > > yep, sounds like you have a few jars missing. Nutch-1.2 came
> with
> > > > > >
> > > > > > tika-0.7,
> > > > > >
> > > > > > > which version of tika are you trying to use?
> > > > > > > if you just added a new parser then it would be easier to ship
> it
> > > as
> > > > > > > a separate jar file. I assume that you did not have to modify
> > > > > > > anything in tika-core, so you could use the standard tika libs
> > and
> > > > > > > simply add yours using Ivy.
> > > > > > >
> > > > > > > Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over
> 1.2
> > > so
> > > > > > > it would be worth getting to the bottom of the issue you're
> > > > > > > encountering
> > > > >
> > > > > and
> > > > >
> > > > > > > get 1.3 to work. Moreover I am not sure that you can use a
> > version
> > > of
> > > > > >
> > > > > > Tika
> > > > > >
> > > > > > > 0.7 on Nutch 1.2 without changing parts of the code (to be
> > checked
> > > > > >
> > > > > > though)
> > > > > >
> > > > > > > Julien
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > *
> > > > > > > *Open Source Solutions for Text Engineering
> > > > > > >
> > > > > > > http://digitalpebble.blogspot.com/
> > > > > > > http://www.digitalpebble.com
> > > > >
> > > > > --
> > > > > *Lewis*
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Reply via email to