Hi,

Nutch 1.3 currently has Tika 0.9 which is the latest official version. I was
trying to replace the Tika in Nutch 1.3 with a Tika project which I had
modifed (Tika 0.9 with a new parser I had created). Is it still recommended
that I create a JIRA issue if it currently has the latest official version?

Thanks,
Fernando

On Tue, Jul 19, 2011 at 9:41 PM, Mattmann, Chris A (388J) <
[email protected]> wrote:

> Hey Fernando,
>
> Would be great to get a JIRA issue and patch to bring
> Nutch 1.4-branch up to date with the latest Tika
> based on your experience.
>
> Thanks for your help!
>
> Cheers,
> Chris
>
> On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote:
>
> > Hi,
> >
> > You were right, it is enough to provide the right clues in the
> > tika-mimetypes.xml file. Once the correct clues got in there, thanks to a
> > Tika developer, all I had to do was replace the jar files with mine. It
> is
> > working just as I want it now.
> >
> > Thanks everyone for the help.
> >
> > Fernando
> >
> > On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche <
> > [email protected]> wrote:
> >
> >> You probably need to make sure that conf/tika-mimetypes.xml is the
> version
> >> you've modified and contains the clues for detecting afm files.
> >> BTW out of curiosity why did you have to modify tika-core.jar? Isn't it
> >> enough to provide the clues in tika-mimetypes.xml?
> >>
> >> Jul
> >>
> >> On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote:
> >>
> >>> Thanks, I really appreciate all the help. I used the ParserChecker and
> I
> >>> could see the metadata my parser extracted!
> >>>
> >>> I have one more question though, I could only see the metadata my
> parser
> >>> extracted if I used the -forceAs mimetype option. Otherwise it is
> >> detected
> >>> as a text/plain file and my parser is then not called. I ran into a
> >> similar
> >>> problem in tika and added some functionality there so that Tika's
> >> detection
> >>> mechanism would not think afm files are text/plain. Does this mean not
> >> all
> >>> of my tika changes made it in (I updated both the tika-core.jar and
> >>> tika-parsers.jar files) or does Nutch have its own file type detection
> >>> mechanism?
> >>>
> >>> Thanks,
> >>> Fernando
> >>>
> >>> On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma
> >>> <[email protected]>wrote:
> >>>
> >>>>
> >>>>> Thanks for the help. I seem to be getting close to what I need to do,
> >>> but
> >>>>> not quite there.
> >>>>>
> >>>>> I downloaded Nutch 1.3 and built it on a unix machine. It built and
> >> ran
> >>>>> fine (before changing any jar files) when I tested it on the site
> >> with
> >>>> the
> >>>>> .afm files that I want to get parsed.
> >>>>>
> >>>>> I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml
> >> (to
> >>>>> enable the parse-tika plugin) and tika-mimetypes.xml files with my
> >>>> updated
> >>>>> versions. I rebuilt (no errors) and then ran the crawl command on the
> >>>> same
> >>>>> site. The fetch seemed to work, I did not see any errors when running
> >>> or
> >>>> in
> >>>>> the log file. There is a parse error but it is related to a pdf I
> >> have
> >>>>> linked in the site I crawled and since I am not interested in the pdf
> >> I
> >>>>> don't think it matters.
> >>>>>
> >>>>> Now here is my completely newb question: how can I tell if the afm
> >>> files
> >>>>> were parsed correctly in the absence of errors?
> >>>>
> >>>> The ParserChecker is what you're looking for. It's a handy tool you
> can
> >>>> locally use to find out if all goes well.
> >>>>
> >>>> bin/nutch org.apache.nutch.parse.ParserChecker
> >>>>
> >>>>>
> >>>>> I looked at the files in the segments/*/parse_data directory (since
> >>> that
> >>>> is
> >>>>> where the tutorial says the metadata goes and the parser I created
> >>> mostly
> >>>>> extracts metadata) but the files aren't really readable. I also
> >> figured
> >>>>> maybe I could search for some terms I expect parser to extract but
> >>>> couldn't
> >>>>> perform a search. When I typed the following command in the
> >>> runtime/local
> >>>>> directory:
> >>>>>
> >>>>> bin/nutch org.apache.nutch.searcher.NutchBean *search_term*
> >>>>>
> >>>>> I get the following error:
> >>>>>
> >>>>> Exception in thread "main" java.lang.NoClassDefFoundError:
> >>>>> org/apache/nutch/searcher/NutchBean
> >>>>>
> >>>>> I looked in the src directory and did not find the searcher (it was
> >> in
> >>>>> there in the 1.2 version). I tried downloading both the binary and
> >> the
> >>>> src
> >>>>> distributions for 1.3 and it was in neither. Is there a different way
> >>> to
> >>>>> perform a search in 1.3 or is there a different way I can see
> >> readable
> >>>>> results of the parsed information?
> >>>>
> >>>> There is no searcher in 1.3. It is deprecated and removed. Use Solr
> for
> >>>> indexing to confirm or use ParserChecker or the new 1.4-dev
> >>>> o.a.n.indexer.IndexingFiltersChecker.
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Fernando
> >>>>>
> >>>>> On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney <
> >>>>>
> >>>>> [email protected]> wrote:
> >>>>>> OK so at least we seem to have sorted out the first of you're
> >>>> problems...
> >>>>>> but now face the dreaded Windows Cygwin partnership.
> >>>>>>
> >>>>>> We do not currently have an up-to-date tutorial for this. We do
> >>> however
> >>>>>> have
> >>>>>> a tutorial for older versions of Nutch which you can find here [1]
> >>> [2]
> >>>>>>
> >>>>>> I'm going to be brutally honest with you here, working with Cygwin
> >>> was
> >>>>>> horrible from my own experience. There seems to be so much overhead
> >>> and
> >>>>>> working with almost any other OS was a significantly easier option.
> >> I
> >>>>>> understand that this may mean a fundamental shift in you're
> >> computing
> >>>>>> style but the benefit is well worth it.
> >>>>>>
> >>>>>> [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin
> >>>>>> [2]
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28
> >>>>>> cygwin%29
> >>>>>>
> >>>>>> On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola <
> >>> [email protected]
> >>>>>>
> >>>>>>> wrote:
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> Thanks for the replies.
> >>>>>>>
> >>>>>>> I have started trying to use Nutch 1.3 after your suggestions,
> >>>>>>> especially since I am using Tika 0.9, but I am not getting
> >> anywhere
> >>>>>>> with it. I am
> >>>>>>
> >>>>>> able
> >>>>>>
> >>>>>>> to build fine but whenever I try to run any command it gives the
> >>>> error
> >>>>>>> stating that it cannot find C:\Program. For example, if I try to
> >>> run
> >>>>>>> the following command to crawl:
> >>>>>>>
> >>>>>>> runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> >>>>>>>
> >>>>>>> It then gives me the following error right away before any other
> >>>>>>> output:
> >>>>>>>
> >>>>>>> runtime/local/bin/nutch: line 251: exec: C:\Program: not found
> >>>>>>>
> >>>>>>> I am running on Cygwin on Windows 7, if that helps.
> >>>>>>>
> >>>>>>> As for Tika, I did modify the CompositeDetector.java file in
> >>>> tika-core
> >>>>>>> since
> >>>>>>> I added a Detector to detect the AFM files and had to make a
> >> slight
> >>>>>>
> >>>>>> change
> >>>>>>
> >>>>>>> to the CompositeDetector. I did rebuild Nutch after I changed the
> >>>> jars
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>>> it built fine but that is when I started getting the fetch failed
> >>>>>>> error.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Fernando
> >>>>>>>
> >>>>>>> On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche <
> >>>>>>>
> >>>>>>> [email protected]> wrote:
> >>>>>>>> Hi Fernando
> >>>>>>>>
> >>>>>>>>> I have made some additions (a new parser) to the Apache Tika
> >>>>>>>
> >>>>>>> application
> >>>>>>>
> >>>>>>>>> and
> >>>>>>>>> I am trying to see if I can run my new changes using the
> >> crawl
> >>>>>>>
> >>>>>>> mechanism
> >>>>>>>
> >>>>>>>> in
> >>>>>>>>
> >>>>>>>>> Nutch, but I am having some trouble updating Nutch with my
> >>>> modified
> >>>>>>>
> >>>>>>> Tika
> >>>>>>>
> >>>>>>>>> application.
> >>>>>>>>>
> >>>>>>>>> The Tika updates I made run fine if I run Tika as a
> >> standalone
> >>>>>>>>> using
> >>>>>>>>
> >>>>>>>> either
> >>>>>>>>
> >>>>>>>>> the command line or the Tika GUI.
> >>>>>>>>
> >>>>>>>> OK
> >>>>>>>>
> >>>>>>>>> I am using Nutch 1.2, 1.3 seems to not be able to run for me
> >> (I
> >>>> get
> >>>>>>
> >>>>>> an
> >>>>>>
> >>>>>>>>> error
> >>>>>>>>> saying C:/Program not found whenever I try to do anything but
> >>> 1.2
> >>>>>>>
> >>>>>>> should
> >>>>>>>
> >>>>>>>> be
> >>>>>>>>
> >>>>>>>>> fine for what I am trying to do which is just to see the
> >> parse
> >>>>>>
> >>>>>> results
> >>>>>>
> >>>>>>>> from
> >>>>>>>>
> >>>>>>>>> the new parser I added to Tika).
> >>>>>>>>>
> >>>>>>>>> I have replaced the tika-core.jar, tika-parsers.jar and
> >>>>>>>>
> >>>>>>>> tika-mimetypes.xml
> >>>>>>>>
> >>>>>>>>> files with my versions of those files as described in the
> >>>> following
> >>>>>>>
> >>>>>>> link:
> >>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-766. I also
> >> updated
> >>>> the
> >>>>>>>>> nutch-site.xml to enable the parse-tika plugin. I also
> >> updated
> >>>> the
> >>>>>>>>> parse-plugins.xml file with the following (afm files are what
> >> I
> >>>> am
> >>>>>>>
> >>>>>>> trying
> >>>>>>>
> >>>>>>>>> to
> >>>>>>>>>
> >>>>>>>>> parse):
> >>>>>>>>>       <mimeType name="application/x-font-afm">
> >>>>>>>>>
> >>>>>>>>>               <plugin id="parse-tika" />
> >>>>>>>>>
> >>>>>>>>>       </mimeType>
> >>>>>>>>
> >>>>>>>> This is not necessary as by default parse-tika is used for any
> >>>>>>
> >>>>>> mime-type
> >>>>>>
> >>>>>>>> unless the mapping mime-type / parser is specified in
> >>>>>>
> >>>>>> parse-plugins.xml.
> >>>>>>
> >>>>>>>> This should not have an impact though
> >>>>>>>>
> >>>>>>>>> I am crawling a personal site in which I have links to .afm
> >>>> files.
> >>>>>>>>> If
> >>>>>>
> >>>>>> I
> >>>>>>
> >>>>>>>>> crawl before making any updates to Nutch, it fetches the
> >> files
> >>>>>>>>> fine.
> >>>>>>>>
> >>>>>>>> After
> >>>>>>>>
> >>>>>>>>> making the updates detailed above, I get the following error:
> >>>>>>>>> "fetch
> >>>>>>
> >>>>>> of
> >>>>>>
> >>>>>>>>> http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
> >>>>>>
> >>>>>>>>> java.lang.NoClassDefFoundError:
> >>>>>> org/apache/james/mime4j/MimeException".
> >>>>>>
> >>>>>>>>> Not really sure, what the issue is but my guess is that I
> >> have
> >>>> not
> >>>>>>>>
> >>>>>>>> updated
> >>>>>>>>
> >>>>>>>>> all the necessary files. Any help would be greatly
> >> appreciated.
> >>>>>>>>
> >>>>>>>> yep, sounds like you have a few jars missing. Nutch-1.2 came
> >> with
> >>>>>>>
> >>>>>>> tika-0.7,
> >>>>>>>
> >>>>>>>> which version of tika are you trying to use?
> >>>>>>>> if you just added a new parser then it would be easier to ship
> >> it
> >>>> as
> >>>>>>>> a separate jar file. I assume that you did not have to modify
> >>>>>>>> anything in tika-core, so you could use the standard tika libs
> >>> and
> >>>>>>>> simply add yours using Ivy.
> >>>>>>>>
> >>>>>>>> Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over
> >> 1.2
> >>>> so
> >>>>>>>> it would be worth getting to the bottom of the issue you're
> >>>>>>>> encountering
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>>>> get 1.3 to work. Moreover I am not sure that you can use a
> >>> version
> >>>> of
> >>>>>>>
> >>>>>>> Tika
> >>>>>>>
> >>>>>>>> 0.7 on Nutch 1.2 without changing parts of the code (to be
> >>> checked
> >>>>>>>
> >>>>>>> though)
> >>>>>>>
> >>>>>>>> Julien
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> *
> >>>>>>>> *Open Source Solutions for Text Engineering
> >>>>>>>>
> >>>>>>>> http://digitalpebble.blogspot.com/
> >>>>>>>> http://www.digitalpebble.com
> >>>>>>
> >>>>>> --
> >>>>>> *Lewis*
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> *
> >> *Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Reply via email to