Glad you managed to get it to work. I don't know what Chris meant by that,
can;t see why we'd open a JIRA when we are already using the latest version

Julien

On 20 July 2011 08:19, Fernando Arreola <[email protected]> wrote:

> Hi,
>
> Nutch 1.3 currently has Tika 0.9 which is the latest official version. I
> was
> trying to replace the Tika in Nutch 1.3 with a Tika project which I had
> modifed (Tika 0.9 with a new parser I had created). Is it still recommended
> that I create a JIRA issue if it currently has the latest official version?
>
> Thanks,
> Fernando
>
> On Tue, Jul 19, 2011 at 9:41 PM, Mattmann, Chris A (388J) <
> [email protected]> wrote:
>
> > Hey Fernando,
> >
> > Would be great to get a JIRA issue and patch to bring
> > Nutch 1.4-branch up to date with the latest Tika
> > based on your experience.
> >
> > Thanks for your help!
> >
> > Cheers,
> > Chris
> >
> > On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote:
> >
> > > Hi,
> > >
> > > You were right, it is enough to provide the right clues in the
> > > tika-mimetypes.xml file. Once the correct clues got in there, thanks to
> a
> > > Tika developer, all I had to do was replace the jar files with mine. It
> > is
> > > working just as I want it now.
> > >
> > > Thanks everyone for the help.
> > >
> > > Fernando
> > >
> > > On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche <
> > > [email protected]> wrote:
> > >
> > >> You probably need to make sure that conf/tika-mimetypes.xml is the
> > version
> > >> you've modified and contains the clues for detecting afm files.
> > >> BTW out of curiosity why did you have to modify tika-core.jar? Isn't
> it
> > >> enough to provide the clues in tika-mimetypes.xml?
> > >>
> > >> Jul
> > >>
> > >> On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote:
> > >>
> > >>> Thanks, I really appreciate all the help. I used the ParserChecker
> and
> > I
> > >>> could see the metadata my parser extracted!
> > >>>
> > >>> I have one more question though, I could only see the metadata my
> > parser
> > >>> extracted if I used the -forceAs mimetype option. Otherwise it is
> > >> detected
> > >>> as a text/plain file and my parser is then not called. I ran into a
> > >> similar
> > >>> problem in tika and added some functionality there so that Tika's
> > >> detection
> > >>> mechanism would not think afm files are text/plain. Does this mean
> not
> > >> all
> > >>> of my tika changes made it in (I updated both the tika-core.jar and
> > >>> tika-parsers.jar files) or does Nutch have its own file type
> detection
> > >>> mechanism?
> > >>>
> > >>> Thanks,
> > >>> Fernando
> > >>>
> > >>> On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma
> > >>> <[email protected]>wrote:
> > >>>
> > >>>>
> > >>>>> Thanks for the help. I seem to be getting close to what I need to
> do,
> > >>> but
> > >>>>> not quite there.
> > >>>>>
> > >>>>> I downloaded Nutch 1.3 and built it on a unix machine. It built and
> > >> ran
> > >>>>> fine (before changing any jar files) when I tested it on the site
> > >> with
> > >>>> the
> > >>>>> .afm files that I want to get parsed.
> > >>>>>
> > >>>>> I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml
> > >> (to
> > >>>>> enable the parse-tika plugin) and tika-mimetypes.xml files with my
> > >>>> updated
> > >>>>> versions. I rebuilt (no errors) and then ran the crawl command on
> the
> > >>>> same
> > >>>>> site. The fetch seemed to work, I did not see any errors when
> running
> > >>> or
> > >>>> in
> > >>>>> the log file. There is a parse error but it is related to a pdf I
> > >> have
> > >>>>> linked in the site I crawled and since I am not interested in the
> pdf
> > >> I
> > >>>>> don't think it matters.
> > >>>>>
> > >>>>> Now here is my completely newb question: how can I tell if the afm
> > >>> files
> > >>>>> were parsed correctly in the absence of errors?
> > >>>>
> > >>>> The ParserChecker is what you're looking for. It's a handy tool you
> > can
> > >>>> locally use to find out if all goes well.
> > >>>>
> > >>>> bin/nutch org.apache.nutch.parse.ParserChecker
> > >>>>
> > >>>>>
> > >>>>> I looked at the files in the segments/*/parse_data directory (since
> > >>> that
> > >>>> is
> > >>>>> where the tutorial says the metadata goes and the parser I created
> > >>> mostly
> > >>>>> extracts metadata) but the files aren't really readable. I also
> > >> figured
> > >>>>> maybe I could search for some terms I expect parser to extract but
> > >>>> couldn't
> > >>>>> perform a search. When I typed the following command in the
> > >>> runtime/local
> > >>>>> directory:
> > >>>>>
> > >>>>> bin/nutch org.apache.nutch.searcher.NutchBean *search_term*
> > >>>>>
> > >>>>> I get the following error:
> > >>>>>
> > >>>>> Exception in thread "main" java.lang.NoClassDefFoundError:
> > >>>>> org/apache/nutch/searcher/NutchBean
> > >>>>>
> > >>>>> I looked in the src directory and did not find the searcher (it was
> > >> in
> > >>>>> there in the 1.2 version). I tried downloading both the binary and
> > >> the
> > >>>> src
> > >>>>> distributions for 1.3 and it was in neither. Is there a different
> way
> > >>> to
> > >>>>> perform a search in 1.3 or is there a different way I can see
> > >> readable
> > >>>>> results of the parsed information?
> > >>>>
> > >>>> There is no searcher in 1.3. It is deprecated and removed. Use Solr
> > for
> > >>>> indexing to confirm or use ParserChecker or the new 1.4-dev
> > >>>> o.a.n.indexer.IndexingFiltersChecker.
> > >>>>
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Fernando
> > >>>>>
> > >>>>> On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney <
> > >>>>>
> > >>>>> [email protected]> wrote:
> > >>>>>> OK so at least we seem to have sorted out the first of you're
> > >>>> problems...
> > >>>>>> but now face the dreaded Windows Cygwin partnership.
> > >>>>>>
> > >>>>>> We do not currently have an up-to-date tutorial for this. We do
> > >>> however
> > >>>>>> have
> > >>>>>> a tutorial for older versions of Nutch which you can find here [1]
> > >>> [2]
> > >>>>>>
> > >>>>>> I'm going to be brutally honest with you here, working with Cygwin
> > >>> was
> > >>>>>> horrible from my own experience. There seems to be so much
> overhead
> > >>> and
> > >>>>>> working with almost any other OS was a significantly easier
> option.
> > >> I
> > >>>>>> understand that this may mean a fundamental shift in you're
> > >> computing
> > >>>>>> style but the benefit is well worth it.
> > >>>>>>
> > >>>>>> [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin
> > >>>>>> [2]
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>
> >
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28
> > >>>>>> cygwin%29
> > >>>>>>
> > >>>>>> On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola <
> > >>> [email protected]
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>> Thanks for the replies.
> > >>>>>>>
> > >>>>>>> I have started trying to use Nutch 1.3 after your suggestions,
> > >>>>>>> especially since I am using Tika 0.9, but I am not getting
> > >> anywhere
> > >>>>>>> with it. I am
> > >>>>>>
> > >>>>>> able
> > >>>>>>
> > >>>>>>> to build fine but whenever I try to run any command it gives the
> > >>>> error
> > >>>>>>> stating that it cannot find C:\Program. For example, if I try to
> > >>> run
> > >>>>>>> the following command to crawl:
> > >>>>>>>
> > >>>>>>> runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > >>>>>>>
> > >>>>>>> It then gives me the following error right away before any other
> > >>>>>>> output:
> > >>>>>>>
> > >>>>>>> runtime/local/bin/nutch: line 251: exec: C:\Program: not found
> > >>>>>>>
> > >>>>>>> I am running on Cygwin on Windows 7, if that helps.
> > >>>>>>>
> > >>>>>>> As for Tika, I did modify the CompositeDetector.java file in
> > >>>> tika-core
> > >>>>>>> since
> > >>>>>>> I added a Detector to detect the AFM files and had to make a
> > >> slight
> > >>>>>>
> > >>>>>> change
> > >>>>>>
> > >>>>>>> to the CompositeDetector. I did rebuild Nutch after I changed the
> > >>>> jars
> > >>>>>>
> > >>>>>> and
> > >>>>>>
> > >>>>>>> it built fine but that is when I started getting the fetch failed
> > >>>>>>> error.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Fernando
> > >>>>>>>
> > >>>>>>> On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche <
> > >>>>>>>
> > >>>>>>> [email protected]> wrote:
> > >>>>>>>> Hi Fernando
> > >>>>>>>>
> > >>>>>>>>> I have made some additions (a new parser) to the Apache Tika
> > >>>>>>>
> > >>>>>>> application
> > >>>>>>>
> > >>>>>>>>> and
> > >>>>>>>>> I am trying to see if I can run my new changes using the
> > >> crawl
> > >>>>>>>
> > >>>>>>> mechanism
> > >>>>>>>
> > >>>>>>>> in
> > >>>>>>>>
> > >>>>>>>>> Nutch, but I am having some trouble updating Nutch with my
> > >>>> modified
> > >>>>>>>
> > >>>>>>> Tika
> > >>>>>>>
> > >>>>>>>>> application.
> > >>>>>>>>>
> > >>>>>>>>> The Tika updates I made run fine if I run Tika as a
> > >> standalone
> > >>>>>>>>> using
> > >>>>>>>>
> > >>>>>>>> either
> > >>>>>>>>
> > >>>>>>>>> the command line or the Tika GUI.
> > >>>>>>>>
> > >>>>>>>> OK
> > >>>>>>>>
> > >>>>>>>>> I am using Nutch 1.2, 1.3 seems to not be able to run for me
> > >> (I
> > >>>> get
> > >>>>>>
> > >>>>>> an
> > >>>>>>
> > >>>>>>>>> error
> > >>>>>>>>> saying C:/Program not found whenever I try to do anything but
> > >>> 1.2
> > >>>>>>>
> > >>>>>>> should
> > >>>>>>>
> > >>>>>>>> be
> > >>>>>>>>
> > >>>>>>>>> fine for what I am trying to do which is just to see the
> > >> parse
> > >>>>>>
> > >>>>>> results
> > >>>>>>
> > >>>>>>>> from
> > >>>>>>>>
> > >>>>>>>>> the new parser I added to Tika).
> > >>>>>>>>>
> > >>>>>>>>> I have replaced the tika-core.jar, tika-parsers.jar and
> > >>>>>>>>
> > >>>>>>>> tika-mimetypes.xml
> > >>>>>>>>
> > >>>>>>>>> files with my versions of those files as described in the
> > >>>> following
> > >>>>>>>
> > >>>>>>> link:
> > >>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-766. I also
> > >> updated
> > >>>> the
> > >>>>>>>>> nutch-site.xml to enable the parse-tika plugin. I also
> > >> updated
> > >>>> the
> > >>>>>>>>> parse-plugins.xml file with the following (afm files are what
> > >> I
> > >>>> am
> > >>>>>>>
> > >>>>>>> trying
> > >>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>>
> > >>>>>>>>> parse):
> > >>>>>>>>>       <mimeType name="application/x-font-afm">
> > >>>>>>>>>
> > >>>>>>>>>               <plugin id="parse-tika" />
> > >>>>>>>>>
> > >>>>>>>>>       </mimeType>
> > >>>>>>>>
> > >>>>>>>> This is not necessary as by default parse-tika is used for any
> > >>>>>>
> > >>>>>> mime-type
> > >>>>>>
> > >>>>>>>> unless the mapping mime-type / parser is specified in
> > >>>>>>
> > >>>>>> parse-plugins.xml.
> > >>>>>>
> > >>>>>>>> This should not have an impact though
> > >>>>>>>>
> > >>>>>>>>> I am crawling a personal site in which I have links to .afm
> > >>>> files.
> > >>>>>>>>> If
> > >>>>>>
> > >>>>>> I
> > >>>>>>
> > >>>>>>>>> crawl before making any updates to Nutch, it fetches the
> > >> files
> > >>>>>>>>> fine.
> > >>>>>>>>
> > >>>>>>>> After
> > >>>>>>>>
> > >>>>>>>>> making the updates detailed above, I get the following error:
> > >>>>>>>>> "fetch
> > >>>>>>
> > >>>>>> of
> > >>>>>>
> > >>>>>>>>> http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
> > >>>>>>
> > >>>>>>>>> java.lang.NoClassDefFoundError:
> > >>>>>> org/apache/james/mime4j/MimeException".
> > >>>>>>
> > >>>>>>>>> Not really sure, what the issue is but my guess is that I
> > >> have
> > >>>> not
> > >>>>>>>>
> > >>>>>>>> updated
> > >>>>>>>>
> > >>>>>>>>> all the necessary files. Any help would be greatly
> > >> appreciated.
> > >>>>>>>>
> > >>>>>>>> yep, sounds like you have a few jars missing. Nutch-1.2 came
> > >> with
> > >>>>>>>
> > >>>>>>> tika-0.7,
> > >>>>>>>
> > >>>>>>>> which version of tika are you trying to use?
> > >>>>>>>> if you just added a new parser then it would be easier to ship
> > >> it
> > >>>> as
> > >>>>>>>> a separate jar file. I assume that you did not have to modify
> > >>>>>>>> anything in tika-core, so you could use the standard tika libs
> > >>> and
> > >>>>>>>> simply add yours using Ivy.
> > >>>>>>>>
> > >>>>>>>> Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over
> > >> 1.2
> > >>>> so
> > >>>>>>>> it would be worth getting to the bottom of the issue you're
> > >>>>>>>> encountering
> > >>>>>>
> > >>>>>> and
> > >>>>>>
> > >>>>>>>> get 1.3 to work. Moreover I am not sure that you can use a
> > >>> version
> > >>>> of
> > >>>>>>>
> > >>>>>>> Tika
> > >>>>>>>
> > >>>>>>>> 0.7 on Nutch 1.2 without changing parts of the code (to be
> > >>> checked
> > >>>>>>>
> > >>>>>>> though)
> > >>>>>>>
> > >>>>>>>> Julien
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> *
> > >>>>>>>> *Open Source Solutions for Text Engineering
> > >>>>>>>>
> > >>>>>>>> http://digitalpebble.blogspot.com/
> > >>>>>>>> http://www.digitalpebble.com
> > >>>>>>
> > >>>>>> --
> > >>>>>> *Lewis*
> > >>>>
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> *
> > >> *Open Source Solutions for Text Engineering
> > >>
> > >> http://digitalpebble.blogspot.com/
> > >> http://www.digitalpebble.com
> > >>
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: [email protected]
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to