Sorry guys I'm nutters! :)

Cheers,
Chris

On Jul 20, 2011, at 1:39 AM, Julien Nioche wrote:

> Glad you managed to get it to work. I don't know what Chris meant by that,
> can;t see why we'd open a JIRA when we are already using the latest version
>
> Julien
>
> On 20 July 2011 08:19, Fernando Arreola <[email protected]> wrote:
>
>> Hi,
>>
>> Nutch 1.3 currently has Tika 0.9 which is the latest official version. I
>> was
>> trying to replace the Tika in Nutch 1.3 with a Tika project which I had
>> modifed (Tika 0.9 with a new parser I had created). Is it still recommended
>> that I create a JIRA issue if it currently has the latest official version?
>>
>> Thanks,
>> Fernando
>>
>> On Tue, Jul 19, 2011 at 9:41 PM, Mattmann, Chris A (388J) <
>> [email protected]> wrote:
>>
>>> Hey Fernando,
>>>
>>> Would be great to get a JIRA issue and patch to bring
>>> Nutch 1.4-branch up to date with the latest Tika
>>> based on your experience.
>>>
>>> Thanks for your help!
>>>
>>> Cheers,
>>> Chris
>>>
>>> On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote:
>>>
>>>> Hi,
>>>>
>>>> You were right, it is enough to provide the right clues in the
>>>> tika-mimetypes.xml file. Once the correct clues got in there, thanks to
>> a
>>>> Tika developer, all I had to do was replace the jar files with mine. It
>>> is
>>>> working just as I want it now.
>>>>
>>>> Thanks everyone for the help.
>>>>
>>>> Fernando
>>>>
>>>> On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche <
>>>> [email protected]> wrote:
>>>>
>>>>> You probably need to make sure that conf/tika-mimetypes.xml is the
>>> version
>>>>> you've modified and contains the clues for detecting afm files.
>>>>> BTW out of curiosity why did you have to modify tika-core.jar? Isn't
>> it
>>>>> enough to provide the clues in tika-mimetypes.xml?
>>>>>
>>>>> Jul
>>>>>
>>>>> On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote:
>>>>>
>>>>>> Thanks, I really appreciate all the help. I used the ParserChecker
>> and
>>> I
>>>>>> could see the metadata my parser extracted!
>>>>>>
>>>>>> I have one more question though, I could only see the metadata my
>>> parser
>>>>>> extracted if I used the -forceAs mimetype option. Otherwise it is
>>>>> detected
>>>>>> as a text/plain file and my parser is then not called. I ran into a
>>>>> similar
>>>>>> problem in tika and added some functionality there so that Tika's
>>>>> detection
>>>>>> mechanism would not think afm files are text/plain. Does this mean
>> not
>>>>> all
>>>>>> of my tika changes made it in (I updated both the tika-core.jar and
>>>>>> tika-parsers.jar files) or does Nutch have its own file type
>> detection
>>>>>> mechanism?
>>>>>>
>>>>>> Thanks,
>>>>>> Fernando
>>>>>>
>>>>>> On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>>
>>>>>>>> Thanks for the help. I seem to be getting close to what I need to
>> do,
>>>>>> but
>>>>>>>> not quite there.
>>>>>>>>
>>>>>>>> I downloaded Nutch 1.3 and built it on a unix machine. It built and
>>>>> ran
>>>>>>>> fine (before changing any jar files) when I tested it on the site
>>>>> with
>>>>>>> the
>>>>>>>> .afm files that I want to get parsed.
>>>>>>>>
>>>>>>>> I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml
>>>>> (to
>>>>>>>> enable the parse-tika plugin) and tika-mimetypes.xml files with my
>>>>>>> updated
>>>>>>>> versions. I rebuilt (no errors) and then ran the crawl command on
>> the
>>>>>>> same
>>>>>>>> site. The fetch seemed to work, I did not see any errors when
>> running
>>>>>> or
>>>>>>> in
>>>>>>>> the log file. There is a parse error but it is related to a pdf I
>>>>> have
>>>>>>>> linked in the site I crawled and since I am not interested in the
>> pdf
>>>>> I
>>>>>>>> don't think it matters.
>>>>>>>>
>>>>>>>> Now here is my completely newb question: how can I tell if the afm
>>>>>> files
>>>>>>>> were parsed correctly in the absence of errors?
>>>>>>>
>>>>>>> The ParserChecker is what you're looking for. It's a handy tool you
>>> can
>>>>>>> locally use to find out if all goes well.
>>>>>>>
>>>>>>> bin/nutch org.apache.nutch.parse.ParserChecker
>>>>>>>
>>>>>>>>
>>>>>>>> I looked at the files in the segments/*/parse_data directory (since
>>>>>> that
>>>>>>> is
>>>>>>>> where the tutorial says the metadata goes and the parser I created
>>>>>> mostly
>>>>>>>> extracts metadata) but the files aren't really readable. I also
>>>>> figured
>>>>>>>> maybe I could search for some terms I expect parser to extract but
>>>>>>> couldn't
>>>>>>>> perform a search. When I typed the following command in the
>>>>>> runtime/local
>>>>>>>> directory:
>>>>>>>>
>>>>>>>> bin/nutch org.apache.nutch.searcher.NutchBean *search_term*
>>>>>>>>
>>>>>>>> I get the following error:
>>>>>>>>
>>>>>>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>>>>>>> org/apache/nutch/searcher/NutchBean
>>>>>>>>
>>>>>>>> I looked in the src directory and did not find the searcher (it was
>>>>> in
>>>>>>>> there in the 1.2 version). I tried downloading both the binary and
>>>>> the
>>>>>>> src
>>>>>>>> distributions for 1.3 and it was in neither. Is there a different
>> way
>>>>>> to
>>>>>>>> perform a search in 1.3 or is there a different way I can see
>>>>> readable
>>>>>>>> results of the parsed information?
>>>>>>>
>>>>>>> There is no searcher in 1.3. It is deprecated and removed. Use Solr
>>> for
>>>>>>> indexing to confirm or use ParserChecker or the new 1.4-dev
>>>>>>> o.a.n.indexer.IndexingFiltersChecker.
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Fernando
>>>>>>>>
>>>>>>>> On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney <
>>>>>>>>
>>>>>>>> [email protected]> wrote:
>>>>>>>>> OK so at least we seem to have sorted out the first of you're
>>>>>>> problems...
>>>>>>>>> but now face the dreaded Windows Cygwin partnership.
>>>>>>>>>
>>>>>>>>> We do not currently have an up-to-date tutorial for this. We do
>>>>>> however
>>>>>>>>> have
>>>>>>>>> a tutorial for older versions of Nutch which you can find here [1]
>>>>>> [2]
>>>>>>>>>
>>>>>>>>> I'm going to be brutally honest with you here, working with Cygwin
>>>>>> was
>>>>>>>>> horrible from my own experience. There seems to be so much
>> overhead
>>>>>> and
>>>>>>>>> working with almost any other OS was a significantly easier
>> option.
>>>>> I
>>>>>>>>> understand that this may mean a fundamental shift in you're
>>>>> computing
>>>>>>>>> style but the benefit is well worth it.
>>>>>>>>>
>>>>>>>>> [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28
>>>>>>>>> cygwin%29
>>>>>>>>>
>>>>>>>>> On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola <
>>>>>> [email protected]
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Thanks for the replies.
>>>>>>>>>>
>>>>>>>>>> I have started trying to use Nutch 1.3 after your suggestions,
>>>>>>>>>> especially since I am using Tika 0.9, but I am not getting
>>>>> anywhere
>>>>>>>>>> with it. I am
>>>>>>>>>
>>>>>>>>> able
>>>>>>>>>
>>>>>>>>>> to build fine but whenever I try to run any command it gives the
>>>>>>> error
>>>>>>>>>> stating that it cannot find C:\Program. For example, if I try to
>>>>>> run
>>>>>>>>>> the following command to crawl:
>>>>>>>>>>
>>>>>>>>>> runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>>>>>>>>>>
>>>>>>>>>> It then gives me the following error right away before any other
>>>>>>>>>> output:
>>>>>>>>>>
>>>>>>>>>> runtime/local/bin/nutch: line 251: exec: C:\Program: not found
>>>>>>>>>>
>>>>>>>>>> I am running on Cygwin on Windows 7, if that helps.
>>>>>>>>>>
>>>>>>>>>> As for Tika, I did modify the CompositeDetector.java file in
>>>>>>> tika-core
>>>>>>>>>> since
>>>>>>>>>> I added a Detector to detect the AFM files and had to make a
>>>>> slight
>>>>>>>>>
>>>>>>>>> change
>>>>>>>>>
>>>>>>>>>> to the CompositeDetector. I did rebuild Nutch after I changed the
>>>>>>> jars
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> it built fine but that is when I started getting the fetch failed
>>>>>>>>>> error.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Fernando
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche <
>>>>>>>>>>
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>> Hi Fernando
>>>>>>>>>>>
>>>>>>>>>>>> I have made some additions (a new parser) to the Apache Tika
>>>>>>>>>>
>>>>>>>>>> application
>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>> I am trying to see if I can run my new changes using the
>>>>> crawl
>>>>>>>>>>
>>>>>>>>>> mechanism
>>>>>>>>>>
>>>>>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>>>> Nutch, but I am having some trouble updating Nutch with my
>>>>>>> modified
>>>>>>>>>>
>>>>>>>>>> Tika
>>>>>>>>>>
>>>>>>>>>>>> application.
>>>>>>>>>>>>
>>>>>>>>>>>> The Tika updates I made run fine if I run Tika as a
>>>>> standalone
>>>>>>>>>>>> using
>>>>>>>>>>>
>>>>>>>>>>> either
>>>>>>>>>>>
>>>>>>>>>>>> the command line or the Tika GUI.
>>>>>>>>>>>
>>>>>>>>>>> OK
>>>>>>>>>>>
>>>>>>>>>>>> I am using Nutch 1.2, 1.3 seems to not be able to run for me
>>>>> (I
>>>>>>> get
>>>>>>>>>
>>>>>>>>> an
>>>>>>>>>
>>>>>>>>>>>> error
>>>>>>>>>>>> saying C:/Program not found whenever I try to do anything but
>>>>>> 1.2
>>>>>>>>>>
>>>>>>>>>> should
>>>>>>>>>>
>>>>>>>>>>> be
>>>>>>>>>>>
>>>>>>>>>>>> fine for what I am trying to do which is just to see the
>>>>> parse
>>>>>>>>>
>>>>>>>>> results
>>>>>>>>>
>>>>>>>>>>> from
>>>>>>>>>>>
>>>>>>>>>>>> the new parser I added to Tika).
>>>>>>>>>>>>
>>>>>>>>>>>> I have replaced the tika-core.jar, tika-parsers.jar and
>>>>>>>>>>>
>>>>>>>>>>> tika-mimetypes.xml
>>>>>>>>>>>
>>>>>>>>>>>> files with my versions of those files as described in the
>>>>>>> following
>>>>>>>>>>
>>>>>>>>>> link:
>>>>>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-766. I also
>>>>> updated
>>>>>>> the
>>>>>>>>>>>> nutch-site.xml to enable the parse-tika plugin. I also
>>>>> updated
>>>>>>> the
>>>>>>>>>>>> parse-plugins.xml file with the following (afm files are what
>>>>> I
>>>>>>> am
>>>>>>>>>>
>>>>>>>>>> trying
>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> parse):
>>>>>>>>>>>>      <mimeType name="application/x-font-afm">
>>>>>>>>>>>>
>>>>>>>>>>>>              <plugin id="parse-tika" />
>>>>>>>>>>>>
>>>>>>>>>>>>      </mimeType>
>>>>>>>>>>>
>>>>>>>>>>> This is not necessary as by default parse-tika is used for any
>>>>>>>>>
>>>>>>>>> mime-type
>>>>>>>>>
>>>>>>>>>>> unless the mapping mime-type / parser is specified in
>>>>>>>>>
>>>>>>>>> parse-plugins.xml.
>>>>>>>>>
>>>>>>>>>>> This should not have an impact though
>>>>>>>>>>>
>>>>>>>>>>>> I am crawling a personal site in which I have links to .afm
>>>>>>> files.
>>>>>>>>>>>> If
>>>>>>>>>
>>>>>>>>> I
>>>>>>>>>
>>>>>>>>>>>> crawl before making any updates to Nutch, it fetches the
>>>>> files
>>>>>>>>>>>> fine.
>>>>>>>>>>>
>>>>>>>>>>> After
>>>>>>>>>>>
>>>>>>>>>>>> making the updates detailed above, I get the following error:
>>>>>>>>>>>> "fetch
>>>>>>>>>
>>>>>>>>> of
>>>>>>>>>
>>>>>>>>>>>> http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
>>>>>>>>>
>>>>>>>>>>>> java.lang.NoClassDefFoundError:
>>>>>>>>> org/apache/james/mime4j/MimeException".
>>>>>>>>>
>>>>>>>>>>>> Not really sure, what the issue is but my guess is that I
>>>>> have
>>>>>>> not
>>>>>>>>>>>
>>>>>>>>>>> updated
>>>>>>>>>>>
>>>>>>>>>>>> all the necessary files. Any help would be greatly
>>>>> appreciated.
>>>>>>>>>>>
>>>>>>>>>>> yep, sounds like you have a few jars missing. Nutch-1.2 came
>>>>> with
>>>>>>>>>>
>>>>>>>>>> tika-0.7,
>>>>>>>>>>
>>>>>>>>>>> which version of tika are you trying to use?
>>>>>>>>>>> if you just added a new parser then it would be easier to ship
>>>>> it
>>>>>>> as
>>>>>>>>>>> a separate jar file. I assume that you did not have to modify
>>>>>>>>>>> anything in tika-core, so you could use the standard tika libs
>>>>>> and
>>>>>>>>>>> simply add yours using Ivy.
>>>>>>>>>>>
>>>>>>>>>>> Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over
>>>>> 1.2
>>>>>>> so
>>>>>>>>>>> it would be worth getting to the bottom of the issue you're
>>>>>>>>>>> encountering
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>>> get 1.3 to work. Moreover I am not sure that you can use a
>>>>>> version
>>>>>>> of
>>>>>>>>>>
>>>>>>>>>> Tika
>>>>>>>>>>
>>>>>>>>>>> 0.7 on Nutch 1.2 without changing parts of the code (to be
>>>>>> checked
>>>>>>>>>>
>>>>>>>>>> though)
>>>>>>>>>>
>>>>>>>>>>> Julien
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *
>>>>>>>>>>> *Open Source Solutions for Text Engineering
>>>>>>>>>>>
>>>>>>>>>>> http://digitalpebble.blogspot.com/
>>>>>>>>>>> http://www.digitalpebble.com
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Lewis*
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *
>>>>> *Open Source Solutions for Text Engineering
>>>>>
>>>>> http://digitalpebble.blogspot.com/
>>>>> http://www.digitalpebble.com
>>>>>
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: [email protected]
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to