Hey Fernando,

Would be great to get a JIRA issue and patch to bring 
Nutch 1.4-branch up to date with the latest Tika
based on your experience.

Thanks for your help!

Cheers,
Chris

On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote:

> Hi,
> 
> You were right, it is enough to provide the right clues in the
> tika-mimetypes.xml file. Once the correct clues got in there, thanks to a
> Tika developer, all I had to do was replace the jar files with mine. It is
> working just as I want it now.
> 
> Thanks everyone for the help.
> 
> Fernando
> 
> On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche <
> [email protected]> wrote:
> 
>> You probably need to make sure that conf/tika-mimetypes.xml is the version
>> you've modified and contains the clues for detecting afm files.
>> BTW out of curiosity why did you have to modify tika-core.jar? Isn't it
>> enough to provide the clues in tika-mimetypes.xml?
>> 
>> Jul
>> 
>> On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote:
>> 
>>> Thanks, I really appreciate all the help. I used the ParserChecker and I
>>> could see the metadata my parser extracted!
>>> 
>>> I have one more question though, I could only see the metadata my parser
>>> extracted if I used the -forceAs mimetype option. Otherwise it is
>> detected
>>> as a text/plain file and my parser is then not called. I ran into a
>> similar
>>> problem in tika and added some functionality there so that Tika's
>> detection
>>> mechanism would not think afm files are text/plain. Does this mean not
>> all
>>> of my tika changes made it in (I updated both the tika-core.jar and
>>> tika-parsers.jar files) or does Nutch have its own file type detection
>>> mechanism?
>>> 
>>> Thanks,
>>> Fernando
>>> 
>>> On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma
>>> <[email protected]>wrote:
>>> 
>>>> 
>>>>> Thanks for the help. I seem to be getting close to what I need to do,
>>> but
>>>>> not quite there.
>>>>> 
>>>>> I downloaded Nutch 1.3 and built it on a unix machine. It built and
>> ran
>>>>> fine (before changing any jar files) when I tested it on the site
>> with
>>>> the
>>>>> .afm files that I want to get parsed.
>>>>> 
>>>>> I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml
>> (to
>>>>> enable the parse-tika plugin) and tika-mimetypes.xml files with my
>>>> updated
>>>>> versions. I rebuilt (no errors) and then ran the crawl command on the
>>>> same
>>>>> site. The fetch seemed to work, I did not see any errors when running
>>> or
>>>> in
>>>>> the log file. There is a parse error but it is related to a pdf I
>> have
>>>>> linked in the site I crawled and since I am not interested in the pdf
>> I
>>>>> don't think it matters.
>>>>> 
>>>>> Now here is my completely newb question: how can I tell if the afm
>>> files
>>>>> were parsed correctly in the absence of errors?
>>>> 
>>>> The ParserChecker is what you're looking for. It's a handy tool you can
>>>> locally use to find out if all goes well.
>>>> 
>>>> bin/nutch org.apache.nutch.parse.ParserChecker
>>>> 
>>>>> 
>>>>> I looked at the files in the segments/*/parse_data directory (since
>>> that
>>>> is
>>>>> where the tutorial says the metadata goes and the parser I created
>>> mostly
>>>>> extracts metadata) but the files aren't really readable. I also
>> figured
>>>>> maybe I could search for some terms I expect parser to extract but
>>>> couldn't
>>>>> perform a search. When I typed the following command in the
>>> runtime/local
>>>>> directory:
>>>>> 
>>>>> bin/nutch org.apache.nutch.searcher.NutchBean *search_term*
>>>>> 
>>>>> I get the following error:
>>>>> 
>>>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>>>> org/apache/nutch/searcher/NutchBean
>>>>> 
>>>>> I looked in the src directory and did not find the searcher (it was
>> in
>>>>> there in the 1.2 version). I tried downloading both the binary and
>> the
>>>> src
>>>>> distributions for 1.3 and it was in neither. Is there a different way
>>> to
>>>>> perform a search in 1.3 or is there a different way I can see
>> readable
>>>>> results of the parsed information?
>>>> 
>>>> There is no searcher in 1.3. It is deprecated and removed. Use Solr for
>>>> indexing to confirm or use ParserChecker or the new 1.4-dev
>>>> o.a.n.indexer.IndexingFiltersChecker.
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Fernando
>>>>> 
>>>>> On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney <
>>>>> 
>>>>> [email protected]> wrote:
>>>>>> OK so at least we seem to have sorted out the first of you're
>>>> problems...
>>>>>> but now face the dreaded Windows Cygwin partnership.
>>>>>> 
>>>>>> We do not currently have an up-to-date tutorial for this. We do
>>> however
>>>>>> have
>>>>>> a tutorial for older versions of Nutch which you can find here [1]
>>> [2]
>>>>>> 
>>>>>> I'm going to be brutally honest with you here, working with Cygwin
>>> was
>>>>>> horrible from my own experience. There seems to be so much overhead
>>> and
>>>>>> working with almost any other OS was a significantly easier option.
>> I
>>>>>> understand that this may mean a fundamental shift in you're
>> computing
>>>>>> style but the benefit is well worth it.
>>>>>> 
>>>>>> [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin
>>>>>> [2]
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28
>>>>>> cygwin%29
>>>>>> 
>>>>>> On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola <
>>> [email protected]
>>>>>> 
>>>>>>> wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> Thanks for the replies.
>>>>>>> 
>>>>>>> I have started trying to use Nutch 1.3 after your suggestions,
>>>>>>> especially since I am using Tika 0.9, but I am not getting
>> anywhere
>>>>>>> with it. I am
>>>>>> 
>>>>>> able
>>>>>> 
>>>>>>> to build fine but whenever I try to run any command it gives the
>>>> error
>>>>>>> stating that it cannot find C:\Program. For example, if I try to
>>> run
>>>>>>> the following command to crawl:
>>>>>>> 
>>>>>>> runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>>>>>>> 
>>>>>>> It then gives me the following error right away before any other
>>>>>>> output:
>>>>>>> 
>>>>>>> runtime/local/bin/nutch: line 251: exec: C:\Program: not found
>>>>>>> 
>>>>>>> I am running on Cygwin on Windows 7, if that helps.
>>>>>>> 
>>>>>>> As for Tika, I did modify the CompositeDetector.java file in
>>>> tika-core
>>>>>>> since
>>>>>>> I added a Detector to detect the AFM files and had to make a
>> slight
>>>>>> 
>>>>>> change
>>>>>> 
>>>>>>> to the CompositeDetector. I did rebuild Nutch after I changed the
>>>> jars
>>>>>> 
>>>>>> and
>>>>>> 
>>>>>>> it built fine but that is when I started getting the fetch failed
>>>>>>> error.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Fernando
>>>>>>> 
>>>>>>> On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche <
>>>>>>> 
>>>>>>> [email protected]> wrote:
>>>>>>>> Hi Fernando
>>>>>>>> 
>>>>>>>>> I have made some additions (a new parser) to the Apache Tika
>>>>>>> 
>>>>>>> application
>>>>>>> 
>>>>>>>>> and
>>>>>>>>> I am trying to see if I can run my new changes using the
>> crawl
>>>>>>> 
>>>>>>> mechanism
>>>>>>> 
>>>>>>>> in
>>>>>>>> 
>>>>>>>>> Nutch, but I am having some trouble updating Nutch with my
>>>> modified
>>>>>>> 
>>>>>>> Tika
>>>>>>> 
>>>>>>>>> application.
>>>>>>>>> 
>>>>>>>>> The Tika updates I made run fine if I run Tika as a
>> standalone
>>>>>>>>> using
>>>>>>>> 
>>>>>>>> either
>>>>>>>> 
>>>>>>>>> the command line or the Tika GUI.
>>>>>>>> 
>>>>>>>> OK
>>>>>>>> 
>>>>>>>>> I am using Nutch 1.2, 1.3 seems to not be able to run for me
>> (I
>>>> get
>>>>>> 
>>>>>> an
>>>>>> 
>>>>>>>>> error
>>>>>>>>> saying C:/Program not found whenever I try to do anything but
>>> 1.2
>>>>>>> 
>>>>>>> should
>>>>>>> 
>>>>>>>> be
>>>>>>>> 
>>>>>>>>> fine for what I am trying to do which is just to see the
>> parse
>>>>>> 
>>>>>> results
>>>>>> 
>>>>>>>> from
>>>>>>>> 
>>>>>>>>> the new parser I added to Tika).
>>>>>>>>> 
>>>>>>>>> I have replaced the tika-core.jar, tika-parsers.jar and
>>>>>>>> 
>>>>>>>> tika-mimetypes.xml
>>>>>>>> 
>>>>>>>>> files with my versions of those files as described in the
>>>> following
>>>>>>> 
>>>>>>> link:
>>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-766. I also
>> updated
>>>> the
>>>>>>>>> nutch-site.xml to enable the parse-tika plugin. I also
>> updated
>>>> the
>>>>>>>>> parse-plugins.xml file with the following (afm files are what
>> I
>>>> am
>>>>>>> 
>>>>>>> trying
>>>>>>> 
>>>>>>>>> to
>>>>>>>>> 
>>>>>>>>> parse):
>>>>>>>>>       <mimeType name="application/x-font-afm">
>>>>>>>>> 
>>>>>>>>>               <plugin id="parse-tika" />
>>>>>>>>> 
>>>>>>>>>       </mimeType>
>>>>>>>> 
>>>>>>>> This is not necessary as by default parse-tika is used for any
>>>>>> 
>>>>>> mime-type
>>>>>> 
>>>>>>>> unless the mapping mime-type / parser is specified in
>>>>>> 
>>>>>> parse-plugins.xml.
>>>>>> 
>>>>>>>> This should not have an impact though
>>>>>>>> 
>>>>>>>>> I am crawling a personal site in which I have links to .afm
>>>> files.
>>>>>>>>> If
>>>>>> 
>>>>>> I
>>>>>> 
>>>>>>>>> crawl before making any updates to Nutch, it fetches the
>> files
>>>>>>>>> fine.
>>>>>>>> 
>>>>>>>> After
>>>>>>>> 
>>>>>>>>> making the updates detailed above, I get the following error:
>>>>>>>>> "fetch
>>>>>> 
>>>>>> of
>>>>>> 
>>>>>>>>> http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
>>>>>> 
>>>>>>>>> java.lang.NoClassDefFoundError:
>>>>>> org/apache/james/mime4j/MimeException".
>>>>>> 
>>>>>>>>> Not really sure, what the issue is but my guess is that I
>> have
>>>> not
>>>>>>>> 
>>>>>>>> updated
>>>>>>>> 
>>>>>>>>> all the necessary files. Any help would be greatly
>> appreciated.
>>>>>>>> 
>>>>>>>> yep, sounds like you have a few jars missing. Nutch-1.2 came
>> with
>>>>>>> 
>>>>>>> tika-0.7,
>>>>>>> 
>>>>>>>> which version of tika are you trying to use?
>>>>>>>> if you just added a new parser then it would be easier to ship
>> it
>>>> as
>>>>>>>> a separate jar file. I assume that you did not have to modify
>>>>>>>> anything in tika-core, so you could use the standard tika libs
>>> and
>>>>>>>> simply add yours using Ivy.
>>>>>>>> 
>>>>>>>> Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over
>> 1.2
>>>> so
>>>>>>>> it would be worth getting to the bottom of the issue you're
>>>>>>>> encountering
>>>>>> 
>>>>>> and
>>>>>> 
>>>>>>>> get 1.3 to work. Moreover I am not sure that you can use a
>>> version
>>>> of
>>>>>>> 
>>>>>>> Tika
>>>>>>> 
>>>>>>>> 0.7 on Nutch 1.2 without changing parts of the code (to be
>>> checked
>>>>>>> 
>>>>>>> though)
>>>>>>> 
>>>>>>>> Julien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> *
>>>>>>>> *Open Source Solutions for Text Engineering
>>>>>>>> 
>>>>>>>> http://digitalpebble.blogspot.com/
>>>>>>>> http://www.digitalpebble.com
>>>>>> 
>>>>>> --
>>>>>> *Lewis*
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> *
>> *Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to