Sorry guys I'm nutters! :) Cheers, Chris
On Jul 20, 2011, at 1:39 AM, Julien Nioche wrote: > Glad you managed to get it to work. I don't know what Chris meant by that, > can;t see why we'd open a JIRA when we are already using the latest version > > Julien > > On 20 July 2011 08:19, Fernando Arreola <[email protected]> wrote: > >> Hi, >> >> Nutch 1.3 currently has Tika 0.9 which is the latest official version. I >> was >> trying to replace the Tika in Nutch 1.3 with a Tika project which I had >> modifed (Tika 0.9 with a new parser I had created). Is it still recommended >> that I create a JIRA issue if it currently has the latest official version? >> >> Thanks, >> Fernando >> >> On Tue, Jul 19, 2011 at 9:41 PM, Mattmann, Chris A (388J) < >> [email protected]> wrote: >> >>> Hey Fernando, >>> >>> Would be great to get a JIRA issue and patch to bring >>> Nutch 1.4-branch up to date with the latest Tika >>> based on your experience. >>> >>> Thanks for your help! >>> >>> Cheers, >>> Chris >>> >>> On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote: >>> >>>> Hi, >>>> >>>> You were right, it is enough to provide the right clues in the >>>> tika-mimetypes.xml file. Once the correct clues got in there, thanks to >> a >>>> Tika developer, all I had to do was replace the jar files with mine. It >>> is >>>> working just as I want it now. >>>> >>>> Thanks everyone for the help. >>>> >>>> Fernando >>>> >>>> On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche < >>>> [email protected]> wrote: >>>> >>>>> You probably need to make sure that conf/tika-mimetypes.xml is the >>> version >>>>> you've modified and contains the clues for detecting afm files. >>>>> BTW out of curiosity why did you have to modify tika-core.jar? Isn't >> it >>>>> enough to provide the clues in tika-mimetypes.xml? >>>>> >>>>> Jul >>>>> >>>>> On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote: >>>>> >>>>>> Thanks, I really appreciate all the help. I used the ParserChecker >> and >>> I >>>>>> could see the metadata my parser extracted! >>>>>> >>>>>> I have one more question though, I could only see the metadata my >>> parser >>>>>> extracted if I used the -forceAs mimetype option. Otherwise it is >>>>> detected >>>>>> as a text/plain file and my parser is then not called. I ran into a >>>>> similar >>>>>> problem in tika and added some functionality there so that Tika's >>>>> detection >>>>>> mechanism would not think afm files are text/plain. Does this mean >> not >>>>> all >>>>>> of my tika changes made it in (I updated both the tika-core.jar and >>>>>> tika-parsers.jar files) or does Nutch have its own file type >> detection >>>>>> mechanism? >>>>>> >>>>>> Thanks, >>>>>> Fernando >>>>>> >>>>>> On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> >>>>>>>> Thanks for the help. I seem to be getting close to what I need to >> do, >>>>>> but >>>>>>>> not quite there. >>>>>>>> >>>>>>>> I downloaded Nutch 1.3 and built it on a unix machine. It built and >>>>> ran >>>>>>>> fine (before changing any jar files) when I tested it on the site >>>>> with >>>>>>> the >>>>>>>> .afm files that I want to get parsed. >>>>>>>> >>>>>>>> I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml >>>>> (to >>>>>>>> enable the parse-tika plugin) and tika-mimetypes.xml files with my >>>>>>> updated >>>>>>>> versions. I rebuilt (no errors) and then ran the crawl command on >> the >>>>>>> same >>>>>>>> site. The fetch seemed to work, I did not see any errors when >> running >>>>>> or >>>>>>> in >>>>>>>> the log file. There is a parse error but it is related to a pdf I >>>>> have >>>>>>>> linked in the site I crawled and since I am not interested in the >> pdf >>>>> I >>>>>>>> don't think it matters. >>>>>>>> >>>>>>>> Now here is my completely newb question: how can I tell if the afm >>>>>> files >>>>>>>> were parsed correctly in the absence of errors? >>>>>>> >>>>>>> The ParserChecker is what you're looking for. It's a handy tool you >>> can >>>>>>> locally use to find out if all goes well. >>>>>>> >>>>>>> bin/nutch org.apache.nutch.parse.ParserChecker >>>>>>> >>>>>>>> >>>>>>>> I looked at the files in the segments/*/parse_data directory (since >>>>>> that >>>>>>> is >>>>>>>> where the tutorial says the metadata goes and the parser I created >>>>>> mostly >>>>>>>> extracts metadata) but the files aren't really readable. I also >>>>> figured >>>>>>>> maybe I could search for some terms I expect parser to extract but >>>>>>> couldn't >>>>>>>> perform a search. When I typed the following command in the >>>>>> runtime/local >>>>>>>> directory: >>>>>>>> >>>>>>>> bin/nutch org.apache.nutch.searcher.NutchBean *search_term* >>>>>>>> >>>>>>>> I get the following error: >>>>>>>> >>>>>>>> Exception in thread "main" java.lang.NoClassDefFoundError: >>>>>>>> org/apache/nutch/searcher/NutchBean >>>>>>>> >>>>>>>> I looked in the src directory and did not find the searcher (it was >>>>> in >>>>>>>> there in the 1.2 version). I tried downloading both the binary and >>>>> the >>>>>>> src >>>>>>>> distributions for 1.3 and it was in neither. Is there a different >> way >>>>>> to >>>>>>>> perform a search in 1.3 or is there a different way I can see >>>>> readable >>>>>>>> results of the parsed information? >>>>>>> >>>>>>> There is no searcher in 1.3. It is deprecated and removed. Use Solr >>> for >>>>>>> indexing to confirm or use ParserChecker or the new 1.4-dev >>>>>>> o.a.n.indexer.IndexingFiltersChecker. >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Fernando >>>>>>>> >>>>>>>> On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney < >>>>>>>> >>>>>>>> [email protected]> wrote: >>>>>>>>> OK so at least we seem to have sorted out the first of you're >>>>>>> problems... >>>>>>>>> but now face the dreaded Windows Cygwin partnership. >>>>>>>>> >>>>>>>>> We do not currently have an up-to-date tutorial for this. We do >>>>>> however >>>>>>>>> have >>>>>>>>> a tutorial for older versions of Nutch which you can find here [1] >>>>>> [2] >>>>>>>>> >>>>>>>>> I'm going to be brutally honest with you here, working with Cygwin >>>>>> was >>>>>>>>> horrible from my own experience. There seems to be so much >> overhead >>>>>> and >>>>>>>>> working with almost any other OS was a significantly easier >> option. >>>>> I >>>>>>>>> understand that this may mean a fundamental shift in you're >>>>> computing >>>>>>>>> style but the benefit is well worth it. >>>>>>>>> >>>>>>>>> [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin >>>>>>>>> [2] >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>> >> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28 >>>>>>>>> cygwin%29 >>>>>>>>> >>>>>>>>> On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola < >>>>>> [email protected] >>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Thanks for the replies. >>>>>>>>>> >>>>>>>>>> I have started trying to use Nutch 1.3 after your suggestions, >>>>>>>>>> especially since I am using Tika 0.9, but I am not getting >>>>> anywhere >>>>>>>>>> with it. I am >>>>>>>>> >>>>>>>>> able >>>>>>>>> >>>>>>>>>> to build fine but whenever I try to run any command it gives the >>>>>>> error >>>>>>>>>> stating that it cannot find C:\Program. For example, if I try to >>>>>> run >>>>>>>>>> the following command to crawl: >>>>>>>>>> >>>>>>>>>> runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 >>>>>>>>>> >>>>>>>>>> It then gives me the following error right away before any other >>>>>>>>>> output: >>>>>>>>>> >>>>>>>>>> runtime/local/bin/nutch: line 251: exec: C:\Program: not found >>>>>>>>>> >>>>>>>>>> I am running on Cygwin on Windows 7, if that helps. >>>>>>>>>> >>>>>>>>>> As for Tika, I did modify the CompositeDetector.java file in >>>>>>> tika-core >>>>>>>>>> since >>>>>>>>>> I added a Detector to detect the AFM files and had to make a >>>>> slight >>>>>>>>> >>>>>>>>> change >>>>>>>>> >>>>>>>>>> to the CompositeDetector. I did rebuild Nutch after I changed the >>>>>>> jars >>>>>>>>> >>>>>>>>> and >>>>>>>>> >>>>>>>>>> it built fine but that is when I started getting the fetch failed >>>>>>>>>> error. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Fernando >>>>>>>>>> >>>>>>>>>> On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche < >>>>>>>>>> >>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> Hi Fernando >>>>>>>>>>> >>>>>>>>>>>> I have made some additions (a new parser) to the Apache Tika >>>>>>>>>> >>>>>>>>>> application >>>>>>>>>> >>>>>>>>>>>> and >>>>>>>>>>>> I am trying to see if I can run my new changes using the >>>>> crawl >>>>>>>>>> >>>>>>>>>> mechanism >>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>>> Nutch, but I am having some trouble updating Nutch with my >>>>>>> modified >>>>>>>>>> >>>>>>>>>> Tika >>>>>>>>>> >>>>>>>>>>>> application. >>>>>>>>>>>> >>>>>>>>>>>> The Tika updates I made run fine if I run Tika as a >>>>> standalone >>>>>>>>>>>> using >>>>>>>>>>> >>>>>>>>>>> either >>>>>>>>>>> >>>>>>>>>>>> the command line or the Tika GUI. >>>>>>>>>>> >>>>>>>>>>> OK >>>>>>>>>>> >>>>>>>>>>>> I am using Nutch 1.2, 1.3 seems to not be able to run for me >>>>> (I >>>>>>> get >>>>>>>>> >>>>>>>>> an >>>>>>>>> >>>>>>>>>>>> error >>>>>>>>>>>> saying C:/Program not found whenever I try to do anything but >>>>>> 1.2 >>>>>>>>>> >>>>>>>>>> should >>>>>>>>>> >>>>>>>>>>> be >>>>>>>>>>> >>>>>>>>>>>> fine for what I am trying to do which is just to see the >>>>> parse >>>>>>>>> >>>>>>>>> results >>>>>>>>> >>>>>>>>>>> from >>>>>>>>>>> >>>>>>>>>>>> the new parser I added to Tika). >>>>>>>>>>>> >>>>>>>>>>>> I have replaced the tika-core.jar, tika-parsers.jar and >>>>>>>>>>> >>>>>>>>>>> tika-mimetypes.xml >>>>>>>>>>> >>>>>>>>>>>> files with my versions of those files as described in the >>>>>>> following >>>>>>>>>> >>>>>>>>>> link: >>>>>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-766. I also >>>>> updated >>>>>>> the >>>>>>>>>>>> nutch-site.xml to enable the parse-tika plugin. I also >>>>> updated >>>>>>> the >>>>>>>>>>>> parse-plugins.xml file with the following (afm files are what >>>>> I >>>>>>> am >>>>>>>>>> >>>>>>>>>> trying >>>>>>>>>> >>>>>>>>>>>> to >>>>>>>>>>>> >>>>>>>>>>>> parse): >>>>>>>>>>>> <mimeType name="application/x-font-afm"> >>>>>>>>>>>> >>>>>>>>>>>> <plugin id="parse-tika" /> >>>>>>>>>>>> >>>>>>>>>>>> </mimeType> >>>>>>>>>>> >>>>>>>>>>> This is not necessary as by default parse-tika is used for any >>>>>>>>> >>>>>>>>> mime-type >>>>>>>>> >>>>>>>>>>> unless the mapping mime-type / parser is specified in >>>>>>>>> >>>>>>>>> parse-plugins.xml. >>>>>>>>> >>>>>>>>>>> This should not have an impact though >>>>>>>>>>> >>>>>>>>>>>> I am crawling a personal site in which I have links to .afm >>>>>>> files. >>>>>>>>>>>> If >>>>>>>>> >>>>>>>>> I >>>>>>>>> >>>>>>>>>>>> crawl before making any updates to Nutch, it fetches the >>>>> files >>>>>>>>>>>> fine. >>>>>>>>>>> >>>>>>>>>>> After >>>>>>>>>>> >>>>>>>>>>>> making the updates detailed above, I get the following error: >>>>>>>>>>>> "fetch >>>>>>>>> >>>>>>>>> of >>>>>>>>> >>>>>>>>>>>> http://scf.usc.edu/~jfarreol/woor2___.AFM failed with: >>>>>>>>> >>>>>>>>>>>> java.lang.NoClassDefFoundError: >>>>>>>>> org/apache/james/mime4j/MimeException". >>>>>>>>> >>>>>>>>>>>> Not really sure, what the issue is but my guess is that I >>>>> have >>>>>>> not >>>>>>>>>>> >>>>>>>>>>> updated >>>>>>>>>>> >>>>>>>>>>>> all the necessary files. Any help would be greatly >>>>> appreciated. >>>>>>>>>>> >>>>>>>>>>> yep, sounds like you have a few jars missing. Nutch-1.2 came >>>>> with >>>>>>>>>> >>>>>>>>>> tika-0.7, >>>>>>>>>> >>>>>>>>>>> which version of tika are you trying to use? >>>>>>>>>>> if you just added a new parser then it would be easier to ship >>>>> it >>>>>>> as >>>>>>>>>>> a separate jar file. I assume that you did not have to modify >>>>>>>>>>> anything in tika-core, so you could use the standard tika libs >>>>>> and >>>>>>>>>>> simply add yours using Ivy. >>>>>>>>>>> >>>>>>>>>>> Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over >>>>> 1.2 >>>>>>> so >>>>>>>>>>> it would be worth getting to the bottom of the issue you're >>>>>>>>>>> encountering >>>>>>>>> >>>>>>>>> and >>>>>>>>> >>>>>>>>>>> get 1.3 to work. Moreover I am not sure that you can use a >>>>>> version >>>>>>> of >>>>>>>>>> >>>>>>>>>> Tika >>>>>>>>>> >>>>>>>>>>> 0.7 on Nutch 1.2 without changing parts of the code (to be >>>>>> checked >>>>>>>>>> >>>>>>>>>> though) >>>>>>>>>> >>>>>>>>>>> Julien >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> * >>>>>>>>>>> *Open Source Solutions for Text Engineering >>>>>>>>>>> >>>>>>>>>>> http://digitalpebble.blogspot.com/ >>>>>>>>>>> http://www.digitalpebble.com >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Lewis* >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> * >>>>> *Open Source Solutions for Text Engineering >>>>> >>>>> http://digitalpebble.blogspot.com/ >>>>> http://www.digitalpebble.com >>>>> >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

