Hey Fernando, Would be great to get a JIRA issue and patch to bring Nutch 1.4-branch up to date with the latest Tika based on your experience.
Thanks for your help! Cheers, Chris On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote: > Hi, > > You were right, it is enough to provide the right clues in the > tika-mimetypes.xml file. Once the correct clues got in there, thanks to a > Tika developer, all I had to do was replace the jar files with mine. It is > working just as I want it now. > > Thanks everyone for the help. > > Fernando > > On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche < > [email protected]> wrote: > >> You probably need to make sure that conf/tika-mimetypes.xml is the version >> you've modified and contains the clues for detecting afm files. >> BTW out of curiosity why did you have to modify tika-core.jar? Isn't it >> enough to provide the clues in tika-mimetypes.xml? >> >> Jul >> >> On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote: >> >>> Thanks, I really appreciate all the help. I used the ParserChecker and I >>> could see the metadata my parser extracted! >>> >>> I have one more question though, I could only see the metadata my parser >>> extracted if I used the -forceAs mimetype option. Otherwise it is >> detected >>> as a text/plain file and my parser is then not called. I ran into a >> similar >>> problem in tika and added some functionality there so that Tika's >> detection >>> mechanism would not think afm files are text/plain. Does this mean not >> all >>> of my tika changes made it in (I updated both the tika-core.jar and >>> tika-parsers.jar files) or does Nutch have its own file type detection >>> mechanism? >>> >>> Thanks, >>> Fernando >>> >>> On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma >>> <[email protected]>wrote: >>> >>>> >>>>> Thanks for the help. I seem to be getting close to what I need to do, >>> but >>>>> not quite there. >>>>> >>>>> I downloaded Nutch 1.3 and built it on a unix machine. It built and >> ran >>>>> fine (before changing any jar files) when I tested it on the site >> with >>>> the >>>>> .afm files that I want to get parsed. >>>>> >>>>> I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml >> (to >>>>> enable the parse-tika plugin) and tika-mimetypes.xml files with my >>>> updated >>>>> versions. I rebuilt (no errors) and then ran the crawl command on the >>>> same >>>>> site. The fetch seemed to work, I did not see any errors when running >>> or >>>> in >>>>> the log file. There is a parse error but it is related to a pdf I >> have >>>>> linked in the site I crawled and since I am not interested in the pdf >> I >>>>> don't think it matters. >>>>> >>>>> Now here is my completely newb question: how can I tell if the afm >>> files >>>>> were parsed correctly in the absence of errors? >>>> >>>> The ParserChecker is what you're looking for. It's a handy tool you can >>>> locally use to find out if all goes well. >>>> >>>> bin/nutch org.apache.nutch.parse.ParserChecker >>>> >>>>> >>>>> I looked at the files in the segments/*/parse_data directory (since >>> that >>>> is >>>>> where the tutorial says the metadata goes and the parser I created >>> mostly >>>>> extracts metadata) but the files aren't really readable. I also >> figured >>>>> maybe I could search for some terms I expect parser to extract but >>>> couldn't >>>>> perform a search. When I typed the following command in the >>> runtime/local >>>>> directory: >>>>> >>>>> bin/nutch org.apache.nutch.searcher.NutchBean *search_term* >>>>> >>>>> I get the following error: >>>>> >>>>> Exception in thread "main" java.lang.NoClassDefFoundError: >>>>> org/apache/nutch/searcher/NutchBean >>>>> >>>>> I looked in the src directory and did not find the searcher (it was >> in >>>>> there in the 1.2 version). I tried downloading both the binary and >> the >>>> src >>>>> distributions for 1.3 and it was in neither. Is there a different way >>> to >>>>> perform a search in 1.3 or is there a different way I can see >> readable >>>>> results of the parsed information? >>>> >>>> There is no searcher in 1.3. It is deprecated and removed. Use Solr for >>>> indexing to confirm or use ParserChecker or the new 1.4-dev >>>> o.a.n.indexer.IndexingFiltersChecker. >>>> >>>>> >>>>> Thanks, >>>>> Fernando >>>>> >>>>> On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney < >>>>> >>>>> [email protected]> wrote: >>>>>> OK so at least we seem to have sorted out the first of you're >>>> problems... >>>>>> but now face the dreaded Windows Cygwin partnership. >>>>>> >>>>>> We do not currently have an up-to-date tutorial for this. We do >>> however >>>>>> have >>>>>> a tutorial for older versions of Nutch which you can find here [1] >>> [2] >>>>>> >>>>>> I'm going to be brutally honest with you here, working with Cygwin >>> was >>>>>> horrible from my own experience. There seems to be so much overhead >>> and >>>>>> working with almost any other OS was a significantly easier option. >> I >>>>>> understand that this may mean a fundamental shift in you're >> computing >>>>>> style but the benefit is well worth it. >>>>>> >>>>>> [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin >>>>>> [2] >>>>>> >>>>>> >>>> >>> >> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28 >>>>>> cygwin%29 >>>>>> >>>>>> On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola < >>> [email protected] >>>>>> >>>>>>> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> Thanks for the replies. >>>>>>> >>>>>>> I have started trying to use Nutch 1.3 after your suggestions, >>>>>>> especially since I am using Tika 0.9, but I am not getting >> anywhere >>>>>>> with it. I am >>>>>> >>>>>> able >>>>>> >>>>>>> to build fine but whenever I try to run any command it gives the >>>> error >>>>>>> stating that it cannot find C:\Program. For example, if I try to >>> run >>>>>>> the following command to crawl: >>>>>>> >>>>>>> runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 >>>>>>> >>>>>>> It then gives me the following error right away before any other >>>>>>> output: >>>>>>> >>>>>>> runtime/local/bin/nutch: line 251: exec: C:\Program: not found >>>>>>> >>>>>>> I am running on Cygwin on Windows 7, if that helps. >>>>>>> >>>>>>> As for Tika, I did modify the CompositeDetector.java file in >>>> tika-core >>>>>>> since >>>>>>> I added a Detector to detect the AFM files and had to make a >> slight >>>>>> >>>>>> change >>>>>> >>>>>>> to the CompositeDetector. I did rebuild Nutch after I changed the >>>> jars >>>>>> >>>>>> and >>>>>> >>>>>>> it built fine but that is when I started getting the fetch failed >>>>>>> error. >>>>>>> >>>>>>> Thanks, >>>>>>> Fernando >>>>>>> >>>>>>> On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche < >>>>>>> >>>>>>> [email protected]> wrote: >>>>>>>> Hi Fernando >>>>>>>> >>>>>>>>> I have made some additions (a new parser) to the Apache Tika >>>>>>> >>>>>>> application >>>>>>> >>>>>>>>> and >>>>>>>>> I am trying to see if I can run my new changes using the >> crawl >>>>>>> >>>>>>> mechanism >>>>>>> >>>>>>>> in >>>>>>>> >>>>>>>>> Nutch, but I am having some trouble updating Nutch with my >>>> modified >>>>>>> >>>>>>> Tika >>>>>>> >>>>>>>>> application. >>>>>>>>> >>>>>>>>> The Tika updates I made run fine if I run Tika as a >> standalone >>>>>>>>> using >>>>>>>> >>>>>>>> either >>>>>>>> >>>>>>>>> the command line or the Tika GUI. >>>>>>>> >>>>>>>> OK >>>>>>>> >>>>>>>>> I am using Nutch 1.2, 1.3 seems to not be able to run for me >> (I >>>> get >>>>>> >>>>>> an >>>>>> >>>>>>>>> error >>>>>>>>> saying C:/Program not found whenever I try to do anything but >>> 1.2 >>>>>>> >>>>>>> should >>>>>>> >>>>>>>> be >>>>>>>> >>>>>>>>> fine for what I am trying to do which is just to see the >> parse >>>>>> >>>>>> results >>>>>> >>>>>>>> from >>>>>>>> >>>>>>>>> the new parser I added to Tika). >>>>>>>>> >>>>>>>>> I have replaced the tika-core.jar, tika-parsers.jar and >>>>>>>> >>>>>>>> tika-mimetypes.xml >>>>>>>> >>>>>>>>> files with my versions of those files as described in the >>>> following >>>>>>> >>>>>>> link: >>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-766. I also >> updated >>>> the >>>>>>>>> nutch-site.xml to enable the parse-tika plugin. I also >> updated >>>> the >>>>>>>>> parse-plugins.xml file with the following (afm files are what >> I >>>> am >>>>>>> >>>>>>> trying >>>>>>> >>>>>>>>> to >>>>>>>>> >>>>>>>>> parse): >>>>>>>>> <mimeType name="application/x-font-afm"> >>>>>>>>> >>>>>>>>> <plugin id="parse-tika" /> >>>>>>>>> >>>>>>>>> </mimeType> >>>>>>>> >>>>>>>> This is not necessary as by default parse-tika is used for any >>>>>> >>>>>> mime-type >>>>>> >>>>>>>> unless the mapping mime-type / parser is specified in >>>>>> >>>>>> parse-plugins.xml. >>>>>> >>>>>>>> This should not have an impact though >>>>>>>> >>>>>>>>> I am crawling a personal site in which I have links to .afm >>>> files. >>>>>>>>> If >>>>>> >>>>>> I >>>>>> >>>>>>>>> crawl before making any updates to Nutch, it fetches the >> files >>>>>>>>> fine. >>>>>>>> >>>>>>>> After >>>>>>>> >>>>>>>>> making the updates detailed above, I get the following error: >>>>>>>>> "fetch >>>>>> >>>>>> of >>>>>> >>>>>>>>> http://scf.usc.edu/~jfarreol/woor2___.AFM failed with: >>>>>> >>>>>>>>> java.lang.NoClassDefFoundError: >>>>>> org/apache/james/mime4j/MimeException". >>>>>> >>>>>>>>> Not really sure, what the issue is but my guess is that I >> have >>>> not >>>>>>>> >>>>>>>> updated >>>>>>>> >>>>>>>>> all the necessary files. Any help would be greatly >> appreciated. >>>>>>>> >>>>>>>> yep, sounds like you have a few jars missing. Nutch-1.2 came >> with >>>>>>> >>>>>>> tika-0.7, >>>>>>> >>>>>>>> which version of tika are you trying to use? >>>>>>>> if you just added a new parser then it would be easier to ship >> it >>>> as >>>>>>>> a separate jar file. I assume that you did not have to modify >>>>>>>> anything in tika-core, so you could use the standard tika libs >>> and >>>>>>>> simply add yours using Ivy. >>>>>>>> >>>>>>>> Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over >> 1.2 >>>> so >>>>>>>> it would be worth getting to the bottom of the issue you're >>>>>>>> encountering >>>>>> >>>>>> and >>>>>> >>>>>>>> get 1.3 to work. Moreover I am not sure that you can use a >>> version >>>> of >>>>>>> >>>>>>> Tika >>>>>>> >>>>>>>> 0.7 on Nutch 1.2 without changing parts of the code (to be >>> checked >>>>>>> >>>>>>> though) >>>>>>> >>>>>>>> Julien >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> * >>>>>>>> *Open Source Solutions for Text Engineering >>>>>>>> >>>>>>>> http://digitalpebble.blogspot.com/ >>>>>>>> http://www.digitalpebble.com >>>>>> >>>>>> -- >>>>>> *Lewis* >>>> >>> >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

