Glad you managed to get it to work. I don't know what Chris meant by that, can;t see why we'd open a JIRA when we are already using the latest version
Julien On 20 July 2011 08:19, Fernando Arreola <[email protected]> wrote: > Hi, > > Nutch 1.3 currently has Tika 0.9 which is the latest official version. I > was > trying to replace the Tika in Nutch 1.3 with a Tika project which I had > modifed (Tika 0.9 with a new parser I had created). Is it still recommended > that I create a JIRA issue if it currently has the latest official version? > > Thanks, > Fernando > > On Tue, Jul 19, 2011 at 9:41 PM, Mattmann, Chris A (388J) < > [email protected]> wrote: > > > Hey Fernando, > > > > Would be great to get a JIRA issue and patch to bring > > Nutch 1.4-branch up to date with the latest Tika > > based on your experience. > > > > Thanks for your help! > > > > Cheers, > > Chris > > > > On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote: > > > > > Hi, > > > > > > You were right, it is enough to provide the right clues in the > > > tika-mimetypes.xml file. Once the correct clues got in there, thanks to > a > > > Tika developer, all I had to do was replace the jar files with mine. It > > is > > > working just as I want it now. > > > > > > Thanks everyone for the help. > > > > > > Fernando > > > > > > On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche < > > > [email protected]> wrote: > > > > > >> You probably need to make sure that conf/tika-mimetypes.xml is the > > version > > >> you've modified and contains the clues for detecting afm files. > > >> BTW out of curiosity why did you have to modify tika-core.jar? Isn't > it > > >> enough to provide the clues in tika-mimetypes.xml? > > >> > > >> Jul > > >> > > >> On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote: > > >> > > >>> Thanks, I really appreciate all the help. I used the ParserChecker > and > > I > > >>> could see the metadata my parser extracted! > > >>> > > >>> I have one more question though, I could only see the metadata my > > parser > > >>> extracted if I used the -forceAs mimetype option. Otherwise it is > > >> detected > > >>> as a text/plain file and my parser is then not called. I ran into a > > >> similar > > >>> problem in tika and added some functionality there so that Tika's > > >> detection > > >>> mechanism would not think afm files are text/plain. Does this mean > not > > >> all > > >>> of my tika changes made it in (I updated both the tika-core.jar and > > >>> tika-parsers.jar files) or does Nutch have its own file type > detection > > >>> mechanism? > > >>> > > >>> Thanks, > > >>> Fernando > > >>> > > >>> On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma > > >>> <[email protected]>wrote: > > >>> > > >>>> > > >>>>> Thanks for the help. I seem to be getting close to what I need to > do, > > >>> but > > >>>>> not quite there. > > >>>>> > > >>>>> I downloaded Nutch 1.3 and built it on a unix machine. It built and > > >> ran > > >>>>> fine (before changing any jar files) when I tested it on the site > > >> with > > >>>> the > > >>>>> .afm files that I want to get parsed. > > >>>>> > > >>>>> I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml > > >> (to > > >>>>> enable the parse-tika plugin) and tika-mimetypes.xml files with my > > >>>> updated > > >>>>> versions. I rebuilt (no errors) and then ran the crawl command on > the > > >>>> same > > >>>>> site. The fetch seemed to work, I did not see any errors when > running > > >>> or > > >>>> in > > >>>>> the log file. There is a parse error but it is related to a pdf I > > >> have > > >>>>> linked in the site I crawled and since I am not interested in the > pdf > > >> I > > >>>>> don't think it matters. > > >>>>> > > >>>>> Now here is my completely newb question: how can I tell if the afm > > >>> files > > >>>>> were parsed correctly in the absence of errors? > > >>>> > > >>>> The ParserChecker is what you're looking for. It's a handy tool you > > can > > >>>> locally use to find out if all goes well. > > >>>> > > >>>> bin/nutch org.apache.nutch.parse.ParserChecker > > >>>> > > >>>>> > > >>>>> I looked at the files in the segments/*/parse_data directory (since > > >>> that > > >>>> is > > >>>>> where the tutorial says the metadata goes and the parser I created > > >>> mostly > > >>>>> extracts metadata) but the files aren't really readable. I also > > >> figured > > >>>>> maybe I could search for some terms I expect parser to extract but > > >>>> couldn't > > >>>>> perform a search. When I typed the following command in the > > >>> runtime/local > > >>>>> directory: > > >>>>> > > >>>>> bin/nutch org.apache.nutch.searcher.NutchBean *search_term* > > >>>>> > > >>>>> I get the following error: > > >>>>> > > >>>>> Exception in thread "main" java.lang.NoClassDefFoundError: > > >>>>> org/apache/nutch/searcher/NutchBean > > >>>>> > > >>>>> I looked in the src directory and did not find the searcher (it was > > >> in > > >>>>> there in the 1.2 version). I tried downloading both the binary and > > >> the > > >>>> src > > >>>>> distributions for 1.3 and it was in neither. Is there a different > way > > >>> to > > >>>>> perform a search in 1.3 or is there a different way I can see > > >> readable > > >>>>> results of the parsed information? > > >>>> > > >>>> There is no searcher in 1.3. It is deprecated and removed. Use Solr > > for > > >>>> indexing to confirm or use ParserChecker or the new 1.4-dev > > >>>> o.a.n.indexer.IndexingFiltersChecker. > > >>>> > > >>>>> > > >>>>> Thanks, > > >>>>> Fernando > > >>>>> > > >>>>> On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney < > > >>>>> > > >>>>> [email protected]> wrote: > > >>>>>> OK so at least we seem to have sorted out the first of you're > > >>>> problems... > > >>>>>> but now face the dreaded Windows Cygwin partnership. > > >>>>>> > > >>>>>> We do not currently have an up-to-date tutorial for this. We do > > >>> however > > >>>>>> have > > >>>>>> a tutorial for older versions of Nutch which you can find here [1] > > >>> [2] > > >>>>>> > > >>>>>> I'm going to be brutally honest with you here, working with Cygwin > > >>> was > > >>>>>> horrible from my own experience. There seems to be so much > overhead > > >>> and > > >>>>>> working with almost any other OS was a significantly easier > option. > > >> I > > >>>>>> understand that this may mean a fundamental shift in you're > > >> computing > > >>>>>> style but the benefit is well worth it. > > >>>>>> > > >>>>>> [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin > > >>>>>> [2] > > >>>>>> > > >>>>>> > > >>>> > > >>> > > >> > > > http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28 > > >>>>>> cygwin%29 > > >>>>>> > > >>>>>> On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola < > > >>> [email protected] > > >>>>>> > > >>>>>>> wrote: > > >>>>>>> Hello, > > >>>>>>> > > >>>>>>> Thanks for the replies. > > >>>>>>> > > >>>>>>> I have started trying to use Nutch 1.3 after your suggestions, > > >>>>>>> especially since I am using Tika 0.9, but I am not getting > > >> anywhere > > >>>>>>> with it. I am > > >>>>>> > > >>>>>> able > > >>>>>> > > >>>>>>> to build fine but whenever I try to run any command it gives the > > >>>> error > > >>>>>>> stating that it cannot find C:\Program. For example, if I try to > > >>> run > > >>>>>>> the following command to crawl: > > >>>>>>> > > >>>>>>> runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > >>>>>>> > > >>>>>>> It then gives me the following error right away before any other > > >>>>>>> output: > > >>>>>>> > > >>>>>>> runtime/local/bin/nutch: line 251: exec: C:\Program: not found > > >>>>>>> > > >>>>>>> I am running on Cygwin on Windows 7, if that helps. > > >>>>>>> > > >>>>>>> As for Tika, I did modify the CompositeDetector.java file in > > >>>> tika-core > > >>>>>>> since > > >>>>>>> I added a Detector to detect the AFM files and had to make a > > >> slight > > >>>>>> > > >>>>>> change > > >>>>>> > > >>>>>>> to the CompositeDetector. I did rebuild Nutch after I changed the > > >>>> jars > > >>>>>> > > >>>>>> and > > >>>>>> > > >>>>>>> it built fine but that is when I started getting the fetch failed > > >>>>>>> error. > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> Fernando > > >>>>>>> > > >>>>>>> On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche < > > >>>>>>> > > >>>>>>> [email protected]> wrote: > > >>>>>>>> Hi Fernando > > >>>>>>>> > > >>>>>>>>> I have made some additions (a new parser) to the Apache Tika > > >>>>>>> > > >>>>>>> application > > >>>>>>> > > >>>>>>>>> and > > >>>>>>>>> I am trying to see if I can run my new changes using the > > >> crawl > > >>>>>>> > > >>>>>>> mechanism > > >>>>>>> > > >>>>>>>> in > > >>>>>>>> > > >>>>>>>>> Nutch, but I am having some trouble updating Nutch with my > > >>>> modified > > >>>>>>> > > >>>>>>> Tika > > >>>>>>> > > >>>>>>>>> application. > > >>>>>>>>> > > >>>>>>>>> The Tika updates I made run fine if I run Tika as a > > >> standalone > > >>>>>>>>> using > > >>>>>>>> > > >>>>>>>> either > > >>>>>>>> > > >>>>>>>>> the command line or the Tika GUI. > > >>>>>>>> > > >>>>>>>> OK > > >>>>>>>> > > >>>>>>>>> I am using Nutch 1.2, 1.3 seems to not be able to run for me > > >> (I > > >>>> get > > >>>>>> > > >>>>>> an > > >>>>>> > > >>>>>>>>> error > > >>>>>>>>> saying C:/Program not found whenever I try to do anything but > > >>> 1.2 > > >>>>>>> > > >>>>>>> should > > >>>>>>> > > >>>>>>>> be > > >>>>>>>> > > >>>>>>>>> fine for what I am trying to do which is just to see the > > >> parse > > >>>>>> > > >>>>>> results > > >>>>>> > > >>>>>>>> from > > >>>>>>>> > > >>>>>>>>> the new parser I added to Tika). > > >>>>>>>>> > > >>>>>>>>> I have replaced the tika-core.jar, tika-parsers.jar and > > >>>>>>>> > > >>>>>>>> tika-mimetypes.xml > > >>>>>>>> > > >>>>>>>>> files with my versions of those files as described in the > > >>>> following > > >>>>>>> > > >>>>>>> link: > > >>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-766. I also > > >> updated > > >>>> the > > >>>>>>>>> nutch-site.xml to enable the parse-tika plugin. I also > > >> updated > > >>>> the > > >>>>>>>>> parse-plugins.xml file with the following (afm files are what > > >> I > > >>>> am > > >>>>>>> > > >>>>>>> trying > > >>>>>>> > > >>>>>>>>> to > > >>>>>>>>> > > >>>>>>>>> parse): > > >>>>>>>>> <mimeType name="application/x-font-afm"> > > >>>>>>>>> > > >>>>>>>>> <plugin id="parse-tika" /> > > >>>>>>>>> > > >>>>>>>>> </mimeType> > > >>>>>>>> > > >>>>>>>> This is not necessary as by default parse-tika is used for any > > >>>>>> > > >>>>>> mime-type > > >>>>>> > > >>>>>>>> unless the mapping mime-type / parser is specified in > > >>>>>> > > >>>>>> parse-plugins.xml. > > >>>>>> > > >>>>>>>> This should not have an impact though > > >>>>>>>> > > >>>>>>>>> I am crawling a personal site in which I have links to .afm > > >>>> files. > > >>>>>>>>> If > > >>>>>> > > >>>>>> I > > >>>>>> > > >>>>>>>>> crawl before making any updates to Nutch, it fetches the > > >> files > > >>>>>>>>> fine. > > >>>>>>>> > > >>>>>>>> After > > >>>>>>>> > > >>>>>>>>> making the updates detailed above, I get the following error: > > >>>>>>>>> "fetch > > >>>>>> > > >>>>>> of > > >>>>>> > > >>>>>>>>> http://scf.usc.edu/~jfarreol/woor2___.AFM failed with: > > >>>>>> > > >>>>>>>>> java.lang.NoClassDefFoundError: > > >>>>>> org/apache/james/mime4j/MimeException". > > >>>>>> > > >>>>>>>>> Not really sure, what the issue is but my guess is that I > > >> have > > >>>> not > > >>>>>>>> > > >>>>>>>> updated > > >>>>>>>> > > >>>>>>>>> all the necessary files. Any help would be greatly > > >> appreciated. > > >>>>>>>> > > >>>>>>>> yep, sounds like you have a few jars missing. Nutch-1.2 came > > >> with > > >>>>>>> > > >>>>>>> tika-0.7, > > >>>>>>> > > >>>>>>>> which version of tika are you trying to use? > > >>>>>>>> if you just added a new parser then it would be easier to ship > > >> it > > >>>> as > > >>>>>>>> a separate jar file. I assume that you did not have to modify > > >>>>>>>> anything in tika-core, so you could use the standard tika libs > > >>> and > > >>>>>>>> simply add yours using Ivy. > > >>>>>>>> > > >>>>>>>> Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over > > >> 1.2 > > >>>> so > > >>>>>>>> it would be worth getting to the bottom of the issue you're > > >>>>>>>> encountering > > >>>>>> > > >>>>>> and > > >>>>>> > > >>>>>>>> get 1.3 to work. Moreover I am not sure that you can use a > > >>> version > > >>>> of > > >>>>>>> > > >>>>>>> Tika > > >>>>>>> > > >>>>>>>> 0.7 on Nutch 1.2 without changing parts of the code (to be > > >>> checked > > >>>>>>> > > >>>>>>> though) > > >>>>>>> > > >>>>>>>> Julien > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> -- > > >>>>>>>> * > > >>>>>>>> *Open Source Solutions for Text Engineering > > >>>>>>>> > > >>>>>>>> http://digitalpebble.blogspot.com/ > > >>>>>>>> http://www.digitalpebble.com > > >>>>>> > > >>>>>> -- > > >>>>>> *Lewis* > > >>>> > > >>> > > >> > > >> > > >> > > >> -- > > >> * > > >> *Open Source Solutions for Text Engineering > > >> > > >> http://digitalpebble.blogspot.com/ > > >> http://www.digitalpebble.com > > >> > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

