You probably need to make sure that conf/tika-mimetypes.xml is the version you've modified and contains the clues for detecting afm files. BTW out of curiosity why did you have to modify tika-core.jar? Isn't it enough to provide the clues in tika-mimetypes.xml?
Jul On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote: > Thanks, I really appreciate all the help. I used the ParserChecker and I > could see the metadata my parser extracted! > > I have one more question though, I could only see the metadata my parser > extracted if I used the -forceAs mimetype option. Otherwise it is detected > as a text/plain file and my parser is then not called. I ran into a similar > problem in tika and added some functionality there so that Tika's detection > mechanism would not think afm files are text/plain. Does this mean not all > of my tika changes made it in (I updated both the tika-core.jar and > tika-parsers.jar files) or does Nutch have its own file type detection > mechanism? > > Thanks, > Fernando > > On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma > <[email protected]>wrote: > > > > > > Thanks for the help. I seem to be getting close to what I need to do, > but > > > not quite there. > > > > > > I downloaded Nutch 1.3 and built it on a unix machine. It built and ran > > > fine (before changing any jar files) when I tested it on the site with > > the > > > .afm files that I want to get parsed. > > > > > > I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml (to > > > enable the parse-tika plugin) and tika-mimetypes.xml files with my > > updated > > > versions. I rebuilt (no errors) and then ran the crawl command on the > > same > > > site. The fetch seemed to work, I did not see any errors when running > or > > in > > > the log file. There is a parse error but it is related to a pdf I have > > > linked in the site I crawled and since I am not interested in the pdf I > > > don't think it matters. > > > > > > Now here is my completely newb question: how can I tell if the afm > files > > > were parsed correctly in the absence of errors? > > > > The ParserChecker is what you're looking for. It's a handy tool you can > > locally use to find out if all goes well. > > > > bin/nutch org.apache.nutch.parse.ParserChecker > > > > > > > > I looked at the files in the segments/*/parse_data directory (since > that > > is > > > where the tutorial says the metadata goes and the parser I created > mostly > > > extracts metadata) but the files aren't really readable. I also figured > > > maybe I could search for some terms I expect parser to extract but > > couldn't > > > perform a search. When I typed the following command in the > runtime/local > > > directory: > > > > > > bin/nutch org.apache.nutch.searcher.NutchBean *search_term* > > > > > > I get the following error: > > > > > > Exception in thread "main" java.lang.NoClassDefFoundError: > > > org/apache/nutch/searcher/NutchBean > > > > > > I looked in the src directory and did not find the searcher (it was in > > > there in the 1.2 version). I tried downloading both the binary and the > > src > > > distributions for 1.3 and it was in neither. Is there a different way > to > > > perform a search in 1.3 or is there a different way I can see readable > > > results of the parsed information? > > > > There is no searcher in 1.3. It is deprecated and removed. Use Solr for > > indexing to confirm or use ParserChecker or the new 1.4-dev > > o.a.n.indexer.IndexingFiltersChecker. > > > > > > > > Thanks, > > > Fernando > > > > > > On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney < > > > > > > [email protected]> wrote: > > > > OK so at least we seem to have sorted out the first of you're > > problems... > > > > but now face the dreaded Windows Cygwin partnership. > > > > > > > > We do not currently have an up-to-date tutorial for this. We do > however > > > > have > > > > a tutorial for older versions of Nutch which you can find here [1] > [2] > > > > > > > > I'm going to be brutally honest with you here, working with Cygwin > was > > > > horrible from my own experience. There seems to be so much overhead > and > > > > working with almost any other OS was a significantly easier option. I > > > > understand that this may mean a fundamental shift in you're computing > > > > style but the benefit is well worth it. > > > > > > > > [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin > > > > [2] > > > > > > > > > > > http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28 > > > > cygwin%29 > > > > > > > > On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola < > [email protected] > > > > > > > > >wrote: > > > > > Hello, > > > > > > > > > > Thanks for the replies. > > > > > > > > > > I have started trying to use Nutch 1.3 after your suggestions, > > > > > especially since I am using Tika 0.9, but I am not getting anywhere > > > > > with it. I am > > > > > > > > able > > > > > > > > > to build fine but whenever I try to run any command it gives the > > error > > > > > stating that it cannot find C:\Program. For example, if I try to > run > > > > > the following command to crawl: > > > > > > > > > > runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > > > > > > > > > It then gives me the following error right away before any other > > > > > output: > > > > > > > > > > runtime/local/bin/nutch: line 251: exec: C:\Program: not found > > > > > > > > > > I am running on Cygwin on Windows 7, if that helps. > > > > > > > > > > As for Tika, I did modify the CompositeDetector.java file in > > tika-core > > > > > since > > > > > I added a Detector to detect the AFM files and had to make a slight > > > > > > > > change > > > > > > > > > to the CompositeDetector. I did rebuild Nutch after I changed the > > jars > > > > > > > > and > > > > > > > > > it built fine but that is when I started getting the fetch failed > > > > > error. > > > > > > > > > > Thanks, > > > > > Fernando > > > > > > > > > > On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche < > > > > > > > > > > [email protected]> wrote: > > > > > > Hi Fernando > > > > > > > > > > > > > I have made some additions (a new parser) to the Apache Tika > > > > > > > > > > application > > > > > > > > > > > > and > > > > > > > I am trying to see if I can run my new changes using the crawl > > > > > > > > > > mechanism > > > > > > > > > > > in > > > > > > > > > > > > > Nutch, but I am having some trouble updating Nutch with my > > modified > > > > > > > > > > Tika > > > > > > > > > > > > application. > > > > > > > > > > > > > > The Tika updates I made run fine if I run Tika as a standalone > > > > > > > using > > > > > > > > > > > > either > > > > > > > > > > > > > the command line or the Tika GUI. > > > > > > > > > > > > OK > > > > > > > > > > > > > I am using Nutch 1.2, 1.3 seems to not be able to run for me (I > > get > > > > > > > > an > > > > > > > > > > > error > > > > > > > saying C:/Program not found whenever I try to do anything but > 1.2 > > > > > > > > > > should > > > > > > > > > > > be > > > > > > > > > > > > > fine for what I am trying to do which is just to see the parse > > > > > > > > results > > > > > > > > > > from > > > > > > > > > > > > > the new parser I added to Tika). > > > > > > > > > > > > > > I have replaced the tika-core.jar, tika-parsers.jar and > > > > > > > > > > > > tika-mimetypes.xml > > > > > > > > > > > > > files with my versions of those files as described in the > > following > > > > > > > > > > link: > > > > > > > http://issues.apache.org/jira/browse/NUTCH-766. I also updated > > the > > > > > > > nutch-site.xml to enable the parse-tika plugin. I also updated > > the > > > > > > > parse-plugins.xml file with the following (afm files are what I > > am > > > > > > > > > > trying > > > > > > > > > > > > to > > > > > > > > > > > > > > parse): > > > > > > > <mimeType name="application/x-font-afm"> > > > > > > > > > > > > > > <plugin id="parse-tika" /> > > > > > > > > > > > > > > </mimeType> > > > > > > > > > > > > This is not necessary as by default parse-tika is used for any > > > > > > > > mime-type > > > > > > > > > > unless the mapping mime-type / parser is specified in > > > > > > > > parse-plugins.xml. > > > > > > > > > > This should not have an impact though > > > > > > > > > > > > > I am crawling a personal site in which I have links to .afm > > files. > > > > > > > If > > > > > > > > I > > > > > > > > > > > crawl before making any updates to Nutch, it fetches the files > > > > > > > fine. > > > > > > > > > > > > After > > > > > > > > > > > > > making the updates detailed above, I get the following error: > > > > > > > "fetch > > > > > > > > of > > > > > > > > > > > http://scf.usc.edu/~jfarreol/woor2___.AFM failed with: > > > > > > > > > > > java.lang.NoClassDefFoundError: > > > > org/apache/james/mime4j/MimeException". > > > > > > > > > > > Not really sure, what the issue is but my guess is that I have > > not > > > > > > > > > > > > updated > > > > > > > > > > > > > all the necessary files. Any help would be greatly appreciated. > > > > > > > > > > > > yep, sounds like you have a few jars missing. Nutch-1.2 came with > > > > > > > > > > tika-0.7, > > > > > > > > > > > which version of tika are you trying to use? > > > > > > if you just added a new parser then it would be easier to ship it > > as > > > > > > a separate jar file. I assume that you did not have to modify > > > > > > anything in tika-core, so you could use the standard tika libs > and > > > > > > simply add yours using Ivy. > > > > > > > > > > > > Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over 1.2 > > so > > > > > > it would be worth getting to the bottom of the issue you're > > > > > > encountering > > > > > > > > and > > > > > > > > > > get 1.3 to work. Moreover I am not sure that you can use a > version > > of > > > > > > > > > > Tika > > > > > > > > > > > 0.7 on Nutch 1.2 without changing parts of the code (to be > checked > > > > > > > > > > though) > > > > > > > > > > > Julien > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > * > > > > > > *Open Source Solutions for Text Engineering > > > > > > > > > > > > http://digitalpebble.blogspot.com/ > > > > > > http://www.digitalpebble.com > > > > > > > > -- > > > > *Lewis* > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

