Thanks for the help. I seem to be getting close to what I need to do, but not quite there.
I downloaded Nutch 1.3 and built it on a unix machine. It built and ran fine (before changing any jar files) when I tested it on the site with the .afm files that I want to get parsed. I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml (to enable the parse-tika plugin) and tika-mimetypes.xml files with my updated versions. I rebuilt (no errors) and then ran the crawl command on the same site. The fetch seemed to work, I did not see any errors when running or in the log file. There is a parse error but it is related to a pdf I have linked in the site I crawled and since I am not interested in the pdf I don't think it matters. Now here is my completely newb question: how can I tell if the afm files were parsed correctly in the absence of errors? I looked at the files in the segments/*/parse_data directory (since that is where the tutorial says the metadata goes and the parser I created mostly extracts metadata) but the files aren't really readable. I also figured maybe I could search for some terms I expect parser to extract but couldn't perform a search. When I typed the following command in the runtime/local directory: bin/nutch org.apache.nutch.searcher.NutchBean *search_term* I get the following error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/searcher/NutchBean I looked in the src directory and did not find the searcher (it was in there in the 1.2 version). I tried downloading both the binary and the src distributions for 1.3 and it was in neither. Is there a different way to perform a search in 1.3 or is there a different way I can see readable results of the parsed information? Thanks, Fernando On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney < [email protected]> wrote: > OK so at least we seem to have sorted out the first of you're problems... > but now face the dreaded Windows Cygwin partnership. > > We do not currently have an up-to-date tutorial for this. We do however > have > a tutorial for older versions of Nutch which you can find here [1] [2] > > I'm going to be brutally honest with you here, working with Cygwin was > horrible from my own experience. There seems to be so much overhead and > working with almost any other OS was a significantly easier option. I > understand that this may mean a fundamental shift in you're computing style > but the benefit is well worth it. > > [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin > [2] > > http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28cygwin%29 > > On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola <[email protected] > >wrote: > > > Hello, > > > > Thanks for the replies. > > > > I have started trying to use Nutch 1.3 after your suggestions, especially > > since I am using Tika 0.9, but I am not getting anywhere with it. I am > able > > to build fine but whenever I try to run any command it gives the error > > stating that it cannot find C:\Program. For example, if I try to run the > > following command to crawl: > > > > runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > > > It then gives me the following error right away before any other output: > > > > runtime/local/bin/nutch: line 251: exec: C:\Program: not found > > > > I am running on Cygwin on Windows 7, if that helps. > > > > As for Tika, I did modify the CompositeDetector.java file in tika-core > > since > > I added a Detector to detect the AFM files and had to make a slight > change > > to the CompositeDetector. I did rebuild Nutch after I changed the jars > and > > it built fine but that is when I started getting the fetch failed error. > > > > Thanks, > > Fernando > > > > On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche < > > [email protected]> wrote: > > > > > Hi Fernando > > > > > > > > > > I have made some additions (a new parser) to the Apache Tika > > application > > > > and > > > > I am trying to see if I can run my new changes using the crawl > > mechanism > > > in > > > > Nutch, but I am having some trouble updating Nutch with my modified > > Tika > > > > application. > > > > > > > > The Tika updates I made run fine if I run Tika as a standalone using > > > either > > > > the command line or the Tika GUI. > > > > > > > > > > OK > > > > > > > > > > > > > > I am using Nutch 1.2, 1.3 seems to not be able to run for me (I get > an > > > > error > > > > saying C:/Program not found whenever I try to do anything but 1.2 > > should > > > be > > > > fine for what I am trying to do which is just to see the parse > results > > > from > > > > the new parser I added to Tika). > > > > > > > > I have replaced the tika-core.jar, tika-parsers.jar and > > > tika-mimetypes.xml > > > > files with my versions of those files as described in the following > > link: > > > > http://issues.apache.org/jira/browse/NUTCH-766. I also updated the > > > > nutch-site.xml to enable the parse-tika plugin. I also updated the > > > > parse-plugins.xml file with the following (afm files are what I am > > trying > > > > to > > > > parse): > > > > > > > > <mimeType name="application/x-font-afm"> > > > > <plugin id="parse-tika" /> > > > > </mimeType> > > > > > > > > > > This is not necessary as by default parse-tika is used for any > mime-type > > > unless the mapping mime-type / parser is specified in > parse-plugins.xml. > > > This should not have an impact though > > > > > > > > > > > > > > I am crawling a personal site in which I have links to .afm files. If > I > > > > crawl before making any updates to Nutch, it fetches the files fine. > > > After > > > > making the updates detailed above, I get the following error: "fetch > of > > > > http://scf.usc.edu/~jfarreol/woor2___.AFM failed with: > > > > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/MimeException". > > > > > > > > Not really sure, what the issue is but my guess is that I have not > > > updated > > > > all the necessary files. Any help would be greatly appreciated. > > > > > > > > > > yep, sounds like you have a few jars missing. Nutch-1.2 came with > > tika-0.7, > > > which version of tika are you trying to use? > > > if you just added a new parser then it would be easier to ship it as a > > > separate jar file. I assume that you did not have to modify anything in > > > tika-core, so you could use the standard tika libs and simply add yours > > > using Ivy. > > > > > > Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over 1.2 so it > > > would be worth getting to the bottom of the issue you're encountering > and > > > get 1.3 to work. Moreover I am not sure that you can use a version of > > Tika > > > > > > > 0.7 on Nutch 1.2 without changing parts of the code (to be checked > > though) > > > > > > Julien > > > > > > > > > > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > > > > > > > -- > *Lewis* >

