I did update the runtime/local/conf/tika-mimetypes.xml and my changes are there. I looked at the code for the ParserChecker and it seems to be doing its own content type detection using a Protocol call, so I am trying to set up Solr in hopes that it would work there (having some unix memory issues so have not been able to install to test).
As for the tika-core.jar, I modified the CompositeDetector.java. I wanted to add a detector for AFM files to go along with the parser (probably not necessary if I left the correct clues in the tika-mimetypes.xml but I am new to Tika as well, new file detection in general, so I was not sure what the correct clues are and this is for a school project so I am just trying to do as much work as I can). My detector returns the appropriate MIME type but then as the CompositeDetector goes through the rest of the detectors the MIME type gets changed to text/plain so I modified it to return the AFM MIME type if it is detected at any point. Not sure if the ParserChecker skips calling this detection part of Tika since it detects the content type on its own (when I run Tika with my updates on its own it detects the AFM files fine through the GUI and CLI). Thanks, Fernando On Wed, Jul 13, 2011 at 1:48 AM, Julien Nioche < [email protected]> wrote: > You probably need to make sure that conf/tika-mimetypes.xml is the version > you've modified and contains the clues for detecting afm files. > BTW out of curiosity why did you have to modify tika-core.jar? Isn't it > enough to provide the clues in tika-mimetypes.xml? > > Jul > > On 13 July 2011 01:16, Fernando Arreola <[email protected]> wrote: > > > Thanks, I really appreciate all the help. I used the ParserChecker and I > > could see the metadata my parser extracted! > > > > I have one more question though, I could only see the metadata my parser > > extracted if I used the -forceAs mimetype option. Otherwise it is > detected > > as a text/plain file and my parser is then not called. I ran into a > similar > > problem in tika and added some functionality there so that Tika's > detection > > mechanism would not think afm files are text/plain. Does this mean not > all > > of my tika changes made it in (I updated both the tika-core.jar and > > tika-parsers.jar files) or does Nutch have its own file type detection > > mechanism? > > > > Thanks, > > Fernando > > > > On Tue, Jul 12, 2011 at 4:54 PM, Markus Jelsma > > <[email protected]>wrote: > > > > > > > > > Thanks for the help. I seem to be getting close to what I need to do, > > but > > > > not quite there. > > > > > > > > I downloaded Nutch 1.3 and built it on a unix machine. It built and > ran > > > > fine (before changing any jar files) when I tested it on the site > with > > > the > > > > .afm files that I want to get parsed. > > > > > > > > I then changed the tika-core.jar, tika-parsers.jar, nutch-site.xml > (to > > > > enable the parse-tika plugin) and tika-mimetypes.xml files with my > > > updated > > > > versions. I rebuilt (no errors) and then ran the crawl command on the > > > same > > > > site. The fetch seemed to work, I did not see any errors when running > > or > > > in > > > > the log file. There is a parse error but it is related to a pdf I > have > > > > linked in the site I crawled and since I am not interested in the pdf > I > > > > don't think it matters. > > > > > > > > Now here is my completely newb question: how can I tell if the afm > > files > > > > were parsed correctly in the absence of errors? > > > > > > The ParserChecker is what you're looking for. It's a handy tool you can > > > locally use to find out if all goes well. > > > > > > bin/nutch org.apache.nutch.parse.ParserChecker > > > > > > > > > > > I looked at the files in the segments/*/parse_data directory (since > > that > > > is > > > > where the tutorial says the metadata goes and the parser I created > > mostly > > > > extracts metadata) but the files aren't really readable. I also > figured > > > > maybe I could search for some terms I expect parser to extract but > > > couldn't > > > > perform a search. When I typed the following command in the > > runtime/local > > > > directory: > > > > > > > > bin/nutch org.apache.nutch.searcher.NutchBean *search_term* > > > > > > > > I get the following error: > > > > > > > > Exception in thread "main" java.lang.NoClassDefFoundError: > > > > org/apache/nutch/searcher/NutchBean > > > > > > > > I looked in the src directory and did not find the searcher (it was > in > > > > there in the 1.2 version). I tried downloading both the binary and > the > > > src > > > > distributions for 1.3 and it was in neither. Is there a different way > > to > > > > perform a search in 1.3 or is there a different way I can see > readable > > > > results of the parsed information? > > > > > > There is no searcher in 1.3. It is deprecated and removed. Use Solr for > > > indexing to confirm or use ParserChecker or the new 1.4-dev > > > o.a.n.indexer.IndexingFiltersChecker. > > > > > > > > > > > Thanks, > > > > Fernando > > > > > > > > On Tue, Jul 12, 2011 at 11:00 AM, lewis john mcgibbney < > > > > > > > > [email protected]> wrote: > > > > > OK so at least we seem to have sorted out the first of you're > > > problems... > > > > > but now face the dreaded Windows Cygwin partnership. > > > > > > > > > > We do not currently have an up-to-date tutorial for this. We do > > however > > > > > have > > > > > a tutorial for older versions of Nutch which you can find here [1] > > [2] > > > > > > > > > > I'm going to be brutally honest with you here, working with Cygwin > > was > > > > > horrible from my own experience. There seems to be so much overhead > > and > > > > > working with almost any other OS was a significantly easier option. > I > > > > > understand that this may mean a fundamental shift in you're > computing > > > > > style but the benefit is well worth it. > > > > > > > > > > [1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin > > > > > [2] > > > > > > > > > > > > > > > > http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28 > > > > > cygwin%29 > > > > > > > > > > On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola < > > [email protected] > > > > > > > > > > >wrote: > > > > > > Hello, > > > > > > > > > > > > Thanks for the replies. > > > > > > > > > > > > I have started trying to use Nutch 1.3 after your suggestions, > > > > > > especially since I am using Tika 0.9, but I am not getting > anywhere > > > > > > with it. I am > > > > > > > > > > able > > > > > > > > > > > to build fine but whenever I try to run any command it gives the > > > error > > > > > > stating that it cannot find C:\Program. For example, if I try to > > run > > > > > > the following command to crawl: > > > > > > > > > > > > runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > > > > > > > > > > > It then gives me the following error right away before any other > > > > > > output: > > > > > > > > > > > > runtime/local/bin/nutch: line 251: exec: C:\Program: not found > > > > > > > > > > > > I am running on Cygwin on Windows 7, if that helps. > > > > > > > > > > > > As for Tika, I did modify the CompositeDetector.java file in > > > tika-core > > > > > > since > > > > > > I added a Detector to detect the AFM files and had to make a > slight > > > > > > > > > > change > > > > > > > > > > > to the CompositeDetector. I did rebuild Nutch after I changed the > > > jars > > > > > > > > > > and > > > > > > > > > > > it built fine but that is when I started getting the fetch failed > > > > > > error. > > > > > > > > > > > > Thanks, > > > > > > Fernando > > > > > > > > > > > > On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche < > > > > > > > > > > > > [email protected]> wrote: > > > > > > > Hi Fernando > > > > > > > > > > > > > > > I have made some additions (a new parser) to the Apache Tika > > > > > > > > > > > > application > > > > > > > > > > > > > > and > > > > > > > > I am trying to see if I can run my new changes using the > crawl > > > > > > > > > > > > mechanism > > > > > > > > > > > > > in > > > > > > > > > > > > > > > Nutch, but I am having some trouble updating Nutch with my > > > modified > > > > > > > > > > > > Tika > > > > > > > > > > > > > > application. > > > > > > > > > > > > > > > > The Tika updates I made run fine if I run Tika as a > standalone > > > > > > > > using > > > > > > > > > > > > > > either > > > > > > > > > > > > > > > the command line or the Tika GUI. > > > > > > > > > > > > > > OK > > > > > > > > > > > > > > > I am using Nutch 1.2, 1.3 seems to not be able to run for me > (I > > > get > > > > > > > > > > an > > > > > > > > > > > > > error > > > > > > > > saying C:/Program not found whenever I try to do anything but > > 1.2 > > > > > > > > > > > > should > > > > > > > > > > > > > be > > > > > > > > > > > > > > > fine for what I am trying to do which is just to see the > parse > > > > > > > > > > results > > > > > > > > > > > > from > > > > > > > > > > > > > > > the new parser I added to Tika). > > > > > > > > > > > > > > > > I have replaced the tika-core.jar, tika-parsers.jar and > > > > > > > > > > > > > > tika-mimetypes.xml > > > > > > > > > > > > > > > files with my versions of those files as described in the > > > following > > > > > > > > > > > > link: > > > > > > > > http://issues.apache.org/jira/browse/NUTCH-766. I also > updated > > > the > > > > > > > > nutch-site.xml to enable the parse-tika plugin. I also > updated > > > the > > > > > > > > parse-plugins.xml file with the following (afm files are what > I > > > am > > > > > > > > > > > > trying > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > parse): > > > > > > > > <mimeType name="application/x-font-afm"> > > > > > > > > > > > > > > > > <plugin id="parse-tika" /> > > > > > > > > > > > > > > > > </mimeType> > > > > > > > > > > > > > > This is not necessary as by default parse-tika is used for any > > > > > > > > > > mime-type > > > > > > > > > > > > unless the mapping mime-type / parser is specified in > > > > > > > > > > parse-plugins.xml. > > > > > > > > > > > > This should not have an impact though > > > > > > > > > > > > > > > I am crawling a personal site in which I have links to .afm > > > files. > > > > > > > > If > > > > > > > > > > I > > > > > > > > > > > > > crawl before making any updates to Nutch, it fetches the > files > > > > > > > > fine. > > > > > > > > > > > > > > After > > > > > > > > > > > > > > > making the updates detailed above, I get the following error: > > > > > > > > "fetch > > > > > > > > > > of > > > > > > > > > > > > > http://scf.usc.edu/~jfarreol/woor2___.AFM failed with: > > > > > > > > > > > > > java.lang.NoClassDefFoundError: > > > > > org/apache/james/mime4j/MimeException". > > > > > > > > > > > > > Not really sure, what the issue is but my guess is that I > have > > > not > > > > > > > > > > > > > > updated > > > > > > > > > > > > > > > all the necessary files. Any help would be greatly > appreciated. > > > > > > > > > > > > > > yep, sounds like you have a few jars missing. Nutch-1.2 came > with > > > > > > > > > > > > tika-0.7, > > > > > > > > > > > > > which version of tika are you trying to use? > > > > > > > if you just added a new parser then it would be easier to ship > it > > > as > > > > > > > a separate jar file. I assume that you did not have to modify > > > > > > > anything in tika-core, so you could use the standard tika libs > > and > > > > > > > simply add yours using Ivy. > > > > > > > > > > > > > > Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over > 1.2 > > > so > > > > > > > it would be worth getting to the bottom of the issue you're > > > > > > > encountering > > > > > > > > > > and > > > > > > > > > > > > get 1.3 to work. Moreover I am not sure that you can use a > > version > > > of > > > > > > > > > > > > Tika > > > > > > > > > > > > > 0.7 on Nutch 1.2 without changing parts of the code (to be > > checked > > > > > > > > > > > > though) > > > > > > > > > > > > > Julien > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > * > > > > > > > *Open Source Solutions for Text Engineering > > > > > > > > > > > > > > http://digitalpebble.blogspot.com/ > > > > > > > http://www.digitalpebble.com > > > > > > > > > > -- > > > > > *Lewis* > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

