Re: nutch crawl issue
Hi Chris, The 'maven install package' produced this for each target/maven-shared-archive-resources/... file. ... [INFO] [bundle:bundle {execution: default-bundle}] [ERROR] Error building bundle org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not exist: target/maven-shared-archive-resources/META-INF/NOTICE~ [ERROR] Error building bundle org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not exist: target/maven-shared-archive-resources/META-INF/DEPENDENCIES~ [ERROR] Error building bundle org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not exist: target/maven-shared-archive-resources/META-INF/LICENSE~ [ERROR] Error(s) found in bundle configuration [INFO] [ERROR] BUILD ERROR [INFO] [INFO] Error(s) found in bundle configuration [INFO] [INFO] For more information, run Maven with the -e switch [INFO] [INFO] Total time: 1 minute 24 seconds [INFO] Finished at: Wed May 05 23:38:56 EDT 2010 [INFO] Final Memory: 40M/271M [INFO] Assuming this was the right thing to do, I renamed each file to match the missing filename, e.g. rename DEPENDENCIES to DEPENDENCIES~ (and NOTICE, LICENSE) in each 'target' and re-ran to generate the new jars. and produce this: [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent SUCCESS [2.261s] [INFO] Apache Tika core .. SUCCESS [14.429s] [INFO] Apache Tika parsers ... SUCCESS [32.370s] [INFO] Apache Tika application ... SUCCESS [34.179s] [INFO] Apache Tika OSGi bundle ... SUCCESS [16.081s] [INFO] Apache Tika ... SUCCESS [0.237s] [INFO] [INFO] [INFO] BUILD SUCCESSFUL [INFO] [INFO] Total time: 1 minute 41 seconds [INFO] Finished at: Wed May 05 23:43:56 EDT 2010 [INFO] Final Memory: 37M/278M [INFO] in plugin/parse-tika I replaced parse-tika.jar and tika-parsers-0.7.jar with tika-core-0.8-SNAPSHOT.jar and tika-parsers-0.8-SNAPSHOT.jar in lib/ I replaced tika-core-0.7.jar with tika-core-0.8-SNAPSHOT.jar I ran bin/nutch crawl and it completed w/o error. All of the javascript was fetched and appeared to be parsed w/o error. However, and I'm not sure the correct terminology to use, no more urls were generated to fetch than before. So the patch appears to be a step in the right direction. The problem is with the FRAMESET/FRAME and how javascript is used to generate content in the FRAMES. As Julien suggested, I will read the archive deeper and look at the legacy parse-js. I suppose as a really ugly 'brute force' work around I could walk the directory tree w/ a perl script and generate a 'seed list' of html URLs to fetch. Ugh. If you have any more ideas please let me know how I can help. Thanks. -m. On Tue, 2010-05-04 at 21:50 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at Julien’s patch and see if there is a way to get it committed sooner rather than later. One way to help me do that ― since you already have an environment and set of use cases where this is reproduceable can you apply TIKA-379 to a local checkout of tika trunk (I’ll show you how) and then let me know if that fixes parse-tika for you? Here are the steps: svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika cd tika wget http://bit.ly/bXeLkf; (if you don't have SSL support, then manually download the linked file) patch -p0 TIKA-379-3.patch mvn install package Then grab tika-parsers and tika-core out of the respective tika-core/target and tika-parsers/target directories and drop those jars in your parse-tika/lib folder, replacing their originals. Then, try your nutch crawl again. See if that works. In the meanwhile, I'll inspect Julien's patch. Thanks! Cheers, Chris On 5/4/10 9:02 PM, matthew a. grisius mgris...@comcast.net wrote: Hi Chris, It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES and/or javascript. Using the parse-html suggested work around I am able to process my simple test cases such as javadoc which does
Re: nutch crawl issue
Hi Chris, It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES and/or javascript. Using the parse-html suggested work around I am able to process my simple test cases such as javadoc which does include simple embedded javascript (of course I can't verify that it is actually parsing it though). I expanded my testing to include two more complex examples that heavily use HTML FRAMESET/FRAME and more complex javascript: 134 mb, 11,269 files 1.9 gb, 133,978 files They both fail at the top level with the similar errors such as: fetching http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js fetching http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDocBanner.htm -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 Error parsing: http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript Attempting to finish item from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@1532fc fetch of http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56 -finishing thread FetcherThread, activeThreads=2 I tried several property settings to mimic the previous work around and could not solve it. Any suggestions? So, I'm not sure how to categorize the issues more accurately. I have many javadoc sets and lots of simple HTML that will now parse, but I have other examples such as the two mentioned above that won't parse and therefore can't be crawled. It seems to me to be systematic rather than exceptional. I cannot believe that I'm the only one who will experience these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks for asking. -m. On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Awesome! Glad it worked. Now my next question how often are you seeing that parse-tika doesn¹t work on HTML files? Is it all HTML that you are trying to process? Or just some of them? Or particular ones (categories of them). The reason I ask is that I¹m trying to determine whether I should commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a systematic thing versus an exception. Let me know and thanks! Cheers, Chris On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote: Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my PDFs, but has problems with some html. Nutch 1.1 includes more current PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. Interesting: well one solution comes to mind. Can you test this out? * uncomment the lines: mimeType name=text/html plugin id=parse-html / /mimeType In conf/parse-plugins.xml. * try your crawl again. I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 with the attached file Thanks! Let me know what happens after you uncomment the line above. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department
Re: nutch crawl issue
Hi Matthew, I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at Julien’s patch and see if there is a way to get it committed sooner rather than later. One way to help me do that ― since you already have an environment and set of use cases where this is reproduceable can you apply TIKA-379 to a local checkout of tika trunk (I’ll show you how) and then let me know if that fixes parse-tika for you? Here are the steps: svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika cd tika wget http://bit.ly/bXeLkf; (if you don't have SSL support, then manually download the linked file) patch -p0 TIKA-379-3.patch mvn install package Then grab tika-parsers and tika-core out of the respective tika-core/target and tika-parsers/target directories and drop those jars in your parse-tika/lib folder, replacing their originals. Then, try your nutch crawl again. See if that works. In the meanwhile, I'll inspect Julien's patch. Thanks! Cheers, Chris On 5/4/10 9:02 PM, matthew a. grisius mgris...@comcast.net wrote: Hi Chris, It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES and/or javascript. Using the parse-html suggested work around I am able to process my simple test cases such as javadoc which does include simple embedded javascript (of course I can't verify that it is actually parsing it though). I expanded my testing to include two more complex examples that heavily use HTML FRAMESET/FRAME and more complex javascript: 134 mb, 11,269 files 1.9 gb, 133,978 files They both fail at the top level with the similar errors such as: fetching http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD ocCommon.js fetching http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDo cBanner.htm -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 Error parsing: http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript Attempting to finish item from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@1532fc fetch of http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56 -finishing thread FetcherThread, activeThreads=2 I tried several property settings to mimic the previous work around and could not solve it. Any suggestions? So, I'm not sure how to categorize the issues more accurately. I have many javadoc sets and lots of simple HTML that will now parse, but I have other examples such as the two mentioned above that won't parse and therefore can't be crawled. It seems to me to be systematic rather than exceptional. I cannot believe that I'm the only one who will experience these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks for asking. -m. On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Awesome! Glad it worked. Now my next question how often are you seeing that parse-tika doesn¹t work on HTML files? Is it all HTML that you are trying to process? Or just some of them? Or particular ones (categories of them). The reason I ask is that I¹m trying to determine whether I should commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a systematic thing versus an exception. Let me know and thanks! Cheers, Chris On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote: Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my PDFs, but has problems with some html. Nutch 1.1 includes more current PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. Interesting: well one solution comes to mind. Can you test this out? * uncomment the lines: mimeType name=text/html plugin id=parse-html / /mimeType In conf/parse-plugins.xml. * try your crawl again. I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 with the attached file Thanks!
Re: nutch crawl issue
Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my PDFs, but has problems with some html. Nutch 1.1 includes more current PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. Interesting: well one solution comes to mind. Can you test this out? * uncomment the lines: mimeType name=text/html plugin id=parse-html / /mimeType In conf/parse-plugins.xml. * try your crawl again. I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 with the attached file Thanks! Let me know what happens after you uncomment the line above. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: nutch crawl issue
Hi Matthew, Awesome! Glad it worked. Now my next question how often are you seeing that parse-tika doesn¹t work on HTML files? Is it all HTML that you are trying to process? Or just some of them? Or particular ones (categories of them). The reason I ask is that I¹m trying to determine whether I should commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a systematic thing versus an exception. Let me know and thanks! Cheers, Chris On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote: Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my PDFs, but has problems with some html. Nutch 1.1 includes more current PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. Interesting: well one solution comes to mind. Can you test this out? * uncomment the lines: mimeType name=text/html plugin id=parse-html / /mimeType In conf/parse-plugins.xml. * try your crawl again. I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 with the attached file Thanks! Let me know what happens after you uncomment the line above. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: nutch crawl issue
This sounds exactly like what I have been experiencing. On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius mgris...@comcast.netwrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I verified this with bin/nutch readdb and bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base seed doc specified is processed. I searched and reviewed the nutch-user archive and tried several different settings but none of the settings appear to have any effect. I then downloaded maven-2.2.1 so that I could mvn install tika and produce tika-app-0.7.jar to command line extract information about the html javadoc file. I am not familiar w/ tika but the command line version doesn't return any metadata, e.g. no 'src=' links from the html 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how nutch uses tika and maybe it's not related . . . Has anyone crawled javadoc files or have any suggestions? Thanks. -m.
Re: nutch crawl issue
Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my PDFs, but has problems with some html. Nutch 1.1 includes more current PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. Interesting: well one solution comes to mind. Can you test this out? * uncomment the lines: mimeType name=text/html plugin id=parse-html / /mimeType In conf/parse-plugins.xml. * try your crawl again. I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 with the attached file Thanks! Let me know what happens after you uncomment the line above. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: nutch crawl issue
in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I verified this with bin/nutch readdb and bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base seed doc specified is processed. I searched and reviewed the nutch-user archive and tried several different settings but none of the settings appear to have any effect. I then downloaded maven-2.2.1 so that I could mvn install tika and produce tika-app-0.7.jar to command line extract information about the html javadoc file. I am not familiar w/ tika but the command line version doesn't return any metadata, e.g. no 'src=' links from the html 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how nutch uses tika and maybe it's not related . . . Has anyone crawled javadoc files or have any suggestions? Thanks. -m.
Re: nutch crawl issue
if u r using nigthly build, Did u changed d same in parse-plugin.xml?? uncomment this: mimeType name=text/html plugin id=parse-html / /mimeType hopefully this helps u On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius mgris...@comcast.net wrote: in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I verified this with bin/nutch readdb and bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base seed doc specified is processed. I searched and reviewed the nutch-user archive and tried several different settings but none of the settings appear to have any effect. I then downloaded maven-2.2.1 so that I could mvn install tika and produce tika-app-0.7.jar to command line extract information about the html javadoc file. I am not familiar w/ tika but the command line version doesn't return any metadata, e.g. no 'src=' links from the html 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how nutch uses tika and maybe it's not related . . . Has anyone crawled javadoc files or have any suggestions? Thanks. -m. -- Regards, Arpit Khurdiya
Re: nutch crawl issue
Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. Could you please open an issue in JIRA https://issues.apache.org/jira/browse/NUTCH) and attach the file you are trying to process? I'll have a look and see if it is related to TIKA-379. Thanks Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 29 April 2010 17:02, matthew a. grisius mgris...@comcast.net wrote: in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I verified this with bin/nutch readdb and bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base seed doc specified is processed. I searched and reviewed the nutch-user archive and tried several different settings but none of the settings appear to have any effect. I then downloaded maven-2.2.1 so that I could mvn install tika and produce tika-app-0.7.jar to command line extract information about the html javadoc file. I am not familiar w/ tika but the command line version doesn't return any metadata, e.g. no 'src=' links from the html 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how nutch uses tika and maybe it's not related . . . Has anyone crawled javadoc files or have any suggestions? Thanks. -m.
Re: nutch crawl issue
My subject should've been clearer, e.g. it should've read Nutch 1.1 nightly build crawl issue. Also, I did verify that Nutch 1.0 successfully completes crawling the javadoc html file and can be verified with luke-1.0.1 and searched using command line bin/nutch org.apache.nutch.searcher.NutchBean java On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I verified this with bin/nutch readdb and bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base seed doc specified is processed. I searched and reviewed the nutch-user archive and tried several different settings but none of the settings appear to have any effect. I then downloaded maven-2.2.1 so that I could mvn install tika and produce tika-app-0.7.jar to command line extract information about the html javadoc file. I am not familiar w/ tika but the command line version doesn't return any metadata, e.g. no 'src=' links from the html 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how nutch uses tika and maybe it's not related . . . Has anyone crawled javadoc files or have any suggestions? Thanks. -m.