Re: SOLVED: injector in nutch-1.4
error was caused by incorrect entry in domain-urlfilter i had there .cz and it should be only cz
Re: How does nutch handles javaScript in href
So, I figured out, that they are not discarded. Let's take this URL for example: http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef This page is not found. I used the linkdb to determine why this deadlink is in the crawldb. The result: ./nutch readlinkdb linkdb -url http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef; 11/10/19 01:29:52 INFO util.NativeCodeLoader: Loaded the native-hadoop library 11/10/19 01:29:52 INFO zlib.ZlibFactory: Successfully loaded initialized native-zlib library 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor 11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor fromUrl: http://www.uni-kassel.de/intranet/footernavi/redaktion.html anchor: fromUrl: http://www.uni-kassel.de/intranet/footernavi/bildnachweis.html anchor: fromUrl: http://www.uni-kassel.de/intranet/footernavi/sitemap.html anchor: I took the first page http://www.uni-kassel.de/intranet/footernavi/redaktion.html and run ParserChecker on it. This is the result: ./nutch org.apache.nutch.parse.ParserChecker http://www.uni-kassel.de/intranet/footernavi/redaktion.html; 11/10/19 13:58:02 INFO parse.ParserChecker: fetching: http://www.uni-kassel.de/intranet/footernavi/redaktion.html 11/10/19 13:58:02 WARN plugin.PluginRepository: Plugins: directory not found: ${job.local.dir}/../jars/plugins 11/10/19 13:58:02 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-nutch/hadoop-unjar8228180125857982003/plugins (...) 11/10/19 13:58:02 INFO http.Http: http.proxy.host = null 11/10/19 13:58:02 INFO http.Http: http.proxy.port = 8080 11/10/19 13:58:02 INFO http.Http: http.timeout = 1 11/10/19 13:58:02 INFO http.Http: http.content.limit = 10485760 11/10/19 13:58:02 INFO http.Http: http.agent = Uni Kassel Spider/Nutch-1.3 (Test Crawler des ITS der Uni Kassel) 11/10/19 13:58:02 INFO http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 11/10/19 13:58:05 INFO conf.Configuration: found resource tika-mimetypes.xml at file:/tmp/hadoop-nutch/hadoop-unjar8228180125857982003/tika-mimetypes.xml 11/10/19 13:58:05 INFO parse.ParserChecker: parsing: http://www.uni-kassel.de/intranet/footernavi/redaktion.html 11/10/19 13:58:05 INFO parse.ParserChecker: contentType: application/xhtml+xml 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml at file:/tmp/hadoop-nutch/hadoop-unjar8228180125857982003/parse-plugins.xml 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml - Url --- http://www.uni-kassel.de/intranet/footernavi/redaktion.html- ParseData - Version: 5 Status: success(1,0) Title: Intranet: Redaktion Outlinks: 23 outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/redaktion.html#nav anchor: Skip to navigation (Press Enter). outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/redaktion.html#col3 anchor: Skip to main content (Press Enter). outlink: toUrl: http://www.uni-kassel.de/intranet/metanavi/zur-uni-startseite.html anchor: zur Uni-Startseite outlink: toUrl: http://www.uni-kassel.de/intranet/aktuelles/aktuelles-aus.html anchor: Intranet outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/redaktion.html anchor: Redaktion outlink: toUrl: http://www.uni-kassel.de/ anchor: Logo der Universität Kassel outlink: toUrl: http://www.uni-kassel.de/intranet/aktuelles/aktuelles-aus.html anchor: Aktuelles outlink: toUrl: http://www.uni-kassel.de/intranet/themen/ueberblick.html anchor: Themen outlink: toUrl: http://www.uni-kassel.de/intranet/abteilungen/ueberblick.html anchor: Abteilungen outlink: toUrl: http://www.uni-kassel.de/intranet/organisation/ueberblick.html anchor: Organisation outlink: toUrl: http://www.uni-kassel.de/intranet/schnelleinstieg/ueberblick.html anchor: Schnelleinstieg outlink: toUrl:
build nutch-1.3 from src/plugin
After trying to build nutch-1.3 from source unsuccessfully from Mac, I tried it from a Linux X86 machine. Making ant build from top level works fine. Makes the classes and runtime folders.After that when I go to src/plugin and try to fire ant from there, I see issues like -- Problem: failed to create task or type antlib:org.apache.ivy.ant:settings This appears to be an antlib declaration. Action: Check that the implementing library exists in one of: -/home/ashish/utils/apache-ant-1.8.2/lib -/home/ashish/.ant/lib If I copy ivy.jar into the location, I start getting issue like --ivy:resolve doesn't support the log attribute Question - Should nutch always be built from the NUTCH_HOME and plugins should not be tried to be built separately from the src/plugin folder ?
Re: How does nutch handles javaScript in href
Hi Marek, This is v. interesting and I am looking forward to hearing from anyone with similar problems. Unfortunately I've not experienced this behaviour, however it is clearly a significant problem as you point out. Ultimately it should be ironed out. What a great tool the ParserChecker is. 11/10/19 13:58:05 INFO parse.ParserChecker: parsing: http://www.uni-kassel.de/intranet/footernavi/redaktion.html 11/10/19 13:58:05 INFO parse.ParserChecker: contentType: application/xhtml+xml 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/** parse-plugins.xml 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml This indicates that parse-html was not used and the default for wildcard contentType defaults to parse-tika... am I correct here? If this is the case then it means that parse-tika is not dealing with the problem as you describe it. However I must also comment, that we recently committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt with application/xhtml+xml material. It would be interesting to see if parse-html in trunk-1.4 deals with this now. If not then I think this needs to be filed as a JIRA issue and dealt with appropriately. Can you please check and get back to us... Thanks Lewis
Re: build nutch-1.3 from src/plugin
Hi, In my experience I have never 'needed' to build from anywhere else that NUTCH_HOME. However I would imagine that this is not always the case in some production environments. I think the method you describe for building plugins works slightly against the way we currently do this, which is Independent plugin management from NUTCH_HOME/src/plugin Vs centralised plugin management via build.xml from NUTCH_HOME I would suggest one possible work around... and I apologise if this is slightly off topic. You can comment out the 'build deploy', 'test' and 'clean' targets for the plugins you do not wish to build within NUTCH_HOME/src/plugin/build.xml. This will enable you to only (control) build the plugins you desire from NUTCH_HOME/build.xml As I said, sorry if my comments are off topic in any way. Lewis On Wed, Oct 19, 2011 at 1:27 PM, Ashish Mehrotra ashme...@yahoo.com wrote: After trying to build nutch-1.3 from source unsuccessfully from Mac, I tried it from a Linux X86 machine. Making ant build from top level works fine. Makes the classes and runtime folders.After that when I go to src/plugin and try to fire ant from there, I see issues like -- Problem: failed to create task or type antlib:org.apache.ivy.ant:settings This appears to be an antlib declaration. Action: Check that the implementing library exists in one of: -/home/ashish/utils/apache-ant-1.8.2/lib-/home/ashish/.ant/lib If I copy ivy.jar into the location, I start getting issue like --ivy:resolve doesn't support the log attribute Question - Should nutch always be built from the NUTCH_HOME and plugins should not be tried to be built separately from the src/plugin folder ? -- *Lewis*
a plugin to select the re-crawl date of a page
hi, I am looking into nutch to try to crawl a couple of forum-based websites and I would like to avoid writing scripts to generate lists of urls to perform daily incremental crawls. Instead, I suspect that I should be able to write a plugin for nutch which is able to associate with each url the date of the next crawl so that nutch generate does the right thing and picks the urls which need to be refreshed, hence picking new messages in live/recent discussions as well as whole new discussions. I have started to dive into the code to figure out how I might be able to do pull this off but I suspect that someone more knowledgeable with the structure of nutch itself could give me hints as to where to look, hence saving me quite a bit of time. Mathieu -- Mathieu Lacage mathieu.lac...@alcmeon.com
Re: How does nutch handles javaScript in href
On 19.10.2011 14:34, lewis john mcgibbney wrote: Hi Marek, This is v. interesting and I am looking forward to hearing from anyone with similar problems. Unfortunately I've not experienced this behaviour, however it is clearly a significant problem as you point out. Ultimately it should be ironed out. What a great tool the ParserChecker is. 11/10/19 13:58:05 INFO parse.ParserChecker: parsing: http://www.uni-kassel.de/intranet/footernavi/redaktion.html 11/10/19 13:58:05 INFO parse.ParserChecker: contentType: application/xhtml+xml 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/** parse-plugins.xml 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml This indicates that parse-html was not used and the default for wildcard contentType defaults to parse-tika... am I correct here? According to my parse-plugins.xml, yes: !-- by default if the mimeType is set to *, or if it can't be determined, use parse-tika -- mimeType name=* plugin id=parse-tika / /mimeType BUT: I added LOG.info(This is HtmlParser); to the first line in getParse in HtmlParser.java and compiled it. After that I got: (...) 11/10/19 15:20:08 WARN parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml 11/10/19 15:20:08 INFO parse.html: This is HtmlParser - Url --- http://www.uni-kassel.de/intranet/footernavi/redaktion.html- ParseData - Version: 5 Status: success(1,0) Title: Intranet: Redaktion Outlinks: 23 outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef anchor: (...) As I understand this, the HtmlParser IS used and NOT Tika? If this is the case then it means that parse-tika is not dealing with the problem as you describe it. However I must also comment, that we recently committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt with application/xhtml+xml material. It would be interesting to see if parse-html in trunk-1.4 deals with this now. If not then I think this needs to be filed as a JIRA issue and dealt with appropriately. Can you please check and get back to us... Thanks Lewis
Re: Re: How does nutch handles javaScript in href
Then in my own opinion there is no existing code within parse-html which prevents it from parsing the anchor snippts you've posted. This would make a great addition to the parse-html as it seems to be an unforseen boundary case that we should not ignore. If you don't get feedback on this, can I ask for you to open a JIRA ticket based upon your understanding of the situation? Thank you On , Marek Bachmann m.bachm...@uni-kassel.de wrote: On 19.10.2011 14:34, lewis john mcgibbney wrote: Hi Marek, This is v. interesting and I am looking forward to hearing from anyone with similar problems. Unfortunately I've not experienced this behaviour, however it is clearly a significant problem as you point out. Ultimately it should be ironed out. What a great tool the ParserChecker is. 11/10/19 13:58:05 INFO parse.ParserChecker: parsing: http://www.uni-kassel.de/intranet/footernavi/redaktion.html 11/10/19 13:58:05 INFO parse.ParserChecker: contentType: application/xhtml+xml 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/** parse-plugins.xml 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml This indicates that parse-html was not used and the default for wildcard contentType defaults to parse-tika... am I correct here? According to my parse-plugins.xml, yes: if it can't be determined, use parse-tika -- BUT: I added LOG.info(This is HtmlParser); to the first line in getParse in HtmlParser.java and compiled it. After that I got: (...) 11/10/19 15:20:08 WARN parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml 11/10/19 15:20:08 INFO parse.html: This is HtmlParser - Url --- http://www.uni-kassel.de/intranet/footernavi/redaktion.html- ParseData - Version: 5 Status: success(1,0) Title: Intranet: Redaktion Outlinks: 23 outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef anchor: (...) As I understand this, the HtmlParser IS used and NOT Tika? If this is the case then it means that parse-tika is not dealing with the problem as you describe it. However I must also comment, that we recently committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt with application/xhtml+xml material. It would be interesting to see if parse-html in trunk-1.4 deals with this now. If not then I think this needs to be filed as a JIRA issue and dealt with appropriately. Can you please check and get back to us... Thanks Lewis
not able to parse adobe 9.0 pdfs using 1.3 tika parser
These pdfs were not getting parsed with parse-pdf plugin of nutch 1.2. So, tried with 1.3. Saw that even simple and old pdfs also not working. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) { values += subvalue; } if (values.length() 0) out.printf(meta name=\%s\ content=\%s\/\n, name, values); } out.println(meta http-equiv=\Content-Type\ content=\text/html;charset=UTF-8\/); // End Meta tags out.println(/head); // End Header // Start Body out.println(body); out.print(parse.getText()); out.println(/body); // End Body out.println(/html); // End Document out.close(); // Close the file return true; } } command: == bash-2.00$ java -classpath conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.jar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-core-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8.jar:. TestParse direct.pdf == output: _ Converting direct.pdf to html. Oct 19, 2011 5:05:19 PM org.apache.hadoop.conf.Configuration getConfResourceAsInputStream INFO: found resource tika-mimetypes.xml at file:/path/to/nutch/1.3/conf/tika-mimetypes.xml Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginManifestParser parsePluginFolder INFO: Plugins: looking in:
Re: a plugin to select the re-crawl date of a page
Hi, If you use a low default fetch interval for newly discovered pages they will be fetched very frequently. If you combine with an adaptive fetch scheduler and using text profiling those pages that do not change anymore will start to be fetched less and less, giving room for new pages. Check the Nutch configuration for settings and descriptions. AdaptiveFetchSchedule and TextProfile are keywords. Cheers hi, I am looking into nutch to try to crawl a couple of forum-based websites and I would like to avoid writing scripts to generate lists of urls to perform daily incremental crawls. Instead, I suspect that I should be able to write a plugin for nutch which is able to associate with each url the date of the next crawl so that nutch generate does the right thing and picks the urls which need to be refreshed, hence picking new messages in live/recent discussions as well as whole new discussions. I have started to dive into the code to figure out how I might be able to do pull this off but I suspect that someone more knowledgeable with the structure of nutch itself could give me hints as to where to look, hence saving me quite a bit of time. Mathieu
Re: not able to parse adobe 9.0 pdfs using 1.3 tika parser
There's always trouble with PDF parsing. Try trunk, it has an upgraded Tika including PDF parse improvements. Utimately problems with parsing should be addressed at the Tika ML or even PDFBox list. These pdfs were not getting parsed with parse-pdf plugin of nutch 1.2. So, tried with 1.3. Saw that even simple and old pdfs also not working. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) { values += subvalue; } if (values.length() 0) out.printf(meta name=\%s\ content=\%s\/\n, name, values); } out.println(meta http-equiv=\Content-Type\ content=\text/html;charset=UTF-8\/); // End Meta tags out.println(/head); // End Header // Start Body out.println(body); out.print(parse.getText()); out.println(/body); // End Body out.println(/html); // End Document out.close(); // Close the file return true; } } command: == bash-2.00$ java -classpath conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.j ar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-c ore-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8. jar:. TestParse
Re: How does nutch handles javaScript in href
On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote: Then in my own opinion there is no existing code within parse-html which prevents it from parsing the anchor snippts you've posted. But something is happening with the content of the href attribute, since in the source file its value is: a href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef'); class=mail and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that means, that the href value is handled somehow?! I guess if nothing would be done with the href value then the outlink value should be: http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef'); Perhaps the java script gets evaluated somewhere but it fails because the reference isn't found... I'll look in the html parser to found more details. This would make a great addition to the parse-html as it seems to be an unforseen boundary case that we should not ignore. If you don't get feedback on this, can I ask for you to open a JIRA ticket based upon your understanding of the situation? Thank you
Good workaround for timeout?
I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property What else can I do? Thanks. Chip
Fetcher NPE's
Hi, We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's clear URL's are actually fetched until due to some reason a NPE occurs. The thread then dies and seems to output 0 records. The URL's themselves are fetchable using index- or parser checker, no problem there. Any ideas how we can pinpoint the source of the issue? Thanks, A sample exception: 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of http://SOME_URL/ failed with: java.lang.NullPointerException 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: java.lang.NullPointerException 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at java.lang.System.arraycopy(Native Method) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1276) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1193) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.Text.write(Text.java:281) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher caught:java.lang.NullPointerException The code catching the error: 801 } catch (Throwable t) { // unexpected exception 802 // unblock 803 fetchQueues.finishFetchItem(fit); 804 logError(fit.url, t.toString()); 805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED, CrawlDatum.STATUS_FETCH_RETRY); 806 }
Re: How does nutch handles javaScript in href
One interesting thing I found out: The HtmlParser Class tells me in debug mode (I had to replace the LOG.trace states through LOG.debug, since I don't know how to use these trace thing) that it had found 20 outlinks: 2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in http://www.uni-kassel.de/intranet/footernavi/redaktion.html BUT the result of ParserChecker tells me there were 23 outlinks: (...) Status: success(1,0) Title: Intranet: Redaktion Outlinks: 23 outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef anchor: (...) This first three links are the ones which shouldn't be there. And the count is the difference between the output if ParserChecker and the debug log. Seems these links doesn't come to the list through HtmlParser? On 19.10.2011 16:24, Marek Bachmann wrote: On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote: Then in my own opinion there is no existing code within parse-html which prevents it from parsing the anchor snippts you've posted. But something is happening with the content of the href attribute, since in the source file its value is: a href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef'); class=mail and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that means, that the href value is handled somehow?! I guess if nothing would be done with the href value then the outlink value should be: http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef'); Perhaps the java script gets evaluated somewhere but it fails because the reference isn't found... I'll look in the html parser to found more details. This would make a great addition to the parse-html as it seems to be an unforseen boundary case that we should not ignore. If you don't get feedback on this, can I ask for you to open a JIRA ticket based upon your understanding of the situation? Thank you
Re: Good workaround for timeout?
What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932Donal dsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property What else can I do? Thanks. Chip
Re: How does nutch handles javaScript in href
Tika can do things a bit different. At least it did in the past and it seems this is the case as well, i get 20 outlinks with Tika. One interesting thing I found out: The HtmlParser Class tells me in debug mode (I had to replace the LOG.trace states through LOG.debug, since I don't know how to use these trace thing) that it had found 20 outlinks: 2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in http://www.uni-kassel.de/intranet/footernavi/redaktion.html BUT the result of ParserChecker tells me there were 23 outlinks: (...) Status: success(1,0) Title: Intranet: Redaktion Outlinks: 23 outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//auto completion/completer.php anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/e f anchor: (...) This first three links are the ones which shouldn't be there. And the count is the difference between the output if ParserChecker and the debug log. Seems these links doesn't come to the list through HtmlParser? On 19.10.2011 16:24, Marek Bachmann wrote: On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote: Then in my own opinion there is no existing code within parse-html which prevents it from parsing the anchor snippts you've posted. But something is happening with the content of the href attribute, since in the source file its value is: a href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef'); class=mail and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that means, that the href value is handled somehow?! I guess if nothing would be done with the href value then the outlink value should be: http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMai lto('nbjmup+jousbofuAvoj.lbttfm/ef'); Perhaps the java script gets evaluated somewhere but it fails because the reference isn't found... I'll look in the html parser to found more details. This would make a great addition to the parse-html as it seems to be an unforseen boundary case that we should not ignore. If you don't get feedback on this, can I ask for you to open a JIRA ticket based upon your understanding of the situation? Thank you
RE: Good workaround for timeout?
If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D onal dsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property What else can I do? Thanks. Chip
Re: Fetcher NPE's
I should add that these URL's not only pass index-0 and parser checker but also manual local testing crawl cycles. There's also nothing significant in the syslog. Dmesg shows messages about too little memory but that's normal. Hi, We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's clear URL's are actually fetched until due to some reason a NPE occurs. The thread then dies and seems to output 0 records. The URL's themselves are fetchable using index- or parser checker, no problem there. Any ideas how we can pinpoint the source of the issue? Thanks, A sample exception: 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of http://SOME_URL/ failed with: java.lang.NullPointerException 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: java.lang.NullPointerException 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at java.lang.System.arraycopy(Native Method) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java: 1276) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java :1193) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.Text.write(Text.java:281) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.se rialize(WritableSerialization.java:90) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s erialize(WritableSerialization.java:77) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060 ) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:5 91) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher caught:java.lang.NullPointerException The code catching the error: 801 } catch (Throwable t) { // unexpected exception 802 // unblock 803 fetchQueues.finishFetchItem(fit); 804 logError(fit.url, t.toString()); 805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED, CrawlDatum.STATUS_FETCH_RETRY); 806 }
Re: Good workaround for timeout?
It is indeed. Tricky. Are you going through some proxy? Are you using protocol-http or httpclient? Are you sure the http.time.out value is actually used in lib-http? If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932Donal dsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D onal dsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property What else can I do? Thanks. Chip
Re: FOUND IT - How does nutch handles javaScript in href
Ok, I went though the source, step by step. It is the HtmlParserFilter called JSParseFilter. So it seems I have to exclude it from the plugin list. 2011-10-19 17:33:46,031 DEBUG js.JSParseFilter - - outlink from JS: 'http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php' 2011-10-19 17:33:46,041 DEBUG js.JSParseFilter - - outlink from JS: 'http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef' 2011-10-19 17:33:46,042 DEBUG js.JSParseFilter - - outlink from JS: 'http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef' But its behaviour isn't right anyway? It shouldn't take this crypto string as an outlink? On 19.10.2011 17:13, Markus Jelsma wrote: Tika can do things a bit different. At least it did in the past and it seems this is the case as well, i get 20 outlinks with Tika. One interesting thing I found out: The HtmlParser Class tells me in debug mode (I had to replace the LOG.trace states through LOG.debug, since I don't know how to use these trace thing) that it had found 20 outlinks: 2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in http://www.uni-kassel.de/intranet/footernavi/redaktion.html BUT the result of ParserChecker tells me there were 23 outlinks: (...) Status: success(1,0) Title: Intranet: Redaktion Outlinks: 23 outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//auto completion/completer.php anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/e f anchor: (...) This first three links are the ones which shouldn't be there. And the count is the difference between the output if ParserChecker and the debug log. Seems these links doesn't come to the list through HtmlParser? On 19.10.2011 16:24, Marek Bachmann wrote: On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote: Then in my own opinion there is no existing code within parse-html which prevents it from parsing the anchor snippts you've posted. But something is happening with the content of the href attribute, since in the source file its value is: a href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef'); class=mail and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that means, that the href value is handled somehow?! I guess if nothing would be done with the href value then the outlink value should be: http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMai lto('nbjmup+jousbofuAvoj.lbttfm/ef'); Perhaps the java script gets evaluated somewhere but it fails because the reference isn't found... I'll look in the html parser to found more details. This would make a great addition to the parse-html as it seems to be an unforseen boundary case that we should not ignore. If you don't get feedback on this, can I ask for you to open a JIRA ticket based upon your understanding of the situation? Thank you
Re: FOUND IT - How does nutch handles javaScript in href
Not sure what JsParse is supposed to do in this situation but you should not use it anyway. It's not regarded as stable, just the protocolhttp. Ok, I went though the source, step by step. It is the HtmlParserFilter called JSParseFilter. So it seems I have to exclude it from the plugin list. 2011-10-19 17:33:46,031 DEBUG js.JSParseFilter - - outlink from JS: 'http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//aut ocompletion/completer.php' 2011-10-19 17:33:46,041 DEBUG js.JSParseFilter - - outlink from JS: 'http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/e f' 2011-10-19 17:33:46,042 DEBUG js.JSParseFilter - - outlink from JS: 'http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm /ef' But its behaviour isn't right anyway? It shouldn't take this crypto string as an outlink? On 19.10.2011 17:13, Markus Jelsma wrote: Tika can do things a bit different. At least it did in the past and it seems this is the case as well, i get 20 outlinks with Tika. One interesting thing I found out: The HtmlParser Class tells me in debug mode (I had to replace the LOG.trace states through LOG.debug, since I don't know how to use these trace thing) that it had found 20 outlinks: 2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in http://www.uni-kassel.de/intranet/footernavi/redaktion.html BUT the result of ParserChecker tells me there were 23 outlinks: (...) Status: success(1,0) Title: Intranet: Redaktion Outlinks: 23 outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//a uto completion/completer.php anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ ef anchor: outlink: toUrl: http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttf m/e f anchor: (...) This first three links are the ones which shouldn't be there. And the count is the difference between the output if ParserChecker and the debug log. Seems these links doesn't come to the list through HtmlParser? On 19.10.2011 16:24, Marek Bachmann wrote: On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote: Then in my own opinion there is no existing code within parse-html which prevents it from parsing the anchor snippts you've posted. But something is happening with the content of the href attribute, since in the source file its value is: a href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef'); class=mail and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that means, that the href value is handled somehow?! I guess if nothing would be done with the href value then the outlink value should be: http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptM ai lto('nbjmup+jousbofuAvoj.lbttfm/ef'); Perhaps the java script gets evaluated somewhere but it fails because the reference isn't found... I'll look in the html parser to found more details. This would make a great addition to the parse-html as it seems to be an unforseen boundary case that we should not ignore. If you don't get feedback on this, can I ask for you to open a JIRA ticket based upon your understanding of the situation? Thank you
RE: Good workaround for timeout?
I'm using protocol-http, but I removed protocol-httpclient after you pointed out in another thread that it's broken. Unfortunately I'm not sure which properties are used by what, and I'm not sure how to find out. I added some more stuff to nutch-site.xml (I'll paste it at the end), and it seems to be working so far; but since this has been an intermittent problem, I can't be sure whether I've really fixed it or whether I'm getting lucky. property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property property nameftp.timeout/name value99/value descriptionDefault timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below./description /property property nameftp.server.timeout/name value9/value descriptionAn estimation of ftp server idle time, in millisec. Typically it is 12 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). /description /property property nameparser.timeout/name value300/value descriptionTimeout in seconds for the parsing of a document, otherwise treats it as an exception and moves on the the following documents. This parameter is applied to any Parser implementation. Set to -1 to deactivate, bearing in mind that this could cause the parsing to crash because of a very long or corrupted document. /description /property -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:28 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? It is indeed. Tricky. Are you going through some proxy? Are you using protocol-http or httpclient? Are you sure the http.time.out value is actually used in lib-http? If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D onal dsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_293 2D onal dsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value999/value
Is there a workaround for https?
I've noticed the recent posts about trouble with protocol-httpclient, which to my understanding is needed for https URLs. Is there another way to handle these? ParserChecker gives me the following when I try one of these URLs. Thanks. # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText https://libwebspace.library.cmu.edu:4430/Research/Archives/ead/generated/shull.xml Exception in thread main org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:80) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:78)