[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Assignee: Tejas Patil Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7 Attachments: NUTCH-1284.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551857#comment-13551857 ] Tejas Patil commented on NUTCH-1284: Can anyone kindly review the patch ? Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7 Attachments: NUTCH-1284.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1274: Fix Version/s: 2.2 Fix [cast] javac warnings - Key: NUTCH-1274 URL: https://issues.apache.org/jira/browse/NUTCH-1274 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1274-2.x.patch, NUTCH-1274-2.x.v2.patch, NUTCH-1274-trunk.patch, NUTCH-1274-trunk.v2.patch A typical example of this is {code} trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] redundant cast to int [javac] res ^= (int)(signature[i] 24 + signature[i+1] 16 + {code} these should all be fixed by replacing with the correct implementations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1274) Fix [cast] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1274. - Resolution: Fixed Committed @revision 1432469 in trunk Committed @revision 1432471 in 2.x Thank you Tejas for your contribution. Fix [cast] javac warnings - Key: NUTCH-1274 URL: https://issues.apache.org/jira/browse/NUTCH-1274 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1274-2.x.patch, NUTCH-1274-2.x.v2.patch, NUTCH-1274-trunk.patch, NUTCH-1274-trunk.v2.patch A typical example of this is {code} trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] redundant cast to int [javac] res ^= (int)(signature[i] 24 + signature[i+1] 16 + {code} these should all be fixed by replacing with the correct implementations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1042: Fix Version/s: 2.2 1.7 Fetcher.max.crawl.delay property not taken into account correctly when set to -1 Key: NUTCH-1042 URL: https://issues.apache.org/jira/browse/NUTCH-1042 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Nutch User - 1 Fix For: 1.7, 2.2 [Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).] From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property Fetcher.java: (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup). The line 554 in Fetcher.java: this.maxCrawlDelay = conf.getInt(fetcher.max.crawl.delay, 30) * 1000; . The lines 615-616 in Fetcher.java: if (rules.getCrawlDelay() 0) { if (rules.getCrawlDelay() maxCrawlDelay) { Now, the documentation states that, if fetcher.max.crawl.delay is set to -1, the crawler will always wait the amount of time the Crawl-Delay parameter specifies. However, as you can see, if it really is negative the condition on the line 616 is always true, which leads to skipping the page whose Crawl-Delay is set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1284: Fix Version/s: 2.2 Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551959#comment-13551959 ] Lewis John McGibbney commented on NUTCH-1284: - Hi Tejas. Nice catch btw as it looks like you've integrated NUTCH-1042 in to this patch as well. With regards to the original issue here e.g. NUTCH-1284, it would be excellent if this issue could also provide logging for the fetcher as originally stated in the issue description. e.g. the log output records crawl.delay on a per url basis. I like the debug logging you've added for the queue. Although it is not marked, IIRC this issue affects both 1.x and 2.x... Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1274) Fix [cast] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551964#comment-13551964 ] Hudson commented on NUTCH-1274: --- Integrated in Nutch-trunk #2081 (See [https://builds.apache.org/job/Nutch-trunk/2081/]) NUTCH-1274 Fix [cast] javac warnings (Revision 1432469) Result = SUCCESS lewismc : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1432469 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java * /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java * /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java * /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/LinkRank.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/Loops.java * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java * /nutch/trunk/src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java * /nutch/trunk/src/java/org/apache/nutch/tools/FreeGenerator.java * /nutch/trunk/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java * /nutch/trunk/src/test/org/apache/nutch/parse/TestParserFactory.java Fix [cast] javac warnings - Key: NUTCH-1274 URL: https://issues.apache.org/jira/browse/NUTCH-1274 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1274-2.x.patch, NUTCH-1274-2.x.v2.patch, NUTCH-1274-trunk.patch, NUTCH-1274-trunk.v2.patch A typical example of this is {code} trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] redundant cast to int [javac] res ^= (int)(signature[i] 24 + signature[i+1] 16 + {code} these should all be fixed by replacing with the correct implementations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1472) InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation)
[ https://issues.apache.org/jira/browse/NUTCH-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1472: Fix Version/s: 2.2 InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation) -- Key: NUTCH-1472 URL: https://issues.apache.org/jira/browse/NUTCH-1472 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: zhaixuepan Fix For: 2.2 me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation) at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:45) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:264) at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97) at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) at me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69) at org.apache.gora.cassandra.store.HectorUtils.insertColumn(HectorUtils.java:47) at org.apache.gora.cassandra.store.CassandraClient.addColumn(CassandraClient.java:169) at org.apache.gora.cassandra.store.CassandraStore.addOrUpdateField(CassandraStore.java:341) at org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:228) at org.apache.gora.cassandra.store.CassandraStore.close(CassandraStore.java:95) at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation) at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20253) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:922) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:908) at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246) at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258) ... 13 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1436) bin/nutch absent in zip package
[ https://issues.apache.org/jira/browse/NUTCH-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1436. - Resolution: Won't Fix As we have released 1.6, which includes the bin/nutch script in the zip package I'm marking this as won't fix. bin/nutch absent in zip package --- Key: NUTCH-1436 URL: https://issues.apache.org/jira/browse/NUTCH-1436 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.5.1 Reporter: Sebastian Nagel Attachments: NUTCH-1436.patch The script bin/nutch is absent in the package apache-nutch-1.5.1-bin.zip, the tar-bin package is not affected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1472) InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation)
[ https://issues.apache.org/jira/browse/NUTCH-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1472: Component/s: injector This issue occurs when injecting URLs into Cassandra using gora-cassandra 0.2.1. InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation) -- Key: NUTCH-1472 URL: https://issues.apache.org/jira/browse/NUTCH-1472 Project: Nutch Issue Type: Bug Components: injector Affects Versions: 2.1 Reporter: zhaixuepan Fix For: 2.2 me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation) at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:45) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:264) at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97) at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) at me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69) at org.apache.gora.cassandra.store.HectorUtils.insertColumn(HectorUtils.java:47) at org.apache.gora.cassandra.store.CassandraClient.addColumn(CassandraClient.java:169) at org.apache.gora.cassandra.store.CassandraStore.addOrUpdateField(CassandraStore.java:341) at org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:228) at org.apache.gora.cassandra.store.CassandraStore.close(CassandraStore.java:95) at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation) at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20253) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:922) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:908) at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246) at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258) ... 13 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1495: Fix Version/s: 2.2 -normalize and -filter for updatedb command in nutch 2.x Key: NUTCH-1495 URL: https://issues.apache.org/jira/browse/NUTCH-1495 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: Nathan Gass Fix For: 2.2 Attachments: patch-updatedb-normalize-filter-2012-11-09.txt, patch-updatedb-normalize-filter-2012-11-13.txt AFAIS in nutch 1.x you could change your url filters and normalizers during the crawl, and update the db using crawldb -normalize -filter. There does not seem to be a away to achieve the same in nutch 2.x? Anyway, I went ahead and tried to implement -normalize and -filter for the nutch 2.x updatedb command. I have no experience with any of the used technologies including java, so please check the attached code carefully before using it. I'm very interested to hear if this is the right approach or any other comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1190: Fix Version/s: 2.2 1.7 MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan Fix For: 1.7, 2.2 Attachments: date-styles.txt, MoreIndexingFilter.patch There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt(place in conf), which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1015) MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42
[ https://issues.apache.org/jira/browse/NUTCH-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1015: Fix Version/s: 2.2 1.7 MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42 --- Key: NUTCH-1015 URL: https://issues.apache.org/jira/browse/NUTCH-1015 Project: Nutch Issue Type: Bug Components: indexer Reporter: Markus Jelsma Fix For: 1.7, 2.2 MoreIndexingFilter must handle the following url's gracefully: {code} can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1 can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT can't parse erroneous date: GMT {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is back to normal : Nutch-nutchgora #463
See https://builds.apache.org/job/Nutch-nutchgora/463/
[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1483: Patch Info: Patch Available Can't crawl filesystem with protocol-file plugin Key: NUTCH-1483 URL: https://issues.apache.org/jira/browse/NUTCH-1483 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.6, 2.1 Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 Reporter: Rogério Pereira Araújo Attachments: NUTCH-1483.patch I tried to follow the same steps described in this wiki page: http://wiki.apache.org/nutch/IntranetDocumentSearch I made all required changes on regex-urlfilter.txt and added the following entry in my seed file: file:///home/rogerio/Documents/ The permissions are ok, I'm running nutch with the same user as folder owner, so nutch has all the required permissions, unfortunately I'm getting the following error: org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) fetch of file://home/rogerio/Documents/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 Why the logs are showing file://home/rogerio/Documents/ instead of file:///home/rogerio/Documents/ ??? Note: The regex-urlfilter entry only works as expected if I add the entry +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1461) Problem with TableUtil
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1461: Patch Info: Patch Available Fix Version/s: 2.2 Problem with TableUtil -- Key: NUTCH-1461 URL: https://issues.apache.org/jira/browse/NUTCH-1461 Project: Nutch Issue Type: Bug Components: parser Affects Versions: nutchgora Environment: Debian / CDH3 / Nutch 2.0 Release Reporter: Christian Johnsson Fix For: 2.2 Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch Affects parse and updatedb and parse. Think i got some missformated urls into hbase but i can't fin them. It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error? 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102) at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=254rev2=255 * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) * [[NutchMavenSupport|Using Nutch as a Maven dependency]] - == Nutch 2.0 == + == Nutch 2.x == * Nutch2Crawling - A description of the crawling jobs * Nutch2Architecture - A high level overview of the new architecture and design * Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0
[jira] [Resolved] (NUTCH-1094) create comprehensive documentation for Nutchgora branch
[ https://issues.apache.org/jira/browse/NUTCH-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1094. - Resolution: Fixed I would argue that this has been significantly addressed in recent months. We now have the following - A description of the crawling jobs - entry on 2.x architecture - Roadmap for 2.x - Building Nutch 2.x in Eclipse - Error messages, and - a guide to understanding the Webpage webdb columns and fields. create comprehensive documentation for Nutchgora branch --- Key: NUTCH-1094 URL: https://issues.apache.org/jira/browse/NUTCH-1094 Project: Nutch Issue Type: Sub-task Components: documentation Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: 2.2 This should shadow the core documentation for Nutch 1.4 (branch) and mainstream users, however it should include fundamentals specific to Nutch trunk. Until we release Nutch 2.0 this documentation should be stored in svn under a /docs directory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1447) Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
[ https://issues.apache.org/jira/browse/NUTCH-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1447: Fix Version/s: 2.2 Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected Key: NUTCH-1447 URL: https://issues.apache.org/jira/browse/NUTCH-1447 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Environment: Cloudera CDH4 Reporter: Trần Anh Tuấn Fix For: 2.2 I'm trying to crawl using Nutch 2. I check out source from http://svn.apache.org/repos/asf/nutch/branches/2.x/ and config with mysql. I get error but when run nutch 1.5 everything okie :( mkdir urls echo nutch.apache.org urls/seed.txt runtime/deploy/bin/nutch inject urls 12/08/07 11:25:38 INFO crawl.InjectorJob: InjectorJob: starting 12/08/07 11:25:38 INFO crawl.InjectorJob: InjectorJob: urlDir: urls 12/08/07 11:25:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/08/07 11:25:44 INFO input.FileInputFormat: Total input paths to process : 1 12/08/07 11:25:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/08/07 11:25:45 WARN snappy.LoadSnappy: Snappy native library is available 12/08/07 11:25:45 INFO snappy.LoadSnappy: Snappy native 12/08/07 11:25:47 INFO mapred.JobClient: map 0% reduce 0% 12/08/07 11:26:01 INFO mapred.JobClient: Task Id : attempt_201208071123_0001_m_00_0, Status : FAILED Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected attempt_201208071123_0001_m_00_0: SLF4J: Class path contains multiple SLF4J bindings. attempt_201208071123_0001_m_00_0: SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201208071123_0001_m_00_0: SLF4J: Found binding in [jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/root/jobcache/job_201208071123_0001/jars/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201208071123_0001_m_00_0: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 12/08/07 11:26:05 INFO mapred.JobClient: Task Id : attempt_201208071123_0001_m_00_1, Status : FAILED Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected attempt_201208071123_0001_m_00_1: SLF4J: Class path contains multiple SLF4J bindings. attempt_201208071123_0001_m_00_1: SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201208071123_0001_m_00_1: SLF4J: Found binding in [jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/root/jobcache/job_201208071123_0001/jars/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201208071123_0001_m_00_1: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 12/08/07 11:26:10 INFO mapred.JobClient: Task Id : attempt_201208071123_0001_m_00_2, Status : FAILED Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected attempt_201208071123_0001_m_00_2: SLF4J: Class path contains multiple SLF4J bindings. attempt_201208071123_0001_m_00_2: SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201208071123_0001_m_00_2: SLF4J: Found binding in [jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/root/jobcache/job_201208071123_0001/jars/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201208071123_0001_m_00_2: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 12/08/07 11:26:19 INFO mapred.JobClient: Job complete: job_201208071123_0001 12/08/07 11:26:19 INFO mapred.JobClient: Counters: 7 12/08/07 11:26:19 INFO mapred.JobClient: Job Counters 12/08/07 11:26:19 INFO mapred.JobClient: Failed map tasks=1 12/08/07 11:26:19 INFO mapred.JobClient: Launched map tasks=4 12/08/07 11:26:19 INFO mapred.JobClient: Data-local map tasks=4 12/08/07 11:26:19 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=18003 12/08/07 11:26:19 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0 12/08/07 11:26:19 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/07 11:26:19 INFO
[jira] [Updated] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
[ https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1418: Fix Version/s: 1.7 error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ Key: NUTCH-1418 URL: https://issues.apache.org/jira/browse/NUTCH-1418 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Arijit Mukherjee Fix For: 1.7 Since learning that nutch will be unable to crawl the javascript function calls in href, I started looking for other alternatives. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India. I first tried injecting this URL and follow the step-by-step approach till fetcher - when I realized, nutch did not fetch anything from this website. I tried looking into logs/hadoop.log and found the following 3 lines - which I believe could be saying that nutch is unable to parse the robots.txt in the website and ttherefore, fetcher stopped? 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ I tried checking the URL using parsechecker and no issues there! I think it means that the robots.txt is malformed for this website, which is preventing fetcher from fetching anything. Is there a way to get around this problem, as parsechecker seems to go on its merry way parsing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1458) Support for raw HTML field added to Solr
[ https://issues.apache.org/jira/browse/NUTCH-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1458: Fix Version/s: 1.7 Support for raw HTML field added to Solr Key: NUTCH-1458 URL: https://issues.apache.org/jira/browse/NUTCH-1458 Project: Nutch Issue Type: New Feature Components: indexer, parser Affects Versions: 1.5.1 Reporter: Max Dzyuba Labels: html, nutch, raw, solr Fix For: 1.7 At the moment, the “content” field holds only the parsed text from the page. It would be nice to have a separate field in Solr document that would hold raw HTML from the crawled page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1457: Fix Version/s: 2.2 Nutch2 Refactor the update process so that fetched items are only processed once Key: NUTCH-1457 URL: https://issues.apache.org/jira/browse/NUTCH-1457 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: 2.2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1452) hadoop.job.history.user.location in nutch-default making job history useless
[ https://issues.apache.org/jira/browse/NUTCH-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1452: Fix Version/s: 2.2 1.7 hadoop.job.history.user.location in nutch-default making job history useless Key: NUTCH-1452 URL: https://issues.apache.org/jira/browse/NUTCH-1452 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: 1.7, 2.2 There is still a property in nutch-default 'hadoop.job.history.user.location' that redirects the creation of history files from job output locations to a custom location. I noticed that the current value does not work well with cloudera (I have tested cdh3u4), because ${hadoop.log.dir} is not defined. This actually causes the job in the jobtracker to show empty info. (With 'incomplete' job status). This is only when the job moves to retired. When it is still in 'completed', all is looking well. This property can be set to 'none', because the job history is ALSO stored in the central jobtracker location anyway. The 'hadoop.job.history.user.location' property specifies an extra location. But if it is set to an invalid value, it causes the central history location to NOT store it, so it seems. Please see for more details: http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html Besides setting it to 'none', another option is to set it to 'history' which does work with cdh. (This writes all logs to 'history' in the user directory in the configured filesystem, usually dfs). The final option is to simply remove this value and not meddle with hadoop properties at all. But that actually requires all jobs to correctly ignore these files. I am not up to date how well this currently works with Nutch jobs. This question is most relevant for trunk, since trunk heavily relies on the filesystem for jobs. What do you think? A) Set property to 'none' B) Set property to 'history' C) Remove property, see what happens, possibly fix jobs D) ? For now, I opt for A. But I think we need some more input with other distributions (for example official Hadoop 1.x) and also Nutch trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader
[ https://issues.apache.org/jira/browse/NUTCH-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-806: --- Fix Version/s: 1.7 Merge CrawlDBScanner with CrawlDBReader --- Key: NUTCH-806 URL: https://issues.apache.org/jira/browse/NUTCH-806 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.7 The CrawlDBScanner [NUTCH-784] should be merged with the CrawlDBReader. Will do that after the 1.1 release -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1410) impact of a map-reduce problem
[ https://issues.apache.org/jira/browse/NUTCH-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1410: Fix Version/s: 2.2 1.7 impact of a map-reduce problem -- Key: NUTCH-1410 URL: https://issues.apache.org/jira/browse/NUTCH-1410 Project: Nutch Issue Type: Bug Components: fetcher, generator Reporter: behnam nikbakht Fix For: 1.7, 2.2 with a simple test , found that each mapper or reducer have a local view of variables. in Nutch, there are multiple places that share a variable between mappers or reducers , for example in generate there is a shared variable : hostCounts . or in fetcher , the last request time for each mapper (fetcherThread) is different from another. this problem cause critical problems like send multiple requests to same host that cause to block. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1502) Test for CrawlDatum state transitions
[ https://issues.apache.org/jira/browse/NUTCH-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1502: Fix Version/s: 2.2 1.7 Test for CrawlDatum state transitions - Key: NUTCH-1502 URL: https://issues.apache.org/jira/browse/NUTCH-1502 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.7, 2.2 Reporter: Sebastian Nagel Fix For: 1.7, 2.2 An exhaustive test to check the matrix of CrawlDatum state transitions (CrawlStatus in 2.x) would be useful to detect errors esp. for continuous crawls where the number of possible transitions is quite large. Additional factors with impact on state transitions (retry counters, static and dynamic intervals) are also tested. The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter for a first sketchy patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1481) When using MySQL as storage unicode characters within URLS cause nutch to fail
[ https://issues.apache.org/jira/browse/NUTCH-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1481: Fix Version/s: 2.2 When using MySQL as storage unicode characters within URLS cause nutch to fail -- Key: NUTCH-1481 URL: https://issues.apache.org/jira/browse/NUTCH-1481 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 2.1 Environment: mysql 5.5.28 on centos Reporter: Arni Sumarlidason Labels: database, sql, unicode, utf8 Fix For: 2.2 MySQL's (innodb) primary key / unique key is restricted to 767 bytes.. currently the url of a web page is used as a primary key in nutch storage. when using latin1 character set on the 'id' column @ length 767 bytes/characters; unicode characters in urls cause jdbc to throw an exception, java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE2\x80\x8' for column 'id' at row 1 when using utf8mb4 character set on the 'id' column @ length 190 characters / 760 bytes to fully support unicode characters; the field length becomes insufficient It may be better to use a hash of the url as the primary key instead of the url itself. This would allow urls of any length and full utf8 support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1490) Data Truncation exceptions when using mysql
[ https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1490: Fix Version/s: 2.2 Data Truncation exceptions when using mysql --- Key: NUTCH-1490 URL: https://issues.apache.org/jira/browse/NUTCH-1490 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: Nathan Gass Fix For: 2.2 Attachments: patch Nutch does not ensure the set (or implicit) maximal length for the following columns: title urls (id, baseUrl, reprUrl, typ (contentType) inlinks outlinks Trying to store too much data in one of this columns results in an exception similar to this (copied from GORA-24, I will be able to add an newer stack trace later today): java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too long for column 'inlinks' at row 1 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for column 'inlinks' at row 1 at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2018) at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1449) at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) ... 5 more I'll add my current fixes in later comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1490) Data Truncation exceptions when using mysql
[ https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1490: Patch Info: Patch Available Data Truncation exceptions when using mysql --- Key: NUTCH-1490 URL: https://issues.apache.org/jira/browse/NUTCH-1490 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: Nathan Gass Fix For: 2.2 Attachments: patch Nutch does not ensure the set (or implicit) maximal length for the following columns: title urls (id, baseUrl, reprUrl, typ (contentType) inlinks outlinks Trying to store too much data in one of this columns results in an exception similar to this (copied from GORA-24, I will be able to add an newer stack trace later today): java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too long for column 'inlinks' at row 1 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for column 'inlinks' at row 1 at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2018) at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1449) at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) ... 5 more I'll add my current fixes in later comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1487) Nutch parse fails first time for PDF files and works on reparse
[ https://issues.apache.org/jira/browse/NUTCH-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1487: Fix Version/s: 2.2 Nutch parse fails first time for PDF files and works on reparse --- Key: NUTCH-1487 URL: https://issues.apache.org/jira/browse/NUTCH-1487 Project: Nutch Issue Type: Bug Components: parser, storage Affects Versions: 2.1 Reporter: kiran Labels: mysql Fix For: 2.2 The parser is failing to parse pdf files at one go and working on re-parsing command the number of times the total number of PDF files as discussed in the mailing list here (http://www.mail-archive.com/user%40nutch.apache.org/msg07952.html) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first
[ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1297: Fix Version/s: 1.7 it is better for fetchItemQueues to select items from greater queues first -- Key: NUTCH-1297 URL: https://issues.apache.org/jira/browse/NUTCH-1297 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Fix For: 1.7 Attachments: NUTCH-1297.patch there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority. for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first
[ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1297: Patch Info: Patch Available it is better for fetchItemQueues to select items from greater queues first -- Key: NUTCH-1297 URL: https://issues.apache.org/jira/browse/NUTCH-1297 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Fix For: 1.7 Attachments: NUTCH-1297.patch there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority. for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)
[ https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1286: Fix Version/s: 2.2 Refactoring/reimplementing crawling API (NutchApp) -- Key: NUTCH-1286 URL: https://issues.apache.org/jira/browse/NUTCH-1286 Project: Nutch Issue Type: Improvement Components: administration gui, REST_api, web gui Reporter: Ferdy Galema Fix For: 2.2 This issue is to track changes we (Mathijs and I) have planned for the API and webapp in Nutchgora. We have a pretty good idea of how we want to be using the crawl API. It may involve some major refactoring or perhaps a side implementation next the current NutchApp functionality. It depends on how much we can reuse the existing components. The bottom line is that there will be a strictly defined Java API that provide everyting related from crawling/indexing to job control. (Listing jobs, tracking progress and aborting jobs being part of it). There will be no server or service for tracking crawling states, all will be persisted one way or the other and queryable from the API. The REST server shall be a very thin layer on top of the Java implementation. A rich web interface will be very easy layer too, once we have a cleanly (but extensive) defined API. But we will start to make to API usable from a simple command-line interface. More details will be provided later on.. feel free to comment if you have suggestions/questions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata
[ https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1267: Fix Version/s: 1.7 urlmeta to delegate indexing to index-metadata -- Key: NUTCH-1267 URL: https://issues.apache.org/jira/browse/NUTCH-1267 Project: Nutch Issue Type: Sub-task Components: indexer Affects Versions: 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.7 Ideally we should get rid of urlmeta altogether and add the transmission of the meta to the outlinks in the core classes - not as a plugin. URLMeta is also a terrible name :-( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1268) parse-meta to delegate indexing to index-metadata
[ https://issues.apache.org/jira/browse/NUTCH-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1268: Fix Version/s: 1.7 parse-meta to delegate indexing to index-metadata - Key: NUTCH-1268 URL: https://issues.apache.org/jira/browse/NUTCH-1268 Project: Nutch Issue Type: Sub-task Components: indexer Reporter: Julien Nioche Fix For: 1.7 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage
[ https://issues.apache.org/jira/browse/NUTCH-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1303: Fix Version/s: 1.7 Fetcher to skip queues for URLS getting repeated exceptions, based on percentage Key: NUTCH-1303 URL: https://issues.apache.org/jira/browse/NUTCH-1303 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Labels: fetch Fix For: 1.7 Attachments: NUTCH-1303.patch as described in https://issues.apache.org/jira/browse/NUTCH-769, it is a good solution to skip queues with high exception value, but it is not easy to set value of fetcher.max.exceptions.per.queue when size of queues are different. i suggest that define a ratio instead of value, so if the ratio of exceptions per requests exceeds, then queue cleared. also, it is not sufficient to keep fetcher from high exceptions, value of fetcher.throughput.threshold.pages ensures that a valueable throughput of fetch can gained against slow hosts, but it clean all queues not slow queue. i suggest for this one that this factor like fetcher.max.exceptions.per.queue enforce to each queue not all of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage
[ https://issues.apache.org/jira/browse/NUTCH-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1303: Patch Info: Patch Available Fetcher to skip queues for URLS getting repeated exceptions, based on percentage Key: NUTCH-1303 URL: https://issues.apache.org/jira/browse/NUTCH-1303 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Labels: fetch Fix For: 1.7 Attachments: NUTCH-1303.patch as described in https://issues.apache.org/jira/browse/NUTCH-769, it is a good solution to skip queues with high exception value, but it is not easy to set value of fetcher.max.exceptions.per.queue when size of queues are different. i suggest that define a ratio instead of value, so if the ratio of exceptions per requests exceeds, then queue cleared. also, it is not sufficient to keep fetcher from high exceptions, value of fetcher.throughput.threshold.pages ensures that a valueable throughput of fetch can gained against slow hosts, but it clean all queues not slow queue. i suggest for this one that this factor like fetcher.max.exceptions.per.queue enforce to each queue not all of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1270) some of Deflate encoded pages not fetched
[ https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1270: Patch Info: Patch Available Fix Version/s: 1.7 some of Deflate encoded pages not fetched - Key: NUTCH-1270 URL: https://issues.apache.org/jira/browse/NUTCH-1270 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: fetch, processDeflateEncoded Fix For: 1.7 Attachments: NUTCH-1270.patch it is a problem with some of web pages that fetched but their content can not retrived after this change, this error fixed we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException { if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); } byte[] content = DeflateUtils.inflateBestEffort(compressed, getMaxContent()); +if(content==null) + content = DeflateUtils.inflateBestEffort(compressed, 20); if (content == null) throw new IOException(inflateBestEffort returned null); if (LOGGER.isTraceEnabled()) { LOGGER.trace(fetched + compressed.length + bytes of compressed content (expanded to + content.length + bytes) from + url); } return content; } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1282) linkdb scalability
[ https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1282: Fix Version/s: 1.7 linkdb scalability -- Key: NUTCH-1282 URL: https://issues.apache.org/jira/browse/NUTCH-1282 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 1.4 Reporter: behnam nikbakht Fix For: 1.7 as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage is only for anchor and not impact on scoring. as seemed, size of linkdb in incremental crawl grow very fast and make it unscalable for huge size of web sites. so, here is two choises, one, ignore invertlinks and linkdb from crawl, and second, make it scalable in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed segments, and second for merge new linkdb with old linkdb. the second job is unscalable and we can ignore it with this changes in solrIndex: in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum == null or parseText == null or parseData == null, then add anchor to doc and update solr (no insert) here also some changes required to NutchDocument. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1269: Fix Version/s: 1.7 Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Fix For: 1.7 Attachments: NUTCH-1269.patch, NUTCH-1269-v.2.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch
[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1281: Fix Version/s: 2.2 1.7 tika parser not work properly with unwanted file types that passed from filters in nutch Key: NUTCH-1281 URL: https://issues.apache.org/jira/browse/NUTCH-1281 Project: Nutch Issue Type: Improvement Components: parser Reporter: behnam nikbakht Fix For: 1.7, 2.2 when in parse-plugins.xml, set this property: mimeType name=* plugin id=parse-tika / /mimeType all unwanted files that pass from all filters, refered to tika but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job. if this file types passed from regex-urlfilter and other filters, parse job failed. for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this: public ParseResult getParse(Content content) { String mimeType = content.getContentType(); + String[]validTypes=new String[]{application/pdf,application/x-tika-msoffice,application/x-tika- ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml}; + boolean valid=false; + for(int k=0;kvalidTypes.length;k++){ + if(validTypes[k].compareTo(mimeType.toLowerCase())==0) + valid=true; + } + if(!valid) + return new ParseStatus(ParseStatus.NOTPARSED, Can't parse for unwanted filetype + mimeType).getEmptyParseResult(content.getUrl(), getConf()); URL base; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1278) Fetch Improvement in threads per host
[ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1278: Patch Info: Patch Available Fetch Improvement in threads per host - Key: NUTCH-1278 URL: https://issues.apache.org/jira/browse/NUTCH-1278 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Fix For: 1.7 Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host there is a possibility with using of dynamic values for every host that influeced with number of blocked requests. this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-926: --- Fix Version/s: 1.7 Nutch follows wrong url in META http-equiv=refresh tag - Key: NUTCH-926 URL: https://issues.apache.org/jira/browse/NUTCH-926 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: gnu/linux centOs Reporter: Marco Novo Priority: Critical Fix For: 1.7 Attachments: ParseOutputFormat.java.patch We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains. So WWW.DOMAIN1.COM .. .. .. WWW.RIGHTDOMAIN.COM .. .. .. .. WWW.DOMAIN.COM We sets nutch to: NOT FOLLOW EXERNAL LINKS During crawling of WWW.RIGHTDOMAIN.COM if a page contains !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN html head title/title META http-equiv=refresh content=0; url=http://WRONG.RIGHTDOMAIN.COM; /head body /body /html Nutch continues to crawl the WRONG subdomains! But it should not do this!! During crawling of WWW.RIGHTDOMAIN.COM if a page contains !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN html head title/title META http-equiv=refresh content=0; url=http://WWW.WRONGDOMAIN.COM; /head body /body /html Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-881) Good quality documentation for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-881: --- Fix Version/s: 1.7 Good quality documentation for Nutch Key: NUTCH-881 URL: https://issues.apache.org/jira/browse/NUTCH-881 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: nutchgora Reporter: Andrzej Bialecki Assignee: Lewis John McGibbney Fix For: 1.7 This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing. IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference. I propose to start with the following: 1. let's decide on the format of the docs. Each format has its own pros and cons: * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc). * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently. * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf. * other? 2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Fix Version/s: 2.2 1.7 Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Fix For: 1.7, 2.2 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1257) Support for the x-robots-tag HTTP Header
[ https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1257: Fix Version/s: 2.2 1.7 Support for the x-robots-tag HTTP Header Key: NUTCH-1257 URL: https://issues.apache.org/jira/browse/NUTCH-1257 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Mike Labels: http,, privacy,, robots, Fix For: 1.7, 2.2 Google and Bing both currently support the x-robots-tag HTTP header. This is important, because they have a policy of not *crawling* links that are in a robots.txt file, and not *indexing* links that are set to noindex. In the case that a page is indexed but not crawled, Google and Bing will show the page in their results, but it will lack a snippet (since they didn't crawl it, they can't generate one). As a result, the only way to block Google and Bing from having a page in their index is to use the robots meta tag in HTML pages and the x-robots-tag in other mimetypes. As a site owner that needs to keep specific pages private, I *cannot* trust robots.txt to keep my pages out of Google and Bing, and I have to use the two robots standards. Since Nutch doesn't support the HTTP header, I have to block it from crawling ALL non-HTML pages on my site. This is not an ideal state of affairs, and it would be great if Nutch supported the x-robots-tag HTTP header. I've done more research on this topic on my blog: - http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag - http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1250) parse-html does not parse links with empty anchor
[ https://issues.apache.org/jira/browse/NUTCH-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1250: Fix Version/s: 2.2 1.7 parse-html does not parse links with empty anchor - Key: NUTCH-1250 URL: https://issues.apache.org/jira/browse/NUTCH-1250 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Andreas Janning Fix For: 1.7, 2.2 The parse-html plugin does not generate an outlink if the link has no anchor For example the following HTML-Code does not create an Outlink: {code:html} a href=example.com/a {code} The JUnit-Test TestDOMContentUtils tries to test this but fails since there is a comment inside the a-Tag. {code:title=TestDOMContentUtils.java|borderStyle=solid} new String(htmlheadtitle title /title + /headbody + a href=\g\!--no anchor--/a + a href=\g1\ !--whitespace-- /a + a href=\g2\ img src=test.gif alt='bla bla' /a + /body/html), {code} When you remove the comment the test fails. {code:title=TestDOMContentUtils.java Test fails|borderStyle=solid} new String(htmlheadtitle title /title + /headbody + a href=\g\/a // no anchor + a href=\g1\ !--whitespace-- /a + a href=\g2\ img src=test.gif alt='bla bla' /a + /body/html), {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1080: Fix Version/s: 2.2 Type safe members , arguments for better readability - Key: NUTCH-1080 URL: https://issues.apache.org/jira/browse/NUTCH-1080 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Karthik K Fix For: 1.7, 2.2 Attachments: NUTCH-1080.patch, NUTCH-rel_14-1080.patch Enable generics for some of the API, for better type safety and readability, in the process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1080: Fix Version/s: 1.7 Type safe members , arguments for better readability - Key: NUTCH-1080 URL: https://issues.apache.org/jira/browse/NUTCH-1080 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Karthik K Fix For: 1.7 Attachments: NUTCH-1080.patch, NUTCH-rel_14-1080.patch Enable generics for some of the API, for better type safety and readability, in the process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1076) Solrindex has no documents following bin/nutch solrindex when using protocol-file
[ https://issues.apache.org/jira/browse/NUTCH-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1076: Fix Version/s: 1.7 Solrindex has no documents following bin/nutch solrindex when using protocol-file - Key: NUTCH-1076 URL: https://issues.apache.org/jira/browse/NUTCH-1076 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Environment: Ubuntu Linux 10.04 server JDK 1.6 Nutch 1.3 Solr 3.1.0 Reporter: Seth Griffin Assignee: Markus Jelsma Labels: nutch, protocol-file, solrindex Fix For: 1.7 Note: When using protocol-http I am able to update solr effortlessly. To test this I have a single pdf file that I am trying to index in my urls directory. I execute: bin/nutch crawl urls Output: solrUrl is not set, indexing will be skipped... crawl started in: crawl-20110805151045 rootUrlDir = urls threads = 10 depth = 5 solrUrl=null Injector: starting at 2011-08-05 15:10:45 Injector: crawlDb: crawl-20110805151045/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-08-05 15:10:48, elapsed: 00:00:02 Generator: starting at 2011-08-05 15:10:48 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20110805151045/segments/20110805151050 Generator: finished at 2011-08-05 15:10:51, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-08-05 15:10:51 Fetcher: segment: crawl-20110805151045/segments/20110805151050 Fetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 fetching file:///home/nutch/nutch-1.3/runtime/local/indexdir/Altec.pdf -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-08-05 15:10:53, elapsed: 00:00:02 ParseSegment: starting at 2011-08-05 15:10:53 ParseSegment: segment: crawl-20110805151045/segments/20110805151050 ParseSegment: finished at 2011-08-05 15:10:56, elapsed: 00:00:03 CrawlDb update: starting at 2011-08-05 15:10:56 CrawlDb update: db: crawl-20110805151045/crawldb CrawlDb update: segments: [crawl-20110805151045/segments/20110805151050] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-08-05 15:10:57, elapsed: 00:00:01 Generator: starting at 2011-08-05 15:10:57 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2011-08-05 15:10:58 LinkDb: linkdb: crawl-20110805151045/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110805151045/segments/20110805151050 LinkDb: finished at 2011-08-05 15:10:59, elapsed: 00:00:01 crawl finished: crawl-20110805151045 Then with a clean solr index (stats output from stats.jsp below): searcherName : Searcher@14dd758 main caching : true numDocs : 0 maxDoc : 0 reader : SolrIndexReader{this=1ee148b,r=ReadOnlyDirectoryReader@1ee148b,refCnt=1,segments=0} readerDir : org.apache.lucene.store.NIOFSDirectory@/home/solr/apache-solr-3.1.0/example/solr/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@987197 indexVersion : 1312575204101 openedAt : Fri Aug 05 15:13:24 CDT 2011 registeredAt : Fri Aug 05 15:13:24 CDT 2011 warmupTime : 0 I then execute: bin/nutch solrindex http://localhost:8983/solr/ crawl-20110805151045/crawldb/ crawl-20110805151045/linkdb/ crawl-20110805151045/segments/* bin/nutch output: SolrIndexer: starting at 2011-08-05 15:15:48 SolrIndexer: finished at 2011-08-05 15:15:50, elapsed: 00:00:01
[jira] [Updated] (NUTCH-1371) Replace Ivy with Maven Ant tasks
[ https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1371: Fix Version/s: 2.2 1.7 Replace Ivy with Maven Ant tasks Key: NUTCH-1371 URL: https://issues.apache.org/jira/browse/NUTCH-1371 Project: Nutch Issue Type: Improvement Components: build Reporter: Julien Nioche Fix For: 1.7, 2.2 Attachments: NUTCH-1371.patch We might move to Maven altogether but a good intermediate step could be to rely on the maven ant tasks for managing the dependencies. Ivy does a good job but we need to have a pom file anyway for publishing the artefacts which means keeping the pom.xml and ivy.xml contents in sync. Most devs are also more familiar with Maven, and it is well integrated in IDEs. Going the ANT+MVN way also means that we don't have to rewrite the whole building process and can rely on our existing script -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1382) Adding support for EmbeddedSolrServer to SolrIndexer
[ https://issues.apache.org/jira/browse/NUTCH-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1382: Fix Version/s: 1.7 Adding support for EmbeddedSolrServer to SolrIndexer Key: NUTCH-1382 URL: https://issues.apache.org/jira/browse/NUTCH-1382 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.5 Reporter: Emre Çelikten Labels: patch Fix For: 1.7 Attachments: embeddedsolrserver.patch Here is a hack to allow somebody to plug their own SolrServer into SolrIndexer. It allows people to use EmbeddedSolrServer in Nutch. It works by: adding a constructor in SolrIndexer with parameter SolrServer, adding an ugly method of getSolrServer into SolrUtils which returns SolrServer if there is one provided by the programmer or returns default getCommonsHttpSolrServer(...) replacing every occurrence of getCommonsHttpSolrServer by getSolrServer. Hope this helps. This is my first patch ever to FOSS community so I hope I am doing it correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1387) All parsers should respond to cancellation / interrupts.
[ https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1387: Fix Version/s: 2.2 1.7 All parsers should respond to cancellation / interrupts. Key: NUTCH-1387 URL: https://issues.apache.org/jira/browse/NUTCH-1387 Project: Nutch Issue Type: Bug Components: parser Reporter: Ferdy Galema Fix For: 1.7, 2.2 During parsing a TimeoutException can occur. This is caused whenever the FutureTask.get() cannot be completed within the specified timeout. The tricky part is that single urls might be perfectly able to complete within the timeout, but when there is a heavy concurrent load (a lot of semi-expensive parses) the parser load might stack up and cause many parses to timeout. This can be the case with parsing during fetch. But when using a separate parserjob this can also happen because Parser implementation do not necessarily have to respond to a thread interrupt. (Which is fired away with the task.cancel(true) call). If a parser does not check the Thread.interrupted state at regular intervals, it will just continue to run and eat up resources. I find it very helpful to debug stalling fetchers/parsers with the lazy men's profiler: kill -QUIT process_id. This will dump stacktraces, sometimes exposing the fact that hundreds of parser threads are still active in the background. (Of course many of them already timed out a long time ago). To fix this, every parser should check it's interrupted state at regular intervals. (For example an html parse might be stuck walking the DOM tree, so checking after every Nth element would be an appropiate moment.) This issue is for reference first. Fixing it all at once would be a huge task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1375) extract main content of a html file
[ https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1375: Patch Info: Patch Available Fix Version/s: 1.7 extract main content of a html file --- Key: NUTCH-1375 URL: https://issues.apache.org/jira/browse/NUTCH-1375 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: behnam nikbakht Fix For: 1.7 Attachments: NUTCH-1375.patch i write a code, that can extract main content of a html (usally weblogs). this content usally apperas in bodyp tag but there is no insurance. also might be multiple tags with form of bodyp but only one of them is main content. this code first find body node, and then compute weight of childs nodes that compute based on text volume and height. so the code find lowest node that have maximum text volume. i hope that improvement of this code cause to solutions to find fake or duplicated pages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1334) NPE in FetcherOutputFormat
[ https://issues.apache.org/jira/browse/NUTCH-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1334: Fix Version/s: 1.7 NPE in FetcherOutputFormat --- Key: NUTCH-1334 URL: https://issues.apache.org/jira/browse/NUTCH-1334 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Fix For: 1.7 Attachments: NUTCH-1334.patch If fetcher.parse or fetcher.store.content are set to false AND the write method receives an instance of Parse or Content, a NPE will be thrown. This usually does not happen as the Fetcher does not output a Parse or Content based on the configuration, however this class is also used by the ArcSegmentCreator which is unaware of these parameters and will output a Parse or Content instance regardless of the configuration. One option would be to make the ArcSegmentCreator aware of the fetcher.* parameters and output things accordingly but it also makes sense to modify the FetcherOutputFormat so that it checks whether a subWriter has been created before trying to use it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1334) NPE in FetcherOutputFormat
[ https://issues.apache.org/jira/browse/NUTCH-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1334: Patch Info: Patch Available NPE in FetcherOutputFormat --- Key: NUTCH-1334 URL: https://issues.apache.org/jira/browse/NUTCH-1334 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Fix For: 1.7 Attachments: NUTCH-1334.patch If fetcher.parse or fetcher.store.content are set to false AND the write method receives an instance of Parse or Content, a NPE will be thrown. This usually does not happen as the Fetcher does not output a Parse or Content based on the configuration, however this class is also used by the ArcSegmentCreator which is unaware of these parameters and will output a Parse or Content instance regardless of the configuration. One option would be to make the ArcSegmentCreator aware of the fetcher.* parameters and output things accordingly but it also makes sense to modify the FetcherOutputFormat so that it checks whether a subWriter has been created before trying to use it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1329) parser not extract outlinks to external web sites
[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1329: Fix Version/s: 2.2 1.7 parser not extract outlinks to external web sites - Key: NUTCH-1329 URL: https://issues.apache.org/jira/browse/NUTCH-1329 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: behnam nikbakht Labels: parse Fix For: 1.7, 2.2 found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url so i replace these lines: URL url = URLUtil.resolveURL(base, target); outlinks.add(new Outlink(url.toString(), linkText.toString().trim())); with: String host_temp=null; try{ host_temp=URLUtil.getDomainName(new URL(target)); } catch(Exception eiuy){ host_temp=null; } URL url=null; if(host_temp==null)// it is an internal outlink url = URLUtil.resolveURL(base, target); else //it is an external link url=new URL(target); outlinks.add(new Outlink(url.toString(), linkText.toString().trim())); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1321) IDNNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1321: Fix Version/s: 1.7 IDNNormalizer - Key: NUTCH-1321 URL: https://issues.apache.org/jira/browse/NUTCH-1321 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.7 Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an indexer so it will encode ASCII URL's to their proper unicode equivalant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?
[ https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1315: Fix Version/s: 1.7 reduce speculation on but ParseOutputFormat doesn't name output files correctly? Key: NUTCH-1315 URL: https://issues.apache.org/jira/browse/NUTCH-1315 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 1.5M urls Reporter: Rafael Labels: hadoop, hdfs Fix For: 1.7 From time to time the Reducer log contains the following and one tasktracker gets blacklisted. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-1/data for DFSClient_attempt_201203151054_0028_r_01_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_01_0 on xx.xx.xx.9 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1066) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy2.create(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy2.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3245) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1132) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:157) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:134) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:92) at org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:448) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.mapred.Child.main(Child.java:249) I asked the hdfs-user mailing list and i got the following answer: Looks like you have reduce speculation turned on, but the ParseOutputFormat you're using doesn't properly name its output files distinctly based on the task attempt ID. As a workaround you can probably turn off
[jira] [Updated] (NUTCH-1309) fetch queue management
[ https://issues.apache.org/jira/browse/NUTCH-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1309: Fix Version/s: 1.7 fetch queue management -- Key: NUTCH-1309 URL: https://issues.apache.org/jira/browse/NUTCH-1309 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Labels: fetch Fix For: 1.7 when run fetch in hadoop with multiple concurrent mapper, there are multiple independent fetchQueues that make hard to manage them. i suggest that construct fetchQueues before begin of run with this line: feeder = new QueueFeeder(input, fetchQueues, threadCount * 50); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum
[ https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-933: --- Fix Version/s: 1.7 Fetcher does not save a pages Last-Modified value in CrawlDatum --- Key: NUTCH-933 URL: https://issues.apache.org/jira/browse/NUTCH-933 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.2 Reporter: Joe Kemp Fix For: 1.7 I added the following code in the output method just after the If (content !=null) statement. String lastModified = metadata.get(Last-Modified); if (lastModified !=null !lastModified.equals()) { try { Date lastModifiedDate = DateUtil.parseDate(lastModified); datum.setModifiedTime(lastModifiedDate.getTime()); } catch (DateParseException e) { } } I now get 304 for pages that haven't changed when I recrawl. Need to do further testing. Might also need a configuration parameter to turn off this behavior, allowing pages to be forced to be refreshed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-929) Create a REST-based admin UI for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-929: --- Fix Version/s: 2.2 Create a REST-based admin UI for Nutch -- Key: NUTCH-929 URL: https://issues.apache.org/jira/browse/NUTCH-929 Project: Nutch Issue Type: New Feature Components: administration gui Affects Versions: nutchgora Reporter: Andrzej Bialecki Fix For: 2.2 This is a follow up to NUTCH-880 - we need to expose the functionality of REST API in a user-friendly admin UI. Thanks to the nature of the API the UI can be implemented in any UI framework that speaks REST/JSON, so it could be a simple webapp (we already have jetty) or a Swing / Pivot / etc standalone application. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-891: --- Fix Version/s: 2.2 Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: Andrzej Bialecki Fix For: 2.2 Attachments: gora-49_v1.patch, gora.build.patch The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-891: --- Patch Info: Patch Available Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: Andrzej Bialecki Fix For: 2.2 Attachments: gora-49_v1.patch, gora.build.patch The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-952) fix outlink which started with '?' in html parser
[ https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-952: --- Fix Version/s: 1.7 fix outlink which started with '?' in html parser - Key: NUTCH-952 URL: https://issues.apache.org/jira/browse/NUTCH-952 Project: Nutch Issue Type: Bug Components: parser Affects Versions: nutchgora Reporter: Stondet Fix For: 1.7 Attachments: NUTCH-952-v2.patch a href=?w=ruby%20on%20railsty=csd=0 ruby on rails/a(a snippet from http://bbs.soso.com/search?ty=csd=0w=rails) outlink parsed from above link: http://bbs.soso.com/?w=ruby%20on%20railsty=csd=0 but expected is http://bbs.soso.com/search?w=ruby%20on%20railsty=csd=0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-649) Log list of files found but not crawled.
[ https://issues.apache.org/jira/browse/NUTCH-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-649: --- Fix Version/s: 1.7 Log list of files found but not crawled. Key: NUTCH-649 URL: https://issues.apache.org/jira/browse/NUTCH-649 Project: Nutch Issue Type: Improvement Components: fetcher Environment: any Reporter: Jim Fix For: 1.7 I use Nutch to find the location of executables on the web, but we do not download the executables with Nutch. In order to get nutch to give the location of files without downloading the files, I had to make a very small patch to the code, but I think this change might be useful to others also. The patch just logs files that are being filtered at the info level, although perhaps it should be at the debug level. I have included a svn diff with this change. Use cases would be to both use as a diagnostic tool (let's see what we are skipping) as well as a way to find content and links pointed to by a page or site without having to actually download that content. Index: ParseOutputFormat.java === --- ParseOutputFormat.java (revision 593619) +++ ParseOutputFormat.java (working copy) @@ -193,17 +193,20 @@ toHost = null; } if (toHost == null || !toHost.equals(fromHost)) { // external links + LOG.info(filtering externalLink + toUrl + linked to by + fromUrl); + continue; // skip it } } try { toUrl = normalizers.normalize(toUrl, URLNormalizers.SCOPE_OUTLINK); // normalize the url - toUrl = filters.filter(toUrl); // filter the url - if (toUrl == null) { -continue; - } -} catch (Exception e) { + + if (filters.filter(toUrl) == null) { // filter the url + LOG.info(filtering content + toUrl + linked to by + fromUrl); + continue; + } + } catch (Exception e) { continue; } CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-960) Language ID - confidence factor
[ https://issues.apache.org/jira/browse/NUTCH-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-960. Resolution: Won't Fix This is way too old and as Ken pointed out this should be dealt with upstream in Tika. Language ID - confidence factor --- Key: NUTCH-960 URL: https://issues.apache.org/jira/browse/NUTCH-960 Project: Nutch Issue Type: Wish Affects Versions: 1.2 Reporter: M Alexander Hi In JAVA implementation, what is the best way to calculate the confidence of the outcome of the language id for a given text? For example: n-gram matching / total n-gram * 100. when a text is passed. The outcome would be en with 89% confidence. What is the best way to implement this to the existig nutch language id code? Thanks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers
[ https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-945: --- Fix Version/s: 2.2 Indexing to multiple SOLR Servers - Key: NUTCH-945 URL: https://issues.apache.org/jira/browse/NUTCH-945 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Charan Malemarpuram Fix For: 2.2 Attachments: MurmurHashPartitioner.java, NonPartitioningPartitioner.java, patch-NUTCH-945.txt It would be nice to have a default Indexer in Nutch, which can submit docs to multiple SOLR Servers. Partitioning is always the question, when writing to multiple SOLR Servers. Default partitioning can be a simple hashcode based distribution with addition hooks to customization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers
[ https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-945: --- Patch Info: Patch Available Indexing to multiple SOLR Servers - Key: NUTCH-945 URL: https://issues.apache.org/jira/browse/NUTCH-945 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Charan Malemarpuram Fix For: 2.2 Attachments: MurmurHashPartitioner.java, NonPartitioningPartitioner.java, patch-NUTCH-945.txt It would be nice to have a default Indexer in Nutch, which can submit docs to multiple SOLR Servers. Partitioning is always the question, when writing to multiple SOLR Servers. Default partitioning can be a simple hashcode based distribution with addition hooks to customization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-734) option to filter a tag text
[ https://issues.apache.org/jira/browse/NUTCH-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-734. Resolution: Won't Fix This is simply not required and dated. Plus I assume by referring to a, we mean stop words. These are filtered during the IR process in (all?) modern indexing servers. option to filter a tag text - Key: NUTCH-734 URL: https://issues.apache.org/jira/browse/NUTCH-734 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: ron Motivation: When fetching pages with menue links the menues (for example search) appear on all pages of the site. Searching for the word search then returns all pages of the site, instead of just returning the the search page. Change request: Add options to filter texts of a tags, or more generally add filters to avoid texts within specific tags. I have worked around this by changing DOMContentUtils.getTextHelper : if (nodeType == Node.TEXT_NODE !(currentNode.getParentNode() != null a.equalsIgnoreCase(currentNode.getParentNode().getNodeName( - Ron -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-745) MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run
[ https://issues.apache.org/jira/browse/NUTCH-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-745. Resolution: Invalid close of legacy issue MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run Key: NUTCH-745 URL: https://issues.apache.org/jira/browse/NUTCH-745 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: JDK1.6 + tomcat 6 + Eclipse3.3 + nutch 1.0 Reporter: jcore_XiaTian MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run public ParseResult getParse(Content content) { return ParseResult.createParseResult(content.getUrl(), new ParseStatus(ParseStatus.FAILED, ParseStatus.FAILED_MISSING_CONTENT, No textual content available).getEmptyParse(conf)); // return null; } nutch-site.xml=== property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(myHtml|html|text|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier|analysis-(zh)/value description![CDATA[ ]] /description /property ==parse-plugins.xml mimeType name=text/html plugin id=parse-myHtml / plugin id=parse-html / /mimeType alias name=parse-myHtml extension-id=org.apache.nutch.parse.html.MyHtmlParser / ===src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java public ParseResult getParse(Content content) { . // cannot run the code: ParseResult filteredParse = this.htmlParseFilters.filter(content, parseResult, metaTags, root); ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment
[ https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-685: --- Fix Version/s: 2.2 1.7 Content-level redirect status lost in ParseSegment -- Key: NUTCH-685 URL: https://issues.apache.org/jira/browse/NUTCH-685 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.7, 2.2 When Fetcher runs in parsing mode, content-level redirects (HTML meta tag Refresh) are properly discovered and recorded in crawl_fetch under source URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is run as a separate step, the content-level redirection data is used only to add the new (target) URL, but the status of the original URL is not reset to indicate a redirect. Consequently, status of the original URL will be different depending on the way you run Fetcher, whereas it should be the same. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-583) FeedParser empty links for items
[ https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-583: --- Fix Version/s: 2.2 1.7 FeedParser empty links for items Key: NUTCH-583 URL: https://issues.apache.org/jira/browse/NUTCH-583 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.7, 2.2 FeedParser in feed plugin just discards the item if it does not have link element. However Rss 2.0 does not necessitate the link element for each item. Moreover sometimes the link is given in the guid element which is a globally unique identifier for the item. I think we can search the url for an item first, then if it is still not found, we can use the feed's url, but with merging all the parse texts into one Parse object. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-356: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: https://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Fix For: 1.7, 2.2 Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, ASF.LICENSE.NOT.GRANTED--patch.txt, cache_classes.patch While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-366) Merge URLFilters and URLNormalizers
[ https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-366: --- Fix Version/s: 2.2 1.7 Merge URLFilters and URLNormalizers --- Key: NUTCH-366 URL: https://issues.apache.org/jira/browse/NUTCH-366 Project: Nutch Issue Type: Improvement Reporter: Andrzej Bialecki Labels: gsoc2012 Fix For: 1.7, 2.2 Currently Nutch uses two subsystems related to url validation and normalization: * URLFilter: this interface checks if URLs are valid for further processing. Input URL is not changed in any way. The output is a boolean value. * URLNormalizer: this interface brings URLs to their base (normal) form, or removes unneeded URL components, or performs any other URL mangling as necessary. Input URLs are changed, and are returned as result. However, various Nutch tools run filters and normalizers in pre-determined order, i.e. normalizers first, and then filters. In some cases, where normalizers are complex and running them is costly (e.g. numerous regex rules, DNS lookups) it would make sense to run some of the filters first (e.g. prefix-based filters that select only certain protocols, or suffix-based filters that select only known extensions). This is currently not possible - we always have to run normalizers, only to later throw away urls because they failed to pass through filters. I would like to solicit comments on the following two solutions, and work on implementation of one of them: 1) we could make URLFilters and URLNormalizers implement the same interface, and basically make them interchangeable. This way users could configure their order arbitrarily, even mixing filters and normalizers out of order. This is more complicated, but gives much more flexibility - and NUTCH-365 already provides sufficient framework to implement this, including the ability to define different sequences for different steps in the workflow. 2) we could use a property url.mangling.order ;) to define whether normalizers or filters should run first. This is simple to implement, but provides only limited improvement - because either all filters or all normalizers would run, they couldn't be mixed in arbitrary order. Any comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-475) Adaptive crawl delay
[ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-475: --- Fix Version/s: 1.7 Adaptive crawl delay Key: NUTCH-475 URL: https://issues.apache.org/jira/browse/NUTCH-475 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Doğacan Güney Fix For: 1.7 Attachments: adaptive-delay_draft.patch, NUTCH-475.patch Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-207) Bandwidth target for fetcher rather than a thread count
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-207: --- Patch Info: Patch Available Fix Version/s: 1.7 Bandwidth target for fetcher rather than a thread count --- Key: NUTCH-207 URL: https://issues.apache.org/jira/browse/NUTCH-207 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8 Reporter: Rod Taylor Fix For: 1.7 Attachments: ratelimit.patch Increases or decreases the number of threads from the starting value (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve a target bandwidth (fetcher.threads.bandwidth). It seems to be able to keep within 10% of the target bandwidth even when large numbers of errors are found or when a number of large pages is run across. To achieve more accurate tracking Nutch should keep track of protocol overhead as well as the volume of pages downloaded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1508) Port limit crawler to defined depth to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1508: Fix Version/s: 2.2 Port limit crawler to defined depth to 2.x -- Key: NUTCH-1508 URL: https://issues.apache.org/jira/browse/NUTCH-1508 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: Julien Nioche Fix For: 2.2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-314) Multiple language identifier instances
[ https://issues.apache.org/jira/browse/NUTCH-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-314. Resolution: Won't Fix close of legacy issue Multiple language identifier instances -- Key: NUTCH-314 URL: https://issues.apache.org/jira/browse/NUTCH-314 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Environment: OS: Linux RHEL 4 JDK: 1.5_07 Reporter: Enrico Triolo In my application I often need to perform the inject - generate - .. - index loop multiple times, since users can 'suggest' new web pages to be crawled and indexed. I also need to enable the language identifier plugin. Everything seems to work correctly, but after some time I get an OutOfMemoryException. Actually the time isn't important, since I noticed that the problem arises when the user submits many urls (~100). As I said, for each submitted url a new loop is performed (similar to the one in the Crawl.main method). Using a profiler (specifically, netbeans profiler) I found out that for each submitted url a new LanguageIdentifier instance is created, and never released. With the memory inspector tool I can see as many instances of LanguageIdentifier and NGramProfile$NGramEntry as the number of fetched pages, each of them occupying about 180kb. Forcing garbage collection doesn't release much memory. Maybe we should cache its instance in the conf as we do for many others objects in Nutch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1483: Fix Version/s: 2.2 1.7 Can't crawl filesystem with protocol-file plugin Key: NUTCH-1483 URL: https://issues.apache.org/jira/browse/NUTCH-1483 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.6, 2.1 Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 Reporter: Rogério Pereira Araújo Fix For: 1.7, 2.2 Attachments: NUTCH-1483.patch I tried to follow the same steps described in this wiki page: http://wiki.apache.org/nutch/IntranetDocumentSearch I made all required changes on regex-urlfilter.txt and added the following entry in my seed file: file:///home/rogerio/Documents/ The permissions are ok, I'm running nutch with the same user as folder owner, so nutch has all the required permissions, unfortunately I'm getting the following error: org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) fetch of file://home/rogerio/Documents/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 Why the logs are showing file://home/rogerio/Documents/ instead of file:///home/rogerio/Documents/ ??? Note: The regex-urlfilter entry only works as expected if I add the entry +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-802) Problems managing outlinks with large url length
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-802: --- Fix Version/s: 1.7 Problems managing outlinks with large url length Key: NUTCH-802 URL: https://issues.apache.org/jira/browse/NUTCH-802 Project: Nutch Issue Type: Bug Components: parser Reporter: Pablo Aragón Assignee: Andrzej Bialecki Labels: nutch, outlink, parse, parseoutputformat Fix For: 1.7 Attachments: ParseOutputFormat.patch Nutch can get idle during the collection of outlinks if the URL address of the outlink is too large. The maximum sizes of an URL for the main web servers are: * Apache: 4,000 bytes * Microsoft Internet Information Server (IIS): 16, 384 bytes * Perl HTTP::Daemon: 8.000 bytes URL adress sizes bigger than 4000 bytes are problematic, so the limit should be set in the nutch-default.xml configuration file. I attached a patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-795) Add ability to maintain nofollow attribute in linkdb
[ https://issues.apache.org/jira/browse/NUTCH-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-795: --- Fix Version/s: 1.7 Add ability to maintain nofollow attribute in linkdb Key: NUTCH-795 URL: https://issues.apache.org/jira/browse/NUTCH-795 Project: Nutch Issue Type: New Feature Components: linkdb Affects Versions: 1.1 Reporter: Sammy Yu Fix For: 1.7 Attachments: 0001-Updated-with-nofollow-support-for-Outlinks.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1478: Fix Version/s: 2.2 Parse-metatags and index-metadata plugin for Nutch 2.x series -- Key: NUTCH-1478 URL: https://issues.apache.org/jira/browse/NUTCH-1478 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.1 Reporter: kiran Fix For: 2.2 Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, Nutch1478.zip I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. This will take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467). The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is no need to give 'metatag' keyword before metatag names. For example my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) This is only the first version and does not include the junit test. I will update the new version soon. This will parse the tags and index the tags in Solr. Make sure you create the fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr. Please let me know if you have any suggestions This is supported by DLA (Digital Library and Archives) of Virginia Tech. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'
[ https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1511: Fix Version/s: 2.2 Metadata in MYSQL updated with 'garbage' Key: NUTCH-1511 URL: https://issues.apache.org/jira/browse/NUTCH-1511 Project: Nutch Issue Type: Bug Components: fetcher, injector, storage Affects Versions: 2.1 Environment: Ubuntu 12.04 Reporter: J. Gobel Labels: metadata, mysql, nutch, scoring-opic Fix For: 2.2 After applying patch for Metadata parser (NUTCH-1478) I notice that the metadata field just before the crawl ends is populated with the correct information. However when the crawl is completely finished the metadata field is populated with 'garbage' _csh_� I notice in my SQL log file that the scoring plugin is overwriting the metadata field in a final data insertion with '_csh_ \0\0\0\0\'. When I remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml , the metadata-field is crisp and clear. MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see a fragments of my MYSQL log file, only the moments when data is written to the METADATA field in the MYSQL table. First Insertion .. here I suppose scoring-opic writes its information, _csh_ ?€\0\0\0 58 QueryINSERT INTO webpage (fetchInterval,fetchTime,id,markers,metadata,score )VALUES (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0',' _csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 _injmrk_ y\0',metadata=' _csh_ ?€\0\0\0',score=1.0 Second Insertion - inhere scraped metada is inserted into metadata. 81 QueryINSERT INTO webpage (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES ('org.apache.nutch:http/', The final insertion - please note that here the metadata field is overwritten with _CSH_\0\0\0\0 90 QueryINSERT INTO webpage (fetchTime,id,inlinks,markers,metadata )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/ Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508 __prsmrk__*1357122982-1745626508 _gnmrk_*1357122982-1745626508 _ftcmrk_*1357122982-1745626508\0',' _csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 0http://nutch.apache.org/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1505) java.lang.IllegalArgumentException during updatedb
[ https://issues.apache.org/jira/browse/NUTCH-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1505: Fix Version/s: 2.2 java.lang.IllegalArgumentException during updatedb -- Key: NUTCH-1505 URL: https://issues.apache.org/jira/browse/NUTCH-1505 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Environment: cassandra 0.8.10 Reporter: Stanley Orlenko Fix For: 2.2 the command bin/nutch updatedb raises the exception. Here is a part of hadoop.log: 2012-12-21 11:27:58,557 WARN mapred.LocalJobRunner - job_local_0001 java.lang.IllegalArgumentException: offset (0) + length (4) exceed the capacity of the array: 2 at org.apache.nutch.util.Bytes.explainWrongLengthOrOffset(Bytes.java:559) at org.apache.nutch.util.Bytes.toInt(Bytes.java:740) at org.apache.nutch.util.Bytes.toFloat(Bytes.java:611) at org.apache.nutch.util.Bytes.toFloat(Bytes.java:598) at org.apache.nutch.scoring.opic.OPICScoringFilter.distributeScoreToOutlinks(OPICScoringFilter.java:128) at org.apache.nutch.scoring.ScoringFilters.distributeScoreToOutlinks(ScoringFilters.java:117) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:70) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-804) CrawlDatum.statNames can be modified
[ https://issues.apache.org/jira/browse/NUTCH-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-804: --- Fix Version/s: 1.7 CrawlDatum.statNames can be modified Key: NUTCH-804 URL: https://issues.apache.org/jira/browse/NUTCH-804 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Mike Baranczak Priority: Minor Fix For: 1.7 public static final HashMapByte, String statNames It's possible to modify the contents of this hash map from anywhere in the application, which could cause problems in unrelated places. Unless I'm missing something, there's no good reason to modify this map after it's initialized. So, it should either not be declared public, or be made read-only. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-789: --- Fix Version/s: 2.2 1.7 Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: parser Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7, 2.2 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-813) Repetitive crawl 403 status page
[ https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-813: --- Fix Version/s: 1.7 Repetitive crawl 403 status page Key: NUTCH-813 URL: https://issues.apache.org/jira/browse/NUTCH-813 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Nguyen Manh Tien Priority: Minor Fix For: 1.7 Attachments: ASF.LICENSE.NOT.GRANTED--Patch When we crawl a page the return a 403 status. It will be crawl repetitively each days with default schedule. Even when we restrict by paramter db.fetch.retry.max -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value
[ https://issues.apache.org/jira/browse/NUTCH-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1464: Patch Info: Patch Available Fix Version/s: 1.7 index-static plugin doesn't allow the colon within the field value -- Key: NUTCH-1464 URL: https://issues.apache.org/jira/browse/NUTCH-1464 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.5 Reporter: Luca Cavanna Priority: Minor Fix For: 1.7 Attachments: NUTCH-1464.patch If I want to configure a static field with a value containing a colon, the index-static plugin does nothing. There's a string split based on the colon character and if the result is an array of length 2 everything is fine, otherwise nothing happens, the static field is not set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1497: Fix Version/s: 2.2 Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Fix For: 2.2 Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1499: Fix Version/s: 1.7 Usage of multiple ipv4 addresses and network cards on fetcher machines -- Key: NUTCH-1499 URL: https://issues.apache.org/jira/browse/NUTCH-1499 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.5.1 Reporter: Walter Tietze Priority: Minor Fix For: 1.7 Attachments: apache-nutch-1.5.1.NUTCH-1499.patch Adds for the fetcher threads the ability to use multiple configured ipv4 addresses. On some cluster machines there are several ipv4 addresses configured where each ip address is associated with its own network interface. This patch enables to configure the protocol-http and the protocol-httpclient to use these network interfaces in a round robin style. If the feature is enabled, a helper class reads at *startup* the network configuration. In each http network connection the next ip address is taken. This method is synchronized, but this should be no bottleneck for the overall performance of the fetcher threads. This feature is tested on our cluster for the protocol-http and the protocol-httpclient protocol. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1499: Patch Info: Patch Available Usage of multiple ipv4 addresses and network cards on fetcher machines -- Key: NUTCH-1499 URL: https://issues.apache.org/jira/browse/NUTCH-1499 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.5.1 Reporter: Walter Tietze Priority: Minor Fix For: 1.7 Attachments: apache-nutch-1.5.1.NUTCH-1499.patch Adds for the fetcher threads the ability to use multiple configured ipv4 addresses. On some cluster machines there are several ipv4 addresses configured where each ip address is associated with its own network interface. This patch enables to configure the protocol-http and the protocol-httpclient to use these network interfaces in a round robin style. If the feature is enabled, a helper class reads at *startup* the network configuration. In each http network connection the next ip address is taken. This method is synchronized, but this should be no bottleneck for the overall performance of the fetcher threads. This feature is tested on our cluster for the protocol-http and the protocol-httpclient protocol. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1485) TableUtil reverseURL to keep userinfo part
[ https://issues.apache.org/jira/browse/NUTCH-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1485: Fix Version/s: 2.2 TableUtil reverseURL to keep userinfo part -- Key: NUTCH-1485 URL: https://issues.apache.org/jira/browse/NUTCH-1485 Project: Nutch Issue Type: Improvement Affects Versions: 2.1 Reporter: Sebastian Nagel Priority: Minor Fix For: 2.2 The reversed URL key does not contain the userinfo part of an URL (user name and password: {{ftp://user:passw...@ftp.xyz/file.txt}}, cf. [RFC 3986|http://tools.ietf.org/html/rfc3986] and [http://en.wikipedia.org/wiki/URI_scheme]. Keeping the userinfo would make it easy to crawl a fixed list of protected content. However, URLs with userinfo can be tricky, eg [http://cnn.comstory=breaking_news@199.239.136.200/mostpopular], so it's ok when the default is to remove the userinfo. But this should be done in default URL normalizers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1182: Fix Version/s: 2.2 1.7 fetcher should track and shut down hung threads --- Key: NUTCH-1182 URL: https://issues.apache.org/jira/browse/NUTCH-1182 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3, 1.4 Environment: Linux, local job runner Reporter: Sebastian Nagel Priority: Minor Fix For: 1.7, 2.2 While crawling a slow server with a couple of very large PDF documents (30 MB) on it after some time and a bulk of successfully fetched documents the fetcher stops with the message: ??Aborting with 10 hung threads.?? From now on every cycle ends with hung threads, almost no documents are fetched successfully. In addition, strange hadoop errors are logged: {noformat} fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException at java.lang.System.arraycopy(Native Method) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108) ... {noformat} or {noformat} Exception in thread QueueFeeder java.lang.NullPointerException at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214) {noformat} I've run the debugger and found: # after the hung threads are reported the fetcher stops but the threads are still alive and continue fetching a document. In consequence, this will #* limit the small bandwidth of network/server even more #* after the document is fetched the thread tries to write the content via {{output.collect()}} which must fail because the fetcher map job is already finished and the associated temporary mapred directory is deleted. The error message may get mixed with the progress output of the next fetch cycle causing additional confusion. # documents/URLs causing the hung thread are never reported nor stored. That is, it's hard to track them down, and they will cause a hung thread again and again. The problem is reproducible when fetching bigger documents and setting {{mapred.task.timeout}} to a low value (this will definitely cause hung threads). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1018) Solr Document Size Limit
[ https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1018. - Resolution: Won't Fix Looks like a plugin is the solution here. Closing as won't fix. Solr Document Size Limit Key: NUTCH-1018 URL: https://issues.apache.org/jira/browse/NUTCH-1018 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Mark Achee Priority: Minor Labels: solr There should be an option, perhaps named solr.content.limit, that defines the max size of documents added to Solr. I've had issues with large documents in Solr, so I set the file.content.limit to 2MB. However, this causes many files to not be parsed (mostly PDFs) because of only retrieving parts of the document. With this new option, I could still correctly parse them, but only index the first 2MB (or however large it is set) in Solr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1007) Add readdb -host output
[ https://issues.apache.org/jira/browse/NUTCH-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1007. - Resolution: Won't Fix This is not a problem and as Markus mentioned the DomainStatistics tool does a pretty good job of this already. Add readdb -host output --- Key: NUTCH-1007 URL: https://issues.apache.org/jira/browse/NUTCH-1007 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Reporter: MilleBii Priority: Minor I have created an enhancement for the readdb feature, which computes a list of host nbre of urls for that host. I think it could be valuable for many people. This is to know what is in the crawldb. Like -dump or -topN the syntax proposed would be like this : readdb -host ouput -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552028#comment-13552028 ] Sebastian Nagel commented on NUTCH-1499: So, a vote for won't fix. Comments? Usage of multiple ipv4 addresses and network cards on fetcher machines -- Key: NUTCH-1499 URL: https://issues.apache.org/jira/browse/NUTCH-1499 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.5.1 Reporter: Walter Tietze Priority: Minor Fix For: 1.7 Attachments: apache-nutch-1.5.1.NUTCH-1499.patch Adds for the fetcher threads the ability to use multiple configured ipv4 addresses. On some cluster machines there are several ipv4 addresses configured where each ip address is associated with its own network interface. This patch enables to configure the protocol-http and the protocol-httpclient to use these network interfaces in a round robin style. If the feature is enabled, a helper class reads at *startup* the network configuration. In each http network connection the next ip address is taken. This method is synchronized, but this should be no bottleneck for the overall performance of the fetcher threads. This feature is tested on our cluster for the protocol-http and the protocol-httpclient protocol. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1316) create EmbeddedNutchInstance testing utility class
[ https://issues.apache.org/jira/browse/NUTCH-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1316. - Resolution: Won't Fix We already have a testing class relating to Fetching (which is what we care about here). http://svn.apache.org/repos/asf/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java Closing as won't fix. create EmbeddedNutchInstance testing utility class -- Key: NUTCH-1316 URL: https://issues.apache.org/jira/browse/NUTCH-1316 Project: Nutch Issue Type: New Feature Reporter: Lewis John McGibbney Priority: Minor Labels: test I propose to create a new testing utility class called EmbeddedNutchInstance which provides two main methods; setup and teardown. This will take the pain out of firing up Nutch test instances in distributed environments and will enable us to test Nutch over the BigTop environment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1313) Nutch trunk add response headers to datastore for the protocol-httpclient plugin
[ https://issues.apache.org/jira/browse/NUTCH-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1313: Fix Version/s: 1.7 Nutch trunk add response headers to datastore for the protocol-httpclient plugin Key: NUTCH-1313 URL: https://issues.apache.org/jira/browse/NUTCH-1313 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Ferdy Galema Priority: Minor Fix For: 1.7 For tracking progress the port of NUTCH-1311 to Nutch trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira