[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1284:
---

Assignee: Tejas Patil

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7

 Attachments: NUTCH-1284.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551857#comment-13551857
 ] 

Tejas Patil commented on NUTCH-1284:


Can anyone kindly review the patch ?

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7

 Attachments: NUTCH-1284.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1274:


Fix Version/s: 2.2

 Fix [cast] javac warnings
 -

 Key: NUTCH-1274
 URL: https://issues.apache.org/jira/browse/NUTCH-1274
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1274-2.x.patch, NUTCH-1274-2.x.v2.patch, 
 NUTCH-1274-trunk.patch, NUTCH-1274-trunk.v2.patch


 A typical example of this is
 {code}
 trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] 
 redundant cast to int
 [javac] res ^= (int)(signature[i]  24 + signature[i+1]  16 + 
 {code}
 these should all be fixed by replacing with the correct implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1274) Fix [cast] javac warnings

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1274.
-

Resolution: Fixed

Committed @revision 1432469 in trunk
Committed @revision 1432471 in 2.x
Thank you Tejas for your contribution.

 Fix [cast] javac warnings
 -

 Key: NUTCH-1274
 URL: https://issues.apache.org/jira/browse/NUTCH-1274
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1274-2.x.patch, NUTCH-1274-2.x.v2.patch, 
 NUTCH-1274-trunk.patch, NUTCH-1274-trunk.v2.patch


 A typical example of this is
 {code}
 trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] 
 redundant cast to int
 [javac] res ^= (int)(signature[i]  24 + signature[i+1]  16 + 
 {code}
 these should all be fixed by replacing with the correct implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1042:


Fix Version/s: 2.2
   1.7

 Fetcher.max.crawl.delay property not taken into account correctly when set to 
 -1
 

 Key: NUTCH-1042
 URL: https://issues.apache.org/jira/browse/NUTCH-1042
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Nutch User - 1
 Fix For: 1.7, 2.2


 [Originally: 
 (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]
 From nutch-default.xml:
 
 property
  namefetcher.max.crawl.delay/name
  value30/value
  description
  If the Crawl-Delay in robots.txt is set to greater than this value (in
  seconds) then the fetcher will skip this page, generating an error report.
  If set to -1 the fetcher will never skip such pages and will wait the
  amount of time retrieved from robots.txt Crawl-Delay, however long that
  might be.
  /description
 /property
 
 Fetcher.java:
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).
 The line 554 in Fetcher.java: this.maxCrawlDelay =
 conf.getInt(fetcher.max.crawl.delay, 30) * 1000; .
 The lines 615-616 in Fetcher.java:
 
 if (rules.getCrawlDelay()  0) {
   if (rules.getCrawlDelay()  maxCrawlDelay) {
 
 Now, the documentation states that, if fetcher.max.crawl.delay is set to
 -1, the crawler will always wait the amount of time the Crawl-Delay
 parameter specifies. However, as you can see, if it really is negative
 the condition on the line 616 is always true, which leads to skipping
 the page whose Crawl-Delay is set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1284:


Fix Version/s: 2.2

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551959#comment-13551959
 ] 

Lewis John McGibbney commented on NUTCH-1284:
-

Hi Tejas. Nice catch btw as it looks like you've integrated NUTCH-1042 in to 
this patch as well. 
With regards to the original issue here e.g. NUTCH-1284, it would be excellent 
if this issue could also provide logging for the fetcher as originally stated 
in the issue description. e.g. the log output records crawl.delay on a per url 
basis. I like the debug logging you've added for the queue. Although it is not 
marked, IIRC this issue affects both 1.x and 2.x... 

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1274) Fix [cast] javac warnings

2013-01-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551964#comment-13551964
 ] 

Hudson commented on NUTCH-1274:
---

Integrated in Nutch-trunk #2081 (See 
[https://builds.apache.org/job/Nutch-trunk/2081/])
NUTCH-1274 Fix [cast] javac warnings (Revision 1432469)

 Result = SUCCESS
lewismc : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1432469
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/Loops.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
* /nutch/trunk/src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
* /nutch/trunk/src/java/org/apache/nutch/tools/FreeGenerator.java
* /nutch/trunk/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java
* /nutch/trunk/src/test/org/apache/nutch/parse/TestParserFactory.java


 Fix [cast] javac warnings
 -

 Key: NUTCH-1274
 URL: https://issues.apache.org/jira/browse/NUTCH-1274
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1274-2.x.patch, NUTCH-1274-2.x.v2.patch, 
 NUTCH-1274-trunk.patch, NUTCH-1274-trunk.v2.patch


 A typical example of this is
 {code}
 trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] 
 redundant cast to int
 [javac] res ^= (int)(signature[i]  24 + signature[i+1]  16 + 
 {code}
 these should all be fixed by replacing with the correct implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1472) InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation)

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1472:


Fix Version/s: 2.2

  InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] 
 failed validation)
 --

 Key: NUTCH-1472
 URL: https://issues.apache.org/jira/browse/NUTCH-1472
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: zhaixuepan
 Fix For: 2.2


 me.prettyprint.hector.api.exceptions.HInvalidRequestException: 
 InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed 
 validation)
   at 
 me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:45)
   at 
 me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:264)
   at 
 me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
   at 
 me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
   at 
 me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69)
   at 
 org.apache.gora.cassandra.store.HectorUtils.insertColumn(HectorUtils.java:47)
   at 
 org.apache.gora.cassandra.store.CassandraClient.addColumn(CassandraClient.java:169)
   at 
 org.apache.gora.cassandra.store.CassandraStore.addOrUpdateField(CassandraStore.java:341)
   at 
 org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:228)
   at 
 org.apache.gora.cassandra.store.CassandraStore.close(CassandraStore.java:95)
   at 
 org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
   at 
 org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: InvalidRequestException(why:(String didn't validate.) 
 [webpage][f][ts] failed validation)
   at 
 org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20253)
   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
   at 
 org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:922)
   at 
 org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:908)
   at 
 me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
   at 
 me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
   at 
 me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
   at 
 me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
   ... 13 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1436) bin/nutch absent in zip package

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1436.
-

Resolution: Won't Fix

As we have released 1.6, which includes the bin/nutch script in the zip package 
I'm marking this as won't fix.

 bin/nutch absent in zip package
 ---

 Key: NUTCH-1436
 URL: https://issues.apache.org/jira/browse/NUTCH-1436
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.5.1
Reporter: Sebastian Nagel
 Attachments: NUTCH-1436.patch


 The script bin/nutch is absent in the package apache-nutch-1.5.1-bin.zip,
 the tar-bin package is not affected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1472) InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation)

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1472:


Component/s: injector

This issue occurs when injecting URLs into Cassandra using gora-cassandra 0.2.1.

  InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] 
 failed validation)
 --

 Key: NUTCH-1472
 URL: https://issues.apache.org/jira/browse/NUTCH-1472
 Project: Nutch
  Issue Type: Bug
  Components: injector
Affects Versions: 2.1
Reporter: zhaixuepan
 Fix For: 2.2


 me.prettyprint.hector.api.exceptions.HInvalidRequestException: 
 InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed 
 validation)
   at 
 me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:45)
   at 
 me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:264)
   at 
 me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
   at 
 me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
   at 
 me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69)
   at 
 org.apache.gora.cassandra.store.HectorUtils.insertColumn(HectorUtils.java:47)
   at 
 org.apache.gora.cassandra.store.CassandraClient.addColumn(CassandraClient.java:169)
   at 
 org.apache.gora.cassandra.store.CassandraStore.addOrUpdateField(CassandraStore.java:341)
   at 
 org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:228)
   at 
 org.apache.gora.cassandra.store.CassandraStore.close(CassandraStore.java:95)
   at 
 org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
   at 
 org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: InvalidRequestException(why:(String didn't validate.) 
 [webpage][f][ts] failed validation)
   at 
 org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20253)
   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
   at 
 org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:922)
   at 
 org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:908)
   at 
 me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
   at 
 me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
   at 
 me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
   at 
 me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
   ... 13 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1495:


Fix Version/s: 2.2

 -normalize and -filter for updatedb command in nutch 2.x
 

 Key: NUTCH-1495
 URL: https://issues.apache.org/jira/browse/NUTCH-1495
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: Nathan Gass
 Fix For: 2.2

 Attachments: patch-updatedb-normalize-filter-2012-11-09.txt, 
 patch-updatedb-normalize-filter-2012-11-13.txt


 AFAIS in nutch 1.x you could change your url filters and normalizers during 
 the crawl, and update the db using crawldb -normalize -filter. There does not 
 seem to be a away to achieve the same in nutch 2.x?
 Anyway, I went ahead and tried to implement -normalize and -filter for the 
 nutch 2.x updatedb command. I have no experience with any of the used 
 technologies including java, so please check the attached code carefully 
 before using it. I'm very interested to hear if this is the right approach or 
 any other comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1190:


Fix Version/s: 2.2
   1.7

 MoreIndexingFilter refactor: move data formats used to parse lastModified 
 to a config file.
 -

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan
 Fix For: 1.7, 2.2

 Attachments: date-styles.txt, MoreIndexingFilter.patch


 There many issues about missing date format:
 [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
 [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
 [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
 The data formats can be diverse, so why not move those data formats to a 
 extra config file?
 I move all the data formats from MoreIndexingFilter.java to a file named 
 date-styles.txt(place in conf), which will be load on startup.
 {code}
   public void setConf(Configuration conf) {
 this.conf = conf;
 MIME = new MimeUtil(conf);
 
 URL res = conf.getResource(date-styles.txt);
 if(res==null){
   LOG.error(Can't find resource: date-styles.txt);
 }else{
   try {
 List lines = FileUtils.readLines(new File(res.getFile()));
 for (int i = 0; i  lines.size(); i++) {
   String dateStyle = (String) lines.get(i);
   if(StringUtils.isBlank(dateStyle)){
 lines.remove(i);
 i--;
 continue;
   }
   dateStyle=StringUtils.trim(dateStyle);
   if(dateStyle.startsWith(#)){
 lines.remove(i);
 i--;
 continue;
   }
   lines.set(i, dateStyle);
 }
 dateStyles = new String[lines.size()];
 lines.toArray(dateStyles);
   } catch (IOException e) {
 LOG.error(Failed to load resource: date-styles.txt);
   }
 }
   }
 {code}
 Then parse lastModified like this(sample):
 {code}
   private long getTime(String date, String url) {
 ..
 Date parsedDate = DateUtils.parseDate(date, dateStyles);
 time = parsedDate.getTime();
 ..
 return time;
   }
 {code}
 This path also contains the path of 
 [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
 Find more details in the patch file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1015) MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1015:


Fix Version/s: 2.2
   1.7

 MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42
 ---

 Key: NUTCH-1015
 URL: https://issues.apache.org/jira/browse/NUTCH-1015
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Reporter: Markus Jelsma
 Fix For: 1.7, 2.2


 MoreIndexingFilter must handle the following url's gracefully:
 {code}
 can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1
 can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT
 can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT
 can't parse erroneous date: GMT
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Jenkins build is back to normal : Nutch-nutchgora #463

2013-01-12 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/463/



[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1483:


Patch Info: Patch Available

 Can't crawl filesystem with protocol-file plugin
 

 Key: NUTCH-1483
 URL: https://issues.apache.org/jira/browse/NUTCH-1483
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1
 Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
Reporter: Rogério Pereira Araújo
 Attachments: NUTCH-1483.patch


 I tried to follow the same steps described in this wiki page:
 http://wiki.apache.org/nutch/IntranetDocumentSearch
 I made all required changes on regex-urlfilter.txt and added the following 
 entry in my seed file:
 file:///home/rogerio/Documents/
 The permissions are ok, I'm running nutch with the same user as folder owner, 
 so nutch has all the required permissions, unfortunately I'm getting the 
 following error:
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at 
 org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
 at 
 org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
 fetch of file://home/rogerio/Documents/ failed with: 
 org.apache.nutch.protocol.file.FileError: File Error: 404
 Why the logs are showing file://home/rogerio/Documents/ instead of 
 file:///home/rogerio/Documents/ ???
 Note: The regex-urlfilter entry only works as expected if I add the entry 
 +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
 as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1461) Problem with TableUtil

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1461:


   Patch Info: Patch Available
Fix Version/s: 2.2

 Problem with TableUtil
 --

 Key: NUTCH-1461
 URL: https://issues.apache.org/jira/browse/NUTCH-1461
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: nutchgora
 Environment: Debian / CDH3 / Nutch 2.0 Release
Reporter: Christian Johnsson
 Fix For: 2.2

 Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch


 Affects parse and updatedb and parse.
 Think i got some missformated urls into hbase but i can't fin them.
 It generates this error though. If i empty hbase and restart it goes for a 
 couple of million pages indexed then it comes up again. Any tips on how to 
 locate what row in the table that genereates this error?
 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running 
 child
 java.lang.ArrayIndexOutOfBoundsException: 1
   at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
   at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
   at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
   at org.apache.hadoop.mapred.Child.main(Child.java:260)
 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
 for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney

2013-01-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=254rev2=255

   * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
   * [[NutchMavenSupport|Using Nutch as a Maven dependency]]
  
- == Nutch 2.0 ==
+ == Nutch 2.x ==
   * Nutch2Crawling - A description of the crawling jobs
   * Nutch2Architecture - A high level overview of the new architecture and 
design
   * Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0


[jira] [Resolved] (NUTCH-1094) create comprehensive documentation for Nutchgora branch

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1094.
-

Resolution: Fixed

I would argue that this has been significantly addressed in recent months. We 
now have the following
- A description of the crawling jobs
- entry on 2.x architecture
- Roadmap for 2.x
- Building Nutch 2.x in Eclipse
- Error messages, and
- a guide to understanding the Webpage webdb columns and fields. 

 create comprehensive documentation for Nutchgora branch
 ---

 Key: NUTCH-1094
 URL: https://issues.apache.org/jira/browse/NUTCH-1094
 Project: Nutch
  Issue Type: Sub-task
  Components: documentation
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: 2.2


 This should shadow the core documentation for Nutch 1.4 (branch) and 
 mainstream users, however it should include fundamentals specific to Nutch 
 trunk. Until we release Nutch 2.0 this documentation should be stored in svn 
 under a /docs directory. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1447) Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1447:


Fix Version/s: 2.2

 Nutch 2.x with Cloudera CDH 4 get Error: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected 
 

 Key: NUTCH-1447
 URL: https://issues.apache.org/jira/browse/NUTCH-1447
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
 Environment: Cloudera CDH4
Reporter: Trần Anh Tuấn
 Fix For: 2.2


 I'm trying to crawl using Nutch 2. 
 I check out source from 
 http://svn.apache.org/repos/asf/nutch/branches/2.x/ and config with 
 mysql. 
 I get error but when run nutch 1.5 everything okie :( 
 mkdir urls 
 echo nutch.apache.org  urls/seed.txt 
 runtime/deploy/bin/nutch inject urls 
 12/08/07 11:25:38 INFO crawl.InjectorJob: InjectorJob: starting 
 12/08/07 11:25:38 INFO crawl.InjectorJob: InjectorJob: urlDir: urls 
 12/08/07 11:25:41 WARN mapred.JobClient: Use GenericOptionsParser for 
 parsing the arguments. Applications should implement Tool for the 
 same. 
 12/08/07 11:25:44 INFO input.FileInputFormat: Total input paths to process : 
 1 
 12/08/07 11:25:45 INFO util.NativeCodeLoader: Loaded the native-hadoop 
 library 
 12/08/07 11:25:45 WARN snappy.LoadSnappy: Snappy native library is available 
 12/08/07 11:25:45 INFO snappy.LoadSnappy: Snappy native 
 12/08/07 11:25:47 INFO mapred.JobClient:  map 0% reduce 0% 
 12/08/07 11:26:01 INFO mapred.JobClient: Task Id : 
 attempt_201208071123_0001_m_00_0, Status : FAILED 
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, 
 but class was expected 
 attempt_201208071123_0001_m_00_0: SLF4J: Class path contains 
 multiple SLF4J bindings. 
 attempt_201208071123_0001_m_00_0: SLF4J: Found binding in 
 [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  
 attempt_201208071123_0001_m_00_0: SLF4J: Found binding in 
 [jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/root/jobcache/job_201208071123_0001/jars/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  
 attempt_201208071123_0001_m_00_0: SLF4J: See 
 http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 
 12/08/07 11:26:05 INFO mapred.JobClient: Task Id : 
 attempt_201208071123_0001_m_00_1, Status : FAILED 
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, 
 but class was expected 
 attempt_201208071123_0001_m_00_1: SLF4J: Class path contains 
 multiple SLF4J bindings. 
 attempt_201208071123_0001_m_00_1: SLF4J: Found binding in 
 [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  
 attempt_201208071123_0001_m_00_1: SLF4J: Found binding in 
 [jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/root/jobcache/job_201208071123_0001/jars/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  
 attempt_201208071123_0001_m_00_1: SLF4J: See 
 http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 
 12/08/07 11:26:10 INFO mapred.JobClient: Task Id : 
 attempt_201208071123_0001_m_00_2, Status : FAILED 
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, 
 but class was expected 
 attempt_201208071123_0001_m_00_2: SLF4J: Class path contains 
 multiple SLF4J bindings. 
 attempt_201208071123_0001_m_00_2: SLF4J: Found binding in 
 [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  
 attempt_201208071123_0001_m_00_2: SLF4J: Found binding in 
 [jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/root/jobcache/job_201208071123_0001/jars/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  
 attempt_201208071123_0001_m_00_2: SLF4J: See 
 http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 
 12/08/07 11:26:19 INFO mapred.JobClient: Job complete: job_201208071123_0001 
 12/08/07 11:26:19 INFO mapred.JobClient: Counters: 7 
 12/08/07 11:26:19 INFO mapred.JobClient:   Job Counters 
 12/08/07 11:26:19 INFO mapred.JobClient: Failed map tasks=1 
 12/08/07 11:26:19 INFO mapred.JobClient: Launched map tasks=4 
 12/08/07 11:26:19 INFO mapred.JobClient: Data-local map tasks=4 
 12/08/07 11:26:19 INFO mapred.JobClient: Total time spent by all 
 maps in occupied slots (ms)=18003 
 12/08/07 11:26:19 INFO mapred.JobClient: Total time spent by all 
 reduces in occupied slots (ms)=0 
 12/08/07 11:26:19 INFO mapred.JobClient: Total time spent by all 
 maps waiting after reserving slots (ms)=0 
 12/08/07 11:26:19 INFO 

[jira] [Updated] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1418:


Fix Version/s: 1.7

 error parsing robots rules- can't decode path: 
 /wiki/Wikipedia%3Mediation_Committee/
 

 Key: NUTCH-1418
 URL: https://issues.apache.org/jira/browse/NUTCH-1418
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Arijit Mukherjee
 Fix For: 1.7


 Since learning that nutch will be unable to crawl the javascript function 
 calls in href, I started looking for other alternatives. I decided to crawl 
 http://en.wikipedia.org/wiki/Districts_of_India.
 I first tried injecting this URL and follow the step-by-step approach 
 till fetcher - when I realized, nutch did not fetch anything from this 
 website. I tried looking into logs/hadoop.log and found the following 3 lines 
 - which I believe could be saying that nutch is unable to parse the 
 robots.txt in the website and ttherefore, fetcher stopped?

 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
 rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
 rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
 rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
 I tried checking the URL using parsechecker and no issues there! I think 
 it means that the robots.txt is malformed for this website, which is 
 preventing fetcher from fetching anything. Is there a way to get around this 
 problem, as parsechecker seems to go on its merry way parsing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1458) Support for raw HTML field added to Solr

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1458:


Fix Version/s: 1.7

 Support for raw HTML field added to Solr
 

 Key: NUTCH-1458
 URL: https://issues.apache.org/jira/browse/NUTCH-1458
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Affects Versions: 1.5.1
Reporter: Max Dzyuba
  Labels: html, nutch, raw, solr
 Fix For: 1.7


 At the moment, the “content” field holds only the parsed text from the page. 
 It would be nice to have a separate field in Solr document that would hold 
 raw HTML from the crawled page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1457:


Fix Version/s: 2.2

 Nutch2 Refactor the update process so that fetched items are only processed 
 once
 

 Key: NUTCH-1457
 URL: https://issues.apache.org/jira/browse/NUTCH-1457
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: 2.2




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1452) hadoop.job.history.user.location in nutch-default making job history useless

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1452:


Fix Version/s: 2.2
   1.7

 hadoop.job.history.user.location in nutch-default making job history useless
 

 Key: NUTCH-1452
 URL: https://issues.apache.org/jira/browse/NUTCH-1452
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: 1.7, 2.2


 There is still a property in nutch-default 'hadoop.job.history.user.location' 
 that redirects the creation of history files from job output locations to a 
 custom location. I noticed that the current value does not work well with 
 cloudera (I have tested cdh3u4), because ${hadoop.log.dir} is not defined. 
 This actually causes the job in the jobtracker to show empty info. (With 
 'incomplete' job status). This is only when the job moves to retired. When it 
 is still in 'completed', all is looking well.
 This property can be set to 'none', because the job history is ALSO stored in 
 the central jobtracker location anyway. The 
 'hadoop.job.history.user.location' property specifies an extra location. But 
 if it is set to an invalid value, it causes the central history location to 
 NOT store it, so it seems. Please see for more details:
 http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
 Besides setting it to 'none', another option is to set it to 'history' which 
 does work with cdh. (This writes all logs to 'history' in the user directory 
 in the configured filesystem, usually dfs). The final option is to simply 
 remove this value and not meddle with hadoop properties at all. But that 
 actually requires all jobs to correctly ignore these files. I am not up to 
 date how well this currently works with Nutch jobs. This question is most 
 relevant for trunk, since trunk heavily relies on the filesystem for jobs.
 What do you think?
 A) Set property to 'none'
 B) Set property to 'history'
 C) Remove property, see what happens, possibly fix jobs
 D) ?
 For now, I opt for A. But I think we need some more input with other 
 distributions (for example official Hadoop 1.x) and also Nutch trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-806:
---

Fix Version/s: 1.7

 Merge CrawlDBScanner with CrawlDBReader
 ---

 Key: NUTCH-806
 URL: https://issues.apache.org/jira/browse/NUTCH-806
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.7


 The CrawlDBScanner [NUTCH-784] should be merged with the CrawlDBReader. Will 
 do that after the 1.1 release 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1410) impact of a map-reduce problem

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1410:


Fix Version/s: 2.2
   1.7

 impact of a map-reduce problem
 --

 Key: NUTCH-1410
 URL: https://issues.apache.org/jira/browse/NUTCH-1410
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Reporter: behnam nikbakht
 Fix For: 1.7, 2.2


 with a simple test , found that each mapper or reducer have a local view of 
 variables. in Nutch, there are multiple places that share a variable between 
 mappers or reducers , for example in generate there is a shared variable : 
 hostCounts . or in fetcher , the last request time for each mapper 
 (fetcherThread) is different from another.
 this problem cause critical problems like send multiple requests to same host 
 that cause to block.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1502) Test for CrawlDatum state transitions

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1502:


Fix Version/s: 2.2
   1.7

 Test for CrawlDatum state transitions
 -

 Key: NUTCH-1502
 URL: https://issues.apache.org/jira/browse/NUTCH-1502
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.7, 2.2
Reporter: Sebastian Nagel
 Fix For: 1.7, 2.2


 An exhaustive test to check the matrix of CrawlDatum state transitions 
 (CrawlStatus in 2.x) would be useful to detect errors esp. for continuous 
 crawls where the number of possible transitions is quite large. Additional 
 factors with impact on state transitions (retry counters, static and dynamic 
 intervals) are also tested.
 The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter 
 for a first sketchy patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1481) When using MySQL as storage unicode characters within URLS cause nutch to fail

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1481:


Fix Version/s: 2.2

 When using MySQL as storage unicode characters within URLS cause nutch to fail
 --

 Key: NUTCH-1481
 URL: https://issues.apache.org/jira/browse/NUTCH-1481
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 2.1
 Environment: mysql 5.5.28 on centos
Reporter: Arni Sumarlidason
  Labels: database, sql, unicode, utf8
 Fix For: 2.2


 MySQL's (innodb) primary key / unique key is restricted to 767 bytes.. 
 currently the url of a web page is used as a primary key in nutch storage.
 when using latin1 character set on the 'id' column @ length 767 
 bytes/characters; unicode characters in urls cause jdbc to throw an exception,
 java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: 
 '\xE2\x80\x8' for column 'id' at row 1
 when using utf8mb4 character set on the 'id' column @ length 190 characters / 
 760 bytes to fully support unicode characters; the field length becomes 
 insufficient
 It may be better to use a hash of the url as the primary key instead of the 
 url itself. This would allow urls of any length and full utf8 support.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1490) Data Truncation exceptions when using mysql

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1490:


Fix Version/s: 2.2

 Data Truncation exceptions when using mysql
 ---

 Key: NUTCH-1490
 URL: https://issues.apache.org/jira/browse/NUTCH-1490
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Nathan Gass
 Fix For: 2.2

 Attachments: patch


 Nutch does not ensure the set (or implicit) maximal length for the following 
 columns:
 title
 urls (id, baseUrl, reprUrl,
 typ (contentType)
 inlinks
 outlinks
 Trying to store too much data in one of this columns results in an exception 
 similar to this (copied from GORA-24, I will be able to add an newer stack 
 trace later today):
 java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
 long for column 'inlinks' at row 1 
 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) 
 at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) 
 at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) 
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) 
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) 
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 
 Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for 
 column 'inlinks' at row 1 
 at 
 com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2018)
  
 at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1449) 
 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) 
 ... 5 more
 I'll add my current fixes in later comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1490) Data Truncation exceptions when using mysql

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1490:


Patch Info: Patch Available

 Data Truncation exceptions when using mysql
 ---

 Key: NUTCH-1490
 URL: https://issues.apache.org/jira/browse/NUTCH-1490
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Nathan Gass
 Fix For: 2.2

 Attachments: patch


 Nutch does not ensure the set (or implicit) maximal length for the following 
 columns:
 title
 urls (id, baseUrl, reprUrl,
 typ (contentType)
 inlinks
 outlinks
 Trying to store too much data in one of this columns results in an exception 
 similar to this (copied from GORA-24, I will be able to add an newer stack 
 trace later today):
 java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
 long for column 'inlinks' at row 1 
 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) 
 at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) 
 at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) 
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) 
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) 
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 
 Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for 
 column 'inlinks' at row 1 
 at 
 com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2018)
  
 at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1449) 
 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) 
 ... 5 more
 I'll add my current fixes in later comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1487) Nutch parse fails first time for PDF files and works on reparse

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1487:


Fix Version/s: 2.2

 Nutch parse fails first time for PDF files and works on reparse
 ---

 Key: NUTCH-1487
 URL: https://issues.apache.org/jira/browse/NUTCH-1487
 Project: Nutch
  Issue Type: Bug
  Components: parser, storage
Affects Versions: 2.1
Reporter: kiran
  Labels: mysql
 Fix For: 2.2


 The parser is failing to parse pdf files at one go and working on re-parsing 
 command the number of times the total number of PDF files as discussed in the 
 mailing list here 
 (http://www.mail-archive.com/user%40nutch.apache.org/msg07952.html) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1297:


Fix Version/s: 1.7

 it is better for fetchItemQueues to select items from greater queues first
 --

 Key: NUTCH-1297
 URL: https://issues.apache.org/jira/browse/NUTCH-1297
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Fix For: 1.7

 Attachments: NUTCH-1297.patch


 there is a situation that if we have multiple hosts in fetch, and size of 
 hosts were different, large hosts have a long delay until the getFetchItem() 
 in FetchItemQueues class select a url from them, so we can give them more 
 priority.
 for example if we have 10 url from host1 and 1000 url from host2, and have 5 
 threads, if all threads first selected from host1, we had more delay on fetch 
 rather than a situation that threads first selected from host2, and when host 
 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1297:


Patch Info: Patch Available

 it is better for fetchItemQueues to select items from greater queues first
 --

 Key: NUTCH-1297
 URL: https://issues.apache.org/jira/browse/NUTCH-1297
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Fix For: 1.7

 Attachments: NUTCH-1297.patch


 there is a situation that if we have multiple hosts in fetch, and size of 
 hosts were different, large hosts have a long delay until the getFetchItem() 
 in FetchItemQueues class select a url from them, so we can give them more 
 priority.
 for example if we have 10 url from host1 and 1000 url from host2, and have 5 
 threads, if all threads first selected from host1, we had more delay on fetch 
 rather than a situation that threads first selected from host2, and when host 
 2 was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1286:


Fix Version/s: 2.2

 Refactoring/reimplementing crawling API (NutchApp)
 --

 Key: NUTCH-1286
 URL: https://issues.apache.org/jira/browse/NUTCH-1286
 Project: Nutch
  Issue Type: Improvement
  Components: administration gui, REST_api, web gui
Reporter: Ferdy Galema
 Fix For: 2.2


 This issue is to track changes we (Mathijs and I) have planned for the API 
 and webapp in Nutchgora. We have a pretty good idea of how we want to be 
 using the crawl API. It may involve some major refactoring or perhaps a side 
 implementation next the current NutchApp functionality. It depends on how 
 much we can reuse the existing components. The bottom line is that there will 
 be a strictly defined Java API that provide everyting related from 
 crawling/indexing to job control. (Listing jobs, tracking progress and 
 aborting jobs being part of it). There will be no server or service for 
 tracking crawling states, all will be persisted one way or the other and 
 queryable from the API. The REST server shall be a very thin layer on top of 
 the Java implementation. A rich web interface will be very easy layer too, 
 once we have a cleanly (but extensive) defined API. But we will start to make 
 to API usable from a simple command-line interface.
 More details will be provided later on.. feel free to comment if you have 
 suggestions/questions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1267:


Fix Version/s: 1.7

 urlmeta to delegate indexing to index-metadata
 --

 Key: NUTCH-1267
 URL: https://issues.apache.org/jira/browse/NUTCH-1267
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Affects Versions: 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.7


 Ideally we should get rid of urlmeta altogether and add the transmission of 
 the meta to the outlinks in the core classes - not as a plugin. URLMeta is 
 also a terrible name :-(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1268) parse-meta to delegate indexing to index-metadata

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1268:


Fix Version/s: 1.7

 parse-meta to delegate indexing to index-metadata
 -

 Key: NUTCH-1268
 URL: https://issues.apache.org/jira/browse/NUTCH-1268
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.7




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1303:


Fix Version/s: 1.7

 Fetcher to skip queues for URLS getting repeated exceptions, based on 
 percentage
 

 Key: NUTCH-1303
 URL: https://issues.apache.org/jira/browse/NUTCH-1303
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
  Labels: fetch
 Fix For: 1.7

 Attachments: NUTCH-1303.patch


 as described in https://issues.apache.org/jira/browse/NUTCH-769, it is a good 
 solution to skip queues with high exception value, but it is not easy to set 
 value of fetcher.max.exceptions.per.queue when size of queues are different.
 i suggest that define a ratio instead of value, so if the ratio of exceptions 
 per requests exceeds, then queue cleared.
 also, it is not sufficient to keep fetcher from high exceptions, value of 
 fetcher.throughput.threshold.pages ensures that a valueable throughput of 
 fetch can gained against slow hosts, but it clean all queues not slow queue. 
 i suggest for this one that this factor like fetcher.max.exceptions.per.queue 
 enforce to each queue not all of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1303:


Patch Info: Patch Available

 Fetcher to skip queues for URLS getting repeated exceptions, based on 
 percentage
 

 Key: NUTCH-1303
 URL: https://issues.apache.org/jira/browse/NUTCH-1303
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
  Labels: fetch
 Fix For: 1.7

 Attachments: NUTCH-1303.patch


 as described in https://issues.apache.org/jira/browse/NUTCH-769, it is a good 
 solution to skip queues with high exception value, but it is not easy to set 
 value of fetcher.max.exceptions.per.queue when size of queues are different.
 i suggest that define a ratio instead of value, so if the ratio of exceptions 
 per requests exceeds, then queue cleared.
 also, it is not sufficient to keep fetcher from high exceptions, value of 
 fetcher.throughput.threshold.pages ensures that a valueable throughput of 
 fetch can gained against slow hosts, but it clean all queues not slow queue. 
 i suggest for this one that this factor like fetcher.max.exceptions.per.queue 
 enforce to each queue not all of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1270) some of Deflate encoded pages not fetched

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1270:


   Patch Info: Patch Available
Fix Version/s: 1.7

 some of Deflate encoded pages not fetched
 -

 Key: NUTCH-1270
 URL: https://issues.apache.org/jira/browse/NUTCH-1270
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: fetch, processDeflateEncoded
 Fix For: 1.7

 Attachments: NUTCH-1270.patch


 it is a problem with some of web pages that fetched but their content can not 
 retrived
 after this change, this error fixed
 we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
   public byte[] processDeflateEncoded(byte[] compressed, URL url) throws 
 IOException {
 if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); }
 byte[] content = DeflateUtils.inflateBestEffort(compressed, 
 getMaxContent());
 +if(content==null)
 + content = DeflateUtils.inflateBestEffort(compressed, 20);
 if (content == null)
   throw new IOException(inflateBestEffort returned null);
 if (LOGGER.isTraceEnabled()) {
   LOGGER.trace(fetched  + compressed.length
  +  bytes of compressed content (expanded to 
  + content.length +  bytes) from  + url);
 }
 return content;
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1282) linkdb scalability

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1282:


Fix Version/s: 1.7

 linkdb scalability
 --

 Key: NUTCH-1282
 URL: https://issues.apache.org/jira/browse/NUTCH-1282
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 1.4
Reporter: behnam nikbakht
 Fix For: 1.7


 as described in NUTCH-1054, the linkdb is optional in solrindex and it's 
 usage is only for anchor and not impact on scoring. 
 as seemed, size of linkdb in incremental crawl grow very fast and make it 
 unscalable for huge size of web sites.
 so, here is two choises, one, ignore invertlinks and linkdb from crawl, and 
 second, make it scalable
 in invertlinks, there is 2 jobs, first for construct new linkdb from new 
 parsed segments, and second for merge new linkdb with old linkdb. the second 
 job is unscalable and we can ignore it with this changes in solrIndex:
 in the class IndexerMapReduce, reduce method, if fetchDatum == null or 
 dbDatum == null or parseText == null or parseData == null, then add anchor to 
 doc and update solr (no insert)
 here also some changes required to NutchDocument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1269) Generate main problems

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1269:


Fix Version/s: 1.7

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Fix For: 1.7

 Attachments: NUTCH-1269.patch, NUTCH-1269-v.2.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1281:


Fix Version/s: 2.2
   1.7

 tika parser not work properly with unwanted file types that passed from 
 filters in nutch
 

 Key: NUTCH-1281
 URL: https://issues.apache.org/jira/browse/NUTCH-1281
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: behnam nikbakht
 Fix For: 1.7, 2.2


 when in parse-plugins.xml, set this property:
 mimeType name=*
 plugin id=parse-tika /
 /mimeType
 all unwanted files that pass from all filters, refered to tika
 but for some file types like .flv, tika parser has problem and hunged and 
 cause to fail in parse Job.
 if this file types passed from regex-urlfilter and other filters, parse job 
 failed.
 for this problem I suggest that add some properties for valid file types, and 
 use this code in TikaParser.java, like this:
 public ParseResult getParse(Content content) {
   String mimeType = content.getContentType();
 + String[]validTypes=new 
 String[]{application/pdf,application/x-tika-msoffice,application/x-tika- 
 ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml};
 + boolean valid=false;
 + for(int k=0;kvalidTypes.length;k++){
 + if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
 + valid=true;
 + }
 + if(!valid)
 + return new ParseStatus(ParseStatus.NOTPARSED, Can't 
 parse for unwanted filetype + 
 mimeType).getEmptyParseResult(content.getUrl(), getConf());
   
   URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1278) Fetch Improvement in threads per host

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1278:


Patch Info: Patch Available

 Fetch Improvement in threads per host
 -

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Fix For: 1.7

 Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip


 the value of maxThreads is equal to fetcher.threads.per.host and is constant 
 for every host
 there is a possibility with using of dynamic values for every host that 
 influeced with number of blocked requests.
 this means that if number of blocked requests for one host increased, then we 
 most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-926:
---

Fix Version/s: 1.7

 Nutch follows wrong url in META http-equiv=refresh tag
 -

 Key: NUTCH-926
 URL: https://issues.apache.org/jira/browse/NUTCH-926
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
 Environment: gnu/linux centOs
Reporter: Marco Novo
Priority: Critical
 Fix For: 1.7

 Attachments: ParseOutputFormat.java.patch


 We have nutch set to crawl a domain urllist and we want to fetch only passed 
 domains (hosts) not subdomains.
 So
 WWW.DOMAIN1.COM
 ..
 ..
 ..
 WWW.RIGHTDOMAIN.COM
 ..
 ..
 ..
 ..
 WWW.DOMAIN.COM
 We sets nutch to:
 NOT FOLLOW EXERNAL LINKS
 During crawling of WWW.RIGHTDOMAIN.COM
 if a page contains
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
 html
 head
 title/title
 META http-equiv=refresh content=0;
 url=http://WRONG.RIGHTDOMAIN.COM;
 /head
 body
 /body
 /html
 Nutch continues to crawl the WRONG subdomains! But it should not do this!!
 During crawling of WWW.RIGHTDOMAIN.COM
 if a page contains
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
 html
 head
 title/title
 META http-equiv=refresh content=0;
 url=http://WWW.WRONGDOMAIN.COM;
 /head
 body
 /body
 /html
 Nutch continues to crawl the WRONG domain! But it should not do this! If that 
 we will spider all the web
 We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have 
 done a patch so we will attach it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-881) Good quality documentation for Nutch

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-881:
---

Fix Version/s: 1.7

 Good quality documentation for Nutch
 

 Key: NUTCH-881
 URL: https://issues.apache.org/jira/browse/NUTCH-881
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: nutchgora
Reporter: Andrzej Bialecki 
Assignee: Lewis John McGibbney
 Fix For: 1.7


 This is, and has been, a long standing request from Nutch users. This becomes 
 an acute need as we redesign Nutch 2.0, because the collective knowledge and 
 the Wiki will no longer be useful without massive amount of editing.
 IMHO the reference documentation should be in SVN, and not on the Wiki - the 
 Wiki is good for casual information and recipes but I think it's too messy 
 and not reliable enough as a reference.
 I propose to start with the following:
  1. let's decide on the format of the docs. Each format has its own pros and 
 cons:
   * HTML: easy to work with, but formatting may be messy unless we edit it by 
 hand, at which point it's no longer so easy... Good toolchains to convert to 
 other formats, but limited expressiveness of larger structures (e.g. book, 
 chapters, TOC, multi-column layouts, etc).
   * Docbook: learning curve is higher, but not insurmountable... Naturally 
 yields very good structure. Figures/diagrams may be problematic - different 
 renderers (html, pdf) like to treat the scaling and placing somewhat 
 differently.
   * Wiki-style (Confluence or TWiki): easy to use, but limited control over 
 larger structures. Maven Doxia can format cwiki, twiki, and a host of other 
 formats to e.g. html and pdf.
   * other?
  2. start documenting the main tools and the main APIs (e.g. the plugins and 
 all the extension points). We can of course reuse material from the Wiki and 
 from various presentations (e.g. the ApacheCon slides).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Fix Version/s: 2.2
   1.7

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1257) Support for the x-robots-tag HTTP Header

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1257:


Fix Version/s: 2.2
   1.7

 Support for the x-robots-tag HTTP Header
 

 Key: NUTCH-1257
 URL: https://issues.apache.org/jira/browse/NUTCH-1257
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Mike
  Labels: http,, privacy,, robots,
 Fix For: 1.7, 2.2


 Google and Bing both currently support the x-robots-tag HTTP header. This is 
 important, because they have a policy of not *crawling* links that are in a 
 robots.txt file, and not *indexing* links that are set to noindex. In the 
 case that a page is indexed but not crawled, Google and Bing will show the 
 page in their results, but it will lack a snippet (since they didn't crawl 
 it, they can't generate one). 
 As a result, the only way to block Google and Bing from having a page in 
 their index is to use the robots meta tag in HTML pages and the x-robots-tag 
 in other mimetypes.
 As a site owner that needs to keep specific pages private, I *cannot* trust 
 robots.txt to keep my pages out of Google and Bing, and I have to use the two 
 robots standards. Since Nutch doesn't support the HTTP header, I have to 
 block it from crawling ALL non-HTML pages on my site.
 This is not an ideal state of affairs, and it would be great if Nutch 
 supported the x-robots-tag HTTP header.
 I've done more research on this topic on my blog:
  - 
 http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag
  - 
 http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1250) parse-html does not parse links with empty anchor

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1250:


Fix Version/s: 2.2
   1.7

 parse-html does not parse links with empty anchor
 -

 Key: NUTCH-1250
 URL: https://issues.apache.org/jira/browse/NUTCH-1250
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Andreas Janning
 Fix For: 1.7, 2.2


 The parse-html plugin does not generate an outlink if the link has no anchor
 For example the following HTML-Code does not create an Outlink:
 {code:html} 
   a href=example.com/a
 {code}
 The JUnit-Test TestDOMContentUtils tries to test this but fails since there 
 is a comment inside the a-Tag.
 {code:title=TestDOMContentUtils.java|borderStyle=solid}
 new String(htmlheadtitle title /title
 + /headbody
 + a href=\g\!--no anchor--/a
 + a href=\g1\ !--whitespace--  /a
 + a href=\g2\  img src=test.gif alt='bla bla' /a
 + /body/html), 
 {code}
 When you remove the comment the test fails.
 {code:title=TestDOMContentUtils.java Test fails|borderStyle=solid}
 new String(htmlheadtitle title /title
 + /headbody
 + a href=\g\/a // no anchor
 + a href=\g1\ !--whitespace--  /a
 + a href=\g2\  img src=test.gif alt='bla bla' /a
 + /body/html), 
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1080:


Fix Version/s: 2.2

 Type safe members , arguments for better readability 
 -

 Key: NUTCH-1080
 URL: https://issues.apache.org/jira/browse/NUTCH-1080
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Karthik K
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1080.patch, NUTCH-rel_14-1080.patch


 Enable generics for some of the API, for better type safety and readability, 
 in the process. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1080:


Fix Version/s: 1.7

 Type safe members , arguments for better readability 
 -

 Key: NUTCH-1080
 URL: https://issues.apache.org/jira/browse/NUTCH-1080
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Karthik K
 Fix For: 1.7

 Attachments: NUTCH-1080.patch, NUTCH-rel_14-1080.patch


 Enable generics for some of the API, for better type safety and readability, 
 in the process. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1076) Solrindex has no documents following bin/nutch solrindex when using protocol-file

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1076:


Fix Version/s: 1.7

 Solrindex has no documents following bin/nutch solrindex when using 
 protocol-file
 -

 Key: NUTCH-1076
 URL: https://issues.apache.org/jira/browse/NUTCH-1076
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
 Environment: Ubuntu Linux 10.04 server
 JDK 1.6
 Nutch 1.3
 Solr 3.1.0
Reporter: Seth Griffin
Assignee: Markus Jelsma
  Labels: nutch, protocol-file, solrindex
 Fix For: 1.7


 Note: When using protocol-http I am able to update solr effortlessly.
 To test this I have a single pdf file that I am trying to index in my urls 
 directory.
 I execute:
 bin/nutch crawl urls
 Output:
 solrUrl is not set, indexing will be skipped...
 crawl started in: crawl-20110805151045
 rootUrlDir = urls
 threads = 10
 depth = 5
 solrUrl=null
 Injector: starting at 2011-08-05 15:10:45
 Injector: crawlDb: crawl-20110805151045/crawldb
 Injector: urlDir: urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2011-08-05 15:10:48, elapsed: 00:00:02
 Generator: starting at 2011-08-05 15:10:48
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl-20110805151045/segments/20110805151050
 Generator: finished at 2011-08-05 15:10:51, elapsed: 00:00:03
 Fetcher: Your 'http.agent.name' value should be listed first in 
 'http.robots.agents' property.
 Fetcher: starting at 2011-08-05 15:10:51
 Fetcher: segment: crawl-20110805151045/segments/20110805151050
 Fetcher: threads: 10
 QueueFeeder finished: total 1 records + hit by time limit :0
 fetching file:///home/nutch/nutch-1.3/runtime/local/indexdir/Altec.pdf
 -finishing thread FetcherThread, activeThreads=9
 -finishing thread FetcherThread, activeThreads=8
 -finishing thread FetcherThread, activeThreads=7
 -finishing thread FetcherThread, activeThreads=6
 -finishing thread FetcherThread, activeThreads=5
 -finishing thread FetcherThread, activeThreads=4
 -finishing thread FetcherThread, activeThreads=3
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2011-08-05 15:10:53, elapsed: 00:00:02
 ParseSegment: starting at 2011-08-05 15:10:53
 ParseSegment: segment: crawl-20110805151045/segments/20110805151050
 ParseSegment: finished at 2011-08-05 15:10:56, elapsed: 00:00:03
 CrawlDb update: starting at 2011-08-05 15:10:56
 CrawlDb update: db: crawl-20110805151045/crawldb
 CrawlDb update: segments: [crawl-20110805151045/segments/20110805151050]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2011-08-05 15:10:57, elapsed: 00:00:01
 Generator: starting at 2011-08-05 15:10:57
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=1 - no more URLs to fetch.
 LinkDb: starting at 2011-08-05 15:10:58
 LinkDb: linkdb: crawl-20110805151045/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment: 
 file:/home/nutch/nutch-1.3/runtime/local/crawl-20110805151045/segments/20110805151050
 LinkDb: finished at 2011-08-05 15:10:59, elapsed: 00:00:01
 crawl finished: crawl-20110805151045
 Then with a clean solr index (stats output from stats.jsp below):
 searcherName : Searcher@14dd758 main
 caching : true
 numDocs : 0
 maxDoc : 0
 reader : 
 SolrIndexReader{this=1ee148b,r=ReadOnlyDirectoryReader@1ee148b,refCnt=1,segments=0}
 readerDir : 
 org.apache.lucene.store.NIOFSDirectory@/home/solr/apache-solr-3.1.0/example/solr/data/index
  lockFactory=org.apache.lucene.store.NativeFSLockFactory@987197
 indexVersion : 1312575204101
 openedAt : Fri Aug 05 15:13:24 CDT 2011
 registeredAt : Fri Aug 05 15:13:24 CDT 2011
 warmupTime : 0 
 I then execute:
 bin/nutch solrindex http://localhost:8983/solr/ crawl-20110805151045/crawldb/ 
 crawl-20110805151045/linkdb/ crawl-20110805151045/segments/*
 bin/nutch output:
 SolrIndexer: starting at 2011-08-05 15:15:48
 SolrIndexer: finished at 2011-08-05 15:15:50, elapsed: 00:00:01

[jira] [Updated] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1371:


Fix Version/s: 2.2
   1.7

 Replace Ivy with Maven Ant tasks
 

 Key: NUTCH-1371
 URL: https://issues.apache.org/jira/browse/NUTCH-1371
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Julien Nioche
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1371.patch


 We might move to Maven altogether but a good intermediate step could be to 
 rely on the maven ant tasks for managing the dependencies. Ivy does a good 
 job but we need to have a pom file anyway for publishing the artefacts which 
 means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
 more familiar with Maven, and it is well integrated in IDEs. Going the 
 ANT+MVN way also means that we don't have to rewrite the whole building 
 process and can rely on our existing script

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1382) Adding support for EmbeddedSolrServer to SolrIndexer

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1382:


Fix Version/s: 1.7

 Adding support for EmbeddedSolrServer to SolrIndexer
 

 Key: NUTCH-1382
 URL: https://issues.apache.org/jira/browse/NUTCH-1382
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.5
Reporter: Emre Çelikten
  Labels: patch
 Fix For: 1.7

 Attachments: embeddedsolrserver.patch


 Here is a hack to allow somebody to plug their own SolrServer into 
 SolrIndexer. It allows people to use EmbeddedSolrServer in Nutch.
 It works by:
 adding a constructor in SolrIndexer with parameter SolrServer, 
 adding an ugly method of getSolrServer into SolrUtils which returns 
 SolrServer if there is one provided by the programmer or returns default 
 getCommonsHttpSolrServer(...)
 replacing every occurrence of getCommonsHttpSolrServer by getSolrServer.
 Hope this helps. This is my first patch ever to FOSS community so I hope I am 
 doing it correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1387) All parsers should respond to cancellation / interrupts.

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1387:


Fix Version/s: 2.2
   1.7

 All parsers should respond to cancellation / interrupts.
 

 Key: NUTCH-1387
 URL: https://issues.apache.org/jira/browse/NUTCH-1387
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Ferdy Galema
 Fix For: 1.7, 2.2


 During parsing a TimeoutException can occur. This is caused whenever the 
 FutureTask.get() cannot be completed within the specified timeout. The tricky 
 part is that single urls might be perfectly able to complete within the 
 timeout, but when there is a heavy concurrent load (a lot of semi-expensive 
 parses) the parser load might stack up and cause many parses to timeout. This 
 can be the case with parsing during fetch. But when using a separate 
 parserjob this can also happen because Parser implementation do not 
 necessarily have to respond to a thread interrupt. (Which is fired away with 
 the task.cancel(true) call). If a parser does not check the 
 Thread.interrupted state at regular intervals, it will just continue to run 
 and eat up resources. I find it very helpful to debug stalling 
 fetchers/parsers with the lazy men's profiler: kill -QUIT process_id. This 
 will dump stacktraces, sometimes exposing the fact that hundreds of parser 
 threads are still active in the background. (Of course many of them already 
 timed out a long time ago).
 To fix this, every parser should check it's interrupted state at regular 
 intervals. (For example an html parse might be stuck walking the DOM tree, so 
 checking after every Nth element would be an appropiate moment.)
 This issue is for reference first. Fixing it all at once would be a huge task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1375) extract main content of a html file

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1375:


   Patch Info: Patch Available
Fix Version/s: 1.7

 extract main content of a html file
 ---

 Key: NUTCH-1375
 URL: https://issues.apache.org/jira/browse/NUTCH-1375
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: behnam nikbakht
 Fix For: 1.7

 Attachments: NUTCH-1375.patch


 i write a code, that can extract main content of a html (usally weblogs).
 this content usally apperas in bodyp tag but there is no insurance. also 
 might be multiple tags with form of bodyp but only one of them is main 
 content. this code first find body node, and then compute weight of childs 
 nodes that compute based on text volume and height. so the code find lowest 
 node that have maximum text volume.
 i hope that improvement of this code cause to solutions to find fake or 
 duplicated pages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1334) NPE in FetcherOutputFormat

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1334:


Fix Version/s: 1.7

 NPE in FetcherOutputFormat 
 ---

 Key: NUTCH-1334
 URL: https://issues.apache.org/jira/browse/NUTCH-1334
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
 Fix For: 1.7

 Attachments: NUTCH-1334.patch


 If fetcher.parse or fetcher.store.content are set to false AND the write 
 method receives an instance of Parse or Content, a NPE will be thrown.
 This usually does not happen as the Fetcher does not output a Parse or 
 Content based on the configuration, however this class is also used by the 
 ArcSegmentCreator which is unaware of these parameters and will output a 
 Parse or Content instance regardless of the configuration. One option would 
 be to make the ArcSegmentCreator aware of the fetcher.* parameters and output 
 things accordingly but it also makes sense to modify the FetcherOutputFormat 
 so that it checks whether a subWriter has been created before trying to use 
 it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1334) NPE in FetcherOutputFormat

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1334:


Patch Info: Patch Available

 NPE in FetcherOutputFormat 
 ---

 Key: NUTCH-1334
 URL: https://issues.apache.org/jira/browse/NUTCH-1334
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
 Fix For: 1.7

 Attachments: NUTCH-1334.patch


 If fetcher.parse or fetcher.store.content are set to false AND the write 
 method receives an instance of Parse or Content, a NPE will be thrown.
 This usually does not happen as the Fetcher does not output a Parse or 
 Content based on the configuration, however this class is also used by the 
 ArcSegmentCreator which is unaware of these parameters and will output a 
 Parse or Content instance regardless of the configuration. One option would 
 be to make the ArcSegmentCreator aware of the fetcher.* parameters and output 
 things accordingly but it also makes sense to modify the FetcherOutputFormat 
 so that it checks whether a subWriter has been created before trying to use 
 it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1329) parser not extract outlinks to external web sites

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1329:


Fix Version/s: 2.2
   1.7

 parser not extract outlinks to external web sites
 -

 Key: NUTCH-1329
 URL: https://issues.apache.org/jira/browse/NUTCH-1329
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: behnam nikbakht
  Labels: parse
 Fix For: 1.7, 2.2


 found a bug in 
 /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java,
  that outlinks like www.example2.com from www.example1.com are inserted as 
 www.example1.com/www.example2.com
 i correct this bug by testing that if outlink (www.example2.com) is a valid 
 url, else inserted with it's base url
 so i replace these lines:
 URL url = URLUtil.resolveURL(base, target);
 outlinks.add(new Outlink(url.toString(),
  linkText.toString().trim()));
 with:
 String host_temp=null;
 try{
 host_temp=URLUtil.getDomainName(new URL(target));
 }
 catch(Exception eiuy){
 host_temp=null;
 }
 URL url=null;
 if(host_temp==null)// it is an internal outlink
 url = URLUtil.resolveURL(base, target);
 else //it is an external link
 url=new URL(target);
 outlinks.add(new Outlink(url.toString(),
  linkText.toString().trim()));

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1321) IDNNormalizer

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1321:


Fix Version/s: 1.7

 IDNNormalizer
 -

 Key: NUTCH-1321
 URL: https://issues.apache.org/jira/browse/NUTCH-1321
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.7


 Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an 
 indexer so it will encode ASCII URL's to their proper unicode equivalant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1315:


Fix Version/s: 1.7

 reduce speculation on but ParseOutputFormat doesn't name output files 
 correctly?
 

 Key: NUTCH-1315
 URL: https://issues.apache.org/jira/browse/NUTCH-1315
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 
 1.5M urls
Reporter: Rafael
  Labels: hadoop, hdfs
 Fix For: 1.7


 From time to time the Reducer log contains the following and one tasktracker 
 gets blacklisted.
 org.apache.hadoop.ipc.RemoteException: 
 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
 create file 
 /user/test/crawl/segments/20120316065507/parse_text/part-1/data for 
 DFSClient_attempt_201203151054_0028_r_01_1 on client xx.x.xx.xx.10, 
 because this file is already being created by 
 DFSClient_attempt_201203151054_0028_r_01_0 on xx.xx.xx.9
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
   at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
   at org.apache.hadoop.ipc.Client.call(Client.java:1066)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
   at $Proxy2.create(Unknown Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
   at $Proxy2.create(Unknown Source)
   at 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3245)
   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
   at 
 org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1132)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:157)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:134)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:92)
   at 
 org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110)
   at 
 org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:448)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 I asked the hdfs-user mailing list and i got the following answer:
 Looks like you have reduce speculation turned on, but the
 ParseOutputFormat you're using doesn't properly name its output files
 distinctly based on the task attempt ID. As a workaround you can
 probably turn off 

[jira] [Updated] (NUTCH-1309) fetch queue management

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1309:


Fix Version/s: 1.7

 fetch queue management
 --

 Key: NUTCH-1309
 URL: https://issues.apache.org/jira/browse/NUTCH-1309
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
  Labels: fetch
 Fix For: 1.7


 when run fetch in hadoop with multiple concurrent mapper, there are multiple 
 independent fetchQueues that make hard to manage them. i suggest that 
 construct fetchQueues before begin of run with this line:
 feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-933:
---

Fix Version/s: 1.7

 Fetcher does not save a pages Last-Modified value in CrawlDatum
 ---

 Key: NUTCH-933
 URL: https://issues.apache.org/jira/browse/NUTCH-933
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.2
Reporter: Joe Kemp
 Fix For: 1.7


 I added the following code in the output method just after the If (content 
 !=null) statement.
 String lastModified = metadata.get(Last-Modified);
 if (lastModified !=null  !lastModified.equals()) {
   try {
   Date lastModifiedDate = 
 DateUtil.parseDate(lastModified);
   
 datum.setModifiedTime(lastModifiedDate.getTime());
   } catch (DateParseException e) {
   
   }
 }
 I now get 304 for pages that haven't changed when I recrawl.  Need to do 
 further testing.  Might also need a configuration parameter to turn off this 
 behavior, allowing pages to be forced to be refreshed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-929) Create a REST-based admin UI for Nutch

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-929:
---

Fix Version/s: 2.2

 Create a REST-based admin UI for Nutch
 --

 Key: NUTCH-929
 URL: https://issues.apache.org/jira/browse/NUTCH-929
 Project: Nutch
  Issue Type: New Feature
  Components: administration gui
Affects Versions: nutchgora
Reporter: Andrzej Bialecki 
 Fix For: 2.2


 This is a follow up to NUTCH-880 - we need to expose the functionality of 
 REST API in a user-friendly admin UI. Thanks to the nature of the API the UI 
 can be implemented in any UI framework that speaks REST/JSON, so it could be 
 a simple webapp (we already have jetty) or a Swing / Pivot / etc standalone 
 application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-891:
---

Fix Version/s: 2.2

 Nutch build should not depend on unversioned local deps
 ---

 Key: NUTCH-891
 URL: https://issues.apache.org/jira/browse/NUTCH-891
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Andrzej Bialecki 
 Fix For: 2.2

 Attachments: gora-49_v1.patch, gora.build.patch


 The fix in NUTCH-873 introduces an unknown variable to the build process. 
 Since local ivy artifacts are unversioned, different people that install Gora 
 jars at different points in time will use the same artifact id but in fact 
 the artifacts (jars) will differ because they will come from different 
 revisions of Gora sources. Therefore Nutch builds based on the same svn rev. 
 won't be repeatable across different environments.
 As much as it pains the ivy purists ;) until Gora publishes versioned 
 artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars 
 built from a known external rev. We can add a README that contains commit id 
 from Gora.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-891:
---

Patch Info: Patch Available

 Nutch build should not depend on unversioned local deps
 ---

 Key: NUTCH-891
 URL: https://issues.apache.org/jira/browse/NUTCH-891
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Andrzej Bialecki 
 Fix For: 2.2

 Attachments: gora-49_v1.patch, gora.build.patch


 The fix in NUTCH-873 introduces an unknown variable to the build process. 
 Since local ivy artifacts are unversioned, different people that install Gora 
 jars at different points in time will use the same artifact id but in fact 
 the artifacts (jars) will differ because they will come from different 
 revisions of Gora sources. Therefore Nutch builds based on the same svn rev. 
 won't be repeatable across different environments.
 As much as it pains the ivy purists ;) until Gora publishes versioned 
 artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars 
 built from a known external rev. We can add a README that contains commit id 
 from Gora.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-952) fix outlink which started with '?' in html parser

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-952:
---

Fix Version/s: 1.7

 fix outlink which started with '?' in html parser
 -

 Key: NUTCH-952
 URL: https://issues.apache.org/jira/browse/NUTCH-952
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: nutchgora
Reporter: Stondet
 Fix For: 1.7

 Attachments: NUTCH-952-v2.patch


 a href=?w=ruby%20on%20railsty=csd=0 ruby on rails/a(a snippet from 
 http://bbs.soso.com/search?ty=csd=0w=rails)
 outlink parsed from above link: 
 http://bbs.soso.com/?w=ruby%20on%20railsty=csd=0
 but expected is http://bbs.soso.com/search?w=ruby%20on%20railsty=csd=0

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-649) Log list of files found but not crawled.

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-649:
---

Fix Version/s: 1.7

 Log list of files found but not crawled.
 

 Key: NUTCH-649
 URL: https://issues.apache.org/jira/browse/NUTCH-649
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: any
Reporter: Jim
 Fix For: 1.7


 I use Nutch to find the location of executables on the web, but we do 
 not download the executables with Nutch.  In order to get nutch to give the 
 location of files without downloading the files, I had to make a very small 
 patch to the code, but I think this change might be useful to others also.  
 The patch just logs files that are being filtered at the info level, although 
 perhaps it should be at the debug level.
I have included a svn diff with this change.  Use cases would be to both 
 use as a diagnostic tool (let's see what we are skipping) as well as a way to 
 find content and links pointed to by a page or site without having to 
 actually download that content.
 Index: ParseOutputFormat.java
 ===
 --- ParseOutputFormat.java  (revision 593619)
 +++ ParseOutputFormat.java  (working copy)
 @@ -193,17 +193,20 @@
 toHost = null;
   }
   if (toHost == null || !toHost.equals(fromHost)) { // external 
 links
 +   LOG.info(filtering externalLink  + toUrl +  linked to by  
 + fromUrl);
 +
 continue; // skip it
   }
 }
 try {
   toUrl = normalizers.normalize(toUrl,
   URLNormalizers.SCOPE_OUTLINK); // normalize the url
 -  toUrl = filters.filter(toUrl);   // filter the url
 -  if (toUrl == null) {
 -continue;
 -  }
 -} catch (Exception e) {
 +
 + if (filters.filter(toUrl) == null) {   // filter the url
 + LOG.info(filtering content  + toUrl +  linked to by 
  + fromUrl);
 + continue;
 + }
 +   } catch (Exception e) {
   continue;
 }
 CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, 
 interval);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-960) Language ID - confidence factor

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-960.


Resolution: Won't Fix

This is way too old and as Ken pointed out this should be dealt with upstream 
in Tika.

 Language ID - confidence factor
 ---

 Key: NUTCH-960
 URL: https://issues.apache.org/jira/browse/NUTCH-960
 Project: Nutch
  Issue Type: Wish
Affects Versions: 1.2
Reporter: M Alexander

 Hi
 In JAVA implementation, what is the best way to calculate the confidence of 
 the outcome of the language id for a given text?
 For example:
 n-gram matching / total n-gram * 100.
 when a text is passed. The outcome would be en with 89% confidence. What is 
 the best way to implement this to the existig nutch language id code?
 Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-945:
---

Fix Version/s: 2.2

 Indexing to multiple SOLR Servers
 -

 Key: NUTCH-945
 URL: https://issues.apache.org/jira/browse/NUTCH-945
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Charan Malemarpuram
 Fix For: 2.2

 Attachments: MurmurHashPartitioner.java, 
 NonPartitioningPartitioner.java, patch-NUTCH-945.txt


 It would be nice to have a default Indexer in Nutch, which can submit docs to 
 multiple SOLR Servers.
  Partitioning is always the question, when writing to multiple SOLR Servers.
  Default partitioning can be a simple hashcode based distribution with 
  addition hooks to customization.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-945:
---

Patch Info: Patch Available

 Indexing to multiple SOLR Servers
 -

 Key: NUTCH-945
 URL: https://issues.apache.org/jira/browse/NUTCH-945
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Charan Malemarpuram
 Fix For: 2.2

 Attachments: MurmurHashPartitioner.java, 
 NonPartitioningPartitioner.java, patch-NUTCH-945.txt


 It would be nice to have a default Indexer in Nutch, which can submit docs to 
 multiple SOLR Servers.
  Partitioning is always the question, when writing to multiple SOLR Servers.
  Default partitioning can be a simple hashcode based distribution with 
  addition hooks to customization.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-734) option to filter a tag text

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-734.


Resolution: Won't Fix

This is simply not required and dated. Plus I assume by referring to a, we 
mean stop words. These are filtered during the IR process in (all?) modern 
indexing servers. 

 option to filter a tag text
 -

 Key: NUTCH-734
 URL: https://issues.apache.org/jira/browse/NUTCH-734
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: ron

 Motivation:
 When fetching pages with menue links the menues (for example search) appear 
 on all pages of the site. Searching for the word search then returns all 
 pages of the site, instead of just returning the the search page.
 Change request:
 Add options to filter texts of a tags, or more generally add filters to 
 avoid texts within specific tags.
 I have worked around this by changing DOMContentUtils.getTextHelper : 
  if (nodeType == Node.TEXT_NODE  !(currentNode.getParentNode() != null 
  a.equalsIgnoreCase(currentNode.getParentNode().getNodeName( 
 - Ron

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-745) MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-745.


Resolution: Invalid

close of legacy issue

 MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run
 

 Key: NUTCH-745
 URL: https://issues.apache.org/jira/browse/NUTCH-745
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: JDK1.6 + tomcat 6 + Eclipse3.3 + nutch 1.0
Reporter: jcore_XiaTian

 MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run
   public ParseResult getParse(Content content) {
   return ParseResult.createParseResult(content.getUrl(), new 
 ParseStatus(ParseStatus.FAILED, 
 ParseStatus.FAILED_MISSING_CONTENT, 
 No textual content available).getEmptyParse(conf)); 
   
   // return null;
   }
 nutch-site.xml===
 property
   nameplugin.includes/name
   
 valueprotocol-http|urlfilter-regex|parse-(myHtml|html|text|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier|analysis-(zh)/value
   description![CDATA[
   
   ]]  /description
 /property
 ==parse-plugins.xml
 mimeType name=text/html
   plugin id=parse-myHtml /
   plugin id=parse-html /
   /mimeType
 alias name=parse-myHtml
   extension-id=org.apache.nutch.parse.html.MyHtmlParser 
 /
 ===src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
  public ParseResult getParse(Content content) {
 .
 // cannot run the code:
   ParseResult filteredParse = this.htmlParseFilters.filter(content, 
 parseResult, 
  metaTags, root);
 ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-685:
---

Fix Version/s: 2.2
   1.7

 Content-level redirect status lost in ParseSegment
 --

 Key: NUTCH-685
 URL: https://issues.apache.org/jira/browse/NUTCH-685
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.7, 2.2


 When Fetcher runs in parsing mode, content-level redirects (HTML meta tag 
 Refresh) are properly discovered and recorded in crawl_fetch under source 
 URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is 
 run as a separate step, the content-level redirection data is used only to 
 add the new (target) URL, but the status of the original URL is not reset to 
 indicate a redirect. Consequently, status of the original URL will be 
 different depending on the way you run Fetcher, whereas it should be the same.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-583) FeedParser empty links for items

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-583:
---

Fix Version/s: 2.2
   1.7

 FeedParser empty links for items
 

 Key: NUTCH-583
 URL: https://issues.apache.org/jira/browse/NUTCH-583
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.7, 2.2


 FeedParser in feed plugin just discards the item if it does not have link 
 element. However Rss 2.0 does not necessitate the link element for each 
 item. 
 Moreover sometimes the link is given in the guid element which is a 
 globally unique identifier for the item. I think we can search the url for an 
 item first, then if it is still not found, we can use the feed's url, but 
 with merging all the parse texts into one Parse object. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-356:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: https://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Fix For: 1.7, 2.2

 Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, 
 ASF.LICENSE.NOT.GRANTED--patch.txt, cache_classes.patch


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-366) Merge URLFilters and URLNormalizers

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-366:
---

Fix Version/s: 2.2
   1.7

 Merge URLFilters and URLNormalizers
 ---

 Key: NUTCH-366
 URL: https://issues.apache.org/jira/browse/NUTCH-366
 Project: Nutch
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
  Labels: gsoc2012
 Fix For: 1.7, 2.2


 Currently Nutch uses two subsystems related to url validation and 
 normalization:
 * URLFilter: this interface checks if URLs are valid for further processing. 
 Input URL is not changed in any way. The output is a boolean value.
 * URLNormalizer: this interface brings URLs to their base (normal) form, or 
 removes unneeded URL components, or performs any other URL mangling as 
 necessary. Input URLs are changed, and are returned as result.
 However, various Nutch tools run filters and normalizers in pre-determined 
 order, i.e. normalizers first, and then filters. In some cases, where 
 normalizers are complex and running them is costly (e.g. numerous regex 
 rules, DNS lookups) it would make sense to run some of the filters first 
 (e.g. prefix-based filters that select only certain protocols, or 
 suffix-based filters that select only known extensions). This is currently 
 not possible - we always have to run normalizers, only to later throw away 
 urls because they failed to pass through filters.
 I would like to solicit comments on the following two solutions, and work on 
 implementation of one of them:
 1) we could make URLFilters and URLNormalizers implement the same interface, 
 and basically make them interchangeable. This way users could configure their 
 order arbitrarily, even mixing filters and normalizers out of order. This is 
 more complicated, but gives much more flexibility - and NUTCH-365 already 
 provides sufficient framework to implement this, including the ability to 
 define different sequences for different steps in the workflow.
 2) we could use a property url.mangling.order ;) to define whether 
 normalizers or filters should run first. This is simple to implement, but 
 provides only limited improvement - because either all filters or all 
 normalizers would run, they couldn't be mixed in arbitrary order.
 Any comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-475) Adaptive crawl delay

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-475:
---

Fix Version/s: 1.7

 Adaptive crawl delay
 

 Key: NUTCH-475
 URL: https://issues.apache.org/jira/browse/NUTCH-475
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Doğacan Güney
 Fix For: 1.7

 Attachments: adaptive-delay_draft.patch, NUTCH-475.patch


 Current fetcher implementation waits a default interval before making another 
 request to the same server (if crawl-delay is not specified in robots.txt). 
 IMHO, an adaptive implementation will be better. If the server is under 
 little load and can server requests fast, then fetcher can ask for more pages 
 in a given interval. Similarly, if the server is suffering from heavy load, 
 fetcher can slow down(w.r.t that host), easing the load on the server.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-207:
---

   Patch Info: Patch Available
Fix Version/s: 1.7

 Bandwidth target for fetcher rather than a thread count
 ---

 Key: NUTCH-207
 URL: https://issues.apache.org/jira/browse/NUTCH-207
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8
Reporter: Rod Taylor
 Fix For: 1.7

 Attachments: ratelimit.patch


 Increases or decreases the number of threads from the starting value 
 (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve 
 a target bandwidth (fetcher.threads.bandwidth).
 It seems to be able to keep within 10% of the target bandwidth even when 
 large numbers of errors are found or when a number of large pages is run 
 across.
 To achieve more accurate tracking Nutch should keep track of protocol 
 overhead as well as the volume of pages downloaded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1508:


Fix Version/s: 2.2

 Port limit crawler to defined depth to 2.x
 --

 Key: NUTCH-1508
 URL: https://issues.apache.org/jira/browse/NUTCH-1508
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: Julien Nioche
 Fix For: 2.2




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-314) Multiple language identifier instances

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-314.


Resolution: Won't Fix

close of legacy issue

 Multiple language identifier instances
 --

 Key: NUTCH-314
 URL: https://issues.apache.org/jira/browse/NUTCH-314
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: OS: Linux RHEL 4
 JDK: 1.5_07
Reporter: Enrico Triolo

 In my application I often need to perform the inject - generate - .. - 
 index loop multiple times, since users can 'suggest' new web pages to be 
 crawled and indexed.
 I also need to enable the language identifier plugin.
 Everything seems to work correctly, but after some time I get an 
 OutOfMemoryException. Actually the time isn't important, since I noticed that 
 the problem arises when the user submits many urls (~100). As I said, for 
 each submitted url a new loop is performed (similar to the one in the 
 Crawl.main method).
 Using a profiler (specifically, netbeans profiler) I found out that for each 
 submitted url a new LanguageIdentifier instance is created, and never 
 released. With the memory inspector tool I can see as many instances of 
 LanguageIdentifier and NGramProfile$NGramEntry as the number of fetched 
 pages, each of them occupying about 180kb. Forcing garbage collection doesn't 
 release much memory.
 Maybe we should cache its instance in the conf as we do for many others 
 objects in Nutch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1483:


Fix Version/s: 2.2
   1.7

 Can't crawl filesystem with protocol-file plugin
 

 Key: NUTCH-1483
 URL: https://issues.apache.org/jira/browse/NUTCH-1483
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1
 Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
Reporter: Rogério Pereira Araújo
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1483.patch


 I tried to follow the same steps described in this wiki page:
 http://wiki.apache.org/nutch/IntranetDocumentSearch
 I made all required changes on regex-urlfilter.txt and added the following 
 entry in my seed file:
 file:///home/rogerio/Documents/
 The permissions are ok, I'm running nutch with the same user as folder owner, 
 so nutch has all the required permissions, unfortunately I'm getting the 
 following error:
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at 
 org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
 at 
 org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
 fetch of file://home/rogerio/Documents/ failed with: 
 org.apache.nutch.protocol.file.FileError: File Error: 404
 Why the logs are showing file://home/rogerio/Documents/ instead of 
 file:///home/rogerio/Documents/ ???
 Note: The regex-urlfilter entry only works as expected if I add the entry 
 +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
 as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-802) Problems managing outlinks with large url length

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-802:
---

Fix Version/s: 1.7

 Problems managing outlinks with large url length
 

 Key: NUTCH-802
 URL: https://issues.apache.org/jira/browse/NUTCH-802
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Pablo Aragón
Assignee: Andrzej Bialecki 
  Labels: nutch, outlink, parse, parseoutputformat
 Fix For: 1.7

 Attachments: ParseOutputFormat.patch


 Nutch can get idle during the collection of outlinks if  the URL address of 
 the outlink is too large.
 The maximum sizes of an URL for the main web servers are:
 * Apache: 4,000 bytes
 * Microsoft Internet Information Server (IIS): 16, 384 bytes
 * Perl HTTP::Daemon: 8.000 bytes
 URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
 be set in the nutch-default.xml configuration file.
 I attached a patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-795) Add ability to maintain nofollow attribute in linkdb

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-795:
---

Fix Version/s: 1.7

 Add ability to maintain nofollow attribute in linkdb
 

 Key: NUTCH-795
 URL: https://issues.apache.org/jira/browse/NUTCH-795
 Project: Nutch
  Issue Type: New Feature
  Components: linkdb
Affects Versions: 1.1
Reporter: Sammy Yu
 Fix For: 1.7

 Attachments: 0001-Updated-with-nofollow-support-for-Outlinks.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1478:


Fix Version/s: 2.2

 Parse-metatags and index-metadata plugin for Nutch 2.x series 
 --

 Key: NUTCH-1478
 URL: https://issues.apache.org/jira/browse/NUTCH-1478
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.1
Reporter: kiran
 Fix For: 2.2

 Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, 
 Nutch1478.zip


 I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
 This will take multiple values of same tag and index in Solr as i patched 
 before (https://issues.apache.org/jira/browse/NUTCH-1467).
 The usage is same as described here 
 (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
 no need to give 'metatag' keyword before metatag names. For example my 
 configuration looks like this 
 (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
  
 This is only the first version and does not include the junit test. I will 
 update the new version soon.
 This will parse the tags and index the tags in Solr. Make sure you create the 
 fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
 Please let me know if you have any suggestions
 This is supported by DLA (Digital Library and Archives) of Virginia Tech.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1511:


Fix Version/s: 2.2

 Metadata in MYSQL updated with 'garbage'
 

 Key: NUTCH-1511
 URL: https://issues.apache.org/jira/browse/NUTCH-1511
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, injector, storage
Affects Versions: 2.1
 Environment: Ubuntu 12.04
Reporter: J. Gobel
  Labels: metadata, mysql, nutch, scoring-opic
 Fix For: 2.2


 After applying patch for Metadata parser (NUTCH-1478) I notice that the 
 metadata field just before the crawl ends is populated with the correct 
 information. However when the crawl is completely finished the metadata field 
 is populated with 'garbage' _csh_� 
 I notice in my SQL log file that the scoring plugin is overwriting the 
 metadata field in a final data insertion with '_csh_ \0\0\0\0\'. When I 
 remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml 
 , the metadata-field is crisp and clear.
 MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see 
 a fragments of my MYSQL log file, only the moments when data is written to 
 the METADATA field in the MYSQL table.
 First Insertion .. here I suppose scoring-opic writes its information, _csh_ 
 ?€\0\0\0 
 58 QueryINSERT INTO webpage 
 (fetchInterval,fetchTime,id,markers,metadata,score )VALUES 
 (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
 _csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE 
 fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 _injmrk_ 
 y\0',metadata='
 _csh_ ?€\0\0\0',score=1.0
 Second Insertion - inhere scraped metada is inserted into metadata. 
  81 QueryINSERT INTO webpage 
 (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES 
 ('org.apache.nutch:http/',
 The final insertion -  please note that here the metadata field is 
 overwritten with _CSH_\0\0\0\0
 90 QueryINSERT INTO webpage (fetchTime,id,inlinks,markers,metadata 
 )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/
 Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508 
 __prsmrk__*1357122982-1745626508 _gnmrk_*1357122982-1745626508 
 _ftcmrk_*1357122982-1745626508\0','
 _csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 
 0http://nutch.apache.org/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1505) java.lang.IllegalArgumentException during updatedb

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1505:


Fix Version/s: 2.2

 java.lang.IllegalArgumentException during updatedb
 --

 Key: NUTCH-1505
 URL: https://issues.apache.org/jira/browse/NUTCH-1505
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
 Environment: cassandra 0.8.10
Reporter: Stanley Orlenko
 Fix For: 2.2


 the command 
 bin/nutch updatedb
 raises the exception. Here is a part of hadoop.log:
 2012-12-21 11:27:58,557 WARN  mapred.LocalJobRunner - job_local_0001
 java.lang.IllegalArgumentException: offset (0) + length (4) exceed the 
 capacity of the array: 2
 at 
 org.apache.nutch.util.Bytes.explainWrongLengthOrOffset(Bytes.java:559)
 at org.apache.nutch.util.Bytes.toInt(Bytes.java:740)
 at org.apache.nutch.util.Bytes.toFloat(Bytes.java:611)
 at org.apache.nutch.util.Bytes.toFloat(Bytes.java:598)
 at 
 org.apache.nutch.scoring.opic.OPICScoringFilter.distributeScoreToOutlinks(OPICScoringFilter.java:128)
 at 
 org.apache.nutch.scoring.ScoringFilters.distributeScoreToOutlinks(ScoringFilters.java:117)
 at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:70)
 at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-804) CrawlDatum.statNames can be modified

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-804:
---

Fix Version/s: 1.7

 CrawlDatum.statNames can be modified
 

 Key: NUTCH-804
 URL: https://issues.apache.org/jira/browse/NUTCH-804
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Mike Baranczak
Priority: Minor
 Fix For: 1.7


 public static final HashMapByte, String statNames
 It's possible to modify the contents of this hash map from anywhere in the 
 application, which could cause problems in unrelated places. Unless I'm 
 missing something, there's no good reason to modify this map after it's 
 initialized. So, it should either not be declared public, or be made 
 read-only.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-789) Improvements to Tika parser

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-789:
---

Fix Version/s: 2.2
   1.7

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: parser
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-813) Repetitive crawl 403 status page

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-813:
---

Fix Version/s: 1.7

 Repetitive crawl 403 status page
 

 Key: NUTCH-813
 URL: https://issues.apache.org/jira/browse/NUTCH-813
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Nguyen Manh Tien
Priority: Minor
 Fix For: 1.7

 Attachments: ASF.LICENSE.NOT.GRANTED--Patch


 When we crawl a page the return a 403 status. It will be crawl repetitively 
 each days with default schedule.
 Even when we restrict by paramter db.fetch.retry.max

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1464:


   Patch Info: Patch Available
Fix Version/s: 1.7

 index-static plugin doesn't allow the colon within the field value
 --

 Key: NUTCH-1464
 URL: https://issues.apache.org/jira/browse/NUTCH-1464
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.5
Reporter: Luca Cavanna
Priority: Minor
 Fix For: 1.7

 Attachments: NUTCH-1464.patch


 If I want to configure a static field with a value containing a colon, the 
 index-static plugin does nothing. There's a string split based on the colon 
 character and if the result is an array of length 2 everything is fine, 
 otherwise nothing happens, the static field is not set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1497:


Fix Version/s: 2.2

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Fix For: 2.2

 Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, 
 gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1499:


Fix Version/s: 1.7

 Usage of multiple ipv4 addresses and network cards on fetcher machines
 --

 Key: NUTCH-1499
 URL: https://issues.apache.org/jira/browse/NUTCH-1499
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.5.1
Reporter: Walter Tietze
Priority: Minor
 Fix For: 1.7

 Attachments: apache-nutch-1.5.1.NUTCH-1499.patch


 Adds for the fetcher threads the ability to use multiple configured ipv4 
 addresses.
 On some cluster machines there are several ipv4 addresses configured where 
 each ip address is associated with its own network interface.
 This patch enables to configure the protocol-http and the protocol-httpclient 
  to use these network interfaces in a round robin style.
 If the feature is enabled, a helper class reads at *startup* the network 
 configuration. In each http network connection the next ip address is taken. 
 This method is synchronized, but this should be no bottleneck for the overall 
 performance of the fetcher threads.
 This feature is tested on our cluster for the protocol-http and the 
 protocol-httpclient protocol.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1499:


Patch Info: Patch Available

 Usage of multiple ipv4 addresses and network cards on fetcher machines
 --

 Key: NUTCH-1499
 URL: https://issues.apache.org/jira/browse/NUTCH-1499
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.5.1
Reporter: Walter Tietze
Priority: Minor
 Fix For: 1.7

 Attachments: apache-nutch-1.5.1.NUTCH-1499.patch


 Adds for the fetcher threads the ability to use multiple configured ipv4 
 addresses.
 On some cluster machines there are several ipv4 addresses configured where 
 each ip address is associated with its own network interface.
 This patch enables to configure the protocol-http and the protocol-httpclient 
  to use these network interfaces in a round robin style.
 If the feature is enabled, a helper class reads at *startup* the network 
 configuration. In each http network connection the next ip address is taken. 
 This method is synchronized, but this should be no bottleneck for the overall 
 performance of the fetcher threads.
 This feature is tested on our cluster for the protocol-http and the 
 protocol-httpclient protocol.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1485) TableUtil reverseURL to keep userinfo part

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1485:


Fix Version/s: 2.2

 TableUtil reverseURL to keep userinfo part
 --

 Key: NUTCH-1485
 URL: https://issues.apache.org/jira/browse/NUTCH-1485
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 2.2


 The reversed URL key does not contain the userinfo part of an URL (user name 
 and password: {{ftp://user:passw...@ftp.xyz/file.txt}}, cf. [RFC 
 3986|http://tools.ietf.org/html/rfc3986] and 
 [http://en.wikipedia.org/wiki/URI_scheme]. Keeping the userinfo would make it 
 easy to crawl a fixed list of protected content. However, URLs with userinfo 
 can be tricky, eg 
 [http://cnn.comstory=breaking_news@199.239.136.200/mostpopular], so it's ok 
 when the default is to remove the userinfo. But this should be done in 
 default URL normalizers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1182:


Fix Version/s: 2.2
   1.7

 fetcher should track and shut down hung threads
 ---

 Key: NUTCH-1182
 URL: https://issues.apache.org/jira/browse/NUTCH-1182
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3, 1.4
 Environment: Linux, local job runner
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.7, 2.2


 While crawling a slow server with a couple of very large PDF documents (30 
 MB) on it
 after some time and a bulk of successfully fetched documents the fetcher stops
 with the message: ??Aborting with 10 hung threads.??
 From now on every cycle ends with hung threads, almost no documents are 
 fetched
 successfully. In addition, strange hadoop errors are logged:
 {noformat}
fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException
 at java.lang.System.arraycopy(Native Method)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108)
 ...
 {noformat}
 or
 {noformat}
Exception in thread QueueFeeder java.lang.NullPointerException
  at 
 org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48)
  at 
 org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41)
  at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214)
 {noformat}
 I've run the debugger and found:
 # after the hung threads are reported the fetcher stops but the threads are 
 still alive and continue fetching a document. In consequence, this will
 #* limit the small bandwidth of network/server even more
 #* after the document is fetched the thread tries to write the content via 
 {{output.collect()}} which must fail because the fetcher map job is already 
 finished and the associated temporary mapred directory is deleted. The error 
 message may get mixed with the progress output of the next fetch cycle 
 causing additional confusion.
 # documents/URLs causing the hung thread are never reported nor stored. That 
 is, it's hard to track them down, and they will cause a hung thread again and 
 again.
 The problem is reproducible when fetching bigger documents and setting 
 {{mapred.task.timeout}} to a low value (this will definitely cause hung 
 threads).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1018) Solr Document Size Limit

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1018.
-

Resolution: Won't Fix

Looks like a plugin is the solution here. Closing as won't fix. 

 Solr Document Size Limit
 

 Key: NUTCH-1018
 URL: https://issues.apache.org/jira/browse/NUTCH-1018
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Mark Achee
Priority: Minor
  Labels: solr

 There should be an option, perhaps named solr.content.limit, that defines the 
 max size of documents added to Solr.  I've had issues with large documents in 
 Solr, so I set the file.content.limit to 2MB.  However, this causes many 
 files to not be parsed (mostly PDFs) because of only retrieving parts of the 
 document.  With this new option, I could still correctly parse them, but only 
 index the first 2MB (or however large it is set) in Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1007) Add readdb -host output

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1007.
-

Resolution: Won't Fix

This is not a problem and as Markus mentioned the DomainStatistics tool does a 
pretty good job of this already. 

 Add readdb -host output
 ---

 Key: NUTCH-1007
 URL: https://issues.apache.org/jira/browse/NUTCH-1007
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
Reporter: MilleBii
Priority: Minor

 I have created an enhancement for the readdb feature, which computes a list 
 of host nbre of urls for that host.
 I think it could be valuable for many people. This is to know what is in the 
 crawldb.
 Like -dump or -topN the syntax proposed would be like this : readdb -host 
 ouput

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552028#comment-13552028
 ] 

Sebastian Nagel commented on NUTCH-1499:


So, a vote for won't fix. Comments?

 Usage of multiple ipv4 addresses and network cards on fetcher machines
 --

 Key: NUTCH-1499
 URL: https://issues.apache.org/jira/browse/NUTCH-1499
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.5.1
Reporter: Walter Tietze
Priority: Minor
 Fix For: 1.7

 Attachments: apache-nutch-1.5.1.NUTCH-1499.patch


 Adds for the fetcher threads the ability to use multiple configured ipv4 
 addresses.
 On some cluster machines there are several ipv4 addresses configured where 
 each ip address is associated with its own network interface.
 This patch enables to configure the protocol-http and the protocol-httpclient 
  to use these network interfaces in a round robin style.
 If the feature is enabled, a helper class reads at *startup* the network 
 configuration. In each http network connection the next ip address is taken. 
 This method is synchronized, but this should be no bottleneck for the overall 
 performance of the fetcher threads.
 This feature is tested on our cluster for the protocol-http and the 
 protocol-httpclient protocol.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1316) create EmbeddedNutchInstance testing utility class

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1316.
-

Resolution: Won't Fix

We already have a testing class relating to Fetching (which is what we care 
about here).
http://svn.apache.org/repos/asf/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java
Closing as won't fix.

 create EmbeddedNutchInstance testing utility class
 --

 Key: NUTCH-1316
 URL: https://issues.apache.org/jira/browse/NUTCH-1316
 Project: Nutch
  Issue Type: New Feature
Reporter: Lewis John McGibbney
Priority: Minor
  Labels: test

 I propose to create a new testing utility class called EmbeddedNutchInstance 
 which provides two main methods; setup and teardown. This will take the pain 
 out of firing up Nutch test instances in distributed environments and will 
 enable us to test Nutch over the BigTop environment.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1313) Nutch trunk add response headers to datastore for the protocol-httpclient plugin

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1313:


Fix Version/s: 1.7

 Nutch trunk add response headers to datastore for the protocol-httpclient 
 plugin
 

 Key: NUTCH-1313
 URL: https://issues.apache.org/jira/browse/NUTCH-1313
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Ferdy Galema
Priority: Minor
 Fix For: 1.7


 For tracking progress the port of NUTCH-1311 to Nutch trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >