[jira] Created: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

2010-06-25 Thread Sebastian Nagel (JIRA)
document deduplication (exact duplicates) failed using MD5Signature
---

 Key: NUTCH-835
 URL: https://issues.apache.org/jira/browse/NUTCH-835
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1, 1.0.0
 Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
Reporter: Sebastian Nagel


The MD5Signature class calculates different signatures for identical documents.

The reason is that
  byte[] data = content.getContent();
  ... StringBuilder().append(data) ...
uses java.lang.Object.toString() to get a string representation of the (binary) 
content
which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
arrays
with identical content.

A solution would be to take the MD5 sum of the binary content as first part of 
the
final signature calculation (the parsed content is the second part):
  ... 
.append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-862) HttpClient null pointer exception

2010-07-27 Thread Sebastian Nagel (JIRA)
HttpClient null pointer exception
-

 Key: NUTCH-862
 URL: https://issues.apache.org/jira/browse/NUTCH-862
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: linux, java 6
Reporter: Sebastian Nagel
Priority: Minor


When re-fetching a document (a continued crawl) HttpClient throws an null 
pointer exception causing the document to be emptied:

2010-07-27 12:45:09,199 INFO  fetcher.Fetcher - fetching 
http://localhost/doc/selfhtml/html/index.htm
2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException
2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:138)
2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220)
2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537)
2010-07-27 12:45:09,204 INFO  fetcher.Fetcher - fetch of 
http://localhost/doc/selfhtml/html/index.htm failed with: 
java.lang.NullPointerException

Because the document is re-fetched the server answers "304" (not modified):

127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] "GET /doc/selfhtml/html/index.htm 
HTTP/1.0" 304 174 "-" "Nutch-1.0"

No content is sent in this case (empty http body).

Index: 
trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
===
--- 
trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
(revision 979647)
+++ 
trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
(working copy)
@@ -134,7 +134,8 @@
 if (code == 200) throw new IOException(e.toString());
 // for codes other than 200 OK, we are fine with empty content
   } finally {
-in.close();
+if (in != null)
+  in.close();
 get.abort();
   }


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-862) HttpClient null pointer exception

2010-07-27 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-862:
--

Attachment: NUTCH-862.patch

patch

> HttpClient null pointer exception
> -
>
> Key: NUTCH-862
> URL: https://issues.apache.org/jira/browse/NUTCH-862
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: linux, java 6
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-862.patch
>
>
> When re-fetching a document (a continued crawl) HttpClient throws an null 
> pointer exception causing the document to be emptied:
> 2010-07-27 12:45:09,199 INFO  fetcher.Fetcher - fetching 
> http://localhost/doc/selfhtml/html/index.htm
> 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:138)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537)
> 2010-07-27 12:45:09,204 INFO  fetcher.Fetcher - fetch of 
> http://localhost/doc/selfhtml/html/index.htm failed with: 
> java.lang.NullPointerException
> Because the document is re-fetched the server answers "304" (not modified):
> 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] "GET /doc/selfhtml/html/index.htm 
> HTTP/1.0" 304 174 "-" "Nutch-1.0"
> No content is sent in this case (empty http body).
> Index: 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> ===
> --- 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> (revision 979647)
> +++ 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> (working copy)
> @@ -134,7 +134,8 @@
>  if (code == 200) throw new IOException(e.toString());
>  // for codes other than 200 OK, we are fine with empty content
>} finally {
> -in.close();
> +if (in != null)
> +  in.close();
>  get.abort();
>}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum

2010-11-10 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930588#action_12930588
 ] 

Sebastian Nagel commented on NUTCH-933:
---

The modifiedTime stored in a CrawlDatum record is not the "Last-Modified" time 
sent by the responding server (or the time stamp of a file, in case 
protocol-file is used) but the time a document was fetched.

Is there any reason? 

Determining the "Last-Modified" time is somewhat difficult since it may be 
specified in the HTTP header or in HTML as . But it would be a nice-to-have information. In addition, the index-more 
indexing filter which provides a field "lastModified" does the job not very 
well: it should take the value from content meta data (which seems to be mostly 
correct) and not from parse meta data.

Beside: re-crawling with if-modified-since is not affected: there is no 
difference if the time of the last fetch is sent because only if the document 
has been modified since the last fetch it must be re-fetched.

> Fetcher does not save a pages Last-Modified value in CrawlDatum
> ---
>
> Key: NUTCH-933
> URL: https://issues.apache.org/jira/browse/NUTCH-933
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2
>Reporter: Joe Kemp
>
> I added the following code in the output method just after the If (content 
> !=null) statement.
> String lastModified = metadata.get("Last-Modified");
> if (lastModified !=null && !lastModified.equals("")) {
>   try {
>   Date lastModifiedDate = 
> DateUtil.parseDate(lastModified);
>   
> datum.setModifiedTime(lastModifiedDate.getTime());
>   } catch (DateParseException e) {
>   
>   }
> }
> I now get 304 for pages that haven't changed when I recrawl.  Need to do 
> further testing.  Might also need a configuration parameter to turn off this 
> behavior, allowing pages to be forced to be refreshed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-01-26 Thread Sebastian Nagel (JIRA)
max. redirects not handled correctly: fetcher stops at max-1 redirects
--

 Key: NUTCH-962
 URL: https://issues.apache.org/jira/browse/NUTCH-962
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.2, 1.3, 2.0
Reporter: Sebastian Nagel


The fetcher stops following redirects one redirect before the max. redirects is 
reached.

The description of http.redirect.max
> The maximum number of redirects the fetcher will follow when
> trying to fetch a page. If set to negative or 0, fetcher won't immediately
> follow redirected URLs, instead it will record them for later fetching.
suggests that if set to 1 that one redirect will be followed.

I tried to crawl two documents the first redirecting by
 
to the second with http.redirect.max = 1
The second document is not fetched and the URL has state GONE in CrawlDb.

fetching file:/test/redirects/meta_refresh.html
redirectCount=0
-finishing thread FetcherThread, activeThreads=1
 - content redirect to file:/test/redirects/to/meta_refresh_target.html 
(fetching now)
 - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html

The attached patch would fix this: if http.redirect.max is 1 : one redirect is 
followed.
Of course, this would mean there is no possibility to skip redirects at all 
since 0
(as well as negative values) means "treat redirects as ordinary links".



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-01-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-962:
--

Attachment: Fetcher_redir.patch

patch for 1.3 to respect count of redirects literally:
 http.redirect.max = 0 (or negative) :: treat redirects as ordinary links
 http.redirect.max = 1 :: follow max. 1 redirect
 http.redirect.max = 2 :: follow max. 2 redirects, etc.

> max. redirects not handled correctly: fetcher stops at max-1 redirects
> --
>
> Key: NUTCH-962
> URL: https://issues.apache.org/jira/browse/NUTCH-962
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2, 1.3, 2.0
>Reporter: Sebastian Nagel
> Attachments: Fetcher_redir.patch
>
>
> The fetcher stops following redirects one redirect before the max. redirects 
> is reached.
> The description of http.redirect.max
> > The maximum number of redirects the fetcher will follow when
> > trying to fetch a page. If set to negative or 0, fetcher won't immediately
> > follow redirected URLs, instead it will record them for later fetching.
> suggests that if set to 1 that one redirect will be followed.
> I tried to crawl two documents the first redirecting by
>  
> to the second with http.redirect.max = 1
> The second document is not fetched and the URL has state GONE in CrawlDb.
> fetching file:/test/redirects/meta_refresh.html
> redirectCount=0
> -finishing thread FetcherThread, activeThreads=1
>  - content redirect to file:/test/redirects/to/meta_refresh_target.html 
> (fetching now)
>  - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html
> The attached patch would fix this: if http.redirect.max is 1 : one redirect 
> is followed.
> Of course, this would mean there is no possibility to skip redirects at all 
> since 0
> (as well as negative values) means "treat redirects as ordinary links".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] [Created] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-04-21 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1344:
--

 Summary: BasicURLNormalizer to normalize https same as http 
 Key: NUTCH-1344
 URL: https://issues.apache.org/jira/browse/NUTCH-1344
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 1.6
Reporter: Sebastian Nagel


Most of the normalization done by BasicURLNormalizer (lowercasing host, 
removing default port, removal of page anchors, cleaning . and . in the path) 
is not done for URLs with protocol https.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-04-21 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1344:
---

Attachment: NUTCH-1344.patch

> BasicURLNormalizer to normalize https same as http 
> ---
>
> Key: NUTCH-1344
> URL: https://issues.apache.org/jira/browse/NUTCH-1344
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
> Attachments: NUTCH-1344.patch
>
>
> Most of the normalization done by BasicURLNormalizer (lowercasing host, 
> removing default port, removal of page anchors, cleaning . and . in the path) 
> is not done for URLs with protocol https.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1339) Default URL normalization rules to remove page anchors completely

2012-04-21 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258827#comment-13258827
 ] 

Sebastian Nagel commented on NUTCH-1339:


BasicURLNormalizer does not remove the anchor for https URLs (NUTCH-1344).
At least, in my case this was the real reason for the large number of bad URLs.

The only motivation to remove the anchor not completely is the rare case that 
anchor and query parameters are accidentally swapped.

> Default URL normalization rules to remove page anchors completely
> -
>
> Key: NUTCH-1339
> URL: https://issues.apache.org/jira/browse/NUTCH-1339
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
> Attachments: NUTCH-1339-2.patch, NUTCH-1339.patch
>
>
> The default rules of URLNormalizerRegex remove the anchor up to the first
> occurrence of ? or &. The remaining part of the anchor is kept
> which may cause a large, possibly infinite number of outlinks when the same 
> document
> fetched again and again with different URLs,
> see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html
> Parameters in inner-page anchors are a common practice in AJAX web sites.
> Currently, crawling AJAX content is not supported (NUTCH-1323).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1293) IndexingFiltersChecker to store detected content type in crawldatum metadata

2012-04-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263124#comment-13263124
 ] 

Sebastian Nagel commented on NUTCH-1293:


The content type should be added to metadata after the check for content == 
null.

{noformat}
% nutch indexchecker file:/
fetching: file:/
org.apache.nutch.protocol.file.FileError: File Error: 404
   ...
Exception in thread "main" java.lang.NullPointerException at 
org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
{noformat}

> IndexingFiltersChecker to store detected content type in crawldatum metadata
> 
>
> Key: NUTCH-1293
> URL: https://issues.apache.org/jira/browse/NUTCH-1293
> Project: Nutch
>  Issue Type: Bug
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1293-1.5-1.patch
>
>
> NUTCH-1259 is not implemented in the checker.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1323) AjaxNormalizer

2012-05-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273954#comment-13273954
 ] 

Sebastian Nagel commented on NUTCH-1323:


After a small test crawl on http://si.draagle.com:
# usage is cumbersome because you have to carefully think about in which steps 
to normalize URLs. This is because AjaxNormalizer acts as a flip-flop: hashbang 
URLs are escaped, escaped ones are unescaped. If URLs are normalized during 
parsing and then during CrawlDb update, you get the hashbang URL again.
# relative hashbang links are not resolved correctly. The outlink of
{noformat}
 base: http://si.draagle.com/?_escaped_fragment_=browse/group/root/
 
{noformat}
should be
{noformat}
http://si.draagle.com/?_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
but hardly
{noformat}
http://si.draagle.com/?_escaped_fragment_=browse/group/root/&_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
# the outlink set of one page with escaped base URL may contain escaped and 
unescaped URLs simultaneously as results of
** a relative link without hashbang, e.g., {{}}
** a global link with hashbang

If understood right:
* URLs with escaped fragments are used
** in crawlDb, segments, linkDb (URL acts as key)
** for fetching
* unescaped hashbang URLs
** are used in the index (and shown to the user)
** may appear in outlinks, redirects, and seeds

Couldn't we bind the decision whether to (un)escape to the current normalizer 
scope:
* if URL contains #!
  and scope is one of { inject, fetcher/redirect, outlink, ?crawldb/update? }
  => escape
* if URL contains _escaped_fragment_=
  and scope is index
  => unescape


> AjaxNormalizer
> --
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them 
> to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-06-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1383:
--

 Summary: IndexingFiltersChecker to show error message instead of 
null pointer exception
 Key: NUTCH-1383
 URL: https://issues.apache.org/jira/browse/NUTCH-1383
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.5, 1.6
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.6


IndexingFiltersChecker may throw null pointer exceptions if
# content returned by protocol implementation is null (artifact of NUTCH-1293)
# if one of the indexing filters sets doc to null (the interface IndexingFilter 
allows to exclude documents by returning null, cf. the IndexingFilter of 
NUTCH-966)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-06-09 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1383:
---

Attachment: NUTCH-1383.patch

patch for both null pointer exceptions

> IndexingFiltersChecker to show error message instead of null pointer exception
> --
>
> Key: NUTCH-1383
> URL: https://issues.apache.org/jira/browse/NUTCH-1383
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.5, 1.6
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1383.patch
>
>
> IndexingFiltersChecker may throw null pointer exceptions if
> # content returned by protocol implementation is null (artifact of NUTCH-1293)
> # if one of the indexing filters sets doc to null (the interface 
> IndexingFilter allows to exclude documents by returning null, cf. the 
> IndexingFilter of NUTCH-966)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2012-06-12 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1389:
--

 Summary: parsechecker and indexchecker to report truncated content
 Key: NUTCH-1389
 URL: https://issues.apache.org/jira/browse/NUTCH-1389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Affects Versions: 1.5, nutchgora
Reporter: Sebastian Nagel
Priority: Minor


ParserChecker and IndexingFiltersChecker should report when a document is 
truncated due to {http,file,ftp}.content.limit.
Truncated content may cause text and metadata extraction to fail for PDF and 
other binary document formats.
A hint that truncation (and not a broken plugin) is the possible reason would 
be useful.
See NUTCH-965 and {{ParseSegment.isTruncated(content)}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-06-30 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1415:
--

 Summary: release packages to contain top level folder 
apache-nutch-x.x
 Key: NUTCH-1415
 URL: https://issues.apache.org/jira/browse/NUTCH-1415
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 1.6, 1.5.1
Reporter: Sebastian Nagel
Priority: Minor


The release packages should contain a top level folder named apache-nutch-x.x 
(x replaced by major and minor version) as in previous releases. Unpacking the 
packages from the command line via tar xvfz package.tar.gz or unzip package.zip 
should place all files in that folder. Cf. discussions on mailing lists:
* 
http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
* 
http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-06-30 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1415:
---

Attachment: NUTCH-1415.patch

Fix ant targets tar-src, tar-bin, zip-src, zip-bin
Also set appropriate permissions for bin/nutch

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-07-03 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1415:
---

Attachment: NUTCH-1415-2.patch

Hi Lewis, you are completely right:
the tarfileset / zipfileset of the *-bin
targets are missing the parameter
  prefix="${final.name}"
Here is a corrected patch, or manually
add 4x the prefix paramater.

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1419) parsechecker and indexchecker to report protocol status

2012-07-03 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1419:
--

 Summary: parsechecker and indexchecker to report protocol status
 Key: NUTCH-1419
 URL: https://issues.apache.org/jira/browse/NUTCH-1419
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Affects Versions: nutchgora, 1.6
Reporter: Sebastian Nagel
Priority: Minor


Parsechecker and indexchecker should report the protocol status when the fetch 
was not successful (status other than 200/ok).

In case of a redirect, the protocol status contains the URL a redirect points 
to. Usually, this URL should be checked instead of the original one which is 
not indexed. The content of a redirect response is less useful (and often 
empty):
{code}
% nutch indexchecker http://lucene.apache.org/nutch/
fetching: http://lucene.apache.org/nutch/
parsing: http://lucene.apache.org/nutch/
contentType: text/html
content :   301 Moved Permanently Moved Permanently The document has moved 
here . Apache/2.4.1 (Unix) OpenSSL/1.
title : 301 Moved Permanently
host :  lucene.apache.org
tstamp :Tue Jul 03 13:27:32 CEST 2012
url :   http://lucene.apache.org/nutch/
{code}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1419) parsechecker and indexchecker to report protocol status

2012-07-03 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1419:
---

Attachment: NUTCH-1419-1.patch

Simple patch: in case of a protocol status other than 200 (success):
# report the protocol status
# exit (since those documents are not parsed and indexed when crawling: 
parsechecker and indexchecker should behave similar to an "ordinary" crawl)


> parsechecker and indexchecker to report protocol status
> ---
>
> Key: NUTCH-1419
> URL: https://issues.apache.org/jira/browse/NUTCH-1419
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, parser
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1419-1.patch
>
>
> Parsechecker and indexchecker should report the protocol status when the 
> fetch was not successful (status other than 200/ok).
> In case of a redirect, the protocol status contains the URL a redirect points 
> to. Usually, this URL should be checked instead of the original one which is 
> not indexed. The content of a redirect response is less useful (and often 
> empty):
> {code}
> % nutch indexchecker http://lucene.apache.org/nutch/
> fetching: http://lucene.apache.org/nutch/
> parsing: http://lucene.apache.org/nutch/
> contentType: text/html
> content :   301 Moved Permanently Moved Permanently The document has 
> moved here . Apache/2.4.1 (Unix) OpenSSL/1.
> title : 301 Moved Permanently
> host :  lucene.apache.org
> tstamp :Tue Jul 03 13:27:32 CEST 2012
> url :   http://lucene.apache.org/nutch/
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns

2012-07-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1421:
--

 Summary: RegexURLNormalizer to only skip rules with invalid 
patterns
 Key: NUTCH-1421
 URL: https://issues.apache.org/jira/browse/NUTCH-1421
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora, 1.6
Reporter: Sebastian Nagel
Priority: Minor


If a regex-normalize.xml file contains one rule with a syntactically invalid 
regular expression patterns, all rules are discarded and no normalization is 
done. 

In combination with a detailed error message, RegexURLNormalizer should only 
skip the invalid rule but use all other (valid) rules.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns

2012-07-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1421:
---

Attachment: NUTCH-1421-1.patch

> RegexURLNormalizer to only skip rules with invalid patterns
> ---
>
> Key: NUTCH-1421
> URL: https://issues.apache.org/jira/browse/NUTCH-1421
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1421-1.patch
>
>
> If a regex-normalize.xml file contains one rule with a syntactically invalid 
> regular expression patterns, all rules are discarded and no normalization is 
> done. 
> In combination with a detailed error message, RegexURLNormalizer should only 
> skip the invalid rule but use all other (valid) rules.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1422) reset signature for redirects

2012-07-06 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1422:
--

 Summary: reset signature for redirects
 Key: NUTCH-1422
 URL: https://issues.apache.org/jira/browse/NUTCH-1422
 Project: Nutch
  Issue Type: Bug
  Components: crawldb, fetcher
Affects Versions: 1.4
Reporter: Sebastian Nagel
 Fix For: 1.6


In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
(http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol 
(cf. attached dumped segment / CrawlDb data):
 2012-02-23 :  injected
 2012-02-24 :  fetched
 2012-03-30 :  re-fetched, signature changed
 2012-04-20 :  re-fetched, redirected
 2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!

The signature of a previously fetched document is not reset when the URL/doc is 
changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
status to db_notmodified because the new signature in with fetch status is 
identical to the old one.

Possible fixes (??):
* reset the signature in Fetcher
* handle this case in CrawlDbReducer.reduce


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1422) reset signature for redirects

2012-07-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1422:
---

Attachment: NUTCH-1422_redir_notmodified_log.txt

> reset signature for redirects
> -
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher
>Affects Versions: 1.4
>Reporter: Sebastian Nagel
> Fix For: 1.6
>
> Attachments: NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
> (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short 
> protocol (cf. attached dumped segment / CrawlDb data):
>  2012-02-23 :  injected
>  2012-02-24 :  fetched
>  2012-03-30 :  re-fetched, signature changed
>  2012-04-20 :  re-fetched, redirected
>  2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc 
> is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
> status to db_notmodified because the new signature in with fetch status is 
> identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1328) a problem with regex-normalize.xml

2012-07-10 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410905#comment-13410905
 ] 

Sebastian Nagel commented on NUTCH-1328:


Duplicate of NUTCH-706

> a problem with regex-normalize.xml
> --
>
> Key: NUTCH-1328
> URL: https://issues.apache.org/jira/browse/NUTCH-1328
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: behnam nikbakht
>  Labels: parse
>
> there is a regex-pattern in regex-normalize.xml:
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)
> that remove session ids from urls, but there is some sites, like:
> http://www.mehrnews.com/fa
> that have urls, like:
> http://www.mehrnews.com/fa/newsdetail.aspx?NewsID=1567539
> and with this pattern, this url converted to an invalid url:
> http://www.mehrnews.com/fa/newsdetail.aspx?New

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-706) Url regex normalizer

2012-07-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-706:
--

Attachment: NUTCH-706.patch

- fix the pattern by adding an anchor prohibiting inner-word matches such as in 
New{color:red}sId{color}
- add test

> Url regex normalizer
> 
>
> Key: NUTCH-706
> URL: https://issues.apache.org/jira/browse/NUTCH-706
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Meghna Kukreja
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-706.patch
>
>
> Hey,
> I encountered the following problem while trying to crawl a site using
> nutch-trunk. In the file regex-normalize.xml, the following regex is
> used to remove session ids:
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$).
> This pattern also transforms a url, such as,
> "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
> matches 'sId' in the 'newsId'), which is incorrect and hence does not
> get fetched. This expression needs to be changed to prevent this.
> Thanks,
> Meghna

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1436) bin/nutch absent in zip package

2012-07-23 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1436:
--

 Summary: bin/nutch absent in zip package
 Key: NUTCH-1436
 URL: https://issues.apache.org/jira/browse/NUTCH-1436
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.5.1
Reporter: Sebastian Nagel


The script bin/nutch is absent in the package apache-nutch-1.5.1-bin.zip,
the tar-bin package is not affected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1436) bin/nutch absent in zip package

2012-07-23 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1436:
---

Attachment: NUTCH-1436.patch

Patch for branch-1.5.1 (if a new bin package is desired). For trunk the last 
patch of NUTCH-1415 is ok.

> bin/nutch absent in zip package
> ---
>
> Key: NUTCH-1436
> URL: https://issues.apache.org/jira/browse/NUTCH-1436
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.5.1
>Reporter: Sebastian Nagel
> Attachments: NUTCH-1436.patch
>
>
> The script bin/nutch is absent in the package apache-nutch-1.5.1-bin.zip,
> the tar-bin package is not affected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-706) Url regex normalizer

2012-08-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-706:
--

Attachment: NUTCH-706-2.patch

Second trial for patch. The first one does not remove:
{code}
?_sessionID=...
{code}
Added more tests to cover more types of real session ids and a further 
counterexample:
{code}
?addressid=...
{code}

> Url regex normalizer
> 
>
> Key: NUTCH-706
> URL: https://issues.apache.org/jira/browse/NUTCH-706
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Meghna Kukreja
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-706-2.patch, NUTCH-706.patch
>
>
> Hey,
> I encountered the following problem while trying to crawl a site using
> nutch-trunk. In the file regex-normalize.xml, the following regex is
> used to remove session ids:
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$).
> This pattern also transforms a url, such as,
> "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
> matches 'sId' in the 'newsId'), which is incorrect and hence does not
> get fetched. This expression needs to be changed to prevent this.
> Thanks,
> Meghna

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1454) parsing chm failed

2012-08-14 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1454:
--

 Summary: parsing chm failed
 Key: NUTCH-1454
 URL: https://issues.apache.org/jira/browse/NUTCH-1454
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5.1
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.6, 2.1


(reported by Jan Riewe, see 
http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)

Nutch fails to parse chm files with
{quote}
 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
application/vnd.ms-htmlhelp
{quote}
Tested with chm test files from Tika:
{code}
 % bin/nutch parsechecker 
file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
{code}
Tika parses this document (but does not extract any content).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names

2012-08-14 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1455:
--

 Summary: RobotRulesParser to match multi-word user-agent names
 Key: NUTCH-1455
 URL: https://issues.apache.org/jira/browse/NUTCH-1455
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.5.1
Reporter: Sebastian Nagel


If the user-agent name(s) configured in http.robots.agents contains spaces it 
is not matched even if is exactly contained in the robots.txt

http.robots.agents = "Download Ninja,*"

If the robots.txt (http://en.wikipedia.org/robots.txt) contains
{code}
User-agent: Download Ninja
Disallow: /
{code}
all content should be forbidden. But it isn't:
{code}
% curl 'http://en.wikipedia.org/robots.txt' > robots.txt
% grep -A1 -i ninja robots.txt 
User-agent: Download Ninja
Disallow: /
% cat test.urls
http://en.wikipedia.org/
% bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser 
robots.txt test.urls 'Download Ninja'
...
allowed:http://en.wikipedia.org/
{code}

The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that
bq. The robot must obey the first record in /robots.txt that contains a 
User-Agent line whose value contains the name token of the robot as a
substring.
Assumed that "Downlaod Ninja" is a substring of itself it should match and 
http://en.wikipedia.org/ should be forbidden.

The point is that the agent name from the User-Agent line is split at spaces 
while the names from the http.robots.agents property are not (they are only 
split at ",").


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-09-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454282#comment-13454282
 ] 

Sebastian Nagel commented on NUTCH-1467:


Since nutch.metadata.Metadata, NutchField, and SolrInputField are multi-valued 
wouldn't it be preferable to keep the multiple values instead of concatenating 
them in advance? This would require to change HTMLMetaTags.generalTags so that 
it can store multiple values. 

> nutch 1.5.1 not able to parse mutliValued metatags
> --
>
> Key: NUTCH-1467
> URL: https://issues.apache.org/jira/browse/NUTCH-1467
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: kiran
>Priority: Minor
> Fix For: 1.6
>
> Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using 
> http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when 
> there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it 
> work ?
> When there are two tags with same name and different content, it takes the 
> value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA 
> (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-1415:
--

Assignee: Sebastian Nagel

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457753#comment-13457753
 ] 

Sebastian Nagel commented on NUTCH-1415:


This has been fixed only for 1.5.1 and 2.0 branches.
Should be fixed for trunk and 2.x before branching 2.1 and 1.6.
Are there any objections?
Otherwise I would apply the patches today night and check the resulting
packages (cf. NUTCH-1436).

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1415.


   Resolution: Fixed
Fix Version/s: 2.1
   1.6

committed to trunk (revision 1387357) and 2.x (revision 1387356)

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-706) Url regex normalizer

2012-10-02 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467990#comment-13467990
 ] 

Sebastian Nagel commented on NUTCH-706:
---

Are there objections to apply and commit the patch? Tests pass for both trunk 
and 2.x.
The problem is reported twice. Until there is a more sophisticated URL 
normalizer (see Ken Krugler's comment) there is no real alternative then 
improving the regex pattern.


> Url regex normalizer
> 
>
> Key: NUTCH-706
> URL: https://issues.apache.org/jira/browse/NUTCH-706
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Meghna Kukreja
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-706-2.patch, NUTCH-706.patch
>
>
> Hey,
> I encountered the following problem while trying to crawl a site using
> nutch-trunk. In the file regex-normalize.xml, the following regex is
> used to remove session ids:
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$).
> This pattern also transforms a url, such as,
> "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
> matches 'sId' in the 'newsId'), which is incorrect and hence does not
> get fetched. This expression needs to be changed to prevent this.
> Thanks,
> Meghna

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-08 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1476:
--

 Summary: SegmentReader getStats should set parsed = -1 if no 
parsing took place
 Key: NUTCH-1476
 URL: https://issues.apache.org/jira/browse/NUTCH-1476
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.6
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.6
 Attachments: NUTCH-1476.patch

The method getStats in SegmentReader sets the number of parsed documents (and 
also the number of parseErrors) to 0 if no parsing took place for a segment. 
The values should be set to -1 analogous to the number of fetched docs and 
fetchErrors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1476:
---

Attachment: NUTCH-1476.patch

> SegmentReader getStats should set parsed = -1 if no parsing took place
> --
>
> Key: NUTCH-1476
> URL: https://issues.apache.org/jira/browse/NUTCH-1476
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.6
>
> Attachments: NUTCH-1476.patch
>
>
> The method getStats in SegmentReader sets the number of parsed documents (and 
> also the number of parseErrors) to 0 if no parsing took place for a segment. 
> The values should be set to -1 analogous to the number of fetched docs and 
> fetchErrors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1252) SegmentReader -get shows wrong data

2012-10-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-1252:
--

Assignee: Sebastian Nagel

> SegmentReader -get shows wrong data
> ---
>
> Key: NUTCH-1252
> URL: https://issues.apache.org/jira/browse/NUTCH-1252
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 1.6
>
> Attachments: NUTCH-1252.patch, NUTCH-1252-v2.patch
>
>
> The command/option -get of the SegmentReader may show wrong data associated 
> with the given URL. 
> To reproduce:
> {code}
> % mkdir -p test_readseg/urls
> % echo -e 
> "http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0";
>  > test_readseg/urls/seeds
> % nutch inject test_readseg/crawldb test_readseg/urls
> Injector: starting at 2012-01-18 09:32:25
> Injector: crawlDb: test_readseg/crawldb
> Injector: urlDir: test_readseg/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03
> % nutch generate test_readseg/crawldb test_readseg/segments/
> Generator: starting at 2012-01-18 09:32:30
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: test_readseg/segments/20120118093232
> Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03
> % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' 
> -nocontent -noparse -nofetch -noparsedata -noparsetext
> SegmentReader: get 'http://nutch.apache.org/'
> Crawl Generate::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Jan 18 09:32:26 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 10.0
> Signature: null
> Metadata: _ngt_: 1326875550401test: AbcTest
> {code}
> The metadata and the score indicate that the CrawlDatum shown is the wrong 
> one (that associated to http://abc.test/ but not to http://nutch.apache.org/).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-10-08 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471915#comment-13471915
 ] 

Sebastian Nagel commented on NUTCH-1344:


Is there any reason why https should be treated different from http (and ftp)?

> BasicURLNormalizer to normalize https same as http 
> ---
>
> Key: NUTCH-1344
> URL: https://issues.apache.org/jira/browse/NUTCH-1344
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
> Attachments: NUTCH-1344.patch
>
>
> Most of the normalization done by BasicURLNormalizer (lowercasing host, 
> removing default port, removal of page anchors, cleaning . and . in the path) 
> is not done for URLs with protocol https.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"

2012-10-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-706:
--

Fix Version/s: 2.2
  Summary: Url regex normalizer: default pattern for session id removal 
not to match "newsId"  (was: Url regex normalizer)

> Url regex normalizer: default pattern for session id removal not to match 
> "newsId"
> --
>
> Key: NUTCH-706
> URL: https://issues.apache.org/jira/browse/NUTCH-706
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Meghna Kukreja
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-706-2.patch, NUTCH-706.patch
>
>
> Hey,
> I encountered the following problem while trying to crawl a site using
> nutch-trunk. In the file regex-normalize.xml, the following regex is
> used to remove session ids:
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$).
> This pattern also transforms a url, such as,
> "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
> matches 'sId' in the 'newsId'), which is incorrect and hence does not
> get fetched. This expression needs to be changed to prevent this.
> Thanks,
> Meghna

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"

2012-10-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-706.
---

Resolution: Fixed

committed to trunk (revision 1396796) and 2.x (revision 1396795)

> Url regex normalizer: default pattern for session id removal not to match 
> "newsId"
> --
>
> Key: NUTCH-706
> URL: https://issues.apache.org/jira/browse/NUTCH-706
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Meghna Kukreja
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-706-2.patch, NUTCH-706.patch
>
>
> Hey,
> I encountered the following problem while trying to crawl a site using
> nutch-trunk. In the file regex-normalize.xml, the following regex is
> used to remove session ids:
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$).
> This pattern also transforms a url, such as,
> "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
> matches 'sId' in the 'newsId'), which is incorrect and hence does not
> get fetched. This expression needs to be changed to prevent this.
> Thanks,
> Meghna

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-10-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1344.


   Resolution: Fixed
Fix Version/s: 2.2
   1.6

committed to trunk (revision 1396801) and 2.x (revision 1396800)



> BasicURLNormalizer to normalize https same as http 
> ---
>
> Key: NUTCH-1344
> URL: https://issues.apache.org/jira/browse/NUTCH-1344
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1344.patch
>
>
> Most of the normalization done by BasicURLNormalizer (lowercasing host, 
> removing default port, removal of page anchors, cleaning . and . in the path) 
> is not done for URLs with protocol https.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"

2012-10-10 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473599#comment-13473599
 ] 

Sebastian Nagel commented on NUTCH-706:
---

First commit erroneously with wrong patch.
Correct patch (NUTCH-706-2.patch) now committed to trunk (revision 1396817) and 
2.x (revision 1396822).

> Url regex normalizer: default pattern for session id removal not to match 
> "newsId"
> --
>
> Key: NUTCH-706
> URL: https://issues.apache.org/jira/browse/NUTCH-706
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Meghna Kukreja
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-706-2.patch, NUTCH-706.patch
>
>
> Hey,
> I encountered the following problem while trying to crawl a site using
> nutch-trunk. In the file regex-normalize.xml, the following regex is
> used to remove session ids:
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$).
> This pattern also transforms a url, such as,
> "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
> matches 'sId' in the 'newsId'), which is incorrect and hence does not
> get fetched. This expression needs to be changed to prevent this.
> Thanks,
> Meghna

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474460#comment-13474460
 ] 

Sebastian Nagel commented on NUTCH-1475:


Indeed, a modified time in the future is a bad choice.
But CrawlDatum and WebPage both have a field modifiedTime. It should contain 
the time of the last fetch or (ideally) even the time of former fetch if the 
document is not modified.

> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> --
>
> Key: NUTCH-1475
> URL: https://issues.apache.org/jira/browse/NUTCH-1475
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1, 1.5.1
> Environment: All
>Reporter: James Sullivan
>Priority: Minor
>  Labels: index-more, plugins
> Fix For: 1.6, 2.2
>
> Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" 
> and "date" field for the Solr index. The "last modified" field is the last 
> modified date from the http headers if available, if not available it is left 
> empty. Currently, the "date" field is the same as the "last modified" field 
> unless that field is empty in which case getFetchTime is used as a fall back. 
> I think getFetchTime is not a good fall back as it is the next fetch time and 
> often a month or more in the future which doesn't make sense for the date 
> field. Users do not expect webpages/documents with future dates. A more 
> sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of 
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>  from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" 
> field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1252) SegmentReader -get shows wrong data

2012-10-11 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1252.


Resolution: Fixed

committed to trunk (revision 1397281)

> SegmentReader -get shows wrong data
> ---
>
> Key: NUTCH-1252
> URL: https://issues.apache.org/jira/browse/NUTCH-1252
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 1.6
>
> Attachments: NUTCH-1252.patch, NUTCH-1252-v2.patch
>
>
> The command/option -get of the SegmentReader may show wrong data associated 
> with the given URL. 
> To reproduce:
> {code}
> % mkdir -p test_readseg/urls
> % echo -e 
> "http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0";
>  > test_readseg/urls/seeds
> % nutch inject test_readseg/crawldb test_readseg/urls
> Injector: starting at 2012-01-18 09:32:25
> Injector: crawlDb: test_readseg/crawldb
> Injector: urlDir: test_readseg/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03
> % nutch generate test_readseg/crawldb test_readseg/segments/
> Generator: starting at 2012-01-18 09:32:30
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: test_readseg/segments/20120118093232
> Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03
> % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' 
> -nocontent -noparse -nofetch -noparsedata -noparsetext
> SegmentReader: get 'http://nutch.apache.org/'
> Crawl Generate::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Jan 18 09:32:26 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 10.0
> Signature: null
> Metadata: _ngt_: 1326875550401test: AbcTest
> {code}
> The metadata and the score indicate that the CrawlDatum shown is the wrong 
> one (that associated to http://abc.test/ but not to http://nutch.apache.org/).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-11 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1476.


Resolution: Fixed

committed to trunk (revision 1397298)

> SegmentReader getStats should set parsed = -1 if no parsing took place
> --
>
> Key: NUTCH-1476
> URL: https://issues.apache.org/jira/browse/NUTCH-1476
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.6
>
> Attachments: NUTCH-1476.patch
>
>
> The method getStats in SegmentReader sets the number of parsed documents (and 
> also the number of parseErrors) to 0 if no parsing took place for a segment. 
> The values should be set to -1 analogous to the number of fetched docs and 
> fetchErrors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-10-11 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1383.


Resolution: Fixed

committed to trunk (revision 1397308)

> IndexingFiltersChecker to show error message instead of null pointer exception
> --
>
> Key: NUTCH-1383
> URL: https://issues.apache.org/jira/browse/NUTCH-1383
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.5, 1.6
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1383.patch
>
>
> IndexingFiltersChecker may throw null pointer exceptions if
> # content returned by protocol implementation is null (artifact of NUTCH-1293)
> # if one of the indexing filters sets doc to null (the interface 
> IndexingFilter allows to exclude documents by returning null, cf. the 
> IndexingFilter of NUTCH-966)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-10-23 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482644#comment-13482644
 ] 

Sebastian Nagel commented on NUTCH-1467:


Hi Kiran,
thanks for the patch. After a look at it:
* instead of replacing {{Properties generalTags}} in HTMLMetaTags.java by a 
{{HashMap}} it seems preferable to use the class 
{{metadata.Metadata}}:
** provides the required methods
*** add one more value to an array of values
*** {{toString()}} etc.
** would shorten the code significantly
** sufficiently tested (own JUnit test)
* in addition to {{parse.html.HTMLMetaProcessor.java}} also 
{{parse.tika.HTMLMetaProcessor.java}} needs to be modified

Also, as Julien mentioned, a test would be useful. Added 
NUTCH-1467-TEST-1.patch as a first draft. Can you have a look at the test? Are 
all situations covered? Promising: test passes with the current patch applied :)


> nutch 1.5.1 not able to parse mutliValued metatags
> --
>
> Key: NUTCH-1467
> URL: https://issues.apache.org/jira/browse/NUTCH-1467
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: kiran
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, 
> Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, 
> Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using 
> http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when 
> there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it 
> work ?
> When there are two tags with same name and different content, it takes the 
> value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA 
> (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-10-23 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1467:
---

Attachment: NUTCH-1467-TEST-1.patch

> nutch 1.5.1 not able to parse mutliValued metatags
> --
>
> Key: NUTCH-1467
> URL: https://issues.apache.org/jira/browse/NUTCH-1467
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: kiran
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, 
> Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, 
> Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using 
> http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when 
> there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it 
> work ?
> When there are two tags with same name and different content, it takes the 
> value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA 
> (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns

2012-10-23 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1421.


   Resolution: Fixed
Fix Version/s: 2.2
   1.6

committed to trunk (rev. 1401459) and 2.x (rev. 1401460)

> RegexURLNormalizer to only skip rules with invalid patterns
> ---
>
> Key: NUTCH-1421
> URL: https://issues.apache.org/jira/browse/NUTCH-1421
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1421-1.patch
>
>
> If a regex-normalize.xml file contains one rule with a syntactically invalid 
> regular expression patterns, all rules are discarded and no normalization is 
> done. 
> In combination with a detailed error message, RegexURLNormalizer should only 
> skip the invalid rule but use all other (valid) rules.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-578-TEST-1.patch

JUnit test to catch this problem and NUTCH-578: a large patch for a test but 
the idea is to extend it to test also other transitions of CrawlDatum states.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-578-TEST-1.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a s

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486144#comment-13486144
 ] 

Sebastian Nagel commented on NUTCH-1482:


+1

> Rename HTMLParseFilter
> --
>
> Key: NUTCH-1482
> URL: https://issues.apache.org/jira/browse/NUTCH-1482
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.5.1
>Reporter: Julien Nioche
>
> See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
> better reflect what it does and I think we should do the same for 1.x.
> any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-1.patch

FetchSchedule.setPageGoneSchedule is called exclusively for a fetch_gone in 
CrawlDbReducer.reduce. Is there a need to call forceRefetch just after a fetch 
leads to a fetch_gone (assumed there is little delay between fetch and 
updatedb)?

Attached patch sets the fetchInterval to db.fetch.interval.max and does not 
call forceRefetch.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-578-TEST-1.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.int

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486290#comment-13486290
 ] 

Sebastian Nagel commented on NUTCH-1482:


Markus, you are right: I remember the API change of HTMLParseFilter in 1.0: it 
took me some hours to get the custom plugins compiled.
- is it possible to deprecate the extension point and keep it for some time?
- at least, place a warning in CHANGES.txt with a link to update instructions 
in the wiki

> Rename HTMLParseFilter
> --
>
> Key: NUTCH-1482
> URL: https://issues.apache.org/jira/browse/NUTCH-1482
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.5.1
>Reporter: Julien Nioche
>
> See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
> better reflect what it does and I think we should do the same for 1.x.
> any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-2.patch
NUTCH-1245-578-TEST-2.patch

Improved patches

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486484#comment-13486484
 ] 

Sebastian Nagel commented on NUTCH-578:
---

NUTCH-1245 provides a test to catch this problem.

Attached v5 patch:
* call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit 
and status is set to db_gone. All attached patches do this: it will set the 
fetchInterval to a value larger than one day, so that from now on the URL is 
not fetched again and again.
* reset the retry counter in setPageGoneSchedule so that it cannot overflow and 
to get again 3 trials after db.max.fetch.interval is reached.

> URL fetched with 403 is generated over and over again
> -
>
> Key: NUTCH-578
> URL: https://issues.apache.org/jira/browse/NUTCH-578
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.0.0
> Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
> have checked out the most recent version of the trunk as of Nov 20, 2007
>Reporter: Nathaniel Powell
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
> NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, 
> NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt
>
>
> I have not changed the following parameter in the nutch-default.xml:
> 
>   db.fetch.retry.max
>   3
>   The maximum number of times a url that has encountered
>   recoverable errors is generated for fetch.
> 
> However, there is a URL which is on the site that I'm crawling, 
> www.teachertube.com, which keeps being generated over and over again for 
> almost every segment (many more times than 3):
> fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
> url=http://www.teachertube.com/images/
> This is a bug, right?
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-578:
--

Attachment: NUTCH-578_v5.patch

> URL fetched with 403 is generated over and over again
> -
>
> Key: NUTCH-578
> URL: https://issues.apache.org/jira/browse/NUTCH-578
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.0.0
> Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
> have checked out the most recent version of the trunk as of Nov 20, 2007
>Reporter: Nathaniel Powell
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
> NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, 
> NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt
>
>
> I have not changed the following parameter in the nutch-default.xml:
> 
>   db.fetch.retry.max
>   3
>   The maximum number of times a url that has encountered
>   recoverable errors is generated for fetch.
> 
> However, there is a URL which is on the site that I'm crawling, 
> www.teachertube.com, which keeps being generated over and over again for 
> almost every segment (many more times than 3):
> fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
> url=http://www.teachertube.com/images/
> This is a bug, right?
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-10-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487316#comment-13487316
 ] 

Sebastian Nagel commented on NUTCH-1370:


+1
Would be nice to see also the number of injected URLs rejected by URL filters.

> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487318#comment-13487318
 ] 

Sebastian Nagel commented on NUTCH-578:
---

Resetting the retry counter in setPageGoneSchedule has some disadvantages:
* the information is lost that the db_gone results from a number of 
unsuccessful fetches due to transient errors
* maybe you do not want to "get again 3 trials after db.max.fetch.interval is 
reached". If a page has been fetched 3 times in a row with a 403 and we try 
again after one month and get a 403 again, we do not need 3 trials any more.


> URL fetched with 403 is generated over and over again
> -
>
> Key: NUTCH-578
> URL: https://issues.apache.org/jira/browse/NUTCH-578
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.0.0
> Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
> have checked out the most recent version of the trunk as of Nov 20, 2007
>Reporter: Nathaniel Powell
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
> NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, 
> NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt
>
>
> I have not changed the following parameter in the nutch-default.xml:
> 
>   db.fetch.retry.max
>   3
>   The maximum number of times a url that has encountered
>   recoverable errors is generated for fetch.
> 
> However, there is a URL which is on the site that I'm crawling, 
> www.teachertube.com, which keeps being generated over and over again for 
> almost every segment (many more times than 3):
> fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
> url=http://www.teachertube.com/images/
> This is a bug, right?
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488146#comment-13488146
 ] 

Sebastian Nagel commented on NUTCH-1483:


Confirmed.
The problem is caused by the rule
{code}


  (? Can't crawl filesystem with protocol-file plugin
> 
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0
>Reporter: Rogério Pereira Araújo
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following 
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner, 
> so nutch has all the required permissions, unfortunately I'm getting the 
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at 
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of 
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry 
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
> as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1483:
---

Affects Version/s: 1.6

> Can't crawl filesystem with protocol-file plugin
> 
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0
>Reporter: Rogério Pereira Araújo
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following 
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner, 
> so nutch has all the required permissions, unfortunately I'm getting the 
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at 
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of 
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry 
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
> as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488200#comment-13488200
 ] 

Sebastian Nagel commented on NUTCH-1483:


I tried with 1.x/trunk.
For 2.x URLs with only one slash breaks usage of reverted URLs.
Have you tried removing the regex normalizer rule?

> Can't crawl filesystem with protocol-file plugin
> 
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0
>Reporter: Rogério Pereira Araújo
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following 
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner, 
> so nutch has all the required permissions, unfortunately I'm getting the 
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at 
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of 
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry 
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
> as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1483:
---

Attachment: NUTCH-1483.patch

StringUtils.split(String, char) does not preserve empty parts: host is empty in 
case of file: URLs.

Patch includes a test case.

> Can't crawl filesystem with protocol-file plugin
> 
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
>Reporter: Rogério Pereira Araújo
> Attachments: NUTCH-1483.patch
>
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following 
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner, 
> so nutch has all the required permissions, unfortunately I'm getting the 
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at 
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of 
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry 
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
> as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488254#comment-13488254
 ] 

Sebastian Nagel commented on NUTCH-1483:


Rogério, can you apply the patch, re-compile and try again?

> Can't crawl filesystem with protocol-file plugin
> 
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
>Reporter: Rogério Pereira Araújo
> Attachments: NUTCH-1483.patch
>
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following 
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner, 
> so nutch has all the required permissions, unfortunately I'm getting the 
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at 
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of 
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry 
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
> as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1484:
--

 Summary: TableUtil unreverseURL fails on file:// URLs
 Key: NUTCH-1484
 URL: https://issues.apache.org/jira/browse/NUTCH-1484
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Sebastian Nagel
Priority: Critical
 Fix For: 2.2


(reported by Rogério Pereira Araújo, see NUTCH-1483)
When crawling the local filesystem TableUtil.unreverseURL fails for URLs with 
empty host part (file:///Documents/). StringUtils.split(String, char) does not 
preserve empty parts which causes:
{code}
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-11-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488558#comment-13488558
 ] 

Sebastian Nagel commented on NUTCH-1483:


Thanks!
Issue with un-reversing URLs pulled out to NUTCH-1484 since it's more critical 
(no work-around).
Fixing the URL normalizers (and filters, see last comment) will take more time. 
Btw., {{file://localhost/Documents/}} is the only legal for according to [RFC 
1738|http://tools.ietf.org/html/rfc1738] (1994) while {{file:///Documents/}} is 
allowed by [RFC 3986|http://tools.ietf.org/html/rfc3986]:
{quote}
the "file" URI scheme is defined so that no authority, an empty host, and 
"localhost" all mean the end-user's machine
{quote}
Maybe we could also make protocol-file more lazy.

> Can't crawl filesystem with protocol-file plugin
> 
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
>Reporter: Rogério Pereira Araújo
> Attachments: NUTCH-1483.patch
>
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following 
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner, 
> so nutch has all the required permissions, unfortunately I'm getting the 
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at 
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of 
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry 
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
> as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-11-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488558#comment-13488558
 ] 

Sebastian Nagel edited comment on NUTCH-1483 at 11/1/12 8:55 AM:
-

Thanks!
Issue with un-reversing URLs pulled out to NUTCH-1484 since it's more critical 
(no work-around).
Fixing the URL normalizers (and filters, see last comment) will take more time. 
Btw., {{file://localhost/Documents/}} is the only legal for according to [RFC 
1738|http://tools.ietf.org/html/rfc1738] (1994) while {{file:///Documents/}} is 
allowed by [RFC 3986|http://tools.ietf.org/html/rfc3986] (2005):
{quote}
the "file" URI scheme is defined so that no authority, an empty host, and 
"localhost" all mean the end-user's machine
{quote}
Maybe we could also make protocol-file more lazy.

  was (Author: wastl-nagel):
Thanks!
Issue with un-reversing URLs pulled out to NUTCH-1484 since it's more critical 
(no work-around).
Fixing the URL normalizers (and filters, see last comment) will take more time. 
Btw., {{file://localhost/Documents/}} is the only legal for according to [RFC 
1738|http://tools.ietf.org/html/rfc1738] (1994) while {{file:///Documents/}} is 
allowed by [RFC 3986|http://tools.ietf.org/html/rfc3986]:
{quote}
the "file" URI scheme is defined so that no authority, an empty host, and 
"localhost" all mean the end-user's machine
{quote}
Maybe we could also make protocol-file more lazy.
  
> Can't crawl filesystem with protocol-file plugin
> 
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
>Reporter: Rogério Pereira Araújo
> Attachments: NUTCH-1483.patch
>
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following 
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner, 
> so nutch has all the required permissions, unfortunately I'm getting the 
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at 
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of 
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry 
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ 
> as wiki says.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1485) TableUtil reverseURL to keep userinfo part

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1485:
--

 Summary: TableUtil reverseURL to keep userinfo part
 Key: NUTCH-1485
 URL: https://issues.apache.org/jira/browse/NUTCH-1485
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Sebastian Nagel
Priority: Minor


The reversed URL key does not contain the userinfo part of an URL (user name 
and password: {{ftp://user:passw...@ftp.xyz/file.txt}}, cf. [RFC 
3986|http://tools.ietf.org/html/rfc3986] and 
[http://en.wikipedia.org/wiki/URI_scheme]. Keeping the userinfo would make it 
easy to crawl a fixed list of protected content. However, URLs with userinfo 
can be tricky, eg 
[http://cnn.com&story=breaking_news@199.239.136.200/mostpopular], so it's ok 
when the default is to remove the userinfo. But this should be done in default 
URL normalizers.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1461) Problem with TableUtil

2012-11-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488585#comment-13488585
 ] 

Sebastian Nagel commented on NUTCH-1461:


Cf. NUTCH-1484: same error with file:// URLs which do not contain a host.

> Problem with TableUtil
> --
>
> Key: NUTCH-1461
> URL: https://issues.apache.org/jira/browse/NUTCH-1461
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: nutchgora
> Environment: Debian / CDH3 / Nutch 2.0 Release
>Reporter: Christian Johnsson
> Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a 
> couple of million pages indexed then it comes up again. Any tips on how to 
> locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
>   at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
>   at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
>   at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
> for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-11-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488935#comment-13488935
 ] 

Sebastian Nagel commented on NUTCH-1245:


They are not duplicates but the effects are similar:

NUTCH-1245
- caused by calling forceRefetch just after a fetch leads to a fetch_gone. If 
the fetchInterval is
close to db.fetch.interval.max, setPageGoneSchedule calls forceRefetch. That's 
useless since we got
a 404 right now (or within the last day(s) for large crawls).
- proposed fix: setPageGoneSchedule should not call forceRefetch but keep the 
fetchInterval
within/below db.fetch.interval.max

NUTCH-578
- although the status of a page fetched 3 times (db.fetch.retry.max) with a 
transient error
(fetch_retry) is set to db_gone, the fetchInterval is still only incremented by 
one day. So next day
this page is fetched again.
- every fetch_retry still increments the retry counter so that it may overflow 
(NUTCH-1247)
- fix:
-* call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit 
and status is set to
db_gone. All patches (by various users/committers) agree in this: it will set 
the fetchInterval to a
value larger than one day, so that from now on the URL is not fetched again and 
again.
-* reset the retry counter to 0 or prohibit an overflow. I'm not sure what the 
best solution is, see
comments on NUTCH-578

Markus, would be great if you start with a look on the JUnit patch. It has two 
aims: catch the error and make analysis easier (it logs a lot).
I would like to extend the test to other CrawlDatum state transitions: these 
are complex for continuous crawls in combination with retry counters, 
intervals, signatures, etc. An exhaustive test could ensure that we do not 
break other state transitions.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fet

[jira] [Created] (NUTCH-1488) bin/nutch to run junit from any directory

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1488:
--

 Summary: bin/nutch to run junit from any directory
 Key: NUTCH-1488
 URL: https://issues.apache.org/jira/browse/NUTCH-1488
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.5.1, 2.1
Reporter: Sebastian Nagel
Priority: Trivial


It should be possible to run a JUnit test via {{bin/nutch junit}} (see 
[http://wiki.apache.org/nutch/bin/nutch%20junit] and NUTCH-672) from elsewhere 
not only from {{runtime/local/}}. All parts of the class path are absolute but 
{{test/classes/}} is relative. Is there any reason for this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1488) bin/nutch to run junit from any directory

2012-11-01 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1488:
---

Attachment: NUTCH-1488.patch

> bin/nutch to run junit from any directory
> -
>
> Key: NUTCH-1488
> URL: https://issues.apache.org/jira/browse/NUTCH-1488
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1, 1.5.1
>Reporter: Sebastian Nagel
>Priority: Trivial
> Attachments: NUTCH-1488.patch
>
>
> It should be possible to run a JUnit test via {{bin/nutch junit}} (see 
> [http://wiki.apache.org/nutch/bin/nutch%20junit] and NUTCH-672) from 
> elsewhere not only from {{runtime/local/}}. All parts of the class path are 
> absolute but {{test/classes/}} is relative. Is there any reason for this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1496) ParserJob logs skipped urls with level info

2012-11-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494950#comment-13494950
 ] 

Sebastian Nagel commented on NUTCH-1496:


+1

> ParserJob logs skipped urls with level info
> ---
>
> Key: NUTCH-1496
> URL: https://issues.apache.org/jira/browse/NUTCH-1496
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Nathan Gass
>Priority: Trivial
> Attachments: patch-parserjob-log-level-2012.txt
>
>
> ParserJob is the only one which logs *all* skipped urls with level info. 
> Attached patch changes this to level debug, the same level already used by 
> FetcherJob, IndexerJob, and GeneratorJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-11 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1484:
---

Attachment: NUTCH-1484.patch

Revised patch: replaced
StringUtils.splitByWholeSeparatorPreserveAllTokens(String str, String 
separator) by splitPreserveAllTokens(String str, char separator) which is 
significantly faster
(as fast as StringUtils.split(String, char)).

> TableUtil unreverseURL fails on file:// URLs
> 
>
> Key: NUTCH-1484
> URL: https://issues.apache.org/jira/browse/NUTCH-1484
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 2.2
>
> Attachments: NUTCH-1484.patch
>
>
> (reported by Rogério Pereira Araújo, see NUTCH-1483)
> When crawling the local filesystem TableUtil.unreverseURL fails for URLs with 
> empty host part (file:///Documents/). StringUtils.split(String, char) does 
> not preserve empty parts which causes:
> {code}
> java.lang.ArrayIndexOutOfBoundsException: 1
> at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494952#comment-13494952
 ] 

Sebastian Nagel edited comment on NUTCH-1484 at 11/11/12 7:56 PM:
--

Revised patch: replaced 
StringUtils.splitByWholeSeparatorPreserveAllTokens(String str, String 
separator) by splitPreserveAllTokens(String str, char separator) which is 
significantly faster (as fast as StringUtils.split(String, char)).

  was (Author: wastl-nagel):
Revised patch: replaced
StringUtils.splitByWholeSeparatorPreserveAllTokens(String str, String 
separator) by splitPreserveAllTokens(String str, char separator) which is 
significantly faster
(as fast as StringUtils.split(String, char)).
  
> TableUtil unreverseURL fails on file:// URLs
> 
>
> Key: NUTCH-1484
> URL: https://issues.apache.org/jira/browse/NUTCH-1484
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 2.2
>
> Attachments: NUTCH-1484.patch
>
>
> (reported by Rogério Pereira Araújo, see NUTCH-1483)
> When crawling the local filesystem TableUtil.unreverseURL fails for URLs with 
> empty host part (file:///Documents/). StringUtils.split(String, char) does 
> not preserve empty parts which causes:
> {code}
> java.lang.ArrayIndexOutOfBoundsException: 1
> at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1484.


Resolution: Fixed

Committed to 2.x (rev. 1408465)

> TableUtil unreverseURL fails on file:// URLs
> 
>
> Key: NUTCH-1484
> URL: https://issues.apache.org/jira/browse/NUTCH-1484
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 2.2
>
> Attachments: NUTCH-1484.patch
>
>
> (reported by Rogério Pereira Araújo, see NUTCH-1483)
> When crawling the local filesystem TableUtil.unreverseURL fails for URLs with 
> empty host part (file:///Documents/). StringUtils.split(String, char) does 
> not preserve empty parts which causes:
> {code}
> java.lang.ArrayIndexOutOfBoundsException: 1
> at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1370:
---

Attachment: NUTCH-1370-1.x.patch

Ferdy is right: custom counters are more transparent.
Patch for 1.x


> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1370:
---

Attachment: NUTCH-1370-2.x-v3.patch

Hi Lewis, yes, the 1.x patch is not easily transferred for 2.x because of 
different (old vs. new) map reduce APIs. Here is a trial...
One question: the logged line "number of urls attempting to inject" suggests 
that there is a third count "urls successfully injected" or similar. What's the 
intention with "attempting"?


> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch, 
> NUTCH-1370-2.x-v2.patch, NUTCH-1370-2.x-v3.patch
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2012-11-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504136#comment-13504136
 ] 

Sebastian Nagel commented on NUTCH-1499:


Short and precise patch. However, is there a reason why the problem is not 
solved on hardware or system level, cf. 
[[bonding|http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding]]?

> Usage of multiple ipv4 addresses and network cards on fetcher machines
> --
>
> Key: NUTCH-1499
> URL: https://issues.apache.org/jira/browse/NUTCH-1499
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 1.5.1
>Reporter: Walter Tietze
>Priority: Minor
> Attachments: apache-nutch-1.5.1.NUTCH-1499.patch
>
>
> Adds for the fetcher threads the ability to use multiple configured ipv4 
> addresses.
> On some cluster machines there are several ipv4 addresses configured where 
> each ip address is associated with its own network interface.
> This patch enables to configure the protocol-http and the protocol-httpclient 
>  to use these network interfaces in a round robin style.
> If the feature is enabled, a helper class reads at *startup* the network 
> configuration. In each http network connection the next ip address is taken. 
> This method is synchronized, but this should be no bottleneck for the overall 
> performance of the fetcher threads.
> This feature is tested on our cluster for the protocol-http and the 
> protocol-httpclient protocol.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2012-11-28 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1500:
--

 Summary: bin/crawl fails on step solrindex with wrong path to 
segment
 Key: NUTCH-1500
 URL: https://issues.apache.org/jira/browse/NUTCH-1500
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.6
Reporter: Sebastian Nagel
Priority: Trivial


The bin/crawl script calls the command (bin/nutch) solrindex with the wrong 
path to the segment which causes solrindex to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2012-11-28 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1500:
---

Attachment: NUTCH-1500.patch

> bin/crawl fails on step solrindex with wrong path to segment
> 
>
> Key: NUTCH-1500
> URL: https://issues.apache.org/jira/browse/NUTCH-1500
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6
>Reporter: Sebastian Nagel
>Priority: Trivial
> Attachments: NUTCH-1500.patch
>
>
> The bin/crawl script calls the command (bin/nutch) solrindex with the wrong 
> path to the segment which causes solrindex to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2012-12-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507944#comment-13507944
 ] 

Sebastian Nagel commented on NUTCH-1499:


Thanks! That's a plausible reason: (let's call it) "administrative constraints".
+1 (lean patch, look's good, I'll try to test it on a machine with suitable 
network settings)

> Usage of multiple ipv4 addresses and network cards on fetcher machines
> --
>
> Key: NUTCH-1499
> URL: https://issues.apache.org/jira/browse/NUTCH-1499
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 1.5.1
>Reporter: Walter Tietze
>Priority: Minor
> Attachments: apache-nutch-1.5.1.NUTCH-1499.patch
>
>
> Adds for the fetcher threads the ability to use multiple configured ipv4 
> addresses.
> On some cluster machines there are several ipv4 addresses configured where 
> each ip address is associated with its own network interface.
> This patch enables to configure the protocol-http and the protocol-httpclient 
>  to use these network interfaces in a round robin style.
> If the feature is enabled, a helper class reads at *startup* the network 
> configuration. In each http network connection the next ip address is taken. 
> This method is synchronized, but this should be no bottleneck for the overall 
> performance of the fetcher threads.
> This feature is tested on our cluster for the protocol-http and the 
> protocol-httpclient protocol.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1038:
---

Patch Info: Patch Available

> Port IndexingFiltersChecker to 2.0
> --
>
> Key: NUTCH-1038
> URL: https://issues.apache.org/jira/browse/NUTCH-1038
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.2
>
> Attachments: NUTCH-1038.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1038:
---

Attachment: NUTCH-1038.patch

> Port IndexingFiltersChecker to 2.0
> --
>
> Key: NUTCH-1038
> URL: https://issues.apache.org/jira/browse/NUTCH-1038
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.2
>
> Attachments: NUTCH-1038.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker

2012-12-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1501:
--

 Summary: Harmonize behavior of parsechecker and indexchecker
 Key: NUTCH-1501
 URL: https://issues.apache.org/jira/browse/NUTCH-1501
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 2.2


Behaviour of ParserChecker and IndexingFiltersChecker has diverged between 
trunk and 2.x
- missing in 2.x: NUTCH-1320, NUTCH-1207
- open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1502) Test for CrawlDatum state transitions

2012-12-06 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1502:
--

 Summary: Test for CrawlDatum state transitions
 Key: NUTCH-1502
 URL: https://issues.apache.org/jira/browse/NUTCH-1502
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.7, 2.2
Reporter: Sebastian Nagel


An exhaustive test to check the matrix of CrawlDatum state transitions 
(CrawlStatus in 2.x) would be useful to detect errors esp. for continuous 
crawls where the number of possible transitions is quite large. Additional 
factors with impact on state transitions (retry counters, static and dynamic 
intervals) are also tested.
The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter for 
a first sketchy patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525439#comment-13525439
 ] 

Sebastian Nagel commented on NUTCH-1245:


@kiran: yes, 2.x is affected since fetch schedulers do not differ (much) 
between 1.x and 2.x. However, with default settings you need a couple of month 
continuously crawling to run into this problem.
@Markus: good news! Pulled the test out to NUTCH-1502 (broader coverage, need 
more time).
Are there objections regarding the proposed patch?


> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is g

[jira] [Commented] (NUTCH-1503) Configuration properties not in sync between FetcherReducer and nutch-default.xml

2012-12-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529497#comment-13529497
 ] 

Sebastian Nagel commented on NUTCH-1503:


Hi Lewis,
both time limit properties are necessary:
* fetcher.timelimit.mins for the user to configure the limit (max. duration in 
minutes)
* fetcher.timelimit (internal use only) to set the time the fetcher has to 
finish (system time in milliseconds, same time for all distributed jobs)

Regarding fetcher.threads.per.host.by.ip: maybe we should not add already 
deprecated properties which will be removed later anyway (cf. NUTCH-1409).
+1 for adding fetcher.queue.use.host.settings to nutch-default.xml
Btw., your efforts to clean up properties remembered me that some time ago I 
promised on 
[user@nutch|http://lucene.472066.n3.nabble.com/Javadoc-incorrect-or-missing-code-in-1-5-1-Generator-td3997883.html]
 to prepare a list with all Nutch properties and flags whether they are 
"defined" and documented in nutch-default.xml: [it's in the wiki 
now|http://wiki.apache.org/nutch/NutchPropertiesCompleteList].


> Configuration properties not in sync between FetcherReducer and 
> nutch-default.xml
> -
>
> Key: NUTCH-1503
> URL: https://issues.apache.org/jira/browse/NUTCH-1503
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1503.patch
>
>
> FetcherReducer.java
> Bug: Following properties appear in FetcherReducer but not in 
> nutch-default.xml
> {code}
> 290   useHostSettings = 
> conf.getBoolean("fetcher.queue.use.host.settings", false);
> 300   this.timelimit = conf.getLong("fetcher.timelimit", -1);
> 450   this.byIP = conf.getBoolean("fetcher.threads.per.host.by.ip", true);
> 698   timelimit = context.getConfiguration().getLong("fetcher.timelimit", 
> -1); 
> {code}
> Therefore they cannot be used properly in code execution and must be updated, 
> removed and/or added to nutch-default.xml.
> Patch coming up just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1038:
---

Attachment: NUTCH-1038v2.patch

Hi Lewis, it's a problem of the patch: the fetch time of a WebPage (unlike 
CrawlDatum) must be explicitly set. Good catch! Attached improved patch.

> Port IndexingFiltersChecker to 2.0
> --
>
> Key: NUTCH-1038
> URL: https://issues.apache.org/jira/browse/NUTCH-1038
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.2
>
> Attachments: NUTCH-1038.patch, NUTCH-1038v2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545480#comment-13545480
 ] 

Sebastian Nagel commented on NUTCH-1514:


+1
But do we need a reference to the removed property in nutch-default.xml?
{quote}
Replaces the deprecated parameter db.default.fetch.interval.
{quote}
It was deprecated for long now, so it could be removed without trace.

> Phase out the deprecated configuration properties (if possible)
> ---
>
> Key: NUTCH-1514
> URL: https://issues.apache.org/jira/browse/NUTCH-1514
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, generator
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Priority: Trivial
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1514.patch
>
>
> In reference to [0], the deprecated configuration properties can be removed 
> (only if possible without affecting the functionality).
> [0] : 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201301.mbox/%3ccafkhtfwvm7w-cvusgzwkegdcwrvshptbdftdcn1nnpm1z2-...@mail.gmail.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552028#comment-13552028
 ] 

Sebastian Nagel commented on NUTCH-1499:


So, a vote for "won't fix". Comments?

> Usage of multiple ipv4 addresses and network cards on fetcher machines
> --
>
> Key: NUTCH-1499
> URL: https://issues.apache.org/jira/browse/NUTCH-1499
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 1.5.1
>Reporter: Walter Tietze
>Priority: Minor
> Fix For: 1.7
>
> Attachments: apache-nutch-1.5.1.NUTCH-1499.patch
>
>
> Adds for the fetcher threads the ability to use multiple configured ipv4 
> addresses.
> On some cluster machines there are several ipv4 addresses configured where 
> each ip address is associated with its own network interface.
> This patch enables to configure the protocol-http and the protocol-httpclient 
>  to use these network interfaces in a round robin style.
> If the feature is enabled, a helper class reads at *startup* the network 
> configuration. In each http network connection the next ip address is taken. 
> This method is synchronized, but this should be no bottleneck for the overall 
> performance of the fetcher threads.
> This feature is tested on our cluster for the protocol-http and the 
> protocol-httpclient protocol.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-813) Repetitive crawl 403 status page

2013-01-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-813.
---

Resolution: Duplicate

The described problem is identical to that of NUTCH-578. The provided patch 
(call setPageGoneSchedule when retry counter hits db.fetch.retry.max) is 
included in all patches of NUTCH-578.

> Repetitive crawl 403 status page
> 
>
> Key: NUTCH-813
> URL: https://issues.apache.org/jira/browse/NUTCH-813
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.1
>Reporter: Nguyen Manh Tien
>Priority: Minor
> Fix For: 1.7
>
> Attachments: ASF.LICENSE.NOT.GRANTED--Patch
>
>
> When we crawl a page the return a 403 status. It will be crawl repetitively 
> each days with default schedule.
> Even when we restrict by paramter db.fetch.retry.max

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required

2013-01-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552082#comment-13552082
 ] 

Sebastian Nagel commented on NUTCH-1345:


JAVA_HOME (or NUTCH_JAVA_HOME) is currently used for two things:
# use $JAVA_HOME/bin/java as Java executable
# determining the location of lib/tools.jar which is part of JDK (not JRE). 
It's probably an unneeded artifact, cf. MAPREDUCE-3624 and HADOOP-7374.

If JAVA_HOME is not set bin/nutch definitely refuses to work. I agree that 
setting an environment variable may be a little hurdle, however there are 
arguments in favour of using JAVA_HOME:
- I had to install Nutch on many customers' machines where the default java 
executable on PATH was not the correct one (>= 1.6): setting JAVA_HOME is more 
transparent than manipulating PATH. NUTCH_JAVA_HOME is even more explicit.
- back-ward compatibility: Nutch should be run by the same JVM as before, not 
accidentally by another one.
- staying parallel to Hadoop which still uses JAVA_HOME

Btw., let JAVA_HOME point to /usr/lib/jvm/default-java for Ubuntu's 
update-alternatives.

> JAVA_HOME should not be required
> 
>
> Key: NUTCH-1345
> URL: https://issues.apache.org/jira/browse/NUTCH-1345
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Ben McCann
>Priority: Minor
> Attachments: nutch, nutch.patch
>
>
> Trying to run Nutch spits out the message "Error: JAVA_HOME is not set."  I 
> already have java on my path, so I really wish I didn't need to set 
> JAVA_HOME.  It's an extra step to get up and running and is not updated by 
> Ubuntu's update-alternatives, so it makes it a lot harder to switch between 
> versions of Java.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2013-01-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554353#comment-13554353
 ] 

Sebastian Nagel commented on NUTCH-1087:


Hi Tristan,
thanks for the patch! The segment path of solrindex was already reported in 
NUTCH-1500
Can you open a new issue for the Mac OS problem? It's more verbose to separate 
the problems then reopening resolved issues again. Thanks. Btw., maybe a simple 
solution
{code}
SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`
{code}
without sed or awk is preferable. Does it work on Mac OS?

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1-2.patch, NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2013-01-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1500.


Resolution: Fixed

committed to trunk (rev. 1433658)

> bin/crawl fails on step solrindex with wrong path to segment
> 
>
> Key: NUTCH-1500
> URL: https://issues.apache.org/jira/browse/NUTCH-1500
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.7
>
> Attachments: NUTCH-1500.patch
>
>
> The bin/crawl script calls the command (bin/nutch) solrindex with the wrong 
> path to the segment which causes solrindex to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2013-01-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554381#comment-13554381
 ] 

Sebastian Nagel commented on NUTCH-1087:


yes, of course, but currently there is already a if-else to separate local from 
distributed mode. But let's move the discussion to a new issue.

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1-2.patch, NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1520) SegmentMerger looses records

2013-01-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556093#comment-13556093
 ] 

Sebastian Nagel commented on NUTCH-1520:


Hi Markus,
have a look at NUTCH-1113. An alternative solution is to take in certain cases 
more than one CrawlDatum into the merged segment.

> SegmentMerger looses records
> 
>
> Key: NUTCH-1520
> URL: https://issues.apache.org/jira/browse/NUTCH-1520
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6
>Reporter: Markus Jelsma
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1520-1.7-1.patch
>
>
> It seems the SegmentMerger tool looses documents. You're likely to see less 
> documents in an index if you index one or more already merged segments than 
> if you index all unmerged segments.
> This is really nasty!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564274#comment-13564274
 ] 

Sebastian Nagel commented on NUTCH-1465:


Hi Tejas,
thanks and a few comments on the patch:

“??for a given host, sitemaps are processed just once??” But they are not 
cached over cycles because the cache is bound to the protocol object. Is this 
correct? So a sitemap is fetched and processed every cycle for every host? If 
yes and sitemaps are large (they can!) this would cause a lot of extra traffic.

Shouldn't sitemap URLs handled the same way as any other URL: add them to 
CrawlDb, fetch and parse once, add found links to CrawlDb, cf. [Ken's post at 
CC|https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/DrAX4Th1A4I].
 There are some complications:
- due to their size, sitemaps may require larger values regarding size and time 
limits
- sitemaps may require more frequent re-fetching (eg. by 
MimeAdaptiveFetchSchedule)
- the current Outlink class cannot hold extra information contained in sitemaps 
(lastmod, changefreq, etc.)

There is another way which we use it for several customers: A SitemapInjector 
fetches the sitemaps, extracts URLs and injects them with all extra 
information. It's a simple use case for a customized site-search: there is a 
sitemap and it shall be used as seed list or even exclusive list of documents 
to be crawled. Is there any interest in this solution? It's not a general 
solution and not adaptable to a large web crawl. 


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.7
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564768#comment-13564768
 ] 

Sebastian Nagel commented on NUTCH-1465:




Yes, SitemapInjector is a map-reduce job. The scenario for its use is the 
following:
- a small set of sites to be crawled (eg, to feed a site-search index)
- you can think of sitemaps as "remote seed lists". Because many content 
management systems can generate sitemaps it is convenient for the site owners 
to publish seeds. The URLs contained in the sitemap can be also the complete 
and exclusive set of URLs to be crawled (you can use the plugin scoring-depth 
to limit the crawl to seed URLs).
- because you can trust in the sitemap's content
-* checks for "cross submissions" are not necessary
-* extra information (lastmod, changefreq, priority) can be used
That's we use sitemaps: remote seed lists, maintained by customers, quite 
convenient if you run a crawler as a service.

For large web crawls there is also another aspect: detection of sitemaps which 
is bound to processing of robots.txt. Processing of sitemaps can (and should?) 
be done the usual Nutch way:
- detection is done in the protocol plugin (see Tejas' patch)
- record in CrawlDb: done by Fetcher (cross submission information can be added)
- fetch (if not yet done), parse (a plugin parse-sitemap based on 
crawler-commons?) and extract outlinks: sitemaps may require special treatment 
here because they can be large in size and usually contain many outlinks. Also 
the Outlink class needs to be extended to deal with the extra info relevant for 
scheduling
To use an extra tool (as the SitemapInjector) for processing the sitemaps has 
the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On 
the contrary, special treatment can easily be realized in a separate map-reduce 
job.

Comments?!

Thanks, Tejas: the feature is moving forward thanks to your initiative!

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.7
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564827#comment-13564827
 ] 

Sebastian Nagel commented on NUTCH-1047:


As some test for the interface started to implement a CSV-indexer - useful for 
exporting crawled data or for quick analysis. First working version (draft, 
still a lot to do) within 100+ lines of code: +1 for the interface / extension 
point.

Some concerns about the usability of IndexingJob as a "daily" tool:
- it's not really transparent which indexer is run (solr, elastic, etc.): you 
have to look into the property plugin-includes
- options must be passed to indexer plugins as properties: complicated, no help 
to get a list of available properties



> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   4   5   6   7   8   9   10   >