from:"Andrzej Bialecki"


[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847291#action_12847291
 ] 

Andrzej Bialecki  commented on NUTCH-693:
-

Thanks for the pointer to the article. Indeed, the issue is muddy at best. So 
far Nutch adhered to a strict interpretation, where the links with this 
attribute are deleted from page outlinks immediately (so they are not only not 
followed but also don't affect out-degree metrics). If there is a general 
agreement in Nutch community towards relaxing this behavior we can further 
develop this patch - at the moment I don't see such support. Consequently, I 
propose to discuss it and in the meantime to move this issue to a later release.

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-693) Add configurable option for treating nofollow behaviour.


 [ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-693:


Assignee: (was: Otis Gospodnetic)

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Priority: Minor
 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?


 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-797:
---

Assignee: Andrzej Bialecki 

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 + baseRightMost = basePath.substring(baseRightMostIdx+1);
 + }
 + 
 + if (target.startsWith(?))
 + target

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?


[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847300#action_12847300
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

If there are no futher comments I'm going to commit the current patch with a 
TODO to revisit this code if/when it's refactored to an external dependency.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1

[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.1.


 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-787:


Assignee: Andrzej Bialecki 
 Summary: Upgrade Lucene to 3.0.1.  (was: Upgrade Lucene to 3.0.0.)

We're shooting at 3.0.1 now.

 Upgrade Lucene to 3.0.1.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Assignee: Andrzej Bialecki 
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.


[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847315#action_12847315
 ] 

Andrzej Bialecki  commented on NUTCH-787:
-

Using Lucene 3.0.1 artifacts I verified that your patch passes all tests and 
produces correct searchable indexes. I'll commit this shortly.

 Upgrade Lucene to 3.0.0.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-787) Upgrade Lucene to 3.0.1.


 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-787.
---

Resolution: Fixed

Committed. Thanks Dawid!

 Upgrade Lucene to 3.0.1.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Assignee: Andrzej Bialecki 
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-803) Upgrade Hadoop to 0.20.2

Upgrade Hadoop to 0.20.2


 Key: NUTCH-803
 URL: https://issues.apache.org/jira/browse/NUTCH-803
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.1


Per subject. We are currently using 0.20.1, so there are no API changes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-803) Upgrade Hadoop to 0.20.2


 [ 
https://issues.apache.org/jira/browse/NUTCH-803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-803.
---

Resolution: Fixed

All tests pass - committed.

 Upgrade Hadoop to 0.20.2
 

 Key: NUTCH-803
 URL: https://issues.apache.org/jira/browse/NUTCH-803
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.1


 Per subject. We are currently using 0.20.1, so there are no API changes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[DISCUSS] Nutch as a top level project (TLP)?

2010-03-19 Thread Andrzej Bialecki


Hi devs,

The ASF Board indicated recently that so called umbrella projects, 
i.e. projects that host many significant sub-projects, should examine 
their structure towards simplification, such as merging or splitting out 
sub-projects.


Lucene TLP is such a project. Recently the Lucene PMC accepted the merge 
of Solr and Lucene core projects. Mahout project will most likely split 
to its own TLP soon. Which leaves Nutch as a sort of odd duck ;)


Moving Nutch to its own TLP has some advantages, mostly an easier 
decision process - voting on new committers and new releases involves 
then only those who participate directly in Nutch dev., i.e. the Nutch 
community.


Also, from the coding point of view, Nutch is not intrinsically tied to 
the Lucene development as if both would require some careful 
coordination - we just use Lucene as one of many dependencies, and in 
fact we aim to cleanly separate Nutch search API from Lucene-based API. 
I can easily imagine Nutch dropping completely the low-level 
Lucene-based components and moving to a more general search fabric (e.g. 
SolrCloud).


Being its own TLP could also give Nutch more exposure and help to 
crystallize our mission.


There are some disadvantages to such a split, too: we would need to 
spend some more effort on various administrative tasks, and maintain a 
separate web site (under Apache, but not under Lucene), and probably 
some other tasks that I'm not yet aware of. This would also mean that 
Nutch would have to stand on its own merit, which considering the small 
number of active committers may be challenging.


Let's discuss this, and after we collect some pros and cons I'm going to 
call for a vote.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?


[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846923#action_12846923
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

That's one option, at least until the crawler-commons produces any artifacts 
... Eventually I think that this code and other related code (e.g. deciding 
which URL is canonical in presence of redirects, url normalization and 
filtering) should end up in the crawler-commons.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

[
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846927#action_12846927
]

Andrzej Bialecki commented on NUTCH-762:
-

In my experience the IP-based fetching was only (rarely) needed when there was
a large number of urls from virtual hosts hosted at the same ISP. In other
words, not a common case - others may have different experience depending on
their typical crawl targets... IMHO I think we don't have to reimplement this.

Alternative Generator which can generate several segments in one parse of the
crawlDB
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (NUTCH-802) Problems managing outlinks with large url length


 [ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reopened NUTCH-802:
-

  Assignee: Andrzej Bialecki 

Submitting a patch is not fixing, it's fixed when the patch is accepted and 
applied.

 Problems managing outlinks with large url length
 

 Key: NUTCH-802
 URL: https://issues.apache.org/jira/browse/NUTCH-802
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Pablo Aragón
Assignee: Andrzej Bialecki 
 Attachments: ParseOutputFormat.patch


 Nutch can get idle during the collection of outlinks if  the URL address of 
 the outlink is too large.
 The maximum sizes of an URL for the main web servers are:
 * Apache: 4,000 bytes
 * Microsoft Internet Information Server (IIS): 16, 384 bytes
 * Perl HTTP::Daemon: 8.000 bytes
 URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
 be set in the nutch-default.xml configuration file.
 I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-802) Problems managing outlinks with large url length


[ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846932#action_12846932
 ] 

Andrzej Bialecki  commented on NUTCH-802:
-

We already have a general way to control this and other aspects of URL-s as 
such, namely with URLFilters. I agree that this functionality could be useful, 
but in a form of a URLFilter (or adding this control to e.g. urlfilter-basic or 
urlfilter-validator).

 Problems managing outlinks with large url length
 

 Key: NUTCH-802
 URL: https://issues.apache.org/jira/browse/NUTCH-802
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Pablo Aragón
Assignee: Andrzej Bialecki 
 Attachments: ParseOutputFormat.patch


 Nutch can get idle during the collection of outlinks if  the URL address of 
 the outlink is too large.
 The maximum sizes of an URL for the main web servers are:
 * Apache: 4,000 bytes
 * Microsoft Internet Information Server (IIS): 16, 384 bytes
 * Perl HTTP::Daemon: 8.000 bytes
 URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
 be set in the nutch-default.xml configuration file.
 I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging

[
https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki closed NUTCH-796.
---

Resolution: Fixed
Fix Version/s: 1.1
Assignee: Andrzej Bialecki

Patch applied in rev. 924945. Thanks for reporting it.

Zero results problems difficult to troubleshoot due to lack of logging
--

Key: NUTCH-796
URL: https://issues.apache.org/jira/browse/NUTCH-796
Project: Nutch
Issue Type: Improvement
Components: searcher, web gui
Affects Versions: 1.0.0, 1.1
Environment: Linux, x86, nutch searcher and nutch webaps. v1.0, v1.1
Reporter: Jesse Hires
Assignee: Andrzej Bialecki
Fix For: 1.1

Attachments: logging.patch

There are a few places where search can fail in a distributed environment,
but when configuration is not quite right, there are no indications of errors
or logging.
Increased logging of failures would help troubleshoot such problems, as well
as lower the I get 0 results, why? questions that come across the mailing
lists.
Areas where logging would be helpful:
search app cannot locate search-servers.txt
search app cannot find searcher node listed in search-server.txt
search app cannot connect to port on searcher specified in search-server.txt
searcher (bin/nutch server...) cannot find index
searcher cannot find segments
Access denied in any of the above scenarios.
There are probably more that would be helpful, but I am not yet familiar to
know all the points of possible failure between the webpage and a search node.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-800) Generator builds a URL list that is not encoded


[ 
https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847071#action_12847071
 ] 

Andrzej Bialecki  commented on NUTCH-800:
-

I'm puzzled by your problem description. Is Nutch affected by a potentially 
malicious URL data? URL form encoding is just a transport encoding, it doesn't 
make URL inherently safe (or unsafe).

 Generator builds a URL list that is not encoded
 ---

 Key: NUTCH-800
 URL: https://issues.apache.org/jira/browse/NUTCH-800
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, 
 1.0.0, 1.1
Reporter: Jesse Campbell

 The URL string that is grabbed by the generator when creating the fetch list 
 does not get encoded, could potentially allow unsafe excecution, and breaks 
 reading improperly encoded URLs from the scraped pages.
 Since we a) cannot guarantee that any site we scrape is not malitious, and b) 
 likely do not have control over all content providers, we are currently 
 forced to use a regex normalizer to perform the same function as a built-in 
 java class (it would be unsafe to leave alone)
 A quick solution would be to update Generator.java to utilize the 
 java.net.URLEncoder static class:
 line 187: 
 old: String urlString = url.toString();
 new: String urlString = URLEncoder.encode(url.toString(),UTF-8);
 line 192:
 old: u = new URL(url.toString());
 new: u = new URL(urlString);
 The use of URLEncoder.encode could also be at the updatedb stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.


[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847074#action_12847074
 ] 

Andrzej Bialecki  commented on NUTCH-693:
-

This patch is controversial in the sense that a) Nutch strives to adhere to 
Internet standards and netiquette, which says that robots should obey nofollow, 
and b) most Nutch users want a well-behaved robot. You are free of course to 
modify the source as you did. Therefore I think that this functionality is not 
applicable to majority of Nutch users, and I vote -1 on including it in Nutch.

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-795) Add ability to maintain nofollow attribute in linkdb


[ 
https://issues.apache.org/jira/browse/NUTCH-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847075#action_12847075
 ] 

Andrzej Bialecki  commented on NUTCH-795:
-

Please see my comment to that issue. Or is there some other use case that you 
have in mind?

 Add ability to maintain nofollow attribute in linkdb
 

 Key: NUTCH-795
 URL: https://issues.apache.org/jira/browse/NUTCH-795
 Project: Nutch
  Issue Type: New Feature
  Components: linkdb
Affects Versions: 1.1
Reporter: Sammy Yu
 Attachments: 0001-Updated-with-nofollow-support-for-Outlinks.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files


[ 
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847094#action_12847094
 ] 

Andrzej Bialecki  commented on NUTCH-780:
-

Is the purpose of this issue to make Crawl.java usable via strongly-typedAPI 
instead of the generic main, e.g. something like this:

{code}
public class Crawl extends Configured {
 
 public int crawl(Path output, Path seedDir, int threads, int numCycles, int 
topN, ...) {
...
  }
}
{code}

 Nutch crawler did not read configuration files
 --

 Key: NUTCH-780
 URL: https://issues.apache.org/jira/browse/NUTCH-780
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Vu Hoang
 Attachments: NUTCH-780.patch


 Nutch searcher can read properties at the constructor ...
 {code:java|title=NutchSearcher.java|borderStyle=solid}
 NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
 ... // put search engine code here
 {code}
 ... but Nutch crawler is not, it only reads data from arguments.
 {code:java|title=NutchCrawler.java|borderStyle=solid}
 StringBuilder builder = new StringBuilder();
 builder.append(domainlist + SPACE);
 builder.append(ARGUMENT_CRAWL_DIR);
 builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
 builder.append(ARGUMENT_CRAWL_THREADS);
 builder.append(threads + SPACE);
 builder.append(ARGUMENT_CRAWL_DEPTH);
 builder.append(depth + SPACE);
 builder.append(ARGUMENT_CRAWL_TOPN);
 builder.append(topN + SPACE);
 Crawl.main(builder.toString().split(SPACE));
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?


[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846402#action_12846402
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

Thanks for reporting this, and providing a patch. An updated revision of the 
standard, RFC3986 section 5.4.1 example 7 follows the same reasoning. I'll fix 
this shortly.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?


[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846418#action_12846418
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

Hm, actually the picture is more complicated than I thought - if we apply both 
methods (fixEmbeddedParams and fixPureQueryTargets) then some of the test cases 
from RFC fail. However, all tests succeed if we only apply the 
fixPureQueryTargets !

Looking at the origin of the fixEmbeddedParams method (NUTCH-436) something 
must been fixed in java.net.URL, because the test case mentioned in that issue 
now passes if we apply only fixPureQueryTargets. The same case with test cases 
in a near-duplicate issue NUTCH-566.

Consequently I'm going to remove fixEmbeddedParams. I added all tests from 
RFC3986 section 5.4.1, and they all pass now. I'll attach an updated patch 
shortly.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith

[jira] Updated: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?


 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-797:


Attachment: pureQueryUrl-2.patch

Updated patch with some refactoring and unit tests. If no objections I'll 
commit this shortly.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 + baseRightMost = basePath.substring(baseRightMostIdx+1

[jira] Updated: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging

[
https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki updated NUTCH-796:

Attachment: logging.patch

I propose this patch. If there are no objections I'll commit it shortly.

Zero results problems difficult to troubleshoot due to lack of logging
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.


[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846428#action_12846428
 ] 

Andrzej Bialecki  commented on NUTCH-787:
-

Lucene 3.0.1 is out now .. I'll test this patch with 3.0.1 artifacts and will 
report.

 Upgrade Lucene to 3.0.0.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?


[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846437#action_12846437
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

Unfortunately the way your fix was applied there is not reusable (private 
method in HtmlParser... ugh :( ). So for the time being I think we'll go with 
our utility class ... which we should really move to the crawler-commons anyway!

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx

[jira] Assigned: (NUTCH-774) Retry interval in crawl date is set to 0


 [ 
https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-774:
---

Assignee: Andrzej Bialecki 

 Retry interval in crawl date is set to 0
 

 Key: NUTCH-774
 URL: https://issues.apache.org/jira/browse/NUTCH-774
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-774.patch, NUTCH-774_2.patch


 When i fetch and parse a feed with the feed plugin,
 http://www.wachauclimbing.net/home/impressum-disclaimer/feed/
 another crawl date is generated
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
 after fetching a second round
 the dump in the crawl db still shows a retry interval with value 0.
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ 
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Wed Dec 02 12:48:22 CET 2009
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 0 seconds (0 days)
 Score: 1.084
 Signature: db9ab2193924cd2d0b53113a500ca604
 Metadata: _pst_: success(1), lastModified=0
 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in 
 the
 method 
 setFetchSchedule

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?


[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846527#action_12846527
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

A few issues with this:

* does this mean that the fixes would be applied to links found in other 
content types as well, not just html (the fixup code in TIKA-287 is located in 
HtmlParser)?

* we need this also in other places, e.g. in the redirection handling code 
(both meta-refresh, javascript location.href and protocol-level redirect)

* for a while we still need this in the parse-html plugin that does not use 
Tika.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-16 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846133#action_12846133
]

Andrzej Bialecki commented on NUTCH-762:
-

It appears this class is not a strict superset - the generate.update.crawldb
functionality is not there. This is a regression in a useful functionality, so
I think it needs to be added back.

Alternative Generator which can generate several segments in one parse of the
crawlDB
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-16 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846174#action_12846174
]

Andrzej Bialecki commented on NUTCH-762:
-

In case of users generating just 1 segment at a time it's an unexpected loss of
flexibility. You can't run this version of Generator twice without first
completing _both_ fetching updating of all segments from the previous run -
because some of the same urls would be generated in the next round. The point
of generate.update.crawldb is to be able to freely interleave generate/update
steps.

E.g. the following scenario breaks in a non-obvious way:

* generate 10 segments
* fetch update 8 of them
* realize you need more rounds due to e.g. gone pages
* generate additional 10 segments

..kaboom! now the new segments partially overlap with the unfetched 2 segments
from the previous generation, and you are going to fetch some urls twice.

Alternative Generator which can generate several segments in one parse of the
crawlDB
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-798) Upgrade to SOLR1.4

2010-03-10 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843555#action_12843555
 ] 

Andrzej Bialecki  commented on NUTCH-798:
-

+1, preferably before the 1.1 freeze so that we  can test it.

 Upgrade to SOLR1.4
 --

 Key: NUTCH-798
 URL: https://issues.apache.org/jira/browse/NUTCH-798
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify 
 the way we buffer the docs before sending them to the SOLR instance 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-10 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843587#action_12843587
 ] 

Andrzej Bialecki  commented on NUTCH-801:
-

Definitely +1, the only reason they lingered so long was the lack of a suitable 
replacement.

 Remove RTF and MP3 parse plugins
 

 Key: NUTCH-801
 URL: https://issues.apache.org/jira/browse/NUTCH-801
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Fix For: 1.1


 *Parse-rtf* and *parse-mp3* are not built by default  due to licensing 
 issues. Since we now have *parse-tika* to handle these formats I would be in 
 favour of removing these 2 plugins altogether to keep things nice and simple. 
 The other plugins will probably be phased out only after the release of 1.1  
 when parse-tika will have been tested a lot more.
 Any reasons not to?
 Julien

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: 1.1 release?

2010-03-09 Thread Andrzej Bialecki


On 2010-03-09 18:17, Julien Nioche wrote:

Hi Chris,

Excellent idea! There have been quite a few changes since 1.0 and it's
probably the right time to have a new release.


+1. Let's just check JIRA and make sure we didn't forget anything 
important ...




Not really a blocker but https://issues.apache.org/jira/browse/NUTCH-762
would be nice to have in 1.1, just needs a bit of reviewing / testing I
suppose. Otherwise this can wait until after 1.1


I'll try to test it before the weekend.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-05 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841790#action_12841790
 ] 

Andrzej Bialecki  commented on NUTCH-799:
-

I think it's ok to do it this way - the commit per reducer may be actually 
harmful if commit succeeds but the task is killed for any reason and re-ran.

Note: the patch has some formatting errors.

 SOLRIndexer to commit once all reducers have finished
 -

 Key: NUTCH-799
 URL: https://issues.apache.org/jira/browse/NUTCH-799
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-799.patch


 What about doing only one SOLR commit after the MR job has finished in 
 SOLRIndexer instead of doing that at the end of every Reducer? 
 I ran into timeout exceptions in some of my reducers and I suspect that this 
 was due to the fact that other reducers had already finished and called 
 commit. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832250#action_12832250
 ] 

Andrzej Bialecki  commented on NUTCH-766:
-

+1 to commit this - please remember to update nutch-default.xml to switch to 
the tika plugin, perhaps add a comment about the deprecated parse-* plugins - 
most people look here and not in the parse-plugins, where this change is 
documented...

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-02-05 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830065#action_12830065
 ] 

Andrzej Bialecki  commented on NUTCH-673:
-

+1 on both counts. Upgrade to Lucene 3.0 may involve more work than expected 
because of deprecated 2.x APIs that are no longer available in 3.0.

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor
 Fix For: 1.1


 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-775) Enhance Searcher interface

2010-01-28 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806031#action_12806031
 ] 

Andrzej Bialecki  commented on NUTCH-775:
-

IMHO this could go as it is ... one suggestion though: this Query/QueryContext 
now resembles SolrQuery/SolrParams. Perhaps we could rename QueryContext to 
QueryParams?

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

2010-01-25 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804558#action_12804558
 ] 

Andrzej Bialecki  commented on NUTCH-766:
-

I agree with Chris, +1 on keeping the old plugins in 1.1 with a prominent 
deprecation note, but I feel equally strongly that we should not prolong their 
life-cycle beyond what we can support, i.e. I'm +1 on removing them in 1.2/1.3. 
We simply don't have resources to maintain so many duplicate plugins, and 
instead we should direct our efforts to improve those in Tika.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-19 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802175#action_12802175
 ] 

Andrzej Bialecki  commented on NUTCH-779:
-

Personally I would use ScoringFilters because I'm familiar with the API, but 
the approach that you propose is certainly more user friendly especially for 
novice users.

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
 Attachments: NUTCH-779


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-18 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801875#action_12801875
 ] 

Andrzej Bialecki  commented on NUTCH-779:
-

You can already achieve this with ScoringFilters, although it requires using 
three methods instead ... I would also rename the status to parse_meta, it's 
less cryptic this way. The property needs some documentation in 
nutch-default.xml plus a sensible default.

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
 Attachments: NUTCH-779


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-655) Injecting Crawl metadata

2010-01-05 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797013#action_12797013
 ] 

Andrzej Bialecki  commented on NUTCH-655:
-

I'm not sure about the latest addition (the score option). If we go this route, 
then I suggest doing the last minor step and recognize reserved metadata keys 
to do also other useful things like setting fetch interval. I.e. define and 
recognize nutch.score and nutch.fetchInterval, and document it properly 
somewhere ...(wiki? javadoc? cmd-line synopsis?).

 Injecting Crawl metadata
 

 Key: NUTCH-655
 URL: https://issues.apache.org/jira/browse/NUTCH-655
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
 Fix For: 1.1

 Attachments: Injector.patch, NUTCH-655.v2


 the patch attached allows to inject metadata into the crawlDB. The input file 
 has to contain fields separated by tabs, with the URL being on the first 
 column. The metadata names and values are separated by '='. A input line 
 might look like this:
 http://www.myurl.com  \t  categ=value1 \t categ2=value2
 This functionality can be useful to store external knowledge and index it 
 with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-12-17 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791979#action_12791979
 ] 

Andrzej Bialecki  commented on NUTCH-666:
-

Do you think it was related to the quality of language models that you built 
(presumably the ones in the patch?) versus the ones in the Nutch plugin, or due 
to a different classification algorithm? I'm trying to understand the source of 
such a big difference, because AFAIK the algorithm in textcat is essentially 
the same as the one we use.

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-775) Enhance Searcher interface

2009-12-16 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791411#action_12791411
 ] 

Andrzej Bialecki  commented on NUTCH-775:
-

+1. I would suggest creating a subclass of Metadata, where we can guarantee the 
presence of some required parameters, e.g.:

{code}
public class SearchContext extends Metadata {
  protected int numHits;
  protected String sortField;
  protected String dedupField;
  ...
  // setters and getters for the above
}
{code}

and change the QueryFilter interface to use SearchContext too.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-12-14 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790225#action_12790225
 ] 

Andrzej Bialecki  commented on NUTCH-666:
-

Dennis, what's the status of this patch (especially the missing part, the new 
language identifier)?

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: State of nutchbase

2009-12-06 Thread Andrzej Bialecki


Alban Mouton wrote:

Hello,

I have looked a little into nutch code and mailing lists. I think the 
nutchbase branch (http://issues.apache.org/jira/browse/NUTCH-650) is 
very interesting, with a good potential to improve code clarity and 
flexibility (I find data structure quite obscure in current version). 
The issue is untouched since last august, so my question is : can 
nutchbase really be part of nutch 1.1 ? 


Definitely no. Release 1.1 will be an update to 1.0, with no major 
design changes. However, we intend to integrate the nutchbase branch 
with trunk at some point - but since this would be a major change it 
would come under 2.0 branch or so ...



Is there still much work to do 
or is it almost ready ? Is it a worthy issue for an interested developer 
with a (still !) limited knowledge of the project ?


Please contact Dogacan, who is leading the work on this branch. AFAIK 
he's going to update the design soon.




So far I have only tried to run nutchbase in eclipse by applying the 
tutorial (http://wiki.apache.org/nutch/RunNutchInEclipse1.0) but I run 
in errors when building, mostly from Parser and tests. I may start by 
cleaning this up.


See above - please coordinate with Dogacan to avoid duplication of effort.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Updated: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-04 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-767:


Remaining Estimate: 0h
 Original Estimate: 0h

I applied the patch, and I'm closing this issue - we will track the test 
failures when we upgrade to Tika 0.6, which is imminent.

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767-part2.patch, NUTCH-767.patch

   Original Estimate: 0h
  Remaining Estimate: 0h

 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-02 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reopened NUTCH-767:
-


 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767.patch


 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-02 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784790#action_12784790
 ] 

Andrzej Bialecki  commented on NUTCH-767:
-

Reopening this issue, because TestContent is failing now - after fixing a 
trivial compilation problem, now the problem seems to be that the type for 
empty content is auto-detected as text/plain and this value overrides the 
hint from the Content-Type header.

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767.patch


 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20


[ 
https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784206#action_12784206
 ] 

Andrzej Bialecki  commented on NUTCH-768:
-

+1.

Minor nit: file lib/hsqldb-1.8.0.10.LICENSE.txt uses Windows EOL style, this 
should be probably corrected before commit.

 Upgrade Nutch 1.0 to use Hadoop 0.20
 

 Key: NUTCH-768
 URL: https://issues.apache.org/jira/browse/NUTCH-768
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-768-1-20091125.patch


 Upgrade Nutch 1.0 to use the Hadoop 0.20 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-770) Timebomb for Fetcher


[ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784250#action_12784250
 ] 

Andrzej Bialecki  commented on NUTCH-770:
-

Fixed in rev. 885776. Thank you!

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, 
 NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-770) Timebomb for Fetcher


 [ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-770.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, 
 NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions


[ 
https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784260#action_12784260
 ] 

Andrzej Bialecki  commented on NUTCH-769:
-

I had to apply this patch by hand, due to NUTCH-770. I also added 
conf/nutch-default.xml documentation. This was committed in rev. 885785 - 
thanks!

 Fetcher to skip queues for URLS getting repeated exceptions  
 -

 Key: NUTCH-769
 URL: https://issues.apache.org/jira/browse/NUTCH-769
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-769-2.patch, NUTCH-769.patch


 As discussed on the mailing list (see 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this 
 patch allows to clear URLs queues in the Fetcher when more than a set number 
 of exceptions have been encountered in a row. This can speed up the fetching 
 substantially in cases where target hosts are not responsive (as a 
 TimeoutException would be thrown) and limits cases where a whole Fetch step 
 is slowed down because of a few queues.
 by default the parameter fetcher.max.exceptions.per.queue has a value of -1 
 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions


 [ 
https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-769.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Fetcher to skip queues for URLS getting repeated exceptions  
 -

 Key: NUTCH-769
 URL: https://issues.apache.org/jira/browse/NUTCH-769
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-769-2.patch, NUTCH-769.patch


 As discussed on the mailing list (see 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this 
 patch allows to clear URLs queues in the Fetcher when more than a set number 
 of exceptions have been encountered in a row. This can speed up the fetching 
 substantially in cases where target hosts are not responsive (as a 
 TimeoutException would be thrown) and limits cases where a whole Fetch step 
 is slowed down because of a few queues.
 by default the parameter fetcher.max.exceptions.per.queue has a value of -1 
 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-767) Update Tika to v0.5 for the MimeType detection


 [ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-767.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki   (was: Chris A. Mattmann)

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767.patch


 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-767) Update Tika to v0.5 for the MimeType detection


[ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784337#action_12784337
 ] 

Andrzej Bialecki  commented on NUTCH-767:
-

Fixed in rev. 885869. Thank you!

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767.patch


 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

2009-11-30 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783638#action_12783638
 ] 

Andrzej Bialecki  commented on NUTCH-770:
-

bq.   time limit is definitely better than timebomb (but not as amusing). 

:) let's got for informative and less confusing now ... Could you please 
also add the nutch-default.xml property and its documentation.

Re: FetchQueues - ok, you have a point here.

Re: code style - yes.

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
 Attachments: log-770, NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: wrong wiki front page

2009-11-30 Thread Andrzej Bialecki


Alban Mouton wrote:
No reaction ? Isn't the Wiki admin on this mailing list ? I don't see 
any link on the Wiki to contact the admin.


The french frontpage is still the generic MoinMoin wiki home page and 
that can make a bad impression to newcomers !


We have little control over the MoinMoin config (AFAIK it's configured 
for multiple projects), and what you noticed is probably a fallout of 
the recent wiki upgrade - please create a JIRA issue here: 
https://issues.apache.org/jira/browse/INFRA (don't forget to mention the 
project name).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

[
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783283#action_12783283
]

Andrzej Bialecki commented on NUTCH-770:
-

I propose to change the name of this functionality - timebomb is not
self-explanatory, and it suggests that if you misbehave then your cluster may
explode ;) Instead I would use time limit, rename all vars and methods to
follow this naming, and document it properly in nutch-default.xml.

A few comments to the patch:

* it has some overlap with NUTCH-769 (the emptyQueue() method), but that's easy
to resolve, see also the next point.

* why change the code in FetchQueues at all? Time limit is a global condition,
we could just break the main loop in run() and ignore the QueueFeeder (or don't
start it if the time limit already passed when starting run() ).

* the patch does not follow the code style (notably whitespace in for/while
loops and assignments).

Timebomb for Fetcher

Key: NUTCH-770
URL: https://issues.apache.org/jira/browse/NUTCH-770
Project: Nutch
Issue Type: Improvement
Reporter: Julien Nioche
Attachments: log-770, NUTCH-770.patch

This patch provides the Fetcher with a timebomb mechanism. By default the
timebomb is not activated; it can be set using the parameter
fetcher.timebomb.mins. The number of minutes is relative to the start of the
Fetch job. When the number of minutes is reached, the QueueFeeder skips all
remaining entries then all active queues are purged. This allows to keep the
Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.


 [ 
https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-746.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing 
 resource leak in the container.
 

 Key: NUTCH-746
 URL: https://issues.apache.org/jira/browse/NUTCH-746
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or 
 Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-746.patch


 NutchBeanConstructor is not cleaning up upon application shutdown 
 (contextDestroyed()).   It leaves open the SegmentUpdater, and potentially 
 other resources.  This causes the WebApp's classloader to not be able to 
 GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.


[ 
https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783287#action_12783287
 ] 

Andrzej Bialecki  commented on NUTCH-746:
-

Fixed in rev. 885148. Thanks!

 NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing 
 resource leak in the container.
 

 Key: NUTCH-746
 URL: https://issues.apache.org/jira/browse/NUTCH-746
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or 
 Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-746.patch


 NutchBeanConstructor is not cleaning up upon application shutdown 
 (contextDestroyed()).   It leaves open the SegmentUpdater, and potentially 
 other resources.  This causes the WebApp's classloader to not be able to 
 GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed


 [ 
https://issues.apache.org/jira/browse/NUTCH-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-738.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 Close SegmentUpdater when FetchedSegments is closed
 ---

 Key: NUTCH-738
 URL: https://issues.apache.org/jira/browse/NUTCH-738
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Martina Koch
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: FetchedSegments.patch, NUTCH-738.patch


 Currently FetchedSegments starts a SegmentUpdater, but never closes it when 
 FetchedSegments is closed.
 (The problem was described in this mailing: 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg13823.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop


 [ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-739.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop


[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783290#action_12783290
 ] 

Andrzej Bialecki  commented on NUTCH-739:
-

Fixed in rev. 885152. Thank you!

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-755) DomainURLFilter crashes on malformed URL


 [ 
https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-755.
---

Resolution: Cannot Reproduce
  Assignee: Andrzej Bialecki 

 DomainURLFilter crashes on malformed URL
 

 Key: NUTCH-755
 URL: https://issues.apache.org/jira/browse/NUTCH-755
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Tomcat 6.0.14
 Java 1.6.0_14
 Linux
Reporter: Mike Baranczak
Assignee: Andrzej Bialecki 

 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply 
 filter on url: http:/comments.php
 java.lang.NullPointerException
 at 
 org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173)
 at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
 at 
 org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
 Expected behavior would be to recognize the URL as malformed, and reject it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-755) DomainURLFilter crashes on malformed URL


[ 
https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783299#action_12783299
 ] 

Andrzej Bialecki  commented on NUTCH-755:
-

I could not verify that the filter indeed crashes - it simply prints the 
exception and then returns null, as you suggested.

 DomainURLFilter crashes on malformed URL
 

 Key: NUTCH-755
 URL: https://issues.apache.org/jira/browse/NUTCH-755
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Tomcat 6.0.14
 Java 1.6.0_14
 Linux
Reporter: Mike Baranczak
Assignee: Andrzej Bialecki 

 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply 
 filter on url: http:/comments.php
 java.lang.NullPointerException
 at 
 org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173)
 at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
 at 
 org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
 Expected behavior would be to recognize the URL as malformed, and reject it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19


[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783302#action_12783302
 ] 

Andrzej Bialecki  commented on NUTCH-692:
-

We should review this issue after the upgrade to Hadoop 0.20 - task output mgmt 
differs there, and the problem may be nonexistent.

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-692.patch


 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.


[ 
https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783304#action_12783304
 ] 

Andrzej Bialecki  commented on NUTCH-741:
-

Fixed in rev. 885156. Thank you!

 Job file includes multiple copies of nutch config files.
 

 Key: NUTCH-741
 URL: https://issues.apache.org/jira/browse/NUTCH-741
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: removeJobDupConf.diff


 From a clean checkout, running ant tar will create a .job file.  The .job 
 file includes two copies of the nutch-site.xml and nutch-default.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-741) Job file includes multiple copies of nutch config files.


 [ 
https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-741.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Job file includes multiple copies of nutch config files.
 

 Key: NUTCH-741
 URL: https://issues.apache.org/jira/browse/NUTCH-741
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: removeJobDupConf.diff


 From a clean checkout, running ant tar will create a .job file.  The .job 
 file includes two copies of the nutch-site.xml and nutch-default.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers


 [ 
https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-712.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers
 -

 Key: NUTCH-712
 URL: https://issues.apache.org/jira/browse/NUTCH-712
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: ParseOutputFormat-NUTCH712v2.patch


 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers otherwise the whole parsing step crashes instead of simply 
 ignoring dodgy outlinks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers


[ 
https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783306#action_12783306
 ] 

Andrzej Bialecki  commented on NUTCH-712:
-

Fixed in rev. 885159. Thank you!

 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers
 -

 Key: NUTCH-712
 URL: https://issues.apache.org/jira/browse/NUTCH-712
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: ParseOutputFormat-NUTCH712v2.patch


 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers otherwise the whole parsing step crashes instead of simply 
 ignoring dodgy outlinks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1

Upgrade Nutch to use Lucene 2.9.1
-

 Key: NUTCH-772
 URL: https://issues.apache.org/jira/browse/NUTCH-772
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Upgrade Nutch to the latest Lucene release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java

2009-11-25 Thread Andrzej Bialecki


david.stu...@progressivealliance.co.uk wrote:
  While you are doing changes and commits in this area I have been 
waiting for this patch https://issues.apache.org/jira/browse/NUTCH-760 
of mine to be incorporated for a while now. Is it possible it get it in??


It's on my agenda - I'll apply the patch either today or tomorrow, time 
permitting.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Closed: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java


 [ 
https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-773.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 some minor bugs in AbstractFetchSchedule.java
 -

 Key: NUTCH-773
 URL: https://issues.apache.org/jira/browse/NUTCH-773
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-773.patch


 fixes some minor trivial bugs in AbstractFetchSchedule.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java


[ 
https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782509#action_12782509
 ] 

Andrzej Bialecki  commented on NUTCH-773:
-

That was a nasty bug - fixed in rev. 884198. Thanks!

 some minor bugs in AbstractFetchSchedule.java
 -

 Key: NUTCH-773
 URL: https://issues.apache.org/jira/browse/NUTCH-773
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-773.patch


 fixes some minor trivial bugs in AbstractFetchSchedule.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice


[ 
https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782516#action_12782516
 ] 

Andrzej Bialecki  commented on NUTCH-753:
-

Fixed in rev. 884203 - thanks!

 Prevent new Fetcher to retrieve the robots twice
 

 Key: NUTCH-753
 URL: https://issues.apache.org/jira/browse/NUTCH-753
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-753.patch


 The new Fetcher which is now used by default handles the robots file directly 
 instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and 
 Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt 
 twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. 
 However in practice the robots file is still fetched as there is a call to 
 robots.getCrawlDelay() a bit further which is not covered by the if 
 (Protocol.CHECK_ROBOTS).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice


 [ 
https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-753.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Prevent new Fetcher to retrieve the robots twice
 

 Key: NUTCH-753
 URL: https://issues.apache.org/jira/browse/NUTCH-753
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-753.patch


 The new Fetcher which is now used by default handles the robots file directly 
 instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and 
 Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt 
 twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. 
 However in practice the robots file is still fetched as there is a call to 
 robots.getCrawlDelay() a bit further which is not covered by the if 
 (Protocol.CHECK_ROBOTS).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

[
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782524#action_12782524
]

Andrzej Bialecki commented on NUTCH-762:
-

This class offers a strict superset of the current Generator functionality.
Maintaining both tools would be cumbersome and error-prone. I propose to
replace Generator with MultiGenerator (under the current name Generator).

Alternative Generator which can generate several segments in one parse of the
crawlDB
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer


 [ 
https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-761.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Avoid cloningCrawlDatum in CrawlDbReducer 
 --

 Key: NUTCH-761
 URL: https://issues.apache.org/jira/browse/NUTCH-761
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: optiCrawlReducer.patch


 In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
 reduce phase and these will be the entries coming from the crawlDB and not 
 present in the segments.
 The patch attached optimizes the reduce step by avoid an unnecessary cloning 
 of the CrawlDatum fields when there is only one CrawlDatum in the values. 
 This has more impact has the crawlDB gets larger,  we noticed an improvement 
 of around 25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer


[ 
https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782537#action_12782537
 ] 

Andrzej Bialecki  commented on NUTCH-761:
-

I applied the patch with some changes - reverted the logic in the name of the 
boolean var, and applied the same method to other cases of non-multiple values. 
Committed in rev. 884224 - thanks!

 Avoid cloningCrawlDatum in CrawlDbReducer 
 --

 Key: NUTCH-761
 URL: https://issues.apache.org/jira/browse/NUTCH-761
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: optiCrawlReducer.patch


 In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
 reduce phase and these will be the entries coming from the crawlDB and not 
 present in the segments.
 The patch attached optimizes the reduce step by avoid an unnecessary cloning 
 of the CrawlDatum fields when there is only one CrawlDatum in the values. 
 This has more impact has the crawlDB gets larger,  we noticed an improvement 
 of around 25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-760) Allow field mapping from nutch to solr index