[jira] Commented: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851923#action_12851923 ] Ken Krugler commented on NUTCH-706: --- Two comments about this: 1. From my experiences with

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2010-03-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846459#action_12846459 ] Ken Krugler commented on NUTCH-797: --- Agreed re crawler-commons...feels like there's a beef

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2010-03-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846424#action_12846424 ] Ken Krugler commented on NUTCH-797: --- I thought this same issue (relative URL with leading

[jira] Commented: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830109#action_12830109 ] Ken Krugler commented on NUTCH-786: --- Is this something that should also be applied to craw

[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

2010-01-11 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798890#action_12798890 ] Ken Krugler commented on NUTCH-751: --- i agree that this should be in crawler-commons. E.g.

[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

2009-09-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753069#action_12753069 ] Ken Krugler commented on NUTCH-751: --- I'm using HttpClient 4.0 in Bixo, and I agree that Nu

[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722242#action_12722242 ] Ken Krugler commented on NUTCH-731: --- This is definitely an issue - I've been pinging vario

[jira] Commented: (NUTCH-101) RobotRulesParser

2009-06-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722014#action_12722014 ] Ken Krugler commented on NUTCH-101: --- 1. Not sure if the reported problem with "Disallow:"

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714277#action_12714277 ] Ken Krugler commented on NUTCH-739: --- There's another approach that works well here, and th

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525 ] Ken Krugler commented on NUTCH-25: -- I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this. The

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466261 ] Ken Krugler commented on NUTCH-353: --- Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466260 ] Ken Krugler commented on NUTCH-353: --- Another small note about this (see NUTCH-411 for a related but different probl

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-23 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12444162 ] Ken Krugler commented on NUTCH-385: --- There is a middle ground, though we don't know yet how important it is to address. When we crawl partner sites, we sometimes

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ] Ken Krugler commented on NUTCH-353: --- +1 that the redirect target is not always the "real" URL that we want to keep. For example, http://www.ibm.com/developerworks

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412621 ] Ken Krugler commented on NUTCH-272: --- The generate.max.per.host parameter does work, but with the following limitations that we've run into: 1. The current code uses the enti

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ] Ken Krugler commented on NUTCH-230: --- So Doug beat me to this comment :) I was going to describe the two cases we'd run into... 1. There's a great page, but most of the links

[jira] Created: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-13 Thread Ken Krugler (JIRA)
OPIC score for outlinks should be based on # of valid links, not total # of links. -- Key: NUTCH-230 URL: http://issues.apache.org/jira/browse/NUTCH-230 Project: Nutch Type: Improvement