[jira] Commented: (NUTCH-419) unavailable robots.txt kills fetch

2009-02-28 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677704#action_12677704 ] Doug Cook commented on NUTCH-419: - I ran into this same problem, and spent some time

[jira] Updated: (NUTCH-419) unavailable robots.txt kills fetch

2009-02-28 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-419: Attachment: diffs Here's a context diff. Hopefully this will work, am rusty at creating patches, and did

[jira] Commented: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2007-10-31 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539146 ] Doug Cook commented on NUTCH-566: - Hi Doğacan. Thanks for following up. The issue has gotten a little more

[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-17 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535593 ] Doug Cook commented on NUTCH-567: - What a nice birthday present! I will check out the fix and see how it works

[jira] Commented: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-10-16 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535272 ] Doug Cook commented on NUTCH-436: - It looks like Nutch-566, and associated patch, which I recently filed, is a

[jira] Created: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2007-10-10 Thread Doug Cook (JIRA)
Sun's URL class has bug in creation of relative query URLs -- Key: NUTCH-566 URL: https://issues.apache.org/jira/browse/NUTCH-566 Project: Nutch Issue Type: Bug Components:

[jira] Updated: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2007-10-10 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-566: Attachment: RelativeURL.java Here's a static method to work around the problem. Sun's URL class has bug

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-08-01 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066 ] Doug Cook commented on NUTCH-25: Cool -- will take a look at the new patch (and will try to make stripGarbage more

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342 ] Doug Cook commented on NUTCH-25: Doğacan, Thanks for the quick feedback. * EncodingDetector api is way too open.

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461 ] Doug Cook commented on NUTCH-25: Can you provide a link on icu4j's language detection?

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026 ] Doug Cook commented on NUTCH-25: OK, I've got more data, and a proposed solution. I created a test set with a number

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: EncodingDetector.java patch needs 'character encoding' detector

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: (was: EncodingDetector.java) needs 'character encoding' detector

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: EncodingDetector.java I cleaned up EncodingDetector a little; here's a functionally identical, but

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426 ] Doug Cook commented on NUTCH-25: Not sure where this belongs architecturally and aesthetically -- will think about

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438 ] Doug Cook commented on NUTCH-25: As far as the problem cases, I'm running a test now on my test DB (the ~60K doc

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375 ] Doug Cook commented on NUTCH-25: Hi, Doğacan. My sincere apologies for the slow response, especially given the

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377 ] Doug Cook commented on NUTCH-25: I should also add that a significant number of the URLs seem to have been fixed by

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382 ] Doug Cook commented on NUTCH-25: Oops, spoke to soon. On running a more extensive test, I saw quite a few

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-22 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041 ] Doug Cook commented on NUTCH-25: Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye shall

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507 ] Doug Cook commented on NUTCH-25: We might want to think about raising the priority of this. I've seen encoding

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284 ] Doug Cook commented on NUTCH-353: - I have a local fix for this problem (partly Paul Gauthier's work, partly mine)

[jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

2006-12-20 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ] Doug Cook commented on NUTCH-416: - You may also want to make the status codes ORed values, so that, for example, all of the various kinds of failure all have a

[jira] Created: (NUTCH-410) Faster RegexNormalize with more features

2006-11-29 Thread Doug Cook (JIRA)
Faster RegexNormalize with more features Key: NUTCH-410 URL: http://issues.apache.org/jira/browse/NUTCH-410 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions:

[jira] Updated: (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ] Doug Cook updated NUTCH-409: Attachment: shortcircuit.patch Add short circuit notion to filters to speedup mixed site/subsite crawling

[jira] Commented: (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ] Doug Cook commented on NUTCH-409: - I should also note that this approach is still not optimal (though it is faster for my usage pattern). I'm still running the

[jira] Created: (NUTCH-396) mergesegs sorts URLs, making segments useless for subsequent fetch

2006-11-03 Thread Doug Cook (JIRA)
mergesegs sorts URLs, making segments useless for subsequent fetch -- Key: NUTCH-396 URL: http://issues.apache.org/jira/browse/NUTCH-396 Project: Nutch Issue Type: Bug

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ] Doug Cook commented on NUTCH-353: - This is definitely a complex issue. It is also high priority -- issues with redirects and duplicates, which URL is chosen, and

[jira] Commented: (NUTCH-364) Javascript parser creates some fairly bogus URLs

2006-09-19 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ] Doug Cook commented on NUTCH-364: - I've been looking into this a little bit. I see two problems: (1) The current two pass heuristic URL-like string extractor has

[jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-18 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ] Doug Cook commented on NUTCH-365: - It still seems to me that iterative normalization is useful and not risky. By definition, a normalizer is something which

[jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-09 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ] Doug Cook commented on NUTCH-365: - Hi, Andrzej. Sounds very cool. Haven't had a chance to check out the patch yet to see if it supports this, but attaching a

[jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-09 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ] Doug Cook commented on NUTCH-365: - PS. I like your idea of combining URL filters normalization. In a sense, a filter is just a normalizer that happens to

[jira] Created: (NUTCH-363) Fetcher normalizes everything at least twice

2006-09-08 Thread Doug Cook (JIRA)
Fetcher normalizes everything at least twice Key: NUTCH-363 URL: http://issues.apache.org/jira/browse/NUTCH-363 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions:

[jira] Created: (NUTCH-364) Javascript parser creates some fairly bogus URLs

2006-09-08 Thread Doug Cook (JIRA)
Javascript parser creates some fairly bogus URLs Key: NUTCH-364 URL: http://issues.apache.org/jira/browse/NUTCH-364 Project: Nutch Issue Type: Bug Affects Versions: 0.8