[
https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677704#action_12677704
]
Doug Cook commented on NUTCH-419:
-
I ran into this same problem, and spent some time
[
https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-419:
Attachment: diffs
Here's a context diff. Hopefully this will work, am rusty at creating patches,
and did
[
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539146
]
Doug Cook commented on NUTCH-566:
-
Hi Doğacan.
Thanks for following up. The issue has gotten a little more
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535593
]
Doug Cook commented on NUTCH-567:
-
What a nice birthday present!
I will check out the fix and see how it works
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535272
]
Doug Cook commented on NUTCH-436:
-
It looks like Nutch-566, and associated patch, which I recently filed, is a
Sun's URL class has bug in creation of relative query URLs
--
Key: NUTCH-566
URL: https://issues.apache.org/jira/browse/NUTCH-566
Project: Nutch
Issue Type: Bug
Components:
[
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-566:
Attachment: RelativeURL.java
Here's a static method to work around the problem.
Sun's URL class has bug
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066
]
Doug Cook commented on NUTCH-25:
Cool -- will take a look at the new patch (and will try to make stripGarbage
more
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342
]
Doug Cook commented on NUTCH-25:
Doğacan,
Thanks for the quick feedback.
* EncodingDetector api is way too open.
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461
]
Doug Cook commented on NUTCH-25:
Can you provide a link on icu4j's language detection?
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026
]
Doug Cook commented on NUTCH-25:
OK, I've got more data, and a proposed solution.
I created a test set with a number
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: EncodingDetector.java
patch
needs 'character encoding' detector
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: (was: EncodingDetector.java)
needs 'character encoding' detector
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: EncodingDetector.java
I cleaned up EncodingDetector a little; here's a functionally identical, but
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426
]
Doug Cook commented on NUTCH-25:
Not sure where this belongs architecturally and aesthetically -- will think
about
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438
]
Doug Cook commented on NUTCH-25:
As far as the problem cases, I'm running a test now on my test DB (the ~60K doc
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375
]
Doug Cook commented on NUTCH-25:
Hi, Doğacan.
My sincere apologies for the slow response, especially given the
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377
]
Doug Cook commented on NUTCH-25:
I should also add that a significant number of the URLs seem to have been fixed
by
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382
]
Doug Cook commented on NUTCH-25:
Oops, spoke to soon. On running a more extensive test, I saw quite a few
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041
]
Doug Cook commented on NUTCH-25:
Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye
shall
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507
]
Doug Cook commented on NUTCH-25:
We might want to think about raising the priority of this. I've seen encoding
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284
]
Doug Cook commented on NUTCH-353:
-
I have a local fix for this problem (partly Paul Gauthier's work, partly mine)
[
http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ]
Doug Cook commented on NUTCH-416:
-
You may also want to make the status codes ORed values, so that, for example,
all of the various kinds of failure all have a
Faster RegexNormalize with more features
Key: NUTCH-410
URL: http://issues.apache.org/jira/browse/NUTCH-410
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions:
[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ]
Doug Cook updated NUTCH-409:
Attachment: shortcircuit.patch
Add short circuit notion to filters to speedup mixed site/subsite crawling
[
http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ]
Doug Cook commented on NUTCH-409:
-
I should also note that this approach is still not optimal (though it is faster
for my usage pattern). I'm still running the
mergesegs sorts URLs, making segments useless for subsequent fetch
--
Key: NUTCH-396
URL: http://issues.apache.org/jira/browse/NUTCH-396
Project: Nutch
Issue Type: Bug
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ]
Doug Cook commented on NUTCH-353:
-
This is definitely a complex issue. It is also high priority -- issues with
redirects and duplicates, which URL is chosen, and
[
http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ]
Doug Cook commented on NUTCH-364:
-
I've been looking into this a little bit. I see two problems:
(1) The current two pass heuristic URL-like string extractor has
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ]
Doug Cook commented on NUTCH-365:
-
It still seems to me that iterative normalization is useful and not risky. By
definition, a normalizer is something which
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ]
Doug Cook commented on NUTCH-365:
-
Hi, Andrzej.
Sounds very cool. Haven't had a chance to check out the patch yet to see if it
supports this, but attaching a
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ]
Doug Cook commented on NUTCH-365:
-
PS. I like your idea of combining URL filters normalization. In a sense, a
filter is just a normalizer that happens to
Fetcher normalizes everything at least twice
Key: NUTCH-363
URL: http://issues.apache.org/jira/browse/NUTCH-363
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions:
Javascript parser creates some fairly bogus URLs
Key: NUTCH-364
URL: http://issues.apache.org/jira/browse/NUTCH-364
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8
34 matches
Mail list logo