Fetcher normalizes everything at least twice
Key: NUTCH-363
URL: http://issues.apache.org/jira/browse/NUTCH-363
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions:
Javascript parser creates some fairly bogus URLs
Key: NUTCH-364
URL: http://issues.apache.org/jira/browse/NUTCH-364
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ]
Doug Cook commented on NUTCH-365:
-
Hi, Andrzej.
Sounds very cool. Haven't had a chance to check out the patch yet to see if it
supports this, but attaching a
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ]
Doug Cook commented on NUTCH-365:
-
PS. I like your idea of combining URL filters normalization. In a sense, a
filter is just a normalizer that happens to
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ]
Doug Cook commented on NUTCH-365:
-
It still seems to me that iterative normalization is useful and not risky. By
definition, a normalizer is something which
[
http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ]
Doug Cook commented on NUTCH-364:
-
I've been looking into this a little bit. I see two problems:
(1) The current two pass heuristic URL-like string extractor has
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ]
Doug Cook commented on NUTCH-353:
-
This is definitely a complex issue. It is also high priority -- issues with
redirects and duplicates, which URL is chosen, and
mergesegs sorts URLs, making segments useless for subsequent fetch
--
Key: NUTCH-396
URL: http://issues.apache.org/jira/browse/NUTCH-396
Project: Nutch
Issue Type: Bug
[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ]
Doug Cook updated NUTCH-409:
Attachment: shortcircuit.patch
Add short circuit notion to filters to speedup mixed site/subsite crawling
[
http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ]
Doug Cook commented on NUTCH-409:
-
I should also note that this approach is still not optimal (though it is faster
for my usage pattern). I'm still running the
Faster RegexNormalize with more features
Key: NUTCH-410
URL: http://issues.apache.org/jira/browse/NUTCH-410
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions:
[
http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ]
Doug Cook commented on NUTCH-416:
-
You may also want to make the status codes ORed values, so that, for example,
all of the various kinds of failure all have a
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284
]
Doug Cook commented on NUTCH-353:
-
I have a local fix for this problem (partly Paul Gauthier's work, partly mine)
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507
]
Doug Cook commented on NUTCH-25:
We might want to think about raising the priority of this. I've seen encoding
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041
]
Doug Cook commented on NUTCH-25:
Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye
shall
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375
]
Doug Cook commented on NUTCH-25:
Hi, Doğacan.
My sincere apologies for the slow response, especially given the
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377
]
Doug Cook commented on NUTCH-25:
I should also add that a significant number of the URLs seem to have been fixed
by
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382
]
Doug Cook commented on NUTCH-25:
Oops, spoke to soon. On running a more extensive test, I saw quite a few
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426
]
Doug Cook commented on NUTCH-25:
Not sure where this belongs architecturally and aesthetically -- will think
about
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438
]
Doug Cook commented on NUTCH-25:
As far as the problem cases, I'm running a test now on my test DB (the ~60K doc
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026
]
Doug Cook commented on NUTCH-25:
OK, I've got more data, and a proposed solution.
I created a test set with a number
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: EncodingDetector.java
patch
needs 'character encoding' detector
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: (was: EncodingDetector.java)
needs 'character encoding' detector
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: EncodingDetector.java
I cleaned up EncodingDetector a little; here's a functionally identical, but
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342
]
Doug Cook commented on NUTCH-25:
Doğacan,
Thanks for the quick feedback.
* EncodingDetector api is way too open.
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461
]
Doug Cook commented on NUTCH-25:
Can you provide a link on icu4j's language detection?
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066
]
Doug Cook commented on NUTCH-25:
Cool -- will take a look at the new patch (and will try to make stripGarbage
more
Sun's URL class has bug in creation of relative query URLs
--
Key: NUTCH-566
URL: https://issues.apache.org/jira/browse/NUTCH-566
Project: Nutch
Issue Type: Bug
Components:
[
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-566:
Attachment: RelativeURL.java
Here's a static method to work around the problem.
Sun's URL class has bug
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535272
]
Doug Cook commented on NUTCH-436:
-
It looks like Nutch-566, and associated patch, which I recently filed, is a
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535593
]
Doug Cook commented on NUTCH-567:
-
What a nice birthday present!
I will check out the fix and see how it works
[
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539146
]
Doug Cook commented on NUTCH-566:
-
Hi Doğacan.
Thanks for following up. The issue has gotten a little more
[
https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677704#action_12677704
]
Doug Cook commented on NUTCH-419:
-
I ran into this same problem, and spent some time
[
https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-419:
Attachment: diffs
Here's a context diff. Hopefully this will work, am rusty at creating patches,
and did
34 matches
Mail list logo