URL filters to produce regexes to be used by OutlinkExtractor.
--
Key: NUTCH-1060
URL: https://issues.apache.org/jira/browse/NUTCH-1060
Project: Nutch
Issue Type: New Feature
I didn't manage to get it running either. I've also trouble finding the test
case class.
bin/nutch junit.textui.TestRunner org.apache.nutch.parse.TestOutlinkExtractor
Won't find the test class. Seem obvious but i've no idea how to run it from
the /src/.
On Sunday 17 July 2011 15:06:26 lewis
Migrate MoreIndexingFilter from Apache ORO to java.util.regex
-
Key: NUTCH-1061
URL: https://issues.apache.org/jira/browse/NUTCH-1061
Project: Nutch
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1061:
-
Attachment: NUTCH-1061-1.4-1.patch
Patch for 1.4.
Migrate MoreIndexingFilter from Apache ORO
Migrate BasicURLNormalizer from Apache ORO to java.util.regex
-
Key: NUTCH-1062
URL: https://issues.apache.org/jira/browse/NUTCH-1062
Project: Nutch
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1062:
-
Description:
Issue for migration from ORO to j.u.regex. There is a small problem here. I
began
[
https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067664#comment-13067664
]
Markus Jelsma commented on NUTCH-1050:
--
If there are no objections i'll add this
[
https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067665#comment-13067665
]
Markus Jelsma commented on NUTCH-1057:
--
Any comments or objections? Any better
[
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067668#comment-13067668
]
Julien Nioche commented on NUTCH-1037:
--
+1. Maybe add to the description something
[
https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067678#comment-13067678
]
Julien Nioche commented on NUTCH-1057:
--
Apart from the part related to NUTCH-1037
[
https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-1050.
--
Resolution: Fixed
Committed for 1.4 in rev. 1148301.
Thanks Julien for reviewing.
Add
[
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067684#comment-13067684
]
Julien Nioche commented on NUTCH-865:
-
That's not very complex nor huge. All it takes
[
https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067686#comment-13067686
]
Markus Jelsma commented on NUTCH-1057:
--
No. This is a tuning option for users that
[
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067690#comment-13067690
]
Markus Jelsma commented on NUTCH-1037:
--
Thanks. The comment has been modified.
[
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1037:
-
Description:
Anchors are not deduplicated before indexing. This can result in a very high
[
https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-729.
---
Resolution: Won't Fix
Closed for legacy. FieldIndexer no longer exists.
NPE in FieldIndexer when
[
https://issues.apache.org/jira/browse/NUTCH-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1049:
-
Fix Version/s: 2.0
1.4
Add classes to bin/nutch
[
https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-771:
Fix Version/s: 2.0
1.4
Assignee: Markus Jelsma (was: Dennis Kubes)
[
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067755#comment-13067755
]
Markus Jelsma commented on NUTCH-1045:
--
Are you looking for something specific? It's
[
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067773#comment-13067773
]
Julien Nioche commented on NUTCH-1045:
--
you should see a message in the logs at the
o.a.n.util.MimeUtil uses deprecated Tika methods
Key: NUTCH-1064
URL: https://issues.apache.org/jira/browse/NUTCH-1064
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.4
[
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067779#comment-13067779
]
Markus Jelsma commented on NUTCH-1045:
--
Strange, the only errors i can find in the
[
https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067791#comment-13067791
]
Markus Jelsma commented on NUTCH-1057:
--
Committed for 1.4 in rev. 1148406.
Make
[
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067794#comment-13067794
]
Julien Nioche commented on NUTCH-1045:
--
{quote}
May be because the empty file is
It's definately on my road map, I have been been away from it for a day or
so, so I will have another pop later.
I think it would be a nice addition however writing the patch is become a
rather tricky task!
On Tue, Jul 19, 2011 at 12:18 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
I
[
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067861#comment-13067861
]
Markus Jelsma edited comment on NUTCH-1045 at 7/19/11 5:50 PM:
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The PluginGotchas page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/PluginGotchas?action=diffrev1=2rev2=3
Total time: 0 seconds
}}}
+ The above error was
[
https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067883#comment-13067883
]
Lewis John McGibbney commented on NUTCH-1048:
-
Thanks for this Julien
[
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067893#comment-13067893
]
Lewis John McGibbney commented on NUTCH-865:
I'm happy to have a crack at
[
https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche closed NUTCH-1048.
Resolution: Fixed
you are welcome. thanks for committing the changes
Busted links on
[
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067912#comment-13067912
]
Julien Nioche commented on NUTCH-1045:
--
Great, seems to be working fine then. Thanks
[
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067945#comment-13067945
]
Markus Jelsma commented on NUTCH-865:
-
It would be good to finish 1.4 with clean style
[
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067950#comment-13067950
]
Lewis John McGibbney commented on NUTCH-865:
agreed :0)
Format source code in
[
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-865:
Fix Version/s: 1.4
Marked 1.4 to keep it on the radar.
Format source code in unique style
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The bin/nutch_updatedb page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_updatedb?action=diffrev1=5rev2=6
Usage:
{{{
- CrawlDb crawldb (-dir
[
https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067972#comment-13067972
]
Andrzej Bialecki commented on NUTCH-1014:
--
java.util.regex has the advantage of
[
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068149#comment-13068149
]
Hudson commented on NUTCH-1037:
---
Integrated in Nutch-trunk #1551 (See
See https://builds.apache.org/job/Nutch-trunk/1551/changes
Changes:
[markus] NUTCH-1037 Option to deduplicate anchors prior to indexing
--
[...truncated 987 lines...]
A
38 matches
Mail list logo