[jira] [Created] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.

2011-07-19 Thread Markus Jelsma (JIRA)
URL filters to produce regexes to be used by OutlinkExtractor. -- Key: NUTCH-1060 URL: https://issues.apache.org/jira/browse/NUTCH-1060 Project: Nutch Issue Type: New Feature

Re: Running individual test classes from nutch script cont'd

2011-07-19 Thread Markus Jelsma
I didn't manage to get it running either. I've also trouble finding the test case class. bin/nutch junit.textui.TestRunner org.apache.nutch.parse.TestOutlinkExtractor Won't find the test class. Seem obvious but i've no idea how to run it from the /src/. On Sunday 17 July 2011 15:06:26 lewis

[jira] [Created] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex

2011-07-19 Thread Markus Jelsma (JIRA)
Migrate MoreIndexingFilter from Apache ORO to java.util.regex - Key: NUTCH-1061 URL: https://issues.apache.org/jira/browse/NUTCH-1061 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1061: - Attachment: NUTCH-1061-1.4-1.patch Patch for 1.4. Migrate MoreIndexingFilter from Apache ORO

[jira] [Created] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

2011-07-19 Thread Markus Jelsma (JIRA)
Migrate BasicURLNormalizer from Apache ORO to java.util.regex - Key: NUTCH-1062 URL: https://issues.apache.org/jira/browse/NUTCH-1062 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1062: - Description: Issue for migration from ORO to j.u.regex. There is a small problem here. I began

[jira] [Commented] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067664#comment-13067664 ] Markus Jelsma commented on NUTCH-1050: -- If there are no objections i'll add this

[jira] [Commented] (NUTCH-1057) Make fetcher thread time out configurable

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067665#comment-13067665 ] Markus Jelsma commented on NUTCH-1057: -- Any comments or objections? Any better

[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067668#comment-13067668 ] Julien Nioche commented on NUTCH-1037: -- +1. Maybe add to the description something

[jira] [Commented] (NUTCH-1057) Make fetcher thread time out configurable

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067678#comment-13067678 ] Julien Nioche commented on NUTCH-1057: -- Apart from the part related to NUTCH-1037

[jira] [Resolved] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1050. -- Resolution: Fixed Committed for 1.4 in rev. 1148301. Thanks Julien for reviewing. Add

[jira] [Commented] (NUTCH-865) Format source code in unique style

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067684#comment-13067684 ] Julien Nioche commented on NUTCH-865: - That's not very complex nor huge. All it takes

[jira] [Commented] (NUTCH-1057) Make fetcher thread time out configurable

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067686#comment-13067686 ] Markus Jelsma commented on NUTCH-1057: -- No. This is a tuning option for users that

[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067690#comment-13067690 ] Markus Jelsma commented on NUTCH-1037: -- Thanks. The comment has been modified.

[jira] [Updated] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1037: - Description: Anchors are not deduplicated before indexing. This can result in a very high

[jira] [Closed] (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-729. --- Resolution: Won't Fix Closed for legacy. FieldIndexer no longer exists. NPE in FieldIndexer when

[jira] [Updated] (NUTCH-1049) Add classes to bin/nutch

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1049: - Fix Version/s: 2.0 1.4 Add classes to bin/nutch

[jira] [Updated] (NUTCH-771) Add WebGraph classes to the bin/nutch script

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-771: Fix Version/s: 2.0 1.4 Assignee: Markus Jelsma (was: Dennis Kubes)

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067755#comment-13067755 ] Markus Jelsma commented on NUTCH-1045: -- Are you looking for something specific? It's

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067773#comment-13067773 ] Julien Nioche commented on NUTCH-1045: -- you should see a message in the logs at the

[jira] [Created] (NUTCH-1064) o.a.n.util.MimeUtil uses deprecated Tika methods

2011-07-19 Thread Julien Nioche (JIRA)
o.a.n.util.MimeUtil uses deprecated Tika methods Key: NUTCH-1064 URL: https://issues.apache.org/jira/browse/NUTCH-1064 Project: Nutch Issue Type: Improvement Affects Versions: 1.4

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067779#comment-13067779 ] Markus Jelsma commented on NUTCH-1045: -- Strange, the only errors i can find in the

[jira] [Commented] (NUTCH-1057) Make fetcher thread time out configurable

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067791#comment-13067791 ] Markus Jelsma commented on NUTCH-1057: -- Committed for 1.4 in rev. 1148406. Make

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067794#comment-13067794 ] Julien Nioche commented on NUTCH-1045: -- {quote} May be because the empty file is

Re: Running individual test classes from nutch script cont'd

2011-07-19 Thread lewis john mcgibbney
It's definately on my road map, I have been been away from it for a day or so, so I will have another pop later. I think it would be a nice addition however writing the patch is become a rather tricky task! On Tue, Jul 19, 2011 at 12:18 PM, Markus Jelsma markus.jel...@openindex.iowrote: I

[jira] [Issue Comment Edited] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067861#comment-13067861 ] Markus Jelsma edited comment on NUTCH-1045 at 7/19/11 5:50 PM:

[Nutch Wiki] Trivial Update of PluginGotchas by LewisJohnMcgibbney

2011-07-19 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PluginGotchas page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/PluginGotchas?action=diffrev1=2rev2=3 Total time: 0 seconds }}} + The above error was

[jira] [Commented] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-19 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067883#comment-13067883 ] Lewis John McGibbney commented on NUTCH-1048: - Thanks for this Julien

[jira] [Commented] (NUTCH-865) Format source code in unique style

2011-07-19 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067893#comment-13067893 ] Lewis John McGibbney commented on NUTCH-865: I'm happy to have a crack at

[jira] [Closed] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1048. Resolution: Fixed you are welcome. thanks for committing the changes Busted links on

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067912#comment-13067912 ] Julien Nioche commented on NUTCH-1045: -- Great, seems to be working fine then. Thanks

[jira] [Commented] (NUTCH-865) Format source code in unique style

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067945#comment-13067945 ] Markus Jelsma commented on NUTCH-865: - It would be good to finish 1.4 with clean style

[jira] [Commented] (NUTCH-865) Format source code in unique style

2011-07-19 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067950#comment-13067950 ] Lewis John McGibbney commented on NUTCH-865: agreed :0) Format source code in

[jira] [Updated] (NUTCH-865) Format source code in unique style

2011-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-865: Fix Version/s: 1.4 Marked 1.4 to keep it on the radar. Format source code in unique style

[Nutch Wiki] Trivial Update of bin/nutch_updatedb by LewisJohnMcgibbney

2011-07-19 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The bin/nutch_updatedb page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/bin/nutch_updatedb?action=diffrev1=5rev2=6 Usage: {{{ - CrawlDb crawldb (-dir

[jira] [Commented] (NUTCH-1014) Migrate from Apache ORO to java.util.regex

2011-07-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067972#comment-13067972 ] Andrzej Bialecki commented on NUTCH-1014: -- java.util.regex has the advantage of

[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-19 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068149#comment-13068149 ] Hudson commented on NUTCH-1037: --- Integrated in Nutch-trunk #1551 (See

Build failed in Jenkins: Nutch-trunk #1551

2011-07-19 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/1551/changes Changes: [markus] NUTCH-1037 Option to deduplicate anchors prior to indexing -- [...truncated 987 lines...] A