[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599210#comment-14599210 ] Sebastian Nagel commented on NUTCH-2038: Yes, it's possible to implement it in

[jira] [Created] (NUTCH-2047) Improvements to the relevance scoring plugin

2015-06-24 Thread Sujen Shah (JIRA)
Sujen Shah created NUTCH-2047: - Summary: Improvements to the relevance scoring plugin Key: NUTCH-2047 URL: https://issues.apache.org/jira/browse/NUTCH-2047 Project: Nutch Issue Type: Improvement

RE: Github Spam

2015-06-24 Thread Markus Jelsma
I am sorry, by getting rid i meant moving git requests to a separate list. But because both are accepted, this is probably not going to happen. Due to the flood of mail, i normally ignore git mail completely, but not Jira updates. If Lewis' mail client is friendly, he can filter git mail to a

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600269#comment-14600269 ] Sebastian Nagel commented on NUTCH-2038: Hi [~asitang], the latest pull request

[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode

2015-06-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600290#comment-14600290 ] Sebastian Nagel commented on NUTCH-1692: +1 SegmentReader broken in distributed

[jira] [Commented] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

2015-06-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600334#comment-14600334 ] Sebastian Nagel commented on NUTCH-1625: Is this really only legacy code and

[jira] [Updated] (NUTCH-2047) Improvements to the relevance scoring plugin

2015-06-24 Thread Sujen Shah (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujen Shah updated NUTCH-2047: -- Attachment: part-0 This file is a dump of the top 1000 URLs. The model file contained information

[jira] [Comment Edited] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419 ] Asitang Mishra edited comment on NUTCH-2038 at 6/25/15 12:19 AM:

[IMPORTANT] Migration Towards HAdoop 2.X -- 3.X

2015-06-24 Thread Lewis John Mcgibbney
Hi Folks, In not too long time Hadoop will be up at 3.X for stable official releases. I wanted to solicit the dev@ community to see what difficulties if any people have had running Nutch trunk on Hadoop 2.X. Hadoop 2.X is supported on Nutch 2.X but getting the patches all correct is literally a

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419 ] Asitang Mishra commented on NUTCH-2038: --- maybe rename the plugin to

[jira] [Comment Edited] (NUTCH-2047) Improvements to the relevance scoring plugin

2015-06-24 Thread Sujen Shah (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600327#comment-14600327 ] Sujen Shah edited comment on NUTCH-2047 at 6/24/15 11:09 PM: -

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/35 NUTCH-2038 You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/34 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599625#comment-14599625 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request:

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165265 --- Diff: conf/nutch-default.xml --- @@ -1208,6 +1208,28 @@ /property property + namehtmlparsefilter.naivebayes.trainfile/name +

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599644#comment-14599644 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165338 --- Diff: ivy/ivy.xml --- @@ -78,7 +78,11 @@ dependency org=org.apache.cxf name=cxf-rt-transports-http-jetty rev=3.0.4/

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599645#comment-14599645 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165299 --- Diff: conf/nutch-default.xml --- @@ -1258,6 +1280,7 @@ !-- urlfilter plugin properties -- + --- End diff --

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599646#comment-14599646 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599647#comment-14599647 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599655#comment-14599655 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread Asitang Mishra (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599656#comment-14599656 ] Asitang Mishra commented on NUTCH-2038: --- I still have to transfer the external

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165528 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165623 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165500 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599657#comment-14599657 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165581 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599659#comment-14599659 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599660#comment-14599660 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165638 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599661#comment-14599661 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[jira] [Created] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Luis Lopez (JIRA)
Luis Lopez created NUTCH-2046: - Summary: The crawl script should be able to skip an initial injection. Key: NUTCH-2046 URL: https://issues.apache.org/jira/browse/NUTCH-2046 Project: Nutch Issue

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165388 --- Diff: ivy/ivy.xml --- @@ -78,7 +78,11 @@ dependency org=org.apache.cxf name=cxf-rt-transports-http-jetty rev=3.0.4/

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599648#comment-14599648 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165405 --- Diff: ivy/ivy.xml --- @@ -100,6 +104,8 @@ exclude module=jmxtools / exclude module=jms / exclude

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599610#comment-14599610 ] Sebastian Nagel commented on NUTCH-2038: Jaccard similarity sounds more like a

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599622#comment-14599622 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request

[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599840#comment-14599840 ] Julien Nioche commented on NUTCH-2046: -- re-script : what about a positive parameter

[jira] [Updated] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Luis Lopez (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Lopez updated NUTCH-2046: -- Attachment: crawl.patch The crawl script skips the initial injection if we use -skipInject instead of

[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Luis Lopez (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599933#comment-14599933 ] Luis Lopez commented on NUTCH-2046: --- I used just -skipInject instead of the actual path

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599748#comment-14599748 ] Chris A. Mattmann commented on NUTCH-2038: -- yeah you got it Seb, we can do

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/36 NUTCH-2038 Made aesthetic changes suggested by Chris Mattmann. Removed dependencies from the main ivy.xml and added it to plugin's ivy.xml. You can merge this pull request into a Git repository by

[jira] [Updated] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asitang Mishra updated NUTCH-2038: -- Description: A html parse filter that will filter out the outlinks in two stages. One:

[jira] [Updated] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asitang Mishra updated NUTCH-2038: -- Description: A html parse filter that will filter out the outlinks in two stages. Classify the

[jira] [Updated] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2046: Fix Version/s: 1.11 The crawl script should be able to skip an initial injection.

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/35 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599795#comment-14599795 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request

[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599750#comment-14599750 ] Lewis John McGibbney commented on NUTCH-2046: - Hi [~betolink], this is a nice

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599798#comment-14599798 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request:

Github Spam

2015-06-24 Thread Lewis John Mcgibbney
Hi Folks, The Github spam is killing me. Seems to go to - nu...@noreply.github.com Basically every commit someone pushes (there have been loads recently) is sending me a new email over and above the digest emails I get. I am sure this must be pissing other people off. Is there a better way for us

RE: Github Spam

2015-06-24 Thread Markus Jelsma
Well, either disable it or have people send less requests. On the other hand, adding patches and Jira comments also gets you e-mail. -Original message- From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com Sent: Wednesday 24th June 2015 21:47 To: dev@nutch.apache.org Subject: Github Spam

RE: Github Spam

2015-06-24 Thread Mattmann, Chris A (3980)
Hey Lewis, Yeah to be honest, this no different than ReviewBoard, JIRA, etc. At least it's not as bad as Spark :/ I did a review of Asitang's patch and it took each one of my comments and sent a mail. B/c of Apache's requirement that things happen on the list, we have to have the mails replicated

[jira] [Commented] (NUTCH-1504) Pluggable url partitioner

2015-06-24 Thread Michael Joyce (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599958#comment-14599958 ] Michael Joyce commented on NUTCH-1504: -- This is great stuff [~lewismc], we definitely

RE: Github Spam

2015-06-24 Thread Markus Jelsma
I am fine with getting rid of Github e-mail, not Jira, Jenkins or other ASF infra stuff. The git requests are not in our svn format anyway. If someone is serious about their patch and want it in the regular releases, then please be so polite to not make it a bit harder for us ;) -Original

RE: Github Spam

2015-06-24 Thread Mattmann, Chris A (3980)
Sorry I wasn't clear. I'm *not* fine with getting rid of Github. I was simply proposing for the mail spam to be moved to a different list. But, to me JIRA/SVN, is no different than Github comments and pull requests and so forth. To each their own :) The ASF full supports Git and Github integration