[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599210#comment-14599210
]
Sebastian Nagel commented on NUTCH-2038:
Yes, it's possible to implement it in
Sujen Shah created NUTCH-2047:
-
Summary: Improvements to the relevance scoring plugin
Key: NUTCH-2047
URL: https://issues.apache.org/jira/browse/NUTCH-2047
Project: Nutch
Issue Type: Improvement
I am sorry, by getting rid i meant moving git requests to a separate list. But
because both are accepted, this is probably not going to happen. Due to the
flood of mail, i normally ignore git mail completely, but not Jira updates.
If Lewis' mail client is friendly, he can filter git mail to a
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600269#comment-14600269
]
Sebastian Nagel commented on NUTCH-2038:
Hi [~asitang], the latest pull request
[
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600290#comment-14600290
]
Sebastian Nagel commented on NUTCH-1692:
+1
SegmentReader broken in distributed
[
https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600334#comment-14600334
]
Sebastian Nagel commented on NUTCH-1625:
Is this really only legacy code and
[
https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sujen Shah updated NUTCH-2047:
--
Attachment: part-0
This file is a dump of the top 1000 URLs.
The model file contained information
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419
]
Asitang Mishra edited comment on NUTCH-2038 at 6/25/15 12:19 AM:
Hi Folks,
In not too long time Hadoop will be up at 3.X for stable official releases.
I wanted to solicit the dev@ community to see what difficulties if any
people have had running Nutch trunk on Hadoop 2.X.
Hadoop 2.X is supported on Nutch 2.X but getting the patches all correct is
literally a
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419
]
Asitang Mishra commented on NUTCH-2038:
---
maybe rename the plugin to
[
https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600327#comment-14600327
]
Sujen Shah edited comment on NUTCH-2047 at 6/24/15 11:09 PM:
-
GitHub user asitang opened a pull request:
https://github.com/apache/nutch/pull/35
NUTCH-2038
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/asitang/nutch NUTCH-2038
Alternatively you can review and apply these changes as the
Github user asitang closed the pull request at:
https://github.com/apache/nutch/pull/34
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599625#comment-14599625
]
ASF GitHub Bot commented on NUTCH-2038:
---
GitHub user asitang opened a pull request:
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165265
--- Diff: conf/nutch-default.xml ---
@@ -1208,6 +1208,28 @@
/property
property
+ namehtmlparsefilter.naivebayes.trainfile/name
+
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599644#comment-14599644
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165338
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
dependency org=org.apache.cxf
name=cxf-rt-transports-http-jetty rev=3.0.4/
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599645#comment-14599645
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165299
--- Diff: conf/nutch-default.xml ---
@@ -1258,6 +1280,7 @@
!-- urlfilter plugin properties --
+
--- End diff --
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599646#comment-14599646
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599647#comment-14599647
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599655#comment-14599655
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599656#comment-14599656
]
Asitang Mishra commented on NUTCH-2038:
---
I still have to transfer the external
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165528
--- Diff:
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
---
@@ -0,0 +1,214
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165623
--- Diff:
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
---
@@ -0,0 +1,214
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165500
--- Diff:
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
---
@@ -0,0 +1,214
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599657#comment-14599657
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165581
--- Diff:
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
---
@@ -0,0 +1,214
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599659#comment-14599659
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599660#comment-14599660
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165638
--- Diff:
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
---
@@ -0,0 +1,214
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599661#comment-14599661
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
Luis Lopez created NUTCH-2046:
-
Summary: The crawl script should be able to skip an initial
injection.
Key: NUTCH-2046
URL: https://issues.apache.org/jira/browse/NUTCH-2046
Project: Nutch
Issue
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165388
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
dependency org=org.apache.cxf
name=cxf-rt-transports-http-jetty rev=3.0.4/
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599648#comment-14599648
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user chrismattmann commented on a
Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/35#discussion_r33165405
--- Diff: ivy/ivy.xml ---
@@ -100,6 +104,8 @@
exclude module=jmxtools /
exclude module=jms /
exclude
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599610#comment-14599610
]
Sebastian Nagel commented on NUTCH-2038:
Jaccard similarity sounds more like a
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599622#comment-14599622
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user asitang closed the pull request
[
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599840#comment-14599840
]
Julien Nioche commented on NUTCH-2046:
--
re-script : what about a positive parameter
[
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luis Lopez updated NUTCH-2046:
--
Attachment: crawl.patch
The crawl script skips the initial injection if we use -skipInject instead of
[
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599933#comment-14599933
]
Luis Lopez commented on NUTCH-2046:
---
I used just -skipInject instead of the actual path
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599748#comment-14599748
]
Chris A. Mattmann commented on NUTCH-2038:
--
yeah you got it Seb, we can do
GitHub user asitang opened a pull request:
https://github.com/apache/nutch/pull/36
NUTCH-2038
Made aesthetic changes suggested by Chris Mattmann. Removed dependencies
from the main ivy.xml and added it to plugin's ivy.xml.
You can merge this pull request into a Git repository by
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Asitang Mishra updated NUTCH-2038:
--
Description:
A html parse filter that will filter out the outlinks in two stages.
One:
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Asitang Mishra updated NUTCH-2038:
--
Description:
A html parse filter that will filter out the outlinks in two stages.
Classify the
[
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-2046:
Fix Version/s: 1.11
The crawl script should be able to skip an initial injection.
Github user asitang closed the pull request at:
https://github.com/apache/nutch/pull/35
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599795#comment-14599795
]
ASF GitHub Bot commented on NUTCH-2038:
---
Github user asitang closed the pull request
[
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599750#comment-14599750
]
Lewis John McGibbney commented on NUTCH-2046:
-
Hi [~betolink], this is a nice
[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599798#comment-14599798
]
ASF GitHub Bot commented on NUTCH-2038:
---
GitHub user asitang opened a pull request:
Hi Folks,
The Github spam is killing me.
Seems to go to - nu...@noreply.github.com
Basically every commit someone pushes (there have been loads recently) is
sending me a new email over and above the digest emails I get.
I am sure this must be pissing other people off. Is there a better way for
us
Well, either disable it or have people send less requests. On the other hand,
adding patches and Jira comments also gets you e-mail.
-Original message-
From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com
Sent: Wednesday 24th June 2015 21:47
To: dev@nutch.apache.org
Subject: Github Spam
Hey Lewis,
Yeah to be honest, this no different than ReviewBoard, JIRA, etc.
At least it's not as bad as Spark :/ I did a review of Asitang's patch
and it took each one of my comments and sent a mail. B/c of Apache's
requirement that things happen on the list, we have to have the mails
replicated
[
https://issues.apache.org/jira/browse/NUTCH-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599958#comment-14599958
]
Michael Joyce commented on NUTCH-1504:
--
This is great stuff [~lewismc], we definitely
I am fine with getting rid of Github e-mail, not Jira, Jenkins or other ASF
infra stuff. The git requests are not in our svn format anyway. If someone is
serious about their patch and want it in the regular releases, then please be
so polite to not make it a bit harder for us ;)
-Original
Sorry I wasn't clear. I'm *not* fine with getting rid of Github.
I was simply proposing for the mail spam to be moved to a different
list. But, to me JIRA/SVN, is no different than Github comments and
pull requests and so forth. To each their own :) The ASF full supports
Git and Github integration
56 matches
Mail list logo