[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599210#comment-14599210 ] Sebastian Nagel commented on NUTCH-2038: Yes, it's possible to implement it in HtmlParseFilter.filter() - content and outlinks are passed to this method. But keep in mind that outlinks are not yet normalized and filtered. In ScoringFilter.distributeScoreToOutlink() they are. Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2047) Improvements to the relevance scoring plugin
Sujen Shah created NUTCH-2047: - Summary: Improvements to the relevance scoring plugin Key: NUTCH-2047 URL: https://issues.apache.org/jira/browse/NUTCH-2047 Project: Nutch Issue Type: Improvement Components: scoring Reporter: Sujen Shah Fix For: 1.11 To discuss the results and improvements on the scoring-similarity plugin using the cosine similarity model. Currently, the outlinks are distributed the same score as the parent URL. Which means an irrelevant URL(with a relevant parent) would be fetched for one more round before it gets a lower score and filtered. So we would require one additional fetch/parse to score these irrelevant urls(from relevant parents) lower. Any suggestions on this are appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Github Spam
I am sorry, by getting rid i meant moving git requests to a separate list. But because both are accepted, this is probably not going to happen. Due to the flood of mail, i normally ignore git mail completely, but not Jira updates. If Lewis' mail client is friendly, he can filter git mail to a separate inbox. My filters never seem to work :( -Original message- From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov Sent: Wednesday 24th June 2015 22:10 To: dev@nutch.apache.org Subject: RE: Github Spam Sorry I wasn't clear. I'm *not* fine with getting rid of Github. I was simply proposing for the mail spam to be moved to a different list. But, to me JIRA/SVN, is no different than Github comments and pull requests and so forth. To each their own :) The ASF full supports Git and Github integration though, and in a very nice way which works with SVN so just to be clear I am in no way proposing that we move to Git, etc., but I'm also not proposing that we don't accept Git pull requests. I was just trying to help Lewis with his mail issues. Cheers, Chris From: Markus Jelsma [markus.jel...@openindex.io] Sent: Wednesday, June 24, 2015 1:07 PM To: dev@nutch.apache.org Subject: RE: Github Spam I am fine with getting rid of Github e-mail, not Jira, Jenkins or other ASF infra stuff. The git requests are not in our svn format anyway. If someone is serious about their patch and want it in the regular releases, then please be so polite to not make it a bit harder for us ;) -Original message- From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov Sent: Wednesday 24th June 2015 21:54 To: dev@nutch.apache.org Subject: RE: Github Spam Hey Lewis, Yeah to be honest, this no different than ReviewBoard, JIRA, etc. At least it's not as bad as Spark :/ I did a review of Asitang's patch and it took each one of my comments and sent a mail. B/c of Apache's requirement that things happen on the list, we have to have the mails replicated from Github on all interactions. The thing is though, maybe we should create a nutch-github@ email address, and then send mails there? Would that help? Or nutch-notifications@a.o ? Then JIRA, Github, etc., could go there? Others would have to be in support of this too. I'm +0 on either. You know all my email problems so this is just noise really lol in a sea of other noise. Cheers, Chris From: Markus Jelsma [markus.jel...@openindex.io] Sent: Wednesday, June 24, 2015 12:49 PM To: dev@nutch.apache.org Subject: RE: Github Spam Well, either disable it or have people send less requests. On the other hand, adding patches and Jira comments also gets you e-mail. -Original message- From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com Sent: Wednesday 24th June 2015 21:47 To: dev@nutch.apache.org Subject: Github Spam Hi Folks, The Github spam is killing me. Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com Basically every commit someone pushes (there have been loads recently) is sending me a new email over and above the digest emails I get. I am sure this must be pissing other people off. Is there a better way for us to work this mail? Thanks Lewis -- Lewis
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600269#comment-14600269 ] Sebastian Nagel commented on NUTCH-2038: Hi [~asitang], the latest pull request #36 looks good. - maybe rename the plugin to parsefilter-naivebayes for simplicity and in advance of NUTCH-1482 - is this statement still true? bq. CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier. - afaics, the way the model is generated, stored and loaded needs a review: -* it should be read/generated once and then cached in memory, -* writing the model to disk is likely to become painful in distributed mode with concurrent tasks. - cosmetics: -* exceptions are properly logged via LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in stdout/stderr as of e.printStackTrace() -* code formatting, see [[1|http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing]] Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600290#comment-14600290 ] Sebastian Nagel commented on NUTCH-1692: +1 SegmentReader broken in distributed mode Key: NUTCH-1692 URL: https://issues.apache.org/jira/browse/NUTCH-1692 Project: Nutch Issue Type: Bug Affects Versions: 1.8 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.11 Attachments: 20140126210858.tgz, NUTCH-1692-trunk.patch, NUTCH-1692.patch SegmentReader -list option ignores the -no* options, causing the following exception in distributed mode: {code} Exception in thread main java.lang.NullPointerException at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) at java.util.Arrays.sort(Arrays.java:472) at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85) at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463) at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441) at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:160) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED
[ https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600334#comment-14600334 ] Sebastian Nagel commented on NUTCH-1625: Is this really only legacy code and what's the exact problem? If a segment contains a document with a fetch datum of status FETCH_NOTMODIFIED - the document should be already indexed from a prior segment where it has been really fetched - there is definitely no content for this document in this segment because the server has responded with 304. Because documents with empty content are already skipped before a test for FETCH_NOTMODIFIED has no effect at all, afaics. Because the check for with DB_NOTMODIFIED (if property indexer.skip.notmodified == false) also comes after, it only affects docs which are fetched (FETCH_SUCCESS) and then recognized as not modified by signature comparison. IndexerMapReduce skips FETCH_NOTMODIFIED Key: NUTCH-1625 URL: https://issues.apache.org/jira/browse/NUTCH-1625 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.11 Attachments: NUTCH-1625.patch, NUTCH-1625.patch IndexerMapReduce has the option to skip DB_NOTMODIFIED but legacy code also skips FETCH_NOTMODIFIED and the latter is not optional. We can keep the check but that should also include FETCH_NOTMODIFIED. Relying on FETCH_NOTMODIFIED isn't very useful anyway because since 1.5 orso we can safely rely on DB_NOTMODIFIED as it is properly set in the CrawlDBReducer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2047) Improvements to the relevance scoring plugin
[ https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujen Shah updated NUTCH-2047: -- Attachment: part-0 This file is a dump of the top 1000 URLs. The model file contained information related to robotics from a wikipedia article. And the seed list was CMU's Robotics institute homepage Improvements to the relevance scoring plugin Key: NUTCH-2047 URL: https://issues.apache.org/jira/browse/NUTCH-2047 Project: Nutch Issue Type: Improvement Components: scoring Reporter: Sujen Shah Labels: memex Fix For: 1.11 Attachments: part-0 To discuss the results and improvements on the scoring-similarity plugin using the cosine similarity model. Currently, the outlinks are distributed the same score as the parent URL. Which means an irrelevant URL(with a relevant parent) would be fetched for one more round before it gets a lower score and filtered. So we would require one additional fetch/parse to score these irrelevant urls(from relevant parents) lower. Any suggestions on this are appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419 ] Asitang Mishra edited comment on NUTCH-2038 at 6/25/15 12:19 AM: - 1 maybe rename the plugin to parsefilter-naivebayes for simplicity and in advance of NUTCH-1482 Will do that 2 is this statement still true? CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier. The first ever call to parse filter takes a bit more time because the training is done and model is created. So, time out should be a little more. Does not take much time after this. 3 afaics, the way the model is generated, stored and loaded needs a review: it should be read/generated once and then cached in memory, writing the model to disk is likely to become painful in distributed mode with concurrent tasks. The model is created during the parsing of the first fetched page of the very first parse job. After that it checks if the model file already present or not. The model file is being read each time the classify() function is called, will change that and store the model all the way thru for a single parse job. 4 cosmetics: exceptions are properly logged via LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in stdout/stderr as of e.printStackTrace() code formatting, see [1] will do that was (Author: asitang): maybe rename the plugin to parsefilter-naivebayes for simplicity and in advance of NUTCH-1482 Will do that is this statement still true? CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier. The first ever call to parse filter takes a bit more time because the training is done and model is created. So, time out should be a little more. Does not take much time after this. afaics, the way the model is generated, stored and loaded needs a review: it should be read/generated once and then cached in memory, writing the model to disk is likely to become painful in distributed mode with concurrent tasks. The model is created during the parsing of the first fetched page of the very first parse job. After that it checks if the model file already present or not. The model file is being read each time the classify() function is called, will change that and store the model all the way thru for a single parse job. cosmetics: exceptions are properly logged via LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in stdout/stderr as of e.printStackTrace() code formatting, see [1] will do that Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[IMPORTANT] Migration Towards HAdoop 2.X -- 3.X
Hi Folks, In not too long time Hadoop will be up at 3.X for stable official releases. I wanted to solicit the dev@ community to see what difficulties if any people have had running Nutch trunk on Hadoop 2.X. Hadoop 2.X is supported on Nutch 2.X but getting the patches all correct is literally a PITA... we are working on that down in the Gora community and need to get a better more frequent release cycle. I just wanted to know if there was motivation for us to get some patches committed to trunk, releases it as 1.11 then focus the next development drive on a switch to Hadoop 2.X for trunk. We could potentially then release Nutch 1.11 as 3.0. What do you guys think? Thanks Lewis -- *Lewis*
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419 ] Asitang Mishra commented on NUTCH-2038: --- maybe rename the plugin to parsefilter-naivebayes for simplicity and in advance of NUTCH-1482 Will do that is this statement still true? CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier. The first ever call to parse filter takes a bit more time because the training is done and model is created. So, time out should be a little more. Does not take much time after this. afaics, the way the model is generated, stored and loaded needs a review: it should be read/generated once and then cached in memory, writing the model to disk is likely to become painful in distributed mode with concurrent tasks. The model is created during the parsing of the first fetched page of the very first parse job. After that it checks if the model file already present or not. The model file is being read each time the classify() function is called, will change that and store the model all the way thru for a single parse job. cosmetics: exceptions are properly logged via LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in stdout/stderr as of e.printStackTrace() code formatting, see [1] will do that Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2047) Improvements to the relevance scoring plugin
[ https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600327#comment-14600327 ] Sujen Shah edited comment on NUTCH-2047 at 6/24/15 11:09 PM: - This file is a dump of the top 1000 URLs. The model file contained information related to robotics from a wikipedia article. And the seed list was CMU's Robotics institute homepage. The top few URLs are marked with the same score because most of they are unfetched and have been distributed the score by their parent url. was (Author: sujenshah): This file is a dump of the top 1000 URLs. The model file contained information related to robotics from a wikipedia article. And the seed list was CMU's Robotics institute homepage Improvements to the relevance scoring plugin Key: NUTCH-2047 URL: https://issues.apache.org/jira/browse/NUTCH-2047 Project: Nutch Issue Type: Improvement Components: scoring Reporter: Sujen Shah Labels: memex Fix For: 1.11 Attachments: part-0 To discuss the results and improvements on the scoring-similarity plugin using the cosine similarity model. Currently, the outlinks are distributed the same score as the parent URL. Which means an irrelevant URL(with a relevant parent) would be fetched for one more round before it gets a lower score and filtered. So we would require one additional fetch/parse to score these irrelevant urls(from relevant parents) lower. Any suggestions on this are appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/35 NUTCH-2038 You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/35.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #35 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: NUTCH-2038
Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/34 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599625#comment-14599625 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/35 NUTCH-2038 You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/35.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #35 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165265 --- Diff: conf/nutch-default.xml --- @@ -1208,6 +1208,28 @@ /property property + namehtmlparsefilter.naivebayes.trainfile/name + value/value + descriptionSet the name of the file to be used for Naive Bayes training. The format will be: +Each line contains two tab seperted parts +There are two columns/parts: +1. 1 or 0, 1 for relevent and 0 for irrelevent document. +3. Text (text that will be used for training) + +Each row will be considered a new document for the classifier. +CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier. + + /description +/property + +property + namehtmlparsefilter.naivebayes.wordlist/name + value/value + descriptionPut the name of the file you want to be used as a list of important words to be matched in the url for the model filter. The format should be one word per line. --- End diff -- can you insert some line breaks at like 80 chars so it doesn't run off the screen on this? Thanks @asitang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599644#comment-14599644 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165265 --- Diff: conf/nutch-default.xml --- @@ -1208,6 +1208,28 @@ /property property + namehtmlparsefilter.naivebayes.trainfile/name + value/value + descriptionSet the name of the file to be used for Naive Bayes training. The format will be: +Each line contains two tab seperted parts +There are two columns/parts: +1. 1 or 0, 1 for relevent and 0 for irrelevent document. +3. Text (text that will be used for training) + +Each row will be considered a new document for the classifier. +CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier. + + /description +/property + +property + namehtmlparsefilter.naivebayes.wordlist/name + value/value + descriptionPut the name of the file you want to be used as a list of important words to be matched in the url for the model filter. The format should be one word per line. --- End diff -- can you insert some line breaks at like 80 chars so it doesn't run off the screen on this? Thanks @asitang Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165338 --- Diff: ivy/ivy.xml --- @@ -78,7 +78,11 @@ dependency org=org.apache.cxf name=cxf-rt-transports-http-jetty rev=3.0.4/ dependency org=com.fasterxml.jackson.core name=jackson-databind rev=2.5.1 / dependency org=com.fasterxml.jackson.dataformat name=jackson-dataformat-cbor rev=2.5.1 / -dependency org=com.fasterxml.jackson.jaxrs name=jackson-jaxrs-json-provider rev=2.5.1 / +dependency org=com.fasterxml.jackson.jaxrs name=jackson-jaxrs-json-provider rev=2.5.1 / --- End diff -- extraneous. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599645#comment-14599645 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165299 --- Diff: conf/nutch-default.xml --- @@ -1258,6 +1280,7 @@ !-- urlfilter plugin properties -- + --- End diff -- extraneous not needed. Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165299 --- Diff: conf/nutch-default.xml --- @@ -1258,6 +1280,7 @@ !-- urlfilter plugin properties -- + --- End diff -- extraneous not needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599646#comment-14599646 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165338 --- Diff: ivy/ivy.xml --- @@ -78,7 +78,11 @@ dependency org=org.apache.cxf name=cxf-rt-transports-http-jetty rev=3.0.4/ dependency org=com.fasterxml.jackson.core name=jackson-databind rev=2.5.1 / dependency org=com.fasterxml.jackson.dataformat name=jackson-dataformat-cbor rev=2.5.1 / -dependency org=com.fasterxml.jackson.jaxrs name=jackson-jaxrs-json-provider rev=2.5.1 / +dependency org=com.fasterxml.jackson.jaxrs name=jackson-jaxrs-json-provider rev=2.5.1 / --- End diff -- extraneous. Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599647#comment-14599647 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165388 --- Diff: ivy/ivy.xml --- @@ -78,7 +78,11 @@ dependency org=org.apache.cxf name=cxf-rt-transports-http-jetty rev=3.0.4/ dependency org=com.fasterxml.jackson.core name=jackson-databind rev=2.5.1 / dependency org=com.fasterxml.jackson.dataformat name=jackson-dataformat-cbor rev=2.5.1 / -dependency org=com.fasterxml.jackson.jaxrs name=jackson-jaxrs-json-provider rev=2.5.1 / +dependency org=com.fasterxml.jackson.jaxrs name=jackson-jaxrs-json-provider rev=2.5.1 / +dependency org=org.apache.mahout name=mahout-math rev=0.8 / --- End diff -- these dependencies should go into the htmlparsefilter-naivebayes/ivy.xml, not the main one. I mentioned this last time. Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599655#comment-14599655 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165500 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block --- End diff -- remove Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599656#comment-14599656 ] Asitang Mishra commented on NUTCH-2038: --- I still have to transfer the external mahout and lucene jars to the plugin. Will do that. Have changed the functionality of the plugin this time a bit. Made it more simple to use. The trainfile now will have just two rows. 1 or 0 and text (see the patch for better detail). Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165528 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text); --- End diff -- maybe print the e's stack trace too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165623 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text); + +} + +return false; + } + + public boolean filterUrl(String url) { + +return containsWord(url, wordlist); + + } + + public boolean classify(String text) throws IOException { + +// if classified as relevent 1 then return true +if (NaiveBayesClassifier.classify(text).equals(1)) + return true; +return false; + } + + public void train() throws Exception { +// check if the model file exists, if it does then don't train +if (!FileSystem.get(conf).exists(new Path(model))) { + LOG.info(Training the Naive Bayes Model); + NaiveBayesClassifier.createModel(inputFilePath); +} else { + LOG.info(Model already exists. Skipping training.); +} + } + + public boolean containsWord(String url, ArrayListString wordlist) { +for (String word : wordlist) { + if (url.contains(word)) { +return true; + } +} + +return false; + } + + public void setConf(Configuration conf) { +this.conf = conf; +inputFilePath = conf.get(TRAINFILE_MODELFILTER); +dictionaryFile = conf.get(DICTFILE_MODELFILTER); +if (inputFilePath == null || inputFilePath.trim().length() == 0 +|| dictionaryFile == null || dictionaryFile.trim().length() == 0) { + String message = Model URLFilter:
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165500 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block --- End diff -- remove --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599657#comment-14599657 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165528 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text); --- End diff -- maybe print the e's stack trace too? Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165581 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text); + +} + +return false; + } + + public boolean filterUrl(String url) { + +return containsWord(url, wordlist); + + } + + public boolean classify(String text) throws IOException { + +// if classified as relevent 1 then return true +if (NaiveBayesClassifier.classify(text).equals(1)) + return true; +return false; + } + + public void train() throws Exception { +// check if the model file exists, if it does then don't train +if (!FileSystem.get(conf).exists(new Path(model))) { + LOG.info(Training the Naive Bayes Model); + NaiveBayesClassifier.createModel(inputFilePath); +} else { + LOG.info(Model already exists. Skipping training.); +} + } + + public boolean containsWord(String url, ArrayListString wordlist) { +for (String word : wordlist) { + if (url.contains(word)) { +return true; + } +} + +return false; + } + + public void setConf(Configuration conf) { +this.conf = conf; +inputFilePath = conf.get(TRAINFILE_MODELFILTER); +dictionaryFile = conf.get(DICTFILE_MODELFILTER); +if (inputFilePath == null || inputFilePath.trim().length() == 0 +|| dictionaryFile == null || dictionaryFile.trim().length() == 0) { + String message = Model URLFilter:
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599659#comment-14599659 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165581 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text); + +} + +return false; + } + + public boolean filterUrl(String url) { + +return containsWord(url, wordlist); + + } + + public boolean classify(String text) throws IOException { + +// if classified as relevent 1 then return true +if (NaiveBayesClassifier.classify(text).equals(1)) + return true; +return false; + } + + public void train() throws Exception { +// check if the model file exists, if it does then don't train +if (!FileSystem.get(conf).exists(new Path(model))) { + LOG.info(Training the Naive Bayes Model); + NaiveBayesClassifier.createModel(inputFilePath); +} else { + LOG.info(Model already exists. Skipping training.); +} + } + + public boolean containsWord(String url, ArrayListString wordlist) { +for (String word : wordlist) { + if (url.contains(word)) { +return true; + } +} + +return false; + } + + public void setConf(Configuration conf) { +this.conf = conf; +inputFilePath = conf.get(TRAINFILE_MODELFILTER); +
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599660#comment-14599660 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165623 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text); + +} + +return false; + } + + public boolean filterUrl(String url) { + +return containsWord(url, wordlist); + + } + + public boolean classify(String text) throws IOException { + +// if classified as relevent 1 then return true +if (NaiveBayesClassifier.classify(text).equals(1)) + return true; +return false; + } + + public void train() throws Exception { +// check if the model file exists, if it does then don't train +if (!FileSystem.get(conf).exists(new Path(model))) { + LOG.info(Training the Naive Bayes Model); + NaiveBayesClassifier.createModel(inputFilePath); +} else { + LOG.info(Model already exists. Skipping training.); +} + } + + public boolean containsWord(String url, ArrayListString wordlist) { +for (String word : wordlist) { + if (url.contains(word)) { +return true; + } +} + +return false; + } + + public void setConf(Configuration conf) { +this.conf = conf; +inputFilePath = conf.get(TRAINFILE_MODELFILTER); +
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165638 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text); + +} + +return false; + } + + public boolean filterUrl(String url) { + +return containsWord(url, wordlist); + + } + + public boolean classify(String text) throws IOException { + +// if classified as relevent 1 then return true +if (NaiveBayesClassifier.classify(text).equals(1)) + return true; +return false; + } + + public void train() throws Exception { +// check if the model file exists, if it does then don't train +if (!FileSystem.get(conf).exists(new Path(model))) { + LOG.info(Training the Naive Bayes Model); + NaiveBayesClassifier.createModel(inputFilePath); +} else { + LOG.info(Model already exists. Skipping training.); +} + } + + public boolean containsWord(String url, ArrayListString wordlist) { +for (String word : wordlist) { + if (url.contains(word)) { +return true; + } +} + +return false; + } + + public void setConf(Configuration conf) { +this.conf = conf; +inputFilePath = conf.get(TRAINFILE_MODELFILTER); +dictionaryFile = conf.get(DICTFILE_MODELFILTER); +if (inputFilePath == null || inputFilePath.trim().length() == 0 +|| dictionaryFile == null || dictionaryFile.trim().length() == 0) { + String message = Model URLFilter:
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599661#comment-14599661 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165638 --- Diff: src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java --- @@ -0,0 +1,214 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.htmlparsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of htmlparsefilter.naivebayes.trainfile) and if found irrelevent + * it gives the link a second chance if it contains any of the words from the + * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the + * parser.timeout to -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesHTMLParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesHTMLParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = htmlparsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = htmlparsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public NaiveBayesHTMLParseFilter() { + + } + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text); + +} + +return false; + } + + public boolean filterUrl(String url) { + +return containsWord(url, wordlist); + + } + + public boolean classify(String text) throws IOException { + +// if classified as relevent 1 then return true +if (NaiveBayesClassifier.classify(text).equals(1)) + return true; +return false; + } + + public void train() throws Exception { +// check if the model file exists, if it does then don't train +if (!FileSystem.get(conf).exists(new Path(model))) { + LOG.info(Training the Naive Bayes Model); + NaiveBayesClassifier.createModel(inputFilePath); +} else { + LOG.info(Model already exists. Skipping training.); +} + } + + public boolean containsWord(String url, ArrayListString wordlist) { +for (String word : wordlist) { + if (url.contains(word)) { +return true; + } +} + +return false; + } + + public void setConf(Configuration conf) { +this.conf = conf; +inputFilePath = conf.get(TRAINFILE_MODELFILTER); +
[jira] [Created] (NUTCH-2046) The crawl script should be able to skip an initial injection.
Luis Lopez created NUTCH-2046: - Summary: The crawl script should be able to skip an initial injection. Key: NUTCH-2046 URL: https://issues.apache.org/jira/browse/NUTCH-2046 Project: Nutch Issue Type: Improvement Components: crawldb, injector Affects Versions: 1.10 Reporter: Luis Lopez When our crawl gets really big a new injection takes considerable time as it updates crawldb, the crawl script should be able to skip the injection and go directly to the generate call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165388 --- Diff: ivy/ivy.xml --- @@ -78,7 +78,11 @@ dependency org=org.apache.cxf name=cxf-rt-transports-http-jetty rev=3.0.4/ dependency org=com.fasterxml.jackson.core name=jackson-databind rev=2.5.1 / dependency org=com.fasterxml.jackson.dataformat name=jackson-dataformat-cbor rev=2.5.1 / -dependency org=com.fasterxml.jackson.jaxrs name=jackson-jaxrs-json-provider rev=2.5.1 / +dependency org=com.fasterxml.jackson.jaxrs name=jackson-jaxrs-json-provider rev=2.5.1 / +dependency org=org.apache.mahout name=mahout-math rev=0.8 / --- End diff -- these dependencies should go into the htmlparsefilter-naivebayes/ivy.xml, not the main one. I mentioned this last time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599648#comment-14599648 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165405 --- Diff: ivy/ivy.xml --- @@ -100,6 +104,8 @@ exclude module=jmxtools / exclude module=jms / exclude module=jmxri / + exclude org=com.thoughtworks.xstream/ --- End diff -- also should go into the plugins ivy.xml Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/35#discussion_r33165405 --- Diff: ivy/ivy.xml --- @@ -100,6 +104,8 @@ exclude module=jmxtools / exclude module=jms / exclude module=jmxri / + exclude org=com.thoughtworks.xstream/ --- End diff -- also should go into the plugins ivy.xml --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599610#comment-14599610 ] Sebastian Nagel commented on NUTCH-2038: Jaccard similarity sounds more like a scoring metric. Of course, it can be transformed to a boolean accept/reject by a threshold. Btw., a plugin can implement multiple interfaces and, e.g., calculate a score in the parse filter, cache it in the parse meta data, and use it in distributeScoreToOutlinks. Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599622#comment-14599622 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/34 Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599840#comment-14599840 ] Julien Nioche commented on NUTCH-2046: -- re-script : what about a positive parameter instead of a negative one (like we do for the indexing with -i)? Could have -s followed by the path to the seed. The crawl script should be able to skip an initial injection. - Key: NUTCH-2046 URL: https://issues.apache.org/jira/browse/NUTCH-2046 Project: Nutch Issue Type: Improvement Components: crawldb, injector Affects Versions: 1.10 Reporter: Luis Lopez Labels: crawl, injection Fix For: 1.11 When our crawl gets really big a new injection takes considerable time as it updates crawldb, the crawl script should be able to skip the injection and go directly to the generate call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2046) The crawl script should be able to skip an initial injection.
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Lopez updated NUTCH-2046: -- Attachment: crawl.patch The crawl script skips the initial injection if we use -skipInject instead of the seeds path. The crawl script should be able to skip an initial injection. - Key: NUTCH-2046 URL: https://issues.apache.org/jira/browse/NUTCH-2046 Project: Nutch Issue Type: Improvement Components: crawldb, injector Affects Versions: 1.10 Reporter: Luis Lopez Labels: crawl, injection Fix For: 1.11 Attachments: crawl.patch When our crawl gets really big a new injection takes considerable time as it updates crawldb, the crawl script should be able to skip the injection and go directly to the generate call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599933#comment-14599933 ] Luis Lopez commented on NUTCH-2046: --- I used just -skipInject instead of the actual path just because it's simpler. Also for these cases usually it's a negative parameter isn't it? like -noFilter -noParsing etc. The crawl script should be able to skip an initial injection. - Key: NUTCH-2046 URL: https://issues.apache.org/jira/browse/NUTCH-2046 Project: Nutch Issue Type: Improvement Components: crawldb, injector Affects Versions: 1.10 Reporter: Luis Lopez Labels: crawl, injection Fix For: 1.11 Attachments: crawl.patch When our crawl gets really big a new injection takes considerable time as it updates crawldb, the crawl script should be able to skip the injection and go directly to the generate call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599748#comment-14599748 ] Chris A. Mattmann commented on NUTCH-2038: -- yeah you got it Seb, we can do accept/reject by threshold. See: http://github.com/chrismattmann/tika-img-similarity/ where I have been already doing this for a while and my search engines class http://sunset.usc.edu/classes/cs572_2015/ specifically HW1 where I had them develop something similar. The idea would be to use it to find similar objects with features, and to accept those e.g., that would fall within a threshhold. There is no difference with Jaccard, Cosine, Edit Distance, whatever. They are all simply distance measurements. They can be used in Search Engines deduplication; in scoring; in URL filtering, in a number of places. Anyways I'll try and get something up soon. In the meanwhile I am +1 for Asitang's latest PR, modulo my stylistic updates I suggested. Thanks for the great feedback as usual. Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/36 NUTCH-2038 Made aesthetic changes suggested by Chris Mattmann. Removed dependencies from the main ivy.xml and added it to plugin's ivy.xml. You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 commit ae639ec40263fafbd6c0273c619d425ee482f7f0 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T17:31:09Z commit for 3.3 patch of NUTCH-2038 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asitang Mishra updated NUTCH-2038: -- Description: A html parse filter that will filter out the outlinks in two stages. One: Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. was:A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. Summary: Naive Bayes classifier based html Parse filter (for filtering outlinks) (was: Naive Bayes classifier based url filter) Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. One: Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asitang Mishra updated NUTCH-2038: -- Description: A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. was: A html parse filter that will filter out the outlinks in two stages. One: Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2046) The crawl script should be able to skip an initial injection.
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2046: Fix Version/s: 1.11 The crawl script should be able to skip an initial injection. - Key: NUTCH-2046 URL: https://issues.apache.org/jira/browse/NUTCH-2046 Project: Nutch Issue Type: Improvement Components: crawldb, injector Affects Versions: 1.10 Reporter: Luis Lopez Labels: crawl, injection Fix For: 1.11 When our crawl gets really big a new injection takes considerable time as it updates crawldb, the crawl script should be able to skip the injection and go directly to the generate call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2038
Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/35 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599795#comment-14599795 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/35 Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599750#comment-14599750 ] Lewis John McGibbney commented on NUTCH-2046: - Hi [~betolink], this is a nice issue. I think that we could easily have a [-skipInject] flag to the crawl script. Are you able to provide a patch? The crawl script should be able to skip an initial injection. - Key: NUTCH-2046 URL: https://issues.apache.org/jira/browse/NUTCH-2046 Project: Nutch Issue Type: Improvement Components: crawldb, injector Affects Versions: 1.10 Reporter: Luis Lopez Labels: crawl, injection Fix For: 1.11 When our crawl gets really big a new injection takes considerable time as it updates crawldb, the crawl script should be able to skip the injection and go directly to the generate call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599798#comment-14599798 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/36 NUTCH-2038 Made aesthetic changes suggested by Chris Mattmann. Removed dependencies from the main ivy.xml and added it to plugin's ivy.xml. You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 commit ae639ec40263fafbd6c0273c619d425ee482f7f0 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T17:31:09Z commit for 3.3 patch of NUTCH-2038 Naive Bayes classifier based url filter --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A url filter that will filter out the urls (after the parsing stage, will keep only those urls that contain some hot words provided again in a list.) from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Github Spam
Hi Folks, The Github spam is killing me. Seems to go to - nu...@noreply.github.com Basically every commit someone pushes (there have been loads recently) is sending me a new email over and above the digest emails I get. I am sure this must be pissing other people off. Is there a better way for us to work this mail? Thanks Lewis -- *Lewis*
RE: Github Spam
Well, either disable it or have people send less requests. On the other hand, adding patches and Jira comments also gets you e-mail. -Original message- From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com Sent: Wednesday 24th June 2015 21:47 To: dev@nutch.apache.org Subject: Github Spam Hi Folks, The Github spam is killing me. Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com Basically every commit someone pushes (there have been loads recently) is sending me a new email over and above the digest emails I get. I am sure this must be pissing other people off. Is there a better way for us to work this mail? Thanks Lewis -- Lewis
RE: Github Spam
Hey Lewis, Yeah to be honest, this no different than ReviewBoard, JIRA, etc. At least it's not as bad as Spark :/ I did a review of Asitang's patch and it took each one of my comments and sent a mail. B/c of Apache's requirement that things happen on the list, we have to have the mails replicated from Github on all interactions. The thing is though, maybe we should create a nutch-github@ email address, and then send mails there? Would that help? Or nutch-notifications@a.o ? Then JIRA, Github, etc., could go there? Others would have to be in support of this too. I'm +0 on either. You know all my email problems so this is just noise really lol in a sea of other noise. Cheers, Chris From: Markus Jelsma [markus.jel...@openindex.io] Sent: Wednesday, June 24, 2015 12:49 PM To: dev@nutch.apache.org Subject: RE: Github Spam Well, either disable it or have people send less requests. On the other hand, adding patches and Jira comments also gets you e-mail. -Original message- From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com Sent: Wednesday 24th June 2015 21:47 To: dev@nutch.apache.org Subject: Github Spam Hi Folks, The Github spam is killing me. Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com Basically every commit someone pushes (there have been loads recently) is sending me a new email over and above the digest emails I get. I am sure this must be pissing other people off. Is there a better way for us to work this mail? Thanks Lewis -- Lewis
[jira] [Commented] (NUTCH-1504) Pluggable url partitioner
[ https://issues.apache.org/jira/browse/NUTCH-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599958#comment-14599958 ] Michael Joyce commented on NUTCH-1504: -- This is great stuff [~lewismc], we definitely need to get this in there. Would help us out a great deal. Pluggable url partitioner - Key: NUTCH-1504 URL: https://issues.apache.org/jira/browse/NUTCH-1504 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.6 Reporter: Sourajit Basak Assignee: Lewis John McGibbney Fix For: 1.11 Attachments: custom.partitioner.patch At present, the url partition logic is hard wired inside nutch core. It should be pluggable like FetchSchedule customized via nutch-site.xml. There might be use cases where a single domain needs to be partioned on some custom logic. The existing UrlPartitioner cannot handle such cases. Hence the requirement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Github Spam
I am fine with getting rid of Github e-mail, not Jira, Jenkins or other ASF infra stuff. The git requests are not in our svn format anyway. If someone is serious about their patch and want it in the regular releases, then please be so polite to not make it a bit harder for us ;) -Original message- From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov Sent: Wednesday 24th June 2015 21:54 To: dev@nutch.apache.org Subject: RE: Github Spam Hey Lewis, Yeah to be honest, this no different than ReviewBoard, JIRA, etc. At least it's not as bad as Spark :/ I did a review of Asitang's patch and it took each one of my comments and sent a mail. B/c of Apache's requirement that things happen on the list, we have to have the mails replicated from Github on all interactions. The thing is though, maybe we should create a nutch-github@ email address, and then send mails there? Would that help? Or nutch-notifications@a.o ? Then JIRA, Github, etc., could go there? Others would have to be in support of this too. I'm +0 on either. You know all my email problems so this is just noise really lol in a sea of other noise. Cheers, Chris From: Markus Jelsma [markus.jel...@openindex.io] Sent: Wednesday, June 24, 2015 12:49 PM To: dev@nutch.apache.org Subject: RE: Github Spam Well, either disable it or have people send less requests. On the other hand, adding patches and Jira comments also gets you e-mail. -Original message- From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com Sent: Wednesday 24th June 2015 21:47 To: dev@nutch.apache.org Subject: Github Spam Hi Folks, The Github spam is killing me. Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com Basically every commit someone pushes (there have been loads recently) is sending me a new email over and above the digest emails I get. I am sure this must be pissing other people off. Is there a better way for us to work this mail? Thanks Lewis -- Lewis
RE: Github Spam
Sorry I wasn't clear. I'm *not* fine with getting rid of Github. I was simply proposing for the mail spam to be moved to a different list. But, to me JIRA/SVN, is no different than Github comments and pull requests and so forth. To each their own :) The ASF full supports Git and Github integration though, and in a very nice way which works with SVN so just to be clear I am in no way proposing that we move to Git, etc., but I'm also not proposing that we don't accept Git pull requests. I was just trying to help Lewis with his mail issues. Cheers, Chris From: Markus Jelsma [markus.jel...@openindex.io] Sent: Wednesday, June 24, 2015 1:07 PM To: dev@nutch.apache.org Subject: RE: Github Spam I am fine with getting rid of Github e-mail, not Jira, Jenkins or other ASF infra stuff. The git requests are not in our svn format anyway. If someone is serious about their patch and want it in the regular releases, then please be so polite to not make it a bit harder for us ;) -Original message- From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov Sent: Wednesday 24th June 2015 21:54 To: dev@nutch.apache.org Subject: RE: Github Spam Hey Lewis, Yeah to be honest, this no different than ReviewBoard, JIRA, etc. At least it's not as bad as Spark :/ I did a review of Asitang's patch and it took each one of my comments and sent a mail. B/c of Apache's requirement that things happen on the list, we have to have the mails replicated from Github on all interactions. The thing is though, maybe we should create a nutch-github@ email address, and then send mails there? Would that help? Or nutch-notifications@a.o ? Then JIRA, Github, etc., could go there? Others would have to be in support of this too. I'm +0 on either. You know all my email problems so this is just noise really lol in a sea of other noise. Cheers, Chris From: Markus Jelsma [markus.jel...@openindex.io] Sent: Wednesday, June 24, 2015 12:49 PM To: dev@nutch.apache.org Subject: RE: Github Spam Well, either disable it or have people send less requests. On the other hand, adding patches and Jira comments also gets you e-mail. -Original message- From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com Sent: Wednesday 24th June 2015 21:47 To: dev@nutch.apache.org Subject: Github Spam Hi Folks, The Github spam is killing me. Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com Basically every commit someone pushes (there have been loads recently) is sending me a new email over and above the digest emails I get. I am sure this must be pissing other people off. Is there a better way for us to work this mail? Thanks Lewis -- Lewis