[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599210#comment-14599210
 ] 

Sebastian Nagel commented on NUTCH-2038:


Yes, it's possible to implement it in HtmlParseFilter.filter() - content and 
outlinks are passed to this method. But keep in mind that outlinks are not yet 
normalized and filtered. In ScoringFilter.distributeScoreToOutlink() they are. 

 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2047) Improvements to the relevance scoring plugin

2015-06-24 Thread Sujen Shah (JIRA)
Sujen Shah created NUTCH-2047:
-

 Summary: Improvements to the relevance scoring plugin
 Key: NUTCH-2047
 URL: https://issues.apache.org/jira/browse/NUTCH-2047
 Project: Nutch
  Issue Type: Improvement
  Components: scoring
Reporter: Sujen Shah
 Fix For: 1.11


To discuss the results and improvements on the scoring-similarity plugin using 
the cosine similarity model. 

Currently, the outlinks are distributed the same score as the parent URL. Which 
means an irrelevant URL(with a relevant parent) would be fetched for one more 
round before it gets a lower score and filtered. So we would require one 
additional fetch/parse to score these irrelevant urls(from relevant parents) 
lower. 

Any suggestions on this are appreciated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Github Spam

2015-06-24 Thread Markus Jelsma
I am sorry, by getting rid i meant moving git requests to a separate list. But 
because both are accepted, this is probably not going to happen.  Due to the 
flood of mail, i normally ignore git mail completely, but not Jira updates. 

If Lewis' mail client is friendly, he can filter git mail to a separate inbox. 
My filters never seem to work :(
 
-Original message-
 From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov
 Sent: Wednesday 24th June 2015 22:10
 To: dev@nutch.apache.org
 Subject: RE: Github Spam
 
 Sorry I wasn't clear. I'm *not* fine with getting rid of Github.
 I was simply proposing for the mail spam to be moved to a different
 list. But, to me JIRA/SVN, is no different than Github comments and
 pull requests and so forth. To each their own :) The ASF full supports
 Git and Github integration though, and in a very nice way which works
 with SVN so just to be clear I am in no way proposing that we move to
 Git, etc., but I'm also not proposing that we don't accept Git pull requests.
 I was just trying to help Lewis with his mail issues.
 
 Cheers,
 Chris
 
 
 From: Markus Jelsma [markus.jel...@openindex.io]
 Sent: Wednesday, June 24, 2015 1:07 PM
 To: dev@nutch.apache.org
 Subject: RE: Github Spam
 
 I am fine with getting rid of Github e-mail, not Jira, Jenkins or other ASF 
 infra stuff. The git requests are not in our svn format anyway. If someone is 
 serious about their patch and want it in the regular releases, then please be 
 so polite to not make it a bit harder for us ;)
 
 -Original message-
  From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov
  Sent: Wednesday 24th June 2015 21:54
  To: dev@nutch.apache.org
  Subject: RE: Github Spam
 
  Hey Lewis,
 
  Yeah to be honest, this no different than ReviewBoard, JIRA, etc.
  At least it's not as bad as Spark :/ I did a review of Asitang's patch
  and it took each one of my comments and sent a mail. B/c of Apache's
  requirement that things happen on the list, we have to have the mails
  replicated from Github on all interactions. The thing is though, maybe we
  should create a nutch-github@ email address, and then send mails there?
  Would that help? Or nutch-notifications@a.o ? Then JIRA, Github, etc.,
  could go there?
 
  Others would have to be in support of this too.
 
  I'm +0 on either. You know all my email problems so this is just noise 
  really
  lol in a sea of other noise.
 
  Cheers,
  Chris
 
 
  
  From: Markus Jelsma [markus.jel...@openindex.io]
  Sent: Wednesday, June 24, 2015 12:49 PM
  To: dev@nutch.apache.org
  Subject: RE: Github Spam
 
  Well, either disable it or have people send less requests. On the other 
  hand, adding patches and Jira comments also gets you e-mail.
 
  -Original message-
  From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com
  Sent: Wednesday 24th June 2015 21:47
  To: dev@nutch.apache.org
  Subject: Github Spam
 
  Hi Folks,
 
  The Github spam is killing me.
 
  Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com
 
  Basically every commit someone pushes (there have been loads recently) is 
  sending me a new email over and above the digest emails I get.
 
  I am sure this must be pissing other people off. Is there a better way for 
  us to work this mail?
 
  Thanks
 
  Lewis
 
  --
 
  Lewis
 
 


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600269#comment-14600269
 ] 

Sebastian Nagel commented on NUTCH-2038:


Hi [~asitang], the latest pull request #36 looks good.
- maybe rename the plugin to parsefilter-naivebayes for simplicity and in 
advance of NUTCH-1482
- is this statement still true?
bq. CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using 
this classifier.
- afaics, the way the model is generated, stored and loaded needs a review:
-* it should be read/generated once and then cached in memory,
-* writing the model to disk is likely to become painful in distributed mode 
with concurrent tasks.
- cosmetics:
-* exceptions are properly logged via 
LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in 
stdout/stderr as of e.printStackTrace()
-* code formatting, see 
[[1|http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing]]

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode

2015-06-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600290#comment-14600290
 ] 

Sebastian Nagel commented on NUTCH-1692:


+1

 SegmentReader broken in distributed mode
 

 Key: NUTCH-1692
 URL: https://issues.apache.org/jira/browse/NUTCH-1692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.8
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.11

 Attachments: 20140126210858.tgz, NUTCH-1692-trunk.patch, 
 NUTCH-1692.patch


 SegmentReader -list option ignores the -no* options, causing the following 
 exception in distributed mode:
 {code}
 Exception in thread main java.lang.NullPointerException
 at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
 at java.util.Arrays.sort(Arrays.java:472)
 at 
 org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
 at 
 org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
 at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
 at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

2015-06-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600334#comment-14600334
 ] 

Sebastian Nagel commented on NUTCH-1625:


Is this really only legacy code and what's the exact problem? If a segment 
contains a document with a fetch datum of status FETCH_NOTMODIFIED
- the document should be already indexed from a prior segment where it has been 
really fetched
- there is definitely no content for this document in this segment because the 
server has responded with 304. Because documents with empty content are already 
skipped before a test for FETCH_NOTMODIFIED has no effect at all, afaics. 
Because the check for with DB_NOTMODIFIED (if property indexer.skip.notmodified 
== false) also comes after, it only affects docs which are fetched 
(FETCH_SUCCESS) and then recognized as not modified by signature comparison.

 IndexerMapReduce skips FETCH_NOTMODIFIED
 

 Key: NUTCH-1625
 URL: https://issues.apache.org/jira/browse/NUTCH-1625
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.11

 Attachments: NUTCH-1625.patch, NUTCH-1625.patch


 IndexerMapReduce has the option to skip DB_NOTMODIFIED but legacy code also 
 skips FETCH_NOTMODIFIED and the latter is not optional. We can keep the check 
 but that should also include FETCH_NOTMODIFIED. Relying on FETCH_NOTMODIFIED 
 isn't very useful anyway because since 1.5 orso we can safely rely on 
 DB_NOTMODIFIED as it is properly set in the CrawlDBReducer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2047) Improvements to the relevance scoring plugin

2015-06-24 Thread Sujen Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujen Shah updated NUTCH-2047:
--
Attachment: part-0

This file is a dump of the top 1000 URLs. 
The model file contained information related to robotics from a wikipedia 
article. And the seed list was CMU's Robotics institute homepage 

 Improvements to the relevance scoring plugin
 

 Key: NUTCH-2047
 URL: https://issues.apache.org/jira/browse/NUTCH-2047
 Project: Nutch
  Issue Type: Improvement
  Components: scoring
Reporter: Sujen Shah
  Labels: memex
 Fix For: 1.11

 Attachments: part-0


 To discuss the results and improvements on the scoring-similarity plugin 
 using the cosine similarity model. 
 Currently, the outlinks are distributed the same score as the parent URL. 
 Which means an irrelevant URL(with a relevant parent) would be fetched for 
 one more round before it gets a lower score and filtered. So we would require 
 one additional fetch/parse to score these irrelevant urls(from relevant 
 parents) lower. 
 Any suggestions on this are appreciated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419
 ] 

Asitang Mishra edited comment on NUTCH-2038 at 6/25/15 12:19 AM:
-

 1   maybe rename the plugin to parsefilter-naivebayes for simplicity and in 
advance of NUTCH-1482

Will do that



2  is this statement still true?

CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when 
using this classifier.

The first ever call to parse filter takes a bit more time because the training 
is done and model is created. So, time out should be a little more. Does not 
take much time after this.




 3 afaics, the way the model is generated, stored and loaded needs a review:
it should be read/generated once and then cached in memory,
writing the model to disk is likely to become painful in distributed 
mode with concurrent tasks.

The model is created during the parsing of the first fetched page of the very 
first parse job. After that it checks if the model file already present or not.
The model file is being read each time the classify() function is called, will 
change that and store the model all the way thru for a single parse job.

 4 cosmetics:
exceptions are properly logged via 
LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in 
stdout/stderr as of e.printStackTrace()
code formatting, see [1]

will do that



was (Author: asitang):

maybe rename the plugin to parsefilter-naivebayes for simplicity and in 
advance of NUTCH-1482

Will do that



is this statement still true?

CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when 
using this classifier.

The first ever call to parse filter takes a bit more time because the training 
is done and model is created. So, time out should be a little more. Does not 
take much time after this.




afaics, the way the model is generated, stored and loaded needs a review:
it should be read/generated once and then cached in memory,
writing the model to disk is likely to become painful in distributed 
mode with concurrent tasks.

The model is created during the parsing of the first fetched page of the very 
first parse job. After that it checks if the model file already present or not.
The model file is being read each time the classify() function is called, will 
change that and store the model all the way thru for a single parse job.

cosmetics:
exceptions are properly logged via 
LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in 
stdout/stderr as of e.printStackTrace()
code formatting, see [1]

will do that


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[IMPORTANT] Migration Towards HAdoop 2.X -- 3.X

2015-06-24 Thread Lewis John Mcgibbney
Hi Folks,
In not too long time Hadoop will be up at 3.X for stable official releases.
I wanted to solicit the dev@ community to see what difficulties if any
people have had running Nutch trunk on Hadoop 2.X.
Hadoop 2.X is supported on Nutch 2.X but getting the patches all correct is
literally a PITA... we are working on that down in the Gora community and
need to get a better more frequent release cycle.
I just wanted to know if there was motivation for us to get some patches
committed to trunk, releases it as 1.11 then focus the next development
drive on a switch to Hadoop 2.X for trunk.
We could potentially then release Nutch  1.11 as 3.0.
What do you guys think?
Thanks
Lewis

-- 
*Lewis*


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419
 ] 

Asitang Mishra commented on NUTCH-2038:
---


maybe rename the plugin to parsefilter-naivebayes for simplicity and in 
advance of NUTCH-1482

Will do that



is this statement still true?

CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when 
using this classifier.

The first ever call to parse filter takes a bit more time because the training 
is done and model is created. So, time out should be a little more. Does not 
take much time after this.




afaics, the way the model is generated, stored and loaded needs a review:
it should be read/generated once and then cached in memory,
writing the model to disk is likely to become painful in distributed 
mode with concurrent tasks.

The model is created during the parsing of the first fetched page of the very 
first parse job. After that it checks if the model file already present or not.
The model file is being read each time the classify() function is called, will 
change that and store the model all the way thru for a single parse job.

cosmetics:
exceptions are properly logged via 
LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in 
stdout/stderr as of e.printStackTrace()
code formatting, see [1]

will do that


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2047) Improvements to the relevance scoring plugin

2015-06-24 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600327#comment-14600327
 ] 

Sujen Shah edited comment on NUTCH-2047 at 6/24/15 11:09 PM:
-

This file is a dump of the top 1000 URLs. 
The model file contained information related to robotics from a wikipedia 
article. And the seed list was CMU's Robotics institute homepage. 

The top few URLs are marked with the same score because most of they are 
unfetched and have been distributed the score by their parent url. 


was (Author: sujenshah):
This file is a dump of the top 1000 URLs. 
The model file contained information related to robotics from a wikipedia 
article. And the seed list was CMU's Robotics institute homepage 

 Improvements to the relevance scoring plugin
 

 Key: NUTCH-2047
 URL: https://issues.apache.org/jira/browse/NUTCH-2047
 Project: Nutch
  Issue Type: Improvement
  Components: scoring
Reporter: Sujen Shah
  Labels: memex
 Fix For: 1.11

 Attachments: part-0


 To discuss the results and improvements on the scoring-similarity plugin 
 using the cosine similarity model. 
 Currently, the outlinks are distributed the same score as the parent URL. 
 Which means an irrelevant URL(with a relevant parent) would be fetched for 
 one more round before it gets a lower score and filtered. So we would require 
 one additional fetch/parse to score these irrelevant urls(from relevant 
 parents) lower. 
 Any suggestions on this are appreciated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/35

NUTCH-2038



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/35.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #35


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/34


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599625#comment-14599625
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/35

NUTCH-2038



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/35.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #35


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038




 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165265
  
--- Diff: conf/nutch-default.xml ---
@@ -1208,6 +1208,28 @@
 /property
 
 property
+  namehtmlparsefilter.naivebayes.trainfile/name
+  value/value
+  descriptionSet the name of the file to be used for Naive Bayes 
training. The format will be: 
+Each line contains two tab seperted parts
+There are two columns/parts:
+1. 1 or 0, 1 for relevent and 0 for irrelevent document.
+3. Text (text that will be used for training)
+
+Each row will be considered a new document for the classifier.
+CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when 
using this classifier.
+
+  /description
+/property
+
+property
+  namehtmlparsefilter.naivebayes.wordlist/name
+  value/value
+  descriptionPut the name of the file you want to be used as a list of 
important words to be matched in the url for the model filter. The format 
should be one word per line.
--- End diff --

can you insert some line breaks at like 80 chars so it doesn't run off the 
screen on this? Thanks @asitang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599644#comment-14599644
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165265
  
--- Diff: conf/nutch-default.xml ---
@@ -1208,6 +1208,28 @@
 /property
 
 property
+  namehtmlparsefilter.naivebayes.trainfile/name
+  value/value
+  descriptionSet the name of the file to be used for Naive Bayes 
training. The format will be: 
+Each line contains two tab seperted parts
+There are two columns/parts:
+1. 1 or 0, 1 for relevent and 0 for irrelevent document.
+3. Text (text that will be used for training)
+
+Each row will be considered a new document for the classifier.
+CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when 
using this classifier.
+
+  /description
+/property
+
+property
+  namehtmlparsefilter.naivebayes.wordlist/name
+  value/value
+  descriptionPut the name of the file you want to be used as a list of 
important words to be matched in the url for the model filter. The format 
should be one word per line.
--- End diff --

can you insert some line breaks at like 80 chars so it doesn't run off the 
screen on this? Thanks @asitang 


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165338
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 dependency org=org.apache.cxf 
name=cxf-rt-transports-http-jetty rev=3.0.4/
 dependency org=com.fasterxml.jackson.core 
name=jackson-databind rev=2.5.1 / 
 dependency org=com.fasterxml.jackson.dataformat 
name=jackson-dataformat-cbor rev=2.5.1 /
-dependency org=com.fasterxml.jackson.jaxrs 
name=jackson-jaxrs-json-provider rev=2.5.1 /
+dependency org=com.fasterxml.jackson.jaxrs 
name=jackson-jaxrs-json-provider rev=2.5.1 /
--- End diff --

extraneous.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599645#comment-14599645
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165299
  
--- Diff: conf/nutch-default.xml ---
@@ -1258,6 +1280,7 @@
 
 !-- urlfilter plugin properties --
 
+
--- End diff --

extraneous not needed.


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165299
  
--- Diff: conf/nutch-default.xml ---
@@ -1258,6 +1280,7 @@
 
 !-- urlfilter plugin properties --
 
+
--- End diff --

extraneous not needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599646#comment-14599646
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165338
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 dependency org=org.apache.cxf 
name=cxf-rt-transports-http-jetty rev=3.0.4/
 dependency org=com.fasterxml.jackson.core 
name=jackson-databind rev=2.5.1 / 
 dependency org=com.fasterxml.jackson.dataformat 
name=jackson-dataformat-cbor rev=2.5.1 /
-dependency org=com.fasterxml.jackson.jaxrs 
name=jackson-jaxrs-json-provider rev=2.5.1 /
+dependency org=com.fasterxml.jackson.jaxrs 
name=jackson-jaxrs-json-provider rev=2.5.1 /
--- End diff --

extraneous.


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599647#comment-14599647
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165388
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 dependency org=org.apache.cxf 
name=cxf-rt-transports-http-jetty rev=3.0.4/
 dependency org=com.fasterxml.jackson.core 
name=jackson-databind rev=2.5.1 / 
 dependency org=com.fasterxml.jackson.dataformat 
name=jackson-dataformat-cbor rev=2.5.1 /
-dependency org=com.fasterxml.jackson.jaxrs 
name=jackson-jaxrs-json-provider rev=2.5.1 /
+dependency org=com.fasterxml.jackson.jaxrs 
name=jackson-jaxrs-json-provider rev=2.5.1 /
+dependency org=org.apache.mahout name=mahout-math 
rev=0.8 /
--- End diff --

these dependencies should go into the htmlparsefilter-naivebayes/ivy.xml, 
not the main one. I mentioned this last time.


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599655#comment-14599655
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165500
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
--- End diff --

remove


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599656#comment-14599656
 ] 

Asitang Mishra commented on NUTCH-2038:
---

I still have to transfer the external mahout and lucene jars to the plugin. 
Will do that.
Have changed the functionality of the plugin this time a bit. Made it more 
simple to use.  The trainfile now will have just two rows. 1 or 0 and text (see 
the patch for better detail).

 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165528
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text);
--- End diff --

maybe print the e's stack trace too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165623
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent 1 then return true
+if (NaiveBayesClassifier.classify(text).equals(1))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path(model))) {
+  LOG.info(Training the Naive Bayes Model);
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info(Model already exists. Skipping training.);
+}
+  }
+
+  public boolean containsWord(String url, ArrayListString wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = Model URLFilter: 

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165500
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
--- End diff --

remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599657#comment-14599657
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165528
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text);
--- End diff --

maybe print the e's stack trace too?


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165581
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent 1 then return true
+if (NaiveBayesClassifier.classify(text).equals(1))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path(model))) {
+  LOG.info(Training the Naive Bayes Model);
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info(Model already exists. Skipping training.);
+}
+  }
+
+  public boolean containsWord(String url, ArrayListString wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = Model URLFilter: 

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599659#comment-14599659
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165581
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent 1 then return true
+if (NaiveBayesClassifier.classify(text).equals(1))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path(model))) {
+  LOG.info(Training the Naive Bayes Model);
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info(Model already exists. Skipping training.);
+}
+  }
+
+  public boolean containsWord(String url, ArrayListString wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+ 

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599660#comment-14599660
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165623
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent 1 then return true
+if (NaiveBayesClassifier.classify(text).equals(1))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path(model))) {
+  LOG.info(Training the Naive Bayes Model);
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info(Model already exists. Skipping training.);
+}
+  }
+
+  public boolean containsWord(String url, ArrayListString wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+ 

[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165638
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent 1 then return true
+if (NaiveBayesClassifier.classify(text).equals(1))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path(model))) {
+  LOG.info(Training the Naive Bayes Model);
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info(Model already exists. Skipping training.);
+}
+  }
+
+  public boolean containsWord(String url, ArrayListString wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+dictionaryFile = conf.get(DICTFILE_MODELFILTER);
+if (inputFilePath == null || inputFilePath.trim().length() == 0
+|| dictionaryFile == null || dictionaryFile.trim().length() == 0) {
+  String message = Model URLFilter: 

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599661#comment-14599661
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165638
  
--- Diff: 
src/plugin/htmlparsefilter-naivebayes/src/java/org/apache/nutch/htmlparsefilter/naivebayes/NaiveBayesHTMLParseFilter.java
 ---
@@ -0,0 +1,214 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.htmlparsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of htmlparsefilter.naivebayes.trainfile) and if found 
irrelevent
+ * it gives the link a second chance if it contains any of the words from 
the
+ * list given in htmlparsefilter.naivebayes.wordlist. CAUTION: Set the
+ * parser.timeout to -1 or a bigger value than 30, when using this 
classifier.
+ */
+public class NaiveBayesHTMLParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesHTMLParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
htmlparsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
htmlparsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public NaiveBayesHTMLParseFilter() {
+
+  }
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text);
+
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent 1 then return true
+if (NaiveBayesClassifier.classify(text).equals(1))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path(model))) {
+  LOG.info(Training the Naive Bayes Model);
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info(Model already exists. Skipping training.);
+}
+  }
+
+  public boolean containsWord(String url, ArrayListString wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+ 

[jira] [Created] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Luis Lopez (JIRA)
Luis Lopez created NUTCH-2046:
-

 Summary: The crawl script should be able to skip an initial 
injection.
 Key: NUTCH-2046
 URL: https://issues.apache.org/jira/browse/NUTCH-2046
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Affects Versions: 1.10
Reporter: Luis Lopez


When our crawl gets really big a new injection takes considerable time as it 
updates crawldb, the crawl script should be able to skip the injection and go 
directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165388
  
--- Diff: ivy/ivy.xml ---
@@ -78,7 +78,11 @@
 dependency org=org.apache.cxf 
name=cxf-rt-transports-http-jetty rev=3.0.4/
 dependency org=com.fasterxml.jackson.core 
name=jackson-databind rev=2.5.1 / 
 dependency org=com.fasterxml.jackson.dataformat 
name=jackson-dataformat-cbor rev=2.5.1 /
-dependency org=com.fasterxml.jackson.jaxrs 
name=jackson-jaxrs-json-provider rev=2.5.1 /
+dependency org=com.fasterxml.jackson.jaxrs 
name=jackson-jaxrs-json-provider rev=2.5.1 /
+dependency org=org.apache.mahout name=mahout-math 
rev=0.8 /
--- End diff --

these dependencies should go into the htmlparsefilter-naivebayes/ivy.xml, 
not the main one. I mentioned this last time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599648#comment-14599648
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165405
  
--- Diff: ivy/ivy.xml ---
@@ -100,6 +104,8 @@
exclude module=jmxtools /
exclude module=jms /
exclude module=jmxri /
+   exclude org=com.thoughtworks.xstream/
--- End diff --

also should go into the plugins ivy.xml


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/35#discussion_r33165405
  
--- Diff: ivy/ivy.xml ---
@@ -100,6 +104,8 @@
exclude module=jmxtools /
exclude module=jms /
exclude module=jmxri /
+   exclude org=com.thoughtworks.xstream/
--- End diff --

also should go into the plugins ivy.xml


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599610#comment-14599610
 ] 

Sebastian Nagel commented on NUTCH-2038:


Jaccard similarity sounds more like a scoring metric. Of course, it can be 
transformed to a boolean accept/reject by a threshold.

Btw., a plugin can implement multiple interfaces and, e.g., calculate a score 
in the parse filter, cache it in the parse meta data, and use it in 
distributeScoreToOutlinks.

 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599622#comment-14599622
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/34


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599840#comment-14599840
 ] 

Julien Nioche commented on NUTCH-2046:
--

re-script : what about a positive parameter instead of a negative one (like we 
do for the indexing with -i)? Could have -s followed by the path to the seed.

 The crawl script should be able to skip an initial injection.
 -

 Key: NUTCH-2046
 URL: https://issues.apache.org/jira/browse/NUTCH-2046
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Affects Versions: 1.10
Reporter: Luis Lopez
  Labels: crawl, injection
 Fix For: 1.11


 When our crawl gets really big a new injection takes considerable time as it 
 updates crawldb, the crawl script should be able to skip the injection and go 
 directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Luis Lopez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Lopez updated NUTCH-2046:
--
Attachment: crawl.patch

The crawl script skips the initial injection if we use -skipInject instead of 
the seeds path.

 The crawl script should be able to skip an initial injection.
 -

 Key: NUTCH-2046
 URL: https://issues.apache.org/jira/browse/NUTCH-2046
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Affects Versions: 1.10
Reporter: Luis Lopez
  Labels: crawl, injection
 Fix For: 1.11

 Attachments: crawl.patch


 When our crawl gets really big a new injection takes considerable time as it 
 updates crawldb, the crawl script should be able to skip the injection and go 
 directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Luis Lopez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599933#comment-14599933
 ] 

Luis Lopez commented on NUTCH-2046:
---

I used just -skipInject instead of the actual path just because it's simpler. 
Also for these cases usually it's a negative parameter isn't it? like -noFilter 
-noParsing etc. 

 The crawl script should be able to skip an initial injection.
 -

 Key: NUTCH-2046
 URL: https://issues.apache.org/jira/browse/NUTCH-2046
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Affects Versions: 1.10
Reporter: Luis Lopez
  Labels: crawl, injection
 Fix For: 1.11

 Attachments: crawl.patch


 When our crawl gets really big a new injection takes considerable time as it 
 updates crawldb, the crawl script should be able to skip the injection and go 
 directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599748#comment-14599748
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

yeah you got it Seb, we can do accept/reject by threshold. See: 
http://github.com/chrismattmann/tika-img-similarity/ where I have been already 
doing this for a while and my search engines class 
http://sunset.usc.edu/classes/cs572_2015/ specifically HW1 where I had them 
develop something similar. The idea would be to use it to find similar objects 
with features, and to accept those e.g., that would fall within a threshhold. 
There is no difference with Jaccard, Cosine, Edit Distance, whatever. They are 
all simply distance measurements. They can be used in Search Engines 
deduplication; in scoring; in URL filtering, in a number of places. Anyways 
I'll try and get something up soon. 

In the meanwhile I am +1 for Asitang's latest PR, modulo my stylistic updates I 
suggested. 

Thanks for the great feedback as usual.

 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/36

NUTCH-2038

Made aesthetic changes suggested by Chris Mattmann. Removed dependencies 
from the main ivy.xml and added it to plugin's ivy.xml. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #36


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asitang Mishra updated NUTCH-2038:
--
Description: 
A html parse filter that will filter out the outlinks in two stages. 
One: Classify the parse text and decide if the parent page is relevant. If 
relevant then don't filter the outlinks. If irrelevant then go thru each 
outlink and see if the url contains any of the important words from a list. If 
it does then let it pass.


  was:A url filter that will filter out the urls (after the parsing stage,  
will keep only those urls that contain some hot words provided again in a 
list.) from that pages that are classified irrelevant by the classifier.

Summary: Naive Bayes classifier based html Parse filter (for filtering 
outlinks)  (was: Naive Bayes classifier based url filter)

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 One: Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asitang Mishra updated NUTCH-2038:
--
Description: 
A html parse filter that will filter out the outlinks in two stages. 
Classify the parse text and decide if the parent page is relevant. If relevant 
then don't filter the outlinks. If irrelevant then go thru each outlink and see 
if the url contains any of the important words from a list. If it does then let 
it pass.


  was:
A html parse filter that will filter out the outlinks in two stages. 
One: Classify the parse text and decide if the parent page is relevant. If 
relevant then don't filter the outlinks. If irrelevant then go thru each 
outlink and see if the url contains any of the important words from a list. If 
it does then let it pass.



 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2046:

Fix Version/s: 1.11

 The crawl script should be able to skip an initial injection.
 -

 Key: NUTCH-2046
 URL: https://issues.apache.org/jira/browse/NUTCH-2046
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Affects Versions: 1.10
Reporter: Luis Lopez
  Labels: crawl, injection
 Fix For: 1.11


 When our crawl gets really big a new injection takes considerable time as it 
 updates crawldb, the crawl script should be able to skip the injection and go 
 directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2038

2015-06-24 Thread asitang
Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/35


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599795#comment-14599795
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/35


 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599750#comment-14599750
 ] 

Lewis John McGibbney commented on NUTCH-2046:
-

Hi [~betolink], this is a nice issue. I think that we could easily have a 
[-skipInject] flag to the crawl script. 
Are you able to provide a patch?

 The crawl script should be able to skip an initial injection.
 -

 Key: NUTCH-2046
 URL: https://issues.apache.org/jira/browse/NUTCH-2046
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Affects Versions: 1.10
Reporter: Luis Lopez
  Labels: crawl, injection
 Fix For: 1.11


 When our crawl gets really big a new injection takes considerable time as it 
 updates crawldb, the crawl script should be able to skip the injection and go 
 directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599798#comment-14599798
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/36

NUTCH-2038

Made aesthetic changes suggested by Chris Mattmann. Removed dependencies 
from the main ivy.xml and added it to plugin's ivy.xml. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #36


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038




 Naive Bayes classifier based url filter
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A url filter that will filter out the urls (after the parsing stage,  will 
 keep only those urls that contain some hot words provided again in a list.) 
 from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Github Spam

2015-06-24 Thread Lewis John Mcgibbney
Hi Folks,
The Github spam is killing me.
Seems to go to - nu...@noreply.github.com
Basically every commit someone pushes (there have been loads recently) is
sending me a new email over and above the digest emails I get.
I am sure this must be pissing other people off. Is there a better way for
us to work this mail?
Thanks
Lewis

-- 
*Lewis*


RE: Github Spam

2015-06-24 Thread Markus Jelsma
Well, either disable it or have people send less requests. On the other hand, 
adding patches and Jira comments also gets you e-mail.

-Original message-
From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com
Sent: Wednesday 24th June 2015 21:47
To: dev@nutch.apache.org
Subject: Github Spam

Hi Folks,

The Github spam is killing me.

Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com

Basically every commit someone pushes (there have been loads recently) is 
sending me a new email over and above the digest emails I get.

I am sure this must be pissing other people off. Is there a better way for us 
to work this mail?

Thanks

Lewis

--

Lewis




RE: Github Spam

2015-06-24 Thread Mattmann, Chris A (3980)
Hey Lewis,

Yeah to be honest, this no different than ReviewBoard, JIRA, etc.
At least it's not as bad as Spark :/ I did a review of Asitang's patch
and it took each one of my comments and sent a mail. B/c of Apache's
requirement that things happen on the list, we have to have the mails
replicated from Github on all interactions. The thing is though, maybe we
should create a nutch-github@ email address, and then send mails there?
Would that help? Or nutch-notifications@a.o ? Then JIRA, Github, etc.,
could go there?

Others would have to be in support of this too.

I'm +0 on either. You know all my email problems so this is just noise really
lol in a sea of other noise.

Cheers,
Chris



From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Wednesday, June 24, 2015 12:49 PM
To: dev@nutch.apache.org
Subject: RE: Github Spam

Well, either disable it or have people send less requests. On the other hand, 
adding patches and Jira comments also gets you e-mail.

-Original message-
From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com
Sent: Wednesday 24th June 2015 21:47
To: dev@nutch.apache.org
Subject: Github Spam

Hi Folks,

The Github spam is killing me.

Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com

Basically every commit someone pushes (there have been loads recently) is 
sending me a new email over and above the digest emails I get.

I am sure this must be pissing other people off. Is there a better way for us 
to work this mail?

Thanks

Lewis

--

Lewis



[jira] [Commented] (NUTCH-1504) Pluggable url partitioner

2015-06-24 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599958#comment-14599958
 ] 

Michael Joyce commented on NUTCH-1504:
--

This is great stuff [~lewismc], we definitely need to get this in there. Would 
help us out a great deal.

 Pluggable url partitioner
 -

 Key: NUTCH-1504
 URL: https://issues.apache.org/jira/browse/NUTCH-1504
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.6
Reporter: Sourajit Basak
Assignee: Lewis John McGibbney
 Fix For: 1.11

 Attachments: custom.partitioner.patch


 At present, the url partition logic is hard wired inside nutch core. It 
 should be pluggable like FetchSchedule customized via nutch-site.xml.
 There might be use cases where a single domain needs to be partioned on some 
 custom logic. The existing UrlPartitioner cannot handle such cases. 
 Hence the requirement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Github Spam

2015-06-24 Thread Markus Jelsma
I am fine with getting rid of Github e-mail, not Jira, Jenkins or other ASF 
infra stuff. The git requests are not in our svn format anyway. If someone is 
serious about their patch and want it in the regular releases, then please be 
so polite to not make it a bit harder for us ;)
 
-Original message-
 From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov
 Sent: Wednesday 24th June 2015 21:54
 To: dev@nutch.apache.org
 Subject: RE: Github Spam
 
 Hey Lewis,
 
 Yeah to be honest, this no different than ReviewBoard, JIRA, etc.
 At least it's not as bad as Spark :/ I did a review of Asitang's patch
 and it took each one of my comments and sent a mail. B/c of Apache's
 requirement that things happen on the list, we have to have the mails
 replicated from Github on all interactions. The thing is though, maybe we
 should create a nutch-github@ email address, and then send mails there?
 Would that help? Or nutch-notifications@a.o ? Then JIRA, Github, etc.,
 could go there?
 
 Others would have to be in support of this too.
 
 I'm +0 on either. You know all my email problems so this is just noise really
 lol in a sea of other noise.
 
 Cheers,
 Chris
 
 
 
 From: Markus Jelsma [markus.jel...@openindex.io]
 Sent: Wednesday, June 24, 2015 12:49 PM
 To: dev@nutch.apache.org
 Subject: RE: Github Spam
 
 Well, either disable it or have people send less requests. On the other hand, 
 adding patches and Jira comments also gets you e-mail.
 
 -Original message-
 From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com
 Sent: Wednesday 24th June 2015 21:47
 To: dev@nutch.apache.org
 Subject: Github Spam
 
 Hi Folks,
 
 The Github spam is killing me.
 
 Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com
 
 Basically every commit someone pushes (there have been loads recently) is 
 sending me a new email over and above the digest emails I get.
 
 I am sure this must be pissing other people off. Is there a better way for us 
 to work this mail?
 
 Thanks
 
 Lewis
 
 --
 
 Lewis
 
 


RE: Github Spam

2015-06-24 Thread Mattmann, Chris A (3980)
Sorry I wasn't clear. I'm *not* fine with getting rid of Github.
I was simply proposing for the mail spam to be moved to a different
list. But, to me JIRA/SVN, is no different than Github comments and
pull requests and so forth. To each their own :) The ASF full supports
Git and Github integration though, and in a very nice way which works
with SVN so just to be clear I am in no way proposing that we move to
Git, etc., but I'm also not proposing that we don't accept Git pull requests.
I was just trying to help Lewis with his mail issues.

Cheers,
Chris


From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Wednesday, June 24, 2015 1:07 PM
To: dev@nutch.apache.org
Subject: RE: Github Spam

I am fine with getting rid of Github e-mail, not Jira, Jenkins or other ASF 
infra stuff. The git requests are not in our svn format anyway. If someone is 
serious about their patch and want it in the regular releases, then please be 
so polite to not make it a bit harder for us ;)

-Original message-
 From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov
 Sent: Wednesday 24th June 2015 21:54
 To: dev@nutch.apache.org
 Subject: RE: Github Spam

 Hey Lewis,

 Yeah to be honest, this no different than ReviewBoard, JIRA, etc.
 At least it's not as bad as Spark :/ I did a review of Asitang's patch
 and it took each one of my comments and sent a mail. B/c of Apache's
 requirement that things happen on the list, we have to have the mails
 replicated from Github on all interactions. The thing is though, maybe we
 should create a nutch-github@ email address, and then send mails there?
 Would that help? Or nutch-notifications@a.o ? Then JIRA, Github, etc.,
 could go there?

 Others would have to be in support of this too.

 I'm +0 on either. You know all my email problems so this is just noise really
 lol in a sea of other noise.

 Cheers,
 Chris


 
 From: Markus Jelsma [markus.jel...@openindex.io]
 Sent: Wednesday, June 24, 2015 12:49 PM
 To: dev@nutch.apache.org
 Subject: RE: Github Spam

 Well, either disable it or have people send less requests. On the other hand, 
 adding patches and Jira comments also gets you e-mail.

 -Original message-
 From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com
 Sent: Wednesday 24th June 2015 21:47
 To: dev@nutch.apache.org
 Subject: Github Spam

 Hi Folks,

 The Github spam is killing me.

 Seems to go to - nu...@noreply.github.com mailto:nu...@noreply.github.com

 Basically every commit someone pushes (there have been loads recently) is 
 sending me a new email over and above the digest emails I get.

 I am sure this must be pissing other people off. Is there a better way for us 
 to work this mail?

 Thanks

 Lewis

 --

 Lewis