How Nutch crawl for specifice word not for specific url Then get the structure data and store in hbase.

2017-09-06 Thread Muhammad UMER
Hi All, I am new Using Apache Nutch to crawl some sites , filter and get content on the base of word not on the base of url. e.g. 1. I have to crawl those sites that contain words like 'shop' or 'product' in contents(text). if these word not exists then not crawl further

[jira] [Created] (NUTCH-2419) Domain blacklist URL filter does not respect command-line override for file

2017-09-06 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2419: Summary: Domain blacklist URL filter does not respect command-line override for file Key: NUTCH-2419 URL: https://issues.apache.org/jira/browse/NUTCH-2419 Project:

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

2017-09-06 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154937#comment-16154937 ] ASF GitHub Bot commented on NUTCH-2375: --- Omkar20895 commented on issue #188: NUTCH-2375 Upgrade the

[jira] [Updated] (NUTCH-2419) Domain blacklist URL filter does not respect command-line override for file

2017-09-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2419: - Attachment: NUTCH-2419.patch Patch for trunk! > Domain blacklist URL filter does not respect

[jira] [Updated] (NUTCH-2417) Support for variable fetch delay via FreeGenerator

2017-09-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2417: - Attachment: NUTCH-2417.patch Patch for trnk! > Support for variable fetch delay via

[jira] [Commented] (NUTCH-2417) Support for variable fetch delay via FreeGenerator

2017-09-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155138#comment-16155138 ] Markus Jelsma commented on NUTCH-2417: -- No patch, wrong ticket! > Support for variable fetch delay

[jira] [Updated] (NUTCH-2417) Support for variable fetch delay via FreeGenerator

2017-09-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2417: - Attachment: (was: NUTCH-2417.patch) > Support for variable fetch delay via FreeGenerator >

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

2017-09-06 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156020#comment-16156020 ] ASF GitHub Bot commented on NUTCH-2375: --- lewismc commented on issue #188: NUTCH-2375 Upgrade the

Request for Review

2017-09-06 Thread lewis john mcgibbney
Hi user@ and dev@, As part of the Nutch Google Summer of Code effort this year, Omkar Reddy and I have been working persistently throughout the summer months on the Hadoop MapReduce API upgrade e.g. NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce [0].