[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-10-29 Thread Roberto Gardenier (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486041#comment-13486041 ] Roberto Gardenier commented on NUTCH-585: - I have compiled nutch 1.5.1 with the pro

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-10-29 Thread Roberto Gardenier (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roberto Gardenier updated NUTCH-585: Comment: was deleted (was: I have compiled nutch 1.5.1 with the provided plugin and used the

[jira] [Created] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1482: Summary: Rename HTMLParseFilter Key: NUTCH-1482 URL: https://issues.apache.org/jira/browse/NUTCH-1482 Project: Nutch Issue Type: Task Components: p

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486106#comment-13486106 ] Lewis John McGibbney commented on NUTCH-1482: - Hi Julien. +1 for this

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-578-TEST-1.patch JUnit test to catch this problem and NUTCH-578: a lar

[jira] [Assigned] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-10-29 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1370: --- Assignee: Lewis John McGibbney > Expose exact number of urls injected @ru

Re: NUTCH-1370

2012-10-29 Thread Lewis John Mcgibbney
In addition to this. Can someone please explain why [0] StorageUtils#getDataStoreClass is a private method in this class. The reason I ask is that it would be nice to be able to log which Gora class is being used to persist the Injected URLs. Are there any security risks associated with making thi

Re: NUTCH-1370

2012-10-29 Thread Julien Nioche
Hi Lewis see comments below > > So I thought I'd take this one on tonight and see if I can resolve. > Basically, my high level question is as follows... > Is each line of a text file (seed file) which we attempt to inject > into the webdb considered as an individual map task? > no - each file in

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486144#comment-13486144 ] Sebastian Nagel commented on NUTCH-1482: +1 > Rename HTMLParseFil

Re: NUTCH-1370

2012-10-29 Thread Lewis John Mcgibbney
Hi Julien, Thanks for the comments. Any additional ones regarding the accessibility of the getDataStoreClass? Thanks again Lewis On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi Lewis > > see comments below > >> >> So I thought I'd take this one on to

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-1.patch FetchSchedule.setPageGoneSchedule is called exclusively for a

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486155#comment-13486155 ] Markus Jelsma commented on NUTCH-1482: -- +0 I'm fine with such a change but this will

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486290#comment-13486290 ] Sebastian Nagel commented on NUTCH-1482: Markus, you are right: I remember the API

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-2.patch NUTCH-1245-578-TEST-2.patch Improved patches

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486484#comment-13486484 ] Sebastian Nagel commented on NUTCH-578: --- NUTCH-1245 provides a test to catch this pro

[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-578: -- Attachment: NUTCH-578_v5.patch > URL fetched with 403 is generated over and over again > ---