[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-578: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Dennis Kubes Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Serykh Evgeniy updated NUTCH-578: - Attachment: NUTCH-578_v4.patch URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Dennis Kubes Fix For: 1.1 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Serykh Evgeniy updated NUTCH-578: - Attachment: (was: NUTCH-578_v4.patch) URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Dennis Kubes Fix For: 1.1 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Serykh Evgeniy updated NUTCH-578: - Attachment: NUTCH-578_v4.patch URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Dennis Kubes Fix For: 1.1 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Lihachev updated NUTCH-578: -- Attachment: NUTCH-578_v3.patch changes in CrawlDbReducer already applied in trunk, so patch only AbstractFetchSchedule URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Dennis Kubes Fix For: 1.1 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-578: Attachment: NUTCH-578.patch I've got the same error for page with an HTTP status code = 503. I found the issue in the CrawlDbReduce class. The fetchtime was not refresh correctly according to the DB Status. My patch fix this issue. URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-578: Attachment: NUTCH-578_v2.patch Actually i just realised that the setPageRetrySchedule in AbstractSchedule was not correctly defined. This patch fix this issue too. URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathaniel Powell updated NUTCH-578: --- Attachment: nutch-site.xml For your reference, this is how I have the nutch-site set up. URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: nutch-site.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathaniel Powell updated NUTCH-578: --- Attachment: regex-normalize.xml Another file I customized. URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathaniel Powell updated NUTCH-578: --- Attachment: crawl-urlfilter.txt File I customized for this crawl. URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathaniel Powell updated NUTCH-578: --- Attachment: (was: nutch-site.xml) URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: crawl-urlfilter.txt, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathaniel Powell updated NUTCH-578: --- Attachment: nutch-site.xml This is the nutch-site that I used to run the crawl. URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathaniel Powell updated NUTCH-578: --- Attachment: nutch-site.xml Use wget to download this file. This is the nutch-site.xml that I used for this crawl. URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathaniel Powell updated NUTCH-578: --- Attachment: (was: nutch-site.xml) URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Fix For: 1.0.0 Attachments: crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.