[jira] Commented: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-10-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925308#action_12925308 ] Markus Jelsma commented on NUTCH-824: - You're correct, no patch has been submitted and

[jira] Updated: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-10-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-824: Affects Version/s: 2.0 1.3 1.2 Fix Version/s:

If-Modified-Since header with Nutch

2010-10-27 Thread Davide Cavalaglio
Hi, i have problem with the option If-Modified-Since with Nutch. I want crawl on a web syte every day, so i have in nutch-site.html the right setting of property db.fetch.interval.default. But i want to limit Nutch to fetch only page that changed using the If-Modified-Since header. I found some

[jira] Commented: (NUTCH-901) Make index-more plug-in configurable

2010-10-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925318#action_12925318 ] Markus Jelsma commented on NUTCH-901: - Applied patch and added Mattmann's test to

[jira] Updated: (NUTCH-900) Confusion in nutch-default between http.content.limit and file.content.limit

2010-10-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-900: Attachment: NUTCH-900-1.3.patch This patch is for branch-1.3 and fixes a typo in http.content.limit

More real-time crawling

2010-10-27 Thread Ken Krugler
Hi Xiao, FWIR there is adaptive refetch interval support in Nutch currently - or are you looking for something different? Regards, -- Ken On Oct 27, 2010, at 1:42am, xiao yang wrote: I want to modify the schedule of crawler to make it more real-time. Some web pages are frequently

[jira] Commented: (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag

2010-10-27 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925543#action_12925543 ] Andrzej Bialecki commented on NUTCH-926: - bq. Nutch continues to crawl the WRONG

[jira] Created: (NUTCH-927) Sub pages are not getting crawled

2010-10-27 Thread Rameez Raja (JIRA)
Sub pages are not getting crawled - Key: NUTCH-927 URL: https://issues.apache.org/jira/browse/NUTCH-927 Project: Nutch Issue Type: Bug Components: injector Affects Versions: 2.0

Build failed in Hudson: Nutch-trunk #1289

2010-10-27 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1289/ -- [...truncated 925 lines...] A src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java A

[jira] Created: (NUTCH-928) Segmentation

2010-10-27 Thread Rameez Raja (JIRA)
Segmentation Key: NUTCH-928 URL: https://issues.apache.org/jira/browse/NUTCH-928 Project: Nutch Issue Type: Bug Components: injector Affects Versions: 2.0 Reporter: Rameez Raja I need to create

[jira] Updated: (NUTCH-928) Segmentation

2010-10-27 Thread Rameez Raja (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rameez Raja updated NUTCH-928: -- Description: Is there any configuration needed to create segments for each URL rather than for each