[jira] Created: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

2007-05-22 Thread Emmanuel Joke (JIRA)
Avoid parsing uneccessary links and get a more relevant outlink list Key: NUTCH-488 URL: https://issues.apache.org/jira/browse/NUTCH-488 Project: Nutch Issue Type:

[jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

2007-05-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-488: Attachment: DOMContentUtils.patch Avoid parsing uneccessary links and get a more relevant outlink

[jira] Updated: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-489: Attachment: SuffixURLFilter.java.patch suffix-urlfilter.txt.patch URLFilter-suffix

[jira] Updated: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-489: Attachment: SuffixURLFilter_v2.java.patch My mistake... I've added a new patchwhich is supposed

[jira] Created: (NUTCH-500) Add hadoop masters configuration file into conf folder

2007-06-18 Thread Emmanuel Joke (JIRA)
Add hadoop masters configuration file into conf folder -- Key: NUTCH-500 URL: https://issues.apache.org/jira/browse/NUTCH-500 Project: Nutch Issue Type: Improvement Components:

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-21 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506922 ] Emmanuel Joke commented on NUTCH-503: - I just try your patch and i'm afraid I still have the same issue.

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507469 ] Emmanuel Joke commented on NUTCH-503: - Sorry, my mistake. My compiled jar was not correctly included in my

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-29 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509039 ] Emmanuel Joke commented on NUTCH-503: - Results seems to good. So I'm wondering if it is possible to commit this

[jira] Created: (NUTCH-508) ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker

2007-07-07 Thread Emmanuel Joke (JIRA)
${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker -- Key: NUTCH-508 URL: https://issues.apache.org/jira/browse/NUTCH-508 Project: Nutch

[jira] Created: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-08 Thread Emmanuel Joke (JIRA)
Update Crawldb: avoid to start a job if there is no valid segment - Key: NUTCH-509 URL: https://issues.apache.org/jira/browse/NUTCH-509 Project: Nutch Issue Type: Improvement

[jira] Updated: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-08 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-509: Attachment: crawldb.patch In this patch, I've added a simple boolean to start the job only we have

[jira] Commented: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-09 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511038 ] Emmanuel Joke commented on NUTCH-509: - You're right. In this case, I will close the JIRA Update Crawldb: avoid

[jira] Closed: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-09 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke closed NUTCH-509. --- Resolution: Won't Fix As explain by Doğacan, the Crawldb update has a good behaviour. This patch is

[jira] Updated: (NUTCH-516) Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE

2007-07-18 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-516: Attachment: NUTCH-516.patch I fxied the issue by changing the FetchTime in

[jira] Created: (NUTCH-522) Use URLValidator in the Injector

2007-07-19 Thread Emmanuel Joke (JIRA)
Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke

[jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-19 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: NUTCH-522_v2.patch Oops, my mistake. Please find an updated patch. Actually I've a

[jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-20 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514153 ] Emmanuel Joke commented on NUTCH-522: - Actually I tried to fetch the url

[jira] Updated: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-24 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-526: Attachment: NUTCH-526.patch patch provided Use a combiner in LinDbMerger to improve the

[jira] Created: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-24 Thread Emmanuel Joke (JIRA)
Use a combiner in LinDbMerger to improve the performance as in LinkDb - Key: NUTCH-526 URL: https://issues.apache.org/jira/browse/NUTCH-526 Project: Nutch Issue Type:

[jira] Updated: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2007-07-26 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-528: Attachment: NUTCH-528.patch patch attached CrawlDbReader: add some new stats + dump into a csv

[jira] Created: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2007-07-26 Thread Emmanuel Joke (JIRA)
CrawlDbReader: add some new stats + dump into a csv format -- Key: NUTCH-528 URL: https://issues.apache.org/jira/browse/NUTCH-528 Project: Nutch Issue Type: Improvement

[jira] Updated: (NUTCH-529) NodeWalker.skipChildren don't wrok for more than 1 child.

2007-07-26 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-529: Attachment: NUTCH-529.patch patch attached NodeWalker.skipChildren don't wrok for more than 1

[jira] Created: (NUTCH-529) NodeWalker.skipChildren don't wrok for more than 1 child.

2007-07-26 Thread Emmanuel Joke (JIRA)
NodeWalker.skipChildren don't wrok for more than 1 child. - Key: NUTCH-529 URL: https://issues.apache.org/jira/browse/NUTCH-529 Project: Nutch Issue Type: Bug Reporter:

[jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: NUTCH-522_v3.patch Use URLValidator in the Injector

[jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: NUTCH-522_v3.patch commons-validator's UrlValidator does not filter URLS with space.

[jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: (was: NUTCH-522_v3.patch) Use URLValidator in the Injector

[jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516138 ] Emmanuel Joke commented on NUTCH-522: - I tried with protocol-http and protocol-httpclient, i got the same error

[jira] Commented: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-29 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516209 ] Emmanuel Joke commented on NUTCH-526: - Actually i made a simple test on 2 small linkdb, and i didn't see any

[jira] Created: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-29 Thread Emmanuel Joke (JIRA)
Add a combiner to improve performance on updatedb - Key: NUTCH-530 URL: https://issues.apache.org/jira/browse/NUTCH-530 Project: Nutch Issue Type: Improvement Environment: java 1.6

[jira] Updated: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-29 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-530: Attachment: NUTCH-530.patch Patch provided. It reduced the process time by 20%. Output from the

[jira] Created: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
CrawlDbMerger: wrong computation of last fetch time --- Key: NUTCH-532 URL: https://issues.apache.org/jira/browse/NUTCH-532 Project: Nutch Issue Type: Bug Reporter: Emmanuel Joke

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532.patch Patch provided. CrawlDbMerger: wrong computation of last fetch time

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: (was: NUTCH-532.patch) CrawlDbMerger: wrong computation of last fetch time

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532.patch CrawlDbMerger: wrong computation of last fetch time

[jira] Created: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread Emmanuel Joke (JIRA)
LinkDbMerger: url normlaized is not updated in the key and inlinks list --- Key: NUTCH-533 URL: https://issues.apache.org/jira/browse/NUTCH-533 Project: Nutch Issue Type:

[jira] Updated: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-533: Attachment: NUTCH-533.patch Patch provided LinkDbMerger: url normlaized is not updated in the key

[jira] Commented: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516358 ] Emmanuel Joke commented on NUTCH-526: - Could you please wait again few days ? I would like to wait for a

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516602 ] Emmanuel Joke commented on NUTCH-530: - I'm sure to follow your point regarding the outlinks number. I don't

[jira] Commented: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516618 ] Emmanuel Joke commented on NUTCH-532: - res.getFetchTime() - Math.round(res.getFetchInterval() * 1000d); always

[jira] Updated: (NUTCH-534) SegmentMerger: add -normalize option

2007-07-31 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-534: Attachment: NUTCH-534.patch Patch provided SegmentMerger: add -normalize option

[jira] Created: (NUTCH-534) SegmentMerger: add -normalize option

2007-07-31 Thread Emmanuel Joke (JIRA)
SegmentMerger: add -normalize option Key: NUTCH-534 URL: https://issues.apache.org/jira/browse/NUTCH-534 Project: Nutch Issue Type: Improvement Reporter: Emmanuel Joke Assignee:

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-31 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516675 ] Emmanuel Joke commented on NUTCH-530: - Actually I don't re-use CrawlDbReducer, I've define a new class as

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-08-06 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532_v2.patch New patch provided * Add new method to CrawlDatum

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-08-06 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532_v3.patch My mistake, acually i'm not really familiar with the VERSION. I

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-08-06 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532_v4.patch I updated the code following Andrzej comments. I've also update the

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-09-03 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532-test.patch Please find a patch which fix the JUNIT test. CrawlDbMerger:

[jira] Closed: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-09-03 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke closed NUTCH-526. --- Resolution: Won't Fix No improvement. Use a combiner in LinDbMerger to improve the performance as

[jira] Updated: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2007-09-04 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-528: Attachment: NUTCH-528_v2.patch New Patch provided, It includes the new options as requested by DG.

[jira] Updated: (NUTCH-529) NodeWalker.skipChildren doesn't work for more than 1 child.

2007-09-04 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-529: Attachment: TestNodeWalker.java Junit test provided. NodeWalker.skipChildren doesn't work for

[jira] Created: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

2007-09-04 Thread Emmanuel Joke (JIRA)
Move URLNormalizer from Outlink to ParseOutputFormat Key: NUTCH-548 URL: https://issues.apache.org/jira/browse/NUTCH-548 Project: Nutch Issue Type: Improvement Components:

[jira] Updated: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

2007-09-04 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-548: Attachment: NUTCH-548.patch Patch provided Move URLNormalizer from Outlink to ParseOutputFormat

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

2007-09-04 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524669 ] Emmanuel Joke commented on NUTCH-548: - Actually I've one comment/question. I noticed that we normalize and filter

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

2007-09-06 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525452 ] Emmanuel Joke commented on NUTCH-548: - My mistake, you re right i was using the command crawl to make my test,

[jira] Updated: (NUTCH-529) NodeWalker.skipChildren doesn't work for more than 1 child.

2007-09-11 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-529: Attachment: TestNodeWalker.java Another version without dependency to Neko.

[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-19 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528729 ] Emmanuel Joke commented on NUTCH-557: - Did you notice any difference in term of performance ? improvement or

[jira] Updated: (NUTCH-529) NodeWalker.skipChildren doesn't work for more than 1 child.

2007-09-21 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-529: Attachment: (was: TestNodeWalker.java) NodeWalker.skipChildren doesn't work for more than 1

[jira] Commented: (NUTCH-508) ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker

2007-10-04 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532552 ] Emmanuel Joke commented on NUTCH-508: - It is Mathijs Homminga ${hadoop.log.dir} and ${hadoop.log.file} are not

[jira] Updated: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

2007-10-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-548: Attachment: NUTCH-548.patch.v2 New patch which remove unused parameter and fix the plugin parser

[jira] Created: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2007-12-16 Thread Emmanuel Joke (JIRA)
Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED - Key: NUTCH-592 URL: https://issues.apache.org/jira/browse/NUTCH-592 Project: Nutch Issue Type: Bug

[jira] Updated: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2007-12-16 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-592: Attachment: patch.txt Patch provided. Fetcher2 : NPE for page with status

[jira] Commented: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2007-12-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554555 ] Emmanuel Joke commented on NUTCH-528: - I'm wondering if somebody could review this patch and eventually commit it

[jira] Commented: (NUTCH-534) SegmentMerger: add -normalize option

2007-12-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554572 ] Emmanuel Joke commented on NUTCH-534: - Hi Andrzej, would you mind to review this patch too and give us your

[jira] Commented: (NUTCH-595) Target file:/.... already exists

2007-12-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554571 ] Emmanuel Joke commented on NUTCH-595: - I had a similar issue and i follow the instruction done by Dennis and it

[jira] Updated: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2007-12-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-528: Attachment: NUTCH-528_v3.patch New path provided following Andrzej recommandations: ??*

[jira] Created: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS

2007-12-30 Thread Emmanuel Joke (JIRA)
ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS --- Key: NUTCH-596 URL: https://issues.apache.org/jira/browse/NUTCH-596 Project: Nutch Issue

[jira] Created: (NUTCH-598) Remove deprecated use of ToolBase, Migration to the new implementation

2008-01-02 Thread Emmanuel Joke (JIRA)
Remove deprecated use of ToolBase, Migration to the new implementation -- Key: NUTCH-598 URL: https://issues.apache.org/jira/browse/NUTCH-598 Project: Nutch Issue Type:

[jira] Updated: (NUTCH-598) Remove deprecated use of ToolBase, Migration to the new implementation

2008-01-02 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-598: Attachment: NUTCH-598.patch Patch provided It includes: - remove ToolBase call and move to the new

[jira] Commented: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

2008-01-03 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555840#action_12555840 ] Emmanuel Joke commented on NUTCH-559: - Dogocan, is there any chance that you commit this

[jira] Commented: (NUTCH-580) Remove deprecated hadoop api calls (FS)

2008-01-03 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555841#action_12555841 ] Emmanuel Joke commented on NUTCH-580: - I've been using your patch for a while now and it

[jira] Commented: (NUTCH-531) Pages with no ContentType cause a Null Pointer exception

2008-01-03 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555843#action_12555843 ] Emmanuel Joke commented on NUTCH-531: - It looks like this issue has been solved with the

[jira] Issue Comment Edited: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

2008-01-04 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555840#action_12555840 ] jokeout edited comment on NUTCH-559 at 1/4/08 1:55 AM: - Dogacan,

[jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS

2008-01-04 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555890#action_12555890 ] Emmanuel Joke commented on NUTCH-596: - I agree with you the proper solution will be the

[jira] Updated: (NUTCH-598) Remove deprecated use of ToolBase, Migration to the new implementation

2008-01-06 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-598: Attachment: NUTCH-598.v2.patch Thanks Dogacan for your update. New patch provided. Most of the

[jira] Commented: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2008-01-11 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557969#action_12557969 ] Emmanuel Joke commented on NUTCH-528: - Andrzej, did you have the time to review my new

[jira] Commented: (NUTCH-534) SegmentMerger: add -normalize option

2008-01-11 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557968#action_12557968 ] Emmanuel Joke commented on NUTCH-534: - Andrzej, do you think it will be possible to

[jira] Commented: (NUTCH-363) Fetcher normalizes everything at least twice

2008-01-15 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12559378#action_12559378 ] Emmanuel Joke commented on NUTCH-363: - FYI, The operation to normalize link within the

[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2008-02-08 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566950#action_12566950 ] Emmanuel Joke commented on NUTCH-567: - Hi Dogacan, do you think you will commit this new

[jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS

2008-02-11 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12567693#action_12567693 ] Emmanuel Joke commented on NUTCH-596: - I didn't find any usefull information in the

[jira] Commented: (NUTCH-613) Empty Summaries and Cached Pages

2008-02-23 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12571874#action_12571874 ] Emmanuel Joke commented on NUTCH-613: - I have the same analysis. I just change my local

[jira] Commented: (NUTCH-598) Remove deprecated use of ToolBase, Migration to the new implementation

2008-02-23 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12571875#action_12571875 ] Emmanuel Joke commented on NUTCH-598: - Hi Dogacan,did you finish to review my patch.Is

[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again

2008-02-24 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-578: Attachment: NUTCH-578.patch I've got the same error for page with an HTTP status code = 503. I

[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again

2008-02-24 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-578: Attachment: NUTCH-578_v2.patch Actually i just realised that the setPageRetrySchedule in

[jira] Updated: (NUTCH-615) Redirected URL are fetched wihtout setting any FetchInterval

2008-02-26 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-615: Attachment: NUTCH-615.patch Redirected URL are fetched wihtout setting any FetchInterval

[jira] Created: (NUTCH-615) Redirected URL are fetched wihtout setting any FetchInterval

2008-02-26 Thread Emmanuel Joke (JIRA)
Redirected URL are fetched wihtout setting any FetchInterval Key: NUTCH-615 URL: https://issues.apache.org/jira/browse/NUTCH-615 Project: Nutch Issue Type: Bug

[jira] Updated: (NUTCH-616) Reset Fetch Retry counter when fetch is successful

2008-02-26 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-616: Attachment: NUTCH-616.patch Patch provided Reset Fetch Retry counter when fetch is successful

[jira] Created: (NUTCH-616) Reset Fetch Retry counter when fetch is successful

2008-02-26 Thread Emmanuel Joke (JIRA)
Reset Fetch Retry counter when fetch is successful -- Key: NUTCH-616 URL: https://issues.apache.org/jira/browse/NUTCH-616 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2008-03-16 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579285#action_12579285 ] Emmanuel Joke commented on NUTCH-530: - OK Add a combiner to improve performance on