[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446511#comment-13446511 ] Ferdy Galema commented on NUTCH-872: Yes that is correct. Change the default fetcher.parse to FALSE - Key: NUTCH-872 URL: https://issues.apache.org/jira/browse/NUTCH-872 Project: Nutch Issue Type: Improvement Affects Versions: 1.2, 1.3, nutchgora Reporter: Andrzej Bialecki I propose to change this property to false. The reason is that it's a safer default - parsing issues don't lead to a loss of the downloaded content. For larger crawls this is the recommended way to run Fetcher. Users that run smaller crawls can still override it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446568#comment-13446568 ] Christian Johnsson commented on NUTCH-872: -- I applied the patch and did a test run with nutch 2 against hbase but it still stores the f:cnt field with the entire source document. I have only done the fetch with the parse set to true and store to false. Snippet: f:bas timestamp=1346459170179, value=http://www.w.www/ f:cnt timestamp=1346459170179, value=!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN\x0Ahttp://www.w3.org/TR/xhtml1/DTD/xhtml1 -transitional.dtd\x0Ahtml xmlns=http://www.w3.org/1999/xhtml; xml:lang=en lang=en\x0A head\x0A This shouldn't be there if i undertood it correct, right? :-) Change the default fetcher.parse to FALSE - Key: NUTCH-872 URL: https://issues.apache.org/jira/browse/NUTCH-872 Project: Nutch Issue Type: Improvement Affects Versions: 1.2, 1.3, nutchgora Reporter: Andrzej Bialecki I propose to change this property to false. The reason is that it's a safer default - parsing issues don't lead to a loss of the downloaded content. For larger crawls this is the recommended way to run Fetcher. Users that run smaller crawls can still override it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-872) Change the default fetcher.parse to FALSE
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008397#comment-13008397 ] Markus Jelsma commented on NUTCH-872: - To all: Andrzej has committed this to 1.3 as well in r1079746 at 2011-03-09. Change the default fetcher.parse to FALSE - Key: NUTCH-872 URL: https://issues.apache.org/jira/browse/NUTCH-872 Project: Nutch Issue Type: Improvement Affects Versions: 1.2, 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 I propose to change this property to false. The reason is that it's a safer default - parsing issues don't lead to a loss of the downloaded content. For larger crawls this is the recommended way to run Fetcher. Users that run smaller crawls can still override it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira