[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE

2012-08-31 Thread Ferdy Galema (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446511#comment-13446511
 ] 

Ferdy Galema commented on NUTCH-872:


Yes that is correct.

 Change the default fetcher.parse to FALSE
 -

 Key: NUTCH-872
 URL: https://issues.apache.org/jira/browse/NUTCH-872
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 1.3, nutchgora
Reporter: Andrzej Bialecki 

 I propose to change this property to false. The reason is that it's a safer 
 default - parsing issues don't lead to a loss of the downloaded content. For 
 larger crawls this is the recommended way to run Fetcher. Users that run 
 smaller crawls can still override it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE

2012-08-31 Thread Christian Johnsson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446568#comment-13446568
 ] 

Christian Johnsson commented on NUTCH-872:
--

I applied the patch and did a test run with nutch 2 against hbase but it still 
stores the
f:cnt field with the entire source document. I have only done the fetch with 
the parse set to true and store to false.

Snippet:

f:bas  timestamp=1346459170179, 
value=http://www.w.www/ 

 f:cnt  timestamp=1346459170179, 
value=!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 
Transitional//EN\x0Ahttp://www.w3.org/TR/xhtml1/DTD/xhtml1
-transitional.dtd\x0Ahtml 
xmlns=http://www.w3.org/1999/xhtml; xml:lang=en lang=en\x0A head\x0A 

This shouldn't be there if i undertood it correct, right? :-)

 Change the default fetcher.parse to FALSE
 -

 Key: NUTCH-872
 URL: https://issues.apache.org/jira/browse/NUTCH-872
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 1.3, nutchgora
Reporter: Andrzej Bialecki 

 I propose to change this property to false. The reason is that it's a safer 
 default - parsing issues don't lead to a loss of the downloaded content. For 
 larger crawls this is the recommended way to run Fetcher. Users that run 
 smaller crawls can still override it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-872) Change the default fetcher.parse to FALSE

2011-03-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008397#comment-13008397
 ] 

Markus Jelsma commented on NUTCH-872:
-

To all: Andrzej has committed this to 1.3 as well in r1079746 at 2011-03-09.

 Change the default fetcher.parse to FALSE
 -

 Key: NUTCH-872
 URL: https://issues.apache.org/jira/browse/NUTCH-872
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


 I propose to change this property to false. The reason is that it's a safer 
 default - parsing issues don't lead to a loss of the downloaded content. For 
 larger crawls this is the recommended way to run Fetcher. Users that run 
 smaller crawls can still override it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira