Hi, paragraph breaks have been added by
https://github.com/apache/nutch/pull/190 and https://issues.apache.org/jira/browse/NUTCH-2397 It's not configurable. A simple s/\n/ /g should restore the old "look" of extracted plain texts. Best, Sebastian On 02/26/2018 04:17 PM, BlackIce wrote: > Hi, > > did run into a problem with Nutch 1.14 which I don't recall having in > previous versions > > I'm find a lot of "\n" (Newline?) in my content of crawled sites. > > I've tried with different configurations/constelations of Html parser and > Tika and just Tika to no avail. > > All the info I can find on this this is regarding older versions of Nutch.. > like ancient versions... > > Did something change on to were there is an extra configuration step now > required? > > Greetz > > RRK >

