So this is a "newline"? And Nutch is preserving somewhat the apearance of the original text? I Haven't set-up the frontend yet... so I'm just seeing the Solr content.... in other words I may actually want to keep these?
this is what I'm having a lot off: " sitio\nÍndice del sitio\nAdministración" Actually stuff that should be removed anyway, but I haven't gotten to that part yet (Those are links) where would that : s/\n/ /g go anyway? I've just noticed that I've deleted all my Nutch/Solr/Frontend configurations, so I'm starting from scratch... probably not a bad thing On Mon, Feb 26, 2018 at 4:31 PM, Sebastian Nagel <[email protected] > wrote: > Hi, > > paragraph breaks have been added by > > https://github.com/apache/nutch/pull/190 > and > https://issues.apache.org/jira/browse/NUTCH-2397 > > It's not configurable. > > A simple > s/\n/ /g > should restore the old "look" of extracted plain texts. > > Best, > Sebastian > > > On 02/26/2018 04:17 PM, BlackIce wrote: > > Hi, > > > > did run into a problem with Nutch 1.14 which I don't recall having in > > previous versions > > > > I'm find a lot of "\n" (Newline?) in my content of crawled sites. > > > > I've tried with different configurations/constelations of Html parser and > > Tika and just Tika to no avail. > > > > All the info I can find on this this is regarding older versions of > Nutch.. > > like ancient versions... > > > > Did something change on to were there is an extra configuration step now > > required? > > > > Greetz > > > > RRK > > > >

