> And Nutch is preserving somewhat the apearance of the original text? Yes. Of course, this is useful mostly for the main text (without navigation / boiler plate)
> where would that : s/\n/ /g > go anyway? In a Lucene analyzer? > I Haven't set-up the frontend yet... so I'm just seeing the Solr > content.... in other words I may actually want to keep these? Yes, it may help to generate snippets. On 02/26/2018 04:46 PM, BlackIce wrote: > So this is a "newline"? And Nutch is preserving somewhat the apearance of > the original text? > I Haven't set-up the frontend yet... so I'm just seeing the Solr > content.... in other words I may actually want to keep these? > > this is what I'm having a lot off: " sitio\nÍndice del > sitio\nAdministración" > Actually stuff that should be removed anyway, but I haven't gotten to that > part yet (Those are links) > > where would that : s/\n/ /g > go anyway? > > I've just noticed that I've deleted all my Nutch/Solr/Frontend > configurations, so I'm starting from scratch... probably not a bad thing > > On Mon, Feb 26, 2018 at 4:31 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi, >> >> paragraph breaks have been added by >> >> https://github.com/apache/nutch/pull/190 >> and >> https://issues.apache.org/jira/browse/NUTCH-2397 >> >> It's not configurable. >> >> A simple >> s/\n/ /g >> should restore the old "look" of extracted plain texts. >> >> Best, >> Sebastian >> >> >> On 02/26/2018 04:17 PM, BlackIce wrote: >>> Hi, >>> >>> did run into a problem with Nutch 1.14 which I don't recall having in >>> previous versions >>> >>> I'm find a lot of "\n" (Newline?) in my content of crawled sites. >>> >>> I've tried with different configurations/constelations of Html parser and >>> Tika and just Tika to no avail. >>> >>> All the info I can find on this this is regarding older versions of >> Nutch.. >>> like ancient versions... >>> >>> Did something change on to were there is an extra configuration step now >>> required? >>> >>> Greetz >>> >>> RRK >>> >> >> >

