So this is a "newline"? And Nutch is preserving somewhat the apearance of
the original text?
I Haven't set-up the frontend yet... so I'm just seeing the Solr
content.... in other words I may actually want to keep these?

this is what I'm having a lot off: " sitio\nÍndice del
sitio\nAdministración"
Actually stuff that should be removed anyway, but I haven't gotten to that
part yet (Those are links)

where would that :   s/\n/ /g
go anyway?

I've just noticed that I've deleted all my Nutch/Solr/Frontend
configurations, so I'm starting from scratch... probably not a bad thing

On Mon, Feb 26, 2018 at 4:31 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi,
>
> paragraph breaks have been added by
>
> https://github.com/apache/nutch/pull/190
>  and
> https://issues.apache.org/jira/browse/NUTCH-2397
>
> It's not configurable.
>
> A simple
>   s/\n/ /g
> should restore the old "look" of extracted plain texts.
>
> Best,
> Sebastian
>
>
> On 02/26/2018 04:17 PM, BlackIce wrote:
> > Hi,
> >
> > did run into a problem with Nutch 1.14 which I don't recall having in
> > previous versions
> >
> > I'm find a lot of "\n"  (Newline?) in my content of crawled sites.
> >
> > I've tried with different configurations/constelations of Html parser and
> > Tika and just Tika to no avail.
> >
> > All the info I can find on this this is regarding older versions of
> Nutch..
> > like ancient versions...
> >
> > Did something change on to were there is an extra configuration step now
> > required?
> >
> > Greetz
> >
> > RRK
> >
>
>

Reply via email to