Re: removing "\n"... Nutch 1.14

Sebastian Nagel Mon, 26 Feb 2018 08:02:34 -0800

> And Nutch is preserving somewhat the apearance of the original text?

Yes. Of course, this is useful mostly for the main text (without navigation / 
boiler plate)


> where would that :   s/\n/ /g
> go anyway?

In a Lucene analyzer?

> I Haven't set-up the frontend yet... so I'm just seeing the Solr
> content.... in other words I may actually want to keep these?

Yes, it may help to generate snippets.


On 02/26/2018 04:46 PM, BlackIce wrote:
> So this is a "newline"? And Nutch is preserving somewhat the apearance of
> the original text?
> I Haven't set-up the frontend yet... so I'm just seeing the Solr
> content.... in other words I may actually want to keep these?
> 
> this is what I'm having a lot off: " sitio\nÍndice del
> sitio\nAdministración"
> Actually stuff that should be removed anyway, but I haven't gotten to that
> part yet (Those are links)
> 
> where would that :   s/\n/ /g
> go anyway?
> 
> I've just noticed that I've deleted all my Nutch/Solr/Frontend
> configurations, so I'm starting from scratch... probably not a bad thing
> 
> On Mon, Feb 26, 2018 at 4:31 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi,
>>
>> paragraph breaks have been added by
>>
>> https://github.com/apache/nutch/pull/190
>>  and
>> https://issues.apache.org/jira/browse/NUTCH-2397
>>
>> It's not configurable.
>>
>> A simple
>>   s/\n/ /g
>> should restore the old "look" of extracted plain texts.
>>
>> Best,
>> Sebastian
>>
>>
>> On 02/26/2018 04:17 PM, BlackIce wrote:
>>> Hi,
>>>
>>> did run into a problem with Nutch 1.14 which I don't recall having in
>>> previous versions
>>>
>>> I'm find a lot of "\n"  (Newline?) in my content of crawled sites.
>>>
>>> I've tried with different configurations/constelations of Html parser and
>>> Tika and just Tika to no avail.
>>>
>>> All the info I can find on this this is regarding older versions of
>> Nutch..
>>> like ancient versions...
>>>
>>> Did something change on to were there is an extra configuration step now
>>> required?
>>>
>>> Greetz
>>>
>>> RRK
>>>
>>
>>
>

Re: removing "\n"... Nutch 1.14

Reply via email to