Re: Unnecessary Newline Characters and Metadata at Content

Furkan KAMACI Fri, 25 Nov 2016 02:24:33 -0800

Hi Karl,

I used default values for Solr. At my Solr output connector "Use the
Extract Update Handler" is clicked. Update handler is defined as:
"/update/extract". There is no Tika content extractor defined at Job
pipeline.


I have WireShark captures and logs from both ManifoldCF and Solr. I can
share them if you want.

Kind Regards,
Furkan KAMACI

On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]> wrote:

> Is this being indexed via the extracting update handler?  What does your
> pipeline look like?  Is the tika extractor in the pipeline?
>
>
> Karl
>
>
> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <[email protected]>
> wrote:
>
>> I've indexed a file via ManifoldCF to Solr which has a content starts
>> with:
>>
>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" directed
>> by Elia Kazan, 1951*
>>
>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
>> Elia Kazan, 1951*
>>
>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
>> Elia Kazan, 1951*
>>
>> However when I check Solr I see that at content:
>>
>> * " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
>> application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n
>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
>> directed by Elia Kazan \n"*
>>
>> There are 2 problems at here.
>>
>> 1) There are newline characters which are unnecessary.
>>
>> 2) There are metadata prepended to content field which should not be.
>>
>> So, one can think that problem maybe at Solr or ManifoldCF (related to
>> Tika). When I index same document to Solr via cURL there are not new line
>> characters or metadata prepended.
>>
>> What do you think about for a solution?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>>
>

Re: Unnecessary Newline Characters and Metadata at Content

Reply via email to