Re: Unnecessary Newline Characters and Metadata at Content

Karl Wright Thu, 24 Nov 2016 14:03:06 -0800

Is this being indexed via the extracting update handler?  What does your
pipeline look like?  Is the tika extractor in the pipeline?



Karl


On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <[email protected]>
wrote:

> I've indexed a file via ManifoldCF to Solr which has a content starts with:
>
> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" directed
> by Elia Kazan, 1951*
>
> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
> Elia Kazan, 1951*
>
> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
> Elia Kazan, 1951*
>
> However when I check Solr I see that at content:
>
> * " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
> application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n
> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
> directed by Elia Kazan \n"*
>
> There are 2 problems at here.
>
> 1) There are newline characters which are unnecessary.
>
> 2) There are metadata prepended to content field which should not be.
>
> So, one can think that problem maybe at Solr or ManifoldCF (related to
> Tika). When I index same document to Solr via cURL there are not new line
> characters or metadata prepended.
>
> What do you think about for a solution?
>
> Kind Regards,
> Furkan KAMACI
>
>

Re: Unnecessary Newline Characters and Metadata at Content

Reply via email to