Is this being indexed via the extracting update handler? What does your pipeline look like? Is the tika extractor in the pipeline?
Karl On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <[email protected]> wrote: > I've indexed a file via ManifoldCF to Solr which has a content starts with: > > *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" directed > by Elia Kazan, 1951* > > *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by > Elia Kazan, 1951* > > *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by > Elia Kazan, 1951* > > However when I check Solr I see that at content: > > * " \n \nstream_source_info MARLON BRANDO.rtf \nstream_content_type > application/rtf \nstream_size 13580 \nstream_name MARLON BRANDO.rtf > \nContent-Type application/rtf \nresourceName MARLON BRANDO.rtf \n \n > \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\" > directed by Elia Kazan \n"* > > There are 2 problems at here. > > 1) There are newline characters which are unnecessary. > > 2) There are metadata prepended to content field which should not be. > > So, one can think that problem maybe at Solr or ManifoldCF (related to > Tika). When I index same document to Solr via cURL there are not new line > characters or metadata prepended. > > What do you think about for a solution? > > Kind Regards, > Furkan KAMACI > >
