Hi Karl, I used default values for Solr. At my Solr output connector "Use the Extract Update Handler" is clicked. Update handler is defined as: "/update/extract". There is no Tika content extractor defined at Job pipeline.
I have WireShark captures and logs from both ManifoldCF and Solr. I can share them if you want. Kind Regards, Furkan KAMACI On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]> wrote: > Is this being indexed via the extracting update handler? What does your > pipeline look like? Is the tika extractor in the pipeline? > > > Karl > > > On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <[email protected]> > wrote: > >> I've indexed a file via ManifoldCF to Solr which has a content starts >> with: >> >> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" directed >> by Elia Kazan, 1951* >> >> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by >> Elia Kazan, 1951* >> >> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by >> Elia Kazan, 1951* >> >> However when I check Solr I see that at content: >> >> * " \n \nstream_source_info MARLON BRANDO.rtf \nstream_content_type >> application/rtf \nstream_size 13580 \nstream_name MARLON BRANDO.rtf >> \nContent-Type application/rtf \nresourceName MARLON BRANDO.rtf \n \n >> \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\" >> directed by Elia Kazan \n"* >> >> There are 2 problems at here. >> >> 1) There are newline characters which are unnecessary. >> >> 2) There are metadata prepended to content field which should not be. >> >> So, one can think that problem maybe at Solr or ManifoldCF (related to >> Tika). When I index same document to Solr via cURL there are not new line >> characters or metadata prepended. >> >> What do you think about for a solution? >> >> Kind Regards, >> Furkan KAMACI >> >> >
