Re: Metadata and Newline Characters at Content

2016-11-26 Thread Furkan KAMACI
PS: \n characters are not shown in browser but breaks how highlighter work. \n characters are considered at fragsize too. On Sat, Nov 26, 2016 at 9:47 PM, Furkan KAMACI wrote: > Hi Erick, > > I resolved my metadata problem with configuring solrconfig.xml However > even

Re: Metadata and Newline Characters at Content

2016-11-26 Thread Furkan KAMACI
Hi Erick, I resolved my metadata problem with configuring solrconfig.xml However even I post data with post.sh I see content as like: CANADA �1 \n \n \n \n Place I have newline characters as \n and some non-ASCII characters. As far as I understand it is usual to have such characters because

Re: Metadata and Newline Characters at Content

2016-11-24 Thread Erick Erickson
Not sure. What have you tried? For production situations or when you want to take total control of the indexing process,I strongly recommend that you put the Tika parsing on the _client_. Here's a writeup on this topic: https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ Best, Erick

Re: Metadata and Newline Characters at Content

2016-11-24 Thread Furkan KAMACI
Hi Erick, When I check the *Solr* documentation I see that [1]: *In addition to Tika's metadata, Solr adds the following metadata (defined in ExtractingMetadataConstants):* *"stream_name" - The name of the ContentStream as uploaded to Solr. Depending on how the file is uploaded, this may or may

Re: Metadata and Newline Characters at Content

2016-11-24 Thread Erick Erickson
about PatternCaptureGroupFilterFactory. This isn't going to help. The data you see when you return stored data is _before_ any analysis so the PatternFactory won't be applied. You could do this in a ScriptUpdateProcessorFactory. Or, just don't worry about it and have the real app deal with it.

Re: Metadata and Newline Characters at Content

2016-11-24 Thread Furkan KAMACI
Hi Erick, 1) I am looking stored data via Solr Admin UI. I send the query and check what is in content field. 2) I can debug the Tika settings if you think that this is not the desired behaviour to have such metadata fields combined into content field. *PS: *Is there any solution to get rid of

Re: Metadata and Newline Characters at Content

2016-11-24 Thread Erick Erickson
1> I'm assuming when you "see" this data you're looking at the stored data, right? It's a verbatim copy of whatever you sent to the field. I'm guessing it's a character-encoding mismatch between the source and what you use to display. 2> How are you extracting this data? There are Tika options I

Metadata and Newline Characters at Content

2016-11-24 Thread Furkan KAMACI
Hi, I'm testing Solr 4.9.1 I've indexed documents via it. Content field at schema has text_general field type which is not modified from original. I do not copy any fields to content. When I check the data I see content values as like: " \n \nstream_source_info MARLON BRANDO.rtf