RE: Unicode Character Problem

2016-12-12 Thread Allison, Timothy B.
> I don't see any weird character when I manual copy it to any text editor.

That's a good diagnostic step, but there's a chance that Adobe (or your viewer) 
got it right, and Tika or PDFBox isn't getting it right.

If you run tika-app on the file [0], do you get the same problem?  See our stub 
on common text extraction challenges with PDFs [1] and how to run PDFBox's 
ExtractText against your file [2].

[0] java -jar tika-app.jar -i  -o 
[1] https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29
[2] https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems 

-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Monday, December 12, 2016 10:55 AM
To: solr-user@lucene.apache.org; Ahmet Arslan <iori...@yahoo.com>
Subject: Re: Unicode Character Problem

Hi Ahmet,

I don't see any weird character when I manual copy it to any text editor.

On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi Furkan,
>
> I am pretty sure this is a pdf extraction thing.
> Turkish characters caused us trouble in the past during extracting 
> text from pdf files.
> You can confirm by performing manual copy-paste from original pdf file.
>
> Ahmet
>
>
> On Friday, December 9, 2016 8:44 PM, Furkan KAMACI 
> <furkankam...@gmail.com>
> wrote:
> Hi,
>
> I'm trying to index Turkish characters. These are what I see at my 
> index (I see both of them at different places of my content):
>
> aç  klama
> açıklama
>
> These are same words but indexed different (same weird character at 
> first one). I see that there is not a weird character when I check the 
> original PDF file.
>
> What do you think about it. Is it related to Solr or Tika?
>
> PS: I use text_general for analyser of content field.
>
> Kind Regards,
> Furkan KAMACI
>


Re: Unicode Character Problem

2016-12-12 Thread Furkan KAMACI
Hi Ahmet,

I don't see any weird character when I manual copy it to any text editor.

On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan 
wrote:

> Hi Furkan,
>
> I am pretty sure this is a pdf extraction thing.
> Turkish characters caused us trouble in the past during extracting text
> from pdf files.
> You can confirm by performing manual copy-paste from original pdf file.
>
> Ahmet
>
>
> On Friday, December 9, 2016 8:44 PM, Furkan KAMACI 
> wrote:
> Hi,
>
> I'm trying to index Turkish characters. These are what I see at my index (I
> see both of them at different places of my content):
>
> aç �klama
> açıklama
>
> These are same words but indexed different (same weird character at first
> one). I see that there is not a weird character when I check the original
> PDF file.
>
> What do you think about it. Is it related to Solr or Tika?
>
> PS: I use text_general for analyser of content field.
>
> Kind Regards,
> Furkan KAMACI
>


Re: Unicode Character Problem

2016-12-10 Thread Ahmet Arslan
Hi Furkan,

I am pretty sure this is a pdf extraction thing.
Turkish characters caused us trouble in the past during extracting text from 
pdf files.
You can confirm by performing manual copy-paste from original pdf file.

Ahmet


On Friday, December 9, 2016 8:44 PM, Furkan KAMACI  
wrote:
Hi,

I'm trying to index Turkish characters. These are what I see at my index (I
see both of them at different places of my content):

aç �klama
açıklama

These are same words but indexed different (same weird character at first
one). I see that there is not a weird character when I check the original
PDF file.

What do you think about it. Is it related to Solr or Tika?

PS: I use text_general for analyser of content field.

Kind Regards,
Furkan KAMACI 


Unicode Character Problem

2016-12-09 Thread Furkan KAMACI
Hi,

I'm trying to index Turkish characters. These are what I see at my index (I
see both of them at different places of my content):

aç �klama
açıklama

These are same words but indexed different (same weird character at first
one). I see that there is not a weird character when I check the original
PDF file.

What do you think about it. Is it related to Solr or Tika?

PS: I use text_general for analyser of content field.

Kind Regards,
Furkan KAMACI