Re: Issues when indexing PDF files

2015-12-18 Thread Zheng Lin Edwin Yeo
Thanks for all your replies. I did chance upon this question from stackoverflow which it says is able to solve the issues: http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/ However, when I tried to run it, it still get the same "?" output in the content, the

Re: Issues when indexing PDF files

2015-12-18 Thread Erick Erickson
This could also simply be your browser isn't set up to display UTF-8, the characters may be just fine. Best, Erick On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo wrote: > Thanks for all your replies. > > I did chance upon this question from stackoverflow which it

Re: Issues when indexing PDF files

2015-12-18 Thread Zheng Lin Edwin Yeo
Hi Erick, Thanks for your reply. However, it is unlikely to be the browser issue, as the same result occurs when I tried it in the Tika app. Regards, Edwin On 18 December 2015 at 23:39, Erick Erickson wrote: > This could also simply be your browser isn't set up to >

Re: Issues when indexing PDF files

2015-12-17 Thread Zheng Lin Edwin Yeo
Hi Alexandre, Thanks for your reply. So the only way to solve this issue is to explore with PDF specific tools and change the encoding of the file? Is there any way to configure it in Solr? Regards, Edwin On 17 December 2015 at 15:42, Alexandre Rafalovitch wrote: > They

Re: Issues when indexing PDF files

2015-12-17 Thread Binoy Dalal
You can always write an update handler plugin to convert your PDFs to utf-8 and then push them to solr On Thu, 17 Dec 2015, 14:16 Zheng Lin Edwin Yeo wrote: > Hi Alexandre, > > Thanks for your reply. > > So the only way to solve this issue is to explore with PDF specific

RE: Issues when indexing PDF files

2015-12-17 Thread Allison, Timothy B.
17, 2015 5:48 AM To: solr-user@lucene.apache.org Subject: Re: Issues when indexing PDF files On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote: > Hi Alexandre, > > Thanks for your reply. > > So the only way to solve this issue is to explore with PDF specific > tools and change the en

Re: Issues when indexing PDF files

2015-12-17 Thread Walter Underwood
PDF isn’t really text. For example, it doesn’t have spaces, it just moves the next letter over farther. Letters might not be in reading order — two column text could be printed as horizontal scans. Custom fonts might not use an encoding that matches Unicode, which makes them encrypted (badly).

Re: Issues when indexing PDF files

2015-12-17 Thread Charlie Hull
On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote: Hi Alexandre, Thanks for your reply. So the only way to solve this issue is to explore with PDF specific tools and change the encoding of the file? Is there any way to configure it in Solr? Solr uses Tika to extract plain text from PDFs. If the

Re: Issues when indexing PDF files

2015-12-16 Thread Zheng Lin Edwin Yeo
I've checked all the files which has problem with the content in the Solr index using the Tika app. All of them shows the same issues as what I see in the Solr index. So does the issues lies with the encoding of the file? Are we able to check the encoding of the file? Regards, Edwin On 17

Re: Issues when indexing PDF files

2015-12-16 Thread Alexandre Rafalovitch
They could be using custom fonts and non-Unicode characters. That's probably something to explore with PDF specific tools. On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" wrote: > I've checked all the files which has problem with the content in the Solr > index using the Tika

Re: Issues when indexing PDF files

2015-12-16 Thread Erik Hatcher
Edwin - Can you share one of those PDF files? Also, drop the file into the Tika app and see what it sees directly - get the tika-app JAR and run that desktop application. Could be an encoding issue? Erik — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com

Re: Issues when indexing PDF files

2015-12-16 Thread Zheng Lin Edwin Yeo
Hi Erik, I've shared the file on dropbox, which you can access via the link here: https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0 This is what I get from the Tika app after dropping the file in. Content-Length: 75092 Content-Type: application/pdf Type: COSName{Info}