Thanks for all your replies.
I did chance upon this question from stackoverflow which it says is able to
solve the issues:
http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/
However, when I tried to run it, it still get the same "?" output in
the content, the
This could also simply be your browser isn't set up to
display UTF-8, the characters may be just fine.
Best,
Erick
On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo
wrote:
> Thanks for all your replies.
>
> I did chance upon this question from stackoverflow which it
Hi Erick,
Thanks for your reply.
However, it is unlikely to be the browser issue, as the same result occurs
when I tried it in the Tika app.
Regards,
Edwin
On 18 December 2015 at 23:39, Erick Erickson
wrote:
> This could also simply be your browser isn't set up to
>
Hi Alexandre,
Thanks for your reply.
So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?
Regards,
Edwin
On 17 December 2015 at 15:42, Alexandre Rafalovitch
wrote:
> They
You can always write an update handler plugin to convert your PDFs to utf-8
and then push them to solr
On Thu, 17 Dec 2015, 14:16 Zheng Lin Edwin Yeo wrote:
> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific
17, 2015 5:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Issues when indexing PDF files
On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific
> tools and change the en
PDF isn’t really text. For example, it doesn’t have spaces, it just moves the
next letter over farther. Letters might not be in reading order — two column
text could be printed as horizontal scans. Custom fonts might not use an
encoding that matches Unicode, which makes them encrypted (badly).
On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
Hi Alexandre,
Thanks for your reply.
So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?
Solr uses Tika to extract plain text from PDFs. If the
I've checked all the files which has problem with the content in the Solr
index using the Tika app. All of them shows the same issues as what I see
in the Solr index.
So does the issues lies with the encoding of the file? Are we able to check
the encoding of the file?
Regards,
Edwin
On 17
They could be using custom fonts and non-Unicode characters. That's
probably something to explore with PDF specific tools.
On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" wrote:
> I've checked all the files which has problem with the content in the Solr
> index using the Tika
Edwin - Can you share one of those PDF files?
Also, drop the file into the Tika app and see what it sees directly - get the
tika-app JAR and run that desktop application.
Could be an encoding issue?
Erik
—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com
Hi Erik,
I've shared the file on dropbox, which you can access via the link here:
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
This is what I get from the Tika app after dropping the file in.
Content-Length: 75092
Content-Type: application/pdf
Type: COSName{Info}
12 matches
Mail list logo