Re: Issues when indexing PDF files

Zheng Lin Edwin Yeo Thu, 17 Dec 2015 00:46:43 -0800

Hi Alexandre,

Thanks for your reply.


So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?

Regards,
Edwin


On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> They could be using custom fonts and non-Unicode characters. That's
> probably something to explore with PDF specific tools.
> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com>
> wrote:
>
> > I've checked all the files which has problem with the content in the Solr
> > index using the Tika app. All of them shows the same issues as what I see
> > in the Solr index.
> >
> > So does the issues lies with the encoding of the file? Are we able to
> check
> > the encoding of the file?
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > wrote:
> >
> > > Hi Erik,
> > >
> > > I've shared the file on dropbox, which you can access via the link
> here:
> > >
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> > >
> > > This is what I get from the Tika app after dropping the file in.
> > >
> > > Content-Length: 75092
> > > Content-Type: application/pdf
> > > Type: COSName{Info}
> > > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > > X-TIKA:digest:SHA256:
> > > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > > access_permission:assemble_document: true
> > > access_permission:can_modify: true
> > > access_permission:can_print: true
> > > access_permission:can_print_degraded: true
> > > access_permission:extract_content: true
> > > access_permission:extract_for_accessibility: true
> > > access_permission:fill_in_form: true
> > > access_permission:modify_annotations: true
> > > dc:format: application/pdf; version=1.3
> > > pdf:PDFVersion: 1.3
> > > pdf:encrypted: false
> > > producer: null
> > > resourceName: Desmophen+670+BAe.pdf
> > > xmpTPg:NPages: 3
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>
> > wrote:
> > >
> > >> Edwin - Can you share one of those PDF files?
> > >>
> > >> Also, drop the file into the Tika app and see what it sees directly -
> > get
> > >> the tika-app JAR and run that desktop application.
> > >>
> > >> Could be an encoding issue?
> > >>
> > >>         Erik
> > >>
> > >> —
> > >> Erik Hatcher, Senior Solutions Architect
> > >> http://www.lucidworks.com <http://www.lucidworks.com/>
> > >>
> > >>
> > >>
> > >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I'm using Solr 5.3.0
> > >> >
> > >> > I'm indexing some PDF documents. However, for certain PDF files,
> there
> > >> are
> > >> > chinese text in the documents, but after indexing, what is indexed
> in
> > >> the
> > >> > content is either a series of "??????" or an empty content.
> > >> >
> > >> > I'm using the post.jar that comes together with Solr.
> > >> >
> > >> > What could be the reason that causes this?
> > >> >
> > >> > Regards,
> > >> > Edwin
> > >>
> > >>
> > >
> >
>

Re: Issues when indexing PDF files

Reply via email to