You can always write an update handler plugin to convert your PDFs to utf-8 and then push them to solr
On Thu, 17 Dec 2015, 14:16 Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Alexandre, > > Thanks for your reply. > > So the only way to solve this issue is to explore with PDF specific tools > and change the encoding of the file? > Is there any way to configure it in Solr? > > Regards, > Edwin > > > On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com> > wrote: > > > They could be using custom fonts and non-Unicode characters. That's > > probably something to explore with PDF specific tools. > > On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com> > > wrote: > > > > > I've checked all the files which has problem with the content in the > Solr > > > index using the Tika app. All of them shows the same issues as what I > see > > > in the Solr index. > > > > > > So does the issues lies with the encoding of the file? Are we able to > > check > > > the encoding of the file? > > > > > > > > > Regards, > > > Edwin > > > > > > > > > On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> > > > wrote: > > > > > > > Hi Erik, > > > > > > > > I've shared the file on dropbox, which you can access via the link > > here: > > > > > > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0 > > > > > > > > This is what I get from the Tika app after dropping the file in. > > > > > > > > Content-Length: 75092 > > > > Content-Type: application/pdf > > > > Type: COSName{Info} > > > > X-Parsed-By: org.apache.tika.parser.DefaultParser > > > > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf > > > > X-TIKA:digest:SHA256: > > > > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7 > > > > access_permission:assemble_document: true > > > > access_permission:can_modify: true > > > > access_permission:can_print: true > > > > access_permission:can_print_degraded: true > > > > access_permission:extract_content: true > > > > access_permission:extract_for_accessibility: true > > > > access_permission:fill_in_form: true > > > > access_permission:modify_annotations: true > > > > dc:format: application/pdf; version=1.3 > > > > pdf:PDFVersion: 1.3 > > > > pdf:encrypted: false > > > > producer: null > > > > resourceName: Desmophen+670+BAe.pdf > > > > xmpTPg:NPages: 3 > > > > > > > > > > > > Regards, > > > > Edwin > > > > > > > > > > > > On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com> > > > wrote: > > > > > > > >> Edwin - Can you share one of those PDF files? > > > >> > > > >> Also, drop the file into the Tika app and see what it sees directly > - > > > get > > > >> the tika-app JAR and run that desktop application. > > > >> > > > >> Could be an encoding issue? > > > >> > > > >> Erik > > > >> > > > >> — > > > >> Erik Hatcher, Senior Solutions Architect > > > >> http://www.lucidworks.com <http://www.lucidworks.com/> > > > >> > > > >> > > > >> > > > >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo < > > > edwinye...@gmail.com> > > > >> wrote: > > > >> > > > > >> > Hi, > > > >> > > > > >> > I'm using Solr 5.3.0 > > > >> > > > > >> > I'm indexing some PDF documents. However, for certain PDF files, > > there > > > >> are > > > >> > chinese text in the documents, but after indexing, what is indexed > > in > > > >> the > > > >> > content is either a series of "??????" or an empty content. > > > >> > > > > >> > I'm using the post.jar that comes together with Solr. > > > >> > > > > >> > What could be the reason that causes this? > > > >> > > > > >> > Regards, > > > >> > Edwin > > > >> > > > >> > > > > > > > > > > -- Regards, Binoy Dalal