Re: Problem with Solr indexing "non-searchable" pdf files

Erick Erickson Thu, 17 Dec 2015 08:49:56 -0800

Not sure how much help I can be, I have no clue what DSpace is
doing with Solr.

If you're willing to try to index straight to Solr, you can always use
SolrJ to parse the files, it's actually not very hard. Here's an example:
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

some database stuff is mixed in there, but that can be removed.

Otherwise, perhaps the DSpace folks have more guidance on
what/how they expect to do with PDFs.

Best,
Erick

On Thu, Dec 17, 2015 at 6:54 AM, RICARDO EITO BRUN <re...@bib.uc3m.es> wrote:
> Hi,
> I am using SOLR as part of the dspace 5.4 SW application.
> I have a problem when running the dspace indexing command
> (index-discovery). Most of the files are not being added to the index, and
> an exception is raised.
>
> It seems that Solr does not process the PDF files that are result of
> scanning without OCR (non-searchable PDF files).
>
> Is there any way to tell Solr that the document metadata should be
> processed even if the PDF file itself cannot be indexed?
>
> Any suggestion on how to make the pdf files "searchable" using some kind of
> batch process/tool?
>
> Thanks in advance,
> Ricardo
>
> --
> RICARDO EITO BRUN
> Universidad Carlos III de Madrid

Re: Problem with Solr indexing "non-searchable" pdf files

Reply via email to