RE: Indexing PDF file in Apache SOLR via Apache TIKA

Phil Scadden Tue, 30 Oct 2018 14:13:09 -0700

I will second the SolrJ method. You don’t want to be doing this on your SOLR 
instance. One question is whether your PDFs are scanned or are already 
searchable. I use tesseract offline to convert all scanned PDFs into searchable 
PDF so I don’t want Tika to be doing that. My code core is:
            File f = new File(filename);
             ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
             Metadata metadata = new Metadata();
             Parser parser = new AutoDetectParser();
             ParseContext context = new ParseContext();
             if (filename.toLowerCase().contains("pdf")) {
               PDFParserConfig pdfConfig = new PDFParserConfig();
               pdfConfig.setExtractInlineImages(false);
               pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR); 
// Remove this line (in fact remove the whole pdfparserConfig if you want tika 
to OCR
               context.set(PDFParserConfig.class,pdfConfig);
               context.set(Parser.class,parser);
             }
             InputStream input = new FileInputStream(f);
             try {
               parser.parse(input, textHandler, metadata, context);
             } catch (Exception e) {
               e.printStackTrace();
               return false;
              }
             SolrInputDocument up = new SolrInputDocument();
             if (title==null) title = metadata.get("title");
             if (author==null) author = metadata.get("author");
             up.addField("id",f.getCanonicalPath()); // load up whatever fields 
you are using
             up.addField("location",idString);
             up.addField("access",access);
             up.addField("datasource",datasource);
             up.addField("title",title);
             up.addField("author",author);
             if (year>0) up.addField("year",year);
             if (opfyear>0) up.addField("opfyear",opfyear);
             String content = textHandler.toString();
             up.addField("_text_",content);
             UpdateRequest req = new UpdateRequest();
             req.add(up);
             req.setBasicAuthCredentials("solrAdmin", password);
             UpdateResponse ur =  req.process(solr,"prindex");
             req.commit(solr, "prindex");
             return true;


-----Original Message-----
From: Erick Erickson <erickerick...@gmail.com>
Sent: Wednesday, 31 October 2018 06:00
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA

All of the above work, but for robust production situations you'll want to 
consider a SolrJ client, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog combines 
indexing from a DB and using Tika, but those are independent.

Best,
Erick
On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau <kamuela....@gmail.com> wrote:
>
> Hi there,
>
> Here are a couple of ways I'm aware of:
>
> 1. Extract-handler / post tool
> You can use the curl command with the extract handler or bin/post to
> upload a single document.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html
>
> 2. DataImportHandler
> This could be used for, say, uploading multiple documents with Tika.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-sto
> re-data-with-the-data-import-handler.html#the-tikaentityprocessor
>
> You should also be able to do it via the admin page, so long as you
> define and modify the extract handler in solrconfig.xml.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-up
> load
>
> Hope this helps!
>
> On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin
> <adiyaksake...@gmail.com>
> wrote:
>
> > Hello there, let me introduce my self. My name is Mohammad Kevin
> > Putra (you can call me Kevin), from Indonesia, i am a beginner in
> > backend developer, i use Linux Mint, i use Apache SOLR 7.5.0 and Apache 
> > TIKA 1.91.0.
> >
> > I have a little bit problem about how to put PDF File via Apache
> > TIKA. I understand how SOLR or TIKA works, but i don't know how they
> > both integrated.
> > Last thing i know, TIKA can extract the PDF file i upload, and parse
> > it into data/meta data automatically. And i just have to copy &
> > paste it to the "Documents" tab in core solr.
> > The question is :
> > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it
> > only with CLI mode ? if yes only with CLI mode, can you explain it
> > to me please ?
> > 2. Is it possible to add a text result in "Query" tab ?.
> >
> > The Background i asking about this is, i want to indexing PDF in my
> > local system, then i just upload it like "drag & drop" in SOLR (is
> > it possible ?) then when i type something in search box the result is like 
> > this :
> > (Title of doc)
> > blablablabla (yellow stabilo result) blablabla.
> > the blablabla text is like a couple sentences. That's all i need.
> > Sorry for my bad english.
> > Thanks for reading and replying this for me, it will be very helpful to me.
> > Thanks a lot
> >
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Indexing PDF file in Apache SOLR via Apache TIKA

Reply via email to