I will second the SolrJ method. You don’t want to be doing this on your SOLR instance. One question is whether your PDFs are scanned or are already searchable. I use tesseract offline to convert all scanned PDFs into searchable PDF so I don’t want Tika to be doing that. My code core is: File f = new File(filename); ContentHandler textHandler = new BodyContentHandler(Integer.MAX_VALUE); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); if (filename.toLowerCase().contains("pdf")) { PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(false); pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR); // Remove this line (in fact remove the whole pdfparserConfig if you want tika to OCR context.set(PDFParserConfig.class,pdfConfig); context.set(Parser.class,parser); } InputStream input = new FileInputStream(f); try { parser.parse(input, textHandler, metadata, context); } catch (Exception e) { e.printStackTrace(); return false; } SolrInputDocument up = new SolrInputDocument(); if (title==null) title = metadata.get("title"); if (author==null) author = metadata.get("author"); up.addField("id",f.getCanonicalPath()); // load up whatever fields you are using up.addField("location",idString); up.addField("access",access); up.addField("datasource",datasource); up.addField("title",title); up.addField("author",author); if (year>0) up.addField("year",year); if (opfyear>0) up.addField("opfyear",opfyear); String content = textHandler.toString(); up.addField("_text_",content); UpdateRequest req = new UpdateRequest(); req.add(up); req.setBasicAuthCredentials("solrAdmin", password); UpdateResponse ur = req.process(solr,"prindex"); req.commit(solr, "prindex"); return true;
-----Original Message----- From: Erick Erickson <erickerick...@gmail.com> Sent: Wednesday, 31 October 2018 06:00 To: solr-user <solr-user@lucene.apache.org> Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA All of the above work, but for robust production situations you'll want to consider a SolrJ client, see: https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog combines indexing from a DB and using Tika, but those are independent. Best, Erick On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau <kamuela....@gmail.com> wrote: > > Hi there, > > Here are a couple of ways I'm aware of: > > 1. Extract-handler / post tool > You can use the curl command with the extract handler or bin/post to > upload a single document. > Reference: > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell > -using-apache-tika.html > > 2. DataImportHandler > This could be used for, say, uploading multiple documents with Tika. > Reference: > https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-sto > re-data-with-the-data-import-handler.html#the-tikaentityprocessor > > You should also be able to do it via the admin page, so long as you > define and modify the extract handler in solrconfig.xml. > Reference: > https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-up > load > > Hope this helps! > > On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin > <adiyaksake...@gmail.com> > wrote: > > > Hello there, let me introduce my self. My name is Mohammad Kevin > > Putra (you can call me Kevin), from Indonesia, i am a beginner in > > backend developer, i use Linux Mint, i use Apache SOLR 7.5.0 and Apache > > TIKA 1.91.0. > > > > I have a little bit problem about how to put PDF File via Apache > > TIKA. I understand how SOLR or TIKA works, but i don't know how they > > both integrated. > > Last thing i know, TIKA can extract the PDF file i upload, and parse > > it into data/meta data automatically. And i just have to copy & > > paste it to the "Documents" tab in core solr. > > The question is : > > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it > > only with CLI mode ? if yes only with CLI mode, can you explain it > > to me please ? > > 2. Is it possible to add a text result in "Query" tab ?. > > > > The Background i asking about this is, i want to indexing PDF in my > > local system, then i just upload it like "drag & drop" in SOLR (is > > it possible ?) then when i type something in search box the result is like > > this : > > (Title of doc) > > blablablabla (yellow stabilo result) blablabla. > > the blablabla text is like a couple sentences. That's all i need. > > Sorry for my bad english. > > Thanks for reading and replying this for me, it will be very helpful to me. > > Thanks a lot > > Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.