SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

Paden Thu, 09 Jul 2015 06:37:00 -0700

Hello, 

I've been working to get a search engine up an running for a little while
now. I'm using Solr to index from both a database and a file system.
However, I'm using the filepath contained inside the database to find the
file in the filesystem and then merge the the metadata in the DB and the
file system. I pretty much figured out I had two options. I could use the
DIH or I could create my own custom indexer in Java. I got pretty far on the
indexer, almost complete actually. But I defaulted to the DIH because it
indexed all the files I had at the time well.


Now I'm taking the project to the next stage of development and I'm worried
that the larger PDF's that I have to index might just kill Tika/Solr,
thereby stopping me in my tracks. So I want to have that custom indexer as a
backup. As I said I got pretty far with the custom indexer but I encountered
one problem at the the end. Tika wouldn't index the text of all the .doc
files. It would pull it but when I got the results in Solr it would look
blank

}
Author: "Some name" 
text: ""
}

Some context, I got these files from a .zip that was given to me by another
department so they were all sitting in a single file system. After trying a
few things I finally created a NEW .doc and copied the text from another
.doc file in the system to see if that would work. And it did. So it's not
that it wasn't indexing the text of .doc files. It was just THOSE .doc's I
was given in the .zip. I didn't request another zip with fresh files because
that would mean jumping through some hoops but I wonder if I should. Now I
haven't posted the code cause I don't feel like this is really a code issue.
I feel like it might be some bizarre file issue. I've posted the code below
but really I was just wondering whether or not anyone has ran into this
particular brand of problem before and how they solved it. I'm using a linux
file system so theres that and ALL data except for the text comes from the
database. That means author, id,...etc all comes from the database. That's
why i could get the "some name" author above in the response. 


import org.apache.solr.client.solrj.impl.HttpSolrClient; 

import org.apache.solr.client.solrj.SolrServerException; 

import org.apache.solr.client.solrj.impl.XMLResponseParser; 

import org.apache.solr.client.solrj.SolrClient; 

import org.apache.solr.client.solrj.response.UpdateResponse; 

import org.apache.solr.common.SolrInputDocument; 

import org.apache.pdfbox.pdmodel.PDDocument; 

/* Tika jars need to be retrieved online */ 

import org.apache.tika.metadata.Metadata; 

import org.apache.tika.parser.pdf.PDFParser; 

import org.apache.tika.parser.AutoDetectParser; 

import org.apache.tika.parser.ParseContext; 

import org.apache.tika.sax.BodyContentHandler; 

import org.xml.sax.ContentHandler; 


import java.io.File;

import java.io.FileInputStream; 

import java.io.IOException; 

import java.io.InputStream; 

import java.sql.*; 

import java.util.ArrayList; 

import java.util.Collection;




public class TikaSqlIndexer {
        
        private SolrClient server = new
HttpSolrClient("http://localhost:8983/solr/Testcore3";);
        
        private long _start = System.currentTimeMillis(); 
        
        private AutoDetectParser autoParser;
        
        private PDFParser pdfParser; 
        
        private int _totalTika = 0; 
        
        private int _totalSql = 0; 
        
        
        private Collection<SolrInputDocument> _docs = new ArrayList(); 


        
        public static void main(String[] args) {
                
                try{ 
                
                TikaSqlIndexer idxer = new
TikaSqlIndexer("http://localhost:8983/solr/Testcore3";); 

                //idxer.Index(); 
                
                idxer.doTikaDocuments(new
File("/home/paden/Documents/LWP_Files/BIGDATA")); 
                
                } catch (Exception e) { 
                        
                        e.printStackTrace(); 
                        
                }
        }
        
        private TikaSqlIndexer(String url) throws IOException, 
SolrServerException
{ 
                
                // creates a channel with the Solr server 
                
                server = new HttpSolrClient(url); 
                
                //server.setParser(new XMLResponseParser()); 
                
                autoParser = new AutoDetectParser(); 
                
                pdfParser = new PDFParser(); 
                 
                
        }
        
        private void Index() throws SQLException, SolrServerException {
        
                Connection con = null; 
                
                try{ 
                        
                        Class.forName("com.mysql.jdbc.Driver").newInstance(); 
                        
                        log("Driver Loaded ......"); 
                        
                        String URL = 
"jdbc:mysql://localhost:3306/EDMS_Metadata"; 
                        
                        String user = "root";
                        
                        String pass = "Natsulucyerzagrey7";
                        
                        con = DriverManager.getConnection(URL, user, pass); 
                        
                        Statement st = con.createStatement();
                        
                        ResultSet rs = st.executeQuery("select
ID,Title,TextContentURL,AuthorCreator from MasterIndex"); 
                        
                        while(rs.next()){
                                
                                // DO NOT move this outside the while loop 
                                
                                SolrInputDocument doc = new 
SolrInputDocument(); 
                                
                                String id = rs.getString("id"); 
                                
                                String title = rs.getString("Title"); 
                                
                                String filepath = 
rs.getString("TextContentURL"); 
                                
                                //String key = 
rs.getString("OriginalRecordKey"); 
                                                
                        /*      
                        if(key != null){
                                
                                doc.addField("key", key); 
                        } */ 
                                
                                
                                //if(id != null){       
                                        
                        //      doc.addField("id", id);
                                
                                //}
                                
                                
                                //if(title !=null){ 
                                
                            //doc.addField("title", title);
                                
                        //      }
                            
                            if(filepath != null){
                                
                                doc.addField("filepath", filepath);
                                
                            }
                            
                                System.out.println(filepath); 
                                
                                
                                if(!filepath.isEmpty()){
                                
                                File file = new File(filepath);                 
                                
                                 // Get ready to parse the file.
                              ContentHandler textHandler = new 
BodyContentHandler();
                              
                              Metadata metadata = new Metadata();
                              
                              ParseContext context = new ParseContext();
                         
                              InputStream input = new FileInputStream(file);
                         
                                // Try parsing the file. Note we haven't 
checked at all to
                              
                                // see whether this file is a good candidate.
                             
                              
                              try {
                                          
                                autoParser.parse(input, textHandler, metadata, 
context);
                                
                                 
                                
                              } catch (Exception e) {
                                  
                                  // Needs better logging of what went wrong in 
order to
                                  
                                  // track down "bad" documents.
                                  
                                log(String.format("File %s failed", 
file.getCanonicalPath()));
                                
                                e.printStackTrace();
                                
                                continue;
                                
                              }
                              // Just to show how much meta-data and what form 
it's in.
                              
                         
                              // Index just a couple of the meta-data fields.
                              
                         
                              // Crude way to get known meta-data fields.
                              
                              // Also possible to write a simple loop to 
examine all the
                              
                              // metadata returned and selectively index it 
and/or
                              
                              // just get a list of them.
                              
                              // One can also use the LucidWorks field mapping 
to
                              
                              // accomplish much the same thing.
                          
                            
                         if(textHandler != null){
                
                            doc.addField("text", textHandler.toString());       
                                            
                              
                         }
                               
                                
                              try{      
                                        
                                  server.add(doc); 
                                  
                                  server.commit(true, true); 
                                  
                                  
                              }catch (Exception ex){
                                                        
                                        log(String.format("File %s failed", 
file.getCanonicalPath()));
                                                
                                  ex.printStackTrace();
                                  
                                 
                                                        
                                  continue;  
                              }  
                              
                                }
                              
                        }
                        /*
                              if(_docs.size() > 0) { 
                                  
                                  UpdateResponse resp = server.add(_docs); 
                                  
                                        if(resp.getStatus() != 0) { 
                                                
                                                log("Some horrible error has 
occured, status is: " +
resp.getStatus()); 
                                        } 
                              
                              _docs.clear(); 
                              
                                } */ 
                                
                        
                                  
                        
                } catch (Exception ex) { 
                                
                        ex.printStackTrace(); 
                                
                                } finally {
                                
                                        if (con != null){
                                        
                                                con.close(); 
                                        
                                                }
                                }
        }
        
        
        
        
        
          private static void log(String msg) {

                    System.out.println(msg);

                  }
          
          
          private void doTikaDocuments(File root) throws IOException,
SolrServerException {
                  
                    // Simple loop for recursively indexing all the files
                    // in the root directory passed in.
                    for (File file : root.listFiles()) {
                      if (file.isDirectory()) {
                        doTikaDocuments(file);
                        continue;
                      }
                        // Get ready to parse the file.
                      ContentHandler textHandler = new BodyContentHandler();
                      Metadata metadata = new Metadata();
                      ParseContext context = new ParseContext();
                 
                      InputStream input = new FileInputStream(file);
                 
                        // Try parsing the file. Note we haven't checked at all 
to
                        // see whether this file is a good candidate.
                      try {
                        autoParser.parse(input, textHandler, metadata, context);
                      } catch (Exception e) {
                          // Needs better logging of what went wrong in order to
                          // track down "bad" documents.
                        log(String.format("File %s failed", 
file.getCanonicalPath()));
                        e.printStackTrace();
                        continue;
                      }
                      // Just to show how much meta-data and what form it's in.
                     // dumpMetadata(file.getCanonicalPath(), metadata);
                 
                      // Index just a couple of the meta-data fields.
                      
                      SolrInputDocument doc = new SolrInputDocument();
                 
                      doc.addField("id", file.getCanonicalPath());
                 
                      // Crude way to get known meta-data fields.
                      // Also possible to write a simple loop to examine all the
                      // metadata returned and selectively index it and/or
                      // just get a list of them.
                      // One can also use the LucidWorks field mapping to
                      // accomplish much the same thing.
                      
                      String author = metadata.get("Author");
                 
                      if (author != null) {
                          
                        doc.addField("author", author);
                      }
                 
                      doc.addField("text", textHandler.toString());
                 
                      System.out.println(file.getCanonicalPath()); 
                      System.out.println(textHandler); 
                 
                    
                      ++_totalTika;
                 
                      // Completely arbitrary, just batch up more than one 
document
                      
                      // for throughput!
                 
                          // Commit within 5 minutes.
                      
                        System.out.println("id"); 
                        server.add(doc);
                     
                        }
                        _docs.clear();
                      }
                    }




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-Tika-custom-indexer-not-indexing-CERTAIN-doc-text-tp4216541.html
Sent from the Solr - User mailing list archive at Nabble.com.

SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

Reply via email to