gnoreTikaException flag not working
Hi! I'm importing BLOBs from an Oracle DB, and want to retrieve the textual body/plaintext content for analyzing/indexing purposes. I'm using TikaEntityProcessor to do the parsing of the documents, which works fine for most of the documents. But in some cases , e.g. when a document is password protected, the parsing fails, and Tika throws a Tika-198: IllegalIOException (see stack trace at the end of post). This leads to the entire dataimport being rollbacked, which really is an unfortunate behavior. After finding the ignoreTikaException flag (Jira issue https://issues.apache.org/jira/browse/SOLR-2480) patch, I thought my problem was fixed, but adding this flag to my extractingRequestHandler doesn't seem to do anything. My requestHandler: I've tried by adding ignoreTikaException=true as a custom parameter when doing the dataImport as well, but it doesn't do anything. Did I miss something, or has the ignoreTikaException mechanism changed in later versions of Solr? The ERROR stack trace: -- View this message in context: http://lucene.472066.n3.nabble.com/gnoreTikaException-flag-not-working-tp4168526.html Sent from the Solr - User mailing list archive at Nabble.com.
The exact same query gets executed n times for the nth row when retrieving body (plaintext) from BLOB column with Tika Entity Processor
Hi! Not sure if this is a problem or if I just don't understand the debug response, but it seems somewhat odd to me. The main entity can have multiple BLOB documents. I'm using Tika Entity Processor to retrieve the body (plaintext) from these documents and put the result in a multivalued field, filedata. The data-config looks like this: It seems to work properly, but when I debug the data import, it seems that the query on TABLE2 on the BLOB column (FILEDATA_BIN) gets executed 1 time for document #1, which is correct, but 2 times for document #2, 3 times for document #3, and so on. I.e. for document #1: And for document #2: The result seems correct, ie. it doesn't duplicate the filedata. But why does it query the DB two times for document #2? Any ideas? Maybe something wrong in my config? -- View this message in context: http://lucene.472066.n3.nabble.com/The-exact-same-query-gets-executed-n-times-for-the-nth-row-when-retrieving-body-plaintext-from-BLOB-r-tp4166822.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issue with multivalued fields in UIMA
I had to overcome this issue, as I needed to analyze multivalued fields. The fact that UIMA don't analyse multivalued fields is a known bug in UIMA. With the help of Maryam, I solved the issue. The JIRA issue, along with a working patch, can be found here: https://issues.apache.org/jira/browse/SOLR-6622 -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-multivalued-fields-in-UIMA-tp4155609p4166576.html Sent from the Solr - User mailing list archive at Nabble.com.