gnoreTikaException flag not working

2014-11-10 Thread 5ton3
Hi!

I'm importing BLOBs from an Oracle DB, and want to retrieve the textual
body/plaintext content for analyzing/indexing purposes. I'm using
TikaEntityProcessor to do the parsing of the documents, which works fine for
most of the documents. But in some cases , e.g. when a document is password
protected, the parsing fails, and Tika throws a Tika-198: IllegalIOException
(see stack trace at the end of post). This leads to the entire dataimport
being rollbacked, which really is an unfortunate behavior.

After finding the ignoreTikaException flag (Jira issue
https://issues.apache.org/jira/browse/SOLR-2480) patch, I thought my problem
was fixed, but adding this flag to my extractingRequestHandler doesn't seem
to do anything.

My requestHandler:


I've tried by adding ignoreTikaException=true as a custom parameter when
doing the dataImport as well, but it doesn't do anything.
Did I miss something, or has the ignoreTikaException mechanism changed in
later versions of Solr?

The ERROR stack trace:




--
View this message in context: 
http://lucene.472066.n3.nabble.com/gnoreTikaException-flag-not-working-tp4168526.html
Sent from the Solr - User mailing list archive at Nabble.com.


The exact same query gets executed n times for the nth row when retrieving body (plaintext) from BLOB column with Tika Entity Processor

2014-10-31 Thread 5ton3
Hi!

Not sure if this is a problem or if I just don't understand the debug
response, but it seems somewhat odd to me.
The main entity can have multiple BLOB documents. I'm using Tika Entity
Processor to retrieve the body (plaintext) from these documents and put the
result in a multivalued field, filedata.  The data-config looks like this:


It seems to work properly, but when I debug the data import, it seems that
the query on TABLE2 on the BLOB column (FILEDATA_BIN) gets executed 1 time
for document #1, which is correct, but 2 times for document #2, 3 times for
document #3, and so on.
I.e. for document #1:

And for document #2:

The result seems correct, ie. it doesn't duplicate the filedata. But why
does it query the DB two times for document #2? Any ideas? Maybe something
wrong in my config?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-exact-same-query-gets-executed-n-times-for-the-nth-row-when-retrieving-body-plaintext-from-BLOB-r-tp4166822.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issue with multivalued fields in UIMA

2014-10-30 Thread 5ton3
I had to overcome this issue, as I needed to analyze multivalued fields. The
fact that UIMA don't analyse multivalued fields is a known bug in UIMA. With
the help of Maryam, I solved the issue. The JIRA issue, along with a working
patch, can be found here: https://issues.apache.org/jira/browse/SOLR-6622



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-multivalued-fields-in-UIMA-tp4155609p4166576.html
Sent from the Solr - User mailing list archive at Nabble.com.