[jira] [Comment Edited] (TIKA-1764) Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor

Tim Allison (JIRA) Fri, 09 Oct 2015 04:57:48 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950277#comment-14950277
 ]


Tim Allison edited comment on TIKA-1764 at 10/9/15 11:57 AM:
-------------------------------------------------------------

Y, I completely agree that we all need to see when embedded documents are 
failing.  The RecursiveParserWrapper allowed me to discover TIKA-1651, for 
example, and I suspect that there are lots of other discoveries to be made with 
embedded objects.

I think I now remember why I haven't gotten around to fixing this...

The problem with logging the full metadata value at that point in the code is 
that there is no container document information in the metadata object at that 
point of the parsing via the standard AutoDetectParser.  So, all you'd get 
would be the detected mime type, the embedded object's name and any metadata 
that was pulled out before the parse failed.  In short, without other changes 
in our code, there would be no way to link that stacktrace or the metadata back 
to the source document with the AutoDetectParser.

I (re)tested this just now to confirm.  I truncated a ppt file and zipped it 
up.  This is what I got at that point in the code:
{noformat}
inside parsingEmbdeddedExtractor: date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: embeddedRelationshipId ; 
testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: X-Parsed-By ; 
org.apache.tika.parser.DefaultParser
inside parsingEmbdeddedExtractor: meta:save-date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: resourceName ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: dcterms:modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Last-Modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Length ; 63760
inside parsingEmbdeddedExtractor: Last-Save-Date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Type ; application/vnd.ms-powerpoint
{noformat}

So, the only way to get the container doc's information would be to cache it as 
you're parsing the embedded documents and transmit that information through the 
ParseContext.  This is exactly what the RecursiveMetadataParser does, so I'm 
not sure that we'd want to modify anything within Tika to solve this problem 
because I think the existing solution is sufficient.

If you're using Solr Cell... I opened a ticket a while ago to parameterize the 
use of the RecursiveMetadataParser in Solr Cell/DIH (SOLR-7229), but I haven't 
worked on it at all.  If you'd like to help on that by giving feedback on what 
you'd need, I think the Solr community would be receptive.  We had very quick 
commits on SOLR-7189 and SOLR-7231.

As a side note, I would very strongly encourage you to support SOLR-7632 and 
move Tika out of the same JVM that is sending updates to Solr.  I don't think 
this should be the default, but I do think that users should be able to 
configure the use of tika-server instead of the current embedded use of Tika.

Finally, speaking of embedded documents, if you have any friends over on Kite, 
I'd encourage them to look at Kite's failure to handle embedded documents 
[here|https://github.com/kite-sdk/kite/issues/397].  There's every chance 
they've fixed this by now, but as of July, no dice.


 



was (Author: talli...@mitre.org):
Y, I completely agree that we all need to see when embedded documents are 
failing.  The RecursiveParserWrapper allowed me to discover TIKA-1651, for 
example, and I suspect that there are lots of other discoveries to be made with 
embedded objects.

I think I now remember why I haven't gotten around to fixing this...

The problem with logging the full metadata value at that point in the code is 
that there is no container document information in the metadata object at that 
point of the parsing via the standard AutoDetectParser.  So, all you'd get 
would be the detected mime type, the embedded object's name and any metadata 
that was pulled out before the parse failed.  In short, without other changes 
in our code, there would be no way to link that stacktrace back to the source 
document with the AutoDetectParser.

I (re)tested this just now to confirm.  I truncated a ppt file and zipped it 
up.  This is what I got at that point in the code:
{noformat}
inside parsingEmbdeddedExtractor: date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: embeddedRelationshipId ; 
testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: X-Parsed-By ; 
org.apache.tika.parser.DefaultParser
inside parsingEmbdeddedExtractor: meta:save-date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: resourceName ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: dcterms:modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Last-Modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Length ; 63760
inside parsingEmbdeddedExtractor: Last-Save-Date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Type ; application/vnd.ms-powerpoint
{noformat}

So, the only way to get the container doc's information would be to cache it as 
you're parsing the embedded documents and transmit that information through the 
ParseContext.  This is exactly what the RecursiveMetadataParser does, so I'm 
not sure that we'd want to modify anything within Tika to solve this problem 
because I think the existing solution is sufficient.

If you're using Solr Cell... I opened a ticket a while ago to parameterize the 
use of the RecursiveMetadataParser in Solr Cell/DIH (SOLR-7229), but I haven't 
worked on it at all.  If you'd like to help on that by giving feedback on what 
you'd need, I think the Solr community would be receptive.  We had very quick 
commits on SOLR-7189 and SOLR-7231.

As a side note, I would very strongly encourage you to support SOLR-7632 and 
move Tika out of the same JVM that is sending updates to Solr.  I don't think 
this should be the default, but I do think that users should be able to 
configure the use of tika-server instead of the current embedded use of Tika.

Finally, speaking of embedded documents, if you have any friends over on Kite, 
I'd encourage them to look at Kite's failure to handle embedded documents 
[here|https://github.com/kite-sdk/kite/issues/397].  There's every chance 
they've fixed this by now, but as of July, no dice.


 


> Provide information on failed document parsing in 
> ParsingEmbeddedDocumentExtractor
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-1764
>                 URL: https://issues.apache.org/jira/browse/TIKA-1764
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.5, 1.10
>            Reporter: Odilo Oehmichen
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{ParsingEmbeddedDocumentExtractor}} delegates the parsing of documents 
> to a {{Parser}}-instance.  
> If this parser fails with a {{TikaException}} the extractor class returns 
> silenty:
> {code}
>  catch (TikaException e) {
>             // TODO: can we log a warning somehow?
>             // Could not parse the entry, just skip the content
>         }
> {code}
> This behaviour makes it very hard to detect problems concerning parsing.
> As the {{TODO}} in the source already states, please a some logging of the 
> exception here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1764) Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor

Reply via email to