[ https://issues.apache.org/jira/browse/TIKA-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732522#action_12732522 ]
Daan de Wit commented on TIKA-203: ---------------------------------- Created TIKA-262, reproducable also on WinXP, so does not seem to be related to the OS > Earlier metadata extraction in ParsingReader > -------------------------------------------- > > Key: TIKA-203 > URL: https://issues.apache.org/jira/browse/TIKA-203 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Jukka Zitting > Assignee: Jukka Zitting > Priority: Minor > Fix For: 0.3 > > Attachments: lipsum.doc > > > The normal parse() method guarantees that all extracted metadata will be > available in the metadata object once the method returns. But since the > ParsingReader class runs the parse() method in a background thread, one can > only assume that extracted metadata is available once the entire character > stream has been consumed. This is troublesome for example when creating > Lucene Document objects, as Lucene postpones reading the given character > stream to when the already constructed Document is passed to an IndexWriter. > The result is that (depending on thread scheduling and the structure of the > input document format) metadata may not be available for inclusion in the > indexed Document. > One way of fixing this issue is to add a small character buffer in > ParsingReader, and to make sure that the buffer is filled with extracted text > before the ParsingReader constructor returns. This should ensure that > relevant document metadata is almost always available, since the majority of > document formats have all or most metadata at the beginning of the document > stream. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.