[
https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530822
]
Keith R. Bennett commented on TIKA-35:
--------------------------------------
Rida -
The big question is: do we support the ability of parser implementations to
make multiple passes over a stream? If so, then we need to incorporate this
cleanly into the architectural design. Possible solutions are:
1) Save the contents of the stream during the first pass. Or, if the stream
supports, use mark() and release().
2) Pass to the Parsers a URL instead of an InputStream so that we can create a
stream multiple times. This is simpler, but runs the risk of the resource
changing between stream instantiations, though.
IMO it would not be a good idea to put a resource identifier in the Parser
class, even temporarily -- this is the reverse direction from our goal of
making the parsers stateless.
Instead, we could start discussing (or should I say continue to discuss?) how
to support multiple passes cleanly in the architecture.
Thanks,
Keith
P.S. For anyone having trouble applying Rida's patch, passing the "-p5" option
to patch worked for me.
> Extract MsOffice properties
> ---------------------------
>
> Key: TIKA-35
> URL: https://issues.apache.org/jira/browse/TIKA-35
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 0.1-incubator
> Reporter: Rida Benjelloun
> Fix For: 0.1-incubator
>
> Attachments: tika35.patch
>
>
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't
> able to extract the MsOffice properties and full text from a single
> inputstream, I always get this error : java.io.IOException Source code of
> java.io.IOException: Unable to read entire header; -1 bytes read;
> expected 512 bytes.
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I
> populate it from ParseUtils class. After that I create an inputStream from
> filePath or Url and I use it to extract properties and I use the default
> inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.