Jukka -
Thanks for these improvements. I noticed that, unlike me, you put the
stream closes in finally blocks. Thank you. My bad. ;)
Some questions:
1) While you are modifying the Parser class, could we change getContents()
to not swallow exceptions? For example, change:
} catch (Exception e) {
logger.error("Parse error: " + e.getMessage(), e);
return "";
}
--- to something like: ---
} catch (TikaException e) {
logger.error("Parse error: " + e.getMessage(), e);
throw e;
} catch (Exception e) {
logger.error("Parse error: " + e.getMessage(), e);
throw new TikaException(e);
}
and modify the method declaration to throw a TikaException?
2) In ParserFactory, we have:
} catch (Exception e) {
logger.error("Unable to instantiate parser: " + className, e);
throw new TikaException(e.getMessage());
}
When we adapt an exception to a TikaException, would it make sense to wrap
the entire exception, and not just its getMessage()?
TikaException.getMessage() could be modified to include the cause's
getMessage() in its return value if there is one. The advantages are not
huge, but it seems to me marginally better to do it that way. One reason to
preserve the exception cause is that Throwable has a getLocalizedMessage()
that the wrapped exception might implement, and we lose access to it if we
only wrap its getMessage().
3) In Parser.getContents(), we could use Commons Lang StringUtils to make
the code more nullsafe and a bit more concise by replacing:
int length = Math.min(contentStr.length(), 500);
String summary = contentStr.substring(0, length);
--- with: ---
String summary = StringUtils.left(contentStr, 500);
(Javadoc for StringUtils.left() is at
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#left(java.lang.String,%20int%29).
It's too bad we can't have a custom object...then we could have a
getSummary() method that would do this so we don't run the risk of the
summary getting out of sync with respect to the fulltext content. Same for
getValue() always being getValues().get(0).
Regards,
Keith
JIRA [EMAIL PROTECTED] wrote:
>
>
> [
> https://issues.apache.org/jira/browse/TIKA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Jukka Zitting updated TIKA-33:
> ------------------------------
>
> Attachment: TIKA-33.patch
>
> Proposed patch.
>
> This patch replaces the getContentStr(), getContents() and
> getContent(String) methods on Parser with the following method:
>
> String getContents(InputStream stream, Map<String, Content> contents)
>
> The internal implementation is still the same as before, minus the state
> handling parts.
>
> This change makes the current ParserFactory, ParseUtils, and test case
> code somewhat cumbersome, but I didn't want to streamline them yet too
> much to keep the patch clearly focused.
>
>
--
View this message in context:
http://www.nabble.com/-jira--Created%3A-%28TIKA-33%29-Stateless-parsers-tf4522441.html#a12903197
Sent from the Apache Tika - Development mailing list archive at Nabble.com.