Classification: UK OFFICIAL
Morning all,

A new version of Baleen, the UIMA based entity extraction and text analytics 
framework developed by Dstl (part of the UK Ministry of Defence) has been 
released. This version includes the following improvements:


*         New Annotator: MongoStemming uses a gazetteer and stemming to perform 
a pseudo-fuzzy match and find gazetter terms in different tenses and plurals

*         New Cleaner: MergeAdjacent will merge adjacent entities of the same 
type

*         New Content Extractor: CsvContentExtractor splits CSV fields into 
content and metadata

*         New Collection Reader: LineReader will read a single file into 
multiple documents by line

*         New REST API to get configuration parameters for components (e.g. 
annotators)

*         Significant changes to the way gazetteer annotators work, including 
changing from RadixTrees to MultiMaps and implementing the Aho-Corasick 
algorithm, resulting in performance improvements for large gazetteers in the 
order of 100s

*         Lots of bug fixes and minor improvements

The latest release is available on GitHub: https://github.com/dstl/baleen

Any feedback, suggestions, comments, issues and code contributions are welcome! 
We're keen for people to help us improve it so that it's a useful tool for a 
wide range of people.

James

"This e-mail and any attachment(s) is intended for the recipient only.   Its 
unauthorised use, 
disclosure, storage or copying is not permitted.  Communications with Dstl are 
monitored and/or 
recorded for system efficiency and other lawful purposes, including business 
intelligence, business 
metrics and training.  Any views or opinions expressed in this e-mail do not 
necessarily reflect Dstl policy."

"If you are not the intended recipient, please remove it from your system and 
notify the author of 
the email and [email protected]"

Reply via email to