Hi, while not strictly a UIMA issue, we have a problem that seems very
relevant in the context of UIMA analysis engines: how to manage large
binary resources such as trained models used by an AE, etc.
So far, we have managed to achieve a good separation between code
development and the actual AEs, using Maven (and git for version
control). An AE thus consists only of a POM referencing the code, the AE
descriptor, and the resources used for the AE. The AE poms are
configured to generate PEAR archives that include all dependencies and
resources.
At this point we have the code in git, and the AEs' pom and descriptor
also, while we manually copy the resources to the directory before
running `mvn package` (and exclude those resources from git). We're
missing a way to manage those resources, including versioning etc.
I'm guessing that this is a rather typical problem, so what solutions do
you use? We're thinking of having all resources also in Maven (e.g.
Artifactory) so we can reference them with a unique identifier and
version. This would also help us when moving to more complex pipeline
assemblies using uimafit instead of generating individual PEARS for each
component in order to create complete packages.
Btw, we are just very few core developers, with most of the team made up
of linguists, so we want to make it easy for them to save versions of
resources they create and assemble AEs by just referencing the algorithm
and resource (e.g. "create a new OpenNLP POStagger using
spanish-pos-model.bin, version 1.2.3").
Thanks for sharing your experiences with this...
Jens