I would like to weigh in on the recent discussion (previously titled
"Parameters in uima descriptors") w.r.t. our thinking about descriptor
files in our UIMA project, ClearTK. The last time we got together we
decided that we were going to move away from providing descriptor files
for our project and move towards providing static factory methods for
creating *Description objects (e.g. AnalysisEngineDescription). If you
check out the code and look at it now - you will see that there are
still descriptor files scattered throughout our code and that we have
started adding these factory methods - but that realizing this goal is
still in progress. (see http://cleartk.googlecode.com) These methods
will serve two purposes - 1) allow users to directly instantiate our
components in Java and 2) to guide users in how to write descriptor
files for our components. While we understand the purpose and necessity
of descriptor files, we are not going to provide them for the following
reasons:
1) maintaining descriptor files is a giant pain in the butt. The
developers of ClearTK are two graduate students and a postdoc and we do
not have the resources (or patience) to maintain these files. We have
found that as we have evolved and refactored our code that our
descriptor files are constantly breaking and are absurdly burdensome to
maintain. I don't want to call out others in this conversation (please
chime in as you will!) but I have had a number of conversations with
developers on several other UIMA projects and I am not alone in my
loathing of maintaining descriptor files. The maintainance is
particularly burdensome for descriptor files that you might create for
your unit tests. They are constantly breaking, they are tedious to fix,
and they discourage code refactoring and evolution by their mere
presence (let me tell you how I really feel!)
2) We cannot create all possible descriptor files that might be needed
to use ClearTK in the ways desired by the user. Our library relies
heavily on dynamic class loading driven by class names provided in
configuration parameters. For example, when you are writing training
data for a particular machine learning classifier you can specify the
class name of the data writer to be used (e.g. one for maxent or
libsvm). These data writers may require additional configuration
parameters that must be set in the descriptor file. Therefore, what
ends up in the descriptor file is determined by a specific use-case and
is not constrained to a fixed set of configuration parameters.
It is our goal to make it easy for users of ClearTK to be able to make
descriptor files that are specific to a users use-case/scenario by 1)
creating factory methods that demonstrate common ways that our
components can be described (i.e. the user can study these methods when
writing their own descriptor file) and 2) by naming our configuration
parameters according to a strict naming convention which points the user
to the canonical definition and documentation for a configuration
parameter (e.g.
"org.cleartk.classifier.InstanceConsumer.PARAM_ANNOTATION_HANDLER") and
3) by providing documentation on how to do this.
Here are a few more points that I want to make:
- we are not ruling out providing some descriptor files - esp. for
configurations that we think/hope will be useful to have for running
some of our components "out-of-the-box". Much of our code is intended
as a framework for users of ClearTK to create their own components using
common machine learning approaches. While, we could anticipate the
general structure of descriptor files for such user-generated components
(and we have tried) we have decided that descriptors such as these in
particular will no longer be provided in ClearTK.
- we are not getting rid of type system descriptor files.
- we are not saying that descriptor files are of no use. They are
clearly very nice to have for sharing and for deployment. I do not use
them myself for setting up and running experiments in my research and we
feel that for the above reasons we do not have to provide them.
I hope this clarifies the discussion.
Philip