I would like to weigh in on the recent discussion (previously titled "Parameters in uima descriptors") w.r.t. our thinking about descriptor files in our UIMA project, ClearTK. The last time we got together we decided that we were going to move away from providing descriptor files for our project and move towards providing static factory methods for creating *Description objects (e.g. AnalysisEngineDescription). If you check out the code and look at it now - you will see that there are still descriptor files scattered throughout our code and that we have started adding these factory methods - but that realizing this goal is still in progress. (see http://cleartk.googlecode.com) These methods will serve two purposes - 1) allow users to directly instantiate our components in Java and 2) to guide users in how to write descriptor files for our components. While we understand the purpose and necessity of descriptor files, we are not going to provide them for the following reasons:

1) maintaining descriptor files is a giant pain in the butt. The developers of ClearTK are two graduate students and a postdoc and we do not have the resources (or patience) to maintain these files. We have found that as we have evolved and refactored our code that our descriptor files are constantly breaking and are absurdly burdensome to maintain. I don't want to call out others in this conversation (please chime in as you will!) but I have had a number of conversations with developers on several other UIMA projects and I am not alone in my loathing of maintaining descriptor files. The maintainance is particularly burdensome for descriptor files that you might create for your unit tests. They are constantly breaking, they are tedious to fix, and they discourage code refactoring and evolution by their mere presence (let me tell you how I really feel!)

2) We cannot create all possible descriptor files that might be needed to use ClearTK in the ways desired by the user. Our library relies heavily on dynamic class loading driven by class names provided in configuration parameters. For example, when you are writing training data for a particular machine learning classifier you can specify the class name of the data writer to be used (e.g. one for maxent or libsvm). These data writers may require additional configuration parameters that must be set in the descriptor file. Therefore, what ends up in the descriptor file is determined by a specific use-case and is not constrained to a fixed set of configuration parameters. It is our goal to make it easy for users of ClearTK to be able to make descriptor files that are specific to a users use-case/scenario by 1) creating factory methods that demonstrate common ways that our components can be described (i.e. the user can study these methods when writing their own descriptor file) and 2) by naming our configuration parameters according to a strict naming convention which points the user to the canonical definition and documentation for a configuration parameter (e.g. "org.cleartk.classifier.InstanceConsumer.PARAM_ANNOTATION_HANDLER") and 3) by providing documentation on how to do this.

Here are a few more points that I want to make:

- we are not ruling out providing some descriptor files - esp. for configurations that we think/hope will be useful to have for running some of our components "out-of-the-box". Much of our code is intended as a framework for users of ClearTK to create their own components using common machine learning approaches. While, we could anticipate the general structure of descriptor files for such user-generated components (and we have tried) we have decided that descriptors such as these in particular will no longer be provided in ClearTK. - we are not getting rid of type system descriptor files. - we are not saying that descriptor files are of no use. They are clearly very nice to have for sharing and for deployment. I do not use them myself for setting up and running experiments in my research and we feel that for the above reasons we do not have to provide them.

I hope this clarifies the discussion.

Philip

Reply via email to