Hi Guys,

 I'd like to chime in on this one:

> Let's take as an example Chris' great feature of turning on or off byte
> array MIME type detection.   If I'm not mistaken, overriding the default
> setting now requires creation of a new XML file (since using the default
> settings does not require creation of, or even knowledge of, such a file).

 I think we need to divorce ourselves from the fact that Tika (or any other
system that uses XML configuration files) requires XML to configure it.
Simply stated, there are APIs that exist in a lot of elements of Tika
configuration classes, e.g., TikaConfig, and MimeTypes, that control their
functionality. For good coding practices, and so that users don't have to
recompile Tika every time that they want to change a configuration property,
a lot of properties are factored out into separate XML configuration files,
that are read, and used to construct the user's desired internal Tika
configuration object. The use of XML here is not mandated: we could have
used a Java properties style file (a=b\nc=d\n,...etc), we could have sent a
communication to an external database to load the configuration, or we could
have neglected to provide any separate external configuration files at all,
and required Tika to be configured programmatically; and mandated that as
the only option.

 However, it's important to understand as well, that XML is a convenience
only: not a necessity. Additionally, just because Tika ships with a
tika-config.xml file, or a tika-mimetypes.xml, doesn't mean that it's the
end-all for configuration, and it's generally applicable to everyone's
deployment environment or use-case. It should probably be something that we
emphasize that these files, because of their ease to change and convenience,
are amenable to change and should be changed, to meet user's use-cases. We
ship the XML configuration files with the best, most general guesses that we
could make: however, that doesn't mean that they'll never need to be
changed. 

The best example of this I can imagine would be something like the Apache
webserver. It ships with a mime types configuration file -- however, it
doesn't try to include every possible mime type in there as a default. In
fact, if you have exotic content types that you create (e.g., within the
scientific domain there are a lot of .hdf files), then you need to manually
edit this file yourself and add in your new exotic mime types. Additionally,
think about the httpd.conf file that ships with Apache. There are parameters
in there such as WebServerAdmin and DefaultPort, things that can come with
default values, but most likely need to be configured by each person who
downloads and uses the software. It's the same case here, with Tika.

> 
> I suppose even that is ok for a 0.1 release, but here's another thing.  I
> was having problems using this feature.  I looked through the code, and
> could not see where the byte array MIME type detection was being used.  If
> there had been a minimal test exercising the feature I could have run it,
> stepped through it, or just looked at it and been sure that it was me and
> not Tika that needed correction.  After some review of the code, I came to
> the conclusion that it was only used where the user passes the byte header
> him/herself; I was thinking that one of our utilities would have read the
> header.
> 
> If others with no familiarity with Tika internals will find it even more
> difficult to figure this kind of stuff out, then working with Tika may be
> frustrating for them.  There have been many times when I have explored a new
> piece of software, and the amount of effort to understand it and get it to
> work exceeded my patience.  I wouldn't want this to happen with Tika.

While I understand what you're saying, I am basically in agreement with
Bertrand: release early, and release often. Releases are good to "get the
software out there", but more so from a perspective of having a tangible,
stable artifact. To tell someone, "Oh you should use Tika for your project.
Just go to the .../trunk and check out the latest source from there" doesn't
exactly exude confidence that what the user downloads/checks out will be
stable, or even the same code within a few hours time. Having a release is
something that we can always point back to, and something that we can use to
version track the differences in the software as it evolves over time. I'm
reminded of my work where some software deliveries aren't necessary even
intended for outside use: they are simply "feature deliveries" that show
progress towards the overall deliverable. I think that's what this 0.1
release of Tika is: progress towards the overall 1.0 deliverable. We're not
mandating that Tika become a household name with this release -- just
showing that we are making measurable, tangible progress towards something
generally useful.

> 
> I don't mean this in any way as a criticism to anyone.  You are very
> generously giving your personal time and expertise to this code, and I truly
> appreciate it.  My point was to elevate the importance of user friendliness
> in our release criteria. I would like to be helpful in this area, creating
> unit tests, providing documentation, etc.

Agree with this point, wholeheartedly, though I think we need to clarify
that 100% (even 10%) user-friendliness need not be part of an 0.1
(alpha-type) release. I think we can set user-friendliness as a measure for
each release, e.g., by saying, "by release 0.4 we'll be XXX user friendly,
by 0.6 we'll be YYY user friendly, and finally by 0.1, we'll be 100%
bona-fide user-buddy buddy" (beyond friendly) ;) But I don't think we should
stymie the 0.1 release by expecting it to be production quality from a user
friendliness point of view.

Cheers,
  Chris

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to