On Jun 9, 2009, at 1:39 PM, [email protected] wrote:

Well our system is everything for text mining/NLP, we already have
several sets of sentence splitter, tokenizers, named entity taggers,
parsers and many other kinds of tools ready to use.
Our tools are completely interoperable based on the UIMA framework, no
programming required to create a workflow since data type compatiblity
is guranteed.
Please visit http://u-compare.org/ for details.
If you only need text mining/NLP with less human labor, U-Compare is for you.

Excellent! I'll take a look; our Syracuse-based Center for NLP has several such tools as well - POS taggers and the like. However, I'm not able to access or implement any of it on my own, nor redistribute, due to licensing considerations.

Less human labor as a goal is an understatement; one research project requires two graduate students for a full year to code enough data for a paper. Even so, this generates very small training data sets for some content code categories.

What I am not sure is whether we can assume that the input is a raw
text document in this community...

Yes, I agree - our analyses often have to handle (computer) code snippets as well, which is quite problematic when one of the content analysis schema items is punctuation usage! This might not be quite so problematic if we were handling text with only one sort of code snippets embedded in it, but that is not the case. We already do quite a bit of pre-processing because the texts are email messages, so we must ensure a uniform character set and remove headers and signatures.

one piece of analysis. For our uses, the applications of the tools are
drastically restricted when only one level of text is allowable, but
transcending the sentence structure is difficult for NLP.
Do you mean something which has dependencies over sentences like
coreference resolutions?

I believe that's it - I'm not an NLP person, just sufficiently familiar with the lingo from hearing many, many reports in research group meetings... The larger issue is that a whole message unit could be considered representative of some construct or other, while at the same time, a few words within one of several paragraphs within the same message will represent other constructs.

Just from curiosity as an NLP researcher, what sort of analysis you
are planning to perform?

Generally, it's standard qualitative content analysis of the positivistic (rather than hermeneutic) variety. For example, the coding schema for a recent effort included vocatives, inclusive pronouns, jargon, punctuation, apologies, self-depreciation, and appreciation as the specific operationalizations for Face Theory and Politeness Theory. The context was group maintenance (the efforts made by group members to maintain group cohesion) in open source software development projects. Some of the content codes are very easy to achieve high precision and recall (e.g. inclusive pronouns and vocatives have great performance) while others had to be dropped from the coding schema entirely because there were too few examples in the gold standard, or else we couldn't even achieve human inter-rater reliability, as in the case of humor. It turns out that humor is simply too subjective even for humans to code effectively, particularly when it's in email form.

I would definitely be interested in getting my hands on a text mining
plugin or service for Taverna. I would immediately be able to do quite
a few interesting snippets of research that are currently impossible
for me, starting with analysis of the 3.42 GB CSV of juicy search log
data that's just gathering virtual dust on my hard drive...
Well the performance/output issue is another sort of problem.
Our system is scable and possible to be launched locally, but you need
to prepare your servers to run such a large data.
I think we could provide our system as Taverna-linked soon.

Of course - that particular file will have to be cut into chunks to use anyway, or else dumped into a simple database just to be able to select portions of it with greater flexibility. Generally speaking, I find it's difficult to judge just how much data can be pushed through a workflow given the limits of system memory - at least until I generate a stack overflow.

Cheers,

Andrea


Andrea Wiggins
PhD Student, School of Information Studies
Syracuse University

337 Hinds Hall
Syracuse, NY 13244
[email protected]
www.andreawiggins.com

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
Developers Guide: http://www.mygrid.org.uk/tools/developer-information

Reply via email to