Re: [Taverna-hackers] Handling Documents

Andrea Wiggins Tue, 09 Jun 2009 12:35:43 -0700

On Jun 9, 2009, at 1:39 PM, [email protected]wrote:

Well our system is everything for text mining/NLP, we already have
several sets of sentence splitter, tokenizers, named entity taggers,
parsers and many other kinds of tools ready to use.
Our tools are completely interoperable based on the UIMA framework, no
programming required to create a workflow since data type compatiblity
is guranteed.
Please visit http://u-compare.org/ for details.

If you only need text mining/NLP with less human labor, U-Compare isfor you.

Excellent! I'll take a look; our Syracuse-based Center for NLP hasseveral such tools as well - POS taggers and the like. However, I'mnot able to access or implement any of it on my own, nor redistribute,due to licensing considerations.

Less human labor as a goal is an understatement; one research projectrequires two graduate students for a full year to code enough data fora paper. Even so, this generates very small training data sets forsome content code categories.

What I am not sure is whether we can assume that the input is a raw
text document in this community...

Yes, I agree - our analyses often have to handle (computer) codesnippets as well, which is quite problematic when one of the contentanalysis schema items is punctuation usage! This might not be quite soproblematic if we were handling text with only one sort of codesnippets embedded in it, but that is not the case. We already do quitea bit of pre-processing because the texts are email messages, so wemust ensure a uniform character set and remove headers and signatures.

one piece of analysis. For our uses, the applications of the toolsare
drastically restricted when only one level of text is allowable, but
transcending the sentence structure is difficult for NLP.

Do you mean something which has dependencies over sentences like
coreference resolutions?

I believe that's it - I'm not an NLP person, just sufficientlyfamiliar with the lingo from hearing many, many reports in researchgroup meetings... The larger issue is that a whole message unit couldbe considered representative of some construct or other, while at thesame time, a few words within one of several paragraphs within thesame message will represent other constructs.

Just from curiosity as an NLP researcher, what sort of analysis you
are planning to perform?

Generally, it's standard qualitative content analysis of thepositivistic (rather than hermeneutic) variety. For example, thecoding schema for a recent effort included vocatives, inclusivepronouns, jargon, punctuation, apologies, self-depreciation, andappreciation as the specific operationalizations for Face Theory andPoliteness Theory. The context was group maintenance (the efforts madeby group members to maintain group cohesion) in open source softwaredevelopment projects. Some of the content codes are very easy toachieve high precision and recall (e.g. inclusive pronouns andvocatives have great performance) while others had to be dropped fromthe coding schema entirely because there were too few examples in thegold standard, or else we couldn't even achieve human inter-raterreliability, as in the case of humor. It turns out that humor issimply too subjective even for humans to code effectively,particularly when it's in email form.

I would definitely be interested in getting my hands on a text mining

plugin or service for Taverna. I would immediately be able to doquite

a few interesting snippets of research that are currently impossible
for me, starting with analysis of the 3.42 GB CSV of juicy search log
data that's just gathering virtual dust on my hard drive...

Well the performance/output issue is another sort of problem.
Our system is scable and possible to be launched locally, but you need
to prepare your servers to run such a large data.
I think we could provide our system as Taverna-linked soon.

Of course - that particular file will have to be cut into chunks touse anyway, or else dumped into a simple database just to be able toselect portions of it with greater flexibility. Generally speaking, Ifind it's difficult to judge just how much data can be pushed througha workflow given the limits of system memory - at least until Igenerate a stack overflow.


Cheers,

Andrea


Andrea Wiggins
PhD Student, School of Information Studies
Syracuse University

337 Hinds Hall
Syracuse, NY 13244
[email protected]
www.andreawiggins.com

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects

_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
Developers Guide: http://www.mygrid.org.uk/tools/developer-information

Re: [Taverna-hackers] Handling Documents

Reply via email to