Re: [Haskell-cafe] NLP libraries and tools?

2011-07-10 Thread Jason Dagit
On Sun, Jul 10, 2011 at 12:59 PM, ivan vadovic  wrote:
> Hi,
>
> Also a library for string normalization in the sense of stripping diacritical
> marks would be handy too. Does anything in this respect exist that would be
> usable from haskell?

The closest thing I know of is this:
http://hackage.haskell.org/package/text-icu

You still have to install ICU separately, that library is just a
binding for working with it from Haskell.

Jason

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-10 Thread ivan vadovic
Hi,

Also a library for string normalization in the sense of stripping diacritical
marks would be handy too. Does anything in this respect exist that would be
usable from haskell?

Thanks

On Fri, Jul 01, 2011 at 02:31:34PM +0400, Dmitri O.Kondratiev wrote:
> Hi,
> Please advise on NLP libraries similar to Natural Language Toolkit (
> www.nltk.org)
> First of all I need:
> - tools to construct 'bag of words' (
> http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words
> in the
> article.
> - tools to prune common words, such as prepositions and conjunctions, as
> well as extremely rare words, such as the ones with typos.
> - stemming tools
> - Naive Bayes classifier
> - SVM classifier
> -  k-means clustering
> 
> Thanks!

> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-10 Thread Ketil Malde

Perhaps this is interesting?  On the relationship between exploratory
(a.k.a. sloppy or theoretical) and rigorous math.

http://arxiv.org/pdf/math/9307227v1

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-09 Thread Jack Henahan
Heh, I just hit Reply All and I guess the address came in wrong. Ah, well.

I strongly agree with you on the state of linguistics, et al. Having done 
little bits of work in a few of those fields (or at least work _with_ people in 
them), your comments are spot on. Though perhaps I wouldn't say that 
mathematics isn't a science (merely because most fields therein satisfy the 
scientific method). But my glasses may be just a little rosy. :)

All that said, I find your points insightful. And don't even get me started on 
the sloppy math in the social sciences. :D

A major issue in the matter of theory/practice drift seems (to me, at least) to 
be the subject matter's ability to assimilate into pop culture and 
pseudo-scientific meandering. String theory and some of Penrose's works spring 
to mind. Sapir-Whorf, "relational" databases, and the events (perhaps to be 
read 'hype') leading up to the AI Winter also come to mind. A little knowledge 
is a dangerous thing, as they say.

Perhaps that's just confirmation bias. I may just think of them as examples 
because they're pet peeves. :D

And, naturally, every field wishes it could be mathematics. (Tongue in cheek… 
mostly)
http://xkcd.com/435/

On Jul 9, 2011, at 7:55 PM, wren ng thornton wrote:

> (Psst, the nlp list is  :)
> 
> On 7/9/11 3:10 AM, Jack Henahan wrote:
>> On Jul 7, 2011, at 10:53 PM, wren ng thornton wrote:
>>> I can't help but be a (meta)theorist. But then, I'm of the firm opinion
>>> that theory must be grounded in actual practice, else it belongs more to
>>> the realm of theology than science.
>> 
>> Oof, you're liable to wound my (pure) mathematician's pride with remarks
> like that, wren. :P
> 
> How's that now? Pure mathematics is perfectly grounded in the practice of
> mathematics :)
> 
> I've no qualms with pure maths. Afterall, mathematics isn't trying to
> model anything (except itself). The problems I have are when the theory
> branch of a field loses touch with what the field is trying to do in the
> first place, and consequently ends up arguing over details which can be
> neither proven nor disproven. It is this which makes them non-scientific
> and not particularly helpful for practicing scientists. Linguistics is one
> of the fields where this has happened, but it's by no means the only one
> (AI, declarative databases, postmodernism,...)
> 
> There's nothing wrong with not being science. I'm a big fan of the
> humanities, mathematics, and philosophy. It's only a problem when
> non-science is pretending to be science: it derails the scientists and it
> does a disservice to the non-science itself. Non-science is fine;
> pseudo-science is the problem. For the same reason, I despise math envy
> and all the pseudo-math that gets bandied about in social sciences wishing
> they were economics (or economics wishing it were statistics, or
> statistics wishing it were mathematics).
> 
> -- 
> Live well,
> ~wren
> 
> 
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-09 Thread wren ng thornton
(Psst, the nlp list is  :)

On 7/9/11 3:10 AM, Jack Henahan wrote:
> On Jul 7, 2011, at 10:53 PM, wren ng thornton wrote:
>> I can't help but be a (meta)theorist. But then, I'm of the firm opinion
>> that theory must be grounded in actual practice, else it belongs more to
>> the realm of theology than science.
>
> Oof, you're liable to wound my (pure) mathematician's pride with remarks
like that, wren. :P

How's that now? Pure mathematics is perfectly grounded in the practice of
mathematics :)

I've no qualms with pure maths. Afterall, mathematics isn't trying to
model anything (except itself). The problems I have are when the theory
branch of a field loses touch with what the field is trying to do in the
first place, and consequently ends up arguing over details which can be
neither proven nor disproven. It is this which makes them non-scientific
and not particularly helpful for practicing scientists. Linguistics is one
of the fields where this has happened, but it's by no means the only one
(AI, declarative databases, postmodernism,...)

There's nothing wrong with not being science. I'm a big fan of the
humanities, mathematics, and philosophy. It's only a problem when
non-science is pretending to be science: it derails the scientists and it
does a disservice to the non-science itself. Non-science is fine;
pseudo-science is the problem. For the same reason, I despise math envy
and all the pseudo-math that gets bandied about in social sciences wishing
they were economics (or economics wishing it were statistics, or
statistics wishing it were mathematics).

-- 
Live well,
~wren


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-09 Thread Jack Henahan
Oof, you're liable to wound my (pure) mathematician's pride with remarks like 
that, wren. :P

Now go intone the Litany of Categories as penance. :D I'll start you off… Set, 
Rel, Top, Ring, Grp, Cat, Hask…


On Jul 7, 2011, at 10:53 PM, wren ng thornton wrote:

> I can't help but be a (meta)theorist. But then, I'm of the firm opinion
> that theory must be grounded in actual practice, else it belongs more to
> the realm of theology than science.
> 
> -- 
> Live well,
> ~wren
> 
> 
> 
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-07 Thread wren ng thornton
On 7/7/11 3:50 AM, Aleksandar Dimitrov wrote:
> It's actually a shame we're discussing this on -cafe and not on -nlp. Then
> again, maybe it's going to prompt somebody to join -nlp, and I'm gonna
CC it
> there, because some folks over there might be interested, but not read
-cafe.

Quite :)

> When you mentioned Arabic for producing sentences that go on for ages —
> you don't really need to go that far. I have had the doubtful pleasure of
> reading Kant and Hegel in their original versions. In German, it is
sometimes
> still considered good style to write huge sentences. I once made it a
point,
> just to stick it to a Kant-loving-person, to produce a sentence that
spanned 2
> whole pages (A4.) It wasn't even difficult.

The Romans were big fans of that too (though there's only a small group of
folks interested in doing NLP on Latin these days). I've only read Hegel
et al. in translation, but the Latin I've read falls nicely into the
notion of "span". It doesn't, however, always fall nicely into a
clause-based approach like Japanese does. Then again, that could be due to
the poetic/rhetorical nature of the texts in question.

I wonder if there's been any computational attempt to make the notion of
span or discourse atoms rigorous enough for pragmatic use...


> I'm very much a "works for me" person in these matters. Mostly because
I'm tired
> of linguists fighting each other over trivial matters. Give me something
I can
> work with already!

I can't help but be a (meta)theorist. But then, I'm of the firm opinion
that theory must be grounded in actual practice, else it belongs more to
the realm of theology than science.

-- 
Live well,
~wren



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-07 Thread wren ng thornton
On 7/7/11 3:38 AM, Aleksandar Dimitrov wrote:
> On Wed, Jul 06, 2011 at 07:27:10PM -0700, wren ng thornton wrote:
>> I definitely agree with the iteratees comment, but I'm curious about the
>> leaks you mention. I haven't run into leakiness issues (that I'm aware of)
>> in my use of ByteStrings for NLP.
>
> The issue is this: strict ByteStrings retain pointers to the original
chunk. The
> chunk is probably bigger than you'd want to keep in memory, if you, say,
wanted
> to just keep one or two words. In my case, the chunk was some 65K (that
was my
> Iteratee chunk size.)

Oh, that issue. Yeah, I maintain an intern table and make sure that the
copy in the table is a trimmed copy instead of keeping the whole string
alive. I guess I should factor that part of my tagger out into a separate
package :)

I didn't know if you meant there was a technical issue, e.g. something
about the fact that ByteStrings uses pinned memory (whereas Text doesn't
IIRC).

-- 
Live well,
~wren


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-07 Thread Aleksandar Dimitrov
It's actually a shame we're discussing this on -cafe and not on -nlp. Then
again, maybe it's going to prompt somebody to join -nlp, and I'm gonna CC it
there, because some folks over there might be interested, but not read -cafe.

On Wed, Jul 06, 2011 at 07:22:41PM -0700, wren ng thornton wrote:
> On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote:
> > On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
> >> On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
> >>> Hi,
> >>> Continuing my search of Haskell NLP tools and libs, I wonder if the
> >>> following Haskell libraries exist (googling them does not help):
> >>> 1) End of Sentence (EOS) Detection. Break text into a collection of
> >>> meaningful sentences.
> >>
> >> Depending on how you mean, this is either fairly trivial (for English) or
> >> an ill-defined problem. For things like determining whether the "."
> >> character is intended as a full stop vs part of an abbreviation; that's
> >> trivial.
> >
> > I disagree. It's not exactly trivial in the sense that it is solved. It is
> > trivial in the sense that, usually, one would use a list of know
> abbreviations
> > and just compare. This, however, just says that the most common approach is
> > trivial, not that the problem is.
> 
> Perhaps. I recall David Yarowsky suggesting it was considered solved (for
> English, as I qualified earlier).
> 
> The solution I use is just to look at a window around the point and run a
> standard feature-based machine learning algorithm over it[1]. Memorizing
> known abbreviations is actually quite fragile, for reasons you mention.
> This approach will give you accuracy in the high 90s, though I forget the
> exact numbers.

That is indeed one of the best ways to do it (for Indo-European languages,
anyway.) When you mentioned Arabic for producing sentences that go on for ages —
you don't really need to go that far. I have had the doubtful pleasure of
reading Kant and Hegel in their original versions. In German, it is sometimes
still considered good style to write huge sentences. I once made it a point,
just to stick it to a Kant-loving-person, to produce a sentence that spanned 2
whole pages (A4.) It wasn't even difficult.

I sometimes think that we should just adopt a similar notion of "span," like
rhetorical structure theorists do. In that case, you're not segmenting
sentences, but discourse atoms — those are even more ill-defined, however.

> But the problem is that what constitutes an appropriate solution for
> computational needs is still very ill-defined. 

Well, yes, and, well, no. Tokens are ill-defined. There's no good consensus on
how you should parse tokens (i.e., is "in spite of" one token or three?) either,
and so you just pick one that works for you. Same for sentence boundaries:
they're sometimes also ill-defined, but who says you need to define it well?

Maybe there's just a purpose-driven definition? — that people can agree on,
anyways. My purpose is either tagging, or parsing, or NE-detection, or
computational semantics… In all cases, I'm choosing the definition my tools can
use. Not because that's "correct," but I don't really need it to be, no?

I'm very much a "works for me" person in these matters. Mostly because I'm tired
of linguists fighting each other over trivial matters. Give me something I can
work with already!

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread wren ng thornton
On 7/6/11 8:46 PM, Richard O'Keefe wrote:
>> I've been working over the last year+ on an optimized HMM-based POS
>> tagger/supertagger with online tagging and anytime n-best tagging. I'm
>> planning to release it this summer (i.e., by the end of August), though
>> there are a few things I'd like to polish up before doing so. In
>> particular, I want to make the package less monolithic. When I release it
>> I'll make announcements here and on the nlp@ list.
>
> One of the issues I've had with a POS tagger I've been using is that it
> makes some really stupid decisions which can be patched up with a few
> simple rules, but since it's distributed as a .jar file I cannot add
> those rules.

How horrid. I assume the problem is really that the trained model is in
the jar and you can't do your own training? Or is this a Brill-like tagger
where you really mean to add new rules?

If an HMM-based tagger is amenable, you could try switching to Daniël de
Kok's Java port of TnT:

https://github.com/danieldk/jitar


The tagger I'm working on does support being hooked up to a Java client
(i.e., consumer of tagging info), but it's fairly ugly due to Java's
refusal to believe in IPC.

-- 
Live well,
~wren


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread wren ng thornton
On 7/6/11 6:45 PM, Aleksandar Dimitrov wrote:
> One hint, if you ever find yourself reading in quantitative linguistic
data with
> Haskell: forget lazy IO. Forget strict IO, except your documents aren't
ever
> bigger than a few hundred megs. In case you're not keeping the whole
document in
> memory, but you're keeping some stuff in memory, never keep it around in
> ByteStrings, but use Text or SmallString (ByteStrings will invariably
leak space
> in this scenario.) Learn how to use Iteratees and use them judiciously.

I definitely agree with the iteratees comment, but I'm curious about the
leaks you mention. I haven't run into leakiness issues (that I'm aware of)
in my use of ByteStrings for NLP.

-- 
Live well,
~wren


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread wren ng thornton
On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote:
> On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
>> On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
>>> Hi,
>>> Continuing my search of Haskell NLP tools and libs, I wonder if the
>>> following Haskell libraries exist (googling them does not help):
>>> 1) End of Sentence (EOS) Detection. Break text into a collection of
>>> meaningful sentences.
>>
>> Depending on how you mean, this is either fairly trivial (for English) or
>> an ill-defined problem. For things like determining whether the "."
>> character is intended as a full stop vs part of an abbreviation; that's
>> trivial.
>
> I disagree. It's not exactly trivial in the sense that it is solved. It is
> trivial in the sense that, usually, one would use a list of know
abbreviations
> and just compare. This, however, just says that the most common approach is
> trivial, not that the problem is.

Perhaps. I recall David Yarowsky suggesting it was considered solved (for
English, as I qualified earlier).

The solution I use is just to look at a window around the point and run a
standard feature-based machine learning algorithm over it[1]. Memorizing
known abbreviations is actually quite fragile, for reasons you mention.
This approach will give you accuracy in the high 90s, though I forget the
exact numbers.


[1] With obvious features like whether the following word is capitalized,
whether the preceding word is capitalized, length of the preceding word,
whether there's another period after the next word,...


>> But for general sentence breaking, how do you intend to deal with
>> quotations? What about when news articles quote someone uttering a few
>> sentences before the end-quote marker? So far as I'm aware, there's no
>> satisfactory definition of what the solution should be in all reasonable
>> cases. A "sentence" isn't really very well-defined in practice.
>
> As long as you have one routine and stick to it, you don't need a formal
> definition every linguist will agree on. Computational Linguists (and their
> tools,) more often than not, just need a dependable solution, not a
correct one.

But the problem is that what constitutes an appropriate solution for
computational needs is still very ill-defined. For example, the treatment
of quotations will depend on the grammar theory used in the tagger,
parser, translator, etc. The quality of output is often quite susceptible
to EOS being meaningfully[2] distributed. Thus, what constitutes a
"dependable" solution often varies depending on the task in question.[3]

Also, a lot of the tools in this area assume there's some sort of
punctuation marking the end of sentences, even if it's unreliable as an
EOS indicator. That works well enough for languages with European-like
orthographic traditions, but it falls apart quite rapidly when moving to
East Asian languages (e.g., Burmese, Thai,...). And languages like
Japanese or Arabic can have "sentences" that go on forever, but are best
handled by chunking them into clauses.


[2] In a statistical sense, relative to the structure of the model.

[3] Personally, I think the idea of having a single EOS type is the bulk
of the problem. If we allowed for different kinds of EOS in grammars then
the upstream tools could handle sentence fragments better, which would
make it easier to make fragment breaking reliable.


>> I've been working over the last year+ on an optimized HMM-based POS
>> tagger/supertagger with online tagging and anytime n-best tagging. I'm
>> planning to release it this summer (i.e., by the end of August), though
>> there are a few things I'd like to polish up before doing so. In
>> particular, I want to make the package less monolithic. When I release it
>> I'll make announcements here and on the nlp@ list.
>
> I'm very interested in your progress! Keep us posted :-)

Will do :)

-- 
Live well,
~wren



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread Richard O'Keefe

On 7/07/2011, at 7:04 AM, Dmitri O.Kondratiev wrote:
> I am looking for Haskell implementation of sentence tokenizer such as 
> described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual 
> Sentence Boundary Detection”,  which is implemented in NLTK:

That method is multilingual but relies on the text being written using
fairly modern Western conventions, and tackles the problem of "too many
dots" and not knowing which are abbreviation points and which full stops.

I don't suppose anyone knows something that might help with the problem
of too few dots?  Run on sentences are one example.
> 
> I've been working over the last year+ on an optimized HMM-based POS
> tagger/supertagger with online tagging and anytime n-best tagging. I'm
> planning to release it this summer (i.e., by the end of August), though
> there are a few things I'd like to polish up before doing so. In
> particular, I want to make the package less monolithic. When I release it
> I'll make announcements here and on the nlp@ list.

One of the issues I've had with a POS tagger I've been using is that it
makes some really stupid decisions which can be patched up with a few
simple rules, but since it's distributed as a .jar file I cannot add
those rules.



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread Aleksandar Dimitrov
On Wed, Jul 06, 2011 at 03:14:07PM -0700, Rogan Creswick wrote:
> Have you used that particular combination yet? I'd like to know the
> details of how you hooked everything together if that's something you
> can share.  (We're working on a similar Frankenstein at the moment.)

These Frankensteins, as your so dearly call them, are always very task-specific.
Here's a setup I've used:

- Take some sort of corpus you want to work with, and annotate it with, say,
  Java tools. This will probably require you to massage the input corpus into
  something your tools can read, and then call the tools to process it
- Let your Java stuff write everything to disk in a format that you can easily
  read in with Haskell. If your annotations are not interleaving, you're lucky,
  because you can probably just use a word-per-line with columns for markup
  format. That's trivial to read in with Haskell. More complicated stuff should
  probably be handled in XML-fashion. I like HXT for reading in XML, but it's
  slow (as are its competitors. Although it's been a while since I've used it;
  maybe it supports Text or ByteStrings by now.)
- Advanced mode: instead of dumping to files, use named pipes or TCP sockets to
  transfer data. Good luck

Shell scripting comes in *very* handy here, in order to automate this process.

Now, everything I've done so far is only *research*, no finished product that
the end user wants to poke on their desktop and have it work interactively. For
that, it might be useful to have some sort of standing server architecture: you
have multiple annotation servers (one that runs in Java, one that runs in
Haskell) and have them communicate the data.

At this point, the benefits might be outweighed by the drawbacks. My love for
Haskell only goes that far.

One hint, if you ever find yourself reading in quantitative linguistic data with
Haskell: forget lazy IO. Forget strict IO, except your documents aren't ever
bigger than a few hundred megs. In case you're not keeping the whole document in
memory, but you're keeping some stuff in memory, never keep it around in
ByteStrings, but use Text or SmallString (ByteStrings will invariably leak space
in this scenario.) Learn how to use Iteratees and use them judiciously.

Keep in touch on the Haskell NLP list :-)
Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread Rogan Creswick
On Wed, Jul 6, 2011 at 3:03 PM, Aleksandar Dimitrov
 wrote:
>
> So you'd use, say, UIMA+OpenNLP to do sentence boundaries, tokens, tags,
> named-entities whatnot, then spit out some annotated format, read it in with
> Haskell, and do the logic/magic there.

Have you used that particular combination yet? I'd like to know the
details of how you hooked everything together if that's something you
can share.  (We're working on a similar Frankenstein at the moment.)

--Rogan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread Aleksandar Dimitrov
On Wed, Jul 06, 2011 at 11:04:30PM +0400, Dmitri O.Kondratiev wrote:
> On Wed, Jul 6, 2011 at 8:32 PM, wren ng thornton  wrote:
> 
> > On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
> > > Hi,
> > > Continuing my search of Haskell NLP tools and libs, I wonder if the
> > > following Haskell libraries exist (googling them does not help):
> > > 1) End of Sentence (EOS) Detection. Break text into a collection of
> > > meaningful sentences.
> >
> > Depending on how you mean, this is either fairly trivial (for English) or
> > an ill-defined problem. For things like determining whether the "."
> > character is intended as a full stop vs part of an abbreviation; that's
> > trivial.
> >
> > But for general sentence breaking, how do you intend to deal with
> > quotations? What about when news articles quote someone uttering a few
> > sentences before the end-quote marker? So far as I'm aware, there's no
> > satisfactory definition of what the solution should be in all reasonable
> > cases. A "sentence" isn't really very well-defined in practice.
> >
> 
> I am looking for Haskell implementation of sentence tokenizer such as
> described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual
> Sentence Boundary Detection”,  which is implemented in NLTK:
> 
> http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html
> 
> 
> > > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to
> > each
> > > token.
> >
> > There are numerous approaches to this problem; do you care about the
> > solution, or will any one of them suffice?
> >
> > I've been working over the last year+ on an optimized HMM-based POS
> > tagger/supertagger with online tagging and anytime n-best tagging. I'm
> > planning to release it this summer (i.e., by the end of August), though
> > there are a few things I'd like to polish up before doing so. In
> > particular, I want to make the package less monolithic. When I release it
> > I'll make announcements here and on the nlp@ list.
> 
> 
> I am looking for some already working POS tagging framework that can be
> customized for different pidgin languages.
> 
> 
> > > 3) Chunking. Analyze each tagged token within a sentence and assemble
> > > compound tokens that express logical concepts. Define a custom grammar.
> > >
> > > 4) Extraction. Analyze each chunk and further tag the chunks as named
> > > entities, such as people, organizations, locations, etc.
> > >
> > > Any ideas where to look for similar Haskell libraries?
> >
> > I don't know of any work in these areas in Haskell (though I'd love to
> > hear about it). You should try asking on the nlp@ list where the other
> > linguists and NLPers are more likely to see it.
> >
> >
> I will, though n...@projects.haskell.org. looks very quiet...

Quiet, yes, but, hey, we all start out… nevermind, humans start out loud.

Well anyhow, it's quiet, but it's gotta start somewhere. I wouldn't hold my
breath for a full-scale Haskell-native solution to your problem just yet though.

Here's what I'm doing: I usually use external programs to do the heavy lifting
for which there aren't Haskell programs. Then I use Haskell (where applicable)
to do the logic, and shell scripts to glue together everything.

So you'd use, say, UIMA+OpenNLP to do sentence boundaries, tokens, tags,
named-entities whatnot, then spit out some annotated format, read it in with
Haskell, and do the logic/magic there.

Complicated, yes. But it gets me around having to code too much in Java. That's
a gain if I've ever seen one.

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread Aleksandar Dimitrov
On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
> On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
> > Hi,
> > Continuing my search of Haskell NLP tools and libs, I wonder if the
> > following Haskell libraries exist (googling them does not help):
> > 1) End of Sentence (EOS) Detection. Break text into a collection of
> > meaningful sentences.
> 
> Depending on how you mean, this is either fairly trivial (for English) or
> an ill-defined problem. For things like determining whether the "."
> character is intended as a full stop vs part of an abbreviation; that's
> trivial.

I disagree. It's not exactly trivial in the sense that it is solved. It is
trivial in the sense that, usually, one would use a list of know abbreviations
and just compare. This, however, just says that the most common approach is
trivial, not that the problem is.

There are cases where, for example, an abbreviation and a full stop will
coincide. In these cases, you'll often need full-blown parsing or at least a
well-trained maxent classifier.

There are other problems: ordinals, acronyms which themselves also have periods
in them, weird names (like Yahoo!) and initials, to name a few. This is only for
English and similar languages, mind you.

> But for general sentence breaking, how do you intend to deal with
> quotations? What about when news articles quote someone uttering a few
> sentences before the end-quote marker? So far as I'm aware, there's no
> satisfactory definition of what the solution should be in all reasonable
> cases. A "sentence" isn't really very well-defined in practice.

As long as you have one routine and stick to it, you don't need a formal
definition every linguist will agree on. Computational Linguists (and their
tools,) more often than not, just need a dependable solution, not a correct one.

> > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each
> > token.
> 
> There are numerous approaches to this problem; do you care about the
> solution, or will any one of them suffice?
> 
> I've been working over the last year+ on an optimized HMM-based POS
> tagger/supertagger with online tagging and anytime n-best tagging. I'm
> planning to release it this summer (i.e., by the end of August), though
> there are a few things I'd like to polish up before doing so. In
> particular, I want to make the package less monolithic. When I release it
> I'll make announcements here and on the nlp@ list.

I'm very interested in your progress! Keep us posted :-)

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread Dmitri O.Kondratiev
On Wed, Jul 6, 2011 at 8:32 PM, wren ng thornton  wrote:

> On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
> > Hi,
> > Continuing my search of Haskell NLP tools and libs, I wonder if the
> > following Haskell libraries exist (googling them does not help):
> > 1) End of Sentence (EOS) Detection. Break text into a collection of
> > meaningful sentences.
>
> Depending on how you mean, this is either fairly trivial (for English) or
> an ill-defined problem. For things like determining whether the "."
> character is intended as a full stop vs part of an abbreviation; that's
> trivial.
>
> But for general sentence breaking, how do you intend to deal with
> quotations? What about when news articles quote someone uttering a few
> sentences before the end-quote marker? So far as I'm aware, there's no
> satisfactory definition of what the solution should be in all reasonable
> cases. A "sentence" isn't really very well-defined in practice.
>

I am looking for Haskell implementation of sentence tokenizer such as
described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual
Sentence Boundary Detection”,  which is implemented in NLTK:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html


> > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to
> each
> > token.
>
> There are numerous approaches to this problem; do you care about the
> solution, or will any one of them suffice?
>
> I've been working over the last year+ on an optimized HMM-based POS
> tagger/supertagger with online tagging and anytime n-best tagging. I'm
> planning to release it this summer (i.e., by the end of August), though
> there are a few things I'd like to polish up before doing so. In
> particular, I want to make the package less monolithic. When I release it
> I'll make announcements here and on the nlp@ list.


I am looking for some already working POS tagging framework that can be
customized for different pidgin languages.


> > 3) Chunking. Analyze each tagged token within a sentence and assemble
> > compound tokens that express logical concepts. Define a custom grammar.
> >
> > 4) Extraction. Analyze each chunk and further tag the chunks as named
> > entities, such as people, organizations, locations, etc.
> >
> > Any ideas where to look for similar Haskell libraries?
>
> I don't know of any work in these areas in Haskell (though I'd love to
> hear about it). You should try asking on the nlp@ list where the other
> linguists and NLPers are more likely to see it.
>
>
I will, though n...@projects.haskell.org. looks very quiet...
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread wren ng thornton
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
> Hi,
> Continuing my search of Haskell NLP tools and libs, I wonder if the
> following Haskell libraries exist (googling them does not help):
> 1) End of Sentence (EOS) Detection. Break text into a collection of
> meaningful sentences.

Depending on how you mean, this is either fairly trivial (for English) or
an ill-defined problem. For things like determining whether the "."
character is intended as a full stop vs part of an abbreviation; that's
trivial.

But for general sentence breaking, how do you intend to deal with
quotations? What about when news articles quote someone uttering a few
sentences before the end-quote marker? So far as I'm aware, there's no
satisfactory definition of what the solution should be in all reasonable
cases. A "sentence" isn't really very well-defined in practice.

> 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each
> token.

There are numerous approaches to this problem; do you care about the
solution, or will any one of them suffice?

I've been working over the last year+ on an optimized HMM-based POS
tagger/supertagger with online tagging and anytime n-best tagging. I'm
planning to release it this summer (i.e., by the end of August), though
there are a few things I'd like to polish up before doing so. In
particular, I want to make the package less monolithic. When I release it
I'll make announcements here and on the nlp@ list.


> 3) Chunking. Analyze each tagged token within a sentence and assemble
> compound tokens that express logical concepts. Define a custom grammar.
>
> 4) Extraction. Analyze each chunk and further tag the chunks as named
> entities, such as people, organizations, locations, etc.
>
> Any ideas where to look for similar Haskell libraries?

I don't know of any work in these areas in Haskell (though I'd love to
hear about it). You should try asking on the nlp@ list where the other
linguists and NLPers are more likely to see it.

-- 
Live well,
~wren


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-06 Thread Dmitri O.Kondratiev
Hi,
Continuing my search of Haskell NLP tools and libs, I wonder if the
following Haskell libraries exist (googling them does not help):
1) End of Sentence (EOS) Detection. Break text into a collection of
meaningful sentences.
2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each
token.
3) Chunking. Analyze each tagged token within a sentence and assemble
compound tokens that express logical concepts. Define a custom grammar.
4) Extraction. Analyze each chunk and further tag the chunks as named
entities, such as people, organizations, locations, etc.

Any ideas where to look for similar Haskell libraries?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-01 Thread Rogan Creswick
On Fri, Jul 1, 2011 at 2:52 PM, Dmitri O.Kondratiev  wrote:
> Any other then 'toktok' Haskell word tokenizer that compiles and works?
> I need something like:
> http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.regexp.WordPunctTokenizer-class.html
>

I don't think this exists out of the box, but since it appears to be a
basic regex tokenizer, you could use Data.List.Split to create one.
(or one of the regex libraries may be able to do this more simply).

If you go the Data.List.Split route, I suspect you'll want to create a
Splitter based on the whenElt Splitter:

http://hackage.haskell.org/packages/archive/split/0.1.1/doc/html/Data-List-Split.html#v:whenElt

which takes a function from an element to a bool.  (which you can
implement however you wish, possibly with a regular expression,
although it will have to be pure.)

If you want something like a maxent tokenizer, then you're currently
out of luck :( (as far as I know).

--Rogan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-01 Thread Dmitri O.Kondratiev
On Fri, Jul 1, 2011 at 11:58 PM, Rogan Creswick  wrote:

> On Fri, Jul 1, 2011 at 12:38 PM, Dmitri O.Kondratiev 
> wrote:
> > On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick 
> wrote:
> >>
> >> On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev 
> >> wrote:> First of all I need:
> >
> > Unfortunately 'cabal install' fails with toktok:
> >
> > tools/ExtractLexicon.hs:5:35:
> > Module `PGF' does not export `getLexicon'
> > cabal: Error: some packages failed to install:
> > toktok-0.5 failed during the building phase. The exception was:
> > ExitFailure 1
>
> Oh, right - I ran into this problem too, and forgot about it (I should
> have reported a bug...) I think this fails because of (relatively)
> recent changes in GF, which isn't constrained to specific versions in
> the toktok cabal file...
>
> --Rogan
>
> Any other then 'toktok' Haskell word tokenizer that compiles and works?
I need something like:
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.regexp.WordPunctTokenizer-class.html

Thanks!
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-01 Thread Rogan Creswick
On Fri, Jul 1, 2011 at 12:38 PM, Dmitri O.Kondratiev  wrote:
> On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick  wrote:
>>
>> On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev 
>> wrote:> First of all I need:
>
> Unfortunately 'cabal install' fails with toktok:
>
> tools/ExtractLexicon.hs:5:35:
>     Module `PGF' does not export `getLexicon'
> cabal: Error: some packages failed to install:
> toktok-0.5 failed during the building phase. The exception was:
> ExitFailure 1

Oh, right - I ran into this problem too, and forgot about it (I should
have reported a bug...) I think this fails because of (relatively)
recent changes in GF, which isn't constrained to specific versions in
the toktok cabal file...

--Rogan

>
> Any ideas how to solve this?
>

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-01 Thread Dmitri O.Kondratiev
On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick  wrote:

> On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev 
> wrote:> First of all I need:
>

...

> > - tools to construct 'bag of words'
> > (http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of
> words
> > in the
> > article.
>
> This is trivially implemented if you have a natural language tokenizer
> you're happy with.
>
> Toktok might be worth looking at:
> http://hackage.haskell.org/package/toktok but I *think* it takes a
> pretty simple view of tokens (assume it is the tokenizer I've been
> using within the GF).
>

Unfortunately 'cabal install' fails with toktok:

Building toktok-0.5...
[1 of 7] Compiling Toktok.Stack ( Toktok/Stack.hs,
dist/build/Toktok/Stack.o )
[2 of 7] Compiling Toktok.Sandhi( Toktok/Sandhi.hs,
dist/build/Toktok/Sandhi.o )
[3 of 7] Compiling Toktok.Trie  ( Toktok/Trie.hs,
dist/build/Toktok/Trie.o )
[4 of 7] Compiling Toktok.Lattice   ( Toktok/Lattice.hs,
dist/build/Toktok/Lattice.o )
[5 of 7] Compiling Toktok.Transducer ( Toktok/Transducer.hs,
dist/build/Toktok/Transducer.o )
[6 of 7] Compiling Toktok.Lexer ( Toktok/Lexer.hs,
dist/build/Toktok/Lexer.o )
[7 of 7] Compiling Toktok   ( Toktok.hs, dist/build/Toktok.o )
Registering toktok-0.5...
[1 of 1] Compiling Main ( Main.hs,
dist/build/toktok/toktok-tmp/Main.o )
Linking dist/build/toktok/toktok ...
[1 of 1] Compiling Main ( tools/ExtractLexicon.hs,
dist/build/gf-extract-lexicon/gf-ex\
tract-lexicon-tmp/Main.o )

tools/ExtractLexicon.hs:5:35:
Module `PGF' does not export `getLexicon'
cabal: Error: some packages failed to install:
toktok-0.5 failed during the building phase. The exception was:
ExitFailure 1

Any ideas how to solve this?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] NLP libraries and tools?

2011-07-01 Thread Rogan Creswick
On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev  wrote:
> Hi,
> Please advise on NLP libraries similar to Natural Language Toolkit

There is a (slowly?) growing NLP community for haskell over at:

http://projects.haskell.org/nlp/

The nlp mailing list may be a better place to ask for details.  To the
best of my knowledge, most of the NLTK / OpenNLP capabilities have yet
to be implemented/ported to Haskell, but there are some packages to
take a look at on Hackage.

> First of all I need:
> - tools to construct 'bag of words'
> (http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words
> in the
> article.

This is trivially implemented if you have a natural language tokenizer
you're happy with.

Toktok might be worth looking at:
http://hackage.haskell.org/package/toktok but I *think* it takes a
pretty simple view of tokens (assume it is the tokenizer I've been
using within the GF).

Eric Kow (?) has a tokenizer implementation, which I can't seem to
find at the moment - if I recall correctly, it is also very simple,
but it would be a great place to implement a more complex tokenizer :)

> - tools to prune common words, such as prepositions and conjunctions, as
> well as extremely rare words, such as the ones with typos.

I'm not sure what you mean by 'prune'.  Are you looking for a stopword
list to remove irrelevant / confusing words from something like a
search query? (that's not hard to do with a stemmer and a set)

> - stemming tools

There is an implementation of the porter stemmer on Hackage:

 - http://hackage.haskell.org/package/porter

> - Naive Bayes classifier

I'm not aware of a general-purpose bayesian classifier lib. for
haskell, but it *would* be great to have :)  There are probably some
general-purpose statistical packages that I'm unaware of that offer a
larger set of capabilities...

> - SVM classifier

There are a few of these.  Take a look at the AI category on hackage:

 - http://hackage.haskell.org/packages/archive/pkg-list.html#cat:ai

--Rogan

> -  k-means clustering

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] NLP libraries and tools?

2011-07-01 Thread Dmitri O.Kondratiev
Hi,
Please advise on NLP libraries similar to Natural Language Toolkit (
www.nltk.org)
First of all I need:
- tools to construct 'bag of words' (
http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words
in the
article.
- tools to prune common words, such as prepositions and conjunctions, as
well as extremely rare words, such as the ones with typos.
- stemming tools
- Naive Bayes classifier
- SVM classifier
-  k-means clustering

Thanks!
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe