Re: [Haskell-cafe] NLP libraries and tools?
On Sun, Jul 10, 2011 at 12:59 PM, ivan vadovic wrote: > Hi, > > Also a library for string normalization in the sense of stripping diacritical > marks would be handy too. Does anything in this respect exist that would be > usable from haskell? The closest thing I know of is this: http://hackage.haskell.org/package/text-icu You still have to install ICU separately, that library is just a binding for working with it from Haskell. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
Hi, Also a library for string normalization in the sense of stripping diacritical marks would be handy too. Does anything in this respect exist that would be usable from haskell? Thanks On Fri, Jul 01, 2011 at 02:31:34PM +0400, Dmitri O.Kondratiev wrote: > Hi, > Please advise on NLP libraries similar to Natural Language Toolkit ( > www.nltk.org) > First of all I need: > - tools to construct 'bag of words' ( > http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words > in the > article. > - tools to prune common words, such as prepositions and conjunctions, as > well as extremely rare words, such as the ones with typos. > - stemming tools > - Naive Bayes classifier > - SVM classifier > - k-means clustering > > Thanks! > ___ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
Perhaps this is interesting? On the relationship between exploratory (a.k.a. sloppy or theoretical) and rigorous math. http://arxiv.org/pdf/math/9307227v1 -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
Heh, I just hit Reply All and I guess the address came in wrong. Ah, well. I strongly agree with you on the state of linguistics, et al. Having done little bits of work in a few of those fields (or at least work _with_ people in them), your comments are spot on. Though perhaps I wouldn't say that mathematics isn't a science (merely because most fields therein satisfy the scientific method). But my glasses may be just a little rosy. :) All that said, I find your points insightful. And don't even get me started on the sloppy math in the social sciences. :D A major issue in the matter of theory/practice drift seems (to me, at least) to be the subject matter's ability to assimilate into pop culture and pseudo-scientific meandering. String theory and some of Penrose's works spring to mind. Sapir-Whorf, "relational" databases, and the events (perhaps to be read 'hype') leading up to the AI Winter also come to mind. A little knowledge is a dangerous thing, as they say. Perhaps that's just confirmation bias. I may just think of them as examples because they're pet peeves. :D And, naturally, every field wishes it could be mathematics. (Tongue in cheek… mostly) http://xkcd.com/435/ On Jul 9, 2011, at 7:55 PM, wren ng thornton wrote: > (Psst, the nlp list is :) > > On 7/9/11 3:10 AM, Jack Henahan wrote: >> On Jul 7, 2011, at 10:53 PM, wren ng thornton wrote: >>> I can't help but be a (meta)theorist. But then, I'm of the firm opinion >>> that theory must be grounded in actual practice, else it belongs more to >>> the realm of theology than science. >> >> Oof, you're liable to wound my (pure) mathematician's pride with remarks > like that, wren. :P > > How's that now? Pure mathematics is perfectly grounded in the practice of > mathematics :) > > I've no qualms with pure maths. Afterall, mathematics isn't trying to > model anything (except itself). The problems I have are when the theory > branch of a field loses touch with what the field is trying to do in the > first place, and consequently ends up arguing over details which can be > neither proven nor disproven. It is this which makes them non-scientific > and not particularly helpful for practicing scientists. Linguistics is one > of the fields where this has happened, but it's by no means the only one > (AI, declarative databases, postmodernism,...) > > There's nothing wrong with not being science. I'm a big fan of the > humanities, mathematics, and philosophy. It's only a problem when > non-science is pretending to be science: it derails the scientists and it > does a disservice to the non-science itself. Non-science is fine; > pseudo-science is the problem. For the same reason, I despise math envy > and all the pseudo-math that gets bandied about in social sciences wishing > they were economics (or economics wishing it were statistics, or > statistics wishing it were mathematics). > > -- > Live well, > ~wren > > > ___ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
(Psst, the nlp list is :) On 7/9/11 3:10 AM, Jack Henahan wrote: > On Jul 7, 2011, at 10:53 PM, wren ng thornton wrote: >> I can't help but be a (meta)theorist. But then, I'm of the firm opinion >> that theory must be grounded in actual practice, else it belongs more to >> the realm of theology than science. > > Oof, you're liable to wound my (pure) mathematician's pride with remarks like that, wren. :P How's that now? Pure mathematics is perfectly grounded in the practice of mathematics :) I've no qualms with pure maths. Afterall, mathematics isn't trying to model anything (except itself). The problems I have are when the theory branch of a field loses touch with what the field is trying to do in the first place, and consequently ends up arguing over details which can be neither proven nor disproven. It is this which makes them non-scientific and not particularly helpful for practicing scientists. Linguistics is one of the fields where this has happened, but it's by no means the only one (AI, declarative databases, postmodernism,...) There's nothing wrong with not being science. I'm a big fan of the humanities, mathematics, and philosophy. It's only a problem when non-science is pretending to be science: it derails the scientists and it does a disservice to the non-science itself. Non-science is fine; pseudo-science is the problem. For the same reason, I despise math envy and all the pseudo-math that gets bandied about in social sciences wishing they were economics (or economics wishing it were statistics, or statistics wishing it were mathematics). -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
Oof, you're liable to wound my (pure) mathematician's pride with remarks like that, wren. :P Now go intone the Litany of Categories as penance. :D I'll start you off… Set, Rel, Top, Ring, Grp, Cat, Hask… On Jul 7, 2011, at 10:53 PM, wren ng thornton wrote: > I can't help but be a (meta)theorist. But then, I'm of the firm opinion > that theory must be grounded in actual practice, else it belongs more to > the realm of theology than science. > > -- > Live well, > ~wren > > > > ___ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On 7/7/11 3:50 AM, Aleksandar Dimitrov wrote: > It's actually a shame we're discussing this on -cafe and not on -nlp. Then > again, maybe it's going to prompt somebody to join -nlp, and I'm gonna CC it > there, because some folks over there might be interested, but not read -cafe. Quite :) > When you mentioned Arabic for producing sentences that go on for ages > you don't really need to go that far. I have had the doubtful pleasure of > reading Kant and Hegel in their original versions. In German, it is sometimes > still considered good style to write huge sentences. I once made it a point, > just to stick it to a Kant-loving-person, to produce a sentence that spanned 2 > whole pages (A4.) It wasn't even difficult. The Romans were big fans of that too (though there's only a small group of folks interested in doing NLP on Latin these days). I've only read Hegel et al. in translation, but the Latin I've read falls nicely into the notion of "span". It doesn't, however, always fall nicely into a clause-based approach like Japanese does. Then again, that could be due to the poetic/rhetorical nature of the texts in question. I wonder if there's been any computational attempt to make the notion of span or discourse atoms rigorous enough for pragmatic use... > I'm very much a "works for me" person in these matters. Mostly because I'm tired > of linguists fighting each other over trivial matters. Give me something I can > work with already! I can't help but be a (meta)theorist. But then, I'm of the firm opinion that theory must be grounded in actual practice, else it belongs more to the realm of theology than science. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On 7/7/11 3:38 AM, Aleksandar Dimitrov wrote: > On Wed, Jul 06, 2011 at 07:27:10PM -0700, wren ng thornton wrote: >> I definitely agree with the iteratees comment, but I'm curious about the >> leaks you mention. I haven't run into leakiness issues (that I'm aware of) >> in my use of ByteStrings for NLP. > > The issue is this: strict ByteStrings retain pointers to the original chunk. The > chunk is probably bigger than you'd want to keep in memory, if you, say, wanted > to just keep one or two words. In my case, the chunk was some 65K (that was my > Iteratee chunk size.) Oh, that issue. Yeah, I maintain an intern table and make sure that the copy in the table is a trimmed copy instead of keeping the whole string alive. I guess I should factor that part of my tagger out into a separate package :) I didn't know if you meant there was a technical issue, e.g. something about the fact that ByteStrings uses pinned memory (whereas Text doesn't IIRC). -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
It's actually a shame we're discussing this on -cafe and not on -nlp. Then again, maybe it's going to prompt somebody to join -nlp, and I'm gonna CC it there, because some folks over there might be interested, but not read -cafe. On Wed, Jul 06, 2011 at 07:22:41PM -0700, wren ng thornton wrote: > On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote: > > On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote: > >> On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote: > >>> Hi, > >>> Continuing my search of Haskell NLP tools and libs, I wonder if the > >>> following Haskell libraries exist (googling them does not help): > >>> 1) End of Sentence (EOS) Detection. Break text into a collection of > >>> meaningful sentences. > >> > >> Depending on how you mean, this is either fairly trivial (for English) or > >> an ill-defined problem. For things like determining whether the "." > >> character is intended as a full stop vs part of an abbreviation; that's > >> trivial. > > > > I disagree. It's not exactly trivial in the sense that it is solved. It is > > trivial in the sense that, usually, one would use a list of know > abbreviations > > and just compare. This, however, just says that the most common approach is > > trivial, not that the problem is. > > Perhaps. I recall David Yarowsky suggesting it was considered solved (for > English, as I qualified earlier). > > The solution I use is just to look at a window around the point and run a > standard feature-based machine learning algorithm over it[1]. Memorizing > known abbreviations is actually quite fragile, for reasons you mention. > This approach will give you accuracy in the high 90s, though I forget the > exact numbers. That is indeed one of the best ways to do it (for Indo-European languages, anyway.) When you mentioned Arabic for producing sentences that go on for ages — you don't really need to go that far. I have had the doubtful pleasure of reading Kant and Hegel in their original versions. In German, it is sometimes still considered good style to write huge sentences. I once made it a point, just to stick it to a Kant-loving-person, to produce a sentence that spanned 2 whole pages (A4.) It wasn't even difficult. I sometimes think that we should just adopt a similar notion of "span," like rhetorical structure theorists do. In that case, you're not segmenting sentences, but discourse atoms — those are even more ill-defined, however. > But the problem is that what constitutes an appropriate solution for > computational needs is still very ill-defined. Well, yes, and, well, no. Tokens are ill-defined. There's no good consensus on how you should parse tokens (i.e., is "in spite of" one token or three?) either, and so you just pick one that works for you. Same for sentence boundaries: they're sometimes also ill-defined, but who says you need to define it well? Maybe there's just a purpose-driven definition? — that people can agree on, anyways. My purpose is either tagging, or parsing, or NE-detection, or computational semantics… In all cases, I'm choosing the definition my tools can use. Not because that's "correct," but I don't really need it to be, no? I'm very much a "works for me" person in these matters. Mostly because I'm tired of linguists fighting each other over trivial matters. Give me something I can work with already! Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On 7/6/11 8:46 PM, Richard O'Keefe wrote: >> I've been working over the last year+ on an optimized HMM-based POS >> tagger/supertagger with online tagging and anytime n-best tagging. I'm >> planning to release it this summer (i.e., by the end of August), though >> there are a few things I'd like to polish up before doing so. In >> particular, I want to make the package less monolithic. When I release it >> I'll make announcements here and on the nlp@ list. > > One of the issues I've had with a POS tagger I've been using is that it > makes some really stupid decisions which can be patched up with a few > simple rules, but since it's distributed as a .jar file I cannot add > those rules. How horrid. I assume the problem is really that the trained model is in the jar and you can't do your own training? Or is this a Brill-like tagger where you really mean to add new rules? If an HMM-based tagger is amenable, you could try switching to Daniël de Kok's Java port of TnT: https://github.com/danieldk/jitar The tagger I'm working on does support being hooked up to a Java client (i.e., consumer of tagging info), but it's fairly ugly due to Java's refusal to believe in IPC. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On 7/6/11 6:45 PM, Aleksandar Dimitrov wrote: > One hint, if you ever find yourself reading in quantitative linguistic data with > Haskell: forget lazy IO. Forget strict IO, except your documents aren't ever > bigger than a few hundred megs. In case you're not keeping the whole document in > memory, but you're keeping some stuff in memory, never keep it around in > ByteStrings, but use Text or SmallString (ByteStrings will invariably leak space > in this scenario.) Learn how to use Iteratees and use them judiciously. I definitely agree with the iteratees comment, but I'm curious about the leaks you mention. I haven't run into leakiness issues (that I'm aware of) in my use of ByteStrings for NLP. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote: > On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote: >> On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote: >>> Hi, >>> Continuing my search of Haskell NLP tools and libs, I wonder if the >>> following Haskell libraries exist (googling them does not help): >>> 1) End of Sentence (EOS) Detection. Break text into a collection of >>> meaningful sentences. >> >> Depending on how you mean, this is either fairly trivial (for English) or >> an ill-defined problem. For things like determining whether the "." >> character is intended as a full stop vs part of an abbreviation; that's >> trivial. > > I disagree. It's not exactly trivial in the sense that it is solved. It is > trivial in the sense that, usually, one would use a list of know abbreviations > and just compare. This, however, just says that the most common approach is > trivial, not that the problem is. Perhaps. I recall David Yarowsky suggesting it was considered solved (for English, as I qualified earlier). The solution I use is just to look at a window around the point and run a standard feature-based machine learning algorithm over it[1]. Memorizing known abbreviations is actually quite fragile, for reasons you mention. This approach will give you accuracy in the high 90s, though I forget the exact numbers. [1] With obvious features like whether the following word is capitalized, whether the preceding word is capitalized, length of the preceding word, whether there's another period after the next word,... >> But for general sentence breaking, how do you intend to deal with >> quotations? What about when news articles quote someone uttering a few >> sentences before the end-quote marker? So far as I'm aware, there's no >> satisfactory definition of what the solution should be in all reasonable >> cases. A "sentence" isn't really very well-defined in practice. > > As long as you have one routine and stick to it, you don't need a formal > definition every linguist will agree on. Computational Linguists (and their > tools,) more often than not, just need a dependable solution, not a correct one. But the problem is that what constitutes an appropriate solution for computational needs is still very ill-defined. For example, the treatment of quotations will depend on the grammar theory used in the tagger, parser, translator, etc. The quality of output is often quite susceptible to EOS being meaningfully[2] distributed. Thus, what constitutes a "dependable" solution often varies depending on the task in question.[3] Also, a lot of the tools in this area assume there's some sort of punctuation marking the end of sentences, even if it's unreliable as an EOS indicator. That works well enough for languages with European-like orthographic traditions, but it falls apart quite rapidly when moving to East Asian languages (e.g., Burmese, Thai,...). And languages like Japanese or Arabic can have "sentences" that go on forever, but are best handled by chunking them into clauses. [2] In a statistical sense, relative to the structure of the model. [3] Personally, I think the idea of having a single EOS type is the bulk of the problem. If we allowed for different kinds of EOS in grammars then the upstream tools could handle sentence fragments better, which would make it easier to make fragment breaking reliable. >> I've been working over the last year+ on an optimized HMM-based POS >> tagger/supertagger with online tagging and anytime n-best tagging. I'm >> planning to release it this summer (i.e., by the end of August), though >> there are a few things I'd like to polish up before doing so. In >> particular, I want to make the package less monolithic. When I release it >> I'll make announcements here and on the nlp@ list. > > I'm very interested in your progress! Keep us posted :-) Will do :) -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On 7/07/2011, at 7:04 AM, Dmitri O.Kondratiev wrote: > I am looking for Haskell implementation of sentence tokenizer such as > described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual > Sentence Boundary Detection”, which is implemented in NLTK: That method is multilingual but relies on the text being written using fairly modern Western conventions, and tackles the problem of "too many dots" and not knowing which are abbreviation points and which full stops. I don't suppose anyone knows something that might help with the problem of too few dots? Run on sentences are one example. > > I've been working over the last year+ on an optimized HMM-based POS > tagger/supertagger with online tagging and anytime n-best tagging. I'm > planning to release it this summer (i.e., by the end of August), though > there are a few things I'd like to polish up before doing so. In > particular, I want to make the package less monolithic. When I release it > I'll make announcements here and on the nlp@ list. One of the issues I've had with a POS tagger I've been using is that it makes some really stupid decisions which can be patched up with a few simple rules, but since it's distributed as a .jar file I cannot add those rules. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Wed, Jul 06, 2011 at 03:14:07PM -0700, Rogan Creswick wrote: > Have you used that particular combination yet? I'd like to know the > details of how you hooked everything together if that's something you > can share. (We're working on a similar Frankenstein at the moment.) These Frankensteins, as your so dearly call them, are always very task-specific. Here's a setup I've used: - Take some sort of corpus you want to work with, and annotate it with, say, Java tools. This will probably require you to massage the input corpus into something your tools can read, and then call the tools to process it - Let your Java stuff write everything to disk in a format that you can easily read in with Haskell. If your annotations are not interleaving, you're lucky, because you can probably just use a word-per-line with columns for markup format. That's trivial to read in with Haskell. More complicated stuff should probably be handled in XML-fashion. I like HXT for reading in XML, but it's slow (as are its competitors. Although it's been a while since I've used it; maybe it supports Text or ByteStrings by now.) - Advanced mode: instead of dumping to files, use named pipes or TCP sockets to transfer data. Good luck Shell scripting comes in *very* handy here, in order to automate this process. Now, everything I've done so far is only *research*, no finished product that the end user wants to poke on their desktop and have it work interactively. For that, it might be useful to have some sort of standing server architecture: you have multiple annotation servers (one that runs in Java, one that runs in Haskell) and have them communicate the data. At this point, the benefits might be outweighed by the drawbacks. My love for Haskell only goes that far. One hint, if you ever find yourself reading in quantitative linguistic data with Haskell: forget lazy IO. Forget strict IO, except your documents aren't ever bigger than a few hundred megs. In case you're not keeping the whole document in memory, but you're keeping some stuff in memory, never keep it around in ByteStrings, but use Text or SmallString (ByteStrings will invariably leak space in this scenario.) Learn how to use Iteratees and use them judiciously. Keep in touch on the Haskell NLP list :-) Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Wed, Jul 6, 2011 at 3:03 PM, Aleksandar Dimitrov wrote: > > So you'd use, say, UIMA+OpenNLP to do sentence boundaries, tokens, tags, > named-entities whatnot, then spit out some annotated format, read it in with > Haskell, and do the logic/magic there. Have you used that particular combination yet? I'd like to know the details of how you hooked everything together if that's something you can share. (We're working on a similar Frankenstein at the moment.) --Rogan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Wed, Jul 06, 2011 at 11:04:30PM +0400, Dmitri O.Kondratiev wrote: > On Wed, Jul 6, 2011 at 8:32 PM, wren ng thornton wrote: > > > On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote: > > > Hi, > > > Continuing my search of Haskell NLP tools and libs, I wonder if the > > > following Haskell libraries exist (googling them does not help): > > > 1) End of Sentence (EOS) Detection. Break text into a collection of > > > meaningful sentences. > > > > Depending on how you mean, this is either fairly trivial (for English) or > > an ill-defined problem. For things like determining whether the "." > > character is intended as a full stop vs part of an abbreviation; that's > > trivial. > > > > But for general sentence breaking, how do you intend to deal with > > quotations? What about when news articles quote someone uttering a few > > sentences before the end-quote marker? So far as I'm aware, there's no > > satisfactory definition of what the solution should be in all reasonable > > cases. A "sentence" isn't really very well-defined in practice. > > > > I am looking for Haskell implementation of sentence tokenizer such as > described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual > Sentence Boundary Detection”, which is implemented in NLTK: > > http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html > > > > > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to > > each > > > token. > > > > There are numerous approaches to this problem; do you care about the > > solution, or will any one of them suffice? > > > > I've been working over the last year+ on an optimized HMM-based POS > > tagger/supertagger with online tagging and anytime n-best tagging. I'm > > planning to release it this summer (i.e., by the end of August), though > > there are a few things I'd like to polish up before doing so. In > > particular, I want to make the package less monolithic. When I release it > > I'll make announcements here and on the nlp@ list. > > > I am looking for some already working POS tagging framework that can be > customized for different pidgin languages. > > > > > 3) Chunking. Analyze each tagged token within a sentence and assemble > > > compound tokens that express logical concepts. Define a custom grammar. > > > > > > 4) Extraction. Analyze each chunk and further tag the chunks as named > > > entities, such as people, organizations, locations, etc. > > > > > > Any ideas where to look for similar Haskell libraries? > > > > I don't know of any work in these areas in Haskell (though I'd love to > > hear about it). You should try asking on the nlp@ list where the other > > linguists and NLPers are more likely to see it. > > > > > I will, though n...@projects.haskell.org. looks very quiet... Quiet, yes, but, hey, we all start out… nevermind, humans start out loud. Well anyhow, it's quiet, but it's gotta start somewhere. I wouldn't hold my breath for a full-scale Haskell-native solution to your problem just yet though. Here's what I'm doing: I usually use external programs to do the heavy lifting for which there aren't Haskell programs. Then I use Haskell (where applicable) to do the logic, and shell scripts to glue together everything. So you'd use, say, UIMA+OpenNLP to do sentence boundaries, tokens, tags, named-entities whatnot, then spit out some annotated format, read it in with Haskell, and do the logic/magic there. Complicated, yes. But it gets me around having to code too much in Java. That's a gain if I've ever seen one. Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote: > On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote: > > Hi, > > Continuing my search of Haskell NLP tools and libs, I wonder if the > > following Haskell libraries exist (googling them does not help): > > 1) End of Sentence (EOS) Detection. Break text into a collection of > > meaningful sentences. > > Depending on how you mean, this is either fairly trivial (for English) or > an ill-defined problem. For things like determining whether the "." > character is intended as a full stop vs part of an abbreviation; that's > trivial. I disagree. It's not exactly trivial in the sense that it is solved. It is trivial in the sense that, usually, one would use a list of know abbreviations and just compare. This, however, just says that the most common approach is trivial, not that the problem is. There are cases where, for example, an abbreviation and a full stop will coincide. In these cases, you'll often need full-blown parsing or at least a well-trained maxent classifier. There are other problems: ordinals, acronyms which themselves also have periods in them, weird names (like Yahoo!) and initials, to name a few. This is only for English and similar languages, mind you. > But for general sentence breaking, how do you intend to deal with > quotations? What about when news articles quote someone uttering a few > sentences before the end-quote marker? So far as I'm aware, there's no > satisfactory definition of what the solution should be in all reasonable > cases. A "sentence" isn't really very well-defined in practice. As long as you have one routine and stick to it, you don't need a formal definition every linguist will agree on. Computational Linguists (and their tools,) more often than not, just need a dependable solution, not a correct one. > > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each > > token. > > There are numerous approaches to this problem; do you care about the > solution, or will any one of them suffice? > > I've been working over the last year+ on an optimized HMM-based POS > tagger/supertagger with online tagging and anytime n-best tagging. I'm > planning to release it this summer (i.e., by the end of August), though > there are a few things I'd like to polish up before doing so. In > particular, I want to make the package less monolithic. When I release it > I'll make announcements here and on the nlp@ list. I'm very interested in your progress! Keep us posted :-) Regards, Aleks signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Wed, Jul 6, 2011 at 8:32 PM, wren ng thornton wrote: > On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote: > > Hi, > > Continuing my search of Haskell NLP tools and libs, I wonder if the > > following Haskell libraries exist (googling them does not help): > > 1) End of Sentence (EOS) Detection. Break text into a collection of > > meaningful sentences. > > Depending on how you mean, this is either fairly trivial (for English) or > an ill-defined problem. For things like determining whether the "." > character is intended as a full stop vs part of an abbreviation; that's > trivial. > > But for general sentence breaking, how do you intend to deal with > quotations? What about when news articles quote someone uttering a few > sentences before the end-quote marker? So far as I'm aware, there's no > satisfactory definition of what the solution should be in all reasonable > cases. A "sentence" isn't really very well-defined in practice. > I am looking for Haskell implementation of sentence tokenizer such as described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual Sentence Boundary Detection”, which is implemented in NLTK: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html > > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to > each > > token. > > There are numerous approaches to this problem; do you care about the > solution, or will any one of them suffice? > > I've been working over the last year+ on an optimized HMM-based POS > tagger/supertagger with online tagging and anytime n-best tagging. I'm > planning to release it this summer (i.e., by the end of August), though > there are a few things I'd like to polish up before doing so. In > particular, I want to make the package less monolithic. When I release it > I'll make announcements here and on the nlp@ list. I am looking for some already working POS tagging framework that can be customized for different pidgin languages. > > 3) Chunking. Analyze each tagged token within a sentence and assemble > > compound tokens that express logical concepts. Define a custom grammar. > > > > 4) Extraction. Analyze each chunk and further tag the chunks as named > > entities, such as people, organizations, locations, etc. > > > > Any ideas where to look for similar Haskell libraries? > > I don't know of any work in these areas in Haskell (though I'd love to > hear about it). You should try asking on the nlp@ list where the other > linguists and NLPers are more likely to see it. > > I will, though n...@projects.haskell.org. looks very quiet... ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote: > Hi, > Continuing my search of Haskell NLP tools and libs, I wonder if the > following Haskell libraries exist (googling them does not help): > 1) End of Sentence (EOS) Detection. Break text into a collection of > meaningful sentences. Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial. But for general sentence breaking, how do you intend to deal with quotations? What about when news articles quote someone uttering a few sentences before the end-quote marker? So far as I'm aware, there's no satisfactory definition of what the solution should be in all reasonable cases. A "sentence" isn't really very well-defined in practice. > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each > token. There are numerous approaches to this problem; do you care about the solution, or will any one of them suffice? I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list. > 3) Chunking. Analyze each tagged token within a sentence and assemble > compound tokens that express logical concepts. Define a custom grammar. > > 4) Extraction. Analyze each chunk and further tag the chunks as named > entities, such as people, organizations, locations, etc. > > Any ideas where to look for similar Haskell libraries? I don't know of any work in these areas in Haskell (though I'd love to hear about it). You should try asking on the nlp@ list where the other linguists and NLPers are more likely to see it. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences. 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each token. 3) Chunking. Analyze each tagged token within a sentence and assemble compound tokens that express logical concepts. Define a custom grammar. 4) Extraction. Analyze each chunk and further tag the chunks as named entities, such as people, organizations, locations, etc. Any ideas where to look for similar Haskell libraries? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Fri, Jul 1, 2011 at 2:52 PM, Dmitri O.Kondratiev wrote: > Any other then 'toktok' Haskell word tokenizer that compiles and works? > I need something like: > http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.regexp.WordPunctTokenizer-class.html > I don't think this exists out of the box, but since it appears to be a basic regex tokenizer, you could use Data.List.Split to create one. (or one of the regex libraries may be able to do this more simply). If you go the Data.List.Split route, I suspect you'll want to create a Splitter based on the whenElt Splitter: http://hackage.haskell.org/packages/archive/split/0.1.1/doc/html/Data-List-Split.html#v:whenElt which takes a function from an element to a bool. (which you can implement however you wish, possibly with a regular expression, although it will have to be pure.) If you want something like a maxent tokenizer, then you're currently out of luck :( (as far as I know). --Rogan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Fri, Jul 1, 2011 at 11:58 PM, Rogan Creswick wrote: > On Fri, Jul 1, 2011 at 12:38 PM, Dmitri O.Kondratiev > wrote: > > On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick > wrote: > >> > >> On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev > >> wrote:> First of all I need: > > > > Unfortunately 'cabal install' fails with toktok: > > > > tools/ExtractLexicon.hs:5:35: > > Module `PGF' does not export `getLexicon' > > cabal: Error: some packages failed to install: > > toktok-0.5 failed during the building phase. The exception was: > > ExitFailure 1 > > Oh, right - I ran into this problem too, and forgot about it (I should > have reported a bug...) I think this fails because of (relatively) > recent changes in GF, which isn't constrained to specific versions in > the toktok cabal file... > > --Rogan > > Any other then 'toktok' Haskell word tokenizer that compiles and works? I need something like: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.regexp.WordPunctTokenizer-class.html Thanks! ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Fri, Jul 1, 2011 at 12:38 PM, Dmitri O.Kondratiev wrote: > On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick wrote: >> >> On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev >> wrote:> First of all I need: > > Unfortunately 'cabal install' fails with toktok: > > tools/ExtractLexicon.hs:5:35: > Module `PGF' does not export `getLexicon' > cabal: Error: some packages failed to install: > toktok-0.5 failed during the building phase. The exception was: > ExitFailure 1 Oh, right - I ran into this problem too, and forgot about it (I should have reported a bug...) I think this fails because of (relatively) recent changes in GF, which isn't constrained to specific versions in the toktok cabal file... --Rogan > > Any ideas how to solve this? > ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick wrote: > On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev > wrote:> First of all I need: > ... > > - tools to construct 'bag of words' > > (http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of > words > > in the > > article. > > This is trivially implemented if you have a natural language tokenizer > you're happy with. > > Toktok might be worth looking at: > http://hackage.haskell.org/package/toktok but I *think* it takes a > pretty simple view of tokens (assume it is the tokenizer I've been > using within the GF). > Unfortunately 'cabal install' fails with toktok: Building toktok-0.5... [1 of 7] Compiling Toktok.Stack ( Toktok/Stack.hs, dist/build/Toktok/Stack.o ) [2 of 7] Compiling Toktok.Sandhi( Toktok/Sandhi.hs, dist/build/Toktok/Sandhi.o ) [3 of 7] Compiling Toktok.Trie ( Toktok/Trie.hs, dist/build/Toktok/Trie.o ) [4 of 7] Compiling Toktok.Lattice ( Toktok/Lattice.hs, dist/build/Toktok/Lattice.o ) [5 of 7] Compiling Toktok.Transducer ( Toktok/Transducer.hs, dist/build/Toktok/Transducer.o ) [6 of 7] Compiling Toktok.Lexer ( Toktok/Lexer.hs, dist/build/Toktok/Lexer.o ) [7 of 7] Compiling Toktok ( Toktok.hs, dist/build/Toktok.o ) Registering toktok-0.5... [1 of 1] Compiling Main ( Main.hs, dist/build/toktok/toktok-tmp/Main.o ) Linking dist/build/toktok/toktok ... [1 of 1] Compiling Main ( tools/ExtractLexicon.hs, dist/build/gf-extract-lexicon/gf-ex\ tract-lexicon-tmp/Main.o ) tools/ExtractLexicon.hs:5:35: Module `PGF' does not export `getLexicon' cabal: Error: some packages failed to install: toktok-0.5 failed during the building phase. The exception was: ExitFailure 1 Any ideas how to solve this? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] NLP libraries and tools?
On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev wrote: > Hi, > Please advise on NLP libraries similar to Natural Language Toolkit There is a (slowly?) growing NLP community for haskell over at: http://projects.haskell.org/nlp/ The nlp mailing list may be a better place to ask for details. To the best of my knowledge, most of the NLTK / OpenNLP capabilities have yet to be implemented/ported to Haskell, but there are some packages to take a look at on Hackage. > First of all I need: > - tools to construct 'bag of words' > (http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words > in the > article. This is trivially implemented if you have a natural language tokenizer you're happy with. Toktok might be worth looking at: http://hackage.haskell.org/package/toktok but I *think* it takes a pretty simple view of tokens (assume it is the tokenizer I've been using within the GF). Eric Kow (?) has a tokenizer implementation, which I can't seem to find at the moment - if I recall correctly, it is also very simple, but it would be a great place to implement a more complex tokenizer :) > - tools to prune common words, such as prepositions and conjunctions, as > well as extremely rare words, such as the ones with typos. I'm not sure what you mean by 'prune'. Are you looking for a stopword list to remove irrelevant / confusing words from something like a search query? (that's not hard to do with a stemmer and a set) > - stemming tools There is an implementation of the porter stemmer on Hackage: - http://hackage.haskell.org/package/porter > - Naive Bayes classifier I'm not aware of a general-purpose bayesian classifier lib. for haskell, but it *would* be great to have :) There are probably some general-purpose statistical packages that I'm unaware of that offer a larger set of capabilities... > - SVM classifier There are a few of these. Take a look at the AI category on hackage: - http://hackage.haskell.org/packages/archive/pkg-list.html#cat:ai --Rogan > - k-means clustering ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] NLP libraries and tools?
Hi, Please advise on NLP libraries similar to Natural Language Toolkit ( www.nltk.org) First of all I need: - tools to construct 'bag of words' ( http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words in the article. - tools to prune common words, such as prepositions and conjunctions, as well as extremely rare words, such as the ones with typos. - stemming tools - Naive Bayes classifier - SVM classifier - k-means clustering Thanks! ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe