Re: [julia-users] Natural language processing in Julia
I'm a corpus linguist who uses R for just about everything I do, but was recently disappointed when an R script took four and a half hours to create three frequency lists of single words, bigrams, and trigrams, despite using a function which calls compiled C code in one bottleneck. Curious to see if Julia could do it faster, I set out to translate the script into Julia. So far, my Julia script correctly gets the frequency of words, normalized frequency (per million words), and logged frequency. What I'm having trouble with now is getting dispersion measurements, specifically: Range: the number of files that a word occurs in at least once Stefan Gries' Deviation of Proportions (DP) measurement: a measurement that takes into account the relative size of the files that the words come from In R code, the DP measurement is calculated as follows: corpus = c(hello, hola, howdy, what's up, hello, hey, hello, hello, hola, hello) # all words from all files in a vector corpusParts = c(file1, file1, file1, file1, file1, file2, file2, file2, file3, file3) # indicates from which file each word comes; has same length as corpus corpusPartSizes = c(5, 3, 2) # number of words in each file word = hello f = sum(corpus == word) # 5 (there are five hellos in the corpus) if(f == 0) { return(NA); break() } # if word doesn't occur in corpus, move on to next word l - length(corpus) # 10 (total number of words in corpus) s = corpusPartSizes / l # 0.5, 0.3, 0.2 (proportion that each file represents of the whole corpus) v = rowsum(as.integer(corpus == word), corpusParts) # 2, 2, 1 (hello occurs twice in file1, twice in file2, and once in file3) DP = sum(abs((v / f) - s)) / 2 # 0.1 hello has a range of 3 (it occurs in three files) and a DP of 0.1 in this corpus. The line I can't figure out how to translate into Julia is: v = rowsum(as.integer(corpus == word), corpusParts) I assume there is a better way to approach this problem than simply translating my R script (nearly) line by line, but I don't know what that might be right now. But, for what it's worth, below is my Julia script as it now stands. Thanks in advance for any help. Earl Brown PS: Why does uppercase(más) return MáS (or M\ue1S in Julia Studio) rather than the expected MÁS? ## # Julia script to loop over files and create a frequency list using StatsBase using DataFrames cd(/Users/earlbrown/Corpora/United_States/California/Salinas/Textos/Finished/) filesNames = filter(r\.txt$i, readdir()) outputFile = /Users/earlbrown/Desktop/output_julia.csv # puts all lines into an array strings = String[] for i in 1:length(filesNames) #i = 2 curFile = open(filesNames[i]) curText = uppercase(readall(curFile)) close(curFile) push!(strings, curText) end # next file # breaks up the lines and puts them into an array words = Array[] for i in 1:length(strings) curWds = split(strings[i], r[^a-záéíóúüñ']+i) push!(words, curWds) end # next element # gets frequency of words words = join(words, ) words = split(words, r\n+) words = countmap(words) # gets normalized frequency wds = collect(keys(words)) freqs = collect(values(words)) normFreq = (freqs / sum(freqs)) * 100 # sorts words by frequency df = @DataFrame(WORD = wds, FREQ = freqs, NORM = normFreq) df = df[sortperm(df[FREQ], rev=true), :] # saves headers to output file f = open(outputFile, w) headings = RANK\tWORD\tFREQ\tLOG\tNORM write(f, headings, \n) close(f) # saves words and their frequencies to output file sep = \t f = open(outputFile, a) rank = 0 for i in 1:length(wds) if (df[i, 1] == ) continue end rank = rank + 1 curWd = df[i, 1] curFreq = df[i, 2] curLog = log(curFreq) curNorm = df[i, 3] output = string(rank, sep, curWd, sep, curFreq, sep, curLog, sep, curNorm) write(f, output, \n) end close(f) println(All done!) ##
Re: [julia-users] Natural language processing in Julia
I must have missed that then. El martes, 10 de junio de 2014 04:38:54 UTC+2, John Myles White escribió: There’s lots of tools for working with files without loading them in TextAnalysis.jl. — John On Jun 9, 2014, at 3:05 PM, Matías Guzmán Naranjo morte...@gmail.com javascript: wrote: I don't have too much experience in NLP, I am a corpus linguist, so CorpusTools is mainly aimed at doing corpus linguistics things (collocations, collostructions, concordances, etc.). I did see your TextAnalysis library but didn't like some of its aspects (not being able to work with files without loading them which is a must for large corpora). But I guess both could be integrated with some work. El lunes, 9 de junio de 2014 17:12:53 UTC+2, John Myles White escribió: It would be great if somebody with experience with NLP took over leadership for JuliaText. I just can’t keep up with TextAnalysis.jl anymore. — John On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nali...@club.fr wrote: Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : I am working on a library for doing corpus linguistics with Julia (out of my frustration with python). It is mostly designed for my personal needs, written in the little free time I have, and it isn't ready for others to use easily. That being said, any ideas or contributions would be more than welcome. https://github.com/mguzmann/CorpusTools/ Are you aware of this package? https://github.com/johnmyleswhite/TextAnalysis.jl Looks like there's room for collaboration. Regards
Re: [julia-users] Natural language processing in Julia
Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : I am working on a library for doing corpus linguistics with Julia (out of my frustration with python). It is mostly designed for my personal needs, written in the little free time I have, and it isn't ready for others to use easily. That being said, any ideas or contributions would be more than welcome. https://github.com/mguzmann/CorpusTools/ Are you aware of this package? https://github.com/johnmyleswhite/TextAnalysis.jl Looks like there's room for collaboration. Regards
Re: [julia-users] Natural language processing in Julia
It would be great if somebody with experience with NLP took over leadership for JuliaText. I just can’t keep up with TextAnalysis.jl anymore. — John On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : I am working on a library for doing corpus linguistics with Julia (out of my frustration with python). It is mostly designed for my personal needs, written in the little free time I have, and it isn't ready for others to use easily. That being said, any ideas or contributions would be more than welcome. https://github.com/mguzmann/CorpusTools/ Are you aware of this package? https://github.com/johnmyleswhite/TextAnalysis.jl Looks like there's room for collaboration. Regards
Re: [julia-users] Natural language processing in Julia
I don't have too much experience in NLP, I am a corpus linguist, so CorpusTools is mainly aimed at doing corpus linguistics things (collocations, collostructions, concordances, etc.). I did see your TextAnalysis library but didn't like some of its aspects (not being able to work with files without loading them which is a must for large corpora). But I guess both could be integrated with some work. El lunes, 9 de junio de 2014 17:12:53 UTC+2, John Myles White escribió: It would be great if somebody with experience with NLP took over leadership for JuliaText. I just can’t keep up with TextAnalysis.jl anymore. — John On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nali...@club.fr javascript: wrote: Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : I am working on a library for doing corpus linguistics with Julia (out of my frustration with python). It is mostly designed for my personal needs, written in the little free time I have, and it isn't ready for others to use easily. That being said, any ideas or contributions would be more than welcome. https://github.com/mguzmann/CorpusTools/ Are you aware of this package? https://github.com/johnmyleswhite/TextAnalysis.jl Looks like there's room for collaboration. Regards
Re: [julia-users] Natural language processing in Julia
I'll eventually add some basic POS tagging and Parser. This may take a bit of time though. For now I'm thinking of just making a python wrapper for FreeLing. El martes, 10 de junio de 2014 00:05:10 UTC+2, Matías Guzmán Naranjo escribió: I don't have too much experience in NLP, I am a corpus linguist, so CorpusTools is mainly aimed at doing corpus linguistics things (collocations, collostructions, concordances, etc.). I did see your TextAnalysis library but didn't like some of its aspects (not being able to work with files without loading them which is a must for large corpora). But I guess both could be integrated with some work. El lunes, 9 de junio de 2014 17:12:53 UTC+2, John Myles White escribió: It would be great if somebody with experience with NLP took over leadership for JuliaText. I just can’t keep up with TextAnalysis.jl anymore. — John On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nali...@club.fr wrote: Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : I am working on a library for doing corpus linguistics with Julia (out of my frustration with python). It is mostly designed for my personal needs, written in the little free time I have, and it isn't ready for others to use easily. That being said, any ideas or contributions would be more than welcome. https://github.com/mguzmann/CorpusTools/ Are you aware of this package? https://github.com/johnmyleswhite/TextAnalysis.jl Looks like there's room for collaboration. Regards
Re: [julia-users] Natural language processing in Julia
There’s lots of tools for working with files without loading them in TextAnalysis.jl. — John On Jun 9, 2014, at 3:05 PM, Matías Guzmán Naranjo mortem@gmail.com wrote: I don't have too much experience in NLP, I am a corpus linguist, so CorpusTools is mainly aimed at doing corpus linguistics things (collocations, collostructions, concordances, etc.). I did see your TextAnalysis library but didn't like some of its aspects (not being able to work with files without loading them which is a must for large corpora). But I guess both could be integrated with some work. El lunes, 9 de junio de 2014 17:12:53 UTC+2, John Myles White escribió: It would be great if somebody with experience with NLP took over leadership for JuliaText. I just can’t keep up with TextAnalysis.jl anymore. — John On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nali...@club.fr wrote: Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : I am working on a library for doing corpus linguistics with Julia (out of my frustration with python). It is mostly designed for my personal needs, written in the little free time I have, and it isn't ready for others to use easily. That being said, any ideas or contributions would be more than welcome. https://github.com/mguzmann/CorpusTools/ Are you aware of this package? https://github.com/johnmyleswhite/TextAnalysis.jl Looks like there's room for collaboration. Regards
Re: [julia-users] Natural language processing in Julia
Hi Sorami, Yes, JuliaText is meant to be the repository of Julia NLP packages and I agree with you about Julia's potential in the NLP domain. There hasn't been a lot of action there since I think there aren't many people using Julia for NLP yet (although I hope that changes). Any contributions you wanted to make would be most appreciated. On Thursday, May 8, 2014 7:03:18 AM UTC-7, ther...@gmail.com wrote: Hi all, I am interested in writing Natural Language Processing (NLP) tools in Julia. My name is Sorami, I am a data scientist at BrainPad Inc. in Tokyo, Japan. I used to be a graduate student doing NLP. I am much interested in Julia, and I see its great potential as a powerful NLP / Text Mining tool. - I have read the Natural language processing in Julia posts in julia-user Group ( https://groups.google.com/forum/#!searchin/julia-users/nlp/julia-users/SxB16X6lM1c/IWidFfJaDhUJ); Are there any updates on Julia+NLP since then? Is JuliaText ( https://github.com/JuliaText ) the center for NLP stuff? How about TextAnalysis.jl? - FYI: I have uploaded the source code for a naïve Dependency Parser in Julia, which I wrote when I was playing around with Julia. https://github.com/sorami/DependencyParser.jl (Dependency parser is a kind of syntactic analysis tool for natural languages like English or Japanese) Cheers, -- Sorami Hisamoto http://89.io On Tuesday, January 28, 2014 1:31:07 AM UTC+9, John Myles White wrote: JuliaText would be great. TextAnalysis.jl really needs a lot of love to move forward. For now, I’d strongly push people towards NLTK. — John On Jan 27, 2014, at 8:29 AM, Jonathan Malmaud mal...@gmail.com wrote: I was thinking of starting up a Julia NLP meta-project on github if there's enough interest. It could host projects like textanalysis.jl, a Julia interface to NLTK, a Julia interface to some of Stanford's NLP tools, and whatever more native solutions people put together. On Friday, October 25, 2013 9:32:10 AM UTC-4, Dahua Lin wrote: I wish there is something comparable to NLTK in Julia. In a recent project that involves text parsing, I have to implement the text handling module in Python, simply for the purpose of using NTLK and Jinja2. If we can get the attention of the NLP community, I believe some NLP people will build such things very soon. - Dahua On Tuesday, October 22, 2013 7:35:57 PM UTC-5, John Myles White wrote: There's a package called TextAnalysis.jl that has stemming and very basic tokenization. Patches to do POS tagging would be very welcome. -- John On Oct 22, 2013, at 5:29 PM, Jonathan Malmaud mal...@gmail.com wrote: Is anyone working on or know of a package to do NLP tasks with Julia, like part-of-speech tagging and stemming? PyCall works fine with Python's NLTK, so that would be my default choice if there isn't anything more native at the moment.
Re: [julia-users] Natural language processing in Julia
Hi all, I am interested in writing Natural Language Processing (NLP) tools in Julia. My name is Sorami, I am a data scientist at BrainPad Inc. in Tokyo, Japan. I used to be a graduate student doing NLP. I am much interested in Julia, and I see its great potential as a powerful NLP / Text Mining tool. - I have read the Natural language processing in Julia posts in julia-user Group ( https://groups.google.com/forum/#!searchin/julia-users/nlp/julia-users/SxB16X6lM1c/IWidFfJaDhUJ ); Are there any updates on Julia+NLP since then? Is JuliaText ( https://github.com/JuliaText ) the center for NLP stuff? How about TextAnalysis.jl? - FYI: I have uploaded the source code for a naïve Dependency Parser in Julia, which I wrote when I was playing around with Julia. https://github.com/sorami/DependencyParser.jl (Dependency parser is a kind of syntactic analysis tool for natural languages like English or Japanese) Cheers, -- Sorami Hisamoto http://89.io On Tuesday, January 28, 2014 1:31:07 AM UTC+9, John Myles White wrote: JuliaText would be great. TextAnalysis.jl really needs a lot of love to move forward. For now, I’d strongly push people towards NLTK. — John On Jan 27, 2014, at 8:29 AM, Jonathan Malmaud mal...@gmail.comjavascript: wrote: I was thinking of starting up a Julia NLP meta-project on github if there's enough interest. It could host projects like textanalysis.jl, a Julia interface to NLTK, a Julia interface to some of Stanford's NLP tools, and whatever more native solutions people put together. On Friday, October 25, 2013 9:32:10 AM UTC-4, Dahua Lin wrote: I wish there is something comparable to NLTK in Julia. In a recent project that involves text parsing, I have to implement the text handling module in Python, simply for the purpose of using NTLK and Jinja2. If we can get the attention of the NLP community, I believe some NLP people will build such things very soon. - Dahua On Tuesday, October 22, 2013 7:35:57 PM UTC-5, John Myles White wrote: There's a package called TextAnalysis.jl that has stemming and very basic tokenization. Patches to do POS tagging would be very welcome. -- John On Oct 22, 2013, at 5:29 PM, Jonathan Malmaud mal...@gmail.com wrote: Is anyone working on or know of a package to do NLP tasks with Julia, like part-of-speech tagging and stemming? PyCall works fine with Python's NLTK, so that would be my default choice if there isn't anything more native at the moment.
Re: [julia-users] Natural language processing in Julia
I was thinking of starting up a Julia NLP meta-project on github if there's enough interest. It could host projects like textanalysis.jl, a Julia interface to NLTK, a Julia interface to some of Stanford's NLP tools, and whatever more native solutions people put together. On Friday, October 25, 2013 9:32:10 AM UTC-4, Dahua Lin wrote: I wish there is something comparable to NLTK in Julia. In a recent project that involves text parsing, I have to implement the text handling module in Python, simply for the purpose of using NTLK and Jinja2. If we can get the attention of the NLP community, I believe some NLP people will build such things very soon. - Dahua On Tuesday, October 22, 2013 7:35:57 PM UTC-5, John Myles White wrote: There's a package called TextAnalysis.jl that has stemming and very basic tokenization. Patches to do POS tagging would be very welcome. -- John On Oct 22, 2013, at 5:29 PM, Jonathan Malmaud mal...@gmail.com wrote: Is anyone working on or know of a package to do NLP tasks with Julia, like part-of-speech tagging and stemming? PyCall works fine with Python's NLTK, so that would be my default choice if there isn't anything more native at the moment.
Re: [julia-users] Natural language processing in Julia
JuliaText would be great. TextAnalysis.jl really needs a lot of love to move forward. For now, I’d strongly push people towards NLTK. — John On Jan 27, 2014, at 8:29 AM, Jonathan Malmaud malm...@gmail.com wrote: I was thinking of starting up a Julia NLP meta-project on github if there's enough interest. It could host projects like textanalysis.jl, a Julia interface to NLTK, a Julia interface to some of Stanford's NLP tools, and whatever more native solutions people put together. On Friday, October 25, 2013 9:32:10 AM UTC-4, Dahua Lin wrote: I wish there is something comparable to NLTK in Julia. In a recent project that involves text parsing, I have to implement the text handling module in Python, simply for the purpose of using NTLK and Jinja2. If we can get the attention of the NLP community, I believe some NLP people will build such things very soon. - Dahua On Tuesday, October 22, 2013 7:35:57 PM UTC-5, John Myles White wrote: There's a package called TextAnalysis.jl that has stemming and very basic tokenization. Patches to do POS tagging would be very welcome. -- John On Oct 22, 2013, at 5:29 PM, Jonathan Malmaud mal...@gmail.com wrote: Is anyone working on or know of a package to do NLP tasks with Julia, like part-of-speech tagging and stemming? PyCall works fine with Python's NLTK, so that would be my default choice if there isn't anything more native at the moment.
Re: [julia-users] Natural language processing in Julia
How do you convert a Synset to a string in Python? Presumably Synset has some Python method for this? On Saturday, January 18, 2014 7:32:58 AM UTC-5, Jon Norberg wrote: Great. And how dose one get the text from pyobject to a julia string? Thanks very much
Re: [julia-users] Natural language processing in Julia
They have a 'name' property, although I'm not sure that's what you're after. If your julia array of synsets is called 'synsets', you could do [synset[:name] for synset in synsets] On Jan 19, 2014, at 10:23 AM, Steven G. Johnson stevenj@gmail.com wrote: How do you convert a Synset to a string in Python? Presumably Synset has some Python method for this? On Saturday, January 18, 2014 7:32:58 AM UTC-5, Jon Norberg wrote: Great. And how dose one get the text from pyobject to a julia string? Thanks very much
Re: [julia-users] Natural language processing in Julia
Thats great, now I am starting to be able to do what I want. Any idea how one can list all properties and methods of a pyobject
Re: [julia-users] Natural language processing in Julia
Great. And how dose one get the text from pyobject to a julia string? Thanks very much
Re: [julia-users] Natural language processing in Julia
Hi Jonathan would you by any chance have some example code to share how you work with NLTK using pycall. I wish there were more julia examples scripts available for browning and learning. Thanks,