Re: [julia-users] Natural language processing in Julia

2014-08-28 Thread Earl Brown
I'm a corpus linguist who uses R for just about everything I do, but was 
recently disappointed when an R script took four and a half hours to create 
three frequency lists of single words, bigrams, and trigrams, despite using 
a function which calls compiled C code in one bottleneck. Curious to see if 
Julia could do it faster, I set out to translate the script into Julia. So 
far, my Julia script correctly gets the frequency of words, normalized 
frequency (per million words), and logged frequency. 

What I'm having trouble with now is getting dispersion measurements, 
specifically:
Range: the number of files that a word occurs in at least once
Stefan Gries' Deviation of Proportions (DP) measurement: a measurement that 
takes into account the relative size of the files that the words come from

In R code, the DP measurement is calculated as follows:
corpus = c(hello, hola, howdy, what's up, hello, hey, hello, 
hello, hola, hello) # all words from all files in a vector
corpusParts = c(file1, file1, file1, file1, file1, file2, 
file2, file2, file3, file3) # indicates from which file each word 
comes; has same length as corpus
corpusPartSizes = c(5, 3, 2) # number of words in each file

word = hello
f = sum(corpus == word) # 5 (there are five hellos in the corpus)
if(f == 0) { return(NA); break() } # if word doesn't occur in corpus, 
move on to next word
l - length(corpus) # 10 (total number of words in corpus)
s = corpusPartSizes / l # 0.5, 0.3, 0.2 (proportion that each file 
represents of the whole corpus)
v = rowsum(as.integer(corpus == word), corpusParts) # 2, 2, 1 (hello 
occurs twice in file1, twice in file2, and once in file3)
DP = sum(abs((v / f) - s)) / 2 # 0.1

hello has a range of 3 (it occurs in three files) and a DP of 0.1 in this 
corpus.

The line I can't figure out how to translate into Julia is:
v = rowsum(as.integer(corpus == word), corpusParts)

I assume there is a better way to approach this problem than simply 
translating my R script (nearly) line by line, but I don't know what that 
might be right now. But, for what it's worth, below is my Julia script as 
it now stands.

Thanks in advance for any help. Earl Brown

PS: Why does uppercase(más) return MáS (or M\ue1S in Julia Studio) 
rather than the expected MÁS?

##
# Julia script to loop over files and create a frequency list

using StatsBase
using DataFrames

cd(/Users/earlbrown/Corpora/United_States/California/Salinas/Textos/Finished/)
filesNames = filter(r\.txt$i, readdir())
outputFile = /Users/earlbrown/Desktop/output_julia.csv

# puts all lines into an array
strings = String[]
for i in 1:length(filesNames)
#i = 2
curFile = open(filesNames[i])
curText = uppercase(readall(curFile))
close(curFile)
push!(strings, curText)
end # next file

# breaks up the lines and puts them into an array
words = Array[]
for i in 1:length(strings)
curWds = split(strings[i], r[^a-záéíóúüñ']+i)
push!(words, curWds)
end # next element

# gets frequency of words
words = join(words,  )
words = split(words, r\n+)
words = countmap(words)

# gets normalized frequency
wds = collect(keys(words))
freqs = collect(values(words))
normFreq = (freqs / sum(freqs)) * 100

# sorts words by frequency
df = @DataFrame(WORD = wds, FREQ = freqs, NORM = normFreq)
df = df[sortperm(df[FREQ], rev=true), :]

# saves headers to output file
f = open(outputFile, w)
headings = RANK\tWORD\tFREQ\tLOG\tNORM
write(f, headings, \n)
close(f)

# saves words and their frequencies to output file
sep = \t
f = open(outputFile, a)
rank = 0
for i in 1:length(wds)
if (df[i, 1] == )
   continue
end
rank = rank + 1
curWd = df[i, 1]
curFreq = df[i, 2]
curLog = log(curFreq)
curNorm = df[i, 3]
output = string(rank, sep, curWd, sep, curFreq, sep, curLog, sep, 
curNorm)
write(f, output, \n)
end
close(f)

println(All done!)
##


Re: [julia-users] Natural language processing in Julia

2014-06-10 Thread Matías Guzmán Naranjo
I must have missed that then.

El martes, 10 de junio de 2014 04:38:54 UTC+2, John Myles White escribió:

 There’s lots of tools for working with files without loading them in 
 TextAnalysis.jl.

  — John

 On Jun 9, 2014, at 3:05 PM, Matías Guzmán Naranjo morte...@gmail.com 
 javascript: wrote:

 I don't have too much experience in NLP, I am a corpus linguist, so 
 CorpusTools is mainly aimed at doing corpus linguistics things 
 (collocations, collostructions, concordances, etc.). I did see your 
 TextAnalysis library but didn't like some of its aspects (not being able to 
 work with files without loading them which is a must for large corpora). 
 But I guess both could be integrated with some work.

 El lunes, 9 de junio de 2014 17:12:53 UTC+2, John Myles White escribió:

 It would be great if somebody with experience with NLP took over 
 leadership for JuliaText. I just can’t keep up with TextAnalysis.jl 
 anymore. 

  — John 

 On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nali...@club.fr wrote: 

  Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : 
  I am working on a library for doing corpus linguistics with Julia (out 
  of my frustration with python). It is mostly designed for my personal 
  needs, written in the little free time I have, and it isn't ready for 
  others to use easily. That being said, any ideas or contributions 
  would be more than welcome. 
  
  https://github.com/mguzmann/CorpusTools/ 
  Are you aware of this package? 
  https://github.com/johnmyleswhite/TextAnalysis.jl 
  
  Looks like there's room for collaboration. 
  
  
  Regards 
  




Re: [julia-users] Natural language processing in Julia

2014-06-09 Thread Milan Bouchet-Valat
Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit :
 I am working on a library for doing corpus linguistics with Julia (out
 of my frustration with python). It is mostly designed for my personal
 needs, written in the little free time I have, and it isn't ready for
 others to use easily. That being said, any ideas or contributions
 would be more than welcome. 
 
 https://github.com/mguzmann/CorpusTools/
Are you aware of this package?
https://github.com/johnmyleswhite/TextAnalysis.jl

Looks like there's room for collaboration.


Regards



Re: [julia-users] Natural language processing in Julia

2014-06-09 Thread John Myles White
It would be great if somebody with experience with NLP took over leadership for 
JuliaText. I just can’t keep up with TextAnalysis.jl anymore.

 — John

On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nalimi...@club.fr wrote:

 Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit :
 I am working on a library for doing corpus linguistics with Julia (out
 of my frustration with python). It is mostly designed for my personal
 needs, written in the little free time I have, and it isn't ready for
 others to use easily. That being said, any ideas or contributions
 would be more than welcome. 
 
 https://github.com/mguzmann/CorpusTools/
 Are you aware of this package?
 https://github.com/johnmyleswhite/TextAnalysis.jl
 
 Looks like there's room for collaboration.
 
 
 Regards
 



Re: [julia-users] Natural language processing in Julia

2014-06-09 Thread Matías Guzmán Naranjo
I don't have too much experience in NLP, I am a corpus linguist, so 
CorpusTools is mainly aimed at doing corpus linguistics things 
(collocations, collostructions, concordances, etc.). I did see your 
TextAnalysis library but didn't like some of its aspects (not being able to 
work with files without loading them which is a must for large corpora). 
But I guess both could be integrated with some work.

El lunes, 9 de junio de 2014 17:12:53 UTC+2, John Myles White escribió:

 It would be great if somebody with experience with NLP took over 
 leadership for JuliaText. I just can’t keep up with TextAnalysis.jl 
 anymore. 

  — John 

 On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nali...@club.fr 
 javascript: wrote: 

  Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : 
  I am working on a library for doing corpus linguistics with Julia (out 
  of my frustration with python). It is mostly designed for my personal 
  needs, written in the little free time I have, and it isn't ready for 
  others to use easily. That being said, any ideas or contributions 
  would be more than welcome. 
  
  https://github.com/mguzmann/CorpusTools/ 
  Are you aware of this package? 
  https://github.com/johnmyleswhite/TextAnalysis.jl 
  
  Looks like there's room for collaboration. 
  
  
  Regards 
  



Re: [julia-users] Natural language processing in Julia

2014-06-09 Thread Matías Guzmán Naranjo
I'll eventually add some basic POS tagging and Parser. This may take a bit 
of time though. For now I'm thinking of just making a python wrapper for 
FreeLing.

El martes, 10 de junio de 2014 00:05:10 UTC+2, Matías Guzmán Naranjo 
escribió:

 I don't have too much experience in NLP, I am a corpus linguist, so 
 CorpusTools is mainly aimed at doing corpus linguistics things 
 (collocations, collostructions, concordances, etc.). I did see your 
 TextAnalysis library but didn't like some of its aspects (not being able to 
 work with files without loading them which is a must for large corpora). 
 But I guess both could be integrated with some work.

 El lunes, 9 de junio de 2014 17:12:53 UTC+2, John Myles White escribió:

 It would be great if somebody with experience with NLP took over 
 leadership for JuliaText. I just can’t keep up with TextAnalysis.jl 
 anymore. 

  — John 

 On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nali...@club.fr wrote: 

  Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : 
  I am working on a library for doing corpus linguistics with Julia (out 
  of my frustration with python). It is mostly designed for my personal 
  needs, written in the little free time I have, and it isn't ready for 
  others to use easily. That being said, any ideas or contributions 
  would be more than welcome. 
  
  https://github.com/mguzmann/CorpusTools/ 
  Are you aware of this package? 
  https://github.com/johnmyleswhite/TextAnalysis.jl 
  
  Looks like there's room for collaboration. 
  
  
  Regards 
  



Re: [julia-users] Natural language processing in Julia

2014-06-09 Thread John Myles White
There’s lots of tools for working with files without loading them in 
TextAnalysis.jl.

 — John

On Jun 9, 2014, at 3:05 PM, Matías Guzmán Naranjo mortem@gmail.com wrote:

 I don't have too much experience in NLP, I am a corpus linguist, so 
 CorpusTools is mainly aimed at doing corpus linguistics things (collocations, 
 collostructions, concordances, etc.). I did see your TextAnalysis library but 
 didn't like some of its aspects (not being able to work with files without 
 loading them which is a must for large corpora). But I guess both could be 
 integrated with some work.
 
 El lunes, 9 de junio de 2014 17:12:53 UTC+2, John Myles White escribió:
 It would be great if somebody with experience with NLP took over leadership 
 for JuliaText. I just can’t keep up with TextAnalysis.jl anymore. 
 
  — John 
 
 On Jun 9, 2014, at 3:13 AM, Milan Bouchet-Valat nali...@club.fr wrote: 
 
  Le dimanche 08 juin 2014 à 13:57 -0700, Matías Guzmán Naranjo a écrit : 
  I am working on a library for doing corpus linguistics with Julia (out 
  of my frustration with python). It is mostly designed for my personal 
  needs, written in the little free time I have, and it isn't ready for 
  others to use easily. That being said, any ideas or contributions 
  would be more than welcome. 
  
  https://github.com/mguzmann/CorpusTools/ 
  Are you aware of this package? 
  https://github.com/johnmyleswhite/TextAnalysis.jl 
  
  Looks like there's room for collaboration. 
  
  
  Regards 
  
 



Re: [julia-users] Natural language processing in Julia

2014-05-16 Thread Jonathan Malmaud
Hi Sorami,
Yes, JuliaText is meant to be the repository of Julia NLP packages and I 
agree with you about Julia's potential in the NLP domain. There hasn't been 
a lot of action there since I think there aren't many people using Julia 
for NLP yet (although I hope that changes). Any contributions you wanted to 
make would be most appreciated. 

On Thursday, May 8, 2014 7:03:18 AM UTC-7, ther...@gmail.com wrote:

 Hi all,


 I am interested in writing Natural Language Processing (NLP) tools in 
 Julia.

 My name is Sorami, I am a data scientist at BrainPad Inc. in Tokyo, Japan. 
 I used to be a graduate student doing NLP.

 I am much interested in Julia, and I see its great potential as a powerful 
 NLP / Text Mining tool. 

 -

 I have read the Natural language processing in Julia  posts in 
 julia-user Group ( 
 https://groups.google.com/forum/#!searchin/julia-users/nlp/julia-users/SxB16X6lM1c/IWidFfJaDhUJ);
 Are there any updates on Julia+NLP since then? Is JuliaText ( 
 https://github.com/JuliaText ) the center  for NLP stuff? How about 
 TextAnalysis.jl?

 -

 FYI: I have uploaded the source code for a naïve Dependency Parser in 
 Julia, which I wrote when I was playing around with Julia. 
 https://github.com/sorami/DependencyParser.jl

 (Dependency parser is a kind of syntactic analysis tool for natural 
 languages like English or Japanese)


 Cheers,
 --
 Sorami Hisamoto
 http://89.io



 On Tuesday, January 28, 2014 1:31:07 AM UTC+9, John Myles White wrote:

 JuliaText would be great.

 TextAnalysis.jl really needs a lot of love to move forward. For now, I’d 
 strongly push people towards NLTK.

  — John

 On Jan 27, 2014, at 8:29 AM, Jonathan Malmaud mal...@gmail.com wrote:

 I was thinking of starting up a Julia NLP meta-project on github if 
 there's enough interest. It could host projects like textanalysis.jl, a 
 Julia interface to NLTK, a Julia interface to some of Stanford's NLP tools, 
 and whatever more native solutions people put together.

 On Friday, October 25, 2013 9:32:10 AM UTC-4, Dahua Lin wrote:

 I wish there is something comparable to NLTK in Julia. In a recent 
 project that involves text parsing, I have to implement the text handling 
 module in Python, simply for the purpose of using NTLK and Jinja2. 

 If we can get the attention of the NLP community, I believe some NLP 
 people will build such things very soon.

 - Dahua


 On Tuesday, October 22, 2013 7:35:57 PM UTC-5, John Myles White wrote:

 There's a package called TextAnalysis.jl that has stemming and very 
 basic tokenization. Patches to do POS tagging would be very welcome. 

  -- John 

 On Oct 22, 2013, at 5:29 PM, Jonathan Malmaud mal...@gmail.com 
 wrote: 

  Is anyone working on or know of a package to do NLP tasks with Julia, 
 like part-of-speech tagging and stemming? PyCall works fine with Python's 
 NLTK, so that would be my default choice if there isn't anything more 
 native at the moment. 




Re: [julia-users] Natural language processing in Julia

2014-05-08 Thread theremins
Hi all,


I am interested in writing Natural Language Processing (NLP) tools in Julia.

My name is Sorami, I am a data scientist at BrainPad Inc. in Tokyo, Japan. 
I used to be a graduate student doing NLP.

I am much interested in Julia, and I see its great potential as a powerful 
NLP / Text Mining tool. 

-

I have read the Natural language processing in Julia  posts in julia-user 
Group ( 
https://groups.google.com/forum/#!searchin/julia-users/nlp/julia-users/SxB16X6lM1c/IWidFfJaDhUJ
 
);
Are there any updates on Julia+NLP since then? Is JuliaText ( 
https://github.com/JuliaText ) the center  for NLP stuff? How about 
TextAnalysis.jl?

-

FYI: I have uploaded the source code for a naïve Dependency Parser in 
Julia, which I wrote when I was playing around with Julia. 
https://github.com/sorami/DependencyParser.jl

(Dependency parser is a kind of syntactic analysis tool for natural 
languages like English or Japanese)


Cheers,
--
Sorami Hisamoto
http://89.io



On Tuesday, January 28, 2014 1:31:07 AM UTC+9, John Myles White wrote:

 JuliaText would be great.

 TextAnalysis.jl really needs a lot of love to move forward. For now, I’d 
 strongly push people towards NLTK.

  — John

 On Jan 27, 2014, at 8:29 AM, Jonathan Malmaud mal...@gmail.comjavascript: 
 wrote:

 I was thinking of starting up a Julia NLP meta-project on github if 
 there's enough interest. It could host projects like textanalysis.jl, a 
 Julia interface to NLTK, a Julia interface to some of Stanford's NLP tools, 
 and whatever more native solutions people put together.

 On Friday, October 25, 2013 9:32:10 AM UTC-4, Dahua Lin wrote:

 I wish there is something comparable to NLTK in Julia. In a recent 
 project that involves text parsing, I have to implement the text handling 
 module in Python, simply for the purpose of using NTLK and Jinja2. 

 If we can get the attention of the NLP community, I believe some NLP 
 people will build such things very soon.

 - Dahua


 On Tuesday, October 22, 2013 7:35:57 PM UTC-5, John Myles White wrote:

 There's a package called TextAnalysis.jl that has stemming and very 
 basic tokenization. Patches to do POS tagging would be very welcome. 

  -- John 

 On Oct 22, 2013, at 5:29 PM, Jonathan Malmaud mal...@gmail.com wrote: 

  Is anyone working on or know of a package to do NLP tasks with Julia, 
 like part-of-speech tagging and stemming? PyCall works fine with Python's 
 NLTK, so that would be my default choice if there isn't anything more 
 native at the moment. 




Re: [julia-users] Natural language processing in Julia

2014-01-27 Thread Jonathan Malmaud
I was thinking of starting up a Julia NLP meta-project on github if there's 
enough interest. It could host projects like textanalysis.jl, a Julia 
interface to NLTK, a Julia interface to some of Stanford's NLP tools, and 
whatever more native solutions people put together.

On Friday, October 25, 2013 9:32:10 AM UTC-4, Dahua Lin wrote:

 I wish there is something comparable to NLTK in Julia. In a recent project 
 that involves text parsing, I have to implement the text handling module in 
 Python, simply for the purpose of using NTLK and Jinja2. 

 If we can get the attention of the NLP community, I believe some NLP 
 people will build such things very soon.

 - Dahua


 On Tuesday, October 22, 2013 7:35:57 PM UTC-5, John Myles White wrote:

 There's a package called TextAnalysis.jl that has stemming and very basic 
 tokenization. Patches to do POS tagging would be very welcome. 

  -- John 

 On Oct 22, 2013, at 5:29 PM, Jonathan Malmaud mal...@gmail.com wrote: 

  Is anyone working on or know of a package to do NLP tasks with Julia, 
 like part-of-speech tagging and stemming? PyCall works fine with Python's 
 NLTK, so that would be my default choice if there isn't anything more 
 native at the moment. 



Re: [julia-users] Natural language processing in Julia

2014-01-27 Thread John Myles White
JuliaText would be great.

TextAnalysis.jl really needs a lot of love to move forward. For now, I’d 
strongly push people towards NLTK.

 — John

On Jan 27, 2014, at 8:29 AM, Jonathan Malmaud malm...@gmail.com wrote:

 I was thinking of starting up a Julia NLP meta-project on github if there's 
 enough interest. It could host projects like textanalysis.jl, a Julia 
 interface to NLTK, a Julia interface to some of Stanford's NLP tools, and 
 whatever more native solutions people put together.
 
 On Friday, October 25, 2013 9:32:10 AM UTC-4, Dahua Lin wrote:
 I wish there is something comparable to NLTK in Julia. In a recent project 
 that involves text parsing, I have to implement the text handling module in 
 Python, simply for the purpose of using NTLK and Jinja2. 
 
 If we can get the attention of the NLP community, I believe some NLP people 
 will build such things very soon.
 
 - Dahua
 
 
 On Tuesday, October 22, 2013 7:35:57 PM UTC-5, John Myles White wrote:
 There's a package called TextAnalysis.jl that has stemming and very basic 
 tokenization. Patches to do POS tagging would be very welcome. 
 
  -- John 
 
 On Oct 22, 2013, at 5:29 PM, Jonathan Malmaud mal...@gmail.com wrote: 
 
  Is anyone working on or know of a package to do NLP tasks with Julia, like 
  part-of-speech tagging and stemming? PyCall works fine with Python's NLTK, 
  so that would be my default choice if there isn't anything more native at 
  the moment. 
 



Re: [julia-users] Natural language processing in Julia

2014-01-19 Thread Steven G. Johnson
How do you convert a Synset to a string in Python?  Presumably Synset has 
some Python method for this?

On Saturday, January 18, 2014 7:32:58 AM UTC-5, Jon Norberg wrote:

 Great. And how dose one get the text from pyobject to a julia string?

 Thanks very much



Re: [julia-users] Natural language processing in Julia

2014-01-19 Thread Jonathan Malmaud
They have a 'name' property, although I'm not sure that's what you're after. If 
your julia array of synsets is called 'synsets', you could do

[synset[:name] for synset in synsets]

On Jan 19, 2014, at 10:23 AM, Steven G. Johnson stevenj@gmail.com wrote:

 How do you convert a Synset to a string in Python?  Presumably Synset has 
 some Python method for this?
 
 On Saturday, January 18, 2014 7:32:58 AM UTC-5, Jon Norberg wrote:
 Great. And how dose one get the text from pyobject to a julia string?
 Thanks very much
 



Re: [julia-users] Natural language processing in Julia

2014-01-19 Thread Jon Norberg
Thats great, now I am starting to be able to do what I want. Any idea how 
one can list all properties and methods of a pyobject


Re: [julia-users] Natural language processing in Julia

2014-01-18 Thread Jon Norberg
Great. And how dose one get the text from pyobject to a julia string?

Thanks very much


Re: [julia-users] Natural language processing in Julia

2014-01-16 Thread Jon Norberg
Hi Jonathan

would you by any chance have some example code to share  how you work with 
NLTK using pycall. I wish there were more julia examples scripts available 
for browning and learning.

Thanks,