Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-16 Thread Michiaki ARIGA
Thanks for Pontus's kind explanation. He answered what I want to know.
I want to know the standard way to create dictionary (which is a set of
words for ASR or NLP).

To create dictionary for speech recognition or something NLP, we often
control size of vocabulary. There are two ways to limit size of vocabulary,
one is to cut under threshold frequency that Pontus showed, and the other
is to pick up top N frequent words (ngram tool kit such as IRSTLM supports
this situation and it is popular way to control necessary memory size). If
I want to pick frequent words, I think I'll use DataFrame.

On Tue Dec 16 2014 at 15:31:00 Todd Leo sliznmail...@gmail.com wrote:

 Could you provide any clue to guide me locate the issue? I'm willing to
 make a PR but I am unable to find the related issue.


 On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:

 There is not, but if I recall, there may be an open issue about this
 functionality.


 On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo sliznm...@gmail.com wrote:

 Is there a partial sort equivalent to sortperm! ? Supposingly
 selectperm! ?

 On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:

 We have a select function as part of Base, which can do O(n) selection
 of the top n:

 julia v = randn(10^7);

 julia let w = copy(v); @time sort!(w)[1:1000]; end;
 elapsed time: 0.882989281 seconds (8168 bytes allocated)

 julia let w = copy(v); @time select!(w,1:1000); end;
 elapsed time: 0.054981192 seconds (8192 bytes allocated)


 So for large arrays, this is substantially faster.

 On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller trut...@gmail.com wrote:

 This can be done in O(N).  Avoid sorting as it will be O(NlogN)

 Here's one of many Q on how http://stackoverflow.com/q
 uestions/7272534/finding-the-first-n-largest-elements-in-an-array






Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-16 Thread John Myles White
If you want to retain N words, perhaps a priority queue would be useful?

http://julia.readthedocs.org/en/latest/stdlib/collections/#priorityqueue

I'd be cautious about drawing many coding lessons from the TextAnalysis 
package, which has been never been optimized for performance.

 -- John

On Dec 16, 2014, at 3:30 AM, Michiaki ARIGA che...@gmail.com wrote:

 Thanks for Pontus's kind explanation. He answered what I want to know.
 I want to know the standard way to create dictionary (which is a set of words 
 for ASR or NLP).
 
 To create dictionary for speech recognition or something NLP, we often 
 control size of vocabulary. There are two ways to limit size of vocabulary, 
 one is to cut under threshold frequency that Pontus showed, and the other is 
 to pick up top N frequent words (ngram tool kit such as IRSTLM supports this 
 situation and it is popular way to control necessary memory size). If I want 
 to pick frequent words, I think I'll use DataFrame.
 
 On Tue Dec 16 2014 at 15:31:00 Todd Leo sliznmail...@gmail.com wrote:
 Could you provide any clue to guide me locate the issue? I'm willing to make 
 a PR but I am unable to find the related issue.
 
 
 On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:
 There is not, but if I recall, there may be an open issue about this 
 functionality.
 
 On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo sliznm...@gmail.com wrote:
 Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ?
 
 On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:
 We have a select function as part of Base, which can do O(n) selection of the 
 top n:
 
 julia v = randn(10^7);
 
 julia let w = copy(v); @time sort!(w)[1:1000]; end;
 elapsed time: 0.882989281 seconds (8168 bytes allocated)
 
 julia let w = copy(v); @time select!(w,1:1000); end;
 elapsed time: 0.054981192 seconds (8192 bytes allocated)
 
 So for large arrays, this is substantially faster.
 
 On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller trut...@gmail.com wrote:
 This can be done in O(N).  Avoid sorting as it will be O(NlogN)
 
 Here's one of many Q on how 
 http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array
 
 



Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-16 Thread Michiaki ARIGA
Thanks for your answer, John.
I'll use PriorityQueue to get top N words, and if I want to cut off using
threshold I'll create a dictionary with Dict.

---
Michiaki

On Tue Dec 16 2014 at 22:53:34 John Myles White johnmyleswh...@gmail.com
wrote:

 If you want to retain N words, perhaps a priority queue would be useful?

 http://julia.readthedocs.org/en/latest/stdlib/collections/#priorityqueue

 I'd be cautious about drawing many coding lessons from the TextAnalysis
 package, which has been never been optimized for performance.

  -- John

 On Dec 16, 2014, at 3:30 AM, Michiaki ARIGA che...@gmail.com wrote:

 Thanks for Pontus's kind explanation. He answered what I want to know.
 I want to know the standard way to create dictionary (which is a set of
 words for ASR or NLP).

 To create dictionary for speech recognition or something NLP, we often
 control size of vocabulary. There are two ways to limit size of vocabulary,
 one is to cut under threshold frequency that Pontus showed, and the other
 is to pick up top N frequent words (ngram tool kit such as IRSTLM supports
 this situation and it is popular way to control necessary memory size). If
 I want to pick frequent words, I think I'll use DataFrame.

 On Tue Dec 16 2014 at 15:31:00 Todd Leo sliznmail...@gmail.com wrote:

 Could you provide any clue to guide me locate the issue? I'm willing to
 make a PR but I am unable to find the related issue.


 On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:

 There is not, but if I recall, there may be an open issue about this
 functionality.


 On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo sliznm...@gmail.com wrote:

 Is there a partial sort equivalent to sortperm! ? Supposingly
 selectperm! ?

 On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:

 We have a select function as part of Base, which can do O(n) selection
 of the top n:

 julia v = randn(10^7);

 julia let w = copy(v); @time sort!(w)[1:1000]; end;
 elapsed time: 0.882989281 seconds (8168 bytes allocated)

 julia let w = copy(v); @time select!(w,1:1000); end;
 elapsed time: 0.054981192 seconds (8192 bytes allocated)


 So for large arrays, this is substantially faster.

 On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller trut...@gmail.com wrote:

 This can be done in O(N).  Avoid sorting as it will be O(NlogN)

 Here's one of many Q on how http://stackoverflow.com/q
 uestions/7272534/finding-the-first-n-largest-elements-in-an-array







Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-15 Thread Stefan Karpinski
There is not, but if I recall, there may be an open issue about this
functionality.

On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo sliznmail...@gmail.com wrote:

 Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ?

 On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:

 We have a select function as part of Base, which can do O(n) selection of
 the top n:

 julia v = randn(10^7);

 julia let w = copy(v); @time sort!(w)[1:1000]; end;
 elapsed time: 0.882989281 seconds (8168 bytes allocated)

 julia let w = copy(v); @time select!(w,1:1000); end;
 elapsed time: 0.054981192 seconds (8192 bytes allocated)


 So for large arrays, this is substantially faster.

 On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller trut...@gmail.com wrote:

 This can be done in O(N).  Avoid sorting as it will be O(NlogN)

 Here's one of many Q on how http://stackoverflow.com/
 questions/7272534/finding-the-first-n-largest-elements-in-an-array





Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-15 Thread Stefan Karpinski
On Sun, Dec 14, 2014 at 9:13 PM, Michiaki ARIGA che...@gmail.com wrote:

 Of course Julia can work fast with Array, I know.
 But in natural language processing or text analyzing, we often count word
 frequency and create dictionary. We usually store word frequency in kind-a
 Dict and we always cut off non-frequent words (its frequency are under
 threshold) to exclude noisy words. So I want remove keys which values
 follow some condition.

 Finally, I found John Myles White's implementation creating n-gram. So, I
 will refer this.


 https://github.com/johnmyleswhite/TextAnalysis.jl/blob/master/src/ngramizer.jl

I'm not sure how this technique applies any less to strings than to
numbers. Can you clarify what your concern is?


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-15 Thread Pontus Stenetorp
On 16 December 2014 at 04:39, Stefan Karpinski ste...@karpinski.org wrote:

 I'm not sure how this technique applies any less to strings than to numbers.
 Can you clarify what your concern is?

I am suspecting that what he wants to do is to limit the size of a
vocabulary (set of tokens) by imposing an occurrence threshold.  This
is fairly standard and what you really want would be a sorted
dictionary.  However, since this is usually just a pre-processing step
I don't even bother optimising and just do something like:

UNKFORM = UNK
unkcutoff = 1
tokoccs = Dict{UTF8String,Int}()
for tok in data
tokoccs[tok] = get(tokoccs, tok, 0) + 1
end
vocab = Set{UTF8String}()
unks = Set{UTF8String}()
push!(vocab, UNKFORM)
for (tok, occs) in tokoccs
if occs  unkcutoff
push!(vocab, tok)
else
push!(unks, tok)
end
end

This is O(n) + O(n) which should be fine unless you want to focus on
optimising this specific region of code.

Pontus


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-15 Thread Todd Leo
Could you provide any clue to guide me locate the issue? I'm willing to 
make a PR but I am unable to find the related issue.

On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:

 There is not, but if I recall, there may be an open issue about this 
 functionality.

 On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo sliznm...@gmail.com 
 javascript: wrote:

 Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! 
 ?

 On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:

 We have a select function as part of Base, which can do O(n) selection 
 of the top n:

 julia v = randn(10^7);

 julia let w = copy(v); @time sort!(w)[1:1000]; end;
 elapsed time: 0.882989281 seconds (8168 bytes allocated)

 julia let w = copy(v); @time select!(w,1:1000); end;
 elapsed time: 0.054981192 seconds (8192 bytes allocated)


 So for large arrays, this is substantially faster.

 On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller trut...@gmail.com wrote:

 This can be done in O(N).  Avoid sorting as it will be O(NlogN)

 Here's one of many Q on how http://stackoverflow.com/
 questions/7272534/finding-the-first-n-largest-elements-in-an-array





Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-14 Thread Michiaki ARIGA
Of course Julia can work fast with Array, I know.
But in natural language processing or text analyzing, we often count word
frequency and create dictionary. We usually store word frequency in kind-a
Dict and we always cut off non-frequent words (its frequency are under
threshold) to exclude noisy words. So I want remove keys which values
follow some condition.

Finally, I found John Myles White's implementation creating n-gram. So, I
will refer this.

https://github.com/johnmyleswhite/TextAnalysis.jl/blob/master/src/ngramizer.jl


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-14 Thread Todd Leo
Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ?

On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:

 We have a select function as part of Base, which can do O(n) selection of 
 the top n:

 julia v = randn(10^7);

 julia let w = copy(v); @time sort!(w)[1:1000]; end;
 elapsed time: 0.882989281 seconds (8168 bytes allocated)

 julia let w = copy(v); @time select!(w,1:1000); end;
 elapsed time: 0.054981192 seconds (8192 bytes allocated)


 So for large arrays, this is substantially faster.

 On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller trut...@gmail.com 
 javascript: wrote:

 This can be done in O(N).  Avoid sorting as it will be O(NlogN)

 Here's one of many Q on how 
 http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array




Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-08 Thread Jeff Waller
This can be done in O(N).  Avoid sorting as it will be O(NlogN)

Here's one of many Q on how 
http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-08 Thread Stefan Karpinski
We have a select function as part of Base, which can do O(n) selection of
the top n:

julia v = randn(10^7);

julia let w = copy(v); @time sort!(w)[1:1000]; end;
elapsed time: 0.882989281 seconds (8168 bytes allocated)

julia let w = copy(v); @time select!(w,1:1000); end;
elapsed time: 0.054981192 seconds (8192 bytes allocated)


So for large arrays, this is substantially faster.

On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller truth...@gmail.com wrote:

 This can be done in O(N).  Avoid sorting as it will be O(NlogN)

 Here's one of many Q on how
 http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array



Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-07 Thread Michiaki ARIGA
I'm sorry that my example is not good to explain what I want to do.

I tried to count up words and get top N frequent words and I referred to
following example.

https://github.com/JuliaLang/julia/blob/master/examples/wordcount.jl

I don't think it should return Dict for top N words, so I think DataFrame
is good to get top N words using head(). But I wonder if DataFrame isn't
suitable because DataFrame converted from Dict is not sortable style using
its values of Dict.

On Sat Dec 06 2014 at 11:24:02 Steven G. Johnson stevenj@gmail.com
wrote:



 On Friday, December 5, 2014 9:57:28 AM UTC-5, Michiaki Ariga wrote:

 I found there are no method such as sort_by() after v0.3.
 But I want to count word frequency with Dict() and sort by its value to
 find frequent word.
 So, how can I sort Dict efficiently?


  You may want to use a different data structure.  For example, you can
 store word frequencies in a PriorityQueue and then pull out the most
 frequent word with peek or dequeue.  See:

 http://julia.readthedocs.org/en/latest/stdlib/collections/

 (A PriorityQueue lets you quickly fetch the smallest value, whereas you
 want the largest frequency, but you can work around this by just storing
 frequency * -1.)

 If you need all of the values in order, you can instead use an OrderedDict
 from https://github.com/JuliaLang/DataStructures.jl



[julia-users] Re: How can I sort Dict efficiently?

2014-12-05 Thread Steven G. Johnson


On Friday, December 5, 2014 9:57:28 AM UTC-5, Michiaki Ariga wrote:

 I found there are no method such as sort_by() after v0.3.
 But I want to count word frequency with Dict() and sort by its value to 
 find frequent word.
 So, how can I sort Dict efficiently?


 You may want to use a different data structure.  For example, you can 
store word frequencies in a PriorityQueue and then pull out the most 
frequent word with peek or dequeue.  See:

http://julia.readthedocs.org/en/latest/stdlib/collections/

(A PriorityQueue lets you quickly fetch the smallest value, whereas you 
want the largest frequency, but you can work around this by just storing 
frequency * -1.)

If you need all of the values in order, you can instead use an OrderedDict 
from https://github.com/JuliaLang/DataStructures.jl