Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-16 Thread Michiaki ARIGA
Thanks for your answer, John.
I'll use PriorityQueue to get top N words, and if I want to cut off using
threshold I'll create a dictionary with Dict.

---
Michiaki

On Tue Dec 16 2014 at 22:53:34 John Myles White 
wrote:

> If you want to retain N words, perhaps a priority queue would be useful?
>
> http://julia.readthedocs.org/en/latest/stdlib/collections/#priorityqueue
>
> I'd be cautious about drawing many coding lessons from the TextAnalysis
> package, which has been never been optimized for performance.
>
>  -- John
>
> On Dec 16, 2014, at 3:30 AM, Michiaki ARIGA  wrote:
>
> Thanks for Pontus's kind explanation. He answered what I want to know.
> I want to know the standard way to create dictionary (which is a set of
> words for ASR or NLP).
>
> To create dictionary for speech recognition or something NLP, we often
> control size of vocabulary. There are two ways to limit size of vocabulary,
> one is to cut under threshold frequency that Pontus showed, and the other
> is to pick up top N frequent words (ngram tool kit such as IRSTLM supports
> this situation and it is popular way to control necessary memory size). If
> I want to pick frequent words, I think I'll use DataFrame.
>
> On Tue Dec 16 2014 at 15:31:00 Todd Leo  wrote:
>
>> Could you provide any clue to guide me locate the issue? I'm willing to
>> make a PR but I am unable to find the related issue.
>>
>>
>> On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:
>>
>>> There is not, but if I recall, there may be an open issue about this
>>> functionality.
>>>
>>
>>> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo  wrote:
>>>
 Is there a partial sort equivalent to sortperm! ? Supposingly
 selectperm! ?

 On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:
>
> We have a select function as part of Base, which can do O(n) selection
> of the top n:
>
> julia> v = randn(10^7);
>
> julia> let w = copy(v); @time sort!(w)[1:1000]; end;
> elapsed time: 0.882989281 seconds (8168 bytes allocated)
>
> julia> let w = copy(v); @time select!(w,1:1000); end;
> elapsed time: 0.054981192 seconds (8192 bytes allocated)
>
>
> So for large arrays, this is substantially faster.
>
> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller  wrote:
>
>> This can be done in O(N).  Avoid sorting as it will be O(NlogN)
>>
>> Here's one of many Q on how http://stackoverflow.com/q
>> uestions/7272534/finding-the-first-n-largest-elements-in-an-array
>>
>
>
>>>
>


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-16 Thread John Myles White
If you want to retain N words, perhaps a priority queue would be useful?

http://julia.readthedocs.org/en/latest/stdlib/collections/#priorityqueue

I'd be cautious about drawing many coding lessons from the TextAnalysis 
package, which has been never been optimized for performance.

 -- John

On Dec 16, 2014, at 3:30 AM, Michiaki ARIGA  wrote:

> Thanks for Pontus's kind explanation. He answered what I want to know.
> I want to know the standard way to create dictionary (which is a set of words 
> for ASR or NLP).
> 
> To create dictionary for speech recognition or something NLP, we often 
> control size of vocabulary. There are two ways to limit size of vocabulary, 
> one is to cut under threshold frequency that Pontus showed, and the other is 
> to pick up top N frequent words (ngram tool kit such as IRSTLM supports this 
> situation and it is popular way to control necessary memory size). If I want 
> to pick frequent words, I think I'll use DataFrame.
> 
> On Tue Dec 16 2014 at 15:31:00 Todd Leo  wrote:
> Could you provide any clue to guide me locate the issue? I'm willing to make 
> a PR but I am unable to find the related issue.
> 
> 
> On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:
> There is not, but if I recall, there may be an open issue about this 
> functionality.
> 
> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo  wrote:
> Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ?
> 
> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:
> We have a select function as part of Base, which can do O(n) selection of the 
> top n:
> 
> julia> v = randn(10^7);
> 
> julia> let w = copy(v); @time sort!(w)[1:1000]; end;
> elapsed time: 0.882989281 seconds (8168 bytes allocated)
> 
> julia> let w = copy(v); @time select!(w,1:1000); end;
> elapsed time: 0.054981192 seconds (8192 bytes allocated)
> 
> So for large arrays, this is substantially faster.
> 
> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller  wrote:
> This can be done in O(N).  Avoid sorting as it will be O(NlogN)
> 
> Here's one of many Q on how 
> http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array
> 
> 



Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-16 Thread Michiaki ARIGA
Thanks for Pontus's kind explanation. He answered what I want to know.
I want to know the standard way to create dictionary (which is a set of
words for ASR or NLP).

To create dictionary for speech recognition or something NLP, we often
control size of vocabulary. There are two ways to limit size of vocabulary,
one is to cut under threshold frequency that Pontus showed, and the other
is to pick up top N frequent words (ngram tool kit such as IRSTLM supports
this situation and it is popular way to control necessary memory size). If
I want to pick frequent words, I think I'll use DataFrame.

On Tue Dec 16 2014 at 15:31:00 Todd Leo  wrote:

> Could you provide any clue to guide me locate the issue? I'm willing to
> make a PR but I am unable to find the related issue.
>
>
> On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:
>
>> There is not, but if I recall, there may be an open issue about this
>> functionality.
>>
>
>> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo  wrote:
>>
>>> Is there a partial sort equivalent to sortperm! ? Supposingly
>>> selectperm! ?
>>>
>>> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:

 We have a select function as part of Base, which can do O(n) selection
 of the top n:

 julia> v = randn(10^7);

 julia> let w = copy(v); @time sort!(w)[1:1000]; end;
 elapsed time: 0.882989281 seconds (8168 bytes allocated)

 julia> let w = copy(v); @time select!(w,1:1000); end;
 elapsed time: 0.054981192 seconds (8192 bytes allocated)


 So for large arrays, this is substantially faster.

 On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller  wrote:

> This can be done in O(N).  Avoid sorting as it will be O(NlogN)
>
> Here's one of many Q on how http://stackoverflow.com/q
> uestions/7272534/finding-the-first-n-largest-elements-in-an-array
>


>>


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-15 Thread Todd Leo
Could you provide any clue to guide me locate the issue? I'm willing to 
make a PR but I am unable to find the related issue.

On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote:
>
> There is not, but if I recall, there may be an open issue about this 
> functionality.
>
> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo  > wrote:
>
>> Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! 
>> ?
>>
>> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:
>>>
>>> We have a select function as part of Base, which can do O(n) selection 
>>> of the top n:
>>>
>>> julia> v = randn(10^7);
>>>
>>> julia> let w = copy(v); @time sort!(w)[1:1000]; end;
>>> elapsed time: 0.882989281 seconds (8168 bytes allocated)
>>>
>>> julia> let w = copy(v); @time select!(w,1:1000); end;
>>> elapsed time: 0.054981192 seconds (8192 bytes allocated)
>>>
>>>
>>> So for large arrays, this is substantially faster.
>>>
>>> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller  wrote:
>>>
 This can be done in O(N).  Avoid sorting as it will be O(NlogN)

 Here's one of many Q on how http://stackoverflow.com/
 questions/7272534/finding-the-first-n-largest-elements-in-an-array

>>>
>>>
>

Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-15 Thread Pontus Stenetorp
On 16 December 2014 at 04:39, Stefan Karpinski  wrote:
>
> I'm not sure how this technique applies any less to strings than to numbers.
> Can you clarify what your concern is?

I am suspecting that what he wants to do is to limit the size of a
vocabulary (set of tokens) by imposing an occurrence threshold.  This
is fairly standard and what you really want would be a sorted
dictionary.  However, since this is usually just a pre-processing step
I don't even bother optimising and just do something like:

UNKFORM = ""
unkcutoff = 1
tokoccs = Dict{UTF8String,Int}()
for tok in data
tokoccs[tok] = get(tokoccs, tok, 0) + 1
end
vocab = Set{UTF8String}()
unks = Set{UTF8String}()
push!(vocab, UNKFORM)
for (tok, occs) in tokoccs
if occs > unkcutoff
push!(vocab, tok)
else
push!(unks, tok)
end
end

This is O(n) + O(n) which should be fine unless you want to focus on
optimising this specific region of code.

Pontus


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-15 Thread Stefan Karpinski
On Sun, Dec 14, 2014 at 9:13 PM, Michiaki ARIGA  wrote:

> Of course Julia can work fast with Array, I know.
> But in natural language processing or text analyzing, we often count word
> frequency and create dictionary. We usually store word frequency in kind-a
> Dict and we always cut off non-frequent words (its frequency are under
> threshold) to exclude noisy words. So I want remove keys which values
> follow some condition.
>
> Finally, I found John Myles White's implementation creating n-gram. So, I
> will refer this.
>
>
> https://github.com/johnmyleswhite/TextAnalysis.jl/blob/master/src/ngramizer.jl
>
I'm not sure how this technique applies any less to strings than to
numbers. Can you clarify what your concern is?


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-15 Thread Stefan Karpinski
There is not, but if I recall, there may be an open issue about this
functionality.

On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo  wrote:

> Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ?
>
> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:
>>
>> We have a select function as part of Base, which can do O(n) selection of
>> the top n:
>>
>> julia> v = randn(10^7);
>>
>> julia> let w = copy(v); @time sort!(w)[1:1000]; end;
>> elapsed time: 0.882989281 seconds (8168 bytes allocated)
>>
>> julia> let w = copy(v); @time select!(w,1:1000); end;
>> elapsed time: 0.054981192 seconds (8192 bytes allocated)
>>
>>
>> So for large arrays, this is substantially faster.
>>
>> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller  wrote:
>>
>>> This can be done in O(N).  Avoid sorting as it will be O(NlogN)
>>>
>>> Here's one of many Q on how http://stackoverflow.com/
>>> questions/7272534/finding-the-first-n-largest-elements-in-an-array
>>>
>>
>>


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-14 Thread Todd Leo
Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ?

On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote:
>
> We have a select function as part of Base, which can do O(n) selection of 
> the top n:
>
> julia> v = randn(10^7);
>
> julia> let w = copy(v); @time sort!(w)[1:1000]; end;
> elapsed time: 0.882989281 seconds (8168 bytes allocated)
>
> julia> let w = copy(v); @time select!(w,1:1000); end;
> elapsed time: 0.054981192 seconds (8192 bytes allocated)
>
>
> So for large arrays, this is substantially faster.
>
> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller  > wrote:
>
>> This can be done in O(N).  Avoid sorting as it will be O(NlogN)
>>
>> Here's one of many Q on how 
>> http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array
>>
>
>

Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-14 Thread Michiaki ARIGA
Of course Julia can work fast with Array, I know.
But in natural language processing or text analyzing, we often count word
frequency and create dictionary. We usually store word frequency in kind-a
Dict and we always cut off non-frequent words (its frequency are under
threshold) to exclude noisy words. So I want remove keys which values
follow some condition.

Finally, I found John Myles White's implementation creating n-gram. So, I
will refer this.

https://github.com/johnmyleswhite/TextAnalysis.jl/blob/master/src/ngramizer.jl


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-08 Thread Stefan Karpinski
We have a select function as part of Base, which can do O(n) selection of
the top n:

julia> v = randn(10^7);

julia> let w = copy(v); @time sort!(w)[1:1000]; end;
elapsed time: 0.882989281 seconds (8168 bytes allocated)

julia> let w = copy(v); @time select!(w,1:1000); end;
elapsed time: 0.054981192 seconds (8192 bytes allocated)


So for large arrays, this is substantially faster.

On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller  wrote:

> This can be done in O(N).  Avoid sorting as it will be O(NlogN)
>
> Here's one of many Q on how
> http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array
>


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-08 Thread Jeff Waller
This can be done in O(N).  Avoid sorting as it will be O(NlogN)

Here's one of many Q on how 
http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array


Re: [julia-users] Re: How can I sort Dict efficiently?

2014-12-07 Thread Michiaki ARIGA
I'm sorry that my example is not good to explain what I want to do.

I tried to count up words and get top N frequent words and I referred to
following example.

https://github.com/JuliaLang/julia/blob/master/examples/wordcount.jl

I don't think it should return Dict for top N words, so I think DataFrame
is good to get top N words using head(). But I wonder if DataFrame isn't
suitable because DataFrame converted from Dict is not sortable style using
its values of Dict.

On Sat Dec 06 2014 at 11:24:02 Steven G. Johnson 
wrote:

>
>
> On Friday, December 5, 2014 9:57:28 AM UTC-5, Michiaki Ariga wrote:
>>
>> I found there are no method such as sort_by() after v0.3.
>> But I want to count word frequency with Dict() and sort by its value to
>> find frequent word.
>> So, how can I sort Dict efficiently?
>>
>
>  You may want to use a different data structure.  For example, you can
> store word frequencies in a PriorityQueue and then pull out the most
> frequent word with peek or dequeue.  See:
>
> http://julia.readthedocs.org/en/latest/stdlib/collections/
>
> (A PriorityQueue lets you quickly fetch the smallest value, whereas you
> want the largest frequency, but you can work around this by just storing
> frequency * -1.)
>
> If you need all of the values in order, you can instead use an OrderedDict
> from https://github.com/JuliaLang/DataStructures.jl
>