Re: [julia-users] Re: How can I sort Dict efficiently?
Thanks for your answer, John. I'll use PriorityQueue to get top N words, and if I want to cut off using threshold I'll create a dictionary with Dict. --- Michiaki On Tue Dec 16 2014 at 22:53:34 John Myles White wrote: > If you want to retain N words, perhaps a priority queue would be useful? > > http://julia.readthedocs.org/en/latest/stdlib/collections/#priorityqueue > > I'd be cautious about drawing many coding lessons from the TextAnalysis > package, which has been never been optimized for performance. > > -- John > > On Dec 16, 2014, at 3:30 AM, Michiaki ARIGA wrote: > > Thanks for Pontus's kind explanation. He answered what I want to know. > I want to know the standard way to create dictionary (which is a set of > words for ASR or NLP). > > To create dictionary for speech recognition or something NLP, we often > control size of vocabulary. There are two ways to limit size of vocabulary, > one is to cut under threshold frequency that Pontus showed, and the other > is to pick up top N frequent words (ngram tool kit such as IRSTLM supports > this situation and it is popular way to control necessary memory size). If > I want to pick frequent words, I think I'll use DataFrame. > > On Tue Dec 16 2014 at 15:31:00 Todd Leo wrote: > >> Could you provide any clue to guide me locate the issue? I'm willing to >> make a PR but I am unable to find the related issue. >> >> >> On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote: >> >>> There is not, but if I recall, there may be an open issue about this >>> functionality. >>> >> >>> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo wrote: >>> Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ? On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote: > > We have a select function as part of Base, which can do O(n) selection > of the top n: > > julia> v = randn(10^7); > > julia> let w = copy(v); @time sort!(w)[1:1000]; end; > elapsed time: 0.882989281 seconds (8168 bytes allocated) > > julia> let w = copy(v); @time select!(w,1:1000); end; > elapsed time: 0.054981192 seconds (8192 bytes allocated) > > > So for large arrays, this is substantially faster. > > On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller wrote: > >> This can be done in O(N). Avoid sorting as it will be O(NlogN) >> >> Here's one of many Q on how http://stackoverflow.com/q >> uestions/7272534/finding-the-first-n-largest-elements-in-an-array >> > > >>> >
Re: [julia-users] Re: How can I sort Dict efficiently?
If you want to retain N words, perhaps a priority queue would be useful? http://julia.readthedocs.org/en/latest/stdlib/collections/#priorityqueue I'd be cautious about drawing many coding lessons from the TextAnalysis package, which has been never been optimized for performance. -- John On Dec 16, 2014, at 3:30 AM, Michiaki ARIGA wrote: > Thanks for Pontus's kind explanation. He answered what I want to know. > I want to know the standard way to create dictionary (which is a set of words > for ASR or NLP). > > To create dictionary for speech recognition or something NLP, we often > control size of vocabulary. There are two ways to limit size of vocabulary, > one is to cut under threshold frequency that Pontus showed, and the other is > to pick up top N frequent words (ngram tool kit such as IRSTLM supports this > situation and it is popular way to control necessary memory size). If I want > to pick frequent words, I think I'll use DataFrame. > > On Tue Dec 16 2014 at 15:31:00 Todd Leo wrote: > Could you provide any clue to guide me locate the issue? I'm willing to make > a PR but I am unable to find the related issue. > > > On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote: > There is not, but if I recall, there may be an open issue about this > functionality. > > On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo wrote: > Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ? > > On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote: > We have a select function as part of Base, which can do O(n) selection of the > top n: > > julia> v = randn(10^7); > > julia> let w = copy(v); @time sort!(w)[1:1000]; end; > elapsed time: 0.882989281 seconds (8168 bytes allocated) > > julia> let w = copy(v); @time select!(w,1:1000); end; > elapsed time: 0.054981192 seconds (8192 bytes allocated) > > So for large arrays, this is substantially faster. > > On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller wrote: > This can be done in O(N). Avoid sorting as it will be O(NlogN) > > Here's one of many Q on how > http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array > >
Re: [julia-users] Re: How can I sort Dict efficiently?
Thanks for Pontus's kind explanation. He answered what I want to know. I want to know the standard way to create dictionary (which is a set of words for ASR or NLP). To create dictionary for speech recognition or something NLP, we often control size of vocabulary. There are two ways to limit size of vocabulary, one is to cut under threshold frequency that Pontus showed, and the other is to pick up top N frequent words (ngram tool kit such as IRSTLM supports this situation and it is popular way to control necessary memory size). If I want to pick frequent words, I think I'll use DataFrame. On Tue Dec 16 2014 at 15:31:00 Todd Leo wrote: > Could you provide any clue to guide me locate the issue? I'm willing to > make a PR but I am unable to find the related issue. > > > On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote: > >> There is not, but if I recall, there may be an open issue about this >> functionality. >> > >> On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo wrote: >> >>> Is there a partial sort equivalent to sortperm! ? Supposingly >>> selectperm! ? >>> >>> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote: We have a select function as part of Base, which can do O(n) selection of the top n: julia> v = randn(10^7); julia> let w = copy(v); @time sort!(w)[1:1000]; end; elapsed time: 0.882989281 seconds (8168 bytes allocated) julia> let w = copy(v); @time select!(w,1:1000); end; elapsed time: 0.054981192 seconds (8192 bytes allocated) So for large arrays, this is substantially faster. On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller wrote: > This can be done in O(N). Avoid sorting as it will be O(NlogN) > > Here's one of many Q on how http://stackoverflow.com/q > uestions/7272534/finding-the-first-n-largest-elements-in-an-array > >>
Re: [julia-users] Re: How can I sort Dict efficiently?
Could you provide any clue to guide me locate the issue? I'm willing to make a PR but I am unable to find the related issue. On Tuesday, December 16, 2014 3:38:11 AM UTC+8, Stefan Karpinski wrote: > > There is not, but if I recall, there may be an open issue about this > functionality. > > On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo > wrote: > >> Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! >> ? >> >> On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote: >>> >>> We have a select function as part of Base, which can do O(n) selection >>> of the top n: >>> >>> julia> v = randn(10^7); >>> >>> julia> let w = copy(v); @time sort!(w)[1:1000]; end; >>> elapsed time: 0.882989281 seconds (8168 bytes allocated) >>> >>> julia> let w = copy(v); @time select!(w,1:1000); end; >>> elapsed time: 0.054981192 seconds (8192 bytes allocated) >>> >>> >>> So for large arrays, this is substantially faster. >>> >>> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller wrote: >>> This can be done in O(N). Avoid sorting as it will be O(NlogN) Here's one of many Q on how http://stackoverflow.com/ questions/7272534/finding-the-first-n-largest-elements-in-an-array >>> >>> >
Re: [julia-users] Re: How can I sort Dict efficiently?
On 16 December 2014 at 04:39, Stefan Karpinski wrote: > > I'm not sure how this technique applies any less to strings than to numbers. > Can you clarify what your concern is? I am suspecting that what he wants to do is to limit the size of a vocabulary (set of tokens) by imposing an occurrence threshold. This is fairly standard and what you really want would be a sorted dictionary. However, since this is usually just a pre-processing step I don't even bother optimising and just do something like: UNKFORM = "" unkcutoff = 1 tokoccs = Dict{UTF8String,Int}() for tok in data tokoccs[tok] = get(tokoccs, tok, 0) + 1 end vocab = Set{UTF8String}() unks = Set{UTF8String}() push!(vocab, UNKFORM) for (tok, occs) in tokoccs if occs > unkcutoff push!(vocab, tok) else push!(unks, tok) end end This is O(n) + O(n) which should be fine unless you want to focus on optimising this specific region of code. Pontus
Re: [julia-users] Re: How can I sort Dict efficiently?
On Sun, Dec 14, 2014 at 9:13 PM, Michiaki ARIGA wrote: > Of course Julia can work fast with Array, I know. > But in natural language processing or text analyzing, we often count word > frequency and create dictionary. We usually store word frequency in kind-a > Dict and we always cut off non-frequent words (its frequency are under > threshold) to exclude noisy words. So I want remove keys which values > follow some condition. > > Finally, I found John Myles White's implementation creating n-gram. So, I > will refer this. > > > https://github.com/johnmyleswhite/TextAnalysis.jl/blob/master/src/ngramizer.jl > I'm not sure how this technique applies any less to strings than to numbers. Can you clarify what your concern is?
Re: [julia-users] Re: How can I sort Dict efficiently?
There is not, but if I recall, there may be an open issue about this functionality. On Sun, Dec 14, 2014 at 10:15 PM, Todd Leo wrote: > Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ? > > On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote: >> >> We have a select function as part of Base, which can do O(n) selection of >> the top n: >> >> julia> v = randn(10^7); >> >> julia> let w = copy(v); @time sort!(w)[1:1000]; end; >> elapsed time: 0.882989281 seconds (8168 bytes allocated) >> >> julia> let w = copy(v); @time select!(w,1:1000); end; >> elapsed time: 0.054981192 seconds (8192 bytes allocated) >> >> >> So for large arrays, this is substantially faster. >> >> On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller wrote: >> >>> This can be done in O(N). Avoid sorting as it will be O(NlogN) >>> >>> Here's one of many Q on how http://stackoverflow.com/ >>> questions/7272534/finding-the-first-n-largest-elements-in-an-array >>> >> >>
Re: [julia-users] Re: How can I sort Dict efficiently?
Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ? On Monday, December 8, 2014 8:21:33 PM UTC+8, Stefan Karpinski wrote: > > We have a select function as part of Base, which can do O(n) selection of > the top n: > > julia> v = randn(10^7); > > julia> let w = copy(v); @time sort!(w)[1:1000]; end; > elapsed time: 0.882989281 seconds (8168 bytes allocated) > > julia> let w = copy(v); @time select!(w,1:1000); end; > elapsed time: 0.054981192 seconds (8192 bytes allocated) > > > So for large arrays, this is substantially faster. > > On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller > wrote: > >> This can be done in O(N). Avoid sorting as it will be O(NlogN) >> >> Here's one of many Q on how >> http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array >> > >
Re: [julia-users] Re: How can I sort Dict efficiently?
Of course Julia can work fast with Array, I know. But in natural language processing or text analyzing, we often count word frequency and create dictionary. We usually store word frequency in kind-a Dict and we always cut off non-frequent words (its frequency are under threshold) to exclude noisy words. So I want remove keys which values follow some condition. Finally, I found John Myles White's implementation creating n-gram. So, I will refer this. https://github.com/johnmyleswhite/TextAnalysis.jl/blob/master/src/ngramizer.jl
Re: [julia-users] Re: How can I sort Dict efficiently?
We have a select function as part of Base, which can do O(n) selection of the top n: julia> v = randn(10^7); julia> let w = copy(v); @time sort!(w)[1:1000]; end; elapsed time: 0.882989281 seconds (8168 bytes allocated) julia> let w = copy(v); @time select!(w,1:1000); end; elapsed time: 0.054981192 seconds (8192 bytes allocated) So for large arrays, this is substantially faster. On Mon, Dec 8, 2014 at 3:50 AM, Jeff Waller wrote: > This can be done in O(N). Avoid sorting as it will be O(NlogN) > > Here's one of many Q on how > http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array >
Re: [julia-users] Re: How can I sort Dict efficiently?
This can be done in O(N). Avoid sorting as it will be O(NlogN) Here's one of many Q on how http://stackoverflow.com/questions/7272534/finding-the-first-n-largest-elements-in-an-array
Re: [julia-users] Re: How can I sort Dict efficiently?
I'm sorry that my example is not good to explain what I want to do. I tried to count up words and get top N frequent words and I referred to following example. https://github.com/JuliaLang/julia/blob/master/examples/wordcount.jl I don't think it should return Dict for top N words, so I think DataFrame is good to get top N words using head(). But I wonder if DataFrame isn't suitable because DataFrame converted from Dict is not sortable style using its values of Dict. On Sat Dec 06 2014 at 11:24:02 Steven G. Johnson wrote: > > > On Friday, December 5, 2014 9:57:28 AM UTC-5, Michiaki Ariga wrote: >> >> I found there are no method such as sort_by() after v0.3. >> But I want to count word frequency with Dict() and sort by its value to >> find frequent word. >> So, how can I sort Dict efficiently? >> > > You may want to use a different data structure. For example, you can > store word frequencies in a PriorityQueue and then pull out the most > frequent word with peek or dequeue. See: > > http://julia.readthedocs.org/en/latest/stdlib/collections/ > > (A PriorityQueue lets you quickly fetch the smallest value, whereas you > want the largest frequency, but you can work around this by just storing > frequency * -1.) > > If you need all of the values in order, you can instead use an OrderedDict > from https://github.com/JuliaLang/DataStructures.jl >