Re: Comparing strings from the back?
Neil Hodgson wrote: Ethan Furman: *plonk* I can't work out who you are plonking. While more than one of the posters on this thread seem worthy of a good plonk, by not including sufficient context, you've left me feeling puzzled. Is there a guideline for this in basic netiquette? You're right, my apologies. Dwight Hutto is the one I plonked. His signal to noise ratio seems incredibly low. alex23 can definitely be abrasive, but his ratio is high. (Now I'm wondering what that says about me as far as who I agree with and why... I'll have to think about that.) ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Tue, Sep 18, 2012 at 11:12 AM, Ethan Furman et...@stoneleaf.us wrote: Neil Hodgson wrote: Ethan Furman: *plonk* I can't work out who you are plonking. While more than one of the posters on this thread seem worthy of a good plonk, by not including sufficient context, you've left me feeling puzzled. Is there a guideline for this in basic netiquette? You're right, my apologies. Dwight Hutto is the one I plonked. His signal to noise ratio seems incredibly low. Is this just because of the lack of context, or because of the content of my response to the OP following the question? And with all the netiquette talk being passed around, why isn't this a new thread instead of hijacking the OP's original question, and engaging in bickering about netiquette that goes on all the time? And it seems redundant to argue this topic other than to comment, and let it pass, unless there's seems to be a legitimate reason why someone 'disobeyed' posting policy. alex23 can definitely be abrasive, but his ratio is high. (Now I'm wondering what that says about me as far as who I agree with and why... I'll have to think about that.) ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
sufficient context, you've left me feeling puzzled. Is there a guideline for this in basic netiquette? www.woodgate.org/FAQs/netiquette.html -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
You're right, my apologies. Dwight Hutto is the one I plonked. You can call me David. I go by my middle name. And it seem to me I made some valid points about a few simple trimming of postings, that didn't seem necessary in the context of a small quick conversation. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Wed, Sep 19, 2012 at 2:17 AM, Dwight Hutto dwightdhu...@gmail.com wrote: You're right, my apologies. Dwight Hutto is the one I plonked. You can call me David. I go by my middle name. You're most often going to be addressed by the name that's given in your post headers. In this case David has been reduced to an initial, and is visible only in your email address, whereas Dwight is right there in your real-name. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
You're most often going to be addressed by the name that's given in your post headers. In this case David has been reduced to an initial, and is visible only in your email address, whereas Dwight My sig says David, but it was just to let him know he can call me by my used name. -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 18/09/2012 21:40, Dwight Hutto wrote: You're most often going to be addressed by the name that's given in your post headers. In this case David has been reduced to an initial, and is visible only in your email address, whereas Dwight My sig says David, but it was just to let him know he can call me by my used name. Any particular him? -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Tue, 18 Sep 2012 12:17:40 -0400, Dwight Hutto wrote: You can call me David. I go by my middle name. If you want to be known as David, why do you give your name as Dwight? In your email client or newsreader, set your name as David and attributions will be to David, and people will know to call you David. Otherwise, people will continue to call you Dwight. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Sep 17, 2:05 pm, Chris Angelico ros...@gmail.com wrote: I would hate to see people leaned hard on to make their speech entirely politically correct. Use common sense, on both sides. Don't be offensive, and don't be offended. While I would like to agree, this would require there to be no power disparities between the parties involved, which isn't always the case. I hate the term politically correct, it always seems to be used in an eye-rolling, what a load of nonsense way, probably because I believe there's at least some validity in the Sapir/Whorf hypothesis that language dictates thought. If Ramit's name was Barbie and e was told to get back to your dream home or some other guff, it would be quite rightly viewed as sexist, whether intentional or not. Such behaviour has been called out on this list in the past, and I'm genuinely surprised to see so little reaction to this. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
*plonk* -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Ethan Furman: *plonk* I can't work out who you are plonking. While more than one of the posters on this thread seem worthy of a good plonk, by not including sufficient context, you've left me feeling puzzled. Is there a guideline for this in basic netiquette? Neil -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Sep 15, 1:10 pm, Dwight Hutto dwightdhu...@gmail.com wrote: On Fri, Sep 14, 2012 at 6:43 PM, Prasad, Ramit Since I was unsure myself if you were trying to be offensive or racist, I would disagree with everyone can know it wasn't meant as racist. If you're unsure if it was racist, you should err on the side of caution. If your comments are mistakable as racism, maybe *you* should be more cautious and *not make them*. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Mon, Sep 17, 2012 at 11:11 AM, alex23 wuwe...@gmail.com wrote: On Sep 15, 1:10 pm, Dwight Hutto dwightdhu...@gmail.com wrote: On Fri, Sep 14, 2012 at 6:43 PM, Prasad, Ramit Since I was unsure myself if you were trying to be offensive or racist, I would disagree with everyone can know it wasn't meant as racist. If you're unsure if it was racist, you should err on the side of caution. If your comments are mistakable as racism, maybe *you* should be more cautious and *not make them*. That applies to the most obvious examples (for instance, there's a line in a 19th century opera that uses a six-letter word starting with 'n' to refer to a dark-skinned person - for obvious reasons, that line is usually changed in modern performances, even though it was descriptive and not offensive when the opera was written - like referring to a Caucasian). However, there are many things which could be misinterpreted as racist, and I would hate to see people leaned hard on to make their speech entirely politically correct. Use common sense, on both sides. Don't be offensive, and don't be offended. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Sep 14, 3:39 pm, Dwight Hutto dwightdhu...@gmail.com wrote: Please explain any logic whatsoever that would give you that conclusion. Well, this: I think you're referring to a play on words(ramit). Using foreign names derogatively is a common tactic of the racist. Ain't I so punny. Not really, no. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
I think you're referring to a play on words(ramit). Using foreign names derogatively is a common tactic of the racist. Not really. But nice spin on my pun to make me look bad. Keep trying, and maybe you'll come up with an insult/ propaganda that's less obvious to the viewer that you're a liar, and a person who couldn't end this with out throwing a blatant race card. It's similar to if I said, this is real 'queer' of you to do ya big pansy, and next you'll be calling me a homophobe. Try harder, because no one would ever believe I was a racist, and if they did, it's an uninformed decision based off of cherry picken phrases out of context. Very text book propaganda of you though. -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote: Using foreign names derogatively is a common tactic of the racist. Not really. But nice spin on my pun to make me look bad. It actually *is* common behaviour of racists. It's similar to if I said, this is real 'queer' of you to do ya big pansy, and next you'll be calling me a homophobe. Well, *yes*. Because your choice of that terminology as derogatory shows you view it as derogatory. Try harder, because no one would ever believe I was a racist, and if they did, it's an uninformed decision based off of cherry picken phrases out of context. Oh, so *now* context is important. Very text book propaganda of you though. I don't think that word means what you think it does. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Fri, Sep 14, 2012 at 4:20 AM, alex23 wuwe...@gmail.com wrote: On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote: Using foreign names derogatively is a common tactic of the racist. Not really. But nice spin on my pun to make me look bad. It actually *is* common behaviour of racists. Not if there name is ramit. What if your name was john? I'd say I'll be right back, I have to go take a crap on the john. It's a joke about a name, not where it originates. It's similar to if I said, this is real 'queer' of you to do ya big pansy, and next you'll be calling me a homophobe. Well, *yes*. Because your choice of that terminology as derogatory shows you view it as derogatory. No it was a loose analogy to show that he's just trying to use anything he can say to slam me, and everyone can know it wasn't meant as racist. Try harder, because no one would ever believe I was a racist, and if they did, it's an uninformed decision based off of cherry picken phrases out of context. Oh, so *now* context is important. Never said it wasn't. The whole conversation to me is the context, and the OP usually follows it, and that's who it is usually intended for. Very text book propaganda of you though. I don't think that word means what you think it does. from wiki: Propaganda is a form of communication that is aimed at influencing the attitude of a community toward some cause or position. Propaganda is usually repeated and dispersed over a wide variety of media in order to create the chosen result in audience attitudes. He wants people to think I'm a racist, and you now want to see that in me as well. It seems it has propagated, and that I know exactly what it means. And all over a silly post that had too little content, at the time, to have what I said require any other posts, especially if the OP is following the conversation. http://mail.python.org/mailman/listinfo/python-list -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Fri, 14 Sep 2012 01:20:53 -0700, alex23 wrote: On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote: [snip] Please don't feed the trolls. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Sep 14, 6:53 pm, Dwight Hutto dwightdhu...@gmail.com wrote: Not if there name is ramit. What if your name was john? I'd say I'll be right back, I have to go take a crap on the john. It's a joke about a name, not where it originates. I'd recommend reading up on white privilege but I'm pretty sure it'd be a wasted suggestion. It's similar to if I said, this is real 'queer' of you to do ya big pansy, and next you'll be calling me a homophobe. Well, *yes*. Because your choice of that terminology as derogatory shows you view it as derogatory. No it was a loose analogy You can't use something as an example to support your case, and then dismiss it as a loose analogy when it undermines it. he's just trying to use anything he can say to slam me everyone can know it wasn't meant as racist. The anything I'm using is what *you* have said. I'm not trying to slam you, I don't even know who you are. I just have a very short fuse for rudeness and an even shorter one for racism, even if it *was* intended in a hyuck hyuck, I'm so punny way. Ignorant racism is still racism. The whole conversation to me is the context And yet: He wants people to think I'm a racist, and you now want to see that in me as well. It seems it has propagated, and that I know exactly what it means. Again, so much for context. There is no he and me, I'm the person who made the original accusation and then followed up on it. That's not propagation, so it's not propaganda. Now, someone who starts a new thread to have a conversation *with themselves* in some bizarre piece of performance art that seems intended to brand as petty the *requests that he actually follow list etiquette*...*that* is someone I'd consider a propagandist. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
I'd recommend reading up on white privilege but I'm pretty sure it'd be a wasted suggestion. Not really, I tend to like interdisciplinary study. But I'm a little of everything if you like Darwin. It's similar to if I said, this is real 'queer' of you to do ya big pansy, and next you'll be calling me a homophobe. Well, *yes*. Because your choice of that terminology as derogatory shows you view it as derogatory. No it was a loose analogy You can't use something as an example to support your case, and then dismiss it as a loose analogy when it undermines it. How did it undermine it? It's the same thing, just transferred to another well known group subject to violent bigotry. I used a play on words in a response, and because his names foreign that makes me a racist. There's no logic in that, other than to bring me down, and use the worst thing you can say...playing the race card. he's just trying to use anything he can say to slam me everyone can know it wasn't meant as racist. The anything I'm using is what *you* have said. I'm not trying to slam you, I don't even know who you are. I just have a very short fuse for rudeness and an even shorter one for racism, It wasn't rude in terms of these things that have been said about me. Then lengthen your fuse, because I'm not a racist, I just play with words a lot, including foreign names. even if it *was* intended in a hyuck hyuck, I'm so punny way. Ignorant racism is still racism. Still trying to propagate a thought that I'm racist based on a guy who for all I know is white, and uses the A.K.A ramit. I have no freakin clue if ramit is an ethnic name or nickname...we're on the internet tweedledick(badumchee, and hyuck,hyuck,hyuck) The whole conversation to me is the context And yet: He wants people to think I'm a racist, and you now want to see that in me as well. It seems it has propagated, and that I know exactly what it means. Again, so much for context. There is no he and me, I'm the person who made the original accusation and then followed up on it. That's not propagation, so it's not propaganda. You're still trying to prove the point, when we all know I'm not a racist. Now, someone who starts a new thread to have a conversation *with themselves* in some bizarre piece of performance art that seems intended to brand as petty the *requests that he actually follow list etiquette*...*that* is someone I'd consider a propagandist. -- You mean like you taking over this thread to call me a racist, and go on ad nauseam about it, when everyone can see what you're doing is trying to prove a point you know is wrong? Go back to debate 101, and flunk your professor if he passed you. -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
[snip] Please don't feed the trolls. You're down here under the bridge with the rest of us trolls too, Steven. 24/7 -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
RE: Comparing strings from the back?
Dwight Hutto wrote: On Thu, Sep 13, 2012 at 5:17 PM, Mark Lawrence breamore...@yahoo.co.uk wrote: On 13/09/2012 21:34, Joshua Landau wrote: On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk wrote:acci sequence On 13/09/2012 19:39, Prasad, Ramit wrote: Dwight Hutto wrote: Why don' you just time it,eit lops through incrementing thmax input/ What? Without context I have no idea what this means. You're wasting your time, I've been described as a jackass for having the audacity to ask for context :) Mark, Sometimes it is the style in which something is asked, and not what specifically is asked. ;) I'm pretty sure you are in the wrong, acting as if what he said didn't make sense! Just read it, he obviously was telling you to time it, as eit lops are inside thmax input/ which, as you should know if you *bothered to read the thread*, is incrementing. don' is short for don't, by the way. I do grovellingly apologize for my appalling breach of netiquette. I am of course assuming that the rules have changed and that it's now my responsibility to wade back through maybe a couple of hundred responses on a long thread to find the context. I also guess that I'm never going to achieve my ambition of being a pot smoking hippy CEO of a web development company :( -- Cheers. Cheers usually implies you're an alcoholic pressing buttons, with the half of rest of the coders on the net. Dwight/David: Not necessarily true if you happen to be from the UK (or maybe just England). It seems to be a cultural signature equivalent to Thanks--at least in the UK (not sure about other places). [snip] Ramit This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. -- http://mail.python.org/mailman/listinfo/python-list
RE: Comparing strings from the back?
Dwight Hutto wrote: On Fri, Sep 14, 2012 at 4:20 AM, alex23 wuwe...@gmail.com wrote: On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote: Using foreign names derogatively is a common tactic of the racist. Not really. But nice spin on my pun to make me look bad. It actually *is* common behaviour of racists. Not if there name is ramit. What if your name was john? I'd say I'll be right back, I have to go take a crap on the john. It's a joke about a name, not where it originates. Okay, so maybe not racist but instead offensive juvenile humor? I suppose that is better... It's similar to if I said, this is real 'queer' of you to do ya big pansy, and next you'll be calling me a homophobe. Um, yes! You are using 'queer' and 'pansy' as a derogatory comparison which falls under the definition for homophobia. Homophobia is a range of negative attitudes and feelings toward homosexuality or people who are identified or perceived as being lesbian, gay, bisexual or transgender (LGBT). Definitions refer variably to antipathy, contempt, prejudice, aversion, irrational fear, and hatred. ~ First two sentences on Wikipedia Well, *yes*. Because your choice of that terminology as derogatory shows you view it as derogatory. No it was a loose analogy to show that he's just trying to use anything he can say to slam me, and everyone can know it wasn't meant as racist. Since I was unsure myself if you were trying to be offensive or racist, I would disagree with everyone can know it wasn't meant as racist. Try harder, because no one would ever believe I was a racist, and if they did, it's an uninformed decision based off of cherry picken phrases out of context. Oh, so *now* context is important. Never said it wasn't. The whole conversation to me is the context, and the OP usually follows it, and that's who it is usually intended for. [ snip ] This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Sep 6, 4:04 am, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel d...@davea.name wrote: For random strings (as defined below), the average compare time is effectively unrelated to the size of the string, once the size passes some point. Define random string as being a selection from a set of characters, with replacement. So if we pick some set of characters, say 10 (or 256, it doesn't really matter), the number of possible strings is 10**N. The likelihood of not finding a mismatch within k characters is (1/10)**k The likelihood of actually reaching the end of the random string is (1/10)**N. (which is the reciprocal of the number of strings, naturally) If we wanted an average number of comparisons, we'd have to calculate a series, where each term is a probability times a value for k. sum((k * 9*10**-k) for k in range(1, 10)) Those terms very rapidly approach 0, so it's safe to stop after a few. Looking at the first 9 items, I see a value of 1.111 This may not be quite right, but the value is certainly well under 2 for a population of 10 characters, chosen randomly. And notice that N doesn't really come into it. It's exactly right. You can obtain this result analytically from Johannes' formula above. Just replace 256 with 10 to get that the expected number of comparisons is (10/9)*(1 - 10**(-N)) The math here is simple, but of course the average running time is also driven by usage patterns. Let's say you have two independently produced strings with N number of random bits. The average amount of bit comparisons that it takes to determine that the strings are different is between 1 and 2 checks for any value of N, for the same reason that the average number of times you need to flip a coin before either coming up heads or exhausting your N tries is always less than 2, and pretty darn close to 2 for even relatively small values of N. def compute_checks_for_different_strings(n): x = 1 p = 0.5 for i in range(0, n): x += p * i p = p * 0.5 print n, x for n in [10, 100, 1000, 1]: compute_checks_for_different_strings(n) 10 1.9892578125 100 2.0 1000 2.0 1 2.0 Now let's say your string comparison function is used to verify 100- bit passwords, and 99% of the time the password is from a legit user who types it correctly. Except for the 1% of a time when somebody fat fingers the copy/paste of their password, every strcmp call is gonna have to compare 100 pairs of characters, or N pairs. If the usage pattern says that 99% of the time you're simply verifying that two strings are equal, then the number of bits is the main driving factor for running time. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Fri, Sep 14, 2012 at 6:43 PM, Prasad, Ramit ramit.pra...@jpmorgan.com wrote: Dwight Hutto wrote: On Fri, Sep 14, 2012 at 4:20 AM, alex2find-work-home/3 wuwe...@gmail.com wrote: On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote: Using foreign names derogatively is a common tactic of the racist. Not really. But nice spin on my pun to make me look bad. It actually *is* common behaviour of racists. Not if there name is ramit. What if your name was john? I'd say I'll be right back, I have to go take a crap on the john. It's a joke about a name, not where it originates. Okay, so maybe not racist but instead offensive juvenile humor? I suppose that is better... No, just shop talk. You hit me and hit I logarithmically hit back ...butterfly effect. It's similar to if I said, this is real 'queer' of you to do ya big pansy, and next you'll be calling me a homophobe. Um, yes! You are using 'queer' and 'pansy' as a derogatory comparison which falls under the definition for homophobia. Homophobia is a range of negative attitudes and feelings toward homosexuality or people who are identified or perceived as being lesbian, gay, bisexual or transgender (LGBT). Definitions refer variably to antipathy, contempt, prejudice, aversion, irrational fear, and hatred. ~ First two sentences on Wikipedia No, analogy, and a continued attack on a subject we know is propaganda. Well, *yes*. Because your choice of that terminology as derogatory shows you view it as derogatory. No it was a loose analogy to show that he's just trying to use anything he can say to slam me, and everyone can know it wasn't meant as racist. Since I was unsure myself if you were trying to be offensive or racist, I would disagree with everyone can know it wasn't meant as racist. If you're unsure if it was racist, you should err on the side of caution. There are many nicknames on the net. That could be an ethnic name, or an A.K.A.. Non-logical, pure speculation, that is biased that your minority needs more anit-you. Don't engage in conversations you're sure to lose. Try harder, because no one would ever believe I was a racist, and if they did, it's an uninformed decision based off of cherry picken phrases out of context. Oh, so *now* context is important. Never said it wasn't. The whole conversation to me is the context, and the OP usually follows it, and that's who it is usually intended for. -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
RE: Comparing strings from the back?
Dwight Hutto wrote: Why don' you just time it,eit lops through incrementing thmax input/ What? Without context I have no idea what this means. Ramit -- This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Thu, Sep 13, 2012 at 2:39 PM, Prasad, Ramit ramit.pra...@jpmorgan.com wrote: Dwight Hutto wrote: Why don' you just time it,eit lops through incrementing thmax input/ What? Without context I have no idea what this means. Ramit Why don't you read the OP: Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. and this one from me: First include len(string)/2, in order to include starting at the center of the string, and threading/weaving by 2 processes out. import timeit do the the rest, and see which has the fastest time. -- Why don't take the time to read the OP, and ramit in your head? Remember that you're in the middle of a conversation where the OP is following as it goes along, so anyone reading the entire set of postings should get it. But for people who just want to jump in, and assume that the only thing that matters is one piece, without reading the entire content of the conversation, will always have something out of context for them. -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 13/09/2012 19:39, Prasad, Ramit wrote: Dwight Hutto wrote: Why don' you just time it,eit lops through incrementing thmax input/ What? Without context I have no idea what this means. Ramit You're wasting your time, I've been described as a jackass for having the audacity to ask for context :) -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk wrote: On 13/09/2012 19:39, Prasad, Ramit wrote: Dwight Hutto wrote: Why don' you just time it,eit lops through incrementing thmax input/ What? Without context I have no idea what this means. You're wasting your time, I've been described as a jackass for having the audacity to ask for context :) I'm pretty sure you are in the wrong, acting as if what he said didn't make sense! Just read it, he obviously was telling you to time it, as eit lops are inside thmax input/ which, as you should know if you *bothered to read the thread*, is incrementing. don' is short for don't, by the way. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Thu, Sep 13, 2012 at 4:34 PM, Joshua Landau joshua.landau...@gmail.com wrote: On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk wrote: On 13/09/2012 19:39, Prasad, Ramit wrote: Dwight Hutto wrote: Why don' you just time it,eit lops through incrementing thmax input/ What? Without context I have no idea what this means. You're wasting your time, I've been described as a jackass for having the audacity to ask for context :) I'm pretty sure you are in the wrong, acting as if what he said didn't make sense! Just read it, he obviously was telling you to time it, as eit lops are inside thmax input/ which, as you should know if you bothered to read the thread, is incrementing. don' is short for don't, by the way. -- http://mail.python.org/mailman/listinfo/python-list It's the fact that I consider the entire conversation the context. Reading the OP's being the primary, and things they responded positively to. There are other things that get mixed up as well, like not hitting the ... in gmail, and so that content doesn't show, or hitting reply, instead of reply all, and things getting jumbled for the others involved in the conversation. Then there is the problem of people saying you posted too much of the context, or not inline with the OP, just at the end, or top posting. I try to keep it along the line of what the OP has read, and they know the context in which it's meant. -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 13/09/2012 21:34, Joshua Landau wrote: On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk wrote: On 13/09/2012 19:39, Prasad, Ramit wrote: Dwight Hutto wrote: Why don' you just time it,eit lops through incrementing thmax input/ What? Without context I have no idea what this means. You're wasting your time, I've been described as a jackass for having the audacity to ask for context :) I'm pretty sure you are in the wrong, acting as if what he said didn't make sense! Just read it, he obviously was telling you to time it, as eit lops are inside thmax input/ which, as you should know if you *bothered to read the thread*, is incrementing. don' is short for don't, by the way. I do grovellingly apologize for my appalling breach of netiquette. I am of course assuming that the rules have changed and that it's now my responsibility to wade back through maybe a couple of hundred responses on a long thread to find the context. I also guess that I'm never going to achieve my ambition of being a pot smoking hippy CEO of a web development company :( -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Thu, Sep 13, 2012 at 5:17 PM, Mark Lawrence breamore...@yahoo.co.uk wrote: On 13/09/2012 21:34, Joshua Landau wrote: On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk wrote:acci sequence On 13/09/2012 19:39, Prasad, Ramit wrote: Dwight Hutto wrote: Why don' you just time it,eit lops through incrementing thmax input/ What? Without context I have no idea what this means. You're wasting your time, I've been described as a jackass for having the audacity to ask for context :) I'm pretty sure you are in the wrong, acting as if what he said didn't make sense! Just read it, he obviously was telling you to time it, as eit lops are inside thmax input/ which, as you should know if you *bothered to read the thread*, is incrementing. don' is short for don't, by the way. I do grovellingly apologize for my appalling breach of netiquette. I am of course assuming that the rules have changed and that it's now my responsibility to wade back through maybe a couple of hundred responses on a long thread to find the context. I also guess that I'm never going to achieve my ambition of being a pot smoking hippy CEO of a web development company :( -- Cheers. Cheers usually implies you're an alcoholic pressing buttons, with the half of rest of the coders on the net. So don't give up hope, you might be able to take a medication that doesn't impair your judgement with the side effects alcohol has, And of course leaves you without the ability to read any of the responses to say you were right in certain instances, so something must be wrong with your mail reader, or your alcoholic mind. Also, without reading the other posts, you're probably just placing in a duplicate answer sometimes, which might make you seem like a copy cat. -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Thu, 13 Sep 2012 17:06:23 -0400, Dwight Hutto wrote: Then there is the problem of people saying you posted too much of the context, or not inline with the OP, just at the end, or top posting. The solution to you quoted too much unnecessary verbiage is not quote nothing. It is quote only the parts that are relevant. I try to keep it along the line of what the OP has read, and they know the context in which it's meant. You're assuming that people read your posts immediately after they read the post you replied to. Always imagine that your reply will be read a week after the post you replied to. Do you still expect the reader to understand what you're talking about? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Sep 14, 5:37 am, Dwight Hutto dwightdhu...@gmail.com wrote: Why don't take the time to read the OP, and ramit in your head? Please, don't be a dick. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Fri, Sep 14, 2012 at 1:39 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: You're assuming that people read your posts immediately after they read the post you replied to. Always imagine that your reply will be read a week after the post you replied to. And a week is extremely generous too; these posts get archived on the web. I *frequently* find myself hitting mailing list archives when researching obscurities. This is also another good reason to post follow-ups to the list, rather than in private email. You might never be thanked, but somebody years down the track may find the question and an associated answer. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Thu, Sep 13, 2012 at 11:48 PM, alex23 wuwe...@gmail.com wrote: On Sep 14, 5:37 am, Dwight Hutto dwightdhu...@gmail.com wrote: Why don't take the time to read the OP, and ramit in your head? Please, don't be a dick. For telling him to ramit into his head that you should read the OP? -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Sep 14, 2:46 pm, Dwight Hutto dwightdhu...@gmail.com wrote: For telling him to ramit into his head that you should read the OP? Yes. I'm not sure if it was intentionally racist, but you come across as a bit of a dwight supremacist. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Fri, Sep 14, 2012 at 12:54 AM, alex23 wuwe...@gmail.com wrote: On Sep 14, 2:46 pm, Dwight Hutto dwightdhu...@gmail.com wrote: For telling him to ramit into his head that you should read the OP? Yes. I'm not sure if it was intentionally racist, but you come across as a bit of a dwight supremacist. Please explain any logic whatsoever that would give you that conclusion. Seems more like propaganda, and you're not very good at it. I think you're referring to a play on words(ramit). Ain't I so punny. -- Best Regards, David Hutto CEO: http://www.hitwebdevelopment.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: But for the record, in principle string comparisons *could* be the bottleneck. Example: you have 1 strings, which are each created once and stored in a list. Then you iterate over the list, comparing every string against every other string. And due to some weird vagary of the data, the strings are nearly all equal. (Why would you do this? I don't know, maybe it's a programmers' challenge found on the Internet, make up your own scenario...) Total number of strings created: 1. Total number of strings compared: 1. which is exactly what I meant by doing a lot of comparisons (1) on the same string. Perhaps if I'd phrased it more as you have to be doing many more comparisons than string creation operations it would have been clearer what I meant. The overhead of creating the strings is trivial compared to the overhead of comparing them, and since each string is only created once anyway, interning them is just a waste of time. No, you created 10k strings many of which are equal and then did 10k comparisons on each most of which found 'yes' they are equal. Interning them would have reduced all the 'true' comparisons to an identity check at the cost of 1 hash and 1 full comparison per string. so at the expense of a single dictionary insertion when the string is created you can get guaranteed O(1) on all the comparisons. What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. Right if the strings differ only in the last character, but the rest of this thread has been about how, for random strings, the not-equal case is O(1) as well. -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Oscar Benjamin oscar.j.benja...@gmail.com wrote: What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. I thought that if *both* strings were interned then a pointer comparison could decide if they were unequal without needing to check the characters. Have I misunderstood how intern() works? I don't think you've misunderstood how it work, but so far as I can see the code doesn't attempt to short circuit the not equal but interned case. The comparison code doesn't look at interning at all, it only looks for identity as a shortcut. -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 11 September 2012 10:51, Duncan Booth duncan.booth@invalid.invalidwrote: Oscar Benjamin oscar.j.benja...@gmail.com wrote: What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. I thought that if *both* strings were interned then a pointer comparison could decide if they were unequal without needing to check the characters. Have I misunderstood how intern() works? I don't think you've misunderstood how it work, but so far as I can see the code doesn't attempt to short circuit the not equal but interned case. The comparison code doesn't look at interning at all, it only looks for identity as a shortcut. It also doesn't seem to check if the hash values have been set. I guess the cached hash value is only used in contexts where the hash is explicitly desired. That makes two optimisations that can bring worst case string comparison down to O(1) in many contexts that are available to cpython but unused. But then if full string comparison is already on average O(1) then the cost of checking the interned and hash flags for every string comparison would outweigh the benefits of avoiding the rare worst case O(N) comparisons. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 9/11/2012 6:40 AM, Oscar Benjamin wrote: On 11 September 2012 10:51, Duncan Booth duncan.booth@invalid.invalid mailto:duncan.booth@invalid.invalid wrote: Oscar Benjamin oscar.j.benja...@gmail.com mailto:oscar.j.benja...@gmail.com wrote: What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. I thought that if *both* strings were interned then a pointer comparison could decide if they were unequal without needing to check the characters. Have I misunderstood how intern() works? I don't think you've misunderstood how it work, but so far as I can see the code doesn't attempt to short circuit the not equal but interned case. The comparison code doesn't look at interning at all, it only looks for identity as a shortcut. It also doesn't seem to check if the hash values have been set. I guess the cached hash value is only used in contexts where the hash is explicitly desired.- I believe the internal use of interning and hash comparison has varied from release to release. However, the main use of string comparison is for dict keys, especially the internal dicts for namespaces and attributes. Since the dict lookup code needs hash values anyway, to find slots with possible conflicts, I am sure it does not use the generic comparison operators but starts with hash comparisons. Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Gelonida N gelon...@gmail.com wrote: On 09/07/2012 06:06 AM, Steven D'Aprano wrote: On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote: Also of some interest is the best case: O(1) for unequal strings (they differ at the first character) and O(N) for equal strings. The worst case is O(N) or N characters the average case is O(1) or two characters. For denial of service attacks or systems, that are NEVER allowed to fail the worst case is important. For most other cases the average complexity counts. However I still wonder for how many applications the complexity of string comparisons would be the limiting factor. and of course if you ever do find an application where that worst case matters there's an easy way round it: just call intern() on all the strings when they are created. For the comparison to be the limiting factor you have to be doing a lot of comparisons on the same string (otherwise creating the string would be the limiting factor), so at the expense of a single dictionary insertion when the string is created you can get guaranteed O(1) on all the comparisons. -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Mon, 10 Sep 2012 08:59:37 +, Duncan Booth wrote: Gelonida N gelon...@gmail.com wrote: On 09/07/2012 06:06 AM, Steven D'Aprano wrote: On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote: Also of some interest is the best case: O(1) for unequal strings (they differ at the first character) and O(N) for equal strings. The worst case is O(N) or N characters the average case is O(1) or two characters. For denial of service attacks or systems, that are NEVER allowed to fail the worst case is important. For most other cases the average complexity counts. However I still wonder for how many applications the complexity of string comparisons would be the limiting factor. and of course if you ever do find an application where that worst case matters there's an easy way round it: just call intern() on all the strings when they are created. There are other good reasons for interning strings, but speeding up astring == bstring is not one of them. [steve@ando ~]$ python -m timeit -s s = 'abc'*10 -s \ t = s[:-1] + 'x' s == t 1000 loops, best of 3: 910 usec per loop [steve@ando ~]$ python -m timeit -s s = 'abc'*10 -s \ t = s[:-1] + 'x' -s intern(s); intern(t) s == t 1000 loops, best of 3: 914 usec per loop No significant difference. To state the bleedin' obvious, the computational effort required to *compare two strings* pre-supposes that those strings already exist, and is *completely independent* of the complexity of creating the strings. For the comparison to be the limiting factor you have to be doing a lot of comparisons Correct. on the same string (otherwise creating the string would be the limiting factor), Not necessarily. Look, it's really hard to come up with a realistic, non-contrived example where string comparisons are a significant bottleneck in a non-toy application. So let me first off say that *nearly always* you're not going to care whether s == t looks at all N characters or just the first 2 (or 20, or 100). This argument is rather academic (the best sort of argument!). Until you start getting up into truly huge strings, we're arguing about how many angels can dance on a CPU cache. But for the record, in principle string comparisons *could* be the bottleneck. Example: you have 1 strings, which are each created once and stored in a list. Then you iterate over the list, comparing every string against every other string. And due to some weird vagary of the data, the strings are nearly all equal. (Why would you do this? I don't know, maybe it's a programmers' challenge found on the Internet, make up your own scenario...) Total number of strings created: 1. Total number of strings compared: 1. The overhead of creating the strings is trivial compared to the overhead of comparing them, and since each string is only created once anyway, interning them is just a waste of time. so at the expense of a single dictionary insertion when the string is created you can get guaranteed O(1) on all the comparisons. What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 2012-09-10, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Mon, 10 Sep 2012 08:59:37 +, Duncan Booth wrote: Gelonida N gelon...@gmail.com wrote: so at the expense of a single dictionary insertion when the string is created you can get guaranteed O(1) on all the comparisons. What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. I thought that if *both* strings were interned then a pointer comparison could decide if they were unequal without needing to check the characters. Have I misunderstood how intern() works? Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 2012-09-10, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. I thought that if *both* strings were interned then a pointer comparison could decide if they were unequal without needing to check the characters. Have I misunderstood how intern() works? In a language where _all_ strings are guaranteed to be interned (such as Lua, I think), you do indeed gain this. Pointer inequality implies string inequality. But when interning is optional (as in Python), you cannot depend on that, unless there's some way of recognizing interned strings. Of course, that may indeed be the case; a simple bit flag this string has been interned would suffice, and if both strings are interned AND their pointers differ, THEN you can be sure the strings differ. I have no idea whether or not CPython version X.Y.Z does this. The value of such an optimization really depends on how likely strings are to be interned; for instance, if the compiler automatically interns all the names of builtins, this could be quite beneficial. Otherwise, probably not; most Python scripts don't bother interning anything. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 2012-09-10, Chris Angelico ros...@gmail.com wrote: On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 2012-09-10, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. I thought that if *both* strings were interned then a pointer comparison could decide if they were unequal without needing to check the characters. Have I misunderstood how intern() works? In a language where _all_ strings are guaranteed to be interned (such as Lua, I think), you do indeed gain this. Pointer inequality implies string inequality. But when interning is optional (as in Python), you cannot depend on that, unless there's some way of recognizing interned strings. Of course, that may indeed be the case; a simple bit flag this string has been interned would suffice, and if both strings are interned AND their pointers differ, THEN you can be sure the strings differ. I have no idea whether or not CPython version X.Y.Z does this. The value of such an optimization really depends on how likely strings are to be interned; for instance, if the compiler automatically interns all the names of builtins, this could be quite beneficial. Otherwise, probably not; most Python scripts don't bother interning anything. I haven't looked at the source but my understanding was precisely that there is an intern() bit and that not only the builtins module but all the literals in any byte-compiled module are interned. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 2012-09-10, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 2012-09-10, Chris Angelico ros...@gmail.com wrote: On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 2012-09-10, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: What interning buys you is that s == t is an O(1) pointer compare if they are equal. But if s and t differ in the last character, __eq__ will still inspect every character. There is no way to tell Python all strings are interned, if s is not t then s != t as well. I thought that if *both* strings were interned then a pointer comparison could decide if they were unequal without needing to check the characters. Have I misunderstood how intern() works? In a language where _all_ strings are guaranteed to be interned (such as Lua, I think), you do indeed gain this. Pointer inequality implies string inequality. But when interning is optional (as in Python), you cannot depend on that, unless there's some way of recognizing interned strings. Of course, that may indeed be the case; a simple bit flag this string has been interned would suffice, and if both strings are interned AND their pointers differ, THEN you can be sure the strings differ. I have no idea whether or not CPython version X.Y.Z does this. The value of such an optimization really depends on how likely strings are to be interned; for instance, if the compiler automatically interns all the names of builtins, this could be quite beneficial. Otherwise, probably not; most Python scripts don't bother interning anything. I haven't looked at the source but my understanding was precisely that there is an intern() bit and that not only the builtins module but all the literals in any byte-compiled module are interned. s/literals/identifiers/ You can see the interned flag in the PyUnicodeObject struct here: http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303 Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Tue, Sep 11, 2012 at 12:43 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 2012-09-10, Oscar Benjamin oscar.j.benja...@gmail.com wrote: I haven't looked at the source but my understanding was precisely that there is an intern() bit and that not only the builtins module but all the literals in any byte-compiled module are interned. s/literals/identifiers/ You can see the interned flag in the PyUnicodeObject struct here: http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303 Ah, yep, so that's there. In that case, it's possible to have that optimization. However, I may be misreading this, but it seems the only Unicode comparison function is a rich compare, which is unable to take advantage of a known difference: http://hg.python.org/cpython/file/b48ef168d8c5/Objects/unicodeobject.c#l6114 Different pointers prove the strings differ, but don't tell you which is to be sorted earlier. You could use this if you roll your own comparison in C; or, if you already know the strings are interned, you can use 'is' / 'is not'. But that seems to be the extent of it. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 04/09/2012 03:54, Roy Smith wrote: Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. From the rest of the thread, it looks like in most situations it won't make much difference as typically very few characters need to be compared if they are unequal. However, if you were in a situation with many strings which were almost equal, the most general way to improve the situation might be store a hash of the string along with the string, i.e. store (hash(x), x) and then compare equality of this tuple. Almost all of the time, if the strings are unequal the hash will be unequal. Or, as someone else suggested, use interned versions of the strings. This is basically the same solution but even better. In this case, your startup costs will be higher (creating the strings) but your comparisons will always be instant. Dan -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 2012-09-10, Dan Goodman dg.gm...@thesamovar.net wrote: On 04/09/2012 03:54, Roy Smith wrote: Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. From the rest of the thread, it looks like in most situations it won't make much difference as typically very few characters need to be compared if they are unequal. However, if you were in a situation with many strings which were almost equal, the most general way to improve the situation might be store a hash of the string along with the string, i.e. store (hash(x), x) and then compare equality of this tuple. Almost all of the time, if the strings are unequal the hash will be unequal. Or, as someone else suggested, use interned versions of the strings. This is basically the same solution but even better. In this case, your startup costs will be higher (creating the strings) but your comparisons will always be instant. Computing the hash always requires iterating over all characters in the string so is best case O(N) where string comparison is best case (and often average case) O(1). Also, so far as I know the hash value once computed is stored on the string object itself [1] and used for subsequent string comparisons so there's no need for you to do that in your code. Oscar [1] http://hg.python.org/cpython/file/71d94e79b0c3/Include/unicodeobject.h#l293 -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 10/09/2012 18:33, Oscar Benjamin wrote: Computing the hash always requires iterating over all characters in the string so is best case O(N) where string comparison is best case (and often average case) O(1). Yep, but you already have O(N) costs just creating the strings in the first place, so it's absorbed into that. It's only worth doing if you do many comparisons. Also, so far as I know the hash value once computed is stored on the string object itself [1] and used for subsequent string comparisons so there's no need for you to do that in your code. Cool, Python is very clever as always! :) Dan -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 10/09/2012 18:07, Dan Goodman wrote: On 04/09/2012 03:54, Roy Smith wrote: Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. From the rest of the thread, it looks like in most situations it won't make much difference as typically very few characters need to be compared if they are unequal. However, if you were in a situation with many strings which were almost equal, the most general way to improve the situation might be store a hash of the string along with the string, i.e. store (hash(x), x) and then compare equality of this tuple. Almost all of the time, if the strings are unequal the hash will be unequal. Or, as someone else suggested, use interned versions of the strings. This is basically the same solution but even better. In this case, your startup costs will be higher (creating the strings) but your comparisons will always be instant. Just had another thought about this. Although it's unlikely to be necessary in practice since (a) it's rarely necessary at all, and (b) when it is, hashing and optionally interning seems like the better approach, I had another idea that would be more general. Rather than starting from the beginning or the end, why not do something like: check the first and last character, then the len/2 character, then the len/4, then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever about making sure you hit every character but I'm sure someone's already got an efficient algorithm for this. You could probably even make this cache efficient by working on cache line length blocks. Almost certainly entirely unnecessary, but I like the original question and it's a nice theoretical problem. Dan -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 2012-09-10, Dan Goodman dg.gm...@thesamovar.net wrote: On 10/09/2012 18:07, Dan Goodman wrote: On 04/09/2012 03:54, Roy Smith wrote: Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. From the rest of the thread, it looks like in most situations it won't make much difference as typically very few characters need to be compared if they are unequal. However, if you were in a situation with many strings which were almost equal, the most general way to improve the situation might be store a hash of the string along with the string, i.e. store (hash(x), x) and then compare equality of this tuple. Almost all of the time, if the strings are unequal the hash will be unequal. Or, as someone else suggested, use interned versions of the strings. This is basically the same solution but even better. In this case, your startup costs will be higher (creating the strings) but your comparisons will always be instant. Just had another thought about this. Although it's unlikely to be necessary in practice since (a) it's rarely necessary at all, and (b) when it is, hashing and optionally interning seems like the better approach, I had another idea that would be more general. Rather than starting from the beginning or the end, why not do something like: check the first and last character, then the len/2 character, then the len/4, then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever about making sure you hit every character but I'm sure someone's already got an efficient algorithm for this. You could probably even make this cache efficient by working on cache line length blocks. Almost certainly entirely unnecessary, but I like the original question and it's a nice theoretical problem. It's not totally theoretical in the sense that the reasoning applies to all sequence comparisons. If you needed to compare lists of objects where the comparison of each pair of elements was an expensive operation then you would want to think carefully about what order you used. Also in general you can't hash/intern all sequences. If I was going to change the order of comparisons for all strings then I would use a random order. This is essentially how dict gets away with claiming to have O(1) lookup. There are sequences of inputs that can cause every possible hash collision to occur but because the hash function acts as a kind of randomisation filter the pathological sequences are very unlikely to occur unless someone is going out of their way. The clever way that Python 3.3 prevents someone from even doing this on purpose is just to introduce additional per-process randomisation. The difference between dict lookup and string comparison is that string comparison always compares the characters in the same order and it corresponds to the natural ordering of the data. This means that some pefectly natural use cases like comparing file-paths can have close to worst case behaviour. If string/sequence comparison occurs in a random order then there can be no use case where the likely strings would induce close to worst case behaviour unless you really are just comparing lots of almost identical sequences. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 2012-09-08, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Fri, 07 Sep 2012 19:10:16 +, Oscar Benjamin wrote: On 2012-09-07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: snip Would you say, then, that dict insertion is O(N)? Pedantically, yes. But since we're allowed to state (or even imply *wink*) whatever assumptions we like, we're allowed to assume in the absence of significant numbers of hash collisions and come up with amortized O(1) for dict insertions and lookups. (Provided, of course, that your computer has an infinite amount of unfragmented memory and the OS never starts paging your dict to disk. Another unstated assumption that gets glossed over when we talk about complexity analysis -- on real world computers, for big enough N, *everything* is O(2**N) or worse.) Big Oh analysis, despite the formal mathematics used, is not an exact science. Essentially, it is a way of bringing some vague order to hand- wavy estimates of complexity, and the apparent mathematical rigour is built on some awfully shaky foundations. But despite that, it actually is useful. Coming back to strings... given that in any real-world application, you are likely to have some string comparisons on equal strings and some on unequal strings, and more importantly you don't know which are which ahead of time, which attitude is less likely to give you a nasty surprise when you run your code? I have many millions of 100K strings to compare against other 100K strings, and string comparisons are O(1) so that will be fast. I have many millions of 100K strings to compare against other 100K strings, and string comparisons are O(N) so that will be slow, better find another algorithm. True. I can't think of a situation where I've used string comparisons directly in any text heavy code. Rather, I would use a dict or a set (or a regex) and hash(str) is always O(N). Remember too that for small enough N, everything is O(1). Getting hung up on Big Oh is just as much a mistake as ignoring it completely. I can't think of a situation in my own work where O(N) vs O(1) string comparisons would cause a significant problem (except perhaps in libraries that I use but didn't write). However, I can find a number of cases where I compare numpy.ndarrays for equality. For example, I found if np.all(a == b): in some code that I recently wrote. Although np.all() short-circuits, a==b does not so that line forces O(N) behaviour onto a situation where the average case can be better. Unfortunately numpy doesn't seem to provide a short-circuit equals() function. array_equal() is what I want but it does the same as the above. In future, I'll consider using something like def cmparray(a, b): return a.shape == b.shape and a.dtype == b.dtype and buffer(a) == buffer(b) to take advantage of (what I assume are) short-circuit buffer comparisons. Since string comparison is only useful if the strings can be equal or unequal, the average case depends on how often they are equal/unequal as well as the average complexity of both. For random strings the frequency of equal strings decreases very fast as N increases so that the comparison of random strings is O(1). But that is not an upper bound, and Big Oh analysis is strictly defined in terms of upper bounds. It is an upper bound, but it is an upper bound on the *expectation value* assuming a particular distribution of inputs, rather than an upper bound on all possible inputs. (I'm talking about the average here -- the actual number of comparisons can range all the way up to N, but the average is = 2.) The average is actually bounded by 1 / (1 - p) where p is the probability that two characters match. This bound can be arbitrarily large as p approaches 1 as would be the case if, say, one character was much more likely than others. The particular assumption that you have made p = 1/M where M is the number of characters is actually the *smallest* possible value of p. For non-uniform real data (English words for example) p is significantly greater than 1/M but in a strict bounds sense we should say that 1/M = p = 1. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 09/06/2012 10:33 AM, Steven D'Aprano wrote: On Wed, 05 Sep 2012 22:47:14 +, Oscar Benjamin wrote: I may have been overly-conservative earlier when I said that on average string equality has to compare half the characters. I thought I had remembered that from a computer science textbook, but I can't find that reference now, so possibly I was thinking of something else. (String searching perhaps?). Yeah I think you mixed it up with searching an entry in an unsorted list of length N That's one of the most common N/2 cases, that one hits daily with many straight forward implementations -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 09/07/2012 06:06 AM, Steven D'Aprano wrote: On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote: Also of some interest is the best case: O(1) for unequal strings (they differ at the first character) and O(N) for equal strings. The worst case is O(N) or N characters the average case is O(1) or two characters. For denial of service attacks or systems, that are NEVER allowed to fail the worst case is important. For most other cases the average complexity counts. However I still wonder for how many applications the complexity of string comparisons would be the limiting factor. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 2012-09-07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: snip After further thought, and giving consideration to the arguments given by people here, I'm now satisfied to say that for equal-length strings, string equality is best described as O(N). 1) If the strings are equal, a == b will always compare all N characters in each string. 2) If the strings are unequal, a == b will *at worst* compare all N characters. 3) Following usual practice in this field, worst case is the one which conventionally is meant when discussing Big Oh behaviour. See, for example, Introduction To Algorithms by Cormen, Leiserson and Rivest. Would you say, then, that dict insertion is O(N)? Also of some interest is the best case: O(1) for unequal strings (they differ at the first character) and O(N) for equal strings. Also of interest is the case that has caused the majority of the discussion, the average case. I am now satisfied that the average number of comparisons for unequal strings is O(1). To be precise, it is bounded below by 1 comparison (you always have to compare at least one pair of characters) and bounded above by 2 comparisons. I find this idea of separating into the comparison of equal strings versus the comparison of unequal strings rather odd. If the strings you compare come from a distribution where they are guaranteed to be equal (or unequal) then you can just use the O(0) comparison method. Since string comparison is only useful if the strings can be equal or unequal, the average case depends on how often they are equal/unequal as well as the average complexity of both. For random strings the frequency of equal strings decreases very fast as N increases so that the comparison of random strings is O(1). (I'm talking about the average here -- the actual number of comparisons can range all the way up to N, but the average is = 2.) If I've done the maths right, the exact value for the average is: ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N) I'm not sure where the extra N comes from ^ but otherwise good. I would have written that as: (1 - p) * sum(i * p**(i-1) for i in range(1, N+1)) where p is the probability of a match (1/M for M equally likely characters) or in closed form: ⎛ N ⎞ ⎝1 - p ⋅(1 + N ⋅(1 - p))⎠ ─ 1 - p for random strings of length N taken from an alphabet of size M. For M = 2, that average approaches but never exceeds 2 as N increases; for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... and so forth. It approaches 1 / (1 - p) or, if you prefer: M / (M - 1) Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 2012-09-07, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 2012-09-07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: snip Since string comparison is only useful if the strings can be equal or unequal, the average case depends on how often they are equal/unequal as well as the average complexity of both. For random strings the frequency of equal strings decreases very fast as N increases so that the comparison of random strings is O(1). (I'm talking about the average here -- the actual number of comparisons can range all the way up to N, but the average is = 2.) If I've done the maths right, the exact value for the average is: ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N) I'm not sure where the extra N comes from ^ but otherwise good. Ok, I see it's for the case where they're equal. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
With unequal strings/lists to match, it would seem that one would regex through the larger string/list with the shorter string, and piece by piece begin to match for partial percentage matches in relation to the longer iterative item. -- Best Regards, David Hutto *CEO:* *http://www.hitwebdevelopment.com* -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Fri, Sep 7, 2012 at 5:59 PM, Dwight Hutto dwightdhu...@gmail.com wrote: With unequal strings/lists to match, it would seem that one would regex through the larger string/list with the shorter string, and piece by piece begin to match for partial percentage matches in relation to the longer iterative item. While iterating through the larger list character/item, one at a time, or arranging them in certain instances. -- Best Regards, David Hutto *CEO:* *http://www.hitwebdevelopment.com* -- Best Regards, David Hutto *CEO:* *http://www.hitwebdevelopment.com* -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Fri, 07 Sep 2012 19:10:16 +, Oscar Benjamin wrote: On 2012-09-07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: snip After further thought, and giving consideration to the arguments given by people here, I'm now satisfied to say that for equal-length strings, string equality is best described as O(N). 1) If the strings are equal, a == b will always compare all N characters in each string. 2) If the strings are unequal, a == b will *at worst* compare all N characters. 3) Following usual practice in this field, worst case is the one which conventionally is meant when discussing Big Oh behaviour. See, for example, Introduction To Algorithms by Cormen, Leiserson and Rivest. Would you say, then, that dict insertion is O(N)? Pedantically, yes. But since we're allowed to state (or even imply *wink*) whatever assumptions we like, we're allowed to assume in the absence of significant numbers of hash collisions and come up with amortized O(1) for dict insertions and lookups. (Provided, of course, that your computer has an infinite amount of unfragmented memory and the OS never starts paging your dict to disk. Another unstated assumption that gets glossed over when we talk about complexity analysis -- on real world computers, for big enough N, *everything* is O(2**N) or worse.) Big Oh analysis, despite the formal mathematics used, is not an exact science. Essentially, it is a way of bringing some vague order to hand- wavy estimates of complexity, and the apparent mathematical rigour is built on some awfully shaky foundations. But despite that, it actually is useful. Coming back to strings... given that in any real-world application, you are likely to have some string comparisons on equal strings and some on unequal strings, and more importantly you don't know which are which ahead of time, which attitude is less likely to give you a nasty surprise when you run your code? I have many millions of 100K strings to compare against other 100K strings, and string comparisons are O(1) so that will be fast. I have many millions of 100K strings to compare against other 100K strings, and string comparisons are O(N) so that will be slow, better find another algorithm. Remember too that for small enough N, everything is O(1). Getting hung up on Big Oh is just as much a mistake as ignoring it completely. I find this idea of separating into the comparison of equal strings versus the comparison of unequal strings rather odd. If the strings you compare come from a distribution where they are guaranteed to be equal (or unequal) then you can just use the O(0) comparison method. If you know that they're (un)equal, you don't need to call == at all. If you know that most strings will be unequal, then you might be justified in treating comparisons as O(1) most of the time and not stress about the occasional slow call. But of course most of the time you don't know this, which is why it is appropriate to treat string comparisons as O(N) rather than O(1), since that's the upper bound. Since string comparison is only useful if the strings can be equal or unequal, the average case depends on how often they are equal/unequal as well as the average complexity of both. For random strings the frequency of equal strings decreases very fast as N increases so that the comparison of random strings is O(1). But that is not an upper bound, and Big Oh analysis is strictly defined in terms of upper bounds. (I'm talking about the average here -- the actual number of comparisons can range all the way up to N, but the average is = 2.) If I've done the maths right, the exact value for the average is: ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N) I'm not sure where the extra N comes from ^ but otherwise good. The extra N comes from the case where you compare string S with itself, which takes exactly N comparisons. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Why don' you just time it,eit lops through incrementing thmax input/ -- Best Regards, David Hutto *CEO:* *http://www.hitwebdevelopment.com* -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Wed, 05 Sep 2012 22:47:14 +, Oscar Benjamin wrote: In news.gmane.comp.python.general, you wrote: On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...] You are making unjustified assumptions about the distribution of letters in the words. This might be a list of long chemical compounds where the words typically differ only in their suffix. It might be a list of people with titles: Actually, I'm not. I'm stating exactly what assumptions I'm making to get my calculation. I'm comparing *random* character strings or bitstrings. Excuse me, you are not. You are comparing English words which are highly non-random. Evidently we have different understandings of what 'random' means. I don't think it's unreasonable to say that strings drawn uniformly from the set of all strings in the English language (having a given number of characters) is random. The distribution is not uniform over the set of all possible character strings but it is still random. I think Johannes deliberately chose these strings to emulate a particular kind of 'real' distribution of strings that might occur in practise. Of course for some real applications, your strings being compared will be English words. And for other applications, they will be strings of numeric digits uniformly distributed between 2 and 3. Or taken from a Gaussian distribution centered around 7000. Or strings made up of deeply nested sets of parentheses (( ... )). Whichever special distribution of strings you pick, I don't think calling them random is terribly useful in context, even if they are generated randomly from some non-uniform distribution. But you know, it really doesn't make a difference. Equality testing will *still* be O(N) since the asymptomatic behaviour for sufficiently large string comparisons will be bounded by O(N). Multiplicative constants (half the string versus 0.001 of the string) do not matter. I may have been overly-conservative earlier when I said that on average string equality has to compare half the characters. I thought I had remembered that from a computer science textbook, but I can't find that reference now, so possibly I was thinking of something else. (String searching perhaps?). In any case, the *worst* case for string equality testing is certainly O(N) (every character must be looked at), and the *best* case is O(1) obviously (the first character fails to match). But I'm not so sure about the average case. Further thought is required. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 09/06/2012 04:33 AM, Steven D'Aprano wrote: snip I may have been overly-conservative earlier when I said that on average string equality has to compare half the characters. I thought I had remembered that from a computer science textbook, but I can't find that reference now, so possibly I was thinking of something else. (String searching perhaps?). In any case, the *worst* case for string equality testing is certainly O(N) (every character must be looked at), and the *best* case is O(1) obviously (the first character fails to match). But I'm not so sure about the average case. Further thought is required. For random strings (as defined below), the average compare time is effectively unrelated to the size of the string, once the size passes some point. Define random string as being a selection from a set of characters, with replacement. So if we pick some set of characters, say 10 (or 256, it doesn't really matter), the number of possible strings is 10**N. The likelihood of not finding a mismatch within k characters is (1/10)**k The likelihood of actually reaching the end of the random string is (1/10)**N. (which is the reciprocal of the number of strings, naturally) If we wanted an average number of comparisons, we'd have to calculate a series, where each term is a probability times a value for k. sum((k * 9*10**-k) for k in range(1, 10)) Those terms very rapidly approach 0, so it's safe to stop after a few. Looking at the first 9 items, I see a value of 1.111 This may not be quite right, but the value is certainly well under 2 for a population of 10 characters, chosen randomly. And notice that N doesn't really come into it. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel d...@davea.name wrote: For random strings (as defined below), the average compare time is effectively unrelated to the size of the string, once the size passes some point. Define random string as being a selection from a set of characters, with replacement. So if we pick some set of characters, say 10 (or 256, it doesn't really matter), the number of possible strings is 10**N. The likelihood of not finding a mismatch within k characters is (1/10)**k The likelihood of actually reaching the end of the random string is (1/10)**N. (which is the reciprocal of the number of strings, naturally) If we wanted an average number of comparisons, we'd have to calculate a series, where each term is a probability times a value for k. sum((k * 9*10**-k) for k in range(1, 10)) Those terms very rapidly approach 0, so it's safe to stop after a few. Looking at the first 9 items, I see a value of 1.111 This may not be quite right, but the value is certainly well under 2 for a population of 10 characters, chosen randomly. And notice that N doesn't really come into it. It's exactly right. You can obtain this result analytically from Johannes' formula above. Just replace 256 with 10 to get that the expected number of comparisons is (10/9)*(1 - 10**(-N)) The last term shows the dependence on N and is tiny even for N=9. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
In article 50485fca$0$29977$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: In any case, the *worst* case for string equality testing is certainly O(N) (every character must be looked at), and the *best* case is O(1) obviously (the first character fails to match). The best case is O(0), if either string is empty (ducking and running). -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
n Thu, Sep 6, 2012 at 10:13 PM, Roy Smith r...@panix.com wrote: In article 50485fca$0$29977$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: In any case, the *worst* case for string equality testing is certainly O(N) (every character must be looked at), and the *best* case is O(1) obviously (the first character fails to match). The best case is O(0), if either string is empty (ducking and running). No, O(0) would be when the application decides not to compare at all. ChrisA (also ducking) -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 05.09.2012 18:24, Steven D'Aprano wrote: On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...] You are making unjustified assumptions about the distribution of letters in the words. This might be a list of long chemical compounds where the words typically differ only in their suffix. It might be a list of people with titles: Actually, I'm not. I'm stating exactly what assumptions I'm making to get my calculation. I'm comparing *random* character strings or bitstrings. Excuse me, you are not. You are comparing English words which are highly non-random. Not in my original post. If you read it again, you will clearly see that I was talking about purely random strings. And since you like to nitpick, I'll clarify further: I'm talking about bitstrings in which every bit of every character has the same probability of occurence, 50%. You then replied by mentioning probability distributions of characters in different languages/encodings, to which I tried giving a example as it might (and does) occur in the real world, like sorting a dictionary. But the original point is still valid: Sorting of randomized bitstrings is definitely not O(N), you got that wrong. You, on the other hand, are making vague assumptions which you do not care for formalize and yet you claim that the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is. Without any explanation for this. At all. I will accept that my explanation was not good enough, but I strongly disagree that I gave no explanation at all. What possible explanation could there be that comparison of purely random bitstrings yields an equal amount of comparisons for its length? Herr Professor Frederick Schmidt Herr Professor Frederick Wagner ... Is your assumtion that we're comparing words that have the common prefix Herr Professor Frederick ? No, I am pointing out that *your* assumption that most string comparisons will halt close to the beginning of the string is an invalid assumption. Your assumption only holds for some non-random strings. Definitely not. It *especially* holds true for random strings. For random strings with an alphabet of size n, the probability that the first character needs to be compared is n^0, i.e. 1. The probability that the second character needs to be compared is n^(-1). For the xth character, it is n^(-x). This is due to the lazy abort in comparison and it results in the average number of comparisons being extremely short no matter the bitlength n, or O(log n). it's counting the number of character comparisons. Correct. But only out of an extremely limited subset of all possible string comparisons, namely those very few that happen when sorting lists of English words using a very specific algorithm, namely timsort. Yes, but this was to look at a real-world example (in which way more comparisons are needed than in the random case). You first were talking about randomized bitstrings (and so was I), then you brought up character sets and languages (which, for the original problem, are entirely irrelevant). Whatever figure you have calculated by taking this non-random selection, it is irrelevant to the question of the average performance of string equality given random strings. Correct. I gave the estimate for random strings in my very first post. For the sake of simple calculations, let's pretend that there are only 1000 five-letter strings possible. We want to know how many character- comparisons are done on average when testing two random five-letter strings for equality. To calculate this average, you must compare every string to every other string, *including* itself. This gives 1000*1000 = one million equality tests to be performed. For each equality test, we count the number of character comparisons performed, sum all those counts, and divide by 1e6. That is the average number of char comparisons for random strings of five letters. Easy enough. Since you didn't look at my original equation, here are some numbers and the program attached: 64 combinations for length 6 Char comparisons: 8064 Comparison cnt : 4096 Avg comparison : 1.97 128 combinations for length 7 Char comparisons: 32512 Comparison cnt : 16384 Avg comparison : 1.98 256 combinations for length 8 Char comparisons: 130560 Comparison cnt : 65536 Avg comparison : 1.99 512 combinations for length 9 Char comparisons: 523264 Comparison cnt : 262144 Avg comparison : 2.00 1024 combinations for length 10 Char comparisons: 2095104 Comparison cnt : 1048576 Avg comparison : 2.00 2048 combinations for length 11 Char comparisons: 8384512 Comparison cnt : 4194304 Avg comparison : 2.00 4096 combinations for length 12 Char comparisons: 33546240 Comparison cnt : 16777216 Avg comparison : 2.00 Hopefully now you'll see that your assumption O(n) is dead wrong. But that's not what you do. First you eliminate 900 out of the 1000 possible strings by only
Re: Comparing strings from the back?
On 06.09.2012 10:33, Steven D'Aprano wrote: But you know, it really doesn't make a difference. Equality testing will *still* be O(N) since the asymptomatic behaviour for sufficiently large string comparisons will be bounded by O(N). Multiplicative constants (half the string versus 0.001 of the string) do not matter. Wrong, at least for randomized strings (i.e. every character with the same probability). O(N) is worst-case, O(log N) is correct for randomized strings. Then again, since you were nitpicking about Landau notation earlier this thread, every function bound by O(log N) is also bound by O(N), since, as you correctly stated, it's only a upper bound. We should be talking about the asymptotic sharp bound, i.e. capital Theta. I may have been overly-conservative earlier when I said that on average string equality has to compare half the characters. I thought I had remembered that from a computer science textbook, but I can't find that reference now, so possibly I was thinking of something else. (String searching perhaps?). In any case, the *worst* case for string equality testing is certainly O(N) (every character must be looked at), and the *best* case is O(1) obviously (the first character fails to match). But I'm not so sure about the average case. Further thought is required. Yes, worst-case is O(N), best case O(1). Average is O(n log n). Best regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 09/06/2012 09:43 AM, Johannes Bauer wrote: snip Yes, worst-case is O(N), best case O(1). Average is O(n log n). Can't see how you came up with an average of n log(n). Fourteen minutes before you made this post, you demonstrated it was less than 2 for any N. And I previously posted that for a character set of size 10, the average was 1. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 06.09.2012 16:23, Dave Angel wrote: On 09/06/2012 09:43 AM, Johannes Bauer wrote: snip Yes, worst-case is O(N), best case O(1). Average is O(n log n). Can't see how you came up with an average of n log(n). Fourteen minutes before you made this post, you demonstrated it was less than 2 for any N. O(log n) is what I meant, sorry! Damnit. And I previously posted that for a character set of size 10, the average was 1. For any given string n and and alphabet size l, the average is: sum(i = 0..n) (1 / (l ^ n)) So with larger word length, the total complexity constantly increases. The amount by which it increases however is shrinking exponentially with the word length. Therefore O(log n). Sorry about the n log n mistake, argh. Best regards thanks for the correction, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 06.09.2012 15:43, Johannes Bauer wrote: Wrong, at least for randomized strings (i.e. every character with the same probability). O(N) is worst-case, O(log N) is correct for randomized strings. ^^ Here I write the right thing. Then further below... Yes, worst-case is O(N), best case O(1). Average is O(n log n). ...I write the wrong thing. O(log n) is what I meant, as Dave correctly noticed. Best regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 06.09.2012 16:23, Dave Angel wrote: On 09/06/2012 09:43 AM, Johannes Bauer wrote: snip Yes, worst-case is O(N), best case O(1). Average is O(n log n). Can't see how you came up with an average of n log(n). Fourteen minutes before you made this post, you demonstrated it was less than 2 for any N. And I previously posted that for a character set of size 10, the average was 1. Again playing with the equations and thinking about it again. And I completely missed your point. It wasn't about n log n vs. log n. Both are wrong. This surprises me. I was somehow thinking about the limit of sum (1/n), n - infty. But since the summands are shrinking exponentially, the limit is different. I think the limit of the average comparisons for a given wordlength n against infinity with alphabet size l is l / (l - 1) i.e. for bitstrings it's 2 and for bytestrings it's 256/255. This would mean string comparison of randomized strings is O(1). Can that really be true? It looks like it. Best regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer dfnsonfsdu...@gmx.de wrote: Not in my original post. If you read it again, you will clearly see that I was talking about purely random strings. And since you like to nitpick, I'll clarify further: I'm talking about bitstrings in which every bit of every character has the same probability of occurence, 50%. That sort of string isn't a normal thing to be comparing, though. Here's an idea. Someone who's doing a lot of arguing in this thread should take Python, find the string comparison routine, and hack in some statistics-gathering. Then run *real code* on it. Maybe use this with one of those web frameworks and run your web site on it for an hour or two, or fire off some real scripts you really use. Then dump out the stats at the end. My guess: The bulk of string comparisons that get to actually comparing byte-for-byte will end up returning True. Most of the false comparisons will be proven earlier; if I understand correctly, Python will check for identity (easy true) and different lengths (easy false). But my guess could turn out to be flat wrong. In any case, it'll be far FAR more useful than arguing from totally random, or random word selection, or anything. Who's game? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 06.09.2012 16:39, Chris Angelico wrote: In any case, it'll be far FAR more useful than arguing from totally random, or random word selection, or anything. Who's game? Okay, patched against Python 3.2.3: http://pastebin.com/PRRN53P6 To invoke display of the stats, compare the string pleasedumpstats as LHO, i.e.: pleasedumpstats Just ran it against a big script of mine which takes the stringified IMDb text files, parses it and dumps it into a sqlite3 database. Surprisingly little string comparisons however (the sqlite module without doubt uses its own routines internally). Although the database in the end has about 2 million records, these were the stats: strCmpEq 1017 strCmpLt 2802 strCmpGt 1633 strCmpTc 16219 strCmpCc 8570 which means 5452 comparisons of which 19% were equal and the rest inequal. Average string length is about 2.97 characters and aborted was in average after 1.57 characters. Maybe someone has a test which makes heavier use of string comparison. I don't want to make up something, however, since this is (by definition) going to be artificial. Best regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 06.09.2012 17:36, Johannes Bauer wrote: pleasedumpstats And against a XML-reading C code generator that uses mako: strCmpEq 39670 strCmpLt 2766215 strCmpGt 2744002 strCmpTc 37430821 strCmpCc 14048656 Compared strings: 5549887 Equal: 0.7% Average wordlength: 6.74 chars Average comparelength: 2.53 chars Best regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 09/06/2012 10:42 AM, Johannes Bauer wrote: On 06.09.2012 16:23, Dave Angel wrote: On 09/06/2012 09:43 AM, Johannes Bauer wrote: snip Yes, worst-case is O(N), best case O(1). Average is O(n log n). Can't see how you came up with an average of n log(n). Fourteen minutes before you made this post, you demonstrated it was less than 2 for any N. And I previously posted that for a character set of size 10, the average was 1. Again playing with the equations and thinking about it again. And I completely missed your point. It wasn't about n log n vs. log n. Both are wrong. This surprises me. I was somehow thinking about the limit of sum (1/n), n - infty. But since the summands are shrinking exponentially, the limit is different. I think the limit of the average comparisons for a given wordlength n against infinity with alphabet size l is l / (l - 1) i.e. for bitstrings it's 2 and for bytestrings it's 256/255. This would mean string comparison of randomized strings is O(1). Can that really be true? It looks like it. (Just lost the internet in a storm, so I'm not sure how long it'll be before this sends.) Thanks, that was exactly my point. Since el is at least 2, the average number of comparisons is no larger than 2, for any value of N. That's why, when I'm visually comparing MD5 values, I usually only look at the first few, and last few hex digits. However, Chris Angelico (at 10:39) pointed out again that totally random strings aren't real-world equivalent. He has now proposed that somebody measure real-world cases here, by patching the string comparison to gather statistics. I think that's beyond me at this point. I only joined this thread when the cases under study were well-defined random, and therefore predictable. g -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Fri, 07 Sep 2012 00:39:33 +1000, Chris Angelico wrote: On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer dfnsonfsdu...@gmx.de wrote: Not in my original post. If you read it again, you will clearly see that I was talking about purely random strings. And since you like to nitpick, I'll clarify further: I'm talking about bitstrings in which every bit of every character has the same probability of occurence, 50%. That sort of string isn't a normal thing to be comparing, though. Here's an idea. Someone who's doing a lot of arguing in this thread should take Python, find the string comparison routine, and hack in some statistics-gathering. Then run *real code* on it. Where's the fun in that? :-P -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote: On 09/06/2012 04:33 AM, Steven D'Aprano wrote: snip I may have been overly-conservative earlier when I said that on average string equality has to compare half the characters. I thought I had remembered that from a computer science textbook, but I can't find that reference now, so possibly I was thinking of something else. (String searching perhaps?). In any case, the *worst* case for string equality testing is certainly O(N) (every character must be looked at), and the *best* case is O(1) obviously (the first character fails to match). But I'm not so sure about the average case. Further thought is required. For random strings (as defined below), the average compare time is effectively unrelated to the size of the string, once the size passes some point. Okay, I've spent some time looking for the reference, and can't find it, so I'm now sure I must have been conflating it with something else. After further thought, and giving consideration to the arguments given by people here, I'm now satisfied to say that for equal-length strings, string equality is best described as O(N). 1) If the strings are equal, a == b will always compare all N characters in each string. 2) If the strings are unequal, a == b will *at worst* compare all N characters. 3) Following usual practice in this field, worst case is the one which conventionally is meant when discussing Big Oh behaviour. See, for example, Introduction To Algorithms by Cormen, Leiserson and Rivest. Also of some interest is the best case: O(1) for unequal strings (they differ at the first character) and O(N) for equal strings. Also of interest is the case that has caused the majority of the discussion, the average case. I am now satisfied that the average number of comparisons for unequal strings is O(1). To be precise, it is bounded below by 1 comparison (you always have to compare at least one pair of characters) and bounded above by 2 comparisons. (I'm talking about the average here -- the actual number of comparisons can range all the way up to N, but the average is = 2.) If I've done the maths right, the exact value for the average is: ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N) for random strings of length N taken from an alphabet of size M. For M = 2, that average approaches but never exceeds 2 as N increases; for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... and so forth. Thanks to all who contributed. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Roy Smith wrote: There's been a bunch of threads lately about string implementations, and that got me thinking (which is often a dangerous thing). Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. It doesn't seem un-plausible that this is the case. For example, most of the filenames I work with begin with /home/roy/. Most of the strings I use as memcache keys have one of a small number of prefixes. Most of the strings I use as logger names have common leading substrings. Things like credit card and telephone numbers tend to have much more entropy in the trailing digits. But do you actually compare them with each other? Even if you do I'd guess that Python looks at the hash values most of the time. On the other hand, hostnames (and thus email addresses) exhibit the opposite pattern. Nor do english words: $ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8 comparing every pair in a sample of 1000 8-char words taken from '/usr/share/dict/words' head 1: 477222 2: 18870 ** 3: 2870 4:435 5: 74 6: 17 7: 12 tail 1: 386633 2: 74966 *** 3: 29698 4: 6536 * 5: 1475 6:154 7: 28 8: 10 Anyway, it's just a thought. Has anybody studied this for real-life usage patterns? I'm also not sure how this work with all the possible UCS/UTF encodings. With some of them, you may get the encoding semantics wrong if you don't start from the front. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten __pete...@web.de wrote: comparing every pair in a sample of 1000 8-char words taken from '/usr/share/dict/words' head 1: 477222 2: 18870 ** ... Not understanding this. What are the statistics, and what (if it's not obvious from the previous answer) do they prove? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
a random idea: you could compare strings by their hashes.. print hash(randomstring) == hash(randomstring) print hash(randomstring) == hash(randoMstring) -- -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 04.09.2012 20:07, Steven D'Aprano wrote: A reasonable, conservative assumption is to calculate the largest possible value of the average for random strings. That largest value occurs when the alphabet is as small as possible, namely two characters. In practice, strings come from a larger alphabet, up to 1114111 different characters for full Unicode strings, so the average for them will be less than the average we calculate now. So for unequal strings, the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is: That is exactly what I don't agree with. For unequal random strings, the distribution is *not* equally likely to have the same amount of comparisons. For demonstration, let's look at bitstrings and the string 1010. There 15 bitstrings of same length which are unequal and I'll put the number of comparisons needed next to them going left to right: 1 0001 1 0010 1 0011 1 0100 1 0101 1 0110 1 0111 1 1100 2 1101 2 1110 2 2 1000 3 1001 3 1011 4 i.e. about 1.73 (26/15) comparisons in average with random strings compared to the 2.5 comparisons you end up with. My formula does respect lazy termination (because the probabilty of higher comparisons gets exponentially lower). sum([1, 2, 3, ..., N])/N which by a bit of simple algebra works out to be (N+1)/2, or half the characters as I said. Yes, but it's a flawed assumption you're making since you're neglecting lazy evaluation and early abort of comparison. Best regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 04.09.2012 23:59, Chris Angelico wrote: n = (256 / 255) * (1 - 256 ^ (-c)) where n is the average number of character comparisons and c. The rationale as follows: The first character has to be compared in any case. The second with a probability of 1/256, the third with 1/(256^2) and so on. That would be for comparing two random areas of memory. Python strings don't have 256 options per character; and in terms of actual strings, there's so many possibilities. The unit of comparison is completely arbitraty and can equally be used to compare bitstrings if changed accordingly. The strings that a program is going to compare for equality are going to use a vastly restricted alphabet; for a lot of cases, there might be only a few dozen plausible characters. This is very true. But if so, I would like to see the assumtion that you're making (which might very well be plausible) in order to arrive at a average of N/2 comparisons (which just seems *way* off). But even so, it's going to scale approximately linearly with the string length. If they're really random, then yes, there's little chance that either a 1MB string or a 2MB string will be the same, but with real data, they might very well have a long common prefix. So it's still going to be more or less O(n). I understand what you're saying and surely this is true to some extent (someone mentioned filenames, which is an excellent example). However in order to derive a *formula* you need to formularize your assumtions. With random data, it's definitely not O(n). And even in reality (with a more limited alphabet) it usually will early abort: Consider sorting a large dictionary. Sorting is in O(n log(n)), but this assumes O(1) comparisons! If the comparison operation itself were in O(n), the total sorting complexity would be O(n^2 log(n)), which is definitely false. Most comparisons will abort *very* early (after the first character). Best regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 05.09.2012 11:24, Johannes Bauer wrote: Consider sorting a large dictionary. Sorting is in O(n log(n)), but this assumes O(1) comparisons! If the comparison operation itself were in O(n), the total sorting complexity would be O(n^2 log(n)), which is definitely false. Most comparisons will abort *very* early (after the first character). I've just written a program to demonstrate this (below it's attached). I used my aspell dictionary, which contains about 100k words. To have complexity down I went for only words wich occur most often (in this case 9 characters or ~13k words). Here's the results (note it's randomized, so they'll always differ a little, but are pretty stable) Max Cmp: 100 Sum Cmp: 32111 Avg Cmp: 2.393842254361115 Num words : 13414 Min wordlen: 9 Max wordlen: 9 Avg wordlen: 9.0 Going up in wordlength (only showing part now) Avg Cmp: 2.2374929735806632 Num words : 10674 Avg wordlen: 10.0 Avg Cmp: 2.261727078891258 Num words : 7504 Avg wordlen: 11.0 Avg Cmp: 2.2335647202939977 Num words : 4898 Avg wordlen: 12.0 Avg Cmp: 2.1743341404358354 Num words : 2891 Avg wordlen: 13.0 Avg Cmp: 2.156782549420586 Num words : 1467 Avg wordlen: 14.0 Avg Cmp: 2.112449799196787 Num words : 747 Avg wordlen: 15.0 So, interestingly, in this real-world case(tm), the complexity does not scale with O(n). It actually goes down (which is probably due to the limit amount of words of higher length). In summary, let's please not debate feelings or such. Either measurements (with clear indiciation on what data they're based on and how the test was conducted) or derivations (with an explanation of the assumtions). Feelings can be very misleading in cases like this. Best regards, Johannes #!/usr/bin/python3 import random class Word(): def __init__(self, s): self._s = s self._c = 0 def __lt__(self, other): if len(self) len(other): return True elif len(self) len(other): return False for i in range(len(self)): self._c += 1 if self._s[i] other._s[i]: return True return False def __repr__(self): return %s%d % (self._s, self._c) def __len__(self): return len(self._s) def getcmp(self): return self._c wordlist = [ ] for line in open(dict, r): word = Word(line[:-1]) if len(word) == 9: wordlist.append(word) random.shuffle(wordlist) wordlist.sort() comparisons = [ word.getcmp() for word in wordlist ] print(Max Cmp:, max(comparisons)) print(Sum Cmp:, sum(comparisons)) print(Avg Cmp:, sum(comparisons) / len(comparisons)) wordlengths = [ len(word) for word in wordlist ] print(Min wordlen:, min(wordlengths)) print(Max wordlen:, max(wordlengths)) print(Avg wordlen:, sum(wordlengths) / len(wordlengths)) -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Chris Angelico wrote: On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten __pete...@web.de wrote: comparing every pair in a sample of 1000 8-char words taken from '/usr/share/dict/words' head 1: 477222 2: 18870 ** ... tail 1: 386633 2: 74966 *** Not understanding this. What are the statistics, I measured how many chars they have in common for all combinations of 1000 words taken from /usr/share/dict/words. 477222 pairs have one char in common, 18870 pairs have two chars in common when compared from the *beginning* of the string. 386633 pairs have one char in common, 74966 pairs have two chars in common when compared from the *end* of the string. and what (if it's not obvious from the previous answer) do they prove? They demonstrate that for two words from that particular corpus it is likely that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average) characters into account than a == b, i. e. comparing from the back should be slower rather than faster. If that doesn't help, here's the code ;) import random import itertools from contextlib import contextmanager from collections import defaultdict @contextmanager def open_corpus(args): with open(args.name) as instream: words = (line.strip() for line in instream) #words = (word for word in words if not word.endswith('s)) yield words def pick(items, count): items = list(items) return random.sample(items, count) def count_common(a, b): i = 0 for i, (x, y) in enumerate(zip(a, b), 1): if x != y: break return i def corpus(args): with open_corpus(args) as words: wordlens = (len(line.strip()) for line in words) freq = defaultdict(int) for wordlen in wordlens: freq[wordlen] += 1 show_histogram(freq) def show_histogram(freq): peak = max(freq.values()) freq_width = len(str(peak)) max_freq = max(freq) len_width = len(str(max_freq)) for i in range(min(freq), max_freq+1): value = freq[i] print({:{}}: {:{}} {}.format( i, len_width, value, freq_width, * * (60 * value // peak))) def compare(args): with open_corpus(args) as words: words = pick( (word for word in words if len(word) == args.word_len), args.sample_size) n = 0 head_common = defaultdict(int) tail_common = defaultdict(int) n_tail = n_head = 0 for n, (a, b) in enumerate(itertools.combinations(words, 2), 1): cc = count_common(a, b) n_head += cc head_common[cc] += 1 ccr = count_common(a[::-1], b[::-1]) n_tail += ccr tail_common[ccr] += 1 print(n, combinations) print(average common chars (head) {:.3}.format(n_head/n)) print(average common chars (tail) {:.3}.format(n_tail/n)) print(comparing every pair in a sample of {sample} {len}-char words\n taken from {name!r}.format( sample=args.sample_size, len=args.word_len, name=args.name)) print() print(head) show_histogram(head_common) print() print(tail) show_histogram(tail_common) def main(): import argparse parser = argparse.ArgumentParser() parsers = parser.add_subparsers() pcorpus = parsers.add_parser(corpus) pcorpus.add_argument(name) pcorpus.set_defaults(func=corpus) pcompare = parsers.add_parser(compare) pcompare.add_argument(name) pcompare.add_argument(--sample-size, type=int, default=1000) pcompare.add_argument(--word-len, type=int, default=8) pcompare.set_defaults(func=compare) args = parser.parse_args() args.func(args) if __name__ == __main__: main() -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 5 September 2012 10:48, Peter Otten __pete...@web.de wrote: Chris Angelico wrote: On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten __pete...@web.de wrote: comparing every pair in a sample of 1000 8-char words taken from '/usr/share/dict/words' head 1: 477222 2: 18870 ** ... tail 1: 386633 2: 74966 *** Not understanding this. What are the statistics, I measured how many chars they have in common for all combinations of 1000 words taken from /usr/share/dict/words. 477222 pairs have one char in common, 18870 pairs have two chars in common when compared from the *beginning* of the string. 386633 pairs have one char in common, 74966 pairs have two chars in common when compared from the *end* of the string. and what (if it's not obvious from the previous answer) do they prove? They demonstrate that for two words from that particular corpus it is likely that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average) characters into account than a == b, i. e. comparing from the back should be slower rather than faster. If that doesn't help, here's the code ;) snip def count_common(a, b): i = 0 for i, (x, y) in enumerate(zip(a, b), 1): if x != y: break return i This function will return 1 if the first character differs. It does not count the number of common characters but rather the more relevant quantity which is the number of comparisons required to decide if the strings are equal. It shows that, for the English words in this dictionary, the first letters of two randomly selected 8-character words are equal ~5% of the time while the last letters are equal ~20% of the time. But despite the non-uniformity in the distribution of these strings, this provides a good example of the fact that for many situations involving real data, average case comparison complexity is O(1). This is because the probability of stopping after N comparisons decreases exponentially with N, so that the sequence of counts forms something loosely like an arithmetic progression: cmp_forward = [477222, 18870, 2870, 435, 74, 17, 12] cmp_backward = [386633, 74966, 29698, 6536, 1475, 154, 28, 10] def ratios(seq): ... for count1, count2 in zip(seq[:-1], seq[1:]): ... yield count2 / float(count1) ... list(ratios(cmp_forward)) [0.03954134553729711, 0.15209326974032855, 0.15156794425087108, 0.17011494252873563, 0.22972972972972974, 0.7058823529411765] list(ratios(cmp_backward)) [0.19389446839767946, 0.39615292265827173, 0.22008216041484274, 0.22567319461444307, 0.10440677966101695, 0.18181818181818182, 0.35714285714285715] A notable outlier in these sequences is for comparing the first character of the two words which is why for this string distribution it is better to start at the beginning than the end. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote: On 05.09.2012 11:24, Johannes Bauer wrote: Consider sorting a large dictionary. Sorting is in O(n log(n)), but this assumes O(1) comparisons! If the comparison operation itself were in O(n), the total sorting complexity would be O(n^2 log(n)), This shows some significant confusion about Big Oh complexity analysis here. When you say that sorting the list of words is O(N log N), the N refers to the number of words in the dictionary. But when you speak about comparisons being O(N), the N doesn't refer to the size of the dict, but the size of the individual strings being compared. There is no reasonable comparison algorithm where the effort required to compare two strings depends on the size of the dictionary they have come from. Since these are *completely different Ns*, you can't combine them to get O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer nonsense. If you state it like this: sorting the list of words is O(N log N) where N = length of the list comparing the words is O(M) where M = the average length of the words then the overall complexity is O(N log N)*O(M), but since M is a constant for any particular dictionary (its just the average length of the words) that reduces down to O(N log N). To say that sorting a dictionary is O(N log N) does *not* imply that string comparisons are O(1). It only implies that string comparisons don't depend on the size of the dictionary. Comparing: s1 = 'a'*30002 s2 = 'a'*30001 + 'b' takes the same amount of time whether they are in a dictionary of 100 words or one of 10 words. For the purposes of calculating the algorithmic complexity of sorting a large list of words, we can treat comparisons as an elementary operation even though they actually aren't. which is definitely false. Most comparisons will abort *very* early (after the first character). You are making unjustified assumptions about the distribution of letters in the words. This might be a list of long chemical compounds where the words typically differ only in their suffix. It might be a list of people with titles: Herr Professor Frederick Schmidt Herr Professor Frederick Wagner ... But either way, it doesn't matter for the Big Oh behaviour of sorting the words. I've just written a program to demonstrate this (below it's attached). I have no idea why you think this program demonstrates anything about string equality comparisons. What you are counting is not the average number of comparisons for string equality, but the number of comparisons when sorting. These are *completely different*, because sorting a list does not require testing each string against every other string. Whatever you have measured, it has nothing to do with the Big Oh complexity of string comparisons. It seems to me that you have misunderstood the purpose of Big Oh notation. It gives an asymptotic upper bound on the amount of work done, ignoring all lower terms and multiplicative factors, not an exact formula. If something is O(N), then this tells you: 1) the number of algorithmic steps is proportional to N, plus or minus some terms which are less than N (e.g. sqrt(N), or log(N), or 1/N, or some constant, etc.); 2) if you ignore those other terms, and keep all other influences equal, then doubling N will *at worst* double the number of algorithmic steps; 3) if all those algorithmic steps take exactly the same amount of time, then doubling N will take twice as much time. But of course *none* of those assumptions hold for measurements of actual elapsed time. In real life, operations rarely take constant time, you don't always see worst case behaviour, and the lower terms are not necessarily insignificant enough to ignore. [...] So, interestingly, in this real-world case(tm), the complexity does not scale with O(n). It actually goes down (which is probably due to the limit amount of words of higher length). I would guess that it probably has more to do with the ability of the timsort algorithm to exploit local order within a list and avoid performing comparisons. In summary, let's please not debate feelings or such. I have no idea what feelings you are referring to. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 05.09.2012 04:18, Neil Hodgson wrote: The memcpy patch was controversial as it broke Adobe Flash which assumed memcpy was safe like memmove. Adobe Flash was broken before, making an assumption that is not guaranteed by the standard. The patch only showed the problem. Regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 05.09.2012 16:30, Steven D'Aprano wrote: Since these are *completely different Ns*, you can't combine them to get O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer nonsense. If you state it like this: Yes, you are correct here. You are making unjustified assumptions about the distribution of letters in the words. This might be a list of long chemical compounds where the words typically differ only in their suffix. It might be a list of people with titles: Actually, I'm not. I'm stating exactly what assumptions I'm making to get my calculation. I'm comparing *random* character strings or bitstrings. You, on the other hand, are making vague assumptions which you do not care for formalize and yet you claim that the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is. Without any explanation for this. At all. Herr Professor Frederick Schmidt Herr Professor Frederick Wagner ... Is your assumtion that we're comparing words that have the common prefix Herr Professor Frederick ? Why don't you clearly state what you're comparing. In the example you gave in the other thread you claimed Note that this average assumes the strings are completely random., yet now you're talking about specific word patterns. I have no idea why you think this program demonstrates anything about string equality comparisons. What you are counting is not the average number of comparisons for string equality, but the number of comparisons when sorting. These are *completely different*, because sorting a list does not require testing each string against every other string. You're wrong. It's not counting the number of string comparisons, it's counting the number of character comparisons. Which is exactly what we're talking about in this thread, are we not? The sorting of the list is just to show some example where lots of strings are compared. So, interestingly, in this real-world case(tm), the complexity does not scale with O(n). It actually goes down (which is probably due to the limit amount of words of higher length). I would guess that it probably has more to do with the ability of the timsort algorithm to exploit local order within a list and avoid performing comparisons. Again, it's not counting the number of string comparisons. It's counting the average number of character comparisons per string comparison. The total amount of string comparisons has nothing to do with it. In summary, let's please not debate feelings or such. I have no idea what feelings you are referring to. I'm referring to the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is without any deduction or rationale. Without any explanation or assumption. Regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Oscar Benjamin wrote: On 5 September 2012 10:48, Peter Otten __pete...@web.de wrote: def count_common(a, b): [sorry for seriously broken implementation] This function will return 1 if the first character differs. It does not count the number of common characters but rather the more relevant quantity which is the number of comparisons required to decide if the strings are equal. I remember stumbling over the non-zero frequency for len(word) but somehow plodded on with beautifying the broken results :( def count_common(a, b): count_common(alpha, beta) 0 count_common(, ) 0 count_common(alpha, argument) 1 count_common(alpha, alpha) 5 count_common(alpha, alphx) 4 i = 0 for x, y in zip(a, b): if x != y: break i += 1 return i $ python3 reverse_compare2.py compare /usr/share/dict/words 499500 combinations average common chars (head) 0.0535 average common chars (tail) 0.322 comparing every pair in a sample of 1000 8-char words taken from '/usr/share/dict/words' head 0: 477371 1: 18310 ** 2: 3191 3:523 4: 79 5: 15 6: 10 7: 1 tail 0: 385069 1: 78742 2: 26734 3: 7462 * 4: 1297 5:168 6: 22 7: 6 -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...] You are making unjustified assumptions about the distribution of letters in the words. This might be a list of long chemical compounds where the words typically differ only in their suffix. It might be a list of people with titles: Actually, I'm not. I'm stating exactly what assumptions I'm making to get my calculation. I'm comparing *random* character strings or bitstrings. Excuse me, you are not. You are comparing English words which are highly non-random. You, on the other hand, are making vague assumptions which you do not care for formalize and yet you claim that the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is. Without any explanation for this. At all. I will accept that my explanation was not good enough, but I strongly disagree that I gave no explanation at all. Herr Professor Frederick Schmidt Herr Professor Frederick Wagner ... Is your assumtion that we're comparing words that have the common prefix Herr Professor Frederick ? No, I am pointing out that *your* assumption that most string comparisons will halt close to the beginning of the string is an invalid assumption. Your assumption only holds for some non-random strings. [...] I have no idea why you think this program demonstrates anything about string equality comparisons. What you are counting is not the average number of comparisons for string equality, but the number of comparisons when sorting. These are *completely different*, because sorting a list does not require testing each string against every other string. You're wrong. It's not counting the number of string comparisons, I didn't say it was. it's counting the number of character comparisons. Correct. But only out of an extremely limited subset of all possible string comparisons, namely those very few that happen when sorting lists of English words using a very specific algorithm, namely timsort. Which is exactly what we're talking about in this thread, are we not? The sorting of the list is just to show some example where lots of strings are compared. It is not good enough to extract a non-random subset of strings (only English words) and then decrease the data set even further by only comparing an extremely non-uniform subset of those strings: specifically those few string comparisons that happen to occur using timsort. Whatever figure you have calculated by taking this non-random selection, it is irrelevant to the question of the average performance of string equality given random strings. So, interestingly, in this real-world case(tm), the complexity does not scale with O(n). It actually goes down (which is probably due to the limit amount of words of higher length). I would guess that it probably has more to do with the ability of the timsort algorithm to exploit local order within a list and avoid performing comparisons. Again, it's not counting the number of string comparisons. It's counting the average number of character comparisons per string comparison. The total amount of string comparisons has nothing to do with it. I didn't say anything about counting string comparisons. But you are wrong -- the numbers you get are *strongly* influenced by the number of string comparisons performed. That is just one of the reasons why the results you get are biased. Let me spell it out for you: given a list of five letter words: ['fruit', 'apple', 'claim', 'pears', ... , 'toast'] sorting the list may compare: 'fruit' with 'apple' 'apple' with 'claim' but almost certainly will not compare: 'apple' with 'toast' and it definitely won't compare: 'zzzax' with 'zzzaq' because strings like those never get into the list in the first place. For the sake of simple calculations, let's pretend that there are only 1000 five-letter strings possible. We want to know how many character- comparisons are done on average when testing two random five-letter strings for equality. To calculate this average, you must compare every string to every other string, *including* itself. This gives 1000*1000 = one million equality tests to be performed. For each equality test, we count the number of character comparisons performed, sum all those counts, and divide by 1e6. That is the average number of char comparisons for random strings of five letters. But that's not what you do. First you eliminate 900 out of the 1000 possible strings by only sampling English words -- strings like xxoij cannot possibly be selected. So you are left with the highly non-random subset of 1 equality tests. But you haven't finished. Instead of checking all 1 equality tests, you then decide to only check the 2000 tests that happen to randomly occur when using the timsort algorithm to sort a list. So from the population of one million possible equality tests, you throw away the vast majority, then average the *strongly
Re: Comparing strings from the back?
In news.gmane.comp.python.general, you wrote: On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...] You are making unjustified assumptions about the distribution of letters in the words. This might be a list of long chemical compounds where the words typically differ only in their suffix. It might be a list of people with titles: Actually, I'm not. I'm stating exactly what assumptions I'm making to get my calculation. I'm comparing *random* character strings or bitstrings. Excuse me, you are not. You are comparing English words which are highly non-random. Evidently we have different understandings of what 'random' means. I don't think it's unreasonable to say that strings drawn uniformly from the set of all strings in the English language (having a given number of characters) is random. The distribution is not uniform over the set of all possible character strings but it is still random. I think Johannes deliberately chose these strings to emulate a particular kind of 'real' distribution of strings that might occur in practise. You, on the other hand, are making vague assumptions which you do not care for formalize and yet you claim that the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is. Without any explanation for this. At all. I will accept that my explanation was not good enough, but I strongly disagree that I gave no explanation at all. Herr Professor Frederick Schmidt Herr Professor Frederick Wagner ... Is your assumtion that we're comparing words that have the common prefix Herr Professor Frederick ? No, I am pointing out that *your* assumption that most string comparisons will halt close to the beginning of the string is an invalid assumption. Your assumption only holds for some non-random strings. I think you have this backwards. The case where this assumption is provably true is precisely for random strings. To be clear, when I say 'random' in this context I mean that each character is chosen independently from the same probability distribution over the possible characters regardless of which index it has in the string and regardless of what the other characters are (IID). In this case the probability that comparison terminates at the jth character decreases exponentially with j. This means that for large strings the expected number of character comparisons is independent of the number of characters in the string as the probability of reaching the later parts of the string is too small for them to have any significant effect. This is provable and applies regardless of how many possible characters there are and whether or not each character is equally likely (except for the pathological case where one character has a probability of 1). For strings from 'real' distributions it is harder to make statements about the 'average case' and it is possible to construct situations where the comparison would regularly need to compare a common prefix. However, to get asymptotic performance worse than O(1) it is not sufficient to say that there may be a common prefix such as 'Herr' in the distribution of strings. It is necessary that, somehow, the common prefix is likely to grow as the size of the strings grows. For example, the set of all strings of length N whose first N//2 characters are always 'a' and whose remaining characters are chosen IID would lead to O(N) performance. This is why the file paths example chosen at the start of this thread is a good one. If a program is dealing with a number of large strings representing file paths then it is not uncommon that many of those paths would refer to files in the same deeply nested directory and hence compare equal for a significant number of characters. This could lead to average case O(N) performance. I think it's appropriate to compare string comparison with dict insertion: Best case O(1) (no hash collisions) Worst case O(N) (collides with every key) Average case O(1) (as long as you don't use pathological data) The only difference with string comparison is that there are some conceivable, non-malicious cases where the pathological data can occur (such as with file paths). However, I suspect that out of all the different uses of python strings these cases are a small minority. In saying that, it's not inconceivable that someone could exploit string comparison by providing pathological data to make normally O(1) operations behave as O(N). If I understand correctly it was precisely this kind of problem with dict insertion/lookup that lead to the recent hash-seed security update. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 04/09/2012 05:56, Dan Sommers wrote: That said, if I really wanted bloodshed, I would propose = for the third string-equality operator! ;-) Dan Dan agent provocateur Sommers? :) -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 04/09/2012 02:54, Roy Smith wrote: There's been a bunch of threads lately about string implementations, and that got me thinking (which is often a dangerous thing). Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. It doesn't seem un-plausible that this is the case. For example, most of the filenames I work with begin with /home/roy/. Most of the strings I use as memcache keys have one of a small number of prefixes. Most of the strings I use as logger names have common leading substrings. Things like credit card and telephone numbers tend to have much more entropy in the trailing digits. On the other hand, hostnames (and thus email addresses) exhibit the opposite pattern. Anyway, it's just a thought. Has anybody studied this for real-life usage patterns? I'm also not sure how this work with all the possible UCS/UTF encodings. With some of them, you may get the encoding semantics wrong if you don't start from the front. Good grief, what timing!!! The dust has yet to settle over the Flexible string representation, unicode, typography, ... thread and you ask this. I'll just cross my fingers and hope that people stick with facts, realistic benchmarks etc, and not good old fashioned FUD. -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
Roy Smith r...@panix.com writes: There's been a bunch of threads lately about string implementations, and that got me thinking (which is often a dangerous thing). Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. I guess we all have examples with specific entropy distribution. Going backwards on long strings can be profitable with file paths. If that is the case, and if you spend a lot of time comparing strings, why not store them in reverse? It doesn't seem un-plausible that this is the case. For example, most of the filenames I work with begin with /home/roy/. Most of the strings I use as memcache keys have one of a small number of prefixes. Most of the strings I use as logger names have common leading substrings. In that case I would split the paths, and use some sort of string-id, or use intern() and then compare components with is instead of ==. (Basically, turning the string of chars into a strings of ids/ints/pointers.) Things like credit card and telephone numbers tend to have much more entropy in the trailing digits. On the other hand, hostnames (and thus email addresses) exhibit the opposite pattern. Yes, real-world is a mess. Anyway, it's just a thought. Has anybody studied this for real-life usage patterns? I've tried the intern() trick several times with lists of strings, and was satisfied, but this was in specific cases (with a finite set of strings). For short strings of chars (up to, say a hundred of characters), I've never found anything significantly faster than the default sweep (that was in C/C++, not python). For longer and/or more structured strings/lists, making use of the structure is a good idea, but I see no general pattern here (as you note). I'm also not sure how this work with all the possible UCS/UTF encodings. With some of them, you may get the encoding semantics wrong if you don't start from the front. If you're testing for equality, I can't see how this could matter, even with variable-length encodings. If you're comparing different encodings, then you need different access methods to random characters (but this doesn't affect the algorithm). If you're using variable-length encoding, e.g., UTF-8, accessing at a specific position is not possible. -- Alain. -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 04.09.2012 04:17, Steven D'Aprano wrote: On average, string equality needs to check half the characters in the string. How do you arrive at that conclusion? When comparing two random strings, I just derived n = (256 / 255) * (1 - 256 ^ (-c)) where n is the average number of character comparisons and c. The rationale as follows: The first character has to be compared in any case. The second with a probability of 1/256, the third with 1/(256^2) and so on. Best regards, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote: On 04.09.2012 04:17, Steven D'Aprano wrote: On average, string equality needs to check half the characters in the string. How do you arrive at that conclusion? Take two non-empty strings of the same length, N. If the strings are equal, you have to make N comparisons to be sure they are equal (you check all N pairs of characters). If they are unequal, you have to check each pair of characters up to the first pair that are different: def string_equality(astr, bstr): # Assumes lengths are equal. for achr, bchr in zip(astr, bstr): if achr != bchr: return False return True For the unequal case, how many comparisons do you do? Ahead of time, we know very little about the strings. We don't even know how many possible characters there are (are they English alphanumeric, ASCII, Greek, Thai, or from the full range of 1114111 Unicode code points?) or what their probability distribution is. A reasonable, conservative assumption is to calculate the largest possible value of the average for random strings. That largest value occurs when the alphabet is as small as possible, namely two characters. In practice, strings come from a larger alphabet, up to 1114111 different characters for full Unicode strings, so the average for them will be less than the average we calculate now. So for unequal strings, the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is: sum([1, 2, 3, ..., N])/N which by a bit of simple algebra works out to be (N+1)/2, or half the characters as I said. (Note that this average assumes the strings are completely random. In practice, strings tend to come from strongly non-uniform distributions of characters. For instance, in English texts, 'e' is much more likely than 'q' and most characters are vanishingly rare, and in practice, strings are likely to be highly non-random.) -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing strings from the back?
On 4 September 2012 19:07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote: On 04.09.2012 04:17, Steven D'Aprano wrote: On average, string equality needs to check half the characters in the string. How do you arrive at that conclusion? Take two non-empty strings of the same length, N. If the strings are equal, you have to make N comparisons to be sure they are equal (you check all N pairs of characters). If they are unequal, you have to check each pair of characters up to the first pair that are different: def string_equality(astr, bstr): # Assumes lengths are equal. for achr, bchr in zip(astr, bstr): if achr != bchr: return False return True For the unequal case, how many comparisons do you do? Ahead of time, we know very little about the strings. We don't even know how many possible characters there are (are they English alphanumeric, ASCII, Greek, Thai, or from the full range of 1114111 Unicode code points?) or what their probability distribution is. A reasonable, conservative assumption is to calculate the largest possible value of the average for random strings. That largest value occurs when the alphabet is as small as possible, namely two characters. In practice, strings come from a larger alphabet, up to 1114111 different characters for full Unicode strings, so the average for them will be less than the average we calculate now. So for unequal strings, the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is: What? sum([1, 2, 3, ..., N])/N which by a bit of simple algebra works out to be (N+1)/2, or half the characters as I said. (Note that this average assumes the strings are completely random. In practice, strings tend to come from strongly non-uniform distributions of characters. For instance, in English texts, 'e' is much more likely than 'q' and most characters are vanishingly rare, and in practice, strings are likely to be highly non-random.) If the strings are 'completely random' (by which I assume you mean that each character is IID) then the probability of a match for the character at any one index is the same as the probability for a match at any other index. Lets say the probability for a match is p and that p 1. Then for the first comparison: 1) with probability (1 - p) we terminate the loop after 1 comparison. 2) With probability p we continue to the second comparison The second comparison occurs with probability p (from 2 above) and if we reach this point then: 1) with probability (1 - p) we terminate the loop after this second comparison 2) With probability p we continue to the third comparison The probability of reaching the second comparison is p and the probability of terminating at this comparison *if we reach it* is (1-p). So the probability from the outset that we terminate at the second comparison is p*(1 - p). Prob(1 comparison) = (1-p) p*(1-p) = prob(2 comparisons) (since p 1) This can easily be extended by induction or otherwise to show that the probability of terminating after N comparisons decreases as N increases. In fact since it decreases by a factor of p each time, even if p is 1/2 (as would be the case if there were two equally likely characters) the probability of continuing to N comparisons becomes vanishingly small very quickly as N increases. In practise p is larger for mast ascii/bmp/unicode problems so the probability vanishes even more quickly. If you follow this through you should arrive at Johannes' result above. Applying this to real data is obviously more complicated since successive characters are not independent but it would be a fairly unusual situation if you were not more likely to stop at an earlier point in the comparison loop than a later one. Oscar. -- http://mail.python.org/mailman/listinfo/python-list