Re: Comparing strings from the back?

2012-09-18 Thread Ethan Furman

Neil Hodgson wrote:

Ethan Furman:

*plonk*


   I can't work out who you are plonking. While more than one of the 
posters on this thread seem worthy of a good plonk, by not including 
sufficient context, you've left me feeling puzzled. Is there a guideline 
for this in basic netiquette?


You're right, my apologies.  Dwight Hutto is the one I plonked.  His 
signal to noise ratio seems incredibly low.


alex23 can definitely be abrasive, but his ratio is high.

(Now I'm wondering what that says about me as far as who I agree with 
and why... I'll have to think about that.)


~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-18 Thread Dwight Hutto
On Tue, Sep 18, 2012 at 11:12 AM, Ethan Furman et...@stoneleaf.us wrote:
 Neil Hodgson wrote:

 Ethan Furman:

 *plonk*


I can't work out who you are plonking. While more than one of the
 posters on this thread seem worthy of a good plonk, by not including
 sufficient context, you've left me feeling puzzled. Is there a guideline for
 this in basic netiquette?


 You're right, my apologies.  Dwight Hutto is the one I plonked.  His signal
 to noise ratio seems incredibly low.

Is this just because of the lack of context, or because of the content
of my response to the OP following the question?

And with all the netiquette talk being passed around, why isn't this a
new thread instead of hijacking the OP's original question, and
engaging in bickering about netiquette that goes
on all the time?

And it seems redundant to argue this topic other than to comment, and
let it pass, unless there's seems to be a legitimate reason why
someone 'disobeyed' posting policy.

 alex23 can definitely be abrasive, but his ratio is high.

 (Now I'm wondering what that says about me as far as who I agree with and
 why... I'll have to think about that.)

 ~Ethan~
 --
 http://mail.python.org/mailman/listinfo/python-list



-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-18 Thread Dwight Hutto
 sufficient context, you've left me feeling puzzled. Is there a guideline for
 this in basic netiquette?


www.woodgate.org/FAQs/netiquette.html



-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-18 Thread Dwight Hutto

 You're right, my apologies.  Dwight Hutto is the one I plonked.
You can call me David. I go by my middle name.

And it seem to me I made some valid points about a few simple trimming
of postings, that didn't seem necessary in the context of a small
quick conversation.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-18 Thread Chris Angelico
On Wed, Sep 19, 2012 at 2:17 AM, Dwight Hutto dwightdhu...@gmail.com wrote:

 You're right, my apologies.  Dwight Hutto is the one I plonked.
 You can call me David. I go by my middle name.

You're most often going to be addressed by the name that's given in
your post headers. In this case David has been reduced to an
initial, and is visible only in your email address, whereas Dwight
is right there in your real-name.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-18 Thread Dwight Hutto
 You're most often going to be addressed by the name that's given in
 your post headers. In this case David has been reduced to an
 initial, and is visible only in your email address, whereas Dwight
 My sig says David, but it was just to let him know he can call me by
my used name.


-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-18 Thread Mark Lawrence

On 18/09/2012 21:40, Dwight Hutto wrote:

You're most often going to be addressed by the name that's given in
your post headers. In this case David has been reduced to an
initial, and is visible only in your email address, whereas Dwight

  My sig says David, but it was just to let him know he can call me by
my used name.




Any particular him?

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-18 Thread Steven D'Aprano
On Tue, 18 Sep 2012 12:17:40 -0400, Dwight Hutto wrote:

 You can call me David. I go by my middle name.

If you want to be known as David, why do you give your name as Dwight? In 
your email client or newsreader, set your name as David and attributions 
will be to David, and people will know to call you David.

Otherwise, people will continue to call you Dwight.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-17 Thread alex23
On Sep 17, 2:05 pm, Chris Angelico ros...@gmail.com wrote:
 I would hate to see people leaned
 hard on to make their speech entirely politically correct. Use common
 sense, on both sides. Don't be offensive, and don't be offended.

While I would like to agree, this would require there to be no power
disparities between the parties involved, which isn't always the case.
I hate the term politically correct, it always seems to be used in
an eye-rolling, what a load of nonsense way, probably because I
believe there's at least some validity in the Sapir/Whorf hypothesis
that language dictates thought.

If Ramit's name was Barbie and e was told to get back to your dream
home or some other guff, it would be quite rightly viewed as sexist,
whether intentional or not. Such behaviour has been called out on this
list in the past, and I'm genuinely surprised to see so little
reaction to this.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-17 Thread Ethan Furman

*plonk*
--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-17 Thread Neil Hodgson

Ethan Furman:

*plonk*


   I can't work out who you are plonking. While more than one of the 
posters on this thread seem worthy of a good plonk, by not including 
sufficient context, you've left me feeling puzzled. Is there a guideline 
for this in basic netiquette?


   Neil

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-16 Thread alex23
On Sep 15, 1:10 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
 On Fri, Sep 14, 2012 at 6:43 PM, Prasad, Ramit
  Since I was unsure myself if you were trying to be offensive or racist,
  I would disagree with everyone can know it wasn't meant as racist.

 If you're unsure if it was racist, you should err on the side of
 caution.

If your comments are mistakable as racism, maybe *you* should be more
cautious and *not make them*.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-16 Thread Chris Angelico
On Mon, Sep 17, 2012 at 11:11 AM, alex23 wuwe...@gmail.com wrote:
 On Sep 15, 1:10 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
 On Fri, Sep 14, 2012 at 6:43 PM, Prasad, Ramit
  Since I was unsure myself if you were trying to be offensive or racist,
  I would disagree with everyone can know it wasn't meant as racist.

 If you're unsure if it was racist, you should err on the side of
 caution.

 If your comments are mistakable as racism, maybe *you* should be more
 cautious and *not make them*.

That applies to the most obvious examples (for instance, there's a
line in a 19th century opera that uses a six-letter word starting with
'n' to refer to a dark-skinned person - for obvious reasons, that line
is usually changed in modern performances, even though it was
descriptive and not offensive when the opera was written - like
referring to a Caucasian). However, there are many things which could
be misinterpreted as racist, and I would hate to see people leaned
hard on to make their speech entirely politically correct. Use common
sense, on both sides. Don't be offensive, and don't be offended.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread alex23
On Sep 14, 3:39 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
 Please explain any logic whatsoever that would give you that conclusion.

Well, this:

 I think you're referring to a play on words(ramit).

Using foreign names derogatively is a common tactic of the racist.

 Ain't I so punny.

Not really, no.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread Dwight Hutto
 I think you're referring to a play on words(ramit).

 Using foreign names derogatively is a common tactic of the racist.

Not really. But nice spin on my pun to make me look bad.

Keep trying, and maybe you'll come up with an insult/ propaganda
that's less obvious to the viewer that you're a liar, and a person who
couldn't end this with out throwing a blatant race card.

It's similar to if I said, this is real 'queer' of you to do ya big
pansy, and next you'll be calling me a homophobe.

Try harder, because no one would ever believe I was a racist, and if
they did, it's an uninformed decision based off of cherry picken
phrases out of context.

Very text book propaganda of you though.

-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread alex23
On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
  Using foreign names derogatively is a common tactic of the racist.

 Not really. But nice spin on my pun to make me look bad.

It actually *is* common behaviour of racists.

 It's similar to if I said, this is real 'queer' of you to do ya big
 pansy, and next you'll be calling me a homophobe.

Well, *yes*. Because your choice of that terminology as derogatory
shows you view it as derogatory.

 Try harder, because no one would ever believe I was a racist, and if
 they did, it's an uninformed decision based off of cherry picken
 phrases out of context.

Oh, so *now* context is important.

 Very text book propaganda of you though.

I don't think that word means what you think it does.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread Dwight Hutto
On Fri, Sep 14, 2012 at 4:20 AM, alex23 wuwe...@gmail.com wrote:
 On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
  Using foreign names derogatively is a common tactic of the racist.

 Not really. But nice spin on my pun to make me look bad.

 It actually *is* common behaviour of racists.


Not if there name is ramit. What if your name was john? I'd say I'll
be right back, I have to go take a crap on the john. It's a joke about
a name, not where it originates.

 It's similar to if I said, this is real 'queer' of you to do ya big
 pansy, and next you'll be calling me a homophobe.

 Well, *yes*. Because your choice of that terminology as derogatory
 shows you view it as derogatory.

No it was a loose analogy to show that he's just trying to use
anything he can say to slam me, and everyone can know it wasn't meant
as racist.

 Try harder, because no one would ever believe I was a racist, and if
 they did, it's an uninformed decision based off of cherry picken
 phrases out of context.

 Oh, so *now* context is important.

Never said it wasn't. The whole conversation to me is the context, and
the OP usually follows it, and that's who it is usually intended for.

 Very text book propaganda of you though.

 I don't think that word means what you think it does.

from wiki:

Propaganda is a form of communication that is aimed at influencing the
attitude of a community toward some cause or position. Propaganda is
usually repeated and dispersed over a wide variety of media in order
to create the chosen result in audience attitudes.

He wants people to think I'm a racist, and you now want to see that in
me as well. It seems it has propagated, and that I know exactly what
it means.

And all over a silly post that had too little content, at the time, to
have what I said require any other posts, especially if the OP is
following the conversation.

 http://mail.python.org/mailman/listinfo/python-list



-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread Steven D'Aprano
On Fri, 14 Sep 2012 01:20:53 -0700, alex23 wrote:
 On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
[snip]


Please don't feed the trolls.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread alex23
On Sep 14, 6:53 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
 Not if there name is ramit. What if your name was john? I'd say I'll
 be right back, I have to go take a crap on the john. It's a joke about
 a name, not where it originates.

I'd recommend reading up on white privilege but I'm pretty sure it'd
be a wasted suggestion.

  It's similar to if I said, this is real 'queer' of you to do ya big
  pansy, and next you'll be calling me a homophobe.

  Well, *yes*. Because your choice of that terminology as derogatory
  shows you view it as derogatory.

 No it was a loose analogy

You can't use something as an example to support your case, and then
dismiss it as a loose analogy when it undermines it.

 he's just trying to use anything he can say to slam me everyone
 can know it wasn't meant as racist.

The anything I'm using is what *you* have said. I'm not trying to
slam you, I don't even know who you are. I just have a very short
fuse for rudeness and an even shorter one for racism, even if it *was*
intended in a hyuck hyuck, I'm so punny way. Ignorant racism is
still racism.

 The whole conversation to me is the context

And yet:

 He wants people to think I'm a racist, and you now want to see that in
 me as well. It seems it has propagated, and that I know exactly what
 it means.

Again, so much for context. There is no he and me, I'm the person
who made the original accusation and then followed up on it. That's
not propagation, so it's not propaganda.

Now, someone who starts a new thread to have a conversation *with
themselves* in some bizarre piece of performance art that seems
intended to brand as petty the *requests that he actually follow list
etiquette*...*that* is someone I'd consider a propagandist.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread Dwight Hutto
 I'd recommend reading up on white privilege but I'm pretty sure it'd
 be a wasted suggestion.

Not really, I tend to like interdisciplinary study. But I'm a little
of everything if you like Darwin.

  It's similar to if I said, this is real 'queer' of you to do ya big
  pansy, and next you'll be calling me a homophobe.

  Well, *yes*. Because your choice of that terminology as derogatory
  shows you view it as derogatory.

 No it was a loose analogy

 You can't use something as an example to support your case, and then
 dismiss it as a loose analogy when it undermines it.


How did it undermine it? It's the same thing, just transferred to
another well known group subject to violent bigotry.

I used a play on words in a response, and because his names foreign
that makes me a racist. There's no logic in that, other than to bring
me down, and use the worst thing you can say...playing the race card.



 he's just trying to use anything he can say to slam me everyone
 can know it wasn't meant as racist.

 The anything I'm using is what *you* have said. I'm not trying to
 slam you, I don't even know who you are. I just have a very short
 fuse for rudeness and an even shorter one for racism,
It wasn't rude in terms of these things that have been said about me.
Then lengthen your fuse, because I'm not a racist, I just play with
words a lot, including foreign names.
 even if it *was*
 intended in a hyuck hyuck, I'm so punny way. Ignorant racism is
 still racism.
Still trying to propagate a thought that I'm racist based on a guy who
for all I know is white, and uses the A.K.A ramit.

I have no freakin clue if ramit is an ethnic name or nickname...we're
on the internet tweedledick(badumchee, and hyuck,hyuck,hyuck)


 The whole conversation to me is the context

 And yet:

 He wants people to think I'm a racist, and you now want to see that in
 me as well. It seems it has propagated, and that I know exactly what
 it means.

 Again, so much for context. There is no he and me, I'm the person
 who made the original accusation and then followed up on it. That's
 not propagation, so it's not propaganda.
You're still trying to prove the point, when we all know I'm not a racist.


 Now, someone who starts a new thread to have a conversation *with
 themselves* in some bizarre piece of performance art that seems
 intended to brand as petty the *requests that he actually follow list
 etiquette*...*that* is someone I'd consider a propagandist.
 --

You mean like you taking over this thread to call me a racist, and go
on ad nauseam about it, when everyone can see what you're doing is
trying to prove a point you know is wrong?

Go back to debate 101, and flunk your professor if he passed you.


-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread Dwight Hutto
 [snip]


 Please don't feed the trolls.

You're down here under the bridge with the rest of us trolls too, Steven. 24/7




-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Comparing strings from the back?

2012-09-14 Thread Prasad, Ramit
Dwight Hutto wrote:
 On Thu, Sep 13, 2012 at 5:17 PM, Mark Lawrence breamore...@yahoo.co.uk
 wrote:
  On 13/09/2012 21:34, Joshua Landau wrote:
 
  On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk
 wrote:acci sequence
 
  On 13/09/2012 19:39, Prasad, Ramit wrote:
 
  Dwight Hutto wrote:
 
  Why don' you just time it,eit lops through incrementing thmax input/
 
 
  What? Without context I have no idea what this means.
 
 
You're wasting your time, I've been described as a jackass for having
  the
  audacity to ask for context :)
 

Mark,
Sometimes it is the style in which something is asked, and not what 
specifically is asked. ;)

 
 
  I'm pretty sure you are in the wrong, acting as if what he said didn't
  make
  sense! Just read it, he obviously was telling you to time it, as eit lops
  are inside thmax input/ which, as you should know if you *bothered to read
  the thread*, is incrementing.
 
 
  don' is short for don't, by the way.
 
  I do grovellingly apologize for my appalling breach of netiquette.  I am of
  course assuming that the rules have changed and that it's now my
  responsibility to wade back through maybe a couple of hundred responses on a
  long thread to find the context.
   I also guess that I'm never going to
  achieve my ambition of being a pot smoking hippy CEO of a web development
  company :(
  --
  Cheers.

 Cheers usually implies you're an alcoholic pressing buttons, with the
 half of rest of the coders on the net.

Dwight/David: 
Not necessarily true if you happen to be from the UK (or maybe just England). 
It seems to be a cultural signature equivalent to Thanks--at least in the 
UK (not sure about other places). 

[snip] 


Ramit


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Comparing strings from the back?

2012-09-14 Thread Prasad, Ramit
Dwight Hutto wrote:
 On Fri, Sep 14, 2012 at 4:20 AM, alex23 wuwe...@gmail.com wrote:
  On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
   Using foreign names derogatively is a common tactic of the racist.
 
  Not really. But nice spin on my pun to make me look bad.
 
  It actually *is* common behaviour of racists.
 
 
 Not if there name is ramit. What if your name was john? I'd say I'll
 be right back, I have to go take a crap on the john. It's a joke about
 a name, not where it originates.

Okay, so maybe not racist but instead offensive juvenile humor?
I suppose that is better...

 
  It's similar to if I said, this is real 'queer' of you to do ya big
  pansy, and next you'll be calling me a homophobe.

Um, yes! You are using 'queer' and 'pansy' as a derogatory comparison 
which falls under the definition for homophobia.

Homophobia is a range of negative attitudes and feelings toward homosexuality 
or people who are identified or perceived as being lesbian, gay, bisexual or 
transgender (LGBT). Definitions refer variably to antipathy, contempt, 
prejudice, aversion, irrational fear, and hatred.
~ First two sentences on Wikipedia

 
  Well, *yes*. Because your choice of that terminology as derogatory
  shows you view it as derogatory.

 
 No it was a loose analogy to show that he's just trying to use
 anything he can say to slam me, and everyone can know it wasn't meant
 as racist.

Since I was unsure myself if you were trying to be offensive or racist,
I would disagree with everyone can know it wasn't meant as racist.

 
  Try harder, because no one would ever believe I was a racist, and if
  they did, it's an uninformed decision based off of cherry picken
  phrases out of context.
 
  Oh, so *now* context is important.
 
 Never said it wasn't. The whole conversation to me is the context, and
 the OP usually follows it, and that's who it is usually intended for.
 
[ snip ]

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread Steve Howell
On Sep 6, 4:04 am, Oscar Benjamin oscar.j.benja...@gmail.com wrote:
 On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel d...@davea.name wrote:
  For random strings (as defined below), the average compare time is
  effectively unrelated to the size of the string, once the size
 passes
  some point.
  Define random string as being a selection from a set of characters,
 with
  replacement.  So if we pick some set of characters, say 10 (or 256,
 it
  doesn't really matter), the number of possible strings is 10**N.
  The likelihood of not finding a mismatch within k characters is
  (1/10)**k   The likelihood of actually reaching the end of the
 random
  string is (1/10)**N.  (which is the reciprocal of the number of
 strings,
  naturally)
  If we wanted an average number of comparisons, we'd have to
 calculate a
  series, where each term is a probability times a value for k.
     sum((k * 9*10**-k) for k in range(1, 10))
  Those terms very rapidly approach 0, so it's safe to stop after a
 few.
  Looking at the first 9 items, I see a value of 1.111
  This may not be quite right, but the value is certainly well under
 2 for
  a population of 10 characters, chosen randomly.  And notice that N
  doesn't really come into it.

 It's exactly right. You can obtain this result analytically from
 Johannes' formula above. Just replace 256 with 10 to get that the
 expected number of comparisons is

 (10/9)*(1 - 10**(-N))

The math here is simple, but of course the average running time is
also driven by usage patterns.

Let's say you have two independently produced strings with N number of
random bits.  The average amount of bit comparisons that it takes to
determine that the strings are different is between 1 and 2 checks for
any value of N, for the same reason that the average number of times
you need to flip a coin before either coming up heads or exhausting
your N tries is always less than 2, and pretty darn close to 2 for
even relatively small values of N.

def compute_checks_for_different_strings(n):
  x = 1
  p = 0.5
  for i in range(0, n):
x += p * i
p = p * 0.5
  print n, x

for n in [10, 100, 1000, 1]:
  compute_checks_for_different_strings(n)

 10 1.9892578125
 100 2.0
 1000 2.0
 1 2.0


Now let's say your string comparison function is used to verify 100-
bit passwords, and 99% of the time the password is from a legit user
who types it correctly.  Except for the 1% of a time when somebody fat
fingers the copy/paste of their password, every strcmp call is gonna
have to compare 100 pairs of characters, or N pairs.  If the usage
pattern says that 99% of the time you're simply verifying that two
strings are equal, then the number of bits is the main driving factor
for running time.







-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-14 Thread Dwight Hutto
On Fri, Sep 14, 2012 at 6:43 PM, Prasad, Ramit
ramit.pra...@jpmorgan.com wrote:
 Dwight Hutto wrote:
 On Fri, Sep 14, 2012 at 4:20 AM, alex2find-work-home/3 wuwe...@gmail.com 
 wrote:
  On Sep 14, 6:04 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
   Using foreign names derogatively is a common tactic of the racist.
 
  Not really. But nice spin on my pun to make me look bad.
 
  It actually *is* common behaviour of racists.
 

 Not if there name is ramit. What if your name was john? I'd say I'll
 be right back, I have to go take a crap on the john. It's a joke about
 a name, not where it originates.

 Okay, so maybe not racist but instead offensive juvenile humor?
 I suppose that is better...

No, just shop talk. You hit me and hit I logarithmically hit back
...butterfly effect.



  It's similar to if I said, this is real 'queer' of you to do ya big
  pansy, and next you'll be calling me a homophobe.

 Um, yes! You are using 'queer' and 'pansy' as a derogatory comparison
 which falls under the definition for homophobia.

 Homophobia is a range of negative attitudes and feelings toward 
 homosexuality or people who are identified or perceived as being lesbian, 
 gay, bisexual or transgender (LGBT). Definitions refer variably to antipathy, 
 contempt, prejudice, aversion, irrational fear, and hatred.
 ~ First two sentences on Wikipedia
No, analogy, and a continued attack on a subject we know is propaganda.


 
  Well, *yes*. Because your choice of that terminology as derogatory
  shows you view it as derogatory.


 No it was a loose analogy to show that he's just trying to use
 anything he can say to slam me, and everyone can know it wasn't meant
 as racist.

 Since I was unsure myself if you were trying to be offensive or racist,
 I would disagree with everyone can know it wasn't meant as racist.


If you're unsure if it was racist, you should err on the side of
caution. There are many nicknames on the net. That could be an ethnic
name, or an A.K.A.. Non-logical, pure speculation, that is biased that
your minority needs more anit-you.

Don't engage in conversations you're sure to lose.


  Try harder, because no one would ever believe I was a racist, and if
  they did, it's an uninformed decision based off of cherry picken
  phrases out of context.
 
  Oh, so *now* context is important.

 Never said it wasn't. The whole conversation to me is the context, and
 the OP usually follows it, and that's who it is usually intended for.



-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Comparing strings from the back?

2012-09-13 Thread Prasad, Ramit
Dwight Hutto wrote:
 Why don' you just time it,eit lops through incrementing thmax input/

What? Without context I have no idea what this means.


Ramit

--


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Dwight Hutto
On Thu, Sep 13, 2012 at 2:39 PM, Prasad, Ramit
ramit.pra...@jpmorgan.com wrote:
 Dwight Hutto wrote:
 Why don' you just time it,eit lops through incrementing thmax input/

 What? Without context I have no idea what this means.


 Ramit


Why don't you read the OP:


Let's assume you're testing two strings for equality.  You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning?  If the strings are indeed equal, it's the
same amount of work starting from either end.  But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.



and this one from me:

First include len(string)/2, in order to include starting at the
center of the string, and threading/weaving by 2 processes out.

import timeit

 do the the rest, and see which has the fastest time. --


Why don't take the time to read the OP, and ramit in your head?

Remember that you're in the middle of a conversation where the OP is
following as it goes along, so anyone reading the entire set of
postings should get it.

But for people who just want to jump in, and assume that the only
thing that matters is one piece, without reading the entire content of
the conversation, will always have something out of context for them.

-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Mark Lawrence

On 13/09/2012 19:39, Prasad, Ramit wrote:

Dwight Hutto wrote:

Why don' you just time it,eit lops through incrementing thmax input/


What? Without context I have no idea what this means.


Ramit



You're wasting your time, I've been described as a jackass for having 
the audacity to ask for context :)


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Joshua Landau
On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk wrote:

 On 13/09/2012 19:39, Prasad, Ramit wrote:

 Dwight Hutto wrote:

 Why don' you just time it,eit lops through incrementing thmax input/


 What? Without context I have no idea what this means.


  You're wasting your time, I've been described as a jackass for having the
 audacity to ask for context :)


I'm pretty sure you are in the wrong, acting as if what he said didn't make
sense! Just read it, he obviously was telling you to time it, as eit lops
are inside thmax input/ which, as you should know if you *bothered to read
the thread*, is incrementing.

don' is short for don't, by the way.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Dwight Hutto
On Thu, Sep 13, 2012 at 4:34 PM, Joshua Landau
joshua.landau...@gmail.com wrote:
 On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk wrote:

 On 13/09/2012 19:39, Prasad, Ramit wrote:

 Dwight Hutto wrote:

 Why don' you just time it,eit lops through incrementing thmax input/


 What? Without context I have no idea what this means.


 You're wasting your time, I've been described as a jackass for having the
 audacity to ask for context :)


 I'm pretty sure you are in the wrong, acting as if what he said didn't make
 sense! Just read it, he obviously was telling you to time it, as eit lops
 are inside thmax input/ which, as you should know if you bothered to read
 the thread, is incrementing.

 don' is short for don't, by the way.

 --
 http://mail.python.org/mailman/listinfo/python-list


It's the fact that I consider the entire conversation the context.
Reading the OP's being the primary, and things they responded
positively to.

There are other things that get mixed up as well, like not hitting the
... in gmail, and so that content doesn't show, or hitting reply,
instead of reply all, and things getting jumbled for the others
involved in the conversation.

Then there is the problem of people saying you posted too much of the
context, or not inline with the OP, just at the end, or top posting.

I try to keep it along the line of what the OP has read, and they know
the context in which it's meant.



-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Mark Lawrence

On 13/09/2012 21:34, Joshua Landau wrote:

On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk wrote:


On 13/09/2012 19:39, Prasad, Ramit wrote:


Dwight Hutto wrote:


Why don' you just time it,eit lops through incrementing thmax input/



What? Without context I have no idea what this means.



  You're wasting your time, I've been described as a jackass for having the
audacity to ask for context :)



I'm pretty sure you are in the wrong, acting as if what he said didn't make
sense! Just read it, he obviously was telling you to time it, as eit lops
are inside thmax input/ which, as you should know if you *bothered to read
the thread*, is incrementing.

don' is short for don't, by the way.





I do grovellingly apologize for my appalling breach of netiquette.  I am 
of course assuming that the rules have changed and that it's now my 
responsibility to wade back through maybe a couple of hundred responses 
on a long thread to find the context.  I also guess that I'm never going 
to achieve my ambition of being a pot smoking hippy CEO of a web 
development company :(


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Dwight Hutto
On Thu, Sep 13, 2012 at 5:17 PM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 On 13/09/2012 21:34, Joshua Landau wrote:

 On 13 September 2012 20:53, Mark Lawrence breamore...@yahoo.co.uk 
 wrote:acci sequence

 On 13/09/2012 19:39, Prasad, Ramit wrote:

 Dwight Hutto wrote:

 Why don' you just time it,eit lops through incrementing thmax input/


 What? Without context I have no idea what this means.


   You're wasting your time, I've been described as a jackass for having
 the
 audacity to ask for context :)



 I'm pretty sure you are in the wrong, acting as if what he said didn't
 make
 sense! Just read it, he obviously was telling you to time it, as eit lops
 are inside thmax input/ which, as you should know if you *bothered to read
 the thread*, is incrementing.


 don' is short for don't, by the way.




 I do grovellingly apologize for my appalling breach of netiquette.  I am of
 course assuming that the rules have changed and that it's now my
 responsibility to wade back through maybe a couple of hundred responses on a
 long thread to find the context.
  I also guess that I'm never going to
 achieve my ambition of being a pot smoking hippy CEO of a web development
 company :(





 --
 Cheers.
Cheers usually implies you're an alcoholic pressing buttons, with the
half of rest of the coders on the net.

So don't give up hope, you might be able to take a medication that
doesn't impair your judgement with the side effects alcohol has,

And of course leaves you without the ability to read any of the
responses to say you were right in certain instances, so something
must be wrong with your mail reader, or your alcoholic mind.

Also, without reading the other posts, you're probably just placing in
a duplicate answer sometimes, which might make you seem like a copy
cat.


-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Steven D'Aprano
On Thu, 13 Sep 2012 17:06:23 -0400, Dwight Hutto wrote:

 Then there is the problem of people saying you posted too much of the
 context, or not inline with the OP, just at the end, or top posting.

The solution to you quoted too much unnecessary verbiage is not quote 
nothing. It is quote only the parts that are relevant.

 I try to keep it along the line of what the OP has read, and they know
 the context in which it's meant.

You're assuming that people read your posts immediately after they read 
the post you replied to. Always imagine that your reply will be read a 
week after the post you replied to. Do you still expect the reader to 
understand what you're talking about?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread alex23
On Sep 14, 5:37 am, Dwight Hutto dwightdhu...@gmail.com wrote:
 Why don't take the time to read the OP, and ramit in your head?

Please, don't be a dick.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Chris Angelico
On Fri, Sep 14, 2012 at 1:39 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 You're assuming that people read your posts immediately after they read
 the post you replied to. Always imagine that your reply will be read a
 week after the post you replied to.

And a week is extremely generous too; these posts get archived on the
web. I *frequently* find myself hitting mailing list archives when
researching obscurities. This is also another good reason to post
follow-ups to the list, rather than in private email. You might never
be thanked, but somebody years down the track may find the question
and an associated answer.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Dwight Hutto
On Thu, Sep 13, 2012 at 11:48 PM, alex23 wuwe...@gmail.com wrote:
 On Sep 14, 5:37 am, Dwight Hutto dwightdhu...@gmail.com wrote:
 Why don't take the time to read the OP, and ramit in your head?

 Please, don't be a dick.



For telling him to ramit into his head that you should read the OP?




-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread alex23
On Sep 14, 2:46 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
 For telling him to ramit into his head that you should read the OP?

Yes. I'm not sure if it was intentionally racist, but you come across
as a bit of a dwight supremacist.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-13 Thread Dwight Hutto
On Fri, Sep 14, 2012 at 12:54 AM, alex23 wuwe...@gmail.com wrote:
 On Sep 14, 2:46 pm, Dwight Hutto dwightdhu...@gmail.com wrote:
 For telling him to ramit into his head that you should read the OP?

 Yes. I'm not sure if it was intentionally racist, but you come across
 as a bit of a dwight supremacist.

Please explain any logic whatsoever that would give you that conclusion.

Seems more like propaganda, and you're not very good at it.

I think you're referring to a play on words(ramit). Ain't I so punny.

-- 
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-11 Thread Duncan Booth
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 But for the record, in principle string comparisons *could* be the 
 bottleneck. Example: you have 1 strings, which are each created
 once and stored in a list. Then you iterate over the list, comparing
 every string against every other string. And due to some weird vagary
 of the data, the strings are nearly all equal.
 
 (Why would you do this? I don't know, maybe it's a programmers'
 challenge found on the Internet, make up your own scenario...)
 
 Total number of strings created: 1.
 Total number of strings compared: 1.

which is exactly what I meant by doing a lot of comparisons (1) on
the same string. Perhaps if I'd phrased it more as you have to be doing
many more comparisons than string creation operations it would have
been clearer what I meant. 


 The overhead of creating the strings is trivial compared to the
 overhead of comparing them, and since each string is only created once
 anyway, interning them is just a waste of time.

No, you created 10k strings many of which are equal and then did 10k
comparisons on each most of which found 'yes' they are equal. Interning
them would have reduced all the 'true' comparisons to an identity check 
at the cost of 1 hash and 1 full comparison per string. 


 so at the expense of a single dictionary
 insertion when the string is created you can get guaranteed O(1) on
 all the comparisons.
 
 What interning buys you is that s == t is an O(1) pointer compare if
 they are equal. But if s and t differ in the last character, __eq__
 will still inspect every character. There is no way to tell Python
 all strings are interned, if s is not t then s != t as well.

Right if the strings differ only in the last character, but the rest of
this thread has been about how, for random strings, the not-equal case
is O(1) as well. 

-- 
Duncan Booth http://kupuguy.blogspot.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-11 Thread Duncan Booth
Oscar Benjamin oscar.j.benja...@gmail.com wrote:

 What interning buys you is that s == t is an O(1) pointer compare
 if they are equal. But if s and t differ in the last character,
 __eq__ will still inspect every character. There is no way to tell
 Python all strings are interned, if s is not t then s != t as well.

 
 I thought that if *both* strings were interned then a pointer
 comparison could decide if they were unequal without needing to check
 the characters. 
 
 Have I misunderstood how intern() works?
 

I don't think you've misunderstood how it work, but so far as I can see the 
code doesn't attempt to short circuit the not equal but interned case. 
The comparison code doesn't look at interning at all, it only looks for 
identity as a shortcut.

-- 
Duncan Booth http://kupuguy.blogspot.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-11 Thread Oscar Benjamin
On 11 September 2012 10:51, Duncan Booth duncan.booth@invalid.invalidwrote:

 Oscar Benjamin oscar.j.benja...@gmail.com wrote:

  What interning buys you is that s == t is an O(1) pointer compare
  if they are equal. But if s and t differ in the last character,
  __eq__ will still inspect every character. There is no way to tell
  Python all strings are interned, if s is not t then s != t as well.
 
 
  I thought that if *both* strings were interned then a pointer
  comparison could decide if they were unequal without needing to check
  the characters.
 
  Have I misunderstood how intern() works?
 

 I don't think you've misunderstood how it work, but so far as I can see the
 code doesn't attempt to short circuit the not equal but interned case.
 The comparison code doesn't look at interning at all, it only looks for
 identity as a shortcut.


It also doesn't seem to check if the hash values have been set. I guess the
cached hash value is only used in contexts where the hash is explicitly
desired.

That makes two optimisations that can bring worst case string comparison
down to O(1) in many contexts that are available to cpython but unused. But
then if full string comparison is already on average O(1) then the cost of
checking the interned and hash flags for every string comparison would
outweigh the benefits of avoiding the rare worst case O(N) comparisons.

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-11 Thread Terry Reedy

On 9/11/2012 6:40 AM, Oscar Benjamin wrote:

On 11 September 2012 10:51, Duncan Booth duncan.booth@invalid.invalid
mailto:duncan.booth@invalid.invalid wrote:

Oscar Benjamin oscar.j.benja...@gmail.com
mailto:oscar.j.benja...@gmail.com wrote:

  What interning buys you is that s == t is an O(1) pointer compare
  if they are equal. But if s and t differ in the last character,
  __eq__ will still inspect every character. There is no way to tell
  Python all strings are interned, if s is not t then s != t as
well.
 
 
  I thought that if *both* strings were interned then a pointer
  comparison could decide if they were unequal without needing to check
  the characters.
 
  Have I misunderstood how intern() works?
 

I don't think you've misunderstood how it work, but so far as I can
see the
code doesn't attempt to short circuit the not equal but interned case.
The comparison code doesn't look at interning at all, it only looks for
identity as a shortcut.


It also doesn't seem to check if the hash values have been set. I guess
the cached hash value is only used in contexts where the hash is
explicitly desired.-


I believe the internal use of interning and hash comparison has varied 
from release to release. However, the main use of string comparison is 
for dict keys, especially the internal dicts for namespaces and 
attributes. Since the dict lookup code needs hash values anyway, to find 
slots with possible conflicts, I am sure it does not use the generic 
comparison operators but starts with hash comparisons.


Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Duncan Booth
Gelonida N gelon...@gmail.com wrote:

 On 09/07/2012 06:06 AM, Steven D'Aprano wrote:
 On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:


 Also of some interest is the best case: O(1) for unequal strings (they
 differ at the first character) and O(N) for equal strings.
 
 The worst case is O(N) or N characters
 the average case is O(1) or two characters.
 
 For denial of service attacks or systems, that are NEVER allowed to fail 
 the worst case is important.
 
 For most other cases the average complexity counts.
 
 However I still wonder for how many applications the complexity of 
 string comparisons would be the limiting factor.
 
 
and of course if you ever do find an application where that worst case 
matters there's an easy way round it: just call intern() on all the strings 
when they are created.

For the comparison to be the limiting factor you have to be doing a lot of 
comparisons on the same string (otherwise creating the string would be the 
limiting factor), so at the expense of a single dictionary insertion when 
the string is created you can get guaranteed O(1) on all the comparisons.

-- 
Duncan Booth http://kupuguy.blogspot.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Steven D'Aprano
On Mon, 10 Sep 2012 08:59:37 +, Duncan Booth wrote:

 Gelonida N gelon...@gmail.com wrote:
 
 On 09/07/2012 06:06 AM, Steven D'Aprano wrote:
 On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:


 Also of some interest is the best case: O(1) for unequal strings (they
 differ at the first character) and O(N) for equal strings.
 
 The worst case is O(N) or N characters the average case is O(1) or two
 characters.
 
 For denial of service attacks or systems, that are NEVER allowed to
 fail the worst case is important.
 
 For most other cases the average complexity counts.
 
 However I still wonder for how many applications the complexity of
 string comparisons would be the limiting factor.
 
 
 and of course if you ever do find an application where that worst case
 matters there's an easy way round it: just call intern() on all the
 strings when they are created.


There are other good reasons for interning strings, but speeding up 
astring == bstring is not one of them.


[steve@ando ~]$ python -m timeit -s s = 'abc'*10 -s \
 t = s[:-1] + 'x' s == t
1000 loops, best of 3: 910 usec per loop
[steve@ando ~]$ python -m timeit -s s = 'abc'*10 -s \
 t = s[:-1] + 'x' -s intern(s); intern(t) s == t
1000 loops, best of 3: 914 usec per loop

No significant difference.

To state the bleedin' obvious, the computational effort required to 
*compare two strings* pre-supposes that those strings already exist, and 
is *completely independent* of the complexity of creating the strings.


 For the comparison to be the limiting factor you have to be doing a lot
 of comparisons 

Correct.

 on the same string (otherwise creating the string would be the limiting
 factor),

Not necessarily.

Look, it's really hard to come up with a realistic, non-contrived example 
where string comparisons are a significant bottleneck in a non-toy 
application. So let me first off say that *nearly always* you're not 
going to care whether s == t looks at all N characters or just the 
first 2 (or 20, or 100). This argument is rather academic (the best sort 
of argument!). Until you start getting up into truly huge strings, we're 
arguing about how many angels can dance on a CPU cache.

But for the record, in principle string comparisons *could* be the 
bottleneck. Example: you have 1 strings, which are each created once 
and stored in a list. Then you iterate over the list, comparing every 
string against every other string. And due to some weird vagary of the 
data, the strings are nearly all equal.

(Why would you do this? I don't know, maybe it's a programmers' challenge 
found on the Internet, make up your own scenario...)

Total number of strings created: 1.
Total number of strings compared: 1.

The overhead of creating the strings is trivial compared to the overhead 
of comparing them, and since each string is only created once anyway, 
interning them is just a waste of time.


 so at the expense of a single dictionary
 insertion when the string is created you can get guaranteed O(1) on all
 the comparisons.

What interning buys you is that s == t is an O(1) pointer compare if 
they are equal. But if s and t differ in the last character, __eq__ will 
still inspect every character. There is no way to tell Python all 
strings are interned, if s is not t then s != t as well.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Oscar Benjamin
On 2012-09-10, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 On Mon, 10 Sep 2012 08:59:37 +, Duncan Booth wrote:

 Gelonida N gelon...@gmail.com wrote:
 
 so at the expense of a single dictionary
 insertion when the string is created you can get guaranteed O(1) on all
 the comparisons.

 What interning buys you is that s == t is an O(1) pointer compare if 
 they are equal. But if s and t differ in the last character, __eq__ will 
 still inspect every character. There is no way to tell Python all 
 strings are interned, if s is not t then s != t as well.


I thought that if *both* strings were interned then a pointer comparison could
decide if they were unequal without needing to check the characters.

Have I misunderstood how intern() works?

Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Chris Angelico
On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 On 2012-09-10, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 What interning buys you is that s == t is an O(1) pointer compare if
 they are equal. But if s and t differ in the last character, __eq__ will
 still inspect every character. There is no way to tell Python all
 strings are interned, if s is not t then s != t as well.


 I thought that if *both* strings were interned then a pointer comparison could
 decide if they were unequal without needing to check the characters.

 Have I misunderstood how intern() works?

In a language where _all_ strings are guaranteed to be interned (such
as Lua, I think), you do indeed gain this. Pointer inequality implies
string inequality. But when interning is optional (as in Python), you
cannot depend on that, unless there's some way of recognizing interned
strings. Of course, that may indeed be the case; a simple bit flag
this string has been interned would suffice, and if both strings are
interned AND their pointers differ, THEN you can be sure the strings
differ.

I have no idea whether or not CPython version X.Y.Z does this. The
value of such an optimization really depends on how likely strings are
to be interned; for instance, if the compiler automatically interns
all the names of builtins, this could be quite beneficial. Otherwise,
probably not; most Python scripts don't bother interning anything.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Oscar Benjamin
On 2012-09-10, Chris Angelico ros...@gmail.com wrote:
 On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 On 2012-09-10, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 What interning buys you is that s == t is an O(1) pointer compare if
 they are equal. But if s and t differ in the last character, __eq__ will
 still inspect every character. There is no way to tell Python all
 strings are interned, if s is not t then s != t as well.


 I thought that if *both* strings were interned then a pointer comparison
 could decide if they were unequal without needing to check the characters.

 Have I misunderstood how intern() works?

 In a language where _all_ strings are guaranteed to be interned (such
 as Lua, I think), you do indeed gain this. Pointer inequality implies
 string inequality. But when interning is optional (as in Python), you
 cannot depend on that, unless there's some way of recognizing interned
 strings. Of course, that may indeed be the case; a simple bit flag
 this string has been interned would suffice, and if both strings are
 interned AND their pointers differ, THEN you can be sure the strings
 differ.

 I have no idea whether or not CPython version X.Y.Z does this. The
 value of such an optimization really depends on how likely strings are
 to be interned; for instance, if the compiler automatically interns
 all the names of builtins, this could be quite beneficial. Otherwise,
 probably not; most Python scripts don't bother interning anything.


I haven't looked at the source but my understanding was precisely that there
is an intern() bit and that not only the builtins module but all the literals
in any byte-compiled module are interned.

Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Oscar Benjamin
On 2012-09-10, Oscar Benjamin oscar.j.benja...@gmail.com wrote:
 On 2012-09-10, Chris Angelico ros...@gmail.com wrote:
 On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 On 2012-09-10, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 What interning buys you is that s == t is an O(1) pointer compare if
 they are equal. But if s and t differ in the last character, __eq__ will
 still inspect every character. There is no way to tell Python all
 strings are interned, if s is not t then s != t as well.


 I thought that if *both* strings were interned then a pointer comparison
 could decide if they were unequal without needing to check the characters.

 Have I misunderstood how intern() works?

 In a language where _all_ strings are guaranteed to be interned (such
 as Lua, I think), you do indeed gain this. Pointer inequality implies
 string inequality. But when interning is optional (as in Python), you
 cannot depend on that, unless there's some way of recognizing interned
 strings. Of course, that may indeed be the case; a simple bit flag
 this string has been interned would suffice, and if both strings are
 interned AND their pointers differ, THEN you can be sure the strings
 differ.

 I have no idea whether or not CPython version X.Y.Z does this. The
 value of such an optimization really depends on how likely strings are
 to be interned; for instance, if the compiler automatically interns
 all the names of builtins, this could be quite beneficial. Otherwise,
 probably not; most Python scripts don't bother interning anything.


 I haven't looked at the source but my understanding was precisely that there
 is an intern() bit and that not only the builtins module but all the literals
 in any byte-compiled module are interned.


s/literals/identifiers/

You can see the interned flag in the PyUnicodeObject struct here:
http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303

Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Chris Angelico
On Tue, Sep 11, 2012 at 12:43 AM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 On 2012-09-10, Oscar Benjamin oscar.j.benja...@gmail.com wrote:
 I haven't looked at the source but my understanding was precisely that there
 is an intern() bit and that not only the builtins module but all the literals
 in any byte-compiled module are interned.


 s/literals/identifiers/

 You can see the interned flag in the PyUnicodeObject struct here:
 http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303

Ah, yep, so that's there. In that case, it's possible to have that
optimization. However, I may be misreading this, but it seems the only
Unicode comparison function is a rich compare, which is unable to take
advantage of a known difference:
http://hg.python.org/cpython/file/b48ef168d8c5/Objects/unicodeobject.c#l6114

Different pointers prove the strings differ, but don't tell you which
is to be sorted earlier. You could use this if you roll your own
comparison in C; or, if you already know the strings are interned, you
can use 'is' / 'is not'. But that seems to be the extent of it.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Dan Goodman

On 04/09/2012 03:54, Roy Smith wrote:

Let's assume you're testing two strings for equality.  You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning?  If the strings are indeed equal, it's the
same amount of work starting from either end.  But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.


From the rest of the thread, it looks like in most situations it won't 
make much difference as typically very few characters need to be 
compared if they are unequal.


However, if you were in a situation with many strings which were almost 
equal, the most general way to improve the situation might be store a 
hash of the string along with the string, i.e. store (hash(x), x) and 
then compare equality of this tuple. Almost all of the time, if the 
strings are unequal the hash will be unequal. Or, as someone else 
suggested, use interned versions of the strings. This is basically the 
same solution but even better. In this case, your startup costs will be 
higher (creating the strings) but your comparisons will always be instant.


Dan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Oscar Benjamin
On 2012-09-10, Dan Goodman dg.gm...@thesamovar.net wrote:
 On 04/09/2012 03:54, Roy Smith wrote:
 Let's assume you're testing two strings for equality.  You've already
 done the obvious quick tests (i.e they're the same length), and you're
 down to the O(n) part of comparing every character.

 I'm wondering if it might be faster to start at the ends of the strings
 instead of at the beginning?  If the strings are indeed equal, it's the
 same amount of work starting from either end.  But, if it turns out that
 for real-life situations, the ends of strings have more entropy than the
 beginnings, the odds are you'll discover that they're unequal quicker by
 starting at the end.

  From the rest of the thread, it looks like in most situations it won't 
 make much difference as typically very few characters need to be 
 compared if they are unequal.

 However, if you were in a situation with many strings which were almost 
 equal, the most general way to improve the situation might be store a 
 hash of the string along with the string, i.e. store (hash(x), x) and 
 then compare equality of this tuple. Almost all of the time, if the 
 strings are unequal the hash will be unequal. Or, as someone else 
 suggested, use interned versions of the strings. This is basically the 
 same solution but even better. In this case, your startup costs will be 
 higher (creating the strings) but your comparisons will always be instant.


Computing the hash always requires iterating over all characters in the string
so is best case O(N) where string comparison is best case (and often average
case) O(1).

Also, so far as I know the hash value once computed is stored on the string
object itself [1] and used for subsequent string comparisons so there's no
need for you to do that in your code.

Oscar

[1] http://hg.python.org/cpython/file/71d94e79b0c3/Include/unicodeobject.h#l293

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Dan Goodman

On 10/09/2012 18:33, Oscar Benjamin wrote:

Computing the hash always requires iterating over all characters in the string
so is best case O(N) where string comparison is best case (and often average
case) O(1).


Yep, but you already have O(N) costs just creating the strings in the 
first place, so it's absorbed into that. It's only worth doing if you do 
many comparisons.



Also, so far as I know the hash value once computed is stored on the string
object itself [1] and used for subsequent string comparisons so there's no
need for you to do that in your code.


Cool, Python is very clever as always! :)

Dan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Dan Goodman

On 10/09/2012 18:07, Dan Goodman wrote:

On 04/09/2012 03:54, Roy Smith wrote:

Let's assume you're testing two strings for equality.  You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning?  If the strings are indeed equal, it's the
same amount of work starting from either end.  But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.


 From the rest of the thread, it looks like in most situations it won't
make much difference as typically very few characters need to be
compared if they are unequal.

However, if you were in a situation with many strings which were almost
equal, the most general way to improve the situation might be store a
hash of the string along with the string, i.e. store (hash(x), x) and
then compare equality of this tuple. Almost all of the time, if the
strings are unequal the hash will be unequal. Or, as someone else
suggested, use interned versions of the strings. This is basically the
same solution but even better. In this case, your startup costs will be
higher (creating the strings) but your comparisons will always be instant.


Just had another thought about this. Although it's unlikely to be 
necessary in practice since (a) it's rarely necessary at all, and (b) 
when it is, hashing and optionally interning seems like the better 
approach, I had another idea that would be more general. Rather than 
starting from the beginning or the end, why not do something like: check 
the first and last character, then the len/2 character, then the len/4, 
then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever 
about making sure you hit every character but I'm sure someone's already 
got an efficient algorithm for this. You could probably even make this 
cache efficient by working on cache line length blocks. Almost certainly 
entirely unnecessary, but I like the original question and it's a nice 
theoretical problem.


Dan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-10 Thread Oscar Benjamin
On 2012-09-10, Dan Goodman dg.gm...@thesamovar.net wrote:
 On 10/09/2012 18:07, Dan Goodman wrote:
 On 04/09/2012 03:54, Roy Smith wrote:
 Let's assume you're testing two strings for equality.  You've already
 done the obvious quick tests (i.e they're the same length), and you're
 down to the O(n) part of comparing every character.

 I'm wondering if it might be faster to start at the ends of the strings
 instead of at the beginning?  If the strings are indeed equal, it's the
 same amount of work starting from either end.  But, if it turns out that
 for real-life situations, the ends of strings have more entropy than the
 beginnings, the odds are you'll discover that they're unequal quicker by
 starting at the end.

  From the rest of the thread, it looks like in most situations it won't
 make much difference as typically very few characters need to be
 compared if they are unequal.

 However, if you were in a situation with many strings which were almost
 equal, the most general way to improve the situation might be store a
 hash of the string along with the string, i.e. store (hash(x), x) and
 then compare equality of this tuple. Almost all of the time, if the
 strings are unequal the hash will be unequal. Or, as someone else
 suggested, use interned versions of the strings. This is basically the
 same solution but even better. In this case, your startup costs will be
 higher (creating the strings) but your comparisons will always be instant.

 Just had another thought about this. Although it's unlikely to be 
 necessary in practice since (a) it's rarely necessary at all, and (b) 
 when it is, hashing and optionally interning seems like the better 
 approach, I had another idea that would be more general. Rather than 
 starting from the beginning or the end, why not do something like: check 
 the first and last character, then the len/2 character, then the len/4, 
 then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever 
 about making sure you hit every character but I'm sure someone's already 
 got an efficient algorithm for this. You could probably even make this 
 cache efficient by working on cache line length blocks. Almost certainly 
 entirely unnecessary, but I like the original question and it's a nice 
 theoretical problem.

It's not totally theoretical in the sense that the reasoning applies to all
sequence comparisons. If you needed to compare lists of objects where the
comparison of each pair of elements was an expensive operation then you would
want to think carefully about what order you used. Also in general you can't
hash/intern all sequences.

If I was going to change the order of comparisons for all strings then I would
use a random order. This is essentially how dict gets away with claiming to
have O(1) lookup. There are sequences of inputs that can cause every possible
hash collision to occur but because the hash function acts as a kind of
randomisation filter the pathological sequences are very unlikely to occur
unless someone is going out of their way. The clever way that Python 3.3
prevents someone from even doing this on purpose is just to introduce
additional per-process randomisation.

The difference between dict lookup and string comparison is that string
comparison always compares the characters in the same order and it corresponds
to the natural ordering of the data. This means that some pefectly natural use
cases like comparing file-paths can have close to worst case behaviour. If
string/sequence comparison occurs in a random order then there can be no use
case where the likely strings would induce close to worst case behaviour
unless you really are just comparing lots of almost identical sequences.

Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-08 Thread Oscar Benjamin
On 2012-09-08, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 On Fri, 07 Sep 2012 19:10:16 +, Oscar Benjamin wrote:

 On 2012-09-07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info
 wrote:
 snip
 
 Would you say, then, that dict insertion is O(N)?

 Pedantically, yes. 

 But since we're allowed to state (or even imply *wink*) whatever 
 assumptions we like, we're allowed to assume in the absence of 
 significant numbers of hash collisions and come up with amortized O(1) 
 for dict insertions and lookups.

 (Provided, of course, that your computer has an infinite amount of 
 unfragmented memory and the OS never starts paging your dict to disk. 
 Another unstated assumption that gets glossed over when we talk about 
 complexity analysis -- on real world computers, for big enough N, 
 *everything* is O(2**N) or worse.)

 Big Oh analysis, despite the formal mathematics used, is not an exact 
 science. Essentially, it is a way of bringing some vague order to hand-
 wavy estimates of complexity, and the apparent mathematical rigour is 
 built on some awfully shaky foundations. But despite that, it actually is 
 useful.

 Coming back to strings... given that in any real-world application, you 
 are likely to have some string comparisons on equal strings and some on 
 unequal strings, and more importantly you don't know which are which 
 ahead of time, which attitude is less likely to give you a nasty surprise 
 when you run your code?

 I have many millions of 100K strings to compare against other 100K 
 strings, and string comparisons are O(1) so that will be fast.

 I have many millions of 100K strings to compare against other 100K 
 strings, and string comparisons are O(N) so that will be slow, better 
 find another algorithm.

True. I can't think of a situation where I've used string comparisons
directly in any text heavy code. Rather, I would use a dict or a set (or a
regex) and hash(str) is always O(N).



 Remember too that for small enough N, everything is O(1). Getting hung 
 up on Big Oh is just as much a mistake as ignoring it completely.



I can't think of a situation in my own work where O(N) vs O(1) string
comparisons would cause a significant problem (except perhaps in libraries
that I use but didn't write). However, I can find a number of cases where I
compare numpy.ndarrays for equality. For example, I found

if np.all(a == b):

in some code that I recently wrote. Although np.all() short-circuits, a==b
does not so that line forces O(N) behaviour onto a situation where the average
case can be better. Unfortunately numpy doesn't seem to provide a
short-circuit equals() function. array_equal() is what I want but it does the
same as the above. In future, I'll consider using something like

def cmparray(a, b):
  return a.shape == b.shape and a.dtype == b.dtype and buffer(a) == buffer(b)

to take advantage of (what I assume are) short-circuit buffer comparisons.

 Since string comparison is only useful if the strings can be equal or
 unequal, the average case depends on how often they are equal/unequal as
 well as the average complexity of both. For random strings the frequency
 of equal strings decreases very fast as N increases so that the
 comparison of random strings is O(1).

 But that is not an upper bound, and Big Oh analysis is strictly defined 
 in terms of upper bounds.

It is an upper bound, but it is an upper bound on the *expectation value*
assuming a particular distribution of inputs, rather than an upper bound on
all possible inputs.

 (I'm talking about the average here -- the actual number of comparisons
 can range all the way up to N, but the average is = 2.)

The average is actually bounded by 1 / (1 - p) where p is the probability that
two characters match. This bound can be arbitrarily large as p approaches 1 as
would be the case if, say, one character was much more likely than others. The
particular assumption that you have made p = 1/M where M is the number of
characters is actually the *smallest* possible value of p. For non-uniform
real data (English words for example) p is significantly greater than 1/M but
in a strict bounds sense we should say that 1/M = p = 1.

Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-08 Thread Gelonida N

On 09/06/2012 10:33 AM, Steven D'Aprano wrote:

On Wed, 05 Sep 2012 22:47:14 +, Oscar Benjamin wrote:

I may have been overly-conservative earlier when I said that on average
string equality has to compare half the characters. I thought I had
remembered that from a computer science textbook, but I can't find that
reference now, so possibly I was thinking of something else. (String
searching perhaps?).


Yeah I think you mixed it up with searching an entry in an unsorted list 
of length N



That's one of the most common N/2 cases, that one hits daily with many 
straight forward implementations



--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-08 Thread Gelonida N

On 09/07/2012 06:06 AM, Steven D'Aprano wrote:

On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:


Also of some interest is the best case: O(1) for unequal strings (they
differ at the first character) and O(N) for equal strings.


The worst case is O(N) or N characters
the average case is O(1) or two characters.

For denial of service attacks or systems, that are NEVER allowed to fail 
the worst case is important.


For most other cases the average complexity counts.

However I still wonder for how many applications the complexity of 
string comparisons would be the limiting factor.





--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-07 Thread Oscar Benjamin
On 2012-09-07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 snip

 After further thought, and giving consideration to the arguments given by 
 people here, I'm now satisfied to say that for equal-length strings, 
 string equality is best described as O(N).

 1) If the strings are equal, a == b will always compare all N 
characters in each string.

 2) If the strings are unequal, a == b will *at worst* compare
all N characters.

 3) Following usual practice in this field, worst case is the
one which conventionally is meant when discussing Big Oh
behaviour. See, for example, Introduction To Algorithms 
by Cormen, Leiserson and Rivest.

Would you say, then, that dict insertion is O(N)?


 Also of some interest is the best case: O(1) for unequal strings (they 
 differ at the first character) and O(N) for equal strings.

 Also of interest is the case that has caused the majority of the 
 discussion, the average case. I am now satisfied that the average number 
 of comparisons for unequal strings is O(1). To be precise, it is bounded 
 below by 1 comparison (you always have to compare at least one pair of 
 characters) and bounded above by 2 comparisons.

I find this idea of separating into the comparison of equal strings versus the
comparison of unequal strings rather odd. If the strings you compare come from
a distribution where they are guaranteed to be equal (or unequal) then you can
just use the O(0) comparison method.

Since string comparison is only useful if the strings can be equal or unequal,
the average case depends on how often they are equal/unequal as well as the
average complexity of both. For random strings the frequency of equal strings
decreases very fast as N increases so that the comparison of random strings is
O(1).


 (I'm talking about the average here -- the actual number of comparisons 
 can range all the way up to N, but the average is = 2.)

 If I've done the maths right, the exact value for the average is:

 ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

I'm not sure where the extra N comes from ^ but otherwise good.

I would have written that as:

(1 - p) * sum(i * p**(i-1) for i in range(1, N+1))

where p is the probability of a match (1/M for M equally likely characters) or
in closed form:

 ⎛ N ⎞
 ⎝1 - p ⋅(1 + N ⋅(1 - p))⎠
 ─
   1 - p


 for random strings of length N taken from an alphabet of size M.

 For M = 2, that average approaches but never exceeds 2 as N increases; 
 for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... 
 and so forth.

It approaches 1 / (1 - p) or, if you prefer: M / (M - 1)

Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-07 Thread Oscar Benjamin
On 2012-09-07, Oscar Benjamin oscar.j.benja...@gmail.com wrote:
 On 2012-09-07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 snip

 Since string comparison is only useful if the strings can be equal or unequal,
 the average case depends on how often they are equal/unequal as well as the
 average complexity of both. For random strings the frequency of equal strings
 decreases very fast as N increases so that the comparison of random strings is
 O(1).


 (I'm talking about the average here -- the actual number of comparisons 
 can range all the way up to N, but the average is = 2.)

 If I've done the maths right, the exact value for the average is:

 ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

 I'm not sure where the extra N comes from ^ but otherwise good.

Ok, I see it's for the case where they're equal.

Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-07 Thread Dwight Hutto
With unequal strings/lists to match, it would seem that one would regex
through the larger string/list with the shorter string, and piece by piece
begin to match for partial percentage matches in relation to the longer
iterative item.

-- 
Best Regards,
David Hutto
*CEO:* *http://www.hitwebdevelopment.com*
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-07 Thread Dwight Hutto
On Fri, Sep 7, 2012 at 5:59 PM, Dwight Hutto dwightdhu...@gmail.com wrote:

 With unequal strings/lists to match, it would seem that one would regex
 through the larger string/list with the shorter string, and piece by piece
 begin to match for partial percentage matches in relation to the longer
 iterative item.


While iterating through the larger list character/item, one at a time, or
arranging them in certain instances.


 --
 Best Regards,
 David Hutto
 *CEO:* *http://www.hitwebdevelopment.com*




-- 
Best Regards,
David Hutto
*CEO:* *http://www.hitwebdevelopment.com*
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-07 Thread Steven D'Aprano
On Fri, 07 Sep 2012 19:10:16 +, Oscar Benjamin wrote:

 On 2012-09-07, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info
 wrote:
 snip

 After further thought, and giving consideration to the arguments given
 by people here, I'm now satisfied to say that for equal-length strings,
 string equality is best described as O(N).

 1) If the strings are equal, a == b will always compare all N
characters in each string.

 2) If the strings are unequal, a == b will *at worst* compare
all N characters.

 3) Following usual practice in this field, worst case is the
one which conventionally is meant when discussing Big Oh behaviour.
See, for example, Introduction To Algorithms by Cormen, Leiserson
and Rivest.
 
 Would you say, then, that dict insertion is O(N)?

Pedantically, yes. 

But since we're allowed to state (or even imply *wink*) whatever 
assumptions we like, we're allowed to assume in the absence of 
significant numbers of hash collisions and come up with amortized O(1) 
for dict insertions and lookups.

(Provided, of course, that your computer has an infinite amount of 
unfragmented memory and the OS never starts paging your dict to disk. 
Another unstated assumption that gets glossed over when we talk about 
complexity analysis -- on real world computers, for big enough N, 
*everything* is O(2**N) or worse.)

Big Oh analysis, despite the formal mathematics used, is not an exact 
science. Essentially, it is a way of bringing some vague order to hand-
wavy estimates of complexity, and the apparent mathematical rigour is 
built on some awfully shaky foundations. But despite that, it actually is 
useful.

Coming back to strings... given that in any real-world application, you 
are likely to have some string comparisons on equal strings and some on 
unequal strings, and more importantly you don't know which are which 
ahead of time, which attitude is less likely to give you a nasty surprise 
when you run your code?

I have many millions of 100K strings to compare against other 100K 
strings, and string comparisons are O(1) so that will be fast.

I have many millions of 100K strings to compare against other 100K 
strings, and string comparisons are O(N) so that will be slow, better 
find another algorithm.


Remember too that for small enough N, everything is O(1). Getting hung 
up on Big Oh is just as much a mistake as ignoring it completely.


 I find this idea of separating into the comparison of equal strings
 versus the comparison of unequal strings rather odd. If the strings you
 compare come from a distribution where they are guaranteed to be equal
 (or unequal) then you can just use the O(0) comparison method.

If you know that they're (un)equal, you don't need to call == at all.

If you know that most strings will be unequal, then you might be 
justified in treating comparisons as O(1) most of the time and not 
stress about the occasional slow call. But of course most of the time you 
don't know this, which is why it is appropriate to treat string 
comparisons as O(N) rather than O(1), since that's the upper bound.


 Since string comparison is only useful if the strings can be equal or
 unequal, the average case depends on how often they are equal/unequal as
 well as the average complexity of both. For random strings the frequency
 of equal strings decreases very fast as N increases so that the
 comparison of random strings is O(1).

But that is not an upper bound, and Big Oh analysis is strictly defined 
in terms of upper bounds.



 (I'm talking about the average here -- the actual number of comparisons
 can range all the way up to N, but the average is = 2.)

 If I've done the maths right, the exact value for the average is:

 ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)
 
 I'm not sure where the extra N comes from ^ but otherwise good.

The extra N comes from the case where you compare string S with itself, 
which takes exactly N comparisons.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-07 Thread Dwight Hutto
Why don' you just time it,eit lops through incrementing thmax input/

-- 
Best Regards,
David Hutto
*CEO:* *http://www.hitwebdevelopment.com*
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Steven D'Aprano
On Wed, 05 Sep 2012 22:47:14 +, Oscar Benjamin wrote:

 In news.gmane.comp.python.general, you wrote:
 On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...]
 You are making unjustified assumptions about the distribution of
 letters in the words. This might be a list of long chemical compounds
 where the words typically differ only in their suffix. It might be a
 list of people with titles:
 
 Actually, I'm not. I'm stating exactly what assumptions I'm making to
 get my calculation. I'm comparing *random* character strings or
 bitstrings.

 Excuse me, you are not. You are comparing English words which are
 highly non-random.
 
 Evidently we have different understandings of what 'random' means. I
 don't think it's unreasonable to say that strings drawn uniformly from
 the set of all strings in the English language (having a given number of
 characters) is random. The distribution is not uniform over the set of
 all possible character strings but it is still random. I think Johannes
 deliberately chose these strings to emulate a particular kind of 'real'
 distribution of strings that might occur in practise.

Of course for some real applications, your strings being compared will 
be English words. And for other applications, they will be strings of 
numeric digits uniformly distributed between 2 and 3. Or taken 
from a Gaussian distribution centered around 7000. Or strings made up of 
deeply nested sets of parentheses (( ... )). Whichever special 
distribution of strings you pick, I don't think calling them random is 
terribly useful in context, even if they are generated randomly from some 
non-uniform distribution.

But you know, it really doesn't make a difference. Equality testing will 
*still* be O(N) since the asymptomatic behaviour for sufficiently large 
string comparisons will be bounded by O(N). Multiplicative constants 
(half the string versus 0.001 of the string) do not matter.

I may have been overly-conservative earlier when I said that on average 
string equality has to compare half the characters. I thought I had 
remembered that from a computer science textbook, but I can't find that 
reference now, so possibly I was thinking of something else. (String 
searching perhaps?). In any case, the *worst* case for string equality 
testing is certainly O(N) (every character must be looked at), and the 
*best* case is O(1) obviously (the first character fails to match). But 
I'm not so sure about the average case. Further thought is required.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Dave Angel
On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
 snip

 I may have been overly-conservative earlier when I said that on
 average string equality has to compare half the characters. I thought
 I had remembered that from a computer science textbook, but I can't
 find that reference now, so possibly I was thinking of something else.
 (String searching perhaps?). In any case, the *worst* case for string
 equality testing is certainly O(N) (every character must be looked
 at), and the *best* case is O(1) obviously (the first character fails
 to match). But I'm not so sure about the average case. Further thought
 is required. 

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size passes
some point.

Define random string as being a selection from a set of characters, with
replacement.  So if we pick some set of characters, say 10 (or 256, it
doesn't really matter), the number of possible strings is 10**N.

The likelihood of not finding a mismatch within k characters is  
(1/10)**k   The likelihood of actually reaching the end of the random
string is (1/10)**N.  (which is the reciprocal of the number of strings,
naturally)

If we wanted an average number of comparisons, we'd have to calculate a
series, where each term is a probability times a value for k.
   sum((k * 9*10**-k) for k in range(1, 10))


Those terms very rapidly approach 0, so it's safe to stop after a few. 
Looking at the first 9 items, I see a value of 1.111

This may not be quite right, but the value is certainly well under 2 for
a population of 10 characters, chosen randomly.  And notice that N
doesn't really come into it. 

-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Oscar Benjamin

On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel d...@davea.name wrote:

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size 

passes

some point.



Define random string as being a selection from a set of characters, 

with
replacement.  So if we pick some set of characters, say 10 (or 256, 

it

doesn't really matter), the number of possible strings is 10**N.



The likelihood of not finding a mismatch within k characters is  
(1/10)**k   The likelihood of actually reaching the end of the 

random
string is (1/10)**N.  (which is the reciprocal of the number of 

strings,

naturally)



If we wanted an average number of comparisons, we'd have to 

calculate a

series, where each term is a probability times a value for k.
   sum((k * 9*10**-k) for k in range(1, 10))





Those terms very rapidly approach 0, so it's safe to stop after a 
few. 

Looking at the first 9 items, I see a value of 1.111



This may not be quite right, but the value is certainly well under 

2 for

a population of 10 characters, chosen randomly.  And notice that N
doesn't really come into it.


It's exactly right. You can obtain this result analytically from 
Johannes' formula above. Just replace 256 with 10 to get that the 
expected number of comparisons is


(10/9)*(1 - 10**(-N))

The last term shows the dependence on N and is tiny even for N=9.

Oscar

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Roy Smith
In article 50485fca$0$29977$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 In any case, the *worst* case for string equality 
 testing is certainly O(N) (every character must be looked at), and the 
 *best* case is O(1) obviously (the first character fails to match).

The best case is O(0), if either string is empty (ducking and running).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Chris Angelico
n Thu, Sep 6, 2012 at 10:13 PM, Roy Smith r...@panix.com wrote:
 In article 50485fca$0$29977$c3e8da3$54964...@news.astraweb.com,
  Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 In any case, the *worst* case for string equality
 testing is certainly O(N) (every character must be looked at), and the
 *best* case is O(1) obviously (the first character fails to match).

 The best case is O(0), if either string is empty (ducking and running).

No, O(0) would be when the application decides not to compare at all.

ChrisA (also ducking)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Johannes Bauer
On 05.09.2012 18:24, Steven D'Aprano wrote:
 On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
 [...]
 You are making unjustified assumptions about the distribution of
 letters in the words. This might be a list of long chemical compounds
 where the words typically differ only in their suffix. It might be a
 list of people with titles:

 Actually, I'm not. I'm stating exactly what assumptions I'm making to
 get my calculation. I'm comparing *random* character strings or
 bitstrings.
 
 Excuse me, you are not. You are comparing English words which are highly 
 non-random.

Not in my original post. If you read it again, you will clearly see that
I was talking about purely random strings. And since you like to
nitpick, I'll clarify further: I'm talking about bitstrings in which
every bit of every character has the same probability of occurence, 50%.

You then replied by mentioning probability distributions of characters
in different languages/encodings, to which I tried giving a example as
it might (and does) occur in the real world, like sorting a dictionary.

But the original point is still valid: Sorting of randomized bitstrings
is definitely not O(N), you got that wrong.

 You, on the other hand, are making vague assumptions which you do not
 care for formalize and yet you claim that the number of comparisons is
 equally likely to be 1, 2, 3, ..., N. The average then is. Without any
 explanation for this. At all.
 
 I will accept that my explanation was not good enough, but I strongly 
 disagree that I gave no explanation at all.

What possible explanation could there be that comparison of purely
random bitstrings yields an equal amount of comparisons for its length?

 Herr Professor Frederick Schmidt
 Herr Professor Frederick Wagner
 ...

 Is your assumtion that we're comparing words that have the common prefix
 Herr Professor Frederick ? 
 
 No, I am pointing out that *your* assumption that most string comparisons 
 will halt close to the beginning of the string is an invalid assumption. 
 Your assumption only holds for some non-random strings.

Definitely not. It *especially* holds true for random strings. For
random strings with an alphabet of size n, the probability that the
first character needs to be compared is n^0, i.e. 1. The probability
that the second character needs to be compared is n^(-1). For the xth
character, it is n^(-x). This is due to the lazy abort in comparison and
it results in the average number of comparisons being extremely short no
matter the bitlength n, or O(log n).

 it's counting the number of character comparisons.
 
 Correct. But only out of an extremely limited subset of all possible 
 string comparisons, namely those very few that happen when sorting lists 
 of English words using a very specific algorithm, namely timsort.

Yes, but this was to look at a real-world example (in which way more
comparisons are needed than in the random case). You first were talking
about randomized bitstrings (and so was I), then you brought up
character sets and languages (which, for the original problem, are
entirely irrelevant).

 Whatever figure you have calculated by taking this non-random selection, 
 it is irrelevant to the question of the average performance of string 
 equality given random strings.

Correct. I gave the estimate for random strings in my very first post.

 For the sake of simple calculations, let's pretend that there are only 
 1000 five-letter strings possible. We want to know how many character-
 comparisons are done on average when testing two random five-letter 
 strings for equality. To calculate this average, you must compare every 
 string to every other string, *including* itself.
 
 This gives 1000*1000 = one million equality tests to be performed. For 
 each equality test, we count the number of character comparisons 
 performed, sum all those counts, and divide by 1e6. That is the average 
 number of char comparisons for random strings of five letters.

Easy enough. Since you didn't look at my original equation, here are
some numbers and the program attached:

64 combinations for length 6
Char comparisons: 8064
Comparison cnt  : 4096
Avg comparison  : 1.97

128 combinations for length 7
Char comparisons: 32512
Comparison cnt  : 16384
Avg comparison  : 1.98

256 combinations for length 8
Char comparisons: 130560
Comparison cnt  : 65536
Avg comparison  : 1.99

512 combinations for length 9
Char comparisons: 523264
Comparison cnt  : 262144
Avg comparison  : 2.00

1024 combinations for length 10
Char comparisons: 2095104
Comparison cnt  : 1048576
Avg comparison  : 2.00

2048 combinations for length 11
Char comparisons: 8384512
Comparison cnt  : 4194304
Avg comparison  : 2.00

4096 combinations for length 12
Char comparisons: 33546240
Comparison cnt  : 16777216
Avg comparison  : 2.00

Hopefully now you'll see that your assumption O(n) is dead wrong.

 But that's not what you do. First you eliminate 900 out of the 1000 
 possible strings by only 

Re: Comparing strings from the back?

2012-09-06 Thread Johannes Bauer
On 06.09.2012 10:33, Steven D'Aprano wrote:

 But you know, it really doesn't make a difference. Equality testing will 
 *still* be O(N) since the asymptomatic behaviour for sufficiently large 
 string comparisons will be bounded by O(N). Multiplicative constants 
 (half the string versus 0.001 of the string) do not matter.

Wrong, at least for randomized strings (i.e. every character with the
same probability). O(N) is worst-case, O(log N) is correct for
randomized strings.

Then again, since you were nitpicking about Landau notation earlier this
thread, every function bound by O(log N) is also bound by O(N), since,
as you correctly stated, it's only a upper bound. We should be talking
about the asymptotic sharp bound, i.e. capital Theta.

 I may have been overly-conservative earlier when I said that on average 
 string equality has to compare half the characters. I thought I had 
 remembered that from a computer science textbook, but I can't find that 
 reference now, so possibly I was thinking of something else. (String 
 searching perhaps?). In any case, the *worst* case for string equality 
 testing is certainly O(N) (every character must be looked at), and the 
 *best* case is O(1) obviously (the first character fails to match). But 
 I'm not so sure about the average case. Further thought is required.

Yes, worst-case is O(N), best case O(1). Average is O(n log n).

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Dave Angel
On 09/06/2012 09:43 AM, Johannes Bauer wrote:
 snip
 Yes, worst-case is O(N), best case O(1). Average is O(n log n).

Can't see how you came up with an average of n log(n).  Fourteen minutes
before you made this post, you demonstrated it was less than 2 for any N.

And I previously posted that for a character set of size 10, the average
was 1.



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Johannes Bauer
On 06.09.2012 16:23, Dave Angel wrote:
 On 09/06/2012 09:43 AM, Johannes Bauer wrote:
 snip
 Yes, worst-case is O(N), best case O(1). Average is O(n log n).
 
 Can't see how you came up with an average of n log(n).  Fourteen minutes
 before you made this post, you demonstrated it was less than 2 for any N.

O(log n) is what I meant, sorry! Damnit.

 And I previously posted that for a character set of size 10, the average
 was 1.

For any given string n and and alphabet size l, the average is:

sum(i = 0..n) (1 / (l ^ n))

So with larger word length, the total complexity constantly increases.
The amount by which it increases however is shrinking exponentially with
the word length. Therefore O(log n).

Sorry about the n log n mistake, argh.

Best regards  thanks for the correction,
Johannes


-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Johannes Bauer
On 06.09.2012 15:43, Johannes Bauer wrote:

 Wrong, at least for randomized strings (i.e. every character with the
 same probability). O(N) is worst-case, O(log N) is correct for
 randomized strings.
^^
Here I write the right thing. Then further below...

 Yes, worst-case is O(N), best case O(1). Average is O(n log n).

...I write the wrong thing. O(log n) is what I meant, as Dave correctly
noticed.

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Johannes Bauer
On 06.09.2012 16:23, Dave Angel wrote:
 On 09/06/2012 09:43 AM, Johannes Bauer wrote:
 snip
 Yes, worst-case is O(N), best case O(1). Average is O(n log n).
 
 Can't see how you came up with an average of n log(n).  Fourteen minutes
 before you made this post, you demonstrated it was less than 2 for any N.
 
 And I previously posted that for a character set of size 10, the average
 was 1.

Again playing with the equations and thinking about it again. And I
completely missed your point. It wasn't about n log n vs. log n. Both
are wrong. This surprises me. I was somehow thinking about the limit of
sum (1/n), n - infty. But since the summands are shrinking
exponentially, the limit is different.

I think the limit of the average comparisons for a given wordlength n
against infinity with alphabet size l is

l / (l - 1)

i.e. for bitstrings it's 2 and for bytestrings it's 256/255.

This would mean string comparison of randomized strings is O(1). Can
that really be true? It looks like it.

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Chris Angelico
On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer dfnsonfsdu...@gmx.de wrote:
 Not in my original post. If you read it again, you will clearly see that
 I was talking about purely random strings. And since you like to
 nitpick, I'll clarify further: I'm talking about bitstrings in which
 every bit of every character has the same probability of occurence, 50%.

That sort of string isn't a normal thing to be comparing, though.

Here's an idea. Someone who's doing a lot of arguing in this thread
should take Python, find the string comparison routine, and hack in
some statistics-gathering. Then run *real code* on it. Maybe use this
with one of those web frameworks and run your web site on it for an
hour or two, or fire off some real scripts you really use. Then dump
out the stats at the end. My guess: The bulk of string comparisons
that get to actually comparing byte-for-byte will end up returning
True. Most of the false comparisons will be proven earlier; if I
understand correctly, Python will check for identity (easy true) and
different lengths (easy false). But my guess could turn out to be flat
wrong. In any case, it'll be far FAR more useful than arguing from
totally random, or random word selection, or anything.

Who's game?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Johannes Bauer
On 06.09.2012 16:39, Chris Angelico wrote:

 In any case, it'll be far FAR more useful than arguing from
 totally random, or random word selection, or anything.
 
 Who's game?

Okay, patched against Python 3.2.3: http://pastebin.com/PRRN53P6

To invoke display of the stats, compare the string pleasedumpstats as
LHO, i.e.:

pleasedumpstats  

Just ran it against a big script of mine which takes the stringified
IMDb text files, parses it and dumps it into a sqlite3 database.
Surprisingly little string comparisons however (the sqlite module
without doubt uses its own routines internally). Although the database
in the end has about 2 million records, these were the stats:

strCmpEq 1017
strCmpLt 2802
strCmpGt 1633
strCmpTc 16219
strCmpCc 8570

which means 5452 comparisons of which 19% were equal and the rest inequal.

Average string length is about 2.97 characters and aborted was in
average after 1.57 characters.

Maybe someone has a test which makes heavier use of string comparison. I
don't want to make up something, however, since this is (by definition)
going to be artificial.

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Johannes Bauer
On 06.09.2012 17:36, Johannes Bauer wrote:
 pleasedumpstats  

And against a XML-reading C code generator that uses mako:

strCmpEq 39670
strCmpLt 2766215
strCmpGt 2744002
strCmpTc 37430821
strCmpCc 14048656

Compared strings: 5549887
Equal: 0.7%
Average wordlength: 6.74 chars
Average comparelength: 2.53 chars

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Dave Angel
On 09/06/2012 10:42 AM, Johannes Bauer wrote:
 On 06.09.2012 16:23, Dave Angel wrote:
 On 09/06/2012 09:43 AM, Johannes Bauer wrote:
 snip
 Yes, worst-case is O(N), best case O(1). Average is O(n log n).
 Can't see how you came up with an average of n log(n).  Fourteen minutes
 before you made this post, you demonstrated it was less than 2 for any N.

 And I previously posted that for a character set of size 10, the average
 was 1.
 Again playing with the equations and thinking about it again. And I
 completely missed your point. It wasn't about n log n vs. log n. Both
 are wrong. This surprises me. I was somehow thinking about the limit of
 sum (1/n), n - infty. But since the summands are shrinking
 exponentially, the limit is different.

 I think the limit of the average comparisons for a given wordlength n
 against infinity with alphabet size l is

 l / (l - 1)

 i.e. for bitstrings it's 2 and for bytestrings it's 256/255.

 This would mean string comparison of randomized strings is O(1). Can
 that really be true? It looks like it.

(Just lost the internet in a storm, so I'm not sure how long it'll be
before this sends.)

Thanks, that was exactly my point.  Since el is at least 2, the average
number of comparisons is no larger than 2, for any value of N.  That's
why, when I'm visually comparing MD5 values, I usually only look at the
first few, and last few hex digits.

However, Chris Angelico (at 10:39) pointed out again that totally random
strings aren't real-world equivalent.

He has now proposed that somebody measure real-world cases here, by
patching the string comparison to gather statistics.  I think that's
beyond me at this point.  I only joined this thread when the cases under
study were well-defined random, and therefore predictable.  g



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Steven D'Aprano
On Fri, 07 Sep 2012 00:39:33 +1000, Chris Angelico wrote:

 On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer dfnsonfsdu...@gmx.de
 wrote:
 Not in my original post. If you read it again, you will clearly see
 that I was talking about purely random strings. And since you like to
 nitpick, I'll clarify further: I'm talking about bitstrings in which
 every bit of every character has the same probability of occurence,
 50%.
 
 That sort of string isn't a normal thing to be comparing, though.
 
 Here's an idea. Someone who's doing a lot of arguing in this thread
 should take Python, find the string comparison routine, and hack in some
 statistics-gathering. Then run *real code* on it. 

Where's the fun in that?

:-P



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-06 Thread Steven D'Aprano
On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:

 On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
 snip

 I may have been overly-conservative earlier when I said that on average
 string equality has to compare half the characters. I thought I had
 remembered that from a computer science textbook, but I can't find that
 reference now, so possibly I was thinking of something else. (String
 searching perhaps?). In any case, the *worst* case for string equality
 testing is certainly O(N) (every character must be looked at), and the
 *best* case is O(1) obviously (the first character fails to match). But
 I'm not so sure about the average case. Further thought is required.
 
 For random strings (as defined below), the average compare time is
 effectively unrelated to the size of the string, once the size passes
 some point.

Okay, I've spent some time looking for the reference, and can't find it, 
so I'm now sure I must have been conflating it with something else.

After further thought, and giving consideration to the arguments given by 
people here, I'm now satisfied to say that for equal-length strings, 
string equality is best described as O(N).

1) If the strings are equal, a == b will always compare all N 
   characters in each string.

2) If the strings are unequal, a == b will *at worst* compare
   all N characters.

3) Following usual practice in this field, worst case is the
   one which conventionally is meant when discussing Big Oh
   behaviour. See, for example, Introduction To Algorithms 
   by Cormen, Leiserson and Rivest.

Also of some interest is the best case: O(1) for unequal strings (they 
differ at the first character) and O(N) for equal strings.

Also of interest is the case that has caused the majority of the 
discussion, the average case. I am now satisfied that the average number 
of comparisons for unequal strings is O(1). To be precise, it is bounded 
below by 1 comparison (you always have to compare at least one pair of 
characters) and bounded above by 2 comparisons.

(I'm talking about the average here -- the actual number of comparisons 
can range all the way up to N, but the average is = 2.)

If I've done the maths right, the exact value for the average is:

((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

for random strings of length N taken from an alphabet of size M.

For M = 2, that average approaches but never exceeds 2 as N increases; 
for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333... 
and so forth.



Thanks to all who contributed.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Peter Otten
Roy Smith wrote:

 There's been a bunch of threads lately about string implementations, and
 that got me thinking (which is often a dangerous thing).
 
 Let's assume you're testing two strings for equality.  You've already
 done the obvious quick tests (i.e they're the same length), and you're
 down to the O(n) part of comparing every character.
 
 I'm wondering if it might be faster to start at the ends of the strings
 instead of at the beginning?  If the strings are indeed equal, it's the
 same amount of work starting from either end.  But, if it turns out that
 for real-life situations, the ends of strings have more entropy than the
 beginnings, the odds are you'll discover that they're unequal quicker by
 starting at the end.
 
 It doesn't seem un-plausible that this is the case.  For example, most
 of the filenames I work with begin with /home/roy/.  Most of the
 strings I use as memcache keys have one of a small number of prefixes.
 Most of the strings I use as logger names have common leading
 substrings.  Things like credit card and telephone numbers tend to have
 much more entropy in the trailing digits.

But do you actually compare them with each other? Even if you do I'd guess 
that Python looks at the hash values most of the time.
 
 On the other hand, hostnames (and thus email addresses) exhibit the
 opposite pattern.

Nor do english words:

$ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8

comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
1: 477222 
2:  18870 **
3:   2870 
4:435 
5: 74 
6: 17 
7: 12 

tail
1: 386633 
2:  74966 ***
3:  29698 
4:   6536 *
5:   1475 
6:154 
7: 28 
8: 10 

 Anyway, it's just a thought.  Has anybody studied this for real-life
 usage patterns?
 
 I'm also not sure how this work with all the possible UCS/UTF encodings.
 With some of them, you may get the encoding semantics wrong if you don't
 start from the front.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Chris Angelico
On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten __pete...@web.de wrote:
 comparing every pair in a sample of 1000 8-char words
 taken from '/usr/share/dict/words'

 head
 1: 477222 
 2:  18870 **
 ...

Not understanding this. What are the statistics, and what (if it's not
obvious from the previous answer) do they prove?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Andrew Artajos
a random idea:

you could compare strings by their hashes..

print hash(randomstring) == hash(randomstring)
print hash(randomstring) == hash(randoMstring)

--
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Johannes Bauer
On 04.09.2012 20:07, Steven D'Aprano wrote:
 A reasonable, conservative assumption is to calculate the largest 
 possible value of the average for random strings. That largest value 
 occurs when the alphabet is as small as possible, namely two characters. 
 In practice, strings come from a larger alphabet, up to 1114111 different 
 characters for full Unicode strings, so the average for them will be less 
 than the average we calculate now.
 
 So for unequal strings, the number of comparisons is equally likely to be 
 1, 2, 3, ..., N. The average then is:

That is exactly what I don't agree with. For unequal random strings, the
distribution is *not* equally likely to have the same amount of
comparisons. For demonstration, let's look at bitstrings and the string
1010. There 15 bitstrings of same length which are unequal and I'll
put the number of comparisons needed next to them going left to right:

  1
0001  1
0010  1
0011  1
0100  1
0101  1
0110  1
0111  1
1100  2
1101  2
1110  2
  2
1000  3
1001  3
1011  4

i.e. about 1.73 (26/15) comparisons in average with random strings
compared to the 2.5 comparisons you end up with. My formula does respect
lazy termination (because the probabilty of higher comparisons gets
exponentially lower).

 sum([1, 2, 3, ..., N])/N
 
 which by a bit of simple algebra works out to be (N+1)/2, or half the 
 characters as I said.

Yes, but it's a flawed assumption you're making since you're neglecting
lazy evaluation and early abort of comparison.

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Johannes Bauer
On 04.09.2012 23:59, Chris Angelico wrote:

 n = (256 / 255) * (1 - 256 ^ (-c))

 where n is the average number of character comparisons and c. The
 rationale as follows: The first character has to be compared in any
 case. The second with a probability of 1/256, the third with 1/(256^2)
 and so on.
 
 That would be for comparing two random areas of memory. Python strings
 don't have 256 options per character; and in terms of actual strings,
 there's so many possibilities. 

The unit of comparison is completely arbitraty and can equally be used
to compare bitstrings if changed accordingly.

 The strings that a program is going to
 compare for equality are going to use a vastly restricted alphabet;
 for a lot of cases, there might be only a few dozen plausible
 characters.

This is very true. But if so, I would like to see the assumtion that
you're making (which might very well be plausible) in order to arrive at
a average of N/2 comparisons (which just seems *way* off).

 But even so, it's going to scale approximately linearly with the
 string length. If they're really random, then yes, there's little
 chance that either a 1MB string or a 2MB string will be the same, but
 with real data, they might very well have a long common prefix. So
 it's still going to be more or less O(n).

I understand what you're saying and surely this is true to some extent
(someone mentioned filenames, which is an excellent example). However in
order to derive a *formula* you need to formularize your assumtions.
With random data, it's definitely not O(n). And even in reality (with a
more limited alphabet) it usually will early abort:

Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
assumes O(1) comparisons! If the comparison operation itself were in
O(n), the total sorting complexity would be O(n^2 log(n)), which is
definitely false. Most comparisons will abort *very* early (after the
first character).

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Johannes Bauer
On 05.09.2012 11:24, Johannes Bauer wrote:

 Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
 assumes O(1) comparisons! If the comparison operation itself were in
 O(n), the total sorting complexity would be O(n^2 log(n)), which is
 definitely false. Most comparisons will abort *very* early (after the
 first character).

I've just written a program to demonstrate this (below it's attached). I
used my aspell dictionary, which contains about 100k words. To have
complexity down I went for only words wich occur most often (in this
case 9 characters or ~13k words). Here's the results (note it's
randomized, so they'll always differ a little, but are pretty stable)

Max Cmp: 100
Sum Cmp: 32111
Avg Cmp: 2.393842254361115
Num words  : 13414
Min wordlen: 9
Max wordlen: 9
Avg wordlen: 9.0

Going up in wordlength (only showing part now)

Avg Cmp: 2.2374929735806632
Num words  : 10674
Avg wordlen: 10.0

Avg Cmp: 2.261727078891258
Num words  : 7504
Avg wordlen: 11.0

Avg Cmp: 2.2335647202939977
Num words  : 4898
Avg wordlen: 12.0

Avg Cmp: 2.1743341404358354
Num words  : 2891
Avg wordlen: 13.0

Avg Cmp: 2.156782549420586
Num words  : 1467
Avg wordlen: 14.0

Avg Cmp: 2.112449799196787
Num words  : 747
Avg wordlen: 15.0

So, interestingly, in this real-world case(tm), the complexity does not
scale with O(n). It actually goes down (which is probably due to the
limit amount of words of higher length).

In summary, let's please not debate feelings or such. Either
measurements (with clear indiciation on what data they're based on and
how the test was conducted) or derivations (with an explanation of the
assumtions). Feelings can be very misleading in cases like this.

Best regards,
Johannes



#!/usr/bin/python3
import random

class Word():
def __init__(self, s):
self._s = s
self._c = 0

def __lt__(self, other):
if len(self)  len(other):
return True
elif len(self)  len(other):
return False

for i in range(len(self)):
self._c += 1
if self._s[i]  other._s[i]:
return True
return False

def __repr__(self):
return %s%d % (self._s, self._c)

def __len__(self):
return len(self._s)

def getcmp(self):
return self._c

wordlist = [ ]
for line in open(dict, r):
word = Word(line[:-1])
if len(word) == 9:
wordlist.append(word)

random.shuffle(wordlist)
wordlist.sort()

comparisons = [ word.getcmp() for word in wordlist ]
print(Max Cmp:, max(comparisons))
print(Sum Cmp:, sum(comparisons))
print(Avg Cmp:, sum(comparisons) / len(comparisons))

wordlengths = [ len(word) for word in wordlist ]
print(Min wordlen:, min(wordlengths))
print(Max wordlen:, max(wordlengths))
print(Avg wordlen:, sum(wordlengths) / len(wordlengths))


-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Peter Otten
Chris Angelico wrote:

 On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten __pete...@web.de wrote:
 comparing every pair in a sample of 1000 8-char words
 taken from '/usr/share/dict/words'

 head
 1: 477222 
 2:  18870 **
 ...

tail
1: 386633 
2:  74966 ***

 
 Not understanding this. What are the statistics, 

I measured how many chars they have in common for all combinations of 1000 
words taken from /usr/share/dict/words.

477222 pairs have one char in common, 18870 pairs have two chars in common 
when compared from the *beginning* of the string.

386633 pairs have one char in common, 74966 pairs have two chars in common 
when compared from the *end* of the string.

and what (if it's not obvious from the previous answer) do they prove?

They demonstrate that for two words from that particular corpus it is likely 
that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average) 
characters into account than a == b, i. e. comparing from the back should be 
slower rather than faster.

If that doesn't help, here's the code ;)

import random
import itertools
from contextlib import contextmanager
from collections import defaultdict

@contextmanager
def open_corpus(args):
with open(args.name) as instream:
words = (line.strip() for line in instream)
#words = (word for word in words if not word.endswith('s))
yield words

def pick(items, count):
items = list(items)
return random.sample(items, count)

def count_common(a, b):
i = 0
for i, (x, y) in enumerate(zip(a, b), 1):
if x != y:
break
return i

def corpus(args):
with open_corpus(args) as words:
wordlens = (len(line.strip()) for line in words)
freq = defaultdict(int)
for wordlen in wordlens:
freq[wordlen] += 1
show_histogram(freq)

def show_histogram(freq):
peak = max(freq.values())
freq_width = len(str(peak))
max_freq = max(freq)
len_width = len(str(max_freq))
for i in range(min(freq), max_freq+1):
value = freq[i]
print({:{}}: {:{}} {}.format(
i, len_width, value,
freq_width, * * (60 * value // peak)))

def compare(args):
with open_corpus(args) as words:
words = pick(
(word for word in words if len(word) == args.word_len),
args.sample_size)
n = 0
head_common = defaultdict(int)
tail_common = defaultdict(int)
n_tail = n_head = 0
for n, (a, b) in enumerate(itertools.combinations(words, 2), 1):
cc = count_common(a, b)
n_head += cc
head_common[cc] += 1
ccr = count_common(a[::-1], b[::-1])
n_tail += ccr
tail_common[ccr] += 1

print(n, combinations)
print(average common chars (head) {:.3}.format(n_head/n))
print(average common chars (tail) {:.3}.format(n_tail/n))

print(comparing every pair in a 
  sample of {sample} {len}-char words\n
  taken from {name!r}.format(
sample=args.sample_size,
len=args.word_len,
name=args.name))
print()
print(head)
show_histogram(head_common)
print()
print(tail)
show_histogram(tail_common)

def main():
import argparse
parser = argparse.ArgumentParser()
parsers = parser.add_subparsers()

pcorpus = parsers.add_parser(corpus)
pcorpus.add_argument(name)
pcorpus.set_defaults(func=corpus)

pcompare = parsers.add_parser(compare)
pcompare.add_argument(name)
pcompare.add_argument(--sample-size, type=int, default=1000)
pcompare.add_argument(--word-len, type=int, default=8)
pcompare.set_defaults(func=compare)
args = parser.parse_args()
args.func(args)

if __name__ == __main__:
main()


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Oscar Benjamin
On 5 September 2012 10:48, Peter Otten __pete...@web.de wrote:

 Chris Angelico wrote:

  On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten __pete...@web.de wrote:
  comparing every pair in a sample of 1000 8-char words
  taken from '/usr/share/dict/words'
 
  head
  1: 477222 
  2:  18870 **
  ...

 tail
 1: 386633 
 2:  74966 ***


  Not understanding this. What are the statistics,

 I measured how many chars they have in common for all combinations of 1000
 words taken from /usr/share/dict/words.

 477222 pairs have one char in common, 18870 pairs have two chars in common
 when compared from the *beginning* of the string.

 386633 pairs have one char in common, 74966 pairs have two chars in common
 when compared from the *end* of the string.

 and what (if it's not obvious from the previous answer) do they prove?

 They demonstrate that for two words from that particular corpus it is
 likely
 that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average)
 characters into account than a == b, i. e. comparing from the back should
 be
 slower rather than faster.

 If that doesn't help, here's the code ;)
  snip

def count_common(a, b):
 i = 0
 for i, (x, y) in enumerate(zip(a, b), 1):
 if x != y:
 break
 return i


This function will return 1 if the first character differs. It does not
count the number of common characters but rather the more relevant quantity
which is the number of comparisons required to decide if the strings are
equal.

It shows that, for the English words in this dictionary, the first letters
of two randomly selected 8-character words are equal ~5% of the time while
the last letters are equal ~20% of the time.

But despite the non-uniformity in the distribution of these strings,
this provides a good example of the fact that for many situations involving
real data, average case comparison complexity is O(1). This is because the
probability of stopping after N comparisons decreases exponentially with N,
so that the sequence of counts forms something loosely like an arithmetic
progression:

 cmp_forward = [477222, 18870, 2870, 435, 74, 17, 12]
 cmp_backward = [386633, 74966, 29698, 6536, 1475, 154, 28, 10]
 def ratios(seq):
... for count1, count2 in zip(seq[:-1], seq[1:]):
... yield count2 / float(count1)
...
 list(ratios(cmp_forward))
[0.03954134553729711, 0.15209326974032855, 0.15156794425087108,
0.17011494252873563, 0.22972972972972974, 0.7058823529411765]
 list(ratios(cmp_backward))
[0.19389446839767946, 0.39615292265827173, 0.22008216041484274,
0.22567319461444307, 0.10440677966101695, 0.18181818181818182,
0.35714285714285715]

A notable outlier in these sequences is for comparing the first character
of the two words which is why for this string distribution it is better to
start at the beginning than the end.

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Steven D'Aprano
On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote:

 On 05.09.2012 11:24, Johannes Bauer wrote:
 
 Consider sorting a large dictionary. Sorting is in O(n log(n)), but
 this assumes O(1) comparisons! If the comparison operation itself were
 in O(n), the total sorting complexity would be O(n^2 log(n)), 

This shows some significant confusion about Big Oh complexity analysis 
here. When you say that sorting the list of words is O(N log N), the N 
refers to the number of words in the dictionary. But when you speak about 
comparisons being O(N), the N doesn't refer to the size of the dict, but 
the size of the individual strings being compared. There is no reasonable 
comparison algorithm where the effort required to compare two strings 
depends on the size of the dictionary they have come from.

Since these are *completely different Ns*, you can't combine them to get 
O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer 
nonsense. If you state it like this:

sorting the list of words is O(N log N) where N = length of the list
comparing the words is O(M) where M = the average length of the words

then the overall complexity is O(N log N)*O(M), but since M is a constant 
for any particular dictionary (its just the average length of the words) 
that reduces down to O(N log N).

To say that sorting a dictionary is O(N log N) does *not* imply that 
string comparisons are O(1). It only implies that string comparisons 
don't depend on the size of the dictionary. Comparing:

s1 = 'a'*30002
s2 = 'a'*30001 + 'b'

takes the same amount of time whether they are in a dictionary of 100 
words or one of 10 words. For the purposes of calculating the 
algorithmic complexity of sorting a large list of words, we can treat 
comparisons as an elementary operation even though they actually aren't.


 which is
 definitely false. Most comparisons will abort *very* early (after the
 first character).

You are making unjustified assumptions about the distribution of letters 
in the words. This might be a list of long chemical compounds where the 
words typically differ only in their suffix. It might be a list of people 
with titles:

Herr Professor Frederick Schmidt
Herr Professor Frederick Wagner
...

But either way, it doesn't matter for the Big Oh behaviour of sorting the 
words.


 I've just written a program to demonstrate this (below it's attached).

I have no idea why you think this program demonstrates anything about 
string equality comparisons. What you are counting is not the average 
number of comparisons for string equality, but the number of comparisons 
when sorting. These are *completely different*, because sorting a list 
does not require testing each string against every other string.

Whatever you have measured, it has nothing to do with the Big Oh 
complexity of string comparisons.

It seems to me that you have misunderstood the purpose of Big Oh 
notation. It gives an asymptotic upper bound on the amount of work done, 
ignoring all lower terms and multiplicative factors, not an exact formula.

If something is O(N), then this tells you:

1) the number of algorithmic steps is proportional to N, plus or
   minus some terms which are less than N (e.g. sqrt(N), or log(N), 
   or 1/N, or some constant, etc.);

2) if you ignore those other terms, and keep all other influences
   equal, then doubling N will *at worst* double the number of
   algorithmic steps;

3) if all those algorithmic steps take exactly the same amount of
   time, then doubling N will take twice as much time.

But of course *none* of those assumptions hold for measurements of actual 
elapsed time. In real life, operations rarely take constant time, you 
don't always see worst case behaviour, and the lower terms are not 
necessarily insignificant enough to ignore.


[...]
 So, interestingly, in this real-world case(tm), the complexity does not
 scale with O(n). It actually goes down (which is probably due to the
 limit amount of words of higher length).

I would guess that it probably has more to do with the ability of the 
timsort algorithm to exploit local order within a list and avoid 
performing comparisons.


 In summary, let's please not debate feelings or such.

I have no idea what feelings you are referring to.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Johannes Bauer
On 05.09.2012 04:18, Neil Hodgson wrote:

The memcpy patch was controversial as it broke Adobe Flash which
 assumed memcpy was safe like memmove.

Adobe Flash was broken before, making an assumption that is not
guaranteed by the standard. The patch only showed the problem.

Regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Johannes Bauer
On 05.09.2012 16:30, Steven D'Aprano wrote:

 Since these are *completely different Ns*, you can't combine them to get 
 O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer 
 nonsense. If you state it like this:

Yes, you are correct here.

 You are making unjustified assumptions about the distribution of letters 
 in the words. This might be a list of long chemical compounds where the 
 words typically differ only in their suffix. It might be a list of people 
 with titles:

Actually, I'm not. I'm stating exactly what assumptions I'm making to
get my calculation. I'm comparing *random* character strings or bitstrings.

You, on the other hand, are making vague assumptions which you do not
care for formalize and yet you claim that the number of comparisons is
equally likely to be 1, 2, 3, ..., N. The average then is. Without any
explanation for this. At all.

 Herr Professor Frederick Schmidt
 Herr Professor Frederick Wagner
 ...

Is your assumtion that we're comparing words that have the common prefix
Herr Professor Frederick ? Why don't you clearly state what you're
comparing. In the example you gave in the other thread you claimed Note
that this average assumes the strings are completely random., yet now
you're talking about specific word patterns.

 I have no idea why you think this program demonstrates anything about 
 string equality comparisons. What you are counting is not the average 
 number of comparisons for string equality, but the number of comparisons 
 when sorting. These are *completely different*, because sorting a list 
 does not require testing each string against every other string.

You're wrong. It's not counting the number of string comparisons, it's
counting the number of character comparisons. Which is exactly what
we're talking about in this thread, are we not? The sorting of the list
is just to show some example where lots of strings are compared.

 So, interestingly, in this real-world case(tm), the complexity does not
 scale with O(n). It actually goes down (which is probably due to the
 limit amount of words of higher length).
 
 I would guess that it probably has more to do with the ability of the 
 timsort algorithm to exploit local order within a list and avoid 
 performing comparisons.

Again, it's not counting the number of string comparisons. It's counting
the average number of character comparisons per string comparison. The
total amount of string comparisons has nothing to do with it.

 In summary, let's please not debate feelings or such.
 
 I have no idea what feelings you are referring to.

I'm referring to the number of comparisons is equally likely to be 1,
2, 3, ..., N. The average then is without any deduction or rationale.
Without any explanation or assumption.

Regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Peter Otten
Oscar Benjamin wrote:

 On 5 September 2012 10:48, Peter Otten __pete...@web.de wrote:

 def count_common(a, b):

[sorry for seriously broken implementation]

 This function will return 1 if the first character differs. It does not
 count the number of common characters but rather the more relevant
 quantity which is the number of comparisons required to decide if the
 strings are equal.

I remember stumbling over the non-zero frequency for len(word) but somehow 
plodded on with beautifying the broken results :(

def count_common(a, b):

 count_common(alpha, beta)
0
 count_common(, )
0
 count_common(alpha, argument)
1
 count_common(alpha, alpha)
5
 count_common(alpha, alphx)
4

i = 0
for x, y in zip(a, b):
if x != y:
break
i += 1
return i


$ python3 reverse_compare2.py compare /usr/share/dict/words
499500 combinations
average common chars (head) 0.0535
average common chars (tail) 0.322
comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
0: 477371 
1:  18310 **
2:   3191 
3:523 
4: 79 
5: 15 
6: 10 
7:  1 

tail
0: 385069 
1:  78742 
2:  26734 
3:   7462 *
4:   1297 
5:168 
6: 22 
7:  6 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-05 Thread Steven D'Aprano
On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
[...]
 You are making unjustified assumptions about the distribution of
 letters in the words. This might be a list of long chemical compounds
 where the words typically differ only in their suffix. It might be a
 list of people with titles:
 
 Actually, I'm not. I'm stating exactly what assumptions I'm making to
 get my calculation. I'm comparing *random* character strings or
 bitstrings.

Excuse me, you are not. You are comparing English words which are highly 
non-random.


 You, on the other hand, are making vague assumptions which you do not
 care for formalize and yet you claim that the number of comparisons is
 equally likely to be 1, 2, 3, ..., N. The average then is. Without any
 explanation for this. At all.

I will accept that my explanation was not good enough, but I strongly 
disagree that I gave no explanation at all.


 Herr Professor Frederick Schmidt
 Herr Professor Frederick Wagner
 ...
 
 Is your assumtion that we're comparing words that have the common prefix
 Herr Professor Frederick ? 

No, I am pointing out that *your* assumption that most string comparisons 
will halt close to the beginning of the string is an invalid assumption. 
Your assumption only holds for some non-random strings.


[...]
 I have no idea why you think this program demonstrates anything about
 string equality comparisons. What you are counting is not the average
 number of comparisons for string equality, but the number of
 comparisons when sorting. These are *completely different*, because
 sorting a list does not require testing each string against every other
 string.
 
 You're wrong. It's not counting the number of string comparisons, 

I didn't say it was. 


 it's counting the number of character comparisons.

Correct. But only out of an extremely limited subset of all possible 
string comparisons, namely those very few that happen when sorting lists 
of English words using a very specific algorithm, namely timsort.


 Which is exactly what
 we're talking about in this thread, are we not? The sorting of the list
 is just to show some example where lots of strings are compared.

It is not good enough to extract a non-random subset of strings (only 
English words) and then decrease the data set even further by only 
comparing an extremely non-uniform subset of those strings: specifically 
those few string comparisons that happen to occur using timsort.

Whatever figure you have calculated by taking this non-random selection, 
it is irrelevant to the question of the average performance of string 
equality given random strings.



 So, interestingly, in this real-world case(tm), the complexity does
 not scale with O(n). It actually goes down (which is probably due to
 the limit amount of words of higher length).
 
 I would guess that it probably has more to do with the ability of the
 timsort algorithm to exploit local order within a list and avoid
 performing comparisons.
 
 Again, it's not counting the number of string comparisons. It's counting
 the average number of character comparisons per string comparison. The
 total amount of string comparisons has nothing to do with it.

I didn't say anything about counting string comparisons. But you are 
wrong -- the numbers you get are *strongly* influenced by the number of 
string comparisons performed. That is just one of the reasons why the 
results you get are biased.

Let me spell it out for you: given a list of five letter words:

['fruit', 'apple', 'claim', 'pears', ... , 'toast']

sorting the list may compare:

'fruit' with 'apple'
'apple' with 'claim'

but almost certainly will not compare:

'apple' with 'toast'

and it definitely won't compare:

'zzzax' with 'zzzaq'

because strings like those never get into the list in the first place.

For the sake of simple calculations, let's pretend that there are only 
1000 five-letter strings possible. We want to know how many character-
comparisons are done on average when testing two random five-letter 
strings for equality. To calculate this average, you must compare every 
string to every other string, *including* itself.

This gives 1000*1000 = one million equality tests to be performed. For 
each equality test, we count the number of character comparisons 
performed, sum all those counts, and divide by 1e6. That is the average 
number of char comparisons for random strings of five letters.

But that's not what you do. First you eliminate 900 out of the 1000 
possible strings by only sampling English words -- strings like xxoij 
cannot possibly be selected. So you are left with the highly non-random 
subset of 1 equality tests.

But you haven't finished. Instead of checking all 1 equality tests, 
you then decide to only check the 2000 tests that happen to randomly 
occur when using the timsort algorithm to sort a list.

So from the population of one million possible equality tests, you throw 
away the vast majority, then average the *strongly 

Re: Comparing strings from the back?

2012-09-05 Thread Oscar Benjamin
In news.gmane.comp.python.general, you wrote:
 On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
 [...]
 You are making unjustified assumptions about the distribution of
 letters in the words. This might be a list of long chemical compounds
 where the words typically differ only in their suffix. It might be a
 list of people with titles:
 
 Actually, I'm not. I'm stating exactly what assumptions I'm making to
 get my calculation. I'm comparing *random* character strings or
 bitstrings.

 Excuse me, you are not. You are comparing English words which are highly 
 non-random.

Evidently we have different understandings of what 'random' means. I don't
think it's unreasonable to say that strings drawn uniformly from the set of
all strings in the English language (having a given number of characters) is
random. The distribution is not uniform over the set of all possible character
strings but it is still random. I think Johannes deliberately chose these
strings to emulate a particular kind of 'real' distribution of strings that
might occur in practise.



 You, on the other hand, are making vague assumptions which you do not
 care for formalize and yet you claim that the number of comparisons is
 equally likely to be 1, 2, 3, ..., N. The average then is. Without any
 explanation for this. At all.

 I will accept that my explanation was not good enough, but I strongly 
 disagree that I gave no explanation at all.


 Herr Professor Frederick Schmidt
 Herr Professor Frederick Wagner
 ...
 
 Is your assumtion that we're comparing words that have the common prefix
 Herr Professor Frederick ? 

 No, I am pointing out that *your* assumption that most string comparisons 
 will halt close to the beginning of the string is an invalid assumption. 
 Your assumption only holds for some non-random strings.

I think you have this backwards. The case where this assumption is provably
true is precisely for random strings. To be clear, when I say 'random' in this
context I mean that each character is chosen independently from the same
probability distribution over the possible characters regardless of which
index it has in the string and regardless of what the other characters are
(IID). In this case the probability that comparison terminates at the jth
character decreases exponentially with j. This means that for large strings
the expected number of character comparisons is independent of the number of
characters in the string as the probability of reaching the later parts of the
string is too small for them to have any significant effect. This is provable
and applies regardless of how many possible characters there are and whether
or not each character is equally likely (except for the pathological case
where one character has a probability of 1).

For strings from 'real' distributions it is harder to make statements about
the 'average case' and it is possible to construct situations where the
comparison would regularly need to compare a common prefix. However, to get
asymptotic performance worse than O(1) it is not sufficient to say that there
may be a common prefix such as 'Herr' in the distribution of strings. It is
necessary that, somehow, the common prefix is likely to grow as the size of
the strings grows.

For example, the set of all strings of length N whose first N//2 characters
are always 'a' and whose remaining characters are chosen IID would lead to
O(N) performance. This is why the file paths example chosen at the start of
this thread is a good one. If a program is dealing with a number of large
strings representing file paths then it is not uncommon that many of those
paths would refer to files in the same deeply nested directory and hence
compare equal for a significant number of characters. This could lead to
average case O(N) performance.

I think it's appropriate to compare string comparison with dict insertion:
Best case O(1) (no hash collisions)
Worst case O(N) (collides with every key)
Average case O(1) (as long as you don't use pathological data)

The only difference with string comparison is that there are some conceivable,
non-malicious cases where the pathological data can occur (such as with file
paths). However, I suspect that out of all the different uses of python
strings these cases are a small minority.

In saying that, it's not inconceivable that someone could exploit string
comparison by providing pathological data to make normally O(1) operations
behave as O(N). If I understand correctly it was precisely this kind of
problem with dict insertion/lookup that lead to the recent hash-seed security
update.

Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-04 Thread Mark Lawrence

On 04/09/2012 05:56, Dan Sommers wrote:


That said, if I really wanted bloodshed, I would propose = for the third
string-equality operator!  ;-)

Dan



Dan agent provocateur Sommers? :)

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-04 Thread Mark Lawrence

On 04/09/2012 02:54, Roy Smith wrote:

There's been a bunch of threads lately about string implementations, and
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality.  You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning?  If the strings are indeed equal, it's the
same amount of work starting from either end.  But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

It doesn't seem un-plausible that this is the case.  For example, most
of the filenames I work with begin with /home/roy/.  Most of the
strings I use as memcache keys have one of a small number of prefixes.
Most of the strings I use as logger names have common leading
substrings.  Things like credit card and telephone numbers tend to have
much more entropy in the trailing digits.

On the other hand, hostnames (and thus email addresses) exhibit the
opposite pattern.

Anyway, it's just a thought.  Has anybody studied this for real-life
usage patterns?

I'm also not sure how this work with all the possible UCS/UTF encodings.
With some of them, you may get the encoding semantics wrong if you don't
start from the front.



Good grief, what timing!!!  The dust has yet to settle over the 
Flexible string representation, unicode, typography, ... thread and 
you ask this.  I'll just cross my fingers and hope that people stick 
with facts, realistic benchmarks etc, and not good old fashioned FUD.


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-04 Thread Alain Ketterlin
Roy Smith r...@panix.com writes:

 There's been a bunch of threads lately about string implementations, and 
 that got me thinking (which is often a dangerous thing).

 Let's assume you're testing two strings for equality.  You've already 
 done the obvious quick tests (i.e they're the same length), and you're 
 down to the O(n) part of comparing every character.

 I'm wondering if it might be faster to start at the ends of the strings 
 instead of at the beginning?  If the strings are indeed equal, it's the 
 same amount of work starting from either end.  But, if it turns out that 
 for real-life situations, the ends of strings have more entropy than the 
 beginnings, the odds are you'll discover that they're unequal quicker by 
 starting at the end.

I guess we all have examples with specific entropy distribution. Going
backwards on long strings can be profitable with file paths. If that is
the case, and if you spend a lot of time comparing strings, why not
store them in reverse?

 It doesn't seem un-plausible that this is the case.  For example, most 
 of the filenames I work with begin with /home/roy/.  Most of the 
 strings I use as memcache keys have one of a small number of prefixes.  
 Most of the strings I use as logger names have common leading 
 substrings.

In that case I would split the paths, and use some sort of string-id, or
use intern() and then compare components with is instead of ==.
(Basically, turning the string of chars into a strings of
ids/ints/pointers.)

 Things like credit card and telephone numbers tend to have much more
 entropy in the trailing digits. On the other hand, hostnames (and thus
 email addresses) exhibit the opposite pattern.

Yes, real-world is a mess.

 Anyway, it's just a thought.  Has anybody studied this for real-life 
 usage patterns?

I've tried the intern() trick several times with lists of strings, and
was satisfied, but this was in specific cases (with a finite set of
strings). For short strings of chars (up to, say a hundred of
characters), I've never found anything significantly faster than the
default sweep (that was in C/C++, not python). For longer and/or more
structured strings/lists, making use of the structure is a good idea,
but I see no general pattern here (as you note).

 I'm also not sure how this work with all the possible UCS/UTF encodings.  
 With some of them, you may get the encoding semantics wrong if you don't 
 start from the front.

If you're testing for equality, I can't see how this could matter, even
with variable-length encodings.

If you're comparing different encodings, then you need different access
methods to random characters (but this doesn't affect the algorithm). If
you're using variable-length encoding, e.g., UTF-8, accessing at a
specific position is not possible.

-- Alain.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-04 Thread Johannes Bauer
On 04.09.2012 04:17, Steven D'Aprano wrote:

 On average, string equality needs to check half the characters in the 
 string.

How do you arrive at that conclusion? When comparing two random strings,
I just derived

n = (256 / 255) * (1 - 256 ^ (-c))

where n is the average number of character comparisons and c. The
rationale as follows: The first character has to be compared in any
case. The second with a probability of 1/256, the third with 1/(256^2)
and so on.

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-04 Thread Steven D'Aprano
On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote:

 On 04.09.2012 04:17, Steven D'Aprano wrote:
 
 On average, string equality needs to check half the characters in the
 string.
 
 How do you arrive at that conclusion? 

Take two non-empty strings of the same length, N. If the strings are 
equal, you have to make N comparisons to be sure they are equal (you 
check all N pairs of characters). If they are unequal, you have to check 
each pair of characters up to the first pair that are different:

def string_equality(astr, bstr):
# Assumes lengths are equal.
for achr, bchr in zip(astr, bstr):
if achr != bchr:
return False
return True

For the unequal case, how many comparisons do you do? Ahead of time, we 
know very little about the strings. We don't even know how many possible 
characters there are (are they English alphanumeric, ASCII, Greek, Thai, 
or from the full range of 1114111 Unicode code points?) or what their 
probability distribution is.

A reasonable, conservative assumption is to calculate the largest 
possible value of the average for random strings. That largest value 
occurs when the alphabet is as small as possible, namely two characters. 
In practice, strings come from a larger alphabet, up to 1114111 different 
characters for full Unicode strings, so the average for them will be less 
than the average we calculate now.

So for unequal strings, the number of comparisons is equally likely to be 
1, 2, 3, ..., N. The average then is:

sum([1, 2, 3, ..., N])/N

which by a bit of simple algebra works out to be (N+1)/2, or half the 
characters as I said.

(Note that this average assumes the strings are completely random. In 
practice, strings tend to come from strongly non-uniform distributions of 
characters. For instance, in English texts, 'e' is much more likely than 
'q' and most characters are vanishingly rare, and in practice, strings 
are likely to be highly non-random.)



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing strings from the back?

2012-09-04 Thread Oscar Benjamin
On 4 September 2012 19:07, Steven D'Aprano 
steve+comp.lang.pyt...@pearwood.info wrote:

 On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote:

  On 04.09.2012 04:17, Steven D'Aprano wrote:
 
  On average, string equality needs to check half the characters in the
  string.
 
  How do you arrive at that conclusion?

 Take two non-empty strings of the same length, N. If the strings are
 equal, you have to make N comparisons to be sure they are equal (you
 check all N pairs of characters). If they are unequal, you have to check
 each pair of characters up to the first pair that are different:

 def string_equality(astr, bstr):
 # Assumes lengths are equal.
 for achr, bchr in zip(astr, bstr):
 if achr != bchr:
 return False
 return True

 For the unequal case, how many comparisons do you do? Ahead of time, we
 know very little about the strings. We don't even know how many possible
 characters there are (are they English alphanumeric, ASCII, Greek, Thai,
 or from the full range of 1114111 Unicode code points?) or what their
 probability distribution is.

 A reasonable, conservative assumption is to calculate the largest
 possible value of the average for random strings. That largest value
 occurs when the alphabet is as small as possible, namely two characters.
 In practice, strings come from a larger alphabet, up to 1114111 different
 characters for full Unicode strings, so the average for them will be less
 than the average we calculate now.

 So for unequal strings, the number of comparisons is equally likely to be
 1, 2, 3, ..., N. The average then is:


What?



 sum([1, 2, 3, ..., N])/N

 which by a bit of simple algebra works out to be (N+1)/2, or half the
 characters as I said.

 (Note that this average assumes the strings are completely random. In
 practice, strings tend to come from strongly non-uniform distributions of
 characters. For instance, in English texts, 'e' is much more likely than
 'q' and most characters are vanishingly rare, and in practice, strings
 are likely to be highly non-random.)


If the strings are 'completely random' (by which I assume you mean that
each character is IID) then the probability of a match for the character at
any one index is the same as the probability for a match at any other
index. Lets say the probability for a match is p and that p  1.

Then for the first comparison:
1) with probability (1 - p) we terminate the loop after 1 comparison.
2) With probability p we continue to the second comparison

The second comparison occurs with probability p (from 2 above) and if we
reach this point then:
1) with probability (1 - p) we terminate the loop after this second
comparison
2) With probability p we continue to the third comparison

The probability of reaching the second comparison is p and the probability
of terminating at this comparison *if we reach it* is (1-p). So the
probability from the outset that we terminate at the second comparison is
p*(1 - p).

Prob(1 comparison) = (1-p)  p*(1-p) = prob(2 comparisons)

(since p  1)

This can easily be extended by induction or otherwise to show that the
probability of terminating after N comparisons decreases as N increases. In
fact since it decreases by a factor of p each time, even if p is 1/2 (as
would be the case if there were two equally likely characters) the
probability of continuing to N comparisons becomes vanishingly small very
quickly as N increases. In practise p is larger for mast ascii/bmp/unicode
problems so the probability vanishes even more quickly.

If you follow this through you should arrive at Johannes' result
above. Applying this to real data is obviously more complicated since
successive characters are not independent but it would be a fairly unusual
situation if you were not more likely to stop at an earlier point in the
comparison loop than a later one.

Oscar.
-- 
http://mail.python.org/mailman/listinfo/python-list


  1   2   >