Re: [Python-Dev] New string method - splitquoted

2006-05-20 Thread Heiko Wundram
Am Donnerstag 18 Mai 2006 06:06 schrieb Dave Cinege:
 This is useful, but possibly better put into practice as a separate
 method??

I personally don't think it's particularily useful, at least not in the 
special case that your patch tries to address.

1) Generally, you won't only have one character that does quoting, but 
several. Think of the Python syntax, where you have , ',  and ''', which 
all behave slightly differently. The logic for  and ' is simple enough to 
implement (basically that's what your patch does, and I'm sure it's easy 
enough to extend it to accept a range of characters as splitters), but if you 
have more complicated quoting operators (such as ), are you sure it's 
sensible to implement the logic in split()?

2) What should the result of this is a \test string.split(None,-1,'') be? 
An exception (ParseError)? Silently ignoring the missing delimiter, and 
returning ['this','is','a','test string']? Ignoring the delimiter altogether, 
returning ['this','is','a','test','string']? I don't think there's one case 
to satisfy all here...

3) What about escapes of the delimiter? Your current patch doesn't address 
them at all (AFAICT) at the moment, but what should the escaping character 
be? Should escape processing take place, i.E. what should the result 
of this is a \\\delimiter \\test.split(None,-1,'') be?

Don't get me wrong, I personally find this functionality very, very 
interesting (I'm +0.5 on adding it in some way or another), especially as a 
part of the standard library (not necessarily as an extension to .split()).

But there's quite a lot of semantic stuff to get right before you can 
implement it properly; see the complexity of the csv module, where you have 
to define pretty much all of this in the dialect you use to parse the csv 
file...

Why not write up a PEP?

--- Heiko.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-20 Thread Dave Cinege
On Thursday 18 May 2006 11:11, Guido van Rossum wrote:
 This is not an apropriate function to add as a string methods. There
 are too many conventions for quoting and too many details to get
 right. One method can't possibly handle them all without an enormous
 number of weird options. It's better to figure out how to do this with
 regexps or use some of the other approaches that have been suggested.
 (Did anyone mention the csv module yet? It deals with this too.)

Maybe my idea is better called splitexcept instead of splitquoted, as my goal 
is to (simply) provide a way to limit the split by delimiters, and not dive 
into an all encompassing quoting algorithm.

It me this is in the spirit of the maxsplit option already present.

Dave

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-20 Thread Dave Cinege
On Thursday 18 May 2006 16:13, you wrote:
 Dave Cinege wrote:
  For example:
 
  s = '  Chan: 11  SNR: 22  ESSID: Spaced Out Wifi  Enc: On'

 My complaint with this example is that you are just using the wrong tool
 to do this job. If I was going to do this, I would've immediately jumped
 on the regex-press train.

 wifi_info = re.match('^\s+'
   'Chan:\s+(?Pchannel[0-9]+)\s+'
   'SNR:\s+(?Psnr[0-9]+)\s+'
   'ESSID:\s+(?Pessid[^]*)\s+'
   'Enc:\s+(?Pencryption[a-zA-Z]+)'
   , s)

For the 5 years of been pythoning, I've used re probably twice. 
I find regex to be a tool of last resort, and quite a bit of effort to get 
right, as regex (for me) is quite prone it giving unintended results without 
a good deal of thought. I don't want to have to think. That's why I use 
python.  : )

.split() and slicing has always been python's holy grail for me, and I find it 
a lot easier to .replace() 'stray' chars with spaces or a delimiter and then 
split() that.  It's easier to read and (should be) a lot quicker to process 
then regex. (Which I care about, as I'm also often on embedded CPU's of a few 
hundred MHz)

So .split works just super duper.but I keep running in to situations where 
I'd like a substr to be excluded from the split'ing.

The clearest one is excluding a 'quoted' string that has whitespace.
Here's another, be it, a very poor example: 

s = '\t\tFrequency:2.462 GHz (Channel 11)'  # This is real output from 
iwlist:
s.replace(':',')').replace(' (','))').split(None,-1,')')
['Frequency', '2.462 GHz', 'Channel 11']

I wanted to preserve the '2.462 GHz' substr. Let's assume, that could come out 
as '900 MHz' or '11.3409 GHz'. The above code gets what I want in 1 shot, 
either way. Show me an easier way, that doesn't need multiple splits, and 
string re-assembly, and I'll use it.

Dave

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-20 Thread Guido van Rossum
I'm sorry Dave, I'm afraid I can't do that.

We hear you, Dave, but this is not a suitable function to add to the
standard library. Many respondents are trying to tell you that in many
different ways. If you keep arguing for it, we'll just ignore you.

--Guido

PS. Give up TDMA. Try Spambayes instead. It works much better and is
less annoying for your correspondents.

On 5/18/06, Dave Cinege
[EMAIL PROTECTED] wrote:
 On Thursday 18 May 2006 16:13, you wrote:
  Dave Cinege wrote:
   For example:
  
   s = '  Chan: 11  SNR: 22  ESSID: Spaced Out Wifi  Enc: On'
 
  My complaint with this example is that you are just using the wrong tool
  to do this job. If I was going to do this, I would've immediately jumped
  on the regex-press train.
 
  wifi_info = re.match('^\s+'
'Chan:\s+(?Pchannel[0-9]+)\s+'
'SNR:\s+(?Psnr[0-9]+)\s+'
'ESSID:\s+(?Pessid[^]*)\s+'
'Enc:\s+(?Pencryption[a-zA-Z]+)'
, s)

 For the 5 years of been pythoning, I've used re probably twice.
 I find regex to be a tool of last resort, and quite a bit of effort to get
 right, as regex (for me) is quite prone it giving unintended results without
 a good deal of thought. I don't want to have to think. That's why I use
 python.  : )

 .split() and slicing has always been python's holy grail for me, and I find it
 a lot easier to .replace() 'stray' chars with spaces or a delimiter and then
 split() that.  It's easier to read and (should be) a lot quicker to process
 then regex. (Which I care about, as I'm also often on embedded CPU's of a few
 hundred MHz)

 So .split works just super duper.but I keep running in to situations where
 I'd like a substr to be excluded from the split'ing.

 The clearest one is excluding a 'quoted' string that has whitespace.
 Here's another, be it, a very poor example:

 s = '\t\tFrequency:2.462 GHz (Channel 11)'  # This is real output from 
 iwlist:
 s.replace(':',')').replace(' (','))').split(None,-1,')')
 ['Frequency', '2.462 GHz', 'Channel 11']

 I wanted to preserve the '2.462 GHz' substr. Let's assume, that could come out
 as '900 MHz' or '11.3409 GHz'. The above code gets what I want in 1 shot,
 either way. Show me an easier way, that doesn't need multiple splits, and
 string re-assembly, and I'll use it.

 Dave

 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/guido%40python.org



-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Heiko Wundram
Am Donnerstag 18 Mai 2006 06:06 schrieb Dave Cinege:
 This is useful, but possibly better put into practice as a separate
 method??

I personally don't think it's particularily useful, at least not in the 
special case that your patch tries to address.

1) Generally, you won't only have one character that does quoting, but 
several. Think of the Python syntax, where you have , ',  and ''', which 
all behave slightly differently. The logic for  and ' is simple enough to 
implement (basically that's what your patch does, and I'm sure it's easy 
enough to extend it to accept a range of characters as splitters), but if you 
have more complicated quoting operators (such as ), are you sure it's 
sensible to implement the logic in split()?

2) What should the result of this is a \test string.split(None,-1,'') be? 
An exception (ParseError)? Silently ignoring the missing delimiter, and 
returning ['this','is','a','test string']? Ignoring the delimiter altogether, 
returning ['this','is','a','test','string']? I don't think there's one case 
to satisfy all here...

3) What about escapes of the delimiter? Your current patch doesn't address 
them at all (AFAICT) at the moment, but what should the escaping character 
be? Should escape processing take place, i.E. what should the result 
of this is a \\\delimiter \\test.split(None,-1,'') be?

Don't get me wrong, I personally find this functionality very, very 
interesting (I'm +0.5 on adding it in some way or another), especially as a 
part of the standard library (not necessarily as an extension to .split()).

But there's quite a lot of semantic stuff to get right before you can 
implement it properly; see the complexity of the csv module, where you have 
to define pretty much all of this in the dialect you use to parse the csv 
file...

Why not write up a PEP?

--- Heiko.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Neal Norwitz
On 5/17/06, Dave Cinege
[EMAIL PROTECTED] wrote:
 Very oftenmake that very very very very very very very very very often,
 I find myself processing text in python that  when .split()'ing a line, I'd
 like to exclude the split for a 'quoted' item...quoted because it contains
 whitespace or the sep char.

 For example:

 s = '  Chan: 11  SNR: 22  ESSID: Spaced Out Wifi  Enc: On'

 If I want to yank the essid in the above example, it's a pain. But with my new
 dandy split quoted method, we have a 3rd argument to .split() that we can
 spec the quote delimiter where no splitting will occur, and the quote char
 will be dropped:

 s.split(None,-1,'')[5]
 'Spaced Out Wifi'

 Attached is a proof of concept patch against
 Python-2.4.1/Objects/stringobject.c  that implements this. It is limited to
 whitespace splitting only. (sep == None)

 As implemented the quote delimiter also doubles as an additional separator for
 the spliting out a substr.

 For example:
 'There isno whitespace before thesequotes'.split(None,-1,'')
 ['There', 'is', 'no whitespace before these', 'quotes']

 This is useful, but possibly better put into practice as a separate method??

 Comments please.

What's wrong with:  re.findall(r'[^]*|[^\s]+', s)

YMMV,
n
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Giovanni Bajo
Heiko Wundram [EMAIL PROTECTED] wrote:

 Don't get me wrong, I personally find this functionality very, very
 interesting (I'm +0.5 on adding it in some way or another),
 especially as a
 part of the standard library (not necessarily as an extension to
 .split()).


It's already there. It's called shlex.split(), and follows the semantic of a
standard UNIX shell, including escaping and other things.

 import shlex
 shlex.split(rHey I\'m a bad guy for you)
['Hey', I'm, 'a', 'bad guy', 'for', 'you']

Giovanni Bajo

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Heiko Wundram
Am Donnerstag 18 Mai 2006 10:21 schrieb Giovanni Bajo:
 Heiko Wundram [EMAIL PROTECTED] wrote:
  Don't get me wrong, I personally find this functionality very, very
  interesting (I'm +0.5 on adding it in some way or another),
  especially as a
  part of the standard library (not necessarily as an extension to
  .split()).

 It's already there. It's called shlex.split(), and follows the semantic of
 a standard UNIX shell, including escaping and other things.

I knew about *nix shell escaping, but that isn't necessarily what I find in 
input I have to process (although generally it's what you see, yeah). That's 
why I said that it would be interesting to have a generalized method, sort of 
like the csv module but only for string interpretation, which takes a 
dialect, and parses a string for the specified dialect.

Remember, there also escaping by doubling the end of string marker (for 
example, 'this is not a single argument'.split() should be parsed as 
['this','is','not','a',]), and I know programs that use exactly this 
format for file storage.

Maybe, one could simply export the function the csv module uses to parse the 
actual data fields as a more prominent method, which accepts keyword 
arguments, instead of a Dialect-derived class.

--- Heiko.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Giovanni Bajo
Heiko Wundram [EMAIL PROTECTED] wrote:

 Don't get me wrong, I personally find this functionality very, very
 interesting (I'm +0.5 on adding it in some way or another),
 especially as a
 part of the standard library (not necessarily as an extension to
 .split()).

 It's already there. It's called shlex.split(), and follows the
 semantic of a standard UNIX shell, including escaping and other
 things.

 I knew about *nix shell escaping, but that isn't necessarily what I
 find in input I have to process (although generally it's what you
 see, yeah). That's why I said that it would be interesting to have a
 generalized method, sort of like the csv module but only for string
 interpretation, which takes a dialect, and parses a string for the
 specified dialect.

 Remember, there also escaping by doubling the end of string marker
 (for example, 'this is not a single argument'.split() should be
 parsed as ['this','is','not','a',]), and I know programs that
 use exactly this format for file storage.

I never met this one. Anyway, I don't think it's harder than:

 def mysplit(s):
... Allow double quotes to escape a quotes
... return shlex.split(s.replace(r'', r'\'))
...
 mysplit('This is not a single argument')
['This', 'is', 'not', 'a', 'single', 'argument']


 Maybe, one could simply export the function the csv module uses to
 parse the actual data fields as a more prominent method, which
 accepts keyword arguments, instead of a Dialect-derived class.


I think you're over-generalizing a very simple problem. I believe that
str.split, shlex.split, and some simple variation like the one above (maybe
using regular expressions to do the substitution if you have slightly more
complex cases) can handle 99.99% of the splitting cases. They surely handle
100% of those I myself had to parse.

I believe the standard library already covers common usage. There will surely
be cases where a custom lexer/splitetr will have to be written, but that's life
:)

Giovanni Bajo

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Heiko Wundram
Am Donnerstag 18 Mai 2006 12:26 schrieb Giovanni Bajo:
 I believe the standard library already covers common usage. There will
 surely be cases where a custom lexer/splitetr will have to be written, but
 that's life

The csv data field parser handles all common usage I have encountered so far, 
yes. ;-) But, generally, you can't (easily) get at the method that parses a 
data field directly, that's why I proposed to publish that method with 
keyword arguments. (actually, I've only tried getting at it when the csv 
module was still plain-python, I wouldn't even know whether the method is 
exported now that the module is written in C).

I've had the need to write a custom lexer time and again, and generally, I'd 
love to have a little more general string interpretation facility available 
to spare me from writing a state automaton... But as I said before, 
the simple patch that was proposed here won't do for my case. But I don't 
know if it's worth the trouble to actually write a more general version, 
because there are quite some different pitfalls that have to be overcome... I 
still remain +0.5 for adding something like this to the stdlib, but only if 
it's overly general so that it can handle all cases the csv module can 
handle.

--- Heiko.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Nick Coghlan
Dave Cinege wrote:
 Very oftenmake that very very very very very very very very very often,
 I find myself processing text in python that  when .split()'ing a line, I'd 
 like to exclude the split for a 'quoted' item...quoted because it contains 
 whitespace or the sep char.
 
 For example:
 
 s = '  Chan: 11  SNR: 22  ESSID: Spaced Out Wifi  Enc: On'

Even if you don't like Neal's more efficient regex-based version, the 
necessary utility function to do a two-pass split operation really isn't that 
tricky:

def split_quoted(text, sep=None, quote=''):
 sections = text.split(quote)
 result = []
 for idx, unquoted_text in enumerate(sections[::2]):
 result.extend(unquoted_text.split(sep))
 quoted = 2*idx+1
 quoted_text = sections[quoted:quoted+1]
 result.extend(quoted_text)
 return result

  split_quoted('  Chan: 11  SNR: 22  ESSID: Spaced Out Wifi  Enc: On')
['Chan:', '11', 'SNR:', '22', 'ESSID:', 'Spaced Out Wifi', 'Enc:', 'On']

Given that this function (or a regex based equivalent) is easy enough to add 
if you do need it, I don't find the idea of increasing the complexity of the 
basic split API particularly compelling.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
 http://www.boredomandlaziness.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Guido van Rossum
This is not an apropriate function to add as a string methods. There
are too many conventions for quoting and too many details to get
right. One method can't possibly handle them all without an enormous
number of weird options. It's better to figure out how to do this with
regexps or use some of the other approaches that have been suggested.
(Did anyone mention the csv module yet? It deals with this too.)

--Guido

On 5/17/06, Dave Cinege
[EMAIL PROTECTED] wrote:
 Very oftenmake that very very very very very very very very very often,
 I find myself processing text in python that  when .split()'ing a line, I'd
 like to exclude the split for a 'quoted' item...quoted because it contains
 whitespace or the sep char.

 For example:

 s = '  Chan: 11  SNR: 22  ESSID: Spaced Out Wifi  Enc: On'

 If I want to yank the essid in the above example, it's a pain. But with my new
 dandy split quoted method, we have a 3rd argument to .split() that we can
 spec the quote delimiter where no splitting will occur, and the quote char
 will be dropped:

 s.split(None,-1,'')[5]
 'Spaced Out Wifi'

 Attached is a proof of concept patch against
 Python-2.4.1/Objects/stringobject.c  that implements this. It is limited to
 whitespace splitting only. (sep == None)

 As implemented the quote delimiter also doubles as an additional separator for
 the spliting out a substr.

 For example:
 'There isno whitespace before thesequotes'.split(None,-1,'')
 ['There', 'is', 'no whitespace before these', 'quotes']

 This is useful, but possibly better put into practice as a separate method??

 Comments please.

 Dave


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/guido%40python.org






-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Heiko Wundram
Am Donnerstag 18 Mai 2006 17:11 schrieb Guido van Rossum:
 (Did anyone mention the csv module yet? It deals with this too.)

Yes, mentioned it thrice. ;-)

--- Heiko.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Dave Cinege

On Thursday 18 May 2006 03:00, Heiko Wundram wrote:
 Am Donnerstag 18 Mai 2006 06:06 schrieb Dave Cinege:
  This is useful, but possibly better put into practice as a separate
  method??

 I personally don't think it's particularily useful, at least not in the
 special case that your patch tries to address.

Well I'm thinking along the lines of a method to extract only quoted substr's:
' this is something andnothing elsebut junk'.splitout('')
['something ', 'nothing else']

Useful? I dunno

 splitters), but if you have more complicated quoting operators (such as
 ), are you sure it's sensible to implement the logic in split()?

Probably not. See below...

 2) What should the result of this is a \test string.split(None,-1,'')
 be? An exception (ParseError)?

I'd probably vote for that. However my current patch will simply play dumb and stop split'ing the rest of the line, dropping the first quote.

'this is a test string'.split(None,-1,'')
['this', 'is', 'a', 'test string']

 Silently ignoring the missing delimiter, and 
 returning ['this','is','a','test string']? Ignoring the delimiter
 altogether, returning ['this','is','a','test','string']? I don't think
 there's one case to satisfy all here...

Well the point to the patch is a KISS approach to extending the split() method just slightly to exclude a range of substr from split'ing by delimiter, not to engage in further text processing. 

I'm dealing with this ALL the time, while processing output from other programs. (Windope) fIlenames, (poorly considered) wifi network names, etc. For me it's always some element with whitespace in it and double quotes surrounding it, that otherwise I could just use a slice to dump the quotes for the needed element

'filename: /root/tmp.txt'.split()[1] [1:-1]
'/root/tmp.txt'
OK

'filename: /root/is a bit slow.txt'.split()[1] [1:-1]
'/root/i'
NOT OK

This exact bug just zapped me in a product I have, that I didn't forsee whitespace turning up in that element.

Thus my patch:
'filename: /root/is a bit slow.txt'.split(None,-1,'')[1]
'/root/is a bit slow.txt'
LIFE IS GOOD

 3) What about escapes of the delimiter? Your current patch doesn't address
 them at all (AFAICT) at the moment, 

And it wouldn't, just like the current split doesn't.
'this is a \ test string'.split()
['this', 'is', 'a', '\\', 'test', 'string']

 Don't get me wrong, I personally find this functionality very, very
 interesting (I'm +0.5 on adding it in some way or another), especially as a
 part of the standard library (not necessarily as an extension to .split()).

I'd be happy to have this in as .splitquoted(), but once you use it, it seems more to me like a natural 'ought to be there' extension to split itself.

 Why not write up a PEP?

Because I have no idea of the procedure.   : )  URL?

Dave

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Dave Cinege
On Thursday 18 May 2006 04:21, Giovanni Bajo wrote:

 It's already there. It's called shlex.split(), and follows the semantic of
 a standard UNIX shell, including escaping and other things.

Not quite. As I said in my other post, simple is the idea for this, just like 
the split method itself.  (no escaping, etc.just recognizing delimiters 
as an exception to the split seperatation) 

shlex.split() does not let one choose the separator or use a maxsplit, nor is 
it a pure method to strings.

Dave

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New string method - splitquoted

2006-05-18 Thread Giovanni Bajo
Dave Cinege wrote:

 It's already there. It's called shlex.split(), and follows the
 semantic of a standard UNIX shell, including escaping and other
 things.

 Not quite. As I said in my other post, simple is the idea for this,
 just like the split method itself.  (no escaping, etc.just
 recognizing delimiters as an exception to the split seperatation)

And what's the actual problem? You either have a syntax which does not
support escaping or one that it does. If it can't be escaped, there won't be
any weird characters in the way, and shlex.split() will do it. If it does
support escaping in a decent way, you can either use shlex.split() directly
or modify the string before (like I've shown in the other message). In any
case, you get your job done.

Do you have any real-world case where you are still not able to split a
string? And if you do, are they really so many to warrant a place in the
standard library? As I said before, I think that split() and shlex.split()
cover the majority of real world usage cases.

 shlex.split() does not let one choose the separator
 or use a maxsplit

Real-world use case? Show me what you need to parse, and I assume this weird
format is generated by a program you have not written yourself (or you could
just change it to generate a more standard and simple format!)

 , nor is it a pure method to strings.

This is a totally different problem. It doesn't make it less useful nor it
does provide a need for adding a new method to the string.
-- 
Giovanni Bajo

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] New string method - splitquoted

2006-05-17 Thread Dave Cinege
Very oftenmake that very very very very very very very very very often,
I find myself processing text in python that  when .split()'ing a line, I'd 
like to exclude the split for a 'quoted' item...quoted because it contains 
whitespace or the sep char.

For example:

s = '  Chan: 11  SNR: 22  ESSID: Spaced Out Wifi  Enc: On'

If I want to yank the essid in the above example, it's a pain. But with my new 
dandy split quoted method, we have a 3rd argument to .split() that we can 
spec the quote delimiter where no splitting will occur, and the quote char 
will be dropped:

s.split(None,-1,'')[5]
'Spaced Out Wifi'

Attached is a proof of concept patch against 
Python-2.4.1/Objects/stringobject.c  that implements this. It is limited to 
whitespace splitting only. (sep == None)

As implemented the quote delimiter also doubles as an additional separator for 
the spliting out a substr. 

For example:
'There isno whitespace before thesequotes'.split(None,-1,'')
['There', 'is', 'no whitespace before these', 'quotes']

This is useful, but possibly better put into practice as a separate method??

Comments please.

Dave
--- stringobject.c.orig	2006-05-17 16:12:13.0 -0400
+++ stringobject.c	2006-05-17 23:49:52.0 -0400
@@ -1336,6 +1336,85 @@
 	return NULL;
 }
 
+// dc: split quoted example
+// 'This string has  not only this and this butthis mixed in stringas well as this  empty one and two more at the end'.split(None,-1,'')
+// CORRECT: ['This', 'string', 'has', 'not only this', 'and this', 'but', 'this mixed in string', 'as', 'well', 'as', 'this', '', 'empty', 'one', 'and', 'two', 'more', 'at', 'the', 'end', '', '']
+static PyObject *
+split_whitespace_quoted(const char *s, int len, int maxsplit, const char *qsub)
+{
+	int i, j, quoted = 0;
+	PyObject *str;
+	PyObject *list = PyList_New(0);
+
+	if (list == NULL)
+		return NULL;
+
+	for (i = j = 0; i  len; ) {
+			
+		if (!quoted) {
+			while (i  len  isspace(Py_CHARMASK(s[i])) )
+i++;
+		}
+		
+		if (Py_CHARMASK(s[i]) == Py_CHARMASK(qsub[0])) {
+			quoted = 1;
+			i++;
+		}
+		
+		j = i;
+			
+		while (i  len) {
+			if (Py_CHARMASK(s[i]) == Py_CHARMASK(qsub[0])) {	
+if (quoted)	
+	quoted = 2;	// End of quotes found 
+else {
+	quoted = 1;	// Else start of new quotes in the middle of a string
+}
+break;
+			} else if (!quoted  isspace(Py_CHARMASK(s[i])))
+	break;
+			i++;
+		}
+		
+		if (quoted == 2  j == i) {	// Empty string in quotes
+			SPLIT_APPEND(, 0, 0);
+			quoted = 0;
+			i++;
+			j = i;
+
+		} else if (j  i) {
+			if (maxsplit-- = 0)
+break;
+			SPLIT_APPEND(s, j, i);
+	
+			if (quoted == 2) {
+quoted = 0;
+i++;
+			} else if (quoted == 1) {
+i++;
+if (Py_CHARMASK(s[i]) == Py_CHARMASK(qsub[0])) { // Embedded empty string in quotes (at end of string?)
+	SPLIT_APPEND(, 0, 0);
+	quoted = 0;
+	i++;
+}
+			} else {
+while (i  len  isspace(Py_CHARMASK(s[i])))
+	i++;
+			}
+			
+			j = i;
+		}
+	}
+	if (j  len) {
+		SPLIT_APPEND(s, j, len);
+	}
+	return list;
+  onError:
+	Py_DECREF(list);
+	return NULL;
+}
+
+
 static PyObject *
 split_char(const char *s, int len, char ch, int maxcount)
 {
@@ -1376,15 +1455,27 @@
 static PyObject *
 string_split(PyStringObject *self, PyObject *args)
 {
-	int len = PyString_GET_SIZE(self), n, i, j, err;
+	int len = PyString_GET_SIZE(self), n, qn, i, j, err;
 	int maxsplit = -1;
-	const char *s = PyString_AS_STRING(self), *sub;
-	PyObject *list, *item, *subobj = Py_None;
+	const char *s = PyString_AS_STRING(self), *sub, *qsub;
+	PyObject *list, *item, *subobj = Py_None, *qsubobj = Py_None;
 
-	if (!PyArg_ParseTuple(args, |Oi:split, subobj, maxsplit))
+	if (!PyArg_ParseTuple(args, |OiO:split, subobj, maxsplit, qsubobj))
 		return NULL;
 	if (maxsplit  0)
 		maxsplit = INT_MAX;
+	if (qsubobj != Py_None) {
+		if (PyString_Check(qsubobj)) {
+			qsub = PyString_AS_STRING(qsubobj);
+			qn = PyString_GET_SIZE(qsubobj);
+		}
+		if (qn == 0) {
+			PyErr_SetString(PyExc_ValueError, empty delimiter);
+			return NULL;
+		}
+		if (subobj == Py_None)
+			return split_whitespace_quoted(s, len, maxsplit, qsub);
+	}		
 	if (subobj == Py_None)
 		return split_whitespace(s, len, maxsplit);
 	if (PyString_Check(subobj)) {
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com