Re: Regexps

2004-02-11 Thread Nadav Har'El
This thread is becoming hackers-il material...

On Wed, Feb 11, 2004, Arik Baratz wrote about RE: Regexps:
 I would like to suggest that there are two forms of multiple-field 
 variable-field-length 
 flat-file formats.
 
 One form uses 'enclosure' for fields that need seperation, like CSV. In this method,
...
 The first form cannot be parsed using a regular expression IMHO. I bet I can

I showed in a previous email how CSV *can* be parsed by a regular expression,
albeit one that uses Perl extended syntax lookahead.

I agree with you that it's impossible to with the classic regular-expression
syntax (which basically has only *, |). But it *is* possible to do it
with a deterministic finite automaton (state machine), and even easier than
writing a regular expression: all the state you need to remember is the
parity of the number of double-quotes you have seen so far. When you saw an
even number of quotes and come across a comma, this is a field split.
Otherwise, the comma is part of the field. It's as easy as that.

Here's the automaton to find a comma marking an end of the field:
(the default action here on an unknown character, not drawn, is to loop into
the same state)


+--++-+
|  | ()-- | |
| EVEN || ODD |
|  | --() | |
+--++-+
  (,)
   |
   v  
  END OF FIELD

if you also want to replace two consecutive quotes by one, this slightly
complicates the automaton and adds a few more states. Here's an automaton
that *outputs* the next field (with out(..) actions on some edges):

 +-+
 | ODD |out(c)
   -| TMP |(c)---\
  /  +-+  |
()   () v
 +--+  out()  |   +-+
 +(c)|  |/  out()| |(c)+
  out(c) |   | EVEN |-| ODD |   | out(c)
 \--|  |   /  | |--/
 +--+  () +-+
   (,) ^ +-+ ()
|  \--(c)|EVEN |  /
|out(c)  | TMP |-
v+-+
   END OF FIELD

(at least, I think. I'm just inventing this as I go along :) it's much easier
to program in a normal programming language...)


 (like keeping the state with an outside variable). The problem IMHO is that
 you must use a lookahead greater than 1, that's why the perl lookahead
 extension works.

Perl's lookahead for counting the number of quotes *after* the current comma,
probably generates some sort of non-deterministic finite automaton code. As
I explained above, unless I'm mistaken, it's even simpler to remember the
parity of the quotes *before* this comma (one bit of state) and write a
deterministic finite automaton directly.

The reason I gave a regular expression is because this is what the original
poster asked for. I also think the trick I came up with (looking for an
even number of quotes) is cool ;)


-- 
Nadav Har'El|   Wednesday, Feb 11 2004, 19 Shevat 5764
[EMAIL PROTECTED] |-
Phone: +972-53-790466, ICQ 13349191 |Share your knowledge. It's a way to
http://nadav.harel.org.il   |achieve immortality.

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



RE: Regexps

2004-02-10 Thread Hagay Unterman



tr " " 
"\n"


From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Tal 
AchituvSent: Tuesday, February 10, 2004 15:04 PMTo: 
[EMAIL PROTECTED]Subject: Regexps

Hi Guys,

can anyone give me a regular _expression_ that 
turns
bla foo bar "kuku 2" test test

into:
bla
foo
bar
kuku 2
test
test

(replaces spaces with \n only if its not 
encapsulated in ")

Thanks,
Tal.


Re: Regexps

2004-02-10 Thread Oleg Goldshmidt
Hagay Unterman [EMAIL PROTECTED] writes:

 tr   \n

This would be excusable if it were prepended with UNTESTED. It does
not do what the OP wants.

In general, handling string literals with regexps is not trivial,
because you need to take into account escaped , as in

foo \sna fu\ bar

and more complicated variants. Also, what if there are newlines inside
the string?

Assuming there are only simple cases in your input (and some
other things like there is no

foo sna fubar

i.e. quoted strings are always whitespace-separated fields) here is a
simple gawk parser that works on your example:

#!/bin/gawk -f

function tail(str,head) { return substr(str,head+1,length(str)-head+1); }

function trprint(str) { gsub(/[ \t]+/,\n,str); printf(%s,str); }

{
if (!NF) next;
str = $0;
while (len = index(str,\)) {
trprint(substr(str,1,len-1));
str = tail(str,len);
end = index(str,\);
if (!end) {
printf(%s:%d: unmatched quote at position %d\n,
   FILENAME,NR,len)  /dev/stderr;
exit(1);
}
printf(%s\n,substr(str,1,end-1));
str = tail(str,end+1);
}
trprint(str);
printf(\n);
}


 
 --
 
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
 Behalf Of Tal Achituv
 Sent: Tuesday, February 10, 2004 15:04 PM
 To: [EMAIL PROTECTED]
 Subject: Regexps
 
 Hi Guys,
 
  
 
 can anyone give me a regular expression that turns
 
 bla foo bar kuku 2 test test
 
  
 
 into:
 
 bla
 
 foo
 
 bar
 
 kuku 2
 
 test
 
 test
 
  
 
 (replaces spaces with \n only if its not encapsulated in )
 
  
 
 Thanks,
 
 Tal.

Hope it helps,

-- 
Oleg Goldshmidt | [EMAIL PROTECTED]

To unsubscribe, send 
mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



Re: Regexps

2004-02-10 Thread Nadav Har'El
On Tue, Feb 10, 2004, Oleg Goldshmidt wrote about Re: Regexps:
 In general, handling string literals with regexps is not trivial,
 because you need to take into account escaped , as in
 
 foo \sna fu\ bar

This may not be relevant for his situation. One situation in which I once
used a similar trick to the one I posted earlier was in breaking up a
CSV - comma separated values. In a CSV, the comma is the field
separator (rather than the space in the poster's question), so a record might
look like

one,two,three,four,five

Now, the convention is that if field 'two' is to be replaced by something
containing a comma, say '1,2,3', the field is quoted with double-quotes:

one,1,2,3,three,four,five

And you're supposed to split this record up on commas that are not inside
quotes.

What happens if there are quotes in one of the field? Each double-quote is
replaced by two of them, keeping the evenness of the number of quotes
(quote parity) and allowing exactly the same method of splitting on commas,
and allowing for an easy reverse transformation.

For example,

one,1,2,3,three,he said hello!,five
or
one,1,2,3,three,he said hi, man!,five

In the last line you know you shouldn't seperate on the comma before 'man'
because it has an odd number of quotes before (or after) it. Nice and simple :)

At least, that is what I remember. Sadly, the Wikipedia entry on CSV is
non-existant, so I'm using my memory as the source ;)

Anyway, CSV is a simple record/field representation methods, but it is very
rarely used in Unix (it is more common in the Windows world). Tab-seperated
fields are, justifiably much more common - they are easier to use and usually
enough (and if you need tabs, seperate the fields with some other character).


-- 
Nadav Har'El| Tuesday, Feb 10 2004, 19 Shevat 5764
[EMAIL PROTECTED] |-
Phone: +972-53-790466, ICQ 13349191 |A messy desk is a sign of a messy mind.
http://nadav.harel.org.il   |An empty desk is a sign of an empty mind.

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



Re: Regexps

2004-02-10 Thread Oleg Goldshmidt

Nadav Har'El [EMAIL PROTECTED] writes:

 On Tue, Feb 10, 2004, Oleg Goldshmidt wrote about Re: Regexps:
  In general, handling string literals with regexps is not trivial,
  because you need to take into account escaped , as in
  
  foo \sna fu\ bar
 
 This may not be relevant for his situation. 

True, which is why I suggested a simple solution. 

To tell you the truth, I knew of perl's lookahead, but I am not much
of a perl-monger and I didn't remember the syntax, and I don't know of
any regexp engine other than perl that supports this very useful
feature.[1]

So I thought of the most straightforward (not necessarily the best)
way to pair the quotes and process portions of the input accordingly.

 What happens if there are quotes in one of the field? Each
 double-quote is replaced by two of them, keeping the evenness of the
 number of quotes (quote parity) and allowing exactly the same method
 of splitting on commas, and allowing for an easy reverse
 transformation.

Well, you are specifying an input convention that may or may not be
applicable. I am sure I don't need to give examples of usage of
backslash-escaped quotes in string literals.

The escape convention should be specified. From Tal's description, for
instance, it is not quite clear what the output from

Nadav said, Hi, Oleg, and turned back to his code.

should be - maybe the right output is reproducing the input verbatim
(there is no unquoted whitespace)? So I put a disclaimer about all
sorts of assumptions made, and only went through paired quotes, not
checking for more general odd/even cases. Of course, with an escape
sequence that is not based on merely doubling the escaped character
the odd/even rule breaks down, and I didn't think of it at all.

Another potential regexp pitfall is that - for better and for worse -
different regexp engines behave differently, to the point of matching
different things given the same regexp. Therefore, it may be unsafe to
ask for a regexp without specifying the type of engine (or a specific
tool, such as perl or awk). Find some issues with the regexp in the code
below[2].

[1] A really useful feature would be lookbehind, i.e. match
anything but a double quote unless *preceded* by an odd number of
consecutive backslashes. Not even perl supports that.

[2] Basing a parser on matching quoted strings as a whole will make it
a bit difficult reporting unmatched quotes. The code below does a
pretty good job on backslash-escaped quotes, but no warranty is
implied ;-)

#!/bin/gawk -f

function tail(str,len) { return substr(str,len+1,length(str)-len+1); }

function trprint(str) { gsub(/[ \t]+/,\n,str); printf(%s,str); }

{
str = $0;
pos = 0;
while (q = match(str,/([^\\]|\\.)*/,quoted)) {
# process as appropriate
trprint(substr(str,1,q-1));
printf(%s\n,quoted[0]);
# track progress for error reporting below
len = q+length(quoted[0]);
pos += len;
# move on
str = tail(str,len);
}
# at this point we have no quoted strings left
if (q = match(str,/.*$/)) {
printf(%s:%d: unmatched quote at position %d\n,
   FILENAME,NR,pos+q)  /dev/stderr;
exit(1);
}
# process what remains
trprint(str);
printf(\n);
}

-- 
Oleg Goldshmidt | [EMAIL PROTECTED]

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]