Re: Regexps
This thread is becoming hackers-il material... On Wed, Feb 11, 2004, Arik Baratz wrote about RE: Regexps: I would like to suggest that there are two forms of multiple-field variable-field-length flat-file formats. One form uses 'enclosure' for fields that need seperation, like CSV. In this method, ... The first form cannot be parsed using a regular expression IMHO. I bet I can I showed in a previous email how CSV *can* be parsed by a regular expression, albeit one that uses Perl extended syntax lookahead. I agree with you that it's impossible to with the classic regular-expression syntax (which basically has only *, |). But it *is* possible to do it with a deterministic finite automaton (state machine), and even easier than writing a regular expression: all the state you need to remember is the parity of the number of double-quotes you have seen so far. When you saw an even number of quotes and come across a comma, this is a field split. Otherwise, the comma is part of the field. It's as easy as that. Here's the automaton to find a comma marking an end of the field: (the default action here on an unknown character, not drawn, is to loop into the same state) +--++-+ | | ()-- | | | EVEN || ODD | | | --() | | +--++-+ (,) | v END OF FIELD if you also want to replace two consecutive quotes by one, this slightly complicates the automaton and adds a few more states. Here's an automaton that *outputs* the next field (with out(..) actions on some edges): +-+ | ODD |out(c) -| TMP |(c)---\ / +-+ | () () v +--+ out() | +-+ +(c)| |/ out()| |(c)+ out(c) | | EVEN |-| ODD | | out(c) \--| | / | |--/ +--+ () +-+ (,) ^ +-+ () | \--(c)|EVEN | / |out(c) | TMP |- v+-+ END OF FIELD (at least, I think. I'm just inventing this as I go along :) it's much easier to program in a normal programming language...) (like keeping the state with an outside variable). The problem IMHO is that you must use a lookahead greater than 1, that's why the perl lookahead extension works. Perl's lookahead for counting the number of quotes *after* the current comma, probably generates some sort of non-deterministic finite automaton code. As I explained above, unless I'm mistaken, it's even simpler to remember the parity of the quotes *before* this comma (one bit of state) and write a deterministic finite automaton directly. The reason I gave a regular expression is because this is what the original poster asked for. I also think the trick I came up with (looking for an even number of quotes) is cool ;) -- Nadav Har'El| Wednesday, Feb 11 2004, 19 Shevat 5764 [EMAIL PROTECTED] |- Phone: +972-53-790466, ICQ 13349191 |Share your knowledge. It's a way to http://nadav.harel.org.il |achieve immortality. = To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
RE: Regexps
tr " " "\n" From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tal AchituvSent: Tuesday, February 10, 2004 15:04 PMTo: [EMAIL PROTECTED]Subject: Regexps Hi Guys, can anyone give me a regular _expression_ that turns bla foo bar "kuku 2" test test into: bla foo bar kuku 2 test test (replaces spaces with \n only if its not encapsulated in ") Thanks, Tal.
Re: Regexps
Hagay Unterman [EMAIL PROTECTED] writes: tr \n This would be excusable if it were prepended with UNTESTED. It does not do what the OP wants. In general, handling string literals with regexps is not trivial, because you need to take into account escaped , as in foo \sna fu\ bar and more complicated variants. Also, what if there are newlines inside the string? Assuming there are only simple cases in your input (and some other things like there is no foo sna fubar i.e. quoted strings are always whitespace-separated fields) here is a simple gawk parser that works on your example: #!/bin/gawk -f function tail(str,head) { return substr(str,head+1,length(str)-head+1); } function trprint(str) { gsub(/[ \t]+/,\n,str); printf(%s,str); } { if (!NF) next; str = $0; while (len = index(str,\)) { trprint(substr(str,1,len-1)); str = tail(str,len); end = index(str,\); if (!end) { printf(%s:%d: unmatched quote at position %d\n, FILENAME,NR,len) /dev/stderr; exit(1); } printf(%s\n,substr(str,1,end-1)); str = tail(str,end+1); } trprint(str); printf(\n); } -- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tal Achituv Sent: Tuesday, February 10, 2004 15:04 PM To: [EMAIL PROTECTED] Subject: Regexps Hi Guys, can anyone give me a regular expression that turns bla foo bar kuku 2 test test into: bla foo bar kuku 2 test test (replaces spaces with \n only if its not encapsulated in ) Thanks, Tal. Hope it helps, -- Oleg Goldshmidt | [EMAIL PROTECTED] To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
Re: Regexps
On Tue, Feb 10, 2004, Oleg Goldshmidt wrote about Re: Regexps: In general, handling string literals with regexps is not trivial, because you need to take into account escaped , as in foo \sna fu\ bar This may not be relevant for his situation. One situation in which I once used a similar trick to the one I posted earlier was in breaking up a CSV - comma separated values. In a CSV, the comma is the field separator (rather than the space in the poster's question), so a record might look like one,two,three,four,five Now, the convention is that if field 'two' is to be replaced by something containing a comma, say '1,2,3', the field is quoted with double-quotes: one,1,2,3,three,four,five And you're supposed to split this record up on commas that are not inside quotes. What happens if there are quotes in one of the field? Each double-quote is replaced by two of them, keeping the evenness of the number of quotes (quote parity) and allowing exactly the same method of splitting on commas, and allowing for an easy reverse transformation. For example, one,1,2,3,three,he said hello!,five or one,1,2,3,three,he said hi, man!,five In the last line you know you shouldn't seperate on the comma before 'man' because it has an odd number of quotes before (or after) it. Nice and simple :) At least, that is what I remember. Sadly, the Wikipedia entry on CSV is non-existant, so I'm using my memory as the source ;) Anyway, CSV is a simple record/field representation methods, but it is very rarely used in Unix (it is more common in the Windows world). Tab-seperated fields are, justifiably much more common - they are easier to use and usually enough (and if you need tabs, seperate the fields with some other character). -- Nadav Har'El| Tuesday, Feb 10 2004, 19 Shevat 5764 [EMAIL PROTECTED] |- Phone: +972-53-790466, ICQ 13349191 |A messy desk is a sign of a messy mind. http://nadav.harel.org.il |An empty desk is a sign of an empty mind. = To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
Re: Regexps
Nadav Har'El [EMAIL PROTECTED] writes: On Tue, Feb 10, 2004, Oleg Goldshmidt wrote about Re: Regexps: In general, handling string literals with regexps is not trivial, because you need to take into account escaped , as in foo \sna fu\ bar This may not be relevant for his situation. True, which is why I suggested a simple solution. To tell you the truth, I knew of perl's lookahead, but I am not much of a perl-monger and I didn't remember the syntax, and I don't know of any regexp engine other than perl that supports this very useful feature.[1] So I thought of the most straightforward (not necessarily the best) way to pair the quotes and process portions of the input accordingly. What happens if there are quotes in one of the field? Each double-quote is replaced by two of them, keeping the evenness of the number of quotes (quote parity) and allowing exactly the same method of splitting on commas, and allowing for an easy reverse transformation. Well, you are specifying an input convention that may or may not be applicable. I am sure I don't need to give examples of usage of backslash-escaped quotes in string literals. The escape convention should be specified. From Tal's description, for instance, it is not quite clear what the output from Nadav said, Hi, Oleg, and turned back to his code. should be - maybe the right output is reproducing the input verbatim (there is no unquoted whitespace)? So I put a disclaimer about all sorts of assumptions made, and only went through paired quotes, not checking for more general odd/even cases. Of course, with an escape sequence that is not based on merely doubling the escaped character the odd/even rule breaks down, and I didn't think of it at all. Another potential regexp pitfall is that - for better and for worse - different regexp engines behave differently, to the point of matching different things given the same regexp. Therefore, it may be unsafe to ask for a regexp without specifying the type of engine (or a specific tool, such as perl or awk). Find some issues with the regexp in the code below[2]. [1] A really useful feature would be lookbehind, i.e. match anything but a double quote unless *preceded* by an odd number of consecutive backslashes. Not even perl supports that. [2] Basing a parser on matching quoted strings as a whole will make it a bit difficult reporting unmatched quotes. The code below does a pretty good job on backslash-escaped quotes, but no warranty is implied ;-) #!/bin/gawk -f function tail(str,len) { return substr(str,len+1,length(str)-len+1); } function trprint(str) { gsub(/[ \t]+/,\n,str); printf(%s,str); } { str = $0; pos = 0; while (q = match(str,/([^\\]|\\.)*/,quoted)) { # process as appropriate trprint(substr(str,1,q-1)); printf(%s\n,quoted[0]); # track progress for error reporting below len = q+length(quoted[0]); pos += len; # move on str = tail(str,len); } # at this point we have no quoted strings left if (q = match(str,/.*$/)) { printf(%s:%d: unmatched quote at position %d\n, FILENAME,NR,pos+q) /dev/stderr; exit(1); } # process what remains trprint(str); printf(\n); } -- Oleg Goldshmidt | [EMAIL PROTECTED] = To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]