Re: Regex and Mac vs UNIX line endings

2006-07-21 Thread Peter N Lewis

At 19:25 +0200 20/7/06, kurtz le pirate wrote:

hum... is 'end of line' caracter important ?

if not, you can do something like that :
while (FILE) {
  chomp;
  if (/?/) { ... }
  }

yes ? no ?


Not really, because if the file is Mac line endings, then that will 
read the entire file in a single gulp.  Also, if the file is DOS line 
endings, then the chomp will remove only the linefeed (unless you 
have changed $/ to CRLF, in which case it will not remove a single 
linefeed).


If you fist check the fie and determine the line endings (and the 
file has consistent line endings, which is not always the case) and 
set $/ appropriately, then what you suggest will work.


Enjoy,
   Peter.

--
Check out Interarchy 8.1.1, just released, now with Amazon S3 support.
http://www.stairways.com/  http://download.stairways.com/


Re: Regex and Mac vs UNIX line endings

2006-07-20 Thread Peter N Lewis

I'm processing a string with embedded newlines. For testing I was
storing the text in __DATA__ and slurping it into a string. This works
fine. However when I read in a file, I'm having trouble with the line
endings. Matching begining/end of logical lines is not working as I
expect. Regexes like the one below match when using the DATA filehandle,
but don't when opening other text files on my Mac.

$text =~ s/^Text to match.*$//m;

Is this due to UNIX '\n' vs. Mac '\r' line endings? I assumed the 'm'
modifier would recognize any line ending.

Oh what to do?


You have several possibilities, depending on what you are trying to do.

You could explicitly use either line ending, as it:

$text =~ s/(\012|\015|\A)Text to match[^\012\015]*(\012|\015|\z)/$1$2/;

or using backward/forward assertions:

$text =~ s/(?:\A|(?=\012|\015))Text to match[^\012\015]*(?=\012|\015|\z)//;

(the convoluted backward assertion is required because backward 
assertions must be fixed lengths)


Or you could convert $text to \n line endings:

$text =~ s/(\015\012|\012|\015)/\n/g;
$text =~ s/^Text to match.*$//m;

Or you could detect the line ending and explicitly use it.

Enjoy,
   Peter.

--
Check out Interarchy 8.1.1, just released, now with Amazon S3 support.
http://www.stairways.com/  http://download.stairways.com/


Re: Regex and Mac vs UNIX line endings

2006-07-20 Thread Bruce Van Allen
Peter gave some good examples, so I shortened this to supplement his
suggestions.

I prefer to determine what the end-of-line (eol) character is using
something less slippery than \r and \n. In Perl, \n is the native eol
for the OS that Perl is executing under, so it could any of the \n, \r,
\r\n, etc., constructs.

Instead, use the octal characters, which for this are:

Mac CR (Carriage Return)  \015
UNIX, Linux, VMSLF (Line Feed)\012
Win CRLF  \015\012

BTW, many apps in Mac OS X (Excel, Filemaker Pro) continue to use the
eol used in OS 9 and before (CR), not the UNIX eol (LF).

Here's my favorite way to get the eol and convert it to native, no
matter what's in the original file (at least in the popular OSes):

$text   =~ s/(\015?\012|\015)/\n/gs;

You could also specify what you want, if that isn't simply the native
eol:
my $new_eol = \015;  # or \012 or \015\012
$text   =~ s/(\015?\012|\015)/$new_eol/gs;

If the file is large, then you may need to use a heuristic (that is,
test some of the text trying to detect a pattern), as Doug suggests,
testing the first x characters of the file to find one of the above eol
constructs, and then seeing whether it shows up again, and then backing
up and processing the whole file. Or use the look-ahead/behind
approaches that Peter suggests.

1;


- Bruce

__bruce__van_allen__santa_cruz__ca__


Re: Regex and Mac vs UNIX line endings

2006-07-20 Thread kurtz le pirate
In article 
[EMAIL PROTECTED],
 [EMAIL PROTECTED] (Andrew Brosnan) wrote:

 I'm processing a string with embedded newlines. For testing I was
 storing the text in __DATA__ and slurping it into a string. This works
 fine. However when I read in a file, I'm having trouble with the line
 endings. Matching begining/end of logical lines is not working as I
 expect. Regexes like the one below match when using the DATA filehandle,
 but don't when opening other text files on my Mac.
 
 $text =~ s/^Text to match.*$//m;
 
 Is this due to UNIX '\n' vs. Mac '\r' line endings? I assumed the 'm'
 modifier would recognize any line ending.
 
 Oh what to do?
 
 Andrew

hum... is 'end of line' caracter important ?

if not, you can do something like that :
while (FILE) {
  chomp;
  if (/?/) { ... }
  }


yes ? no ?

-- 
klp


Solved - Re: Regex and Mac vs UNIX line endings

2006-07-20 Thread Andrew Brosnan
All set with this. Converting the line endings worked fine. Thanks.

Andrew


On 7/20/06 at 7:25 PM, [EMAIL PROTECTED] (kurtz le pirate) wrote:

 In article 
 [EMAIL PROTECTED],
 [EMAIL PROTECTED] (Andrew Brosnan) wrote:
 
  I'm processing a string with embedded newlines. For testing I was
  storing the text in __DATA__ and slurping it into a string. This 
  works fine. However when I read in a file, I'm having trouble with 
  the line endings. Matching begining/end of logical lines is not 
  working as I expect. Regexes like the one below match when using 
  the DATA filehandle, but don't when opening other text files on my 
  Mac.
  
  $text =~ s/^Text to match.*$//m;
  
  Is this due to UNIX '\n' vs. Mac '\r' line endings? I assumed the 
  'm' modifier would recognize any line ending.
  
  Oh what to do?
  
  Andrew
 
 hum... is 'end of line' caracter important ?
 
 if not, you can do something like that :
 while (FILE) {
 chomp;
 if (/?/) { ... }
 }
 
 
 yes ? no ?
 


Regex and Mac vs UNIX line endings

2006-07-19 Thread Andrew Brosnan
I'm processing a string with embedded newlines. For testing I was
storing the text in __DATA__ and slurping it into a string. This works
fine. However when I read in a file, I'm having trouble with the line
endings. Matching begining/end of logical lines is not working as I
expect. Regexes like the one below match when using the DATA filehandle,
but don't when opening other text files on my Mac.

$text =~ s/^Text to match.*$//m;

Is this due to UNIX '\n' vs. Mac '\r' line endings? I assumed the 'm'
modifier would recognize any line ending.

Oh what to do?

Andrew



Re: Regex and Mac vs UNIX line endings

2006-07-19 Thread Robert Hicks

Andrew Brosnan wrote:

I'm processing a string with embedded newlines. For testing I was
storing the text in __DATA__ and slurping it into a string. This works
fine. However when I read in a file, I'm having trouble with the line
endings. Matching begining/end of logical lines is not working as I
expect. Regexes like the one below match when using the DATA filehandle,
but don't when opening other text files on my Mac.

$text =~ s/^Text to match.*$//m;

Is this due to UNIX '\n' vs. Mac '\r' line endings? I assumed the 'm'
modifier would recognize any line ending.

Oh what to do?

Andrew

What version of the Mac? Anything in the OSX family is Unix and uses the 
standard \n line ending/new line. If you brought the files over then 
yes you are going to have the '\r' line ending.


:Robert


Re: Regex and Mac vs UNIX line endings

2006-07-19 Thread Andrew Brosnan
On 7/19/06 at 9:51 PM, [EMAIL PROTECTED] (Robert Hicks) wrote:

 Andrew Brosnan wrote:
  I'm processing a string with embedded newlines. For testing I was
  storing the text in __DATA__ and slurping it into a string. This 
  works fine. However when I read in a file, I'm having trouble with 
  the line endings. Matching begining/end of logical lines is not 
  working as I expect. Regexes like the one below match when using 
  the DATA filehandle, but don't when opening other text files on my 
  Mac.
  
  $text =~ s/^Text to match.*$//m;
  
  Is this due to UNIX '\n' vs. Mac '\r' line endings? I assumed the 
  'm' modifier would recognize any line ending.
  
  Oh what to do?
  
  Andrew
  
 What version of the Mac?

10.3.9

 Anything in the OSX family is Unix and uses the 
 standard \n line ending

I don't think that is the case. These are text files created on 10.3.9
and they use \r for line endings. The problem is that /^.*$/ won't match
lines ending with \r even with the m modifier.

Andrew




Re: Regex and Mac vs UNIX line endings

2006-07-19 Thread Doug McNutt
If you want to adjust the line ends in the files have a look at:

ftp://ftp.macnauchtan.com/Software/LineEnds/FixEndsFolder.sit  52 kB
ftp://ftp.macnauchtan.com/Software/LineEnds/ReadMe_fixends.txt  4 kB

Yeah. It's pretty easy in perl too.

I have on occasion, read the first few hundred characters of a file and then 
searched for \n and \r and \r\n. From that I make a guess and reopen the file 
for line by line reading after setting $/ to what I found.

If you slurp in the whole string you can play with

$option1 = split /\n/, $thedata;
$option2 = split /\r/, $thedata;

Which option has the most elements?

split /(\r|\n)/, $thedata; # is an idea I just had. I wonder? 
-- 

-- Science is the business of discovering and codifying the rules and methods 
employed by the Intelligent Designer. Religions provide myths to mollify the 
anxiety experienced by those who choose not to participate. --