Re: Removing BOM from UTF-8

2006-02-19 Thread J. Erik Heinz
Hi,

Gerard Seibert [EMAIL PROTECTED] words
on 18.02.2006 - 16:57 (-0500 Zulu-Time):

 Benjamin A'Lee wrote:
  It shouldn't be writing any new files; it prints the filtered text to
  stdout.
  
  Ben
 
 OK, then that is the problem. I need it to actually write the file. It
 could either rename the old file and then rewrite it which would be nice,
 or just over write the old file. The BOM is just the first three
 characters in the file. I am assuming that it would not be removing
 anything else in the file.

use a for-loop in your shell:

# bash
# cd to/your/directory
# for i in *; do 
#  nobom.sh $i  $i.new
# done

this will take all your files in your directory and proceed each one it 
with nobom.sh, which then will write it to new file. 

Be sure that your perlscript points to your perl installation on your
system. You can use 'which perl' to get the location of your perl
installation.

Cheers Erik

-- 
J. Erik Heinz
Keyboard-samuraing in process

:: All non-mailinglist mail to this emailadress will be deleted.

OpenBC: https://www.openbc.com/hp/JErik_Heinz
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Removing BOM from UTF-8

2006-02-19 Thread Gerard Seibert
J. Erik Heinz wrote:

 use a for-loop in your shell:
 
 # bash
 # cd to/your/directory
 # for i in *; do 
 #  nobom.sh $i  $i.new
 # done
 
 this will take all your files in your directory and proceed each one it 
 with nobom.sh, which then will write it to new file. 
 
 Be sure that your perlscript points to your perl installation on your
 system. You can use 'which perl' to get the location of your perl
 installation.
 
 Cheers Erik

Thanks! I'll give it a try when I return to work.

-- 
Gerard
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Removing BOM from UTF-8

2006-02-18 Thread Gerard Seibert
I have a large number of text files created in MS Word and saved in
UTF-8 format. Unfortunately, MS Word adds the BOM to each file. I need
to remove the BOM.

Information regarding BOM and UTF-8 can be found here:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.w3.org/International/questions/qa-utf8-bom

A brief excerpt:

It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF)
as a signature to mark the beginning of a UTF-8 file. This practice
should definitely not be used on POSIX systems for several reasons:

* On POSIX systems, the locale and not magic file type codes define
 the encoding of plain text files. Mixing the two concepts would add a
 lot of complexity and break existing functionality.

* Adding a UTF-8 signature at the start of a file would interfere
 with many established conventions such as the kernel looking for “#!” at
 the beginning of a plaintext executable to locate the appropriate
 interpreter.

* Handling BOMs properly would add undesirable complexity even to
 simple programs like cat or grep that mix contents of several files into
 one.

It has been suggested that a script could be written to eliminate the
BOM from a file(s). My script writing skills suck. I have been unable to
locate one using Google, so I was hoping that someone might know where I
could either locate such a program, or perhaps give me an idea on how to
script one.

Thanks!

-- 
Gerard Seibert
[EMAIL PROTECTED]


 I'm interested in the fact that the less secure a man is, the more
 likely he is to have extreme prejudice.

  Clint Eastwood
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Removing BOM from UTF-8

2006-02-18 Thread Benjamin A'Lee
On Sat, 2006-02-18 at 11:28 -0500, Gerard Seibert wrote:
 It has been suggested that a script could be written to eliminate the
 BOM from a file(s). My script writing skills suck. I have been unable to
 locate one using Google, so I was hoping that someone might know where I
 could either locate such a program, or perhaps give me an idea on how to
 script one.

#!/usr/bin/perl
@file=;
$file[0] =~ s/^\xEF\xBB\xBF//;
print(@file);

That'll read a file from stdin, remove the BOM from the beginning of the
first line if it's present, and print it to stdout.

Hope it helps.

Ben

-- 
Termisoc Tech Officer: http://termisoc.org/
My Homepage: http://benalee.co.uk/
People demand freedom of speech as compensation for the freedom of
thought which they have but seldom use. -- Søren Kierkegaard 
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Removing BOM from UTF-8

2006-02-18 Thread Gerard Seibert
Benjamin A'Lee wrote:

 On Sat, 2006-02-18 at 11:28 -0500, Gerard Seibert wrote:
  It has been suggested that a script could be written to eliminate
  the BOM from a file(s). My script writing skills suck. I have been
  unable to locate one using Google, so I was hoping that someone
  might know where I could either locate such a program, or perhaps
  give me an idea on how to script one.

 #!/usr/bin/perl
 @file=;
 $file[0] =~ s/^\xEF\xBB\xBF//;
 print(@file);

 That'll read a file from stdin, remove the BOM from the beginning of
 the first line if it's present, and print it to stdout.

 Hope it helps.

     Ben

 --
 Termisoc Tech Officer: http://termisoc.org/
 My Homepage: http://benalee.co.uk/
 People demand freedom of speech as compensation for the freedom of
 thought which they have but seldom use. -- Søren Kierkegaard

Maybe I am doing something wrong, but it does not appear to be working 
correctly. I named the file nobom.sh and put it in the same directory 
as the files I want to convert. I also set the program permission to 
0755.

typing the p[program name does nothing; I have to precede it with 
'perl'. Even then, it does not appear to work correctly. In the 
following example, the file is parsed, but not converted.

perl nobom.sh testfile

Am I doing something incorrectly here?

Thanks!

-- 
Gerard Seibert
[EMAIL PROTECTED]

PGP: http://www.seibercom.net/sig/gerard.asc


pgpHJZbonjTLX.pgp
Description: PGP signature


Re: Removing BOM from UTF-8

2006-02-18 Thread Benjamin A'Lee
On Sat, 2006-02-18 at 14:34 -0500, Gerard Seibert wrote:
 Maybe I am doing something wrong, but it does not appear to be working 
 correctly. I named the file nobom.sh and put it in the same directory 
 as the files I want to convert. I also set the program permission to 
 0755.
 
 typing the p[program name does nothing; I have to precede it with 
 'perl'. Even then, it does not appear to work correctly. In the 
 following example, the file is parsed, but not converted.

Sorry; try changing the first line to #!/usr/local/bin/perl

 perl nobom.sh testfile
 
 Am I doing something incorrectly here?

Try:

cat testfile | nobom.sh

Though the way you describe appears to work here:

$ cat bom-testfile | hd
  ef bb bf 23 20 42 4f 4d  20 74 65 73 74 20 66 69  |...# BOM test fi|
0010  6c 65 0a  |le.|
0013
$ bomkill.pl bom-testfile | hd
  23 20 42 4f 4d 20 74 65  73 74 20 66 69 6c 65 0a  |# BOM test file.|
0010


Ben

-- 
Termisoc Tech Officer: http://termisoc.org/
My Homepage: http://benalee.co.uk/
People demand freedom of speech as compensation for the freedom of
thought which they have but seldom use. -- Søren Kierkegaard 
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Removing BOM from UTF-8

2006-02-18 Thread Gerard Seibert
enjamin A'Lee wrote:

 From:
 Benjamin A'Lee [EMAIL PROTECTED]
   To:
 freebsd-questions@freebsd.org
   Date:
 Today 03:29:34 pm
    

 On Sat, 2006-02-18 at 14:34 -0500, Gerard Seibert wrote:
  Maybe I am doing something wrong, but it does not appear to be
  working correctly. I named the file nobom.sh and put it in the same
  directory as the files I want to convert. I also set the program
  permission to 0755.
 
  typing the p[program name does nothing; I have to precede it with
  'perl'. Even then, it does not appear to work correctly. In the
  following example, the file is parsed, but not converted.

 Sorry; try changing the first line to #!/usr/local/bin/perl

  perl nobom.sh testfile
 
  Am I doing something incorrectly here?

 Try:

 cat testfile | nobom.sh

 Though the way you describe appears to work here:

 $ cat bom-testfile | hd
   ef bb bf 23 20 42 4f 4d  20 74 65 73 74 20 66 69  |...# BOM
 test fi| 0010  6c 65 0a                                        
  |le.| 0013
 $ bomkill.pl bom-testfile | hd
   23 20 42 4f 4d 20 74 65  73 74 20 66 69 6c 65 0a  |# BOM
 test file.| 0010


     Ben

Something appears to be wrong here. First, the file will not run unless 
I precede it with 'perl'. I have another perl script in the same 
directory that runs just fine without any special prefixes. Also, the 
script does not seem to remove the BOM entity.

This is the script as I have it entered:

#!/usr/local/bin/perl
use warnings;
use diagnostics -verbose;
@file=;
$file[0] =~ s/^\xEF\xBB\xBF//;
print(@file);

I have the file permissions set to 0755. Is there anything else that 
could be causing this to fail?

This is the first line of the file I am attempting to fix (well one of 
them).

Subject:

That is what appears when I use pico to view the file.

-- 
Gerard Seibert
[EMAIL PROTECTED]

PGP: http://www.seibercom.net/sig/gerard.asc


pgpZJtpuCSTLF.pgp
Description: PGP signature


Re: Removing BOM from UTF-8

2006-02-18 Thread Gerard Seibert
Gerard Seibert wrote:

 enjamin A'Lee wrote:
 
  From:
  Benjamin A'Lee [EMAIL PROTECTED]
To:
  freebsd-questions@freebsd.org
Date:
  Today 03:29:34 pm
 
 
  On Sat, 2006-02-18 at 14:34 -0500, Gerard Seibert wrote:
   Maybe I am doing something wrong, but it does not appear to be
   working correctly. I named the file nobom.sh and put it in the same
   directory as the files I want to convert. I also set the program
   permission to 0755.
  
   typing the p[program name does nothing; I have to precede it with
   'perl'. Even then, it does not appear to work correctly. In the
   following example, the file is parsed, but not converted.
 
  Sorry; try changing the first line to #!/usr/local/bin/perl
 
   perl nobom.sh testfile
  
   Am I doing something incorrectly here?
 
  Try:
 
  cat testfile | nobom.sh
 
  Though the way you describe appears to work here:
 
  $ cat bom-testfile | hd
    ef bb bf 23 20 42 4f 4d  20 74 65 73 74 20 66 69  |...# BOM
  test fi| 0010  6c 65 0a                                        
   |le.| 0013
  $ bomkill.pl bom-testfile | hd
    23 20 42 4f 4d 20 74 65  73 74 20 66 69 6c 65 0a  |# BOM
  test file.| 0010
 
 
      Ben
 
 Something appears to be wrong here. First, the file will not run unless 
 I precede it with 'perl'. I have another perl script in the same 
 directory that runs just fine without any special prefixes. Also, the 
 script does not seem to remove the BOM entity.
 
 This is the script as I have it entered:
 
 #!/usr/local/bin/perl
 use warnings;
 use diagnostics -verbose;
 @file=;
 $file[0] =~ s/^\xEF\xBB\xBF//;
 print(@file);
 
 I have the file permissions set to 0755. Is there anything else that 
 could be causing this to fail?
 
 This is the first line of the file I am attempting to fix (well one of 
 them).
 
 Subject:
 
 That is what appears when I use pico to view the file.
 
As I continue to play with this, it has become apparent that the new
file is not being written, or at least I cannot locate it. Since I do
not know perl, I have no idea where to look for answers.

-- 
Gerard Seibert
[EMAIL PROTECTED]

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Removing BOM from UTF-8

2006-02-18 Thread Benjamin A'Lee
On Sat, 2006-02-18 at 16:14 -0500, Gerard Seibert wrote:
 As I continue to play with this, it has become apparent that the new
 file is not being written, or at least I cannot locate it. Since I do
 not know perl, I have no idea where to look for answers.

It shouldn't be writing any new files; it prints the filtered text to
stdout.

Ben

-- 
Termisoc Tech Officer: http://termisoc.org/
My Homepage: http://benalee.co.uk/
People demand freedom of speech as compensation for the freedom of
thought which they have but seldom use. -- Søren Kierkegaard 
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Removing BOM from UTF-8

2006-02-18 Thread Andrew Pantyukhin
I use this to add BOM:
http://search.cpan.org/~lyokato/UTF8BOM-1.01/lib/UTF8BOM.pm

You shouldn't be so fixed on eliminating BOMs, it's
quite a nice concept. It causes less trouble than you
think.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Removing BOM from UTF-8

2006-02-18 Thread Gerard Seibert
Benjamin A'Lee wrote:

 It shouldn't be writing any new files; it prints the filtered text to
 stdout.
 
 Ben

OK, then that is the problem. I need it to actually write the file. It
could either rename the old file and then rewrite it which would be nice,
or just over write the old file. The BOM is just the first three
characters in the file. I am assuming that it would not be removing
anything else in the file.

-- 
Gerard Seibert
[EMAIL PROTECTED]

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]