Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-03 Thread Avi Rozen
Craig Ringer wrote:
 Kern Sibbald wrote:
   
 Hello,

 Thanks for all the answers; I am a bit overwhelmed by the number, so I am 
 going to try to answer everyone in one email.

 The first thing to understand is that it is *impossible* to know what the 
 encoding is on the client machine (FD -- or File daemon).  On say a 
 Unix/Linux system, the user could create filenames with non-UTF-8 then 
 switch 
 to UTF-8, or restore files that were tarred on Windows or on Mac, or simply 
 copy a Mac directory.  Finally, using system calls to create a file, you can 
 put *any* character into a filename.
 

 While true in theory, in practice it's pretty unusual to have filenames
 encoded with an encoding other than the system LC_CTYPE on a modern
 UNIX/Linux/BSD machine.
   

In my case garbage filenames are all too common. It's a the sad
*reality*, when you're mixing languages (Hebrew and English in my case)
and operating systems. Garbage filenames are everywhere: directories and
files shared between different operating systems and file systems, mail
attachments, mp3 file names based on garbage id3 tags, files in zip
archives (which seem to not handle filename encoding at all), etc.

When I first tried Bacula (version 1.38), I expected to have trouble
with filenames, since this is what I'm used to. I was rather pleased to
find out that it could both backup and restore files, regardless of
origin and destination filename encoding.

I like Bacula because, among other things, it can take the punishment
and chug along, without me even noticing that there was supposed to be a
problem (a recent example: backup/restore files with a negative mtime ...)

My 2c
Avi

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-03 Thread Kern Sibbald

 Craig Ringer wrote:
 Kern Sibbald wrote:

 Hello,

 Thanks for all the answers; I am a bit overwhelmed by the number, so I
 am
 going to try to answer everyone in one email.

 The first thing to understand is that it is *impossible* to know what
 the
 encoding is on the client machine (FD -- or File daemon).  On say a
 Unix/Linux system, the user could create filenames with non-UTF-8 then
 switch
 to UTF-8, or restore files that were tarred on Windows or on Mac, or
 simply
 copy a Mac directory.  Finally, using system calls to create a file,
 you can
 put *any* character into a filename.


 While true in theory, in practice it's pretty unusual to have filenames
 encoded with an encoding other than the system LC_CTYPE on a modern
 UNIX/Linux/BSD machine.


 In my case garbage filenames are all too common. It's a the sad
 *reality*, when you're mixing languages (Hebrew and English in my case)
 and operating systems. Garbage filenames are everywhere: directories and
 files shared between different operating systems and file systems, mail
 attachments, mp3 file names based on garbage id3 tags, files in zip
 archives (which seem to not handle filename encoding at all), etc.

Yes, that is my experience too.  I understand Craig's comments, but I
would much prefer that Bacula just backup and restore and leave the
checking of filename consistencies to other programs.  At least for the
moment, that seems to work quite well.  Obviously if users mix character
sets, sometime display of filenames in Bacula will be wierd, but
nevertheless Bacula will backup and restore them so that what was on the
system before the backup is what is restored.


 When I first tried Bacula (version 1.38), I expected to have trouble
 with filenames, since this is what I'm used to. I was rather pleased to
 find out that it could both backup and restore files, regardless of
 origin and destination filename encoding.

 I like Bacula because, among other things, it can take the punishment
 and chug along, without me even noticing that there was supposed to be a
 problem (a recent example: backup/restore files with a negative mtime ...)


Thanks.  Thanks also for using Bacula :-)

Best regards,

Kern



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-03 Thread Craig Ringer
Frank Sweetser wrote:

 Unless, of course, you're at a good sized school with lots of
 international students, and have fileservers holding filenames created
 on desktops running in Chinese, Turkish, Russian, and other locales.

What I struggle with here is why they're not using ru_RU.UTF-8,
cn_CN.UTF-8, etc as their locales. Why mix charsets?

I don't think that these people should be forced to use a utf-8 database
and encoding conversion if they want to do things like mix-and-match
charsets for file name chaos on their machines, though. I'd just like to
be able to back up systems that _do_ have consistent charsets in ways
that permit me to later reliably search for files by name, restore to
any host, etc.

Perhaps I'm strange in thinking that all this mix-and-match encodings
stuff is bizarre and backward. The Mac OS X and Windows folks seem to
agree, though. Let the file system store unicode data, and translate at
the file system or libc layer for applications that insist on using
other encodings.

I do take Greg Stark's point (a) though. As *nix systems stand,
solutions will only ever be mostly-works, not always-works, which I
agree isn't good enough. Since there's no sane agreement about encodings
on *nix systems and everything is just byte strings that different apps
can interpret in different ways under different environmental
conditions, we may as well throw up our hands in disgust and give up
trying to do anything sensible. The alternative is saying that files the
file system considers legal can't be backed up because of file naming,
which I do agree isn't ok.

The system shouldn't permit those files to exist, either, but I suspect
we'll still have borked encoding-agnostic wackiness as long as we have
*nix systems at all since nobody will ever agree on anything for long
enough to change it.

Sigh. I think this is about the only time I've ever wished I was using
Windows (or Mac OS X).

Also: Greg, your point (c) goes two ways. If I can't trust my backup
software to restore my filenames from one host exactly correctly to
another host that may have configuration differences not reflected in
the backup metadata, a different OS revision, etc, then what good is it
for disaster recovery? How do I even know what those byte strings
*mean*? Bacula doesn't even record the default system encoding with
backup jobs so there's no way for even the end user to try to fix up the
file names for a different encoding. You're faced with some byte strings
in wtf-is-this-anyway encoding and guesswork. Even recording lc_ctype in
the backup job metadata and offering the _option_ to convert encoding on
restore would be a big step, (though it wouldn't fix the breakage with
searches by filename not matching due to encoding mismatches).

Personally, I'm just going to stick to a utf-8 only policy for all my
hosts, working around the limitation that way. It's worked ok thus far,
though I don't much like the way that different normalizations of
unicode won't match equal under SQL_ASCII so I can't reliably search for
file names.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-03 Thread Ivan Sergio Borgonovo
On Thu, 3 Dec 2009 12:22:50 +0100 (CET)
Kern Sibbald k...@sibbald.com wrote:

 Yes, that is my experience too.  I understand Craig's comments,
 but I would much prefer that Bacula just backup and restore and
 leave the checking of filename consistencies to other programs.
 At least for the moment, that seems to work quite well.  Obviously
 if users mix character sets, sometime display of filenames in
 Bacula will be wierd, but nevertheless Bacula will backup and
 restore them so that what was on the system before the backup is
 what is restored.

I expect a backup software has a predictable, reversible behaviour
and warn me if I'm shooting myself in the foot.

It should be the responsibility of the admin to restore files in a
proper place knowing that locales may be a problem.

I think Bacula is taking the right approach.

Still I'd surely appreciate as a feature a tool that will help me
to restore files in a system with a different locale than the
original one or warn me if the locale is different or it can't be
sure it is the same.
That's exactly what Postgresql is doing: at least warning you.
Even Postgresql is taking the right approach.

An additional guessed original locale field and a tool/option to
convert/restore with selected locale could be an interesting feature.

What is Bacula going to do with xattr on different systems?

Postgresql seems to offer a good choice of tools to convert between
encodings and deal with bytea.
Formally I'd prefer bytea but in real use it may just be an
additional pain and other DB may not offer the same tools for
encoding/bytea conversions.

Is it possible to search for a file in a backup set?
What is it going to happen if I'm searching from a system that has a
different locale from the one the backup was made on?
Can I use regexp? Can accents be ignored during searches?


-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-03 Thread Frank Sweetser

On 12/03/2009 10:54 AM, Craig Ringer wrote:

Frank Sweetser wrote:


Unless, of course, you're at a good sized school with lots of
international students, and have fileservers holding filenames created
on desktops running in Chinese, Turkish, Russian, and other locales.


What I struggle with here is why they're not using ru_RU.UTF-8,
cn_CN.UTF-8, etc as their locales. Why mix charsets?


The problem isn't so much what they're using on their unmanaged desktops.  The 
problem is that the server, which is the one getting backed up, holds an 
aggregation of files created by an unknown collection of applications running 
on a mish-mash of operating systems (every large edu has its horror story of 
the 15+ year old, unpatched, mission critical machine that no one dares touch) 
with wildly varying charset configurations, no doubt including horribly broken 
and pre-UTF ones.


The end result is a fileset full of filenames created on a hacked Chinese copy 
of XP, a Russian copy of winME, romanian RedHat 4.0, and Mac OS 8.


This kind of junk is, sadly, not uncommon in academic environments, where IT 
is often required to support stuff that they don't get to manage.


--
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-03 Thread Eitan Talmi
Hi Avi

Please have a look at this link, this is how to install Bacula with MYSQL
database with Hebrew support

Eitan


On Thu, Dec 3, 2009 at 12:35 PM, Avi Rozen avi.ro...@gmail.com wrote:

 Craig Ringer wrote:
  Kern Sibbald wrote:
 
  Hello,
 
  Thanks for all the answers; I am a bit overwhelmed by the number, so I
 am
  going to try to answer everyone in one email.
 
  The first thing to understand is that it is *impossible* to know what
 the
  encoding is on the client machine (FD -- or File daemon).  On say a
  Unix/Linux system, the user could create filenames with non-UTF-8 then
 switch
  to UTF-8, or restore files that were tarred on Windows or on Mac, or
 simply
  copy a Mac directory.  Finally, using system calls to create a file, you
 can
  put *any* character into a filename.
 
 
  While true in theory, in practice it's pretty unusual to have filenames
  encoded with an encoding other than the system LC_CTYPE on a modern
  UNIX/Linux/BSD machine.
 

 In my case garbage filenames are all too common. It's a the sad
 *reality*, when you're mixing languages (Hebrew and English in my case)
 and operating systems. Garbage filenames are everywhere: directories and
 files shared between different operating systems and file systems, mail
 attachments, mp3 file names based on garbage id3 tags, files in zip
 archives (which seem to not handle filename encoding at all), etc.

 When I first tried Bacula (version 1.38), I expected to have trouble
 with filenames, since this is what I'm used to. I was rather pleased to
 find out that it could both backup and restore files, regardless of
 origin and destination filename encoding.

 I like Bacula because, among other things, it can take the punishment
 and chug along, without me even noticing that there was supposed to be a
 problem (a recent example: backup/restore files with a negative mtime ...)

 My 2c
 Avi


 --
 Join us December 9, 2009 for the Red Hat Virtual Experience,
 a free event focused on virtualization and cloud computing.
 Attend in-depth sessions from your desk. Your couch. Anywhere.
 http://p.sf.net/sfu/redhat-sfdev2dev
 ___
 Bacula-users mailing list
 bacula-us...@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/bacula-users



Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-03 Thread Frank Sweetser

On 12/3/2009 3:33 AM, Craig Ringer wrote:

Kern Sibbald wrote:

Hello,

Thanks for all the answers; I am a bit overwhelmed by the number, so I am
going to try to answer everyone in one email.

The first thing to understand is that it is *impossible* to know what the
encoding is on the client machine (FD -- or File daemon).  On say a


Or, even worse, which encoding the user or application was thinking of when it 
wrote a particular out.  There's no guarantee that any two files on a system 
were intended to be looked at with the same encoding.



Unix/Linux system, the user could create filenames with non-UTF-8 then switch
to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
copy a Mac directory.  Finally, using system calls to create a file, you can
put *any* character into a filename.


While true in theory, in practice it's pretty unusual to have filenames
encoded with an encoding other than the system LC_CTYPE on a modern
UNIX/Linux/BSD machine.


Unless, of course, you're at a good sized school with lots of international 
students, and have fileservers holding filenames created on desktops running 
in Chinese, Turkish, Russian, and other locales.


In the end, a filename is (under linux, at least) just a string of arbitrary 
bytes containing anything except / and NULL.  If bacula tries to get too 
clever, and munges or misinterprets those bytes strings - or, worse yet, if 
the database does it behind your back - then stuff _will_ end up breaking.


(A few years back, someone heavily involved in linux kernel filesystem work 
was talking about this exact issue, and made the remark that many doing 
internationalization work secretly feel it would be easier to just teach 
everyone english.  Impossible as this may be, I have since come to understand 
what they were talking about...)


--
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
 GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-03 Thread Alvaro Herrera
Craig Ringer wrote:
 Frank Sweetser wrote:
 
  Unless, of course, you're at a good sized school with lots of
  international students, and have fileservers holding filenames created
  on desktops running in Chinese, Turkish, Russian, and other locales.
 
 What I struggle with here is why they're not using ru_RU.UTF-8,
 cn_CN.UTF-8, etc as their locales. Why mix charsets?

On my own desktop computer, I switched from Latin1 to UTF8 some two
years ago, and I still have a mixture of file name encodings.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-02 Thread Craig Ringer

On 3/12/2009 11:09 AM, Jerome Alet wrote:

On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote:


Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
file daemon, using the encoding of the client, for storage in a utf-8
database.


+1 for me.

this is the way to go.

I understand people with an existing backup history won't be very happy
with this unless you provide them the appropriate tools or instructions
to convert their database's content, though.


I just noticed, while reading src/cats/create_postgresql_database:

# use SQL_ASCII to be able to put any filename into
#  the database even those created with unusual character sets
ENCODING=ENCODING 'SQL_ASCII'

# use UTF8 if you are using standard Unix/Linux LANG specifications
#  that use UTF8 -- this is normally the default and *should* be
#  your standard.  Bacula works correctly *only* with correct UTF8.
#
#  Note, with this encoding, if you have any weird filenames on
#  your system (names generated from Win32 or Mac OS), you may
#  get Bacula batch insert failures.
#
#ENCODING=ENCODING 'UTF8'



... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your 
systems are all in a utf-8 locale. Assuming there's some way for the 
filed to find out the encoding of the director's database, it probably 
wouldn't be too tricky to convert non-matching file names to the 
director's encoding in the fd (when the director's encoding isn't 
SQL_ASCII, of course).


This also makes me wonder how filenames on Mac OS X and Windows are 
handled. I didn't see any use of the unicode-form APIs or any UTF-16 to 
UTF-8 conversion in an admittedly _very_ quick glance at the filed/ 
sources. How does bacula handle file names on those platforms? Read them 
with the non-unicode APIs and hope they fit into the current non-unicode 
encoding? Or am I missing something?


--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-02 Thread Stephen Frost
* Craig Ringer (cr...@postnewspapers.com.au) wrote:
 ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your  
 systems are all in a utf-8 locale. Assuming there's some way for the  
 filed to find out the encoding of the director's database, it probably  
 wouldn't be too tricky to convert non-matching file names to the  
 director's encoding in the fd (when the director's encoding isn't  
 SQL_ASCII, of course).

I'm not sure which piece of bacula connects to PostgreSQL, but whatever
it is, it could just send a 'set client_encoding' to the PG backend and
all the conversion will be done by PG..

 This also makes me wonder how filenames on Mac OS X and Windows are  
 handled. I didn't see any use of the unicode-form APIs or any UTF-16 to  
 UTF-8 conversion in an admittedly _very_ quick glance at the filed/  
 sources. How does bacula handle file names on those platforms? Read them  
 with the non-unicode APIs and hope they fit into the current non-unicode  
 encoding? Or am I missing something?

Good question.

Stephen


signature.asc
Description: Digital signature


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-02 Thread Jerome Alet
On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote:

 Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
 file daemon, using the encoding of the client, for storage in a utf-8
 database.

+1 for me.

this is the way to go.

I understand people with an existing backup history won't be very happy
with this unless you provide them the appropriate tools or instructions
to convert their database's content, though.

bye

--
Jérôme Alet - jerome.a...@univ-nc.nc - Centre de Ressources Informatiques
  Université de la Nouvelle-Calédonie - BPR4 - 98851 NOUMEA CEDEX
   Tél : +687 266754  Fax : +687 254829

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-02 Thread Jose Ildefonso Camargo Tolosa
Hi!

On Thu, Dec 3, 2009 at 10:39 PM, Jerome Alet jerome.a...@univ-nc.nc wrote:
 On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote:

 Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
 file daemon, using the encoding of the client, for storage in a utf-8
 database.

 +1 for me.

+1 here: it, in fact, have problems when restoring to a server with
different code page as the original one.


 this is the way to go.

 I understand people with an existing backup history won't be very happy
 with this unless you provide them the appropriate tools or instructions
 to convert their database's content, though.

 bye

 --
 Jérôme Alet - jerome.a...@univ-nc.nc - Centre de Ressources Informatiques
      Université de la Nouvelle-Calédonie - BPR4 - 98851 NOUMEA CEDEX
   Tél : +687 266754                                  Fax : +687 254829

 --
 Join us December 9, 2009 for the Red Hat Virtual Experience,
 a free event focused on virtualization and cloud computing.
 Attend in-depth sessions from your desk. Your couch. Anywhere.
 http://p.sf.net/sfu/redhat-sfdev2dev
 ___
 Bacula-users mailing list
 bacula-us...@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/bacula-users


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4

2009-12-02 Thread Craig Ringer
Stephen Frost wrote:
 * Craig Ringer (cr...@postnewspapers.com.au) wrote:
 ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your  
 systems are all in a utf-8 locale. Assuming there's some way for the  
 filed to find out the encoding of the director's database, it probably  
 wouldn't be too tricky to convert non-matching file names to the  
 director's encoding in the fd (when the director's encoding isn't  
 SQL_ASCII, of course).
 
 I'm not sure which piece of bacula connects to PostgreSQL, but whatever
 it is, it could just send a 'set client_encoding' to the PG backend and
 all the conversion will be done by PG.

The director is responsible for managing all the metadata, and it's the
component that connects to Pg.

If the fd sent the system charset along with the bundle of filenames etc
that it sends to the director, then I don't see why the director
couldn't `SET client_encoding' appropriately before inserting data from
that fd, then `RESET client_encoding' once the batch insert was done.

The only downside is that if even one file has invalidly encoded data,
the whole batch insert fails and is rolled back. For that reason, I'd
personally prefer that the fd handle conversion so that it can exclude
such files (with a loud complaint in the error log) or munge the file
name into something that _can_ be stored.

Come to think of it, if the fd and database are both on a utf-8
encoding, the fd should *still* validate the utf-8 filenames it reads.
There's no guarantee that just because the system thinks the filename
should be utf-8, it's actually valid utf-8, and it'd be good to catch
this at the fd rather than messing up the batch insert by the director,
thus making it much safer than it presently is to use Bacula with a
utf-8 database.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general