Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
Craig Ringer wrote: Kern Sibbald wrote: Hello, Thanks for all the answers; I am a bit overwhelmed by the number, so I am going to try to answer everyone in one email. The first thing to understand is that it is *impossible* to know what the encoding is on the client machine (FD -- or File daemon). On say a Unix/Linux system, the user could create filenames with non-UTF-8 then switch to UTF-8, or restore files that were tarred on Windows or on Mac, or simply copy a Mac directory. Finally, using system calls to create a file, you can put *any* character into a filename. While true in theory, in practice it's pretty unusual to have filenames encoded with an encoding other than the system LC_CTYPE on a modern UNIX/Linux/BSD machine. In my case garbage filenames are all too common. It's a the sad *reality*, when you're mixing languages (Hebrew and English in my case) and operating systems. Garbage filenames are everywhere: directories and files shared between different operating systems and file systems, mail attachments, mp3 file names based on garbage id3 tags, files in zip archives (which seem to not handle filename encoding at all), etc. When I first tried Bacula (version 1.38), I expected to have trouble with filenames, since this is what I'm used to. I was rather pleased to find out that it could both backup and restore files, regardless of origin and destination filename encoding. I like Bacula because, among other things, it can take the punishment and chug along, without me even noticing that there was supposed to be a problem (a recent example: backup/restore files with a negative mtime ...) My 2c Avi -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
Craig Ringer wrote: Kern Sibbald wrote: Hello, Thanks for all the answers; I am a bit overwhelmed by the number, so I am going to try to answer everyone in one email. The first thing to understand is that it is *impossible* to know what the encoding is on the client machine (FD -- or File daemon). On say a Unix/Linux system, the user could create filenames with non-UTF-8 then switch to UTF-8, or restore files that were tarred on Windows or on Mac, or simply copy a Mac directory. Finally, using system calls to create a file, you can put *any* character into a filename. While true in theory, in practice it's pretty unusual to have filenames encoded with an encoding other than the system LC_CTYPE on a modern UNIX/Linux/BSD machine. In my case garbage filenames are all too common. It's a the sad *reality*, when you're mixing languages (Hebrew and English in my case) and operating systems. Garbage filenames are everywhere: directories and files shared between different operating systems and file systems, mail attachments, mp3 file names based on garbage id3 tags, files in zip archives (which seem to not handle filename encoding at all), etc. Yes, that is my experience too. I understand Craig's comments, but I would much prefer that Bacula just backup and restore and leave the checking of filename consistencies to other programs. At least for the moment, that seems to work quite well. Obviously if users mix character sets, sometime display of filenames in Bacula will be wierd, but nevertheless Bacula will backup and restore them so that what was on the system before the backup is what is restored. When I first tried Bacula (version 1.38), I expected to have trouble with filenames, since this is what I'm used to. I was rather pleased to find out that it could both backup and restore files, regardless of origin and destination filename encoding. I like Bacula because, among other things, it can take the punishment and chug along, without me even noticing that there was supposed to be a problem (a recent example: backup/restore files with a negative mtime ...) Thanks. Thanks also for using Bacula :-) Best regards, Kern -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
Frank Sweetser wrote: Unless, of course, you're at a good sized school with lots of international students, and have fileservers holding filenames created on desktops running in Chinese, Turkish, Russian, and other locales. What I struggle with here is why they're not using ru_RU.UTF-8, cn_CN.UTF-8, etc as their locales. Why mix charsets? I don't think that these people should be forced to use a utf-8 database and encoding conversion if they want to do things like mix-and-match charsets for file name chaos on their machines, though. I'd just like to be able to back up systems that _do_ have consistent charsets in ways that permit me to later reliably search for files by name, restore to any host, etc. Perhaps I'm strange in thinking that all this mix-and-match encodings stuff is bizarre and backward. The Mac OS X and Windows folks seem to agree, though. Let the file system store unicode data, and translate at the file system or libc layer for applications that insist on using other encodings. I do take Greg Stark's point (a) though. As *nix systems stand, solutions will only ever be mostly-works, not always-works, which I agree isn't good enough. Since there's no sane agreement about encodings on *nix systems and everything is just byte strings that different apps can interpret in different ways under different environmental conditions, we may as well throw up our hands in disgust and give up trying to do anything sensible. The alternative is saying that files the file system considers legal can't be backed up because of file naming, which I do agree isn't ok. The system shouldn't permit those files to exist, either, but I suspect we'll still have borked encoding-agnostic wackiness as long as we have *nix systems at all since nobody will ever agree on anything for long enough to change it. Sigh. I think this is about the only time I've ever wished I was using Windows (or Mac OS X). Also: Greg, your point (c) goes two ways. If I can't trust my backup software to restore my filenames from one host exactly correctly to another host that may have configuration differences not reflected in the backup metadata, a different OS revision, etc, then what good is it for disaster recovery? How do I even know what those byte strings *mean*? Bacula doesn't even record the default system encoding with backup jobs so there's no way for even the end user to try to fix up the file names for a different encoding. You're faced with some byte strings in wtf-is-this-anyway encoding and guesswork. Even recording lc_ctype in the backup job metadata and offering the _option_ to convert encoding on restore would be a big step, (though it wouldn't fix the breakage with searches by filename not matching due to encoding mismatches). Personally, I'm just going to stick to a utf-8 only policy for all my hosts, working around the limitation that way. It's worked ok thus far, though I don't much like the way that different normalizations of unicode won't match equal under SQL_ASCII so I can't reliably search for file names. -- Craig Ringer -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
On Thu, 3 Dec 2009 12:22:50 +0100 (CET) Kern Sibbald k...@sibbald.com wrote: Yes, that is my experience too. I understand Craig's comments, but I would much prefer that Bacula just backup and restore and leave the checking of filename consistencies to other programs. At least for the moment, that seems to work quite well. Obviously if users mix character sets, sometime display of filenames in Bacula will be wierd, but nevertheless Bacula will backup and restore them so that what was on the system before the backup is what is restored. I expect a backup software has a predictable, reversible behaviour and warn me if I'm shooting myself in the foot. It should be the responsibility of the admin to restore files in a proper place knowing that locales may be a problem. I think Bacula is taking the right approach. Still I'd surely appreciate as a feature a tool that will help me to restore files in a system with a different locale than the original one or warn me if the locale is different or it can't be sure it is the same. That's exactly what Postgresql is doing: at least warning you. Even Postgresql is taking the right approach. An additional guessed original locale field and a tool/option to convert/restore with selected locale could be an interesting feature. What is Bacula going to do with xattr on different systems? Postgresql seems to offer a good choice of tools to convert between encodings and deal with bytea. Formally I'd prefer bytea but in real use it may just be an additional pain and other DB may not offer the same tools for encoding/bytea conversions. Is it possible to search for a file in a backup set? What is it going to happen if I'm searching from a system that has a different locale from the one the backup was made on? Can I use regexp? Can accents be ignored during searches? -- Ivan Sergio Borgonovo http://www.webthatworks.it -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
On 12/03/2009 10:54 AM, Craig Ringer wrote: Frank Sweetser wrote: Unless, of course, you're at a good sized school with lots of international students, and have fileservers holding filenames created on desktops running in Chinese, Turkish, Russian, and other locales. What I struggle with here is why they're not using ru_RU.UTF-8, cn_CN.UTF-8, etc as their locales. Why mix charsets? The problem isn't so much what they're using on their unmanaged desktops. The problem is that the server, which is the one getting backed up, holds an aggregation of files created by an unknown collection of applications running on a mish-mash of operating systems (every large edu has its horror story of the 15+ year old, unpatched, mission critical machine that no one dares touch) with wildly varying charset configurations, no doubt including horribly broken and pre-UTF ones. The end result is a fileset full of filenames created on a hacked Chinese copy of XP, a Russian copy of winME, romanian RedHat 4.0, and Mac OS 8. This kind of junk is, sadly, not uncommon in academic environments, where IT is often required to support stuff that they don't get to manage. -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
Hi Avi Please have a look at this link, this is how to install Bacula with MYSQL database with Hebrew support Eitan On Thu, Dec 3, 2009 at 12:35 PM, Avi Rozen avi.ro...@gmail.com wrote: Craig Ringer wrote: Kern Sibbald wrote: Hello, Thanks for all the answers; I am a bit overwhelmed by the number, so I am going to try to answer everyone in one email. The first thing to understand is that it is *impossible* to know what the encoding is on the client machine (FD -- or File daemon). On say a Unix/Linux system, the user could create filenames with non-UTF-8 then switch to UTF-8, or restore files that were tarred on Windows or on Mac, or simply copy a Mac directory. Finally, using system calls to create a file, you can put *any* character into a filename. While true in theory, in practice it's pretty unusual to have filenames encoded with an encoding other than the system LC_CTYPE on a modern UNIX/Linux/BSD machine. In my case garbage filenames are all too common. It's a the sad *reality*, when you're mixing languages (Hebrew and English in my case) and operating systems. Garbage filenames are everywhere: directories and files shared between different operating systems and file systems, mail attachments, mp3 file names based on garbage id3 tags, files in zip archives (which seem to not handle filename encoding at all), etc. When I first tried Bacula (version 1.38), I expected to have trouble with filenames, since this is what I'm used to. I was rather pleased to find out that it could both backup and restore files, regardless of origin and destination filename encoding. I like Bacula because, among other things, it can take the punishment and chug along, without me even noticing that there was supposed to be a problem (a recent example: backup/restore files with a negative mtime ...) My 2c Avi -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Bacula-users mailing list bacula-us...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
On 12/3/2009 3:33 AM, Craig Ringer wrote: Kern Sibbald wrote: Hello, Thanks for all the answers; I am a bit overwhelmed by the number, so I am going to try to answer everyone in one email. The first thing to understand is that it is *impossible* to know what the encoding is on the client machine (FD -- or File daemon). On say a Or, even worse, which encoding the user or application was thinking of when it wrote a particular out. There's no guarantee that any two files on a system were intended to be looked at with the same encoding. Unix/Linux system, the user could create filenames with non-UTF-8 then switch to UTF-8, or restore files that were tarred on Windows or on Mac, or simply copy a Mac directory. Finally, using system calls to create a file, you can put *any* character into a filename. While true in theory, in practice it's pretty unusual to have filenames encoded with an encoding other than the system LC_CTYPE on a modern UNIX/Linux/BSD machine. Unless, of course, you're at a good sized school with lots of international students, and have fileservers holding filenames created on desktops running in Chinese, Turkish, Russian, and other locales. In the end, a filename is (under linux, at least) just a string of arbitrary bytes containing anything except / and NULL. If bacula tries to get too clever, and munges or misinterprets those bytes strings - or, worse yet, if the database does it behind your back - then stuff _will_ end up breaking. (A few years back, someone heavily involved in linux kernel filesystem work was talking about this exact issue, and made the remark that many doing internationalization work secretly feel it would be easier to just teach everyone english. Impossible as this may be, I have since come to understand what they were talking about...) -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
Craig Ringer wrote: Frank Sweetser wrote: Unless, of course, you're at a good sized school with lots of international students, and have fileservers holding filenames created on desktops running in Chinese, Turkish, Russian, and other locales. What I struggle with here is why they're not using ru_RU.UTF-8, cn_CN.UTF-8, etc as their locales. Why mix charsets? On my own desktop computer, I switched from Latin1 to UTF8 some two years ago, and I still have a mixture of file name encodings. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
On 3/12/2009 11:09 AM, Jerome Alet wrote: On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote: Anyway, it'd be nice if Bacula would convert file names to utf-8 at the file daemon, using the encoding of the client, for storage in a utf-8 database. +1 for me. this is the way to go. I understand people with an existing backup history won't be very happy with this unless you provide them the appropriate tools or instructions to convert their database's content, though. I just noticed, while reading src/cats/create_postgresql_database: # use SQL_ASCII to be able to put any filename into # the database even those created with unusual character sets ENCODING=ENCODING 'SQL_ASCII' # use UTF8 if you are using standard Unix/Linux LANG specifications # that use UTF8 -- this is normally the default and *should* be # your standard. Bacula works correctly *only* with correct UTF8. # # Note, with this encoding, if you have any weird filenames on # your system (names generated from Win32 or Mac OS), you may # get Bacula batch insert failures. # #ENCODING=ENCODING 'UTF8' ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your systems are all in a utf-8 locale. Assuming there's some way for the filed to find out the encoding of the director's database, it probably wouldn't be too tricky to convert non-matching file names to the director's encoding in the fd (when the director's encoding isn't SQL_ASCII, of course). This also makes me wonder how filenames on Mac OS X and Windows are handled. I didn't see any use of the unicode-form APIs or any UTF-16 to UTF-8 conversion in an admittedly _very_ quick glance at the filed/ sources. How does bacula handle file names on those platforms? Read them with the non-unicode APIs and hope they fit into the current non-unicode encoding? Or am I missing something? -- Craig Ringer -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
* Craig Ringer (cr...@postnewspapers.com.au) wrote: ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your systems are all in a utf-8 locale. Assuming there's some way for the filed to find out the encoding of the director's database, it probably wouldn't be too tricky to convert non-matching file names to the director's encoding in the fd (when the director's encoding isn't SQL_ASCII, of course). I'm not sure which piece of bacula connects to PostgreSQL, but whatever it is, it could just send a 'set client_encoding' to the PG backend and all the conversion will be done by PG.. This also makes me wonder how filenames on Mac OS X and Windows are handled. I didn't see any use of the unicode-form APIs or any UTF-16 to UTF-8 conversion in an admittedly _very_ quick glance at the filed/ sources. How does bacula handle file names on those platforms? Read them with the non-unicode APIs and hope they fit into the current non-unicode encoding? Or am I missing something? Good question. Stephen signature.asc Description: Digital signature
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote: Anyway, it'd be nice if Bacula would convert file names to utf-8 at the file daemon, using the encoding of the client, for storage in a utf-8 database. +1 for me. this is the way to go. I understand people with an existing backup history won't be very happy with this unless you provide them the appropriate tools or instructions to convert their database's content, though. bye -- Jérôme Alet - jerome.a...@univ-nc.nc - Centre de Ressources Informatiques Université de la Nouvelle-Calédonie - BPR4 - 98851 NOUMEA CEDEX Tél : +687 266754 Fax : +687 254829 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
Hi! On Thu, Dec 3, 2009 at 10:39 PM, Jerome Alet jerome.a...@univ-nc.nc wrote: On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote: Anyway, it'd be nice if Bacula would convert file names to utf-8 at the file daemon, using the encoding of the client, for storage in a utf-8 database. +1 for me. +1 here: it, in fact, have problems when restoring to a server with different code page as the original one. this is the way to go. I understand people with an existing backup history won't be very happy with this unless you provide them the appropriate tools or instructions to convert their database's content, though. bye -- Jérôme Alet - jerome.a...@univ-nc.nc - Centre de Ressources Informatiques Université de la Nouvelle-Calédonie - BPR4 - 98851 NOUMEA CEDEX Tél : +687 266754 Fax : +687 254829 -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Bacula-users mailing list bacula-us...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [Bacula-users] [GENERAL] Catastrophic changes to PostgreSQL 8.4
Stephen Frost wrote: * Craig Ringer (cr...@postnewspapers.com.au) wrote: ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your systems are all in a utf-8 locale. Assuming there's some way for the filed to find out the encoding of the director's database, it probably wouldn't be too tricky to convert non-matching file names to the director's encoding in the fd (when the director's encoding isn't SQL_ASCII, of course). I'm not sure which piece of bacula connects to PostgreSQL, but whatever it is, it could just send a 'set client_encoding' to the PG backend and all the conversion will be done by PG. The director is responsible for managing all the metadata, and it's the component that connects to Pg. If the fd sent the system charset along with the bundle of filenames etc that it sends to the director, then I don't see why the director couldn't `SET client_encoding' appropriately before inserting data from that fd, then `RESET client_encoding' once the batch insert was done. The only downside is that if even one file has invalidly encoded data, the whole batch insert fails and is rolled back. For that reason, I'd personally prefer that the fd handle conversion so that it can exclude such files (with a loud complaint in the error log) or munge the file name into something that _can_ be stored. Come to think of it, if the fd and database are both on a utf-8 encoding, the fd should *still* validate the utf-8 filenames it reads. There's no guarantee that just because the system thinks the filename should be utf-8, it's actually valid utf-8, and it'd be good to catch this at the fd rather than messing up the batch insert by the director, thus making it much safer than it presently is to use Bacula with a utf-8 database. -- Craig Ringer -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general