Dirk: > 1.) different encodings: This one should be solved with the encoding > attribute, but while playing with this, I had still problems to output > characters that are allowed in one codepage, but discouraged by the XML > standard. See > http://www.w3.org/TR/REC-xml/#charsets where some characters, that are > still allowed in the the windows-1252 codepage, are discouraged in XML. > esp. most of the characters in the band [x80-x9f].
Please note that XML discourages or forbids some Unicode codepoints, not bytes in specific codepages. Specifically, windows-1252 does not map any byte to a codepoint in the range [0x80-0x9F]. (see http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx) For example, 0x80 in windows-1252 maps to Unicode 0x20AC (Euro sign). > 2.) real garbaged content: real garbage is hard to detect and therefore > ssphys does only filter so called control characters determined by the > "iscntrl" function. This decision is based upon the current locale. A > few other characters are filtered by vss2svn after streaming in the XML > output > > > $gSysOut =~ s/\x00//g; # remove null bytes > > $gSysOut =~ s/.\x08//g; # yes, I've seen VSS store backspaces in > names! > > # allow all characters in the windows-1252 codepage: see > http://de.wikipedia.org/wiki/Windows-1252 > > $gSysOut =~ > s/[\x00-\x09\x11\x12\x14-\x1F\x81\x8D\x8F\x90\x9D]/_/g; # just to be > sure I would hazard that removing just [\x00-\x09\x11\x12\x14-\x1F] should be safe enough for any windows codepage. Cheers, --jonathan _______________________________________________ vss2svn-users mailing list Project homepage: http://www.pumacode.org/projects/vss2svn/ Subscribe/Unsubscribe/Admin: http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org Mailing list web interface (with searchable archives): http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user