Dirk:
> 1.) different encodings: This one should be solved with the encoding
> attribute, but while playing with this, I had still problems to output
> characters that are allowed in one codepage, but discouraged by the
XML
> standard. See
> http://www.w3.org/TR/REC-xml/#charsets where some characters, that are
> still allowed in the the windows-1252 codepage, are discouraged in
XML.
> esp. most of the characters in the band [x80-x9f].

Please note that XML discourages or forbids some Unicode codepoints,
not bytes in specific codepages. Specifically, windows-1252 does not
map any byte to a codepoint in the range [0x80-0x9F].
(see http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx)
For example, 0x80 in windows-1252 maps to Unicode 0x20AC (Euro sign).

> 2.) real garbaged content: real garbage is hard to detect and
therefore
> ssphys does only filter so called control characters determined by the
> "iscntrl" function. This decision is based upon the current locale. A
> few other characters are filtered by vss2svn after streaming in the
XML
> output
> 
>  >    $gSysOut =~ s/\x00//g; # remove null bytes
>  >    $gSysOut =~ s/.\x08//g; # yes, I've seen VSS store backspaces in
> names!
>  >    # allow all characters in the windows-1252 codepage: see
> http://de.wikipedia.org/wiki/Windows-1252
>  >    $gSysOut =~
> s/[\x00-\x09\x11\x12\x14-\x1F\x81\x8D\x8F\x90\x9D]/_/g; # just to be
> sure

I would hazard that removing just [\x00-\x09\x11\x12\x14-\x1F] should
be safe enough for any windows codepage.

Cheers,
--jonathan

_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user

Reply via email to