Re: [basex-talk] how to pass raw bytes intact?
As Liam indicated (thanks!), XQuery may not be the best choice to process data on byte level: XQuery was built to work with Unicode characters as basic unit, which means that it will never be possible with pure XQuery to create illegal UTF8 sequences. This also means that the language provides no support to „repair” invalid input. I wonder if you have enough control over your input to avoid UTF8 shattering? If there’s no choice, and if you still want to try XQuery/BaseX for byte processing, you can play around with the functions of the Conversion Module: http://docs.basex.org/wiki/Conversion_Module ___ On Tue, Jan 1, 2013 at 5:50 AM, wrote: >> "LREQ" == Liam R E Quin writes: > LREQ> Treating the individual UTF-8 octets individually? > Yes. > LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension... > Well no big deal, I was just curious. >>> I was just curious if there was a way in basex if I could do s!!!g >>> like I can do in perl, to restore the damaged UTF-8 characters. > > LREQ> Note that "damaged UTF-8 characters", if by that you mean not > LREQ> well-formed UTF-8, aren't going to come through email reliably, so I > LREQ> might not be seeing what you wrote - s!!!g can be done with > > Don't worry. I wouldn't put any illegal chars into mail. > > LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is > LREQ> another matter. But, my goal in replying was to tease out enough > LREQ> information from you that someone else could answer :-) > >>> http://www.couchsurfing.org/group_read.html?gid=430&post=13998575 > LREQ> This says, "this thread has been deleted" at me. > In fact they deleted the entire group it turns out. > > Anyway here's what I posted there > #!/usr/bin/perl > # Shows line where we remove couchsurfing.org's UTF-8 shattering effects. > # Must run this before the browser gets its hands on it and turns the > # shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER. > # So that seems to count out greasemonkey, etc. solutions. > # I used wwwoffle -o URL|./this_program after first browsing the page logged > in > # in a browser that used wwwoffle as a proxy > # Copyright : http://www.fsf.org/copyleft/gpl.html > # Author : Dan Jacobson -- http://jidanni.org/ > # Created On : 12/31/2012 > # Last Modified On: Mon Dec 31 13:12:57 2012 > # Update Count: 27 > use strict; > use warnings FATAL => 'all'; > my $N = qr/[^[:ascii:]]/; > while (<>) { > my $original_line = $_; > ## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584 > s!!!g; > ## needed on e.g., > ## > http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequest=1223052 > s!($N) ($N)!$1$2!g; > s!\t\s+!! && chomp; > m!^\s+... \(more\) ! && next; > s!\s* !!; > print "$.: $_" if $_ ne $original_line; > } ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] how to pass raw bytes intact?
> "LREQ" == Liam R E Quin writes: LREQ> Treating the individual UTF-8 octets individually? Yes. LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension... Well no big deal, I was just curious. >> I was just curious if there was a way in basex if I could do s!!!g >> like I can do in perl, to restore the damaged UTF-8 characters. LREQ> Note that "damaged UTF-8 characters", if by that you mean not LREQ> well-formed UTF-8, aren't going to come through email reliably, so I LREQ> might not be seeing what you wrote - s!!!g can be done with Don't worry. I wouldn't put any illegal chars into mail. LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is LREQ> another matter. But, my goal in replying was to tease out enough LREQ> information from you that someone else could answer :-) >> http://www.couchsurfing.org/group_read.html?gid=430&post=13998575 LREQ> This says, "this thread has been deleted" at me. In fact they deleted the entire group it turns out. Anyway here's what I posted there #!/usr/bin/perl # Shows line where we remove couchsurfing.org's UTF-8 shattering effects. # Must run this before the browser gets its hands on it and turns the # shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER. # So that seems to count out greasemonkey, etc. solutions. # I used wwwoffle -o URL|./this_program after first browsing the page logged in # in a browser that used wwwoffle as a proxy # Copyright : http://www.fsf.org/copyleft/gpl.html # Author : Dan Jacobson -- http://jidanni.org/ # Created On : 12/31/2012 # Last Modified On: Mon Dec 31 13:12:57 2012 # Update Count: 27 use strict; use warnings FATAL => 'all'; my $N = qr/[^[:ascii:]]/; while (<>) { my $original_line = $_; ## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584 s!!!g; ## needed on e.g., ## http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequest=1223052 s!($N) ($N)!$1$2!g; s!\t\s+!! && chomp; m!^\s+... \(more\) ! && next; s!\s* !!; print "$.: $_" if $_ ne $original_line; } ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] how to pass raw bytes intact?
On Tue, 2013-01-01 at 11:47 +0800, jida...@jidanni.org wrote: > Not exactly after it. 1/3 of the way through it. I.e., shattered UTF-8. Treating the individual UTF-8 octets individually? Not in standard XQuery, but that doesn't preclude a BaseX extension... > I was just curious if there was a way in basex if I could do s!!!g > like I can do in perl, to restore the damaged UTF-8 characters. Note that "damaged UTF-8 characters", if by that you mean not well-formed UTF-8, aren't going to come through email reliably, so I might not be seeing what you wrote - s!!!g can be done with replace() but getting at UTF-8-encoded characters one octet at a time is another matter. But, my goal in replying was to tease out enough information from you that someone else could answer :-) It's probably best not to assume that people on an XQuery-list would be familiar with Unicode handling in other languages, such as Perl, by the way, although some of us are :-) > http://www.couchsurfing.org/group_read.html?gid=430&post=13998575 This says, "this thread has been deleted" at me. Best, Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] how to pass raw bytes intact?
LREQ> Your perl substitution is putting after the first non-ascii LREQ> character on the line, and 你 is for sure not an ascii character, LREQ> so you get after it. Not exactly after it. 1/3 of the way through it. I.e., shattered UTF-8. I was just curious if there was a way in basex if I could do s!!!g like I can do in perl, to restore the damaged UTF-8 characters. http://www.couchsurfing.org/group_read.html?gid=430&post=13998575 ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] how to pass raw bytes intact?
On Tue, 2013-01-01 at 10:52 +0800, jida...@jidanni.org wrote: > I'm just trying to find a way to remove the injected here, > $ echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|qprint -e > =E4=BD=A0=E5=A5=BD I don't have a qprint command on my system, so I'm not sure what's going on for you here. Your perl substitution is putting after the first non-ascii character on the line, and 你 is for sure not an ascii character, so you get after it. Are you trying to do MIME octet-level encoding of UTF-8 here? Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] how to pass raw bytes intact?
> "CG" == Christian Grün writes: CG> Jidanni, >> echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|basex -q ' >> declare option db:parser "html"; >> declare option output:method "raw"; >> doc("/dev/stdin")//*:wbr/..' CG> If you want help, please try to help, too. Your example is not what I CG> would call very helpful; give us at least: CG> a) a minimized example, That's what it is, totally contained. Just run it on your Linux etc. shell command line. CG> b) the returned output, and OK, here it is QP encoded: =EF=BF=BD=EF=BF=BD=EF=BF=BD=E5=A5=BD= CG> c) the expected result I'm just trying to find a way to remove the injected here, $ echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|qprint -e =E4=BD=A0=E5=A5=BD So I can get =E4=BD=A0=E5=A5=BD I am guessing that is not possible with Basex, and one needs byte level tools like perl. >> declare option output:encoding "RAW"; or "BYTES" or "NONE" CG> I’m not sure if you will need any output declaration for your query at CG> all; but we first need more details. >> http://docs.basex.org/wiki/Serialization >> it just says >> "all encodings supported by Java" >> So one is supposed to look at >> http://www.google.com/search?q=all+encodings+supported+by+Java CG> I've added a link. Note, however, that the list is also dependent on CG> the Java VM you are using. OK, also do make a note of that fact there... >> Why doesn't basex have a command that would output the current >> "all encodings supported by Java" >> that it is using. CG> Try this: CG> basex "Q{java.nio.charset.Charset}availableCharsets()" Gawd! $ basex "Q{java.nio.charset.Charset}availableCharsets()"|wc 0 1673593 One big line and everything is repeated twice! $ basex "Q{java.nio.charset.Charset}availableCharsets()"| perl -nwle 'print for /([^\s{]+)=/g'|wc 167 1671713 looks much nicer and has half the bytes. Do make a note of it on the wiki there. Thanks. ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] how to pass raw bytes intact?
Jidanni, > echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|basex -q ' > declare option db:parser "html"; > declare option output:method "raw"; > doc("/dev/stdin")//*:wbr/..' If you want help, please try to help, too. Your example is not what I would call very helpful; give us at least: a) a minimized example, b) the returned output, and c) the expected result > declare option output:encoding "RAW"; or "BYTES" or "NONE" I’m not sure if you will need any output declaration for your query at all; but we first need more details. > http://docs.basex.org/wiki/Serialization > it just says > "all encodings supported by Java" > So one is supposed to look at > http://www.google.com/search?q=all+encodings+supported+by+Java I've added a link. Note, however, that the list is also dependent on the Java VM you are using. > Why doesn't basex have a command that would output the current > "all encodings supported by Java" > that it is using. Try this: basex "Q{java.nio.charset.Charset}availableCharsets()" ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] how to pass raw bytes intact?
Our mission today is to use Basex to remove tags injected right between the bytes of multibyte UTF-8 characters. http://www.couchsurfing.org/group_read.html?gid=430&post=13986932 > "CG" == Christian Grün writes: CG> Have you tried method=raw, as mentioned in our documentation CG> (http://docs.basex.org/wiki/Serialization)? Sorry. Try it yourself: echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|basex -q ' declare option db:parser "html"; declare option output:method "raw"; doc("/dev/stdin")//*:wbr/..' There is no way to cleanly restore the shattered UTF-8. I would also like to try declare option output:encoding "RAW"; or "BYTES" or "NONE" but on http://docs.basex.org/wiki/Serialization it just says "all encodings supported by Java" So one is supposed to look at http://www.google.com/search?q=all+encodings+supported+by+Java etc. etc. Why doesn't basex have a command that would output the current "all encodings supported by Java" that it is using. ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk