Re: [basex-talk] how to pass raw bytes intact?

2013-01-02 Thread Christian Grün
As Liam indicated (thanks!), XQuery may not be the best choice to
process data on byte level: XQuery was built to work with Unicode
characters as basic unit, which means that it will never be possible
with pure XQuery to create illegal UTF8 sequences. This also means
that the language provides no support to „repair” invalid input.

I wonder if you have enough control over your input to avoid UTF8
shattering? If there’s no choice, and if you still want to try
XQuery/BaseX for byte processing, you can play around with the
functions of the Conversion Module:

  http://docs.basex.org/wiki/Conversion_Module
___

On Tue, Jan 1, 2013 at 5:50 AM,   wrote:
>> "LREQ" == Liam R E Quin  writes:
> LREQ> Treating the individual UTF-8 octets individually?
> Yes.
> LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension...
> Well no big deal, I was just curious.
>>> I was just curious if there was a way in basex if I could do s!!!g
>>> like I can do in perl, to restore the damaged UTF-8 characters.
>
> LREQ> Note that "damaged UTF-8 characters", if by that you mean not
> LREQ> well-formed UTF-8, aren't going to come through email reliably, so I
> LREQ> might not be seeing what you wrote - s!!!g can be done with
>
> Don't worry. I wouldn't put any illegal chars into mail.
>
> LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is
> LREQ> another matter. But, my goal in replying was to tease out enough
> LREQ> information from you that someone else could answer :-)
>
>>> http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
> LREQ> This says, "this thread has been deleted" at me.
> In fact they deleted the entire group it turns out.
>
> Anyway here's what I posted there
> #!/usr/bin/perl
> # Shows line where we remove couchsurfing.org's UTF-8 shattering effects.
> # Must run this before the browser gets its hands on it and turns the
> # shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER.
> # So that seems to count out greasemonkey, etc. solutions.
> # I used wwwoffle -o URL|./this_program after first browsing the page logged 
> in
> # in a browser that used wwwoffle as a proxy
> # Copyright   : http://www.fsf.org/copyleft/gpl.html
> # Author  : Dan Jacobson -- http://jidanni.org/
> # Created On  : 12/31/2012
> # Last Modified On: Mon Dec 31 13:12:57 2012
> # Update Count: 27
> use strict;
> use warnings FATAL => 'all';
> my $N = qr/[^[:ascii:]]/;
> while (<>) {
> my $original_line = $_;
> ## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584
> s!!!g;
> ## needed on e.g.,
> ## 
> http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequest=1223052
> s!($N) ($N)!$1$2!g;
> s!\t\s+!! && chomp;
> m!^\s+... \(more\) ! && next;
> s!\s* !!;
> print "$.: $_" if $_ ne $original_line;
> }
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] how to pass raw bytes intact?

2012-12-31 Thread jidanni
> "LREQ" == Liam R E Quin  writes:
LREQ> Treating the individual UTF-8 octets individually?
Yes.
LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension...
Well no big deal, I was just curious.
>> I was just curious if there was a way in basex if I could do s!!!g
>> like I can do in perl, to restore the damaged UTF-8 characters.

LREQ> Note that "damaged UTF-8 characters", if by that you mean not
LREQ> well-formed UTF-8, aren't going to come through email reliably, so I
LREQ> might not be seeing what you wrote - s!!!g can be done with

Don't worry. I wouldn't put any illegal chars into mail.

LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is
LREQ> another matter. But, my goal in replying was to tease out enough
LREQ> information from you that someone else could answer :-)

>> http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
LREQ> This says, "this thread has been deleted" at me.
In fact they deleted the entire group it turns out.

Anyway here's what I posted there
#!/usr/bin/perl
# Shows line where we remove couchsurfing.org's UTF-8 shattering effects.
# Must run this before the browser gets its hands on it and turns the
# shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER.
# So that seems to count out greasemonkey, etc. solutions.
# I used wwwoffle -o URL|./this_program after first browsing the page logged in
# in a browser that used wwwoffle as a proxy
# Copyright   : http://www.fsf.org/copyleft/gpl.html
# Author  : Dan Jacobson -- http://jidanni.org/
# Created On  : 12/31/2012
# Last Modified On: Mon Dec 31 13:12:57 2012
# Update Count: 27
use strict;
use warnings FATAL => 'all';
my $N = qr/[^[:ascii:]]/;
while (<>) {
my $original_line = $_;
## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584
s!!!g;
## needed on e.g.,
## 
http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequest=1223052
s!($N) ($N)!$1$2!g;
s!\t\s+!! && chomp;
m!^\s+... \(more\) ! && next;
s!\s* !!;
print "$.: $_" if $_ ne $original_line;
}
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] how to pass raw bytes intact?

2012-12-31 Thread Liam R E Quin
On Tue, 2013-01-01 at 11:47 +0800, jida...@jidanni.org wrote:

> Not exactly after it. 1/3 of the way through it. I.e., shattered UTF-8.

Treating the individual UTF-8 octets individually?

Not in standard XQuery, but that doesn't preclude a BaseX extension...

> I was just curious if there was a way in basex if I could do s!!!g
> like I can do in perl, to restore the damaged UTF-8 characters.

Note that "damaged UTF-8 characters", if by that you mean not
well-formed UTF-8, aren't going to come through email reliably, so I
might not be seeing what you wrote - s!!!g can be done with
replace() but getting at UTF-8-encoded characters one octet at a time is
another matter. But, my goal in replying was to tease out enough
information from you that someone else could answer :-)

It's probably best not to assume that people on an XQuery-list would be
familiar with Unicode handling in other languages, such as Perl, by the
way, although some of us are :-)

> http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
This says, "this thread has been deleted" at me.

Best,

Liam


-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] how to pass raw bytes intact?

2012-12-31 Thread jidanni
LREQ> Your perl substitution is putting  after the first non-ascii
LREQ> character on the line, and 你 is for sure not an ascii character,
LREQ> so you get  after it.

Not exactly after it. 1/3 of the way through it. I.e., shattered UTF-8.
I was just curious if there was a way in basex if I could do s!!!g
like I can do in perl, to restore the damaged UTF-8 characters.

http://www.couchsurfing.org/group_read.html?gid=430&post=13998575

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] how to pass raw bytes intact?

2012-12-31 Thread Liam R E Quin
On Tue, 2013-01-01 at 10:52 +0800, jida...@jidanni.org wrote:

> I'm just trying to find a way to remove the  injected here,
> $ echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|qprint -e
> =E4=BD=A0=E5=A5=BD

I don't have a qprint command on my system, so I'm not sure what's going
on for you here. Your perl substitution is putting  after the
first non-ascii character on the line, and 你 is for sure not an ascii
character, so you get  after it.

Are you trying to do MIME octet-level encoding of UTF-8 here?

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] how to pass raw bytes intact?

2012-12-31 Thread jidanni
> "CG" == Christian Grün  writes:
CG> Jidanni,

>> echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|basex -q '
>> declare option db:parser "html";
>> declare option output:method "raw";
>> doc("/dev/stdin")//*:wbr/..'

CG> If you want help, please try to help, too. Your example is not what I
CG> would call very helpful; give us at least:

CG>   a) a minimized example,

That's what it is, totally contained. Just run it on your Linux etc.
shell command line.

CG>   b) the returned output, and

OK, here it is QP encoded:
=EF=BF=BD=EF=BF=BD=EF=BF=BD=E5=A5=BD=

CG>   c) the expected result

I'm just trying to find a way to remove the  injected here,
$ echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|qprint -e
=E4=BD=A0=E5=A5=BD

So I can get
=E4=BD=A0=E5=A5=BD

I am guessing that is not possible with Basex, and one needs byte level
tools like perl.

>> declare option output:encoding "RAW"; or "BYTES" or "NONE"

CG> I’m not sure if you will need any output declaration for your query at
CG> all; but we first need more details.

>> http://docs.basex.org/wiki/Serialization
>> it just says
>> "all encodings supported by Java"
>> So one is supposed to look at
>> http://www.google.com/search?q=all+encodings+supported+by+Java

CG> I've added a link. Note, however, that the list is also dependent on
CG> the Java VM you are using.

OK, also do make a note of that fact there...

>> Why doesn't basex have a command that would output the current
>> "all encodings supported by Java"
>> that it is using.

CG> Try this:

CG>   basex "Q{java.nio.charset.Charset}availableCharsets()"

Gawd!
$ basex "Q{java.nio.charset.Charset}availableCharsets()"|wc
  0 1673593
One big line and everything is repeated twice!

$ basex "Q{java.nio.charset.Charset}availableCharsets()"|
  perl -nwle 'print for /([^\s{]+)=/g'|wc
167 1671713
looks much nicer and has half the bytes.

Do make a note of it on the wiki there. Thanks.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] how to pass raw bytes intact?

2012-12-31 Thread Christian Grün
Jidanni,

> echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|basex -q '
>   declare option db:parser "html";
>   declare option output:method "raw";
>   doc("/dev/stdin")//*:wbr/..'

If you want help, please try to help, too. Your example is not what I
would call very helpful; give us at least:

  a) a minimized example,
  b) the returned output, and
  c) the expected result

> declare option output:encoding "RAW"; or "BYTES" or "NONE"

I’m not sure if you will need any output declaration for your query at
all; but we first need more details.

> http://docs.basex.org/wiki/Serialization
> it just says
> "all encodings supported by Java"
> So one is supposed to look at
> http://www.google.com/search?q=all+encodings+supported+by+Java

I've added a link. Note, however, that the list is also dependent on
the Java VM you are using.

> Why doesn't basex have a command that would output the current
> "all encodings supported by Java"
> that it is using.

Try this:

  basex "Q{java.nio.charset.Charset}availableCharsets()"
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] how to pass raw bytes intact?

2012-12-30 Thread jidanni
Our mission today is to use Basex to remove tags injected right between
the bytes of multibyte UTF-8 characters.

http://www.couchsurfing.org/group_read.html?gid=430&post=13986932

> "CG" == Christian Grün  writes:
CG> Have you tried method=raw, as mentioned in our documentation
CG> (http://docs.basex.org/wiki/Serialization)?

Sorry. Try it yourself:
echo '你好'|perl -pwle 's![^[:ascii:]]!$&!'|basex -q '
  declare option db:parser "html";
  declare option output:method "raw";
  doc("/dev/stdin")//*:wbr/..'

There is no way to cleanly restore the shattered UTF-8.

I would also like to try

  declare option output:encoding "RAW"; or "BYTES" or "NONE"

but on
http://docs.basex.org/wiki/Serialization
it just says
"all encodings supported by Java"
So one is supposed to look at
http://www.google.com/search?q=all+encodings+supported+by+Java
etc. etc.

Why doesn't basex have a command that would output the current
"all encodings supported by Java"
that it is using.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk