On 28/03/17 21:35, Andrew U Frank wrote:
the problem/bug is not related to the BOM character but seemingly to
many UTF-8.
i get (consistently) a return code of 204 when the fuseki server is
running without -v and 500 when running with -v if any of the literatls
contains a "strange" (nonASCII?) UTF-8. the current problem is the
character รค (code point 228 - character a with diaresis, german umlaut).
if i remove the character, the triples (all of the request) are stored,
if it is in the literat, none is stored.
(can we stick to hex please?)
228 = U+00E4
I suspect that codepoints are not being encoded into UTF-8 correctly.
That is what the java-based decoder that you hit via "-v" is saying.
For example, U+00E4 is 3 bytes : c3 a4 0a : in UTF-8 on the wire.
What is definitely wrong is sending the codepoint as a byte directly :
xE4 or two bytes 00 E4.
i understand that a request encoded as application/sparql-update must be
coded as UTF8 which my literal is - or is there some special encoding
necessary for the german a umlaut? i do not think that the triples
should be encoded as latin1 or similar??
Can you confirm that on the wire it is c3 a4 0a?
i tried to POST with curl or wget, but did not succeed (i have not much
experience with these outside of simplest case).
in any case, it is likely a bug when the response with or without -v in
the fuseki start makes a difference?
Hitting different decoders.
Strictly, it is an error and it should be 500. javacc
bytes-to-character seems to be too lax.
thank you for the help!
andrew