Richards, thank you for posting your ideas concerning LGD incorrectly
encoded strings. I didn't want to solve this problem for a long time because
hoped that fixed LGD dataset will be released before our service launch.
Looks like I was too optimistic... So here I will share my solution (more
workaround than solution) of this problem. I used C# but it I think it will
be rather easy to port it to any other language.

First of all I needed to find some regularities between LGD (incorrectly
encoded) and correctly encoded literals. I took literal for Tomsk city
(Томск in russian) and used the following code:

         string lgdString = "Томск";
         byte[] lgdStringBytes = Encoding.UTF8.GetBytes(lgdString);

         string realString = "Томск";
         byte[] realStringBytes = Encoding.UTF8.GetBytes(realString);

and got very interesting results:
http://img231.imageshack.us/img231/1213/linkedgeodataencoding.png
As you see in the picture:
1. Lgd string uses twice more bytes
2. Every fourth byte of LGD string is equal to every second byte of normal
string.
3. First byte of normal string is a combination of three first bytes of LGD
string (third byte is a combination of 5, 6, 7 LGD bytes, etc.). Combination
of 195+*144*+194 is mapped to 208 and 195+*145*+194 is mapped to 209 (byte 9
of normal string).

Basing on these regularities I have implemented very simple conversion
algorithm (full source code is provided at the end of this message). It
suits my current need but it surely doesn't work for 3- and 4-bytes UTF-8
symbols. Here is how it works for Tomsk city name:
http://img199.imageshack.us/img199/8356/linkedgeodataencodingfi.png

P. S. There was one magic thing concerning this encoding problem. LGD
strings are used at several places of my site and there was one place where
they were displayed correctly. The speciality of this "place" is that it's
content is dynamically loaded using Ajax (jQuery). So there should be more
natural way of automatic fixing this kind of encoding problems.

PP. S. Source code:

class Program
   {
      static void Main(string[] args)
      {
         string lgdString = "Томск";
         string fixedLgdString = FixLgdString(lgdString);
      }

      public static string FixLgdString(string lgdString)
      {
         byte[] lgdStringBytes = Encoding.UTF8.GetBytes(lgdString);

         if (lgdStringBytes.Length == lgdString.Length)
            return lgdString;

         int firstByteOffset = (194 + 144 + 195) - 208;

         byte[] fixedLgdStringBytes = new byte[lgdStringBytes.Length / 2];
         int k = 0; // fixedLgdStringBytes counter
         for (int i = 0; i < lgdStringBytes.Length; i += 4)
         {
            fixedLgdStringBytes[k++] = (byte)((lgdStringBytes[i] +
lgdStringBytes[i + 1] + lgdStringBytes[i + 2]) - firstByteOffset);
            fixedLgdStringBytes[k++] = lgdStringBytes[i + 3];
         }

         string fixedLgdString =
Encoding.UTF8.GetString(fixedLgdStringBytes);

         return fixedLgdString;
      }
   }

2010/3/23 Richard Cyganiak <[email protected]>

> Well, I'm not affiliated with Linked Geo Data, but have already looked at
> way too many RDF-related encoding problems in my life, so why not look at
> one more ...
>
> It is indeed a problem in Linked Geo Data.
>
> The Moscow resource
>  http://linkedgeodata.org/triplify/node/27503927
>
> has the following value for the :name property, in N-Triples:
>
>  "\u00D0\u009C\u00D0\u00BE\u00D1\u0081\u00D0\u00BA\u00D0\u00B2\u00D0\u00B0"
>
> These are characters escaped with the \u notation of N-Triples. If one
> decodes the characters, this is garbage: ÐœÐ¾Ñ ÐºÐ²Ð°
>
>
> I guess the problem is that the Linked Geo Data code messes up an UTF-8
> encoded input stream that comes from the input dataset. It looks like the
> original stream contained bytes (hexadecimal)
>
>  D0 9C D0 BE D1 81 D0 BA D0 B2 D0 B0
>
> If interpreted as a UTF-8 encoded Unicode string, this is: Москва
>
>
> Now apparently in Linked Geo Data this byte sequence was escaped into \u
> notation simply by prepending \u00 to every byte. That doesn't work. One
> actually has to decode the UTF-8 into Unicode characters first, and then
> escape them one by one, resulting in:
>  "\u041C\u043E\u0441\u043A\u0432\u0430"
>
> A well-tested PHP implementation of this string escaping for N-Triples is
> available in DBpedia as RDFliteral::escape():
>
> http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/extraction/core/RDFliteral.php?view=markup
>
> Best,
> Richard
>
>
>
> On 22 Mar 2010, at 10:33, Hugh Williams wrote:
>
>  Hi Mitko/Alexander,
>>
>> Perhaps someone on the Linked Geo Data group I have added to this reply,
>> can comment ?
>>
>> Best Regards
>> Hugh Williams
>> Professional Services
>> OpenLink Software
>> Web: http://www.openlinksw.com
>> Support: http://support.openlinksw.com
>> Forums: http://boards.openlinksw.com/support
>> Twitter: http://twitter.com/OpenLink
>>
>> On 22 Mar 2010, at 10:21, Mitko Iliev wrote:
>>
>>  The problem is in LinkedGeoData dataset. Can be reproduced  with :
>>>
>>> ttlp (http_get 
>>> ('http://linkedgeodata.org/triplify/node/27503927'<http://linkedgeodata.org/triplify/node/27503927%27>),
>>> '', 
>>> 'http://linkedgeodata.org/triplify/node/27503927'<http://linkedgeodata.org/triplify/node/27503927%27>
>>> );
>>> and query : select * where { <
>>> http://linkedgeodata.org/triplify/node/27503927#id> ?y ?z . }
>>>
>>> Best Regards,
>>> Mitko
>>>
>>>
>>> On Mar 20, 2010, at 9:33 PM, Alexander Sidorov wrote:
>>>
>>>  Hm... Look at this query results:
>>>>
>>>> SELECT ?s ?p ?o ?name
>>>> WHERE
>>>> {
>>>> ?s ?p ?o .
>>>> ?s a <http://linkedgeodata.org/vocabulary#city> .
>>>> ?o bif:contains '"moscow"' .
>>>> OPTIONAL
>>>> {
>>>>  ?s <http://linkedgeodata.org/vocabulary#name> ?name
>>>> }
>>>> }
>>>>
>>>> Do you see "Москва" as name? I see some strange symbols despite I see
>>>> correct cyrillic symbols at your query results. Looks like LinkedGeoData
>>>> specific problem.
>>>>
>>>>
>>>> 2010/3/17 Mitko Iliev <[email protected]>
>>>> Hi Alexander,
>>>>
>>>> The sparql endpoint returns UTF8, also the experiments shows proper
>>>> encoding,  for example try to execute :
>>>> SELECT ?o WHERE {<http://dbpedia.org/resource/Moscow> rdfs:label ?o .
>>>> filter (lang(?o) = 'ru' ) }
>>>> or
>>>> SELECT ?o WHERE { ?s ?p ?o  . ?o bif:contains '"Москва"' } limit 100
>>>> against http://lod.openlinksw.com/sparql . both returns readable
>>>> content.
>>>>
>>>> If your query executed on endpoint above returns bad utf8 please give us
>>>> the query so we can debug what happens, otherwise a possible problem is at
>>>> client side re-coding the response or reading it as narrow charset.
>>>>
>>>> Best Regards,
>>>> Mitko
>>>>
>>>>
>>>> On Mar 17, 2010, at 3:54 AM, Alexander Sidorov wrote:
>>>>
>>>>  Hi Hugh,
>>>>>
>>>>> As I remember ADO.NET encoding bug was fixed (I haven't checked
>>>>> because it has no sense while other Entity Framework bug you know about is
>>>>> not fixed).
>>>>>
>>>>> But this problem has no relation to ADO.NET. As I haven't yet deployed
>>>>> my application to Amazon EC2, I execute geo queries using
>>>>> lod.openlinksw.com/sparql endpoint using SPARQL protocol (but not
>>>>> using database directly). Here are my screen shots:
>>>>> 1. Manchester: http://img171.imageshack.us/img171/5568/manchesterk.png
>>>>> 2. Moscow: http://img204.imageshack.us/img204/7850/moscow.png
>>>>>
>>>>> Regards,
>>>>> Alexander
>>>>>
>>>>> 2010/3/17 Hugh Williams <[email protected]>
>>>>> Hi Alexander,
>>>>>
>>>>> Is this the encoding issue with the ADO.Net Provider you reported
>>>>> previously as that is the only one I am aware of, which is still to be
>>>>> resolved ?
>>>>>
>>>>> Note, their is a 40K limit on the size of emails to this mailing list
>>>>> thus your mail with attachment which exceeded this limit was with held
>>>>> pending approval initially. Please place such attachments on a remote 
>>>>> server
>>>>> and provide links in your mails in future ...
>>>>>
>>>>> Best Regards
>>>>> Hugh Williams
>>>>> Professional Services
>>>>> OpenLink Software
>>>>> Web: http://www.openlinksw.com
>>>>> Support: http://support.openlinksw.com
>>>>> Forums: http://boards.openlinksw.com/support
>>>>> Twitter: http://twitter.com/OpenLink
>>>>>
>>>>> On 17 Mar 2010, at 00:27, Alexander Sidorov wrote:
>>>>>
>>>>>  Hello!
>>>>>>
>>>>>> I have already asked about LOD encoding problems before but no
>>>>>> feedback followed. To be more expressive I have attached my application's
>>>>>> screen shots with information about Manchester (english symbols - 
>>>>>> everything
>>>>>> is okay) and Moscow (russian symbols are displayed incorrectly).
>>>>>>
>>>>>> Regards,
>>>>>> Alexander
>>>>>>
>>>>>> <Manchester.png><Moscow.png>------------------------------------------------------------------------------
>>>>>> Download Intel&#174; Parallel Studio Eval
>>>>>> Try the new software tools for yourself. Speed compiling, find bugs
>>>>>> proactively, and fine-tune applications for parallel performance.
>>>>>> See why Intel Parallel Studio got high marks during beta.
>>>>>>
>>>>>> http://p.sf.net/sfu/intel-sw-dev_______________________________________________
>>>>>> Virtuoso-users mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Download Intel&#174; Parallel Studio Eval
>>>>> Try the new software tools for yourself. Speed compiling, find bugs
>>>>> proactively, and fine-tune applications for parallel performance.
>>>>> See why Intel Parallel Studio got high marks during beta.
>>>>>
>>>>> http://p.sf.net/sfu/intel-sw-dev_______________________________________________
>>>>> Virtuoso-users mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>>>>>
>>>>
>>>>
>>>> --
>>>> Mitko Iliev
>>>> Developer Virtuoso Team
>>>> OpenLink Software
>>>> http://www.openlinksw.com/virtuoso
>>>> Cross Platform Web Services Middleware
>>>>
>>>>
>>>>
>>>
>>> --
>>> Mitko Iliev
>>> Developer Virtuoso Team
>>> OpenLink Software
>>> http://www.openlinksw.com/virtuoso
>>> Cross Platform Web Services Middleware
>>>
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Linked Geo Data" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<linked-geo-data%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/linked-geo-data?hl=en.
>>
>>
>

Reply via email to