Fixed version:
public static string FixLgdString(string lgdString)
{
byte[] lgdStringBytes = Encoding.UTF8.GetBytes(lgdString);
if (lgdStringBytes.Length == lgdString.Length)
return lgdString;
int firstByteOffset = (194 + 144 + 195) - 208;
List<byte> fixedLgdStringBytes = new List<byte>();
for (int i = 0; i < lgdStringBytes.Length; i++)
{
if (lgdStringBytes[i] < 128)
fixedLgdStringBytes.Add(lgdStringBytes[i]);
else
{
fixedLgdStringBytes.Add((byte)((lgdStringBytes[i] +
lgdStringBytes[i + 1] + lgdStringBytes[i + 2]) - firstByteOffset));
fixedLgdStringBytes.Add(lgdStringBytes[i + 3]);
i += 3;
}
}
string fixedLgdString =
Encoding.UTF8.GetString(fixedLgdStringBytes.ToArray());
return fixedLgdString;
}
2010/5/21 Alexander Sidorov <[email protected]>
> Richards, thank you for posting your ideas concerning LGD incorrectly
> encoded strings. I didn't want to solve this problem for a long time because
> hoped that fixed LGD dataset will be released before our service launch.
> Looks like I was too optimistic... So here I will share my solution (more
> workaround than solution) of this problem. I used C# but it I think it will
> be rather easy to port it to any other language.
>
> First of all I needed to find some regularities between LGD (incorrectly
> encoded) and correctly encoded literals. I took literal for Tomsk city
> (Томск in russian) and used the following code:
>
> string lgdString = "Ð¢Ð¾Ð¼Ñ Ðº";
>
> byte[] lgdStringBytes = Encoding.UTF8.GetBytes(lgdString);
>
> string realString = "Томск";
>
> byte[] realStringBytes = Encoding.UTF8.GetBytes(realString);
>
> and got very interesting results:
> http://img231.imageshack.us/img231/1213/linkedgeodataencoding.png
> As you see in the picture:
> 1. Lgd string uses twice more bytes
> 2. Every fourth byte of LGD string is equal to every second byte of normal
> string.
> 3. First byte of normal string is a combination of three first bytes of LGD
> string (third byte is a combination of 5, 6, 7 LGD bytes, etc.). Combination
> of 195+*144*+194 is mapped to 208 and 195+*145*+194 is mapped to 209 (byte 9
> of normal string).
>
> Basing on these regularities I have implemented very simple conversion
> algorithm (full source code is provided at the end of this message). It
> suits my current need but it surely doesn't work for 3- and 4-bytes UTF-8
> symbols. Here is how it works for Tomsk city name:
> http://img199.imageshack.us/img199/8356/linkedgeodataencodingfi.png
>
> P. S. There was one magic thing concerning this encoding problem. LGD
> strings are used at several places of my site and there was one place where
> they were displayed correctly. The speciality of this "place" is that it's
> content is dynamically loaded using Ajax (jQuery). So there should be more
> natural way of automatic fixing this kind of encoding problems.
>
> PP. S. Source code:
>
> class Program
> {
> static void Main(string[] args)
> {
> string lgdString = "Ð¢Ð¾Ð¼Ñ Ðº";
>
> string fixedLgdString = FixLgdString(lgdString);
> }
>
> public static string FixLgdString(string lgdString)
> {
> byte[] lgdStringBytes = Encoding.UTF8.GetBytes(lgdString);
>
> if (lgdStringBytes.Length == lgdString.Length)
> return lgdString;
>
> int firstByteOffset = (194 + 144 + 195) - 208;
>
> byte[] fixedLgdStringBytes = new byte[lgdStringBytes.Length / 2];
> int k = 0; // fixedLgdStringBytes counter
> for (int i = 0; i < lgdStringBytes.Length; i += 4)
> {
> fixedLgdStringBytes[k++] = (byte)((lgdStringBytes[i] +
> lgdStringBytes[i + 1] + lgdStringBytes[i + 2]) - firstByteOffset);
> fixedLgdStringBytes[k++] = lgdStringBytes[i + 3];
> }
>
> string fixedLgdString =
> Encoding.UTF8.GetString(fixedLgdStringBytes);
>
> return fixedLgdString;
> }
> }
>
> 2010/3/23 Richard Cyganiak <[email protected]>
>
>> Well, I'm not affiliated with Linked Geo Data, but have already looked at
>> way too many RDF-related encoding problems in my life, so why not look at
>> one more ...
>>
>>
>> It is indeed a problem in Linked Geo Data.
>>
>> The Moscow resource
>> http://linkedgeodata.org/triplify/node/27503927
>>
>> has the following value for the :name property, in N-Triples:
>>
>>
>> "\u00D0\u009C\u00D0\u00BE\u00D1\u0081\u00D0\u00BA\u00D0\u00B2\u00D0\u00B0"
>>
>> These are characters escaped with the \u notation of N-Triples. If one
>> decodes the characters, this is garbage: ÐœÐ¾Ñ ÐºÐ²Ð°
>>
>>
>> I guess the problem is that the Linked Geo Data code messes up an UTF-8
>> encoded input stream that comes from the input dataset. It looks like the
>> original stream contained bytes (hexadecimal)
>>
>> D0 9C D0 BE D1 81 D0 BA D0 B2 D0 B0
>>
>> If interpreted as a UTF-8 encoded Unicode string, this is: Москва
>>
>>
>> Now apparently in Linked Geo Data this byte sequence was escaped into \u
>> notation simply by prepending \u00 to every byte. That doesn't work. One
>> actually has to decode the UTF-8 into Unicode characters first, and then
>> escape them one by one, resulting in:
>> "\u041C\u043E\u0441\u043A\u0432\u0430"
>>
>> A well-tested PHP implementation of this string escaping for N-Triples is
>> available in DBpedia as RDFliteral::escape():
>>
>> http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/extraction/core/RDFliteral.php?view=markup
>>
>> Best,
>> Richard
>>
>>
>>
>> On 22 Mar 2010, at 10:33, Hugh Williams wrote:
>>
>> Hi Mitko/Alexander,
>>>
>>> Perhaps someone on the Linked Geo Data group I have added to this reply,
>>> can comment ?
>>>
>>> Best Regards
>>> Hugh Williams
>>> Professional Services
>>> OpenLink Software
>>> Web: http://www.openlinksw.com
>>> Support: http://support.openlinksw.com
>>> Forums: http://boards.openlinksw.com/support
>>> Twitter: http://twitter.com/OpenLink
>>>
>>> On 22 Mar 2010, at 10:21, Mitko Iliev wrote:
>>>
>>> The problem is in LinkedGeoData dataset. Can be reproduced with :
>>>>
>>>> ttlp (http_get
>>>> ('http://linkedgeodata.org/triplify/node/27503927'<http://linkedgeodata.org/triplify/node/27503927%27>),
>>>> '',
>>>> 'http://linkedgeodata.org/triplify/node/27503927'<http://linkedgeodata.org/triplify/node/27503927%27>
>>>> );
>>>> and query : select * where { <
>>>> http://linkedgeodata.org/triplify/node/27503927#id> ?y ?z . }
>>>>
>>>> Best Regards,
>>>> Mitko
>>>>
>>>>
>>>> On Mar 20, 2010, at 9:33 PM, Alexander Sidorov wrote:
>>>>
>>>> Hm... Look at this query results:
>>>>>
>>>>> SELECT ?s ?p ?o ?name
>>>>> WHERE
>>>>> {
>>>>> ?s ?p ?o .
>>>>> ?s a <http://linkedgeodata.org/vocabulary#city> .
>>>>> ?o bif:contains '"moscow"' .
>>>>> OPTIONAL
>>>>> {
>>>>> ?s <http://linkedgeodata.org/vocabulary#name> ?name
>>>>> }
>>>>> }
>>>>>
>>>>> Do you see "Москва" as name? I see some strange symbols despite I see
>>>>> correct cyrillic symbols at your query results. Looks like LinkedGeoData
>>>>> specific problem.
>>>>>
>>>>>
>>>>> 2010/3/17 Mitko Iliev <[email protected]>
>>>>> Hi Alexander,
>>>>>
>>>>> The sparql endpoint returns UTF8, also the experiments shows proper
>>>>> encoding, for example try to execute :
>>>>> SELECT ?o WHERE {<http://dbpedia.org/resource/Moscow> rdfs:label ?o .
>>>>> filter (lang(?o) = 'ru' ) }
>>>>> or
>>>>> SELECT ?o WHERE { ?s ?p ?o . ?o bif:contains '"Москва"' } limit 100
>>>>> against http://lod.openlinksw.com/sparql . both returns readable
>>>>> content.
>>>>>
>>>>> If your query executed on endpoint above returns bad utf8 please give
>>>>> us the query so we can debug what happens, otherwise a possible problem is
>>>>> at client side re-coding the response or reading it as narrow charset.
>>>>>
>>>>> Best Regards,
>>>>> Mitko
>>>>>
>>>>>
>>>>> On Mar 17, 2010, at 3:54 AM, Alexander Sidorov wrote:
>>>>>
>>>>> Hi Hugh,
>>>>>>
>>>>>> As I remember ADO.NET encoding bug was fixed (I haven't checked
>>>>>> because it has no sense while other Entity Framework bug you know about
>>>>>> is
>>>>>> not fixed).
>>>>>>
>>>>>> But this problem has no relation to ADO.NET. As I haven't yet
>>>>>> deployed my application to Amazon EC2, I execute geo queries using
>>>>>> lod.openlinksw.com/sparql endpoint using SPARQL protocol (but not
>>>>>> using database directly). Here are my screen shots:
>>>>>> 1. Manchester:
>>>>>> http://img171.imageshack.us/img171/5568/manchesterk.png
>>>>>> 2. Moscow: http://img204.imageshack.us/img204/7850/moscow.png
>>>>>>
>>>>>> Regards,
>>>>>> Alexander
>>>>>>
>>>>>> 2010/3/17 Hugh Williams <[email protected]>
>>>>>> Hi Alexander,
>>>>>>
>>>>>> Is this the encoding issue with the ADO.Net Provider you reported
>>>>>> previously as that is the only one I am aware of, which is still to be
>>>>>> resolved ?
>>>>>>
>>>>>> Note, their is a 40K limit on the size of emails to this mailing list
>>>>>> thus your mail with attachment which exceeded this limit was with held
>>>>>> pending approval initially. Please place such attachments on a remote
>>>>>> server
>>>>>> and provide links in your mails in future ...
>>>>>>
>>>>>> Best Regards
>>>>>> Hugh Williams
>>>>>> Professional Services
>>>>>> OpenLink Software
>>>>>> Web: http://www.openlinksw.com
>>>>>> Support: http://support.openlinksw.com
>>>>>> Forums: http://boards.openlinksw.com/support
>>>>>> Twitter: http://twitter.com/OpenLink
>>>>>>
>>>>>> On 17 Mar 2010, at 00:27, Alexander Sidorov wrote:
>>>>>>
>>>>>> Hello!
>>>>>>>
>>>>>>> I have already asked about LOD encoding problems before but no
>>>>>>> feedback followed. To be more expressive I have attached my
>>>>>>> application's
>>>>>>> screen shots with information about Manchester (english symbols -
>>>>>>> everything
>>>>>>> is okay) and Moscow (russian symbols are displayed incorrectly).
>>>>>>>
>>>>>>> Regards,
>>>>>>> Alexander
>>>>>>>
>>>>>>> <Manchester.png><Moscow.png>------------------------------------------------------------------------------
>>>>>>> Download Intel® Parallel Studio Eval
>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs
>>>>>>> proactively, and fine-tune applications for parallel performance.
>>>>>>> See why Intel Parallel Studio got high marks during beta.
>>>>>>>
>>>>>>> http://p.sf.net/sfu/intel-sw-dev_______________________________________________
>>>>>>> Virtuoso-users mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Download Intel® Parallel Studio Eval
>>>>>> Try the new software tools for yourself. Speed compiling, find bugs
>>>>>> proactively, and fine-tune applications for parallel performance.
>>>>>> See why Intel Parallel Studio got high marks during beta.
>>>>>>
>>>>>> http://p.sf.net/sfu/intel-sw-dev_______________________________________________
>>>>>> Virtuoso-users mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mitko Iliev
>>>>> Developer Virtuoso Team
>>>>> OpenLink Software
>>>>> http://www.openlinksw.com/virtuoso
>>>>> Cross Platform Web Services Middleware
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Mitko Iliev
>>>> Developer Virtuoso Team
>>>> OpenLink Software
>>>> http://www.openlinksw.com/virtuoso
>>>> Cross Platform Web Services Middleware
>>>>
>>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "Linked Geo Data" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<linked-geo-data%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/linked-geo-data?hl=en.
>>>
>>>
>>
>