Re: How to check i

2014-10-17 Thread Uranuz via Digitalmars-d-learn

This is



How to check i

2014-10-16 Thread Uranuz via Digitalmars-d-learn
I have some string *str* of unicode characters. The question is 
how to check if I have valid unicode code point starting at code 
unit *index*?


I need it because I try to write parser that operates on string 
by *code unit*. If more precisely I trying to write function 
*matchWord* that should exctract whole words (that could consist 
not only English letters) from text. This word then compared with 
word from parameter. I want to not decode if it is not necessary. 
But looks like I can't do it without decoding, because I need to 
know if current character is letter of alphabet and not 
punctuation or whitespace for example.


There is how I think this look like. In real code I have template 
algorithm that operates on differrent types of strings: string, 
wstring, dstring.


struct Lexer
{
string str;
size_t index;

bool matchWord(string word)
{
size_t i = index;
while( !str[i..$].empty )
{
if( !str.isValidChar(i) )
{
i++;
continue;
}

uint len = str.graphemeStride(i);

if( !isAlpha(str[i..i+len]) )
{
break;
}
i++;
}

return word == str[index..i];
}
}

It is just a draft of idea. Maybe it is complicated. What I want 
to get as a result is logical flag (matched or not) and position 
should be set after word if it is matched. And it should match 
whole words of course.


How do I implement it correctly without overhead and additional 
UTF decodings if possible?


And also how could I validate single char of string starting at 
code unit index? Also I don't like that graphemeStride can throw 
Exception if I point to wrong possition. Is there some nothrow 
version? I don't want to have extra allocations for exceptions.


Re: How to check i

2014-10-16 Thread spir via Digitalmars-d-learn

On 16/10/14 20:46, Uranuz via Digitalmars-d-learn wrote:

I have some string *str* of unicode characters. The question is how to check if
I have valid unicode code point starting at code unit *index*?
[...]


You cannot do that without decoding. Cheking whether utf-x is valid and decoding 
are the very same process. IIRC, D has a validation func which is more or less 
just an alias for the decoding func ;-). Moreover, you also need to distinguish 
word-character code points from others (punctuation, spacing, etc) which 
requires unicode code points (Unicode the consortium provide tables for such tasks).


Thus, I would recommand you to just abandon the illusion of working at the level 
of code units for such tasks, and simply operate on strings of code points. (Why 
do you think D has them builtin?)


denis


Re: How to check i

2014-10-16 Thread Ali Çehreli via Digitalmars-d-learn

On 10/16/2014 12:43 PM, spir via Digitalmars-d-learn wrote:


denis


spir is back! :)

On 10/16/2014 11:46 AM, Uranuz wrote:

 I have some string *str* of unicode characters. The question is how to
 check if I have valid unicode code point starting at code unit *index*?

It is easy if I understand the question as skipping over invalid UTF-8 
sequences:


import std.stdio;

ubyte upperTwoBits(ubyte b)
{
return b  0b1100_;
}

bool isUtf8ContinuationByte(char c)
{
enum utf8ContinuationPrefix = 0b1000_;
return upperTwoBits(c) == utf8ContinuationPrefix;
}

void moveToValid(ref inout(char)[] s)
{
/* Skip over UTF-8 continuation bytes. */
while (s.length  isUtf8ContinuationByte(s[0])) {
s = s[1..$];
}

/*
 * The wchar[] overload is too complicated for Ali at this time. :)
 *
 * Please see the following function template in phobos/std/utf.d:
 *
 * private dchar decodeImpl(bool canIndex, S)(...)
 * if (is(S : const wchar[]) ...
 */
}

unittest
{
auto s = çde;
moveToValid(s);
assert(s == çde);

s = s[1 .. $];
moveToValid(s);
assert(s == de, s);
}

void moveToValid(ref const(dchar)[] s)
{
/* Every code unit is valid; nothing to do. */
}

void main()
{}

Ali