> OK, I'm still unclear on what's happening after spending time digging > through the source. The part that poses the biggest problem is that > lucene_utf8towc appears (to me) to be getting a correct, 32-bit value > which it stores as an int. However, it then assigns this int directly > to wchar_t. So if I'm understanding this correctly, then if the > Unicode value happens to be too big to store in 16-bits, then this > would be incorrect. It would essentially cause data corruption, yes? > > However, there may be more going on here than just this. For instance, > there's a "repl_wchar.h" file, a "PlatformWin32.h" file, other config > files all spending a good deal of code on things like _UCS2. So it's > possible that somehow it works correctly. However, the fact that I got > different search results on Windows would indicate that there is > certainly a difference. I got more results, so would that indicate > that wchar_t is bigger or smaller? I think it would mean smaller. > > Matthew
After spending some more time on this, I believe that it is converting to UCS2 on win32 platforms (and USC4 where wchar_t is 32bit). Therefore it wouldn't handle Unicode outside of the BMP on Windows. In addition, I don't think the analyzers can handle multi-byte characters, so we shouldn't try to convert it to proper UTF-16. I have to wonder though, if we should be worrying about this particular function and trying to optimize around it. Wouldn't moving to something dynamically allocated affect performance negatively? (If there's even any difference). In addition, we should (imo) be more worried about passing correct utf-8 to the function in the first place. There's a comment in the (SWORD) code that casts doubt on whether that's always the case. And, as I mentioned earlier, there are a few things that cause segfaults, including searching for stop words ("is", "the"). The other bug that we've had reported is that SWORD will index the module without bothering to check whether the location is writable. After indexing, it segfaults when it isn't writable. I guess my vote is to just give it a big value, add the appropriate call to the writer to increase the field size, then worry about some of these other issues. Matthew _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page