[jira] Commented: (LUCENE-826) Language detector
[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804285#action_12804285 ] Ken Krugler commented on LUCENE-826: I think Nutch (and eventually Mahout) plan to use Tika for charset/mime-type/language detection going forward. I've filed an issue [TIKA-369] about improving the current Tika code, which is a simplification of the Nutch code. While using this on lots of docs, there were performance issues. And for small chunks of text the quality isn't very good. It would be interesting if Karl could comment on the approach Ted Dunning took (many years ago - 1994 :)) versus what he did. Language detector - Key: LUCENE-826 URL: https://issues.apache.org/jira/browse/LUCENE-826 Project: Lucene - Java Issue Type: New Feature Reporter: Karl Wettin Assignee: Karl Wettin Attachments: ld.tar.gz, ld.tar.gz A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies. Optionally Wikipedia and NekoHTML for training data harvesting. Initialized like this: {code} LanguageRoot root = new LanguageRoot(new File(documentClassifier/language root)); root.addBranch(uralic); root.addBranch(fino-ugric, uralic); root.addBranch(ugric, uralic); root.addLanguage(fino-ugric, fin, finnish, fi, Suomi); root.addBranch(proto-indo european); root.addBranch(germanic, proto-indo european); root.addBranch(northern germanic, germanic); root.addLanguage(northern germanic, dan, danish, da, Danmark); root.addLanguage(northern germanic, nor, norwegian, no, Norge); root.addLanguage(northern germanic, swe, swedish, sv, Sverige); root.addBranch(west germanic, germanic); root.addLanguage(west germanic, eng, english, en, UK); root.mkdirs(); LanguageClassifier classifier = new LanguageClassifier(root); if (!new File(root.getDataPath(), trainingData.arff).exists()) { classifier.compileTrainingData(); // from wikipedia } classifier.buildClassifier(); {code} Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test: (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.) {code} assertEquals(swe, classifier.classify(sweden_in_swedish).getISO()); testEquals(swe, classifier.classify(norway_in_swedish).getISO()); testEquals(swe, classifier.classify(denmark_in_swedish).getISO()); testEquals(swe, classifier.classify(finland_in_swedish).getISO()); testEquals(swe, classifier.classify(uk_in_swedish).getISO()); testEquals(nor, classifier.classify(sweden_in_norwegian).getISO()); assertEquals(nor, classifier.classify(norway_in_norwegian).getISO()); testEquals(nor, classifier.classify(denmark_in_norwegian).getISO()); testEquals(nor, classifier.classify(finland_in_norwegian).getISO()); testEquals(nor, classifier.classify(uk_in_norwegian).getISO()); testEquals(fin, classifier.classify(sweden_in_finnish).getISO()); testEquals(fin, classifier.classify(norway_in_finnish).getISO()); testEquals(fin, classifier.classify(denmark_in_finnish).getISO()); assertEquals(fin, classifier.classify(finland_in_finnish).getISO()); testEquals(fin, classifier.classify(uk_in_finnish).getISO()); testEquals(dan, classifier.classify(sweden_in_danish).getISO()); // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small. testEquals(dan, classifier.classify(norway_in_danish).getISO()); assertEquals(dan, classifier.classify(denmark_in_danish).getISO()); testEquals(dan, classifier.classify(finland_in_danish).getISO()); testEquals(dan, classifier.classify(uk_in_danish).getISO()); testEquals(eng, classifier.classify(sweden_in_english).getISO()); testEquals(eng, classifier.classify(norway_in_english).getISO()); testEquals(eng, classifier.classify(denmark_in_english).getISO()); testEquals(eng, classifier.classify(finland_in_english).getISO()); assertEquals(eng, classifier.classify(uk_in_english).getISO()); {code} I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying. It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786712#action_12786712 ] Ken Krugler commented on LUCENE-1343: - Just to make sure this point doesn't get lost in the discussion over normalization - the issue of visual normalization is one that I think ISOLatin1AccentFilter originally was trying to address. Specifically how to fold together forms of letters that a user, when typing, might consider equivalent. This is indeed language specific, and re-implementing support that's already in ICU4J is clearly a Bad Idea. I think there's value in a general normalizer that implements the Unicode Consortium's algorithm/data for normalization of int'l domain names, as this is intended to avoid visual spoofing of domain names. Don't know/haven't tracked if or when this is going into ICU4J. But (similar to ICU generic sorting) it provides a useful locale-agnostic approach that would work well-enough for most Lucene use cases. A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: I wanna contribute a Chinese analyzer to lucene
I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of http://www.imdict.net/imdict, which is an intelligent online dictionary. The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/http://code.google.com/p/imdict-chinese-analyzer/ I took a quick look, but didn't see any code posted there yet. [snip] This Analyzer contains two packages, the source code and the lexical dictionary. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me. So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site? I believe your code can be a contrib, with a reference to the dictionary. So a first step would be to open an issue in Lucene's Jira (http://issues.apache.org/jira/browse/LUCENE), and post your source as a patch. The best way to get the right answer to the legal issue is to post it to the legal-disc...@apache.org list (join it first), as Apache's lawyers can then respond to your specific question. -- Ken -- Ken Krugler +1 530-210-6378
Use of Unicode data in Lucene
Hi all, I've started working on something similar to https://issues.apache.org/jira/browse/LUCENE-1343, which is about creating a better (more universal) normalizer for words that look the same. I'd like to avoid the dependency on ICU4J, which (I think) would otherwise prevent the code from being part of the core - due to license issues, it would have to languish in contrib. I can implement the functionality just using the data tables from the Unicode Consortium, including http://www.unicode.org/reports/tr39, but there's still the issue of the Unicode data license and its compatibility with Apache 2.0. Does anybody know whether http://www.unicode.org/copyright.html creates an issue? What's the process for vetting a license? Or is this something I should be posting to a different list? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: TestIndexInput test failures on jdk 1.6/linux after r641303
Ok, it's not a java 1.6 thing it's something else. I also found a box that runs that test ok. From what I can tell, this is the test that's failing: http://www.krugle.org/kse/entfiles/lucene/apache.org/java/trunk/src/test/org/apache/lucene/index/TestIndexInput.java#89 This is verifying that the Modified UTF-8 null bytes sequence is handled properly, from line 63 in the same file. I think this is the old, deprecated format for pre-2.4 indexes. So shouldn't there be a call to setModifiedUTF8StringsMode()? And since this is a one-way setting of the preUTF8Strings flag, It feels like this should be in a separate test. Without this call, you'll get the result of calling the String class's default constructor with an ill-formed UTF-8 sequence (for Unicode 3.1 or later), since 0xC0 0x80 isn't the shortest form for the u code point. -- Ken Mark Miller wrote: Hey Sami, I've been running tests quite a bit recently with Ubuntu 8.10 and OpenJDK 6 on a 64-bit machine, and I have not seen it once. Just tried again with Sun JDK 6 and 5 32-bit as well, and I am still not seeing it. Odd. - Mark Sami Siren wrote: I am constantly seeing following error when running ant test: [junit] Testcase: testRead(org.apache.lucene.index.TestIndexInput):FAILED [junit] expected:[] but was:[??] [junit] junit.framework.ComparisonFailure: expected:[] but was:[??] [junit] at org.apache.lucene.index.TestIndexInput.testRead(TestIndexInput.java:89) on both intel and amd architectures running linux. java on AMD: java version 1.6.0_11 Java(TM) SE Runtime Environment (build 1.6.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode) java on Intel: java version 1.6.0_0 IcedTea6 1.4 (fedora-7.b12.fc10-x86_64) Runtime Environment (build 1.6.0_0-b12) OpenJDK 64-Bit Server VM (build 10.0-b19, mixed mode) java version 1.6.0_11 Java(TM) SE Runtime Environment (build 1.6.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode) java version 1.6.0_11 Java(TM) SE Runtime Environment (build 1.6.0_11-b03) Java HotSpot(TM) Server VM (build 11.0-b16, mixed mode) Anyone else seeing this? -- Sami Siren - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622746#action_12622746 ] Ken Krugler commented on LUCENE-1343: - Hi Robert, So given that you and the Unicode consortium seem to be working on the same problem (normalizing visually similar characters), how similar are your tables to the ones that have been developed to deter spoofing of int'l domain names? -- Ken A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622432#action_12622432 ] Ken Krugler commented on LUCENE-1343: - Hi Robert, FWIW, the issues being discussed here are very similar to those covered by the [Unicode Security Considerations|http://www.unicode.org/reports/tr36/] technical report #36, and associated data found in the [Unicode Security Mechanisms|http://www.unicode.org/reports/tr39/] technical report #39. The fundamental issue for int'l domain name spoofing is detecting when two sequences of Unicode code points will render as similar glyphs...which is basically the same issue you're trying to address here, so that when you search for something you'll find all terms that look similar. So for a more complete (though undoubtedly slower bigger) solution, I'd suggest using ICU4J to do a NFKD normalization, then toss any combining/spacing marks, lower-case the result, and finally apply mappings using the data tables found in the technical report #39 referenced above. -- Ken A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - Key: LUCENE-1343 URL: https://issues.apache.org/jira/browse/LUCENE-1343 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Haschart Priority: Minor Attachments: normalizer.jar, UnicodeCharUtil.java, UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all.Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L . The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł - L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hadoop RPC for distributed Lucene
I believe Hadoop RPC was originally built for distributed search for Nutch. Here's some core code I think Nutch still uses http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648view=markuphttp://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648view=markup Hadoop PRC is used for distributed search, but at a layer above Lucene - search requests are sent via RPC to remote searchers, which are Java processes running on multiple boxes. These in turn make Lucene queries and send back results. You might want to look at the Katta project (http://katta.wiki.sourceforge.net/), which uses Hadoop to handle distributed Lucene indexes. -- Ken One thing I wanted to add to the original email is if some of the core query and filter classes implemented java.io.Externalizable then there would be a speedup in serialization equivalent to using Writeable. It would also be backwards compatible with and enhance the existing distributed search using RMI. Classes that do not implement Externalizable would simply use the default reflection based serialization. On Fri, Jul 11, 2008 at 9:13 AM, Grant Ingersoll mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote: I believe there is a subproject over at Hadoop for doing distributed stuff w/ Lucene, but I am not sure if they are doing search side, only indexing. I was always under the impression that it was too slow for search side, as I don't think Nutch even uses it for the search side of the equation, but I don't know if that is still the case. On Jul 10, 2008, at 10:16 PM, Jason Rutherglen wrote: Has anyone taken a look at using Hadoop RPC for enabling distributed Lucene? I am thinking it would implement the Searchable interface and use serialization to be compatible with the current RMI version. Somewhat defeats the purpose of using Hadoop RPC and serialization however Hadoop RPC scales far beyond what RMI can at the networking level. RMI uses a thread per socket and has reportedly has latency issues. Hadoop RPC uses NIO and is proven to scale to thousands of servers. Serialization unfortunately must be used with Lucene due to the Weight, Query and Filter classes. There could be an extended version of Searchable that allows passing Weight, Query, and Filter classes that implement Hadoop's Writeable interface if a user wants to bypass using serialization. - To unsubscribe, e-mail: mailto:[EMAIL PROTECTED][EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED][EMAIL PROTECTED] -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Potential bug in SloppyPhraseScorer
Hi list (and hopefully Doron), Don't know if Doron saw this Jira issue: https://issues.apache.org/jira/browse/LUCENE-1310 Given that this bug only surfaces with repeating terms in the target phrase, I wonder if it's related to the changes made to fix Lucene-736? We've looked at the code, and the bug isn't obvious. Plus I worry about the probability of introducing a new bug with any modification. If anybody who's touched this code has time to look at the issue and comment, that would be great! Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to solve the issue Unable to read entire block; 72 bytes read; expected 512 bytes
: Sorry, for rtf it throws the following exception: : Unable to read entire header; 100 bytes read; expected 512 bytes : Is it a issue with POI of Lucene?. If so which build of POI contains fix : for this problem where i can get it?. Please tell me asap. 1) java-dev if for discussiong development of hte Lucene Java API, questions baout errors when using the Java API should be sent to the java-user list. 2) that's just a one line error string, it may be the message of an exception -- but it may just be something logged by your application. if it is an exception message, the only way to make sense of it is to see the entire exception stack trace. 3) i can't think of anywhere in the Lucene code base that might write out a string like that (or throw an exception with that message) i suspect it is coming from POI (i'd know ofr sure if you'd sent the full stack trace) so you should consider contacting the POI user list ... before you do, you might try a simple test of a micro app using POI to parse the same document without Lucene involved at all -- if you get hte same error, then you know it's POI and not lucene related at all. It's there in POI: http://www.krugle.org/kse/files/svn/svn.apache.org/poi/src/java/org/apache/poi/poifs/storage/HeaderBlockReader.java On line 83. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzers, perfect hash, ICU
Hi all, I'm working on the analyzer for the slovanic latin languages (cs,sk) w/o stemming at first. I would like to ask you: The StopWord analyzer uses often HashSet implementation, but the the Stopwords are not changed often (if ever) from shipped in the java code. Do you think that is there benefit for the perfect hash algorithm? My guess is that you wouldn't save much time here using a perfect hash. I will do an ICU analyzer for latin chars (decompositing and return base char). Have you any exp. with icu(.sf.net) some problems, bottlenecks? This could be a significant performance hit. Using ICU is a good idea, but typically putting some simple front-end filtering in front can save you a lot of time. E.g. if there are a lot of characters that don't require any decomposition, you could do some quick (and very conservative) checks to skip calls to ICU. But of course, measure then optimize :) P. S.: also I would like these stuff contribute to lucene-contrib if it'll be recognized useful. Is there any howto set the Eclipse for Lucene/Apache related project? If you're asking about how to set up Eclipse to do development for Lucene, I found some posts to the mailing list a while back, but nothing definitive. FWIW, my experience w/Eclipse 3.1 was that trying to auto-create Eclipse projects using the Ant build file didn't work very well. So we wound up manually creating the project, setting up the classpath, etc. -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and UTF-8
Perl development is going very well, by the way. On the indexing side, I've got a new app going which solves both the index compatibility issue and the speed issue, about which I'll make a presentation in this forum after I flesh it out and clean it up. Well, I'm lying a little. The app doesn't quite write a valid Lucene 1.4.3 index, since it writes true UTF-8. If these patches get adopted prior to the release of 1.9, though, it will write valid Lucene 1.9 indexes. This UTF stuff is not my thing, and I have a hard time following all the discussion here (read: I don't get it)... but it sounds like good changes. Could one of the other Lucene committers following this thread apply the patches and commit the stuff if it looks good? Perhaps this is something we should do between 1.9 and 2.0, since the patch will make the new indices incompatible, and breaking the compatibility at version 2.0 would be okay, while 1.9 should remain compatible with 1.4.3 indices and just have a bunch of methods deprecated. Just to clarify, an incompatibility will occur if: a. The new code is used to write the index. b. The text being written contains an embedded null or an extended (not in the BMP) Unicode code point. c. Old code is then used to read the index. It may still make sense to defer this change to 2.0, but it's not at the level of changing the format of an index file. -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? Yes, UTF-16 means two bytes per code unit. A Unicode character (code point) is encoded as either one or two UTF-16 code units. That doesn't seem to be the case. The case where? You mean in what actually gets written out? String.length() is the length in terms of Java chars, which means UTF-16 code units (well, sort of...see below). Looking at the code, IndexOutput.writeString() calls writeVInt() with the string length. One related note. Java 1.4 supports Unicode 3.0, while Java 5.0 supports Unicode 4.0. It was in Unicode 3.1 that supplementary characters (code points U+0, ie outside of the BMP) were added, and the UTF-16 encoding formalized. So I think the issue of non-BMP characters is currently a bit esoteric for Lucene, since I'm guessing there are other places in the code (e.g. JDK calls used by Lucene) where non-BMP characters won't be properly handled. Though some quick tests indicate that there is some knowledge of surrogate pairs in 1.4 (e.g. converting a String w/surrogate pairs to UTF-8 does the right thing). -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene does NOT use UTF-8.
I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding classes - avoiding all of the lookups performed by the String class. Regardless of what underlying support is used, if you want to write out the VInt value as UTF-8 bytes versus Java chars, the Java String has to either be converted to UTF-8 in memory first, or pre-scanned. The first is a memory hit, and the second is a performance hit. I don't know the extent of either, but it's there. Note that since the VInt is a variable size, you can't write out the bytes first and then fill in the correct value later. -- Ken -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 4:24 PM To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for back-compatibility.) Implementations can be modified to pass the version around if they wish to be back-compatible, or they can simply throw exceptions for old format indexes. I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize string memory allocations. I'm going to take this off-list now [ ... ] Please don't. It's better to have a record of the discussion. Doug -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case. UTF-16 is a fixed 2 byte/char representation. I hate to keep beating this horse, but I want to emphasize that it's 2 bytes per Java char (or UTF-16 code unit), not Unicode character (code point). But one cannot equate the character count with the byte count. Each Java char is 2 bytes. I think all that is being said is that the VInt is equal to str.length() as java gives it. On an unrelated project we are determining whether we should use a denormalized (letter followed by an accents) or a normalized form (letter with accents) of accented characters as we present the text to a GUI. We have found that font support varies but appears to be better for denormalized. This is not an issue for storage, as it can be transformed before it goes to screen. However, it is useful to know which form it is in. The reason I mention this is that I seem to remember that the length of the java string varies with the representation. String.length() is the number of Java chars, which always uses UTF-16. If you normalize text, then yes that can change the number of code units and thus the length of the string, but so can doing any kind of text munging (e.g. replacement) operation on characters in the string. So then the count would not be the number of glyphs that the user sees. Please correct me if I am wrong. All kinds of mxn mappings (both at the layout engine level, and using font tables) are possible when going from Unicode characters to display glyphs. Plus zero-width left-kerning glyphs would also alter the relationship between # of visual characters and backing store characters. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Yonik Seeley wrote: A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right? The prefix length is no longer the offset into the char[] to put the suffix. Yes, I suppose this is a problem too. Sigh. Another approach might be to convert the target to a UTF-8 byte[] and do all comparisons on byte[]. UTF-8 has some very nice properties, including that the byte[] representation of UTF-8 strings compare the same as UCS-4 would. I was not aware of that, but I see you are correct: o The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers. (From http://www.faqs.org/rfcs/rfc3629.html) That makes the byte representation much more palatable, since Lucene orders terms lexicographically. Where/how is the Lucene ordering of terms used? I'm asking because people often confuse lexicographic order with dictionary order, whereas in the context of UTF-8 it just means the same order as Unicode code points. And the order of Java chars would be the same as for Unicode code points, other than non-BMP characters. Thanks, -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) I think the following statements are all true: a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version. b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings. c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings. d. The documentation could be clearer on what is meant by the string length, but this is a trivial change. What's unclear to me (not being a Perl, Python, etc jock) is how much easier it would be to get these other implementations working with Lucene, following a change to UTF-8. So I can't comment on the return on time required to change things. I'm also curious about the existing CLucene PyLucene ports. Would they also need to be similarly modified, with the proposed changes? One final point. I doubt people have been adding strings with embedded nulls, and text outside of the Unicode BMP is also very rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's only the above two edge cases that create an interoperability problem. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
[snip] The surrogate pair problem is another matter entirely. First of all, lets see if I do understand the problem correctly: Some unicode characters can be represented by one codepoint outside the BMP (i. e., not with 16 bits) and alternatively with two codepoints, both of them in the 16-bit range. A Unicode character has a code point, which is a scalar value in the range U+ to U+10. The code point for every character in the Unicode character set will fall in this range. There are Unicode encoding schemes, which specify how Unicode code point values are serialized. Examples include UTF-8, UTF-16LE, UTF-16BE, UTF-32, UTF-7, etc. The UTF-16 (big or little endian) encoding scheme uses two code units (16-bit values) to encode Unicode characters with code point values U+0. According to Marvin's explanations, the Unicode standard requires these characters to be represented as the one codepoint in UTF-8, resulting in a 4-, 5-, or 6-byte encoding for that character. Since the Unicode code point range is constrained to U+...U+10, the longest valid UTF-8 sequence is 4 bytes. But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit range cannot be represented as chars. That is, the in-memory-representation still requires the use of the surrogate pairs. Therefore, writing consists of translating the surrogate pair to the 16bit representation of the same character and then algorithmically encoding that. Reading is exactly the reverse process. Yes. Writing requires that you combine the two surrogate characters into a Unicode code point, then converting that value into the UTF-8 4 byte sequence. Adding code to handle the 4 to 6 byte encodings to the readChars/writeChars method is simple, but how do you do the mapping from surrogate pairs to the chars they represent? Is there an algorithm for doing that except for table lookups or huge switch statements? It's easy, since U+D800...U+DBFF is defined as the range for the high (most significant) surrogate, and U+DC00...U+DFFF is defined as the range for the low (least significant) surrogate. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) Ken - you mentioned taking the discussion off-line in a previous post. Please don't. Let's keep it alive on java-dev until we have a resolution to it. I think the following statements are all true: a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version. b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings. What, if any, performance impact would changing Java Lucene in this regard have? (I realize this is rhetorical at this point, until a solution is at hand) Almost zero. A tiny hit when reading/writing surrogate pairs, to properly encode them as a 4 byte UTF-8 sequence versus two 3-byte sequences. c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings. I don't know the gory details, but we've made compatibility breaking changes in the past and the current version of Lucene can open older formats, but only write the most current format. I suspect it could be made to be backwards compatible. Worst case, we break compatibility in 2.0. Ronald is correct in that it would be easy to make the reader handle both Java modified UTF-8 and UTF-8, and the writer always output UTF-8. So the only problem would be if older versions of Lucene (or maybe CLucene) wound up trying to read strings that contained 4-byte UTF-8 sequences, as they wouldn't know how to convert this into two UTF-16 Java chars. Since 4-byte UTF-8 sequences are only for characters outside of the BMP, and these are rare, it seems like an OK thing to do, but that's just my uninformed view. d. The documentation could be clearer on what is meant by the string length, but this is a trivial change. That change was made by Daniel soon after this discussion began. Daniel changed the definition of Chars, but String section still needs to be clarified. Currently it says: Lucene writes strings as a VInt representing the length, followed by the character data. It should read: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and UTF-8
Hi Marvin, I'm guessing that since I'm the one that cares most about interoperability, I'll have to volunteer to do the heavy lifting. Tomorrow I'll go through and survey how many and which things would need to change to achieve full UTF-8 compliance. One concern is that I think in order to make that last case work, readChars() may need to return an array. Since readChars() is part of the public API and may be called by something other than readString(), I don't know if that'll fly. I don't believe such a change would be required, since the ultimate data source/destination on the Java side will look the same (array of Java chars) - the only issue is how it looks when serialized. It seems clear that you have sufficient expertise to hone my rough contributions into final form. If you have the interest, would that be a good division of labor? I wish I could do this alone and just supply finished, tested patches, but obviously I can't. Or perhaps I'm underestimating your level of interest -- do you want to take the ball and run with it? I can take a look at the code, sure. The hard part will be coding up the JUnit test cases (see below). I think we could stand to have 2 corpuses of test documents available: one is which predominantly 2-byte and 3-byte UTF-8 (but no 4-byte), and another which has the full range including non-BMP code points. I can hunt those down or maybe get somebody from the Plucene community to create them, but perhaps they already exist? Good test data for the decoder would be the following: a. Single surrogate pair (two Java chars) b. Surrogate pair at the beginning, followed by regular data. c. Surrogate pair at the end, followed by regular data. d. Two surrogate pairs in a row. Then all of the above, but remove the second (low-order) surrogate character (busted format). Then all of the above, but replace the first (high-order) surrogate character. Then all of the above, but replace the surrogate pair with an xC0 x80 encoded null byte. And no, I don't think this test data exists, unfortunately. But it shouldn't be too hard to generate. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Hi Marvin, Thanks for the detailed response. After spending a bit more time in the code, I think you're right - all strings seem to be funnelled through IndexOutput. The remaining issue is dealing with old-format indexes. I'm going to take this off-list now, since I'm guessing most list readers aren't too interested in the on-going discussion. If anybody else would like to be copied, send me an email. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]