[jira] Commented: (LUCENE-826) Language detector

2010-01-24 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804285#action_12804285
 ] 

Ken Krugler commented on LUCENE-826:


I think Nutch (and eventually Mahout) plan to use Tika for 
charset/mime-type/language detection going forward.

I've filed an issue [TIKA-369] about improving the current Tika code, which is 
a simplification of the Nutch code. While using this on lots of docs, there 
were performance issues. And for small chunks of text the quality isn't very 
good.

It would be interesting if Karl could comment on the approach Ted Dunning took 
(many years ago - 1994 :)) versus what he did.

 Language detector
 -

 Key: LUCENE-826
 URL: https://issues.apache.org/jira/browse/LUCENE-826
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Karl Wettin
Assignee: Karl Wettin
 Attachments: ld.tar.gz, ld.tar.gz


 A formula 1A token/ngram-based language detector. Requires a paragraph of 
 text to avoid false positive classifications. 
 Depends on contrib/analyzers/ngrams for tokenization, Weka for classification 
 (logistic support vector models) feature selection and normalization of token 
 freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
 Initialized like this:
 {code}
 LanguageRoot root = new LanguageRoot(new 
 File(documentClassifier/language root));
 root.addBranch(uralic);
 root.addBranch(fino-ugric, uralic);
 root.addBranch(ugric, uralic);
 root.addLanguage(fino-ugric, fin, finnish, fi, Suomi);
 root.addBranch(proto-indo european);
 root.addBranch(germanic, proto-indo european);
 root.addBranch(northern germanic, germanic);
 root.addLanguage(northern germanic, dan, danish, da, Danmark);
 root.addLanguage(northern germanic, nor, norwegian, no, Norge);
 root.addLanguage(northern germanic, swe, swedish, sv, Sverige);
 root.addBranch(west germanic, germanic);
 root.addLanguage(west germanic, eng, english, en, UK);
 root.mkdirs();
 LanguageClassifier classifier = new LanguageClassifier(root);
 if (!new File(root.getDataPath(), trainingData.arff).exists()) {
   classifier.compileTrainingData(); // from wikipedia
 }
 classifier.buildClassifier();
 {code}
 Training set build from Wikipedia is the pages describing the home country of 
 each registred language in the language to train. Above example pass this 
 test:
 (testEquals is the same as assertEquals, just not required. Only one of them 
 fail, see comment.)
 {code}
 assertEquals(swe, classifier.classify(sweden_in_swedish).getISO());
 testEquals(swe, classifier.classify(norway_in_swedish).getISO());
 testEquals(swe, classifier.classify(denmark_in_swedish).getISO());
 testEquals(swe, classifier.classify(finland_in_swedish).getISO());
 testEquals(swe, classifier.classify(uk_in_swedish).getISO());
 testEquals(nor, classifier.classify(sweden_in_norwegian).getISO());
 assertEquals(nor, classifier.classify(norway_in_norwegian).getISO());
 testEquals(nor, classifier.classify(denmark_in_norwegian).getISO());
 testEquals(nor, classifier.classify(finland_in_norwegian).getISO());
 testEquals(nor, classifier.classify(uk_in_norwegian).getISO());
 testEquals(fin, classifier.classify(sweden_in_finnish).getISO());
 testEquals(fin, classifier.classify(norway_in_finnish).getISO());
 testEquals(fin, classifier.classify(denmark_in_finnish).getISO());
 assertEquals(fin, classifier.classify(finland_in_finnish).getISO());
 testEquals(fin, classifier.classify(uk_in_finnish).getISO());
 testEquals(dan, classifier.classify(sweden_in_danish).getISO());
 // it is ok that this fails. dan and nor are very similar, and the 
 document about norway in danish is very small.
 testEquals(dan, classifier.classify(norway_in_danish).getISO()); 
 assertEquals(dan, classifier.classify(denmark_in_danish).getISO());
 testEquals(dan, classifier.classify(finland_in_danish).getISO());
 testEquals(dan, classifier.classify(uk_in_danish).getISO());
 testEquals(eng, classifier.classify(sweden_in_english).getISO());
 testEquals(eng, classifier.classify(norway_in_english).getISO());
 testEquals(eng, classifier.classify(denmark_in_english).getISO());
 testEquals(eng, classifier.classify(finland_in_english).getISO());
 assertEquals(eng, classifier.classify(uk_in_english).getISO());
 {code}
 I don't know how well it works on lots of lanugages, but this fits my needs 
 for now. I'll try do more work on considering the language trees when 
 classifying.
 It takes a bit of time and RAM to build the training data, so the patch 
 contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-06 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786712#action_12786712
 ] 

Ken Krugler commented on LUCENE-1343:
-

Just to make sure this point doesn't get lost in the discussion over 
normalization - the issue of visual normalization is one that I think 
ISOLatin1AccentFilter originally was trying to address. Specifically how to 
fold together forms of letters that a user, when typing, might consider 
equivalent.

This is indeed language specific, and re-implementing support that's already in 
ICU4J is clearly a Bad Idea.

I think there's value in a general normalizer that implements the Unicode 
Consortium's algorithm/data for normalization of int'l domain names, as this is 
intended to avoid visual spoofing of domain names.

Don't know/haven't tracked if or when this is going into ICU4J. But (similar to 
ICU generic sorting) it provides a useful locale-agnostic approach that would 
work well-enough for most Lucene use cases.

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: I wanna contribute a Chinese analyzer to lucene

2009-04-16 Thread Ken Krugler
I wrote a Analyzer for apache lucene for analyzing sentences in 
Chinese language, it's called imdict-chinese-analyzer as it is a 
subproject of http://www.imdict.net/imdict, which is an 
intelligent online dictionary.


The project on google code is here: 
http://code.google.com/p/imdict-chinese-analyzer/http://code.google.com/p/imdict-chinese-analyzer/


I took a quick look, but didn't see any code posted there yet.

[snip]

This Analyzer contains two packages, the source code and the lexical 
dictionary. I want to publish the source code using Apache license, 
but the dictionary which is under an ambigus license was not create 
by me.
So, can I only submit the source code to lucene contribution 
repository, and let the users download the dictionary from the 
google code site?


I believe your code can be a contrib, with a reference to the 
dictionary. So a first step would be to open an issue in Lucene's 
Jira (http://issues.apache.org/jira/browse/LUCENE), and post your 
source as a patch.


The best way to get the right answer to the legal issue is to post it 
to the legal-disc...@apache.org list (join it first), as Apache's 
lawyers can then respond to your specific question.


-- Ken
--
Ken Krugler
+1 530-210-6378

Use of Unicode data in Lucene

2009-02-25 Thread Ken Krugler

Hi all,

I've started working on something similar to 
https://issues.apache.org/jira/browse/LUCENE-1343, which is about 
creating a better (more universal) normalizer for words that look 
the same.


I'd like to avoid the dependency on ICU4J, which (I think) would 
otherwise prevent the code from being part of the core - due to 
license issues, it would have to languish in contrib.


I can implement the functionality just using the data tables from the 
Unicode Consortium, including http://www.unicode.org/reports/tr39, 
but there's still the issue of the Unicode data license and its 
compatibility with Apache 2.0.


Does anybody know whether http://www.unicode.org/copyright.html 
creates an issue? What's the process for vetting a license? Or is 
this something I should be posting to a different list?


Thanks,

-- Ken
--
Ken Krugler
+1 530-210-6378

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: TestIndexInput test failures on jdk 1.6/linux after r641303

2009-01-05 Thread Ken Krugler
Ok, it's not a java 1.6 thing it's something else. I also found a 
box that runs that test ok.


From what I can tell, this is the test that's failing:

http://www.krugle.org/kse/entfiles/lucene/apache.org/java/trunk/src/test/org/apache/lucene/index/TestIndexInput.java#89

This is verifying that the Modified UTF-8 null bytes sequence is 
handled properly, from line 63 in the same file.


I think this is the old, deprecated format for pre-2.4 indexes.

So shouldn't there be a call to setModifiedUTF8StringsMode()? And 
since this is a one-way setting of the preUTF8Strings flag, It feels 
like this should be in a separate test.


Without this call, you'll get the result of calling the String 
class's default constructor with an ill-formed UTF-8 sequence (for 
Unicode 3.1 or later), since 0xC0 0x80 isn't the shortest form for 
the u code point.


-- Ken



Mark Miller wrote:
Hey Sami, I've been running tests quite a bit recently with Ubuntu 
8.10  and OpenJDK 6 on a 64-bit machine, and I have not seen it 
once.
Just tried again with Sun JDK 6 and 5 32-bit as well, and I am 
still not seeing it.


Odd.

- Mark

Sami Siren wrote:

I am constantly seeing following error when running ant test:

   [junit] Testcase: 
testRead(org.apache.lucene.index.TestIndexInput):FAILED

   [junit] expected:[] but was:[??]
   [junit] junit.framework.ComparisonFailure: expected:[] but was:[??]
   [junit] at 
org.apache.lucene.index.TestIndexInput.testRead(TestIndexInput.java:89)


on both intel and amd architectures running linux.

java on AMD:
java version 1.6.0_11
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)

java on Intel:
java version 1.6.0_0
IcedTea6 1.4 (fedora-7.b12.fc10-x86_64) Runtime Environment (build 
1.6.0_0-b12)

OpenJDK 64-Bit Server VM (build 10.0-b19, mixed mode)

java version 1.6.0_11
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)

java version 1.6.0_11
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) Server VM (build 11.0-b16, mixed mode)

Anyone else seeing this?

--
Sami Siren

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-14 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622746#action_12622746
 ] 

Ken Krugler commented on LUCENE-1343:
-

Hi Robert,

So given that you and the Unicode consortium seem to be working on the same 
problem (normalizing visually similar characters), how similar are your tables 
to the ones that have been developed to deter spoofing of int'l domain names?

-- Ken

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-13 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622432#action_12622432
 ] 

Ken Krugler commented on LUCENE-1343:
-

Hi Robert,

FWIW, the issues being discussed here are very similar to those covered by the 
[Unicode Security Considerations|http://www.unicode.org/reports/tr36/] 
technical report #36, and associated data found in the [Unicode Security 
Mechanisms|http://www.unicode.org/reports/tr39/] technical report #39.

The fundamental issue for int'l domain name spoofing is detecting when two 
sequences of Unicode code points will render as similar glyphs...which is 
basically the same issue you're trying to address here, so that when you search 
for something you'll find all terms that look similar.

So for a more complete (though undoubtedly slower  bigger) solution, I'd 
suggest using ICU4J to do a NFKD normalization, then toss any combining/spacing 
marks, lower-case the result, and finally apply mappings using the data tables 
found in the technical report #39 referenced above.

-- Ken

 A replacement for ISOLatin1AccentFilter that does a more thorough job of 
 removing diacritical marks or non-spacing modifiers.
 -

 Key: LUCENE-1343
 URL: https://issues.apache.org/jira/browse/LUCENE-1343
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Haschart
Priority: Minor
 Attachments: normalizer.jar, UnicodeCharUtil.java, 
 UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java


 The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
 marks and replaces them with a version of that character with the diacritical 
 mark removed.  For example é becomes e.  However another equally valid way of 
 representing an accented character in Unicode is to have the unaccented 
 character followed by a non-spacing modifier character (like this:  é  )
 The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
 characters at all.Additionally there are some instances where a word will 
 contain what looks like an accented character, that is actually considered to 
 be a separate unaccented character  such as  Ł  but which to make searching 
 easier you want to fold onto the latin1  lookalike  version   L  .   
 The UnicodeNormalizationFilter can filter out accents and diacritical marks 
 whether they occur as composed characters or decomposed characters, it can 
 also handle cases where as described above characters that look like they 
 have diacritics (but don't) are to be folded onto the letter that they look 
 like ( Ł  - L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hadoop RPC for distributed Lucene

2008-07-11 Thread Ken Krugler
I believe Hadoop RPC was originally built for distributed search for 
Nutch.  Here's some core code I think Nutch still uses 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648view=markuphttp://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648view=markup


Hadoop PRC is used for distributed search, but at a layer above 
Lucene - search requests are sent via RPC to remote searchers, 
which are Java processes running on multiple boxes. These in turn 
make Lucene queries and send back results.


You might want to look at the Katta project 
(http://katta.wiki.sourceforge.net/), which uses Hadoop to handle 
distributed Lucene indexes.


-- Ken

One thing I wanted to add to the original email is if some of the 
core query and filter classes implemented java.io.Externalizable 
then there would be a speedup in serialization equivalent to using 
Writeable.  It would also be backwards compatible with and enhance 
the existing distributed search using RMI.  Classes that do not 
implement Externalizable would simply use the default reflection 
based serialization.


On Fri, Jul 11, 2008 at 9:13 AM, Grant Ingersoll 
mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote:


I believe there is a subproject over at Hadoop for doing distributed 
stuff w/ Lucene, but I am not sure if they are doing search side, 
only indexing.  I was always under the impression that it was too 
slow for search side, as I don't think Nutch even uses it for the 
search side of the equation, but I don't know if that is still the 
case.




On Jul 10, 2008, at 10:16 PM, Jason Rutherglen wrote:

Has anyone taken a look at using Hadoop RPC for enabling distributed 
Lucene?  I am thinking it would implement the Searchable interface 
and use serialization to be compatible with the current RMI version. 
Somewhat defeats the purpose of using Hadoop RPC and serialization 
however Hadoop RPC scales far beyond what RMI can at the networking 
level.  RMI uses a thread per socket and has reportedly has latency 
issues.  Hadoop RPC uses NIO and is proven to scale to thousands of 
servers.  Serialization unfortunately must be used with Lucene due 
to the Weight, Query and Filter classes.  There could be an extended 
version of Searchable that allows passing Weight, Query, and Filter 
classes that implement Hadoop's Writeable interface if a user wants 
to bypass using serialization.





-
To unsubscribe, e-mail: 
mailto:[EMAIL PROTECTED][EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED][EMAIL PROTECTED]



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it

Potential bug in SloppyPhraseScorer

2008-06-24 Thread Ken Krugler

Hi list (and hopefully Doron),

Don't know if Doron saw this Jira issue:

https://issues.apache.org/jira/browse/LUCENE-1310

Given that this bug only surfaces with repeating terms in the target 
phrase, I wonder if it's related to the changes made to fix 
Lucene-736?


We've looked at the code, and the bug isn't obvious. Plus I worry 
about the probability of introducing a new bug with any modification.


If anybody who's touched this code has time to look at the issue and 
comment, that would be great!


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to solve the issue Unable to read entire block; 72 bytes read; expected 512 bytes

2007-11-12 Thread Ken Krugler

: Sorry, for rtf it throws the following exception:

: Unable to read entire header; 100 bytes read; expected 512 bytes

: Is it a issue with POI of Lucene?. If so which build of POI contains fix
: for this problem where i can get it?. Please tell me asap.

1) java-dev if for discussiong development of hte Lucene Java API,
questions baout errors when using the Java API should be sent to the
java-user list.

2) that's just a one line error string, it may be the message of an
exception -- but it may just be something logged by your application.  if
it is an exception message, the only way to make sense of it is to see the
entire exception stack trace.

3) i can't think of anywhere in the Lucene code base that might write out
a string like that (or throw an exception with that message) i suspect it
is coming from POI (i'd know ofr sure if you'd sent the full stack trace)
so you should consider contacting the POI user list ... before you do, you
might try a simple test of a micro app using POI to parse the same
document without Lucene involved at all -- if you get hte same error, then
you know it's POI and not lucene related at all.


It's there in POI:

http://www.krugle.org/kse/files/svn/svn.apache.org/poi/src/java/org/apache/poi/poifs/storage/HeaderBlockReader.java

On line 83.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzers, perfect hash, ICU

2006-01-11 Thread Ken Krugler

Hi all,
   I'm working on the analyzer for the slovanic latin languages 
(cs,sk) w/o stemming at first.

I would like to ask you:
The StopWord analyzer uses often HashSet implementation, but the the 
Stopwords are not changed often (if ever) from shipped in the java 
code. Do you think that is there benefit for the perfect hash 
algorithm?


My guess is that you wouldn't save much time here using a perfect hash.

I will do an ICU analyzer for latin chars (decompositing and return 
base char). Have you any exp. with icu(.sf.net) some problems, 
bottlenecks?


This could be a significant performance hit. Using ICU is a good 
idea, but typically putting some simple front-end filtering in front 
can save you a lot of time.


E.g. if there are a lot of characters that don't require any 
decomposition, you could do some quick (and very conservative) checks 
to skip calls to ICU.


But of course, measure then optimize :)

P. S.: also I would like these stuff contribute to lucene-contrib if 
it'll be recognized useful. Is there any  howto  set the Eclipse for 
Lucene/Apache related project?


If you're asking about how to set up Eclipse to do development for 
Lucene, I found some posts to the mailing list a while back, but 
nothing definitive.


FWIW, my experience w/Eclipse 3.1 was that trying to auto-create 
Eclipse projects using the Ant build file didn't work very well. So 
we wound up manually creating the project, setting up the classpath, 
etc.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and UTF-8

2005-09-27 Thread Ken Krugler
  Perl development is going very well, by the way.  On the indexing 
 side, I've got a new app going which solves both the index 
 compatibility issue and the speed issue, about which I'll make a 
 presentation in this forum after I flesh it out and clean it up.



  Well, I'm lying a little.  The app doesn't quite write a valid Lucene
  1.4.3 index, since it writes true UTF-8.  If these patches get 

 adopted prior to the release of 1.9, though, it will write valid

  Lucene 1.9 indexes.

This UTF stuff is not my thing, and I have a hard time following all
the discussion here (read: I don't get it)... but it sounds like good
changes. 


Could one of the other Lucene committers following this thread apply
the patches and commit the stuff if it looks good?  Perhaps this is
something we should do between 1.9 and 2.0, since the patch will make
the new indices incompatible, and breaking the compatibility at version
2.0 would be okay, while 1.9 should remain compatible with 1.4.3
indices and just have a bunch of methods deprecated.


Just to clarify, an incompatibility will occur if:

a. The new code is used to write the index.
b. The text being written contains an embedded null or an extended 
(not in the BMP) Unicode code point.

c. Old code is then used to read the index.

It may still make sense to defer this change to 2.0, but it's not at 
the level of changing the format of an index file.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler

On Monday 29 August 2005 19:56, Ken Krugler wrote:


 Lucene writes strings as a VInt representing the length of the
 string in Java chars (UTF-16 code units), followed by the character
 data.


But wouldn't UTF-16 mean 2 bytes per character?


Yes, UTF-16 means two bytes per code unit. A Unicode character (code 
point) is encoded as either one or two UTF-16 code units.



That doesn't seem to be the
case.


The case where? You mean in what actually gets written out?

String.length() is the length in terms of Java chars, which means 
UTF-16 code units (well, sort of...see below). Looking at the code, 
IndexOutput.writeString() calls writeVInt() with the string length.


One related note. Java 1.4 supports Unicode 3.0, while Java 5.0 
supports Unicode 4.0. It was in Unicode 3.1 that supplementary 
characters (code points  U+0, ie outside of the BMP) were added, 
and the UTF-16 encoding formalized.


So I think the issue of non-BMP characters is currently a bit 
esoteric for Lucene, since I'm guessing there are other places in the 
code (e.g. JDK calls used by Lucene) where non-BMP characters won't 
be properly handled. Though some quick tests indicate that there is 
some knowledge of surrogate pairs in 1.4 (e.g. converting a String 
w/surrogate pairs to UTF-8 does the right thing).


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler

I think the VInt should be the numbers of bytes to be stored using the UTF-8
encoding.

It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.

For performance, you can use the actual CharSet encoding classes - avoiding
all of the lookups performed by the String class.


Regardless of what underlying support is used, if you want to write 
out the VInt value as UTF-8 bytes versus Java chars, the Java String 
has to either be converted to UTF-8 in memory first, or pre-scanned. 
The first is a memory hit, and the second is a performance hit. I 
don't know the extent of either, but it's there.


Note that since the VInt is a variable size, you can't write out the 
bytes first and then fill in the correct value later.


-- Ken



-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, August 29, 2005 4:24 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


Ken Krugler wrote:

 The remaining issue is dealing with old-format indexes.


I think that revving the version number on the segments file would be a
good start.  This file must be read before any others.  Its current
version is -1 and would become -2.  (All positive values are version 0,
for back-compatibility.)  Implementations can be modified to pass the
version around if they wish to be back-compatible, or they can simply
throw exceptions for old format indexes.

I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
string memory allocations.


 I'm going to take this off-list now [ ... ]


Please don't.  It's better to have a record of the discussion.

Doug



--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler

Daniel Naber wrote:


On Monday 29 August 2005 19:56, Ken Krugler wrote:


Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data.
  

But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem 
to be the case.



UTF-16 is a fixed 2 byte/char representation.


I hate to keep beating this horse, but I want to emphasize that it's 
2 bytes per Java char (or UTF-16 code unit), not Unicode character 
(code point).


But one cannot equate the character count with the byte count. Each 
Java char is 2 bytes. I think all that is being said is that the 
VInt is equal to str.length() as java gives it.


On an unrelated project we are determining whether we should use a 
denormalized (letter followed by an accents) or a normalized form 
(letter with accents) of accented characters as we present the text 
to a GUI. We have found that font support varies but appears to be 
better for denormalized. This is not an issue for storage, as it can 
be transformed before it goes to screen. However, it is useful to 
know which form it is in.


The reason I mention this is that I seem to remember that the length 
of the java string varies with the representation.


String.length() is the number of Java chars, which always uses 
UTF-16. If you normalize text, then yes that can change the number of 
code units and thus the length of the string, but so can doing any 
kind of text munging (e.g. replacement) operation on characters in 
the string.


So then the count would not be the number of glyphs that the user 
sees. Please correct me if I am wrong.


All kinds of mxn mappings (both at the layout engine level, and using 
font tables) are possible when going from Unicode characters to 
display glyphs. Plus zero-width left-kerning glyphs would also alter 
the relationship between # of visual characters and backing store 
characters.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler

Yonik Seeley wrote:
A related problem exists even if the prefix length vInt is changed 
to represent the number of unicode chars (as opposed to number of 
java chars), right? The prefix length is no longer the offset into 
the char[] to put the suffix.


Yes, I suppose this is a problem too.  Sigh.

Another approach might be to convert the target to a UTF-8 byte[] 
and do all comparisons on byte[]. UTF-8 has some very nice 
properties, including that the byte[] representation of UTF-8 
strings compare the same as UCS-4 would.


I was not aware of that, but I see you are correct:

   o  The byte-value lexicographic sorting order of UTF-8 strings is the
  same as if ordered by character numbers.

(From http://www.faqs.org/rfcs/rfc3629.html)

That makes the byte representation much more palatable, since Lucene 
orders terms lexicographically.


Where/how is the Lucene ordering of terms used?

I'm asking because people often confuse lexicographic order with 
dictionary order, whereas in the context of UTF-8 it just means 
the same order as Unicode code points. And the order of Java chars 
would be the same as for Unicode code points, other than non-BMP 
characters.


Thanks,

-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler

I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue anyone raising a hand?


I could, but recent posts makes me think this is heading towards a 
religious debate :)


I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes to 
be used by other implementations besides the reference Java version.


b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.

c. The hard(er) part would be backwards compatibility with older 
indexes. I haven't looked at this enough to really know, but one 
example is the compound file (xx.cfs) format...I didn't see a version 
number, and it contains strings.


d. The documentation could be clearer on what is meant by the string 
length, but this is a trivial change.


What's unclear to me (not being a Perl, Python, etc jock) is how much 
easier it would be to get these other implementations working with 
Lucene, following a change to UTF-8. So I can't comment on the return 
on time required to change things.


I'm also curious about the existing CLucene  PyLucene ports. Would 
they also need to be similarly modified, with the proposed changes?


One final point. I doubt people have been adding strings with 
embedded nulls, and text outside of the Unicode BMP is also very 
rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's 
only the above two edge cases that create an interoperability problem.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler

[snip]

The surrogate pair problem is another matter entirely. First of all, 
lets see if I do understand the problem correctly: Some unicode 
characters can be represented by one codepoint outside the BMP (i. 
e., not with 16 bits) and alternatively with two codepoints, both of 
them in the 16-bit range.


A Unicode character has a code point, which is a scalar value in the 
range U+ to U+10. The code point for every character in the 
Unicode character set will fall in this range.


There are Unicode encoding schemes, which specify how Unicode code 
point values are serialized. Examples include UTF-8, UTF-16LE, 
UTF-16BE, UTF-32, UTF-7, etc.


The UTF-16 (big or little endian) encoding scheme uses two code units 
(16-bit values) to encode Unicode characters with code point values  
U+0.


According to Marvin's explanations, the Unicode standard requires 
these characters to be represented as the one codepoint in UTF-8, 
resulting in a 4-, 5-, or 6-byte encoding for that character.


Since the Unicode code point range is constrained to 
U+...U+10, the longest valid UTF-8 sequence is 4 bytes.


But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
range cannot be represented as chars.  That is, the 
in-memory-representation still requires the use of the surrogate 
pairs.  Therefore, writing consists of translating the surrogate 
pair to the 16bit representation of the same character and then 
algorithmically encoding that.  Reading is exactly the reverse 
process.


Yes. Writing requires that you combine the two surrogate characters 
into a Unicode code point, then converting that value into the UTF-8 
4 byte sequence.


Adding code to handle the 4 to 6 byte encodings to the 
readChars/writeChars method is simple, but how do you do the mapping 
from surrogate pairs to the chars they represent? Is there an 
algorithm for doing that except for table lookups or huge switch 
statements?


It's easy, since U+D800...U+DBFF is defined as the range for the high 
(most significant) surrogate, and U+DC00...U+DFFF is defined as the 
range for the low (least significant) surrogate.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler

On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:

I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue anyone raising a hand?


I could, but recent posts makes me think this is heading towards a 
religious debate :)


Ken - you mentioned taking the discussion off-line in a previous 
post.  Please don't.  Let's keep it alive on java-dev until we have 
a resolution to it.



I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes 
to be used by other implementations besides the reference Java 
version.


b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.


What, if any, performance impact would changing Java Lucene in this 
regard have?   (I realize this is rhetorical at this point, until a 
solution is at hand)


Almost zero. A tiny hit when reading/writing surrogate pairs, to 
properly encode them as a 4 byte UTF-8 sequence versus two 3-byte 
sequences.


c. The hard(er) part would be backwards compatibility with older 
indexes. I haven't looked at this enough to really know, but one 
example is the compound file (xx.cfs) format...I didn't see a 
version number, and it contains strings.


I don't know the gory details, but we've made compatibility breaking 
changes in the past and the current version of Lucene can open older 
formats, but only write the most current format.  I suspect it could 
be made to be backwards compatible.  Worst case, we break 
compatibility in 2.0.


Ronald is correct in that it would be easy to make the reader handle 
both Java modified UTF-8 and UTF-8, and the writer always output 
UTF-8. So the only problem would be if older versions of Lucene (or 
maybe CLucene) wound up trying to read strings that contained 4-byte 
UTF-8 sequences, as they wouldn't know how to convert this into two 
UTF-16 Java chars.


Since 4-byte UTF-8 sequences are only for characters outside of the 
BMP, and these are rare, it seems like an OK thing to do, but that's 
just my uninformed view.


d. The documentation could be clearer on what is meant by the 
string length, but this is a trivial change.


That change was made by Daniel soon after this discussion began.


Daniel changed the definition of Chars, but String section still 
needs to be clarified. Currently it says:


Lucene writes strings as a VInt representing the length, followed by 
the character data.


It should read:

Lucene writes strings as a VInt representing the length of the 
string in Java chars (UTF-16 code units), followed by the character 
data.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and UTF-8

2005-08-29 Thread Ken Krugler

Hi Marvin,

I'm guessing that since I'm the one that cares most about 
interoperability, I'll have to volunteer to do the heavy lifting.
Tomorrow I'll go through and survey how many and which things would 
need to change to achieve full UTF-8 compliance.  One concern is 
that I think in order to make that last case work, readChars() may 
need to return an array.  Since readChars() is part of the public 
API and may be called by something other than readString(), I don't 
know if that'll fly.


I don't believe such a change would be required, since the ultimate 
data source/destination on the Java side will look the same (array of 
Java chars) - the only issue is how it looks when serialized.


It seems clear that you have sufficient expertise to hone my rough 
contributions into final form.  If you have the interest, would that 
be a good division of labor?  I wish I could do this alone and just 
supply finished, tested patches, but obviously I can't.  Or perhaps 
I'm underestimating your level of interest -- do you want to take 
the ball and run with it?


I can take a look at the code, sure. The hard part will be coding up 
the JUnit test cases (see below).


I think we could stand to have 2 corpuses of test documents 
available: one is which predominantly 2-byte and 3-byte UTF-8 (but 
no 4-byte), and another which has the full range including non-BMP 
code points.  I can hunt those down or maybe get somebody from the 
Plucene community to create them, but perhaps they already exist?


Good test data for the decoder would be the following:

a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.

Then all of the above, but remove the second (low-order) surrogate 
character (busted format).


Then all of the above, but replace the first (high-order) surrogate character.

Then all of the above, but replace the surrogate pair with an xC0 x80 
encoded null byte.


And no, I don't think this test data exists, unfortunately. But it 
shouldn't be too hard to generate.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Ken Krugler

Hi Marvin,

Thanks for the detailed response. After spending a bit more time in 
the code, I think you're right - all strings seem to be funnelled 
through IndexOutput. The remaining issue is dealing with old-format 
indexes.


I'm going to take this off-list now, since I'm guessing most list 
readers aren't too interested in the on-going discussion. If anybody 
else would like to be copied, send me an email.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]