codec mismatch

2014-02-14 Thread Jason Wee
Hello,

This is my first question to lucene mailing list, sorry if the question
sounds funny.

I have been experimenting to store lucene index files on cassandra,
unfortunately the exception got overwhelmed. Below are the stacktrace.

org.apache.lucene.index.CorruptIndexException: codec mismatch: actual
codec=CompoundFileWriterData vs expected codec=Lucene46FieldInfos
(resource: SlicedIndexInput(SlicedIndexInput(_0.fnm in
lucene-cassandra-desc) in lucene-cassandra-desc slice=31:340))
at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:140)
at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130)
at
org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
at
org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:214)
at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94)
at
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843)
at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
at org.apache.lucene.store.Search.init(Search.java:41)
at org.apache.lucene.store.Search.main(Search.java:34)

I'm not sure what does it means, can anybody help?

When I check the hex representation of _0.fnm in cassandra, and translated
to ascii. It is something like this:
??l??Lucene46FieldInfos??path?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?modified?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?contentsPerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0

It looks to me the expected codec is found in the _0.fnm file or am I wrong?

Thank you and please let me know if you need additional information.


Re: codec mismatch

2014-02-14 Thread Michael McCandless
This means Lucene was attempting to open _0.fnm but somehow got the
contents of _0.cfs instead; seems likely that it's a bug in the
Cassanda Directory implementation?  Somehow it's opening the wrong
file name?

Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 14, 2014 at 3:13 AM, Jason Wee peich...@gmail.com wrote:
 Hello,

 This is my first question to lucene mailing list, sorry if the question
 sounds funny.

 I have been experimenting to store lucene index files on cassandra,
 unfortunately the exception got overwhelmed. Below are the stacktrace.

 org.apache.lucene.index.CorruptIndexException: codec mismatch: actual
 codec=CompoundFileWriterData vs expected codec=Lucene46FieldInfos
 (resource: SlicedIndexInput(SlicedIndexInput(_0.fnm in
 lucene-cassandra-desc) in lucene-cassandra-desc slice=31:340))
 at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:140)
 at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130)
 at
 org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
 at
 org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:214)
 at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94)
 at
 org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
 at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843)
 at
 org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
 at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
 at org.apache.lucene.store.Search.init(Search.java:41)
 at org.apache.lucene.store.Search.main(Search.java:34)

 I'm not sure what does it means, can anybody help?

 When I check the hex representation of _0.fnm in cassandra, and translated
 to ascii. It is something like this:
 ??l??Lucene46FieldInfos??path?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?modified?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?contentsPerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0

 It looks to me the expected codec is found in the _0.fnm file or am I wrong?

 Thank you and please let me know if you need additional information.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
Hello,

I am designing a system with documents having one field containing
values such as Ae1 Br2 Cy8 ..., i.e. a sequence of items made of
letters and numbers (max=7 per item), all separated by a space,
possibly 200 items per field, with no limit upon the number of
documents (although I would not expect more than a few millions
documents). The order of these values are important, and I want to
search for these, always starting with the first value, and including
as many following values as needed: for instance, Ae1, Ae1 Br2
would be possible search values.

At first, I indexed these using a space-delimited analyzer, and ran
PrefixQueries. I encountered some performance issues though, so ended
up building my own tokenizer, which would create tokens for all
starting combinations (Ae1, Ae1 Br2...), up to certain limit,
called the analysis depth. I would then dynamically create TermQueries
to match these tokens when searching under the analysis depth, and
PrefixQueries when searching over the analysis depth (the whole string
also being indexed as a single token). The performance was great,
because TermQueries are very fast, and PrefixQueries are not bad
either, when the underlying relevant number of documents is small
(which happens to be the case when searching beyond the analysis
depth). I have however two questions: one regarding the PrefixQuery,
and one regarding the general design.

Regarding the PrefixQuery: it seems that it stops matching documents
when the length of the searched string exceeds a certain length. Is
that the expected behavior, an if so, can I / should I manage this
length?

Regarding the general design: I have adopted an hybrid approach
TermQueries/PrefixQueries, letting clients customize the analysis
depth, so as to keep a balance between the performance and the size of
the index. I am however not sure this is a good idea: would it be
better to tokenize the full string (i.e. analysis depth is infinity,
so as to only use TermQueries)? Or could my design be substituted by
an altogether different, more successful analysis approach?

Thank you in advance for your insights.

Kind regards.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Collector is collecting more than the specified hits

2014-02-14 Thread Michael McCandless
This is how Collector works: it is called for every document matching
the query, and then its job is to choose which of those hits to keep.

This is because in general the hits to keep can come at any time, not
just the first N hits you see; e.g. the best scoring hit may be the
very last one.

But if you have prior knowledge, e.g. that your index is already
pre-sorted by the criteria that you sort by at query time, then indeed
after seeing the first N hits you can stop; to do this you must throw
your own exception, and catch it up above.  See Lucene's
TimeLimitingCollector for a similar example ...

Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 14, 2014 at 2:47 AM, saisantoshi saisantosh...@gmail.com wrote:
 The problem with the below collector is the collect method is not stopping
 after the numHits count has reached. Is there a way to stop the collector
 collecting the docs after it has reached the numHits specified.

 For example:
 * TopScoreDocCollector topScore = TopScoreDocCollector.create(numHits,
 true); *
 // TopScoreDocCollector topScore = TopScoreDocCollector.create(30, true);

 I would except the below collector to pause/exit out after it has collected
 the specified numHits ( in this case it's 30). But what's happening here is
 the collector is collecting all the docs and thereby causing delay in
 searches. Can we configure the collect method below to collect/stop after it
 has reached numHits specified? PLease let me know if there any issue with
 the collector below?

 public class MyCollector extends PositiveScoresOnlyCollector  {

 private IndexReader indexReader;


 public MyCollector (IndexReader indexReader,PositiveScoresOnlyCollector
 topScore) {
 super(topScore);
 this.indexReader = indexReader;
 }

 @Override
 public void collect(int doc) {
 try {
//Custom Logic
 super.collect(doc);
}

 } catch (Exception e) {

 }
 }



 //Usage:

 MyCollector collector;
 TopScoreDocCollector topScore =
 TopScoreDocCollector.create(numHits, true);
 IndexSearcher searcher = new IndexSearcher(reader);
 try {
 collector = new MyCollector(indexReader, new
 PositiveScoresOnlyCollector(topScore));
 searcher.search(query, (Filter) null, collector);
 } finally {

 }

 Thanks,
 Sai.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Tokenization and PrefixQuery

2014-02-14 Thread Michael McCandless
On Fri, Feb 14, 2014 at 6:17 AM, Yann-Erwan Perio ye.pe...@gmail.com wrote:
 Hello,

 I am designing a system with documents having one field containing
 values such as Ae1 Br2 Cy8 ..., i.e. a sequence of items made of
 letters and numbers (max=7 per item), all separated by a space,
 possibly 200 items per field, with no limit upon the number of
 documents (although I would not expect more than a few millions
 documents). The order of these values are important, and I want to
 search for these, always starting with the first value, and including
 as many following values as needed: for instance, Ae1, Ae1 Br2
 would be possible search values.

 At first, I indexed these using a space-delimited analyzer, and ran
 PrefixQueries. I encountered some performance issues though, so ended
 up building my own tokenizer, which would create tokens for all
 starting combinations (Ae1, Ae1 Br2...), up to certain limit,
 called the analysis depth.

This is similar to PathHierarchyTokenizer, I think.

 I would then dynamically create TermQueries
 to match these tokens when searching under the analysis depth, and
 PrefixQueries when searching over the analysis depth (the whole string
 also being indexed as a single token). The performance was great,
 because TermQueries are very fast, and PrefixQueries are not bad
 either, when the underlying relevant number of documents is small
 (which happens to be the case when searching beyond the analysis
 depth). I have however two questions: one regarding the PrefixQuery,
 and one regarding the general design.

 Regarding the PrefixQuery: it seems that it stops matching documents
 when the length of the searched string exceeds a certain length. Is
 that the expected behavior, an if so, can I / should I manage this
 length?

That should not be the case: it should match all terms with that
prefix regardless of the term's length.  Try to boil it down to a
small test case?

 Regarding the general design: I have adopted an hybrid approach
 TermQueries/PrefixQueries, letting clients customize the analysis
 depth, so as to keep a balance between the performance and the size of
 the index. I am however not sure this is a good idea: would it be
 better to tokenize the full string (i.e. analysis depth is infinity,
 so as to only use TermQueries)? Or could my design be substituted by
 an altogether different, more successful analysis approach?

I think your approach is a typical one (adding more terms to the index
so you get TermQuery instead of MoreCostlyQuery).  E.g.,
ShingleFilter, CommonGrams are examples of the same general idea.
Another example is AnalyingInfixSuggester, which does the same thing
you are doing under-the-hood but one byte at a time (i.e. all term
prefixes up to a certain depth), and it also makes its analysis depth
controllable.  Maybe expose it to your users as a very expert tunable?

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless
luc...@mikemccandless.com wrote:

 This is similar to PathHierarchyTokenizer, I think.

Ah, yes, very much. I'll check it out and see if I can make something
of it. I am not sure to what extent it'll be reusable though, as my
tokenizer also sets payloads (the next coming path part is set on
the current token as a payload, so as to provide a perspective of
what's coming ahead, at search time).

 Regarding the PrefixQuery: it seems that it stops matching documents
 when the length of the searched string exceeds a certain length. Is
 that the expected behavior, an if so, can I / should I manage this
 length?

 That should not be the case: it should match all terms with that
 prefix regardless of the term's length.  Try to boil it down to a
 small test case?

I guess I've been too shallow with my testing, then :( Well, I'll dig
deeper, and if I find something wrong with Lucene, I'll post a small
test case demonstrating the issue - but so far, the errors were always
on my side.

 I think your approach is a typical one (adding more terms to the index
 so you get TermQuery instead of MoreCostlyQuery).  E.g.,
 ShingleFilter, CommonGrams are examples of the same general idea.
 Another example is AnalyingInfixSuggester, which does the same thing
 you are doing under-the-hood but one byte at a time (i.e. all term
 prefixes up to a certain depth), and it also makes its analysis depth
 controllable.  Maybe expose it to your users as a very expert tunable?

This is what I have done, letting the clients of the framework specify
the analysis depth through their configuration file.

Thanks a lot for your feedback, it's very appreciated.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
On Fri, Feb 14, 2014 at 1:11 PM, Yann-Erwan Perio ye.pe...@gmail.com wrote:
 On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

Hi again,

 That should not be the case: it should match all terms with that
 prefix regardless of the term's length.  Try to boil it down to a
 small test case?

 I guess I've been too shallow with my testing, then :( Well, I'll dig
 deeper, and if I find something wrong with Lucene, I'll post a small
 test case demonstrating the issue - but so far, the errors were always
 on my side.

I have written a test which demonstrates that the mistake is indeed on
my side. It's probably due to inconsistent rules for
indexing/searching content having special characters (namely the
plus sign).

Sorry for the inconvenience, and thanks again for your answers.

Kind regards.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Tokenization and PrefixQuery

2014-02-14 Thread Michael McCandless
On Fri, Feb 14, 2014 at 8:21 AM, Yann-Erwan Perio ye.pe...@gmail.com wrote:

 I have written a test which demonstrates that the mistake is indeed on
 my side. It's probably due to inconsistent rules for
 indexing/searching content having special characters (namely the
 plus sign).

OK, thanks for bringing closure.

 Sorry for the inconvenience, and thanks again for your answers.

You're welcome!

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Collector is collecting more than the specified hits

2014-02-14 Thread saisantoshi
I am not interested in the scores at all. My requirement is simple, I only
need the first 100 hits or the numHits I specify ( irrespective of there
scores). The collector should stop after collecting the numHits specified.
Is there a way to tell in the collector to stop after collecting the
numHits.

Please correct me if I am wrong. I am trying to do the following.

public void collect(int doc) throws IOException {

 if (collector.getTotalHits() = maxHits ) {// this way, I can stop it
to not collect after the getTotalHits is more than numHits.

delegate.collect(doc); 

}

}

I have to write a separate collector extending the Collector because I am
not able to get the call to getTotalHits() if I am using
PositiveScoresOnlyCollector.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329p4117441.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Reverse Matching

2014-02-14 Thread Siraj Haider
Hi There,
Is there a way to do reverse matching by indexing the queries in an index and 
passing a document to see how many queries matched that? I know that I can have 
the queries in memory and have the document parsed in a memory index and then 
loop through trying to match each query. The issue I have is, we could have 
millions of such queries and looping through them to match it against the 
document is not feasible for us.

regards
-Siraj
(212) 306-0154



This electronic mail message and any attachments may contain information which 
is privileged, sensitive and/or otherwise exempt from disclosure under 
applicable law. The information is intended only for the use of the individual 
or entity named as the addressee above. If you are not the intended recipient, 
you are hereby notified that any disclosure, copying, distribution (electronic 
or otherwise) or forwarding of, or the taking of any action in reliance on, the 
contents of this transmission is strictly prohibited. If you have received this 
electronic transmission in error, please notify us by telephone, facsimile, or 
e-mail as noted above to arrange for the return of any electronic mail or 
attachments. Thank You.


IndexWriter croaks on large file

2014-02-14 Thread John Cecere

I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file  
2GB in size, it dies with the following exception:

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, 
startOffset=-2147483648,endOffset=-2147483647


Essentially, I'm doing this:

Directory directory = new MMapDirectory(indexPath);
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);
IndexWriter iw = new IndexWriter(directory, iwc);

InputStream is = my input stream;
InputStreamReader reader = new InputStreamReader(is);

Document doc = new Document();
doc.add(new StoredField(fileid, fileid));
doc.add(new StoredField(pathname, pathname));
doc.add(new TextField(content, reader));

iw.addDocument(doc);

It's the IndexWriter addDocument method that throws the exception. In looking at the Lucene source code, it appears that the offsets 
being used internally are int, which makes it somewhat obvious why this is happening.


This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly capable of handling a file over 2GB in this manner. What has 
changed and how do I get around this ? Is Lucene no longer capable of handling files this large, or is there some other way I should 
be doing this ?


Here's the full stack trace sans my code:

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, 
startOffset=-2147483648,endOffset=-2147483647

at 
org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
at 
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
at 
org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
at 
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at 
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
at 
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)

Thanks,
John

--
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cec...@oracle.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Extending StandardTokenizer Jflex to not split on '/'

2014-02-14 Thread Diego Fernandez
Hi guys, this is my first time posting on the Lucene list, so hello everyone.

I really like the way that the StandardTokenizer works, however I'd like for it 
to not split tokens on / (forward slash).  I've been looking at 
http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand 
the rules, but I'm either misunderstanding or missing something.  If I 
understand correctly, the symbols in MidLetter keep it from splitting a token 
as long as there's alpha chars on either side.  I tried adding the forward 
slash to the MidLetter and MidLetterSupp rules (tried different combinations), 
but it still seems like it's splitting on it.

Does anyone have any tips or ideas?

Thanks

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extending StandardTokenizer Jflex to not split on '/'

2014-02-14 Thread Steve Rowe
Welcome Diego,

I think you’re right about MidLetter - adding a char to it should disable 
splitting on that char, as long as there is a letter on one side or the other.  
(If you’d like that behavior to be extended to numeric digits, you should use 
MidNumLet instead.)

I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex 
(compressed whitespace diff below):

-MidLetter = (\p{WB:MidLetter}| {MidLetterSupp})
+MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})

then running ‘ant jflex’ under lucene/analysis/common/, and the following text 
was split as indicated (I tested by adding the method below to 
TestStandardAnalyzer.java):

  public void testMidLetterSlash() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo(a, /one/two/three/ four”, 
  new String[]{ one/two/three, four });
BaseTokenStreamTestCase.assertAnalyzesTo(a, 1/two/3”, 
 new String[] { 1, two, 3 });
  }

So it works for me - are you regenerating the scanner (‘ant jflex’)?

FYI, I found a bug when I was testing the above: “http://example.com” is left 
intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and 
‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should 
instead result in “http://example.com” being split into “http” and 
“example.com”.  Further testing indicates that this is a problem for MidLetter, 
MidNumLet and MidNum.  I’ve filed an issue: 
https://issues.apache.org/jira/browse/LUCENE-5447.

Steve

On Feb 14, 2014, at 1:42 PM, Diego Fernandez difer...@redhat.com wrote:

 Hi guys, this is my first time posting on the Lucene list, so hello everyone.
 
 I really like the way that the StandardTokenizer works, however I'd like for 
 it to not split tokens on / (forward slash).  I've been looking at 
 http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand 
 the rules, but I'm either misunderstanding or missing something.  If I 
 understand correctly, the symbols in MidLetter keep it from splitting a token 
 as long as there's alpha chars on either side.  I tried adding the forward 
 slash to the MidLetter and MidLetterSupp rules (tried different 
 combinations), but it still seems like it's splitting on it.
 
 Does anyone have any tips or ideas?
 
 Thanks
 
 Diego Fernandez - 爱国
 Software Engineer
 US GSS Supportability - Diagnostics
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexWriter croaks on large file

2014-02-14 Thread John Cecere
I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At any rate, I don't have control over the size of the 
documents that go into my database. Sometimes my customer's log files end up really big. I'm willing to have huge indexes for these 
things.


Wouldn't just changing from int to long for the offsets solve the problem ? I'm sure it would probably have to be changed in a lot 
of places, but why impose such a limitation ? Especially since it's using an InputStream and only dealing with a block of data at a 
time.


I'll take a look at your suggestion.

Thanks,
John


On 2/14/14 3:20 PM, Michael McCandless wrote:

Hmm, why are you indexing such immense documents?

In 3.x Lucene never sanity checked the offsets, so we would silently
index negative (int overflow'd) offsets into e.g. term vectors.

But in 4.x, we now detect this and throw the exception you're seeing,
because it can lead to index corruption when you index the offsets
into the postings.

If you really must index such enormous documents, maybe you could
create a custom tokenizer  (derived from StandardTokenizer) that
fixes the offset before setting them?  Or maybe just doesn't even
set them.

Note that position can also overflow, if your documents get too large.



Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 14, 2014 at 1:36 PM, John Cecere john.cec...@oracle.com wrote:

I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file 
2GB in size, it dies with the following exception:

java.lang.IllegalArgumentException: startOffset must be non-negative, and
endOffset must be = startOffset,
startOffset=-2147483648,endOffset=-2147483647

Essentially, I'm doing this:

Directory directory = new MMapDirectory(indexPath);
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);
IndexWriter iw = new IndexWriter(directory, iwc);

InputStream is = my input stream;
InputStreamReader reader = new InputStreamReader(is);

Document doc = new Document();
doc.add(new StoredField(fileid, fileid));
doc.add(new StoredField(pathname, pathname));
doc.add(new TextField(content, reader));

iw.addDocument(doc);

It's the IndexWriter addDocument method that throws the exception. In
looking at the Lucene source code, it appears that the offsets being used
internally are int, which makes it somewhat obvious why this is happening.

This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
capable of handling a file over 2GB in this manner. What has changed and how
do I get around this ? Is Lucene no longer capable of handling files this
large, or is there some other way I should be doing this ?

Here's the full stack trace sans my code:

java.lang.IllegalArgumentException: startOffset must be non-negative, and
endOffset must be = startOffset,
startOffset=-2147483648,endOffset=-2147483647
 at
org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
 at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
 at
org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
 at
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
 at
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
 at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
 at
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
 at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
 at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
 at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
 at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
 at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)

Thanks,
John

--
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cec...@oracle.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cec...@oracle.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexWriter croaks on large file

2014-02-14 Thread Glen Newton
You should consider making each _line_ of the log file a (Lucene)
document (assuming it is a log-per-line log file)

-Glen

On Fri, Feb 14, 2014 at 4:12 PM, John Cecere john.cec...@oracle.com wrote:
 I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At
 any rate, I don't have control over the size of the documents that go into
 my database. Sometimes my customer's log files end up really big. I'm
 willing to have huge indexes for these things.

 Wouldn't just changing from int to long for the offsets solve the problem ?
 I'm sure it would probably have to be changed in a lot of places, but why
 impose such a limitation ? Especially since it's using an InputStream and
 only dealing with a block of data at a time.

 I'll take a look at your suggestion.

 Thanks,
 John


 On 2/14/14 3:20 PM, Michael McCandless wrote:

 Hmm, why are you indexing such immense documents?

 In 3.x Lucene never sanity checked the offsets, so we would silently
 index negative (int overflow'd) offsets into e.g. term vectors.

 But in 4.x, we now detect this and throw the exception you're seeing,
 because it can lead to index corruption when you index the offsets
 into the postings.

 If you really must index such enormous documents, maybe you could
 create a custom tokenizer  (derived from StandardTokenizer) that
 fixes the offset before setting them?  Or maybe just doesn't even
 set them.

 Note that position can also overflow, if your documents get too large.



 Mike McCandless

 http://blog.mikemccandless.com


 On Fri, Feb 14, 2014 at 1:36 PM, John Cecere john.cec...@oracle.com
 wrote:

 I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a
 file 
 2GB in size, it dies with the following exception:

 java.lang.IllegalArgumentException: startOffset must be non-negative, and
 endOffset must be = startOffset,
 startOffset=-2147483648,endOffset=-2147483647

 Essentially, I'm doing this:

 Directory directory = new MMapDirectory(indexPath);
 Analyzer analyzer = new StandardAnalyzer();
 IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,
 analyzer);
 IndexWriter iw = new IndexWriter(directory, iwc);

 InputStream is = my input stream;
 InputStreamReader reader = new InputStreamReader(is);

 Document doc = new Document();
 doc.add(new StoredField(fileid, fileid));
 doc.add(new StoredField(pathname, pathname));
 doc.add(new TextField(content, reader));

 iw.addDocument(doc);

 It's the IndexWriter addDocument method that throws the exception. In
 looking at the Lucene source code, it appears that the offsets being used
 internally are int, which makes it somewhat obvious why this is
 happening.

 This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
 capable of handling a file over 2GB in this manner. What has changed and
 how
 do I get around this ? Is Lucene no longer capable of handling files this
 large, or is there some other way I should be doing this ?

 Here's the full stack trace sans my code:

 java.lang.IllegalArgumentException: startOffset must be non-negative, and
 endOffset must be = startOffset,
 startOffset=-2147483648,endOffset=-2147483647
  at

 org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
  at

 org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
  at

 org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
  at

 org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
  at

 org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
  at

 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
  at

 org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
  at

 org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
  at

 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
  at
 org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
  at
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
  at
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)

 Thanks,
 John

 --
 John Cecere
 Principal Engineer - Oracle Corporation
 732-987-4317 / john.cec...@oracle.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 --
 John Cecere
 Principal Engineer - Oracle Corporation
 732-987-4317 / john.cec...@oracle.com

 

Re: IndexWriter croaks on large file

2014-02-14 Thread Tri Cao
As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to users?John, for huge data set, it's usually a good idea to roll your own distributed indexes, and modelyou data schema very carefully. For example, if you are going to index log files, one reasonableidea is to make every 5 minutes of logs a document.Regards,TriOn Feb 14, 2014, at 01:20 PM, Glen Newton glen.new...@gmail.com wrote:You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file)  -Glen  On Fri, Feb 14, 2014 at 4:12 PM, John Cecere john.cec...@oracle.com wrote:I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. Atany rate, I don't have control over the size of the documents that go intomy database. Sometimes my customer's log files end up really big. I'mwilling to have huge indexes for these things.Wouldn't just changing from int to long for the offsets solve the problem ?I'm sure it would probably have to be changed in a lot of places, but whyimpose such a limitation ? Especially since it's using an InputStream andonly dealing with a block of data at a time.I'll take a look at your suggestion.Thanks,JohnOn 2/14/14 3:20 PM, Michael McCandless wrote:Hmm, why are you indexing such immense documents?In 3.x Lucene never sanity checked the offsets, so we would silentlyindex negative (int overflow'd) offsets into e.g. term vectors.But in 4.x, we now detect this and throw the exception you're seeing,because it can lead to index corruption when you index the offsetsinto the postings.If you really must index such enormous documents, maybe you couldcreate a custom tokenizer (derived from StandardTokenizer) that"fixes" the offset before setting them? Or maybe just doesn't evenset them.Note that position can also overflow, if your documents get too large.Mike McCandlesshttp://blog.mikemccandless.comOn Fri, Feb 14, 2014 at 1:36 PM, John Cecere john.cec...@oracle.comwrote:I'm having a problem with Lucene 4.5.1. Whenever I attempt to index afile 2GB in size, it dies with the following exception:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be = startOffset,startOffset=-2147483648,endOffset=-2147483647Essentially, I'm doing this:Directory directory = new MMapDirectory(indexPath);Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,analyzer);IndexWriter iw = new IndexWriter(directory, iwc);InputStream is = my input stream;InputStreamReader reader = new InputStreamReader(is);Document doc = new Document();doc.add(new StoredField("fileid", fileid));doc.add(new StoredField("pathname", pathname));doc.add(new TextField("content", reader));iw.addDocument(doc);It's the IndexWriter addDocument method that throws the exception. Inlooking at the Lucene source code, it appears that the offsets being usedinternally are int, which makes it somewhat obvious why this ishappening.This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectlycapable of handling a file over 2GB in this manner. What has changed andhowdo I get around this ? Is Lucene no longer capable of handling files thislarge, or is there some other way I should be doing this ?Here's the full stack trace sans my code:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be = startOffset,startOffset=-2147483648,endOffset=-2147483647atorg.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)atorg.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)atorg.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)atorg.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)atorg.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)atorg.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)atorg.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)atorg.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)atorg.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)atorg.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)Thanks,John--John CecerePrincipal Engineer - Oracle Corporation732-987-4317 / john.cec...@oracle.com-To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.orgFor additional commands, e-mail: 

Only highlight terms that caused a search hit/match

2014-02-14 Thread Steve Davids
Hello,

I have recently been given a requirement to improve document highlights within 
our system. Unfortunately, the current functionality gives more of a best-guess 
on what terms to highlight vs the actual terms to highlight that actually did 
perform the match. A couple examples of issues that were found:

Nested boolean clause with a term that doesn’t exist ANDed with a term that 
does highlights the ignored term in the query
Text: a b c
Logical Query: a OR (b AND z)
Result: ba/b bb/b c
Expected: ba/b b c
Nested span query doesn’t maintain the proper positions and offsets
Text: y z x y z a
Logical Query: (“x y z”, a) span near 10
Result: by/b bz/b bx/b by/b bz/b ba/b
Expected: y z bx/b by/b bz/b ba/b

I am currently using the Highlighter with a QueryScorer and a 
SimpleSpanFragmenter. While looking through the code it looks like the entire 
query structure is dropped in the WeightedSpanTermExtractor by just grabbing 
any positive TermQuery and flattening them all into a simple Map which is then 
passed on to highlight all of those terms. I believe this over simplification 
of term extraction is the crux of the issue and needs to be modified in order 
to produce more “exact” highlights.

I was brainstorming with a colleague and thought perhaps we can spin up a 
MemoryIndex to index that one document and start performing a depth-first 
search of all queries within the overall Lucene query graph. At that point we 
can start querying the MemoryIndex for leaf queries and start walking back up 
the tree, pruning branches that don’t result in a search hit which results in a 
map of actual matched query terms. This approach seems pretty painful but will 
hopefully produce better matches. I would like to see what the experts on the 
mailing list would have to say about this approach or is there a better way to 
retrieve the query terms  positions that produced the match? Or perhaps there 
is a different Highlighter implementation that should be used, though our user 
queries are extremely complex with a lot of nested queries of various types.

Thanks,

-Steve

char mapping in lucene-icu

2014-02-14 Thread alxsss

Hello,

I try to use lucene-icu li in solr-4.6.1. I need to  change a char mapping in 
lucene-icu. I have made changes
to 

lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt

and built jar file using ant , but it did not help.

 I took a look to  lucene/analysis/icu/build.xml and see these lines

 property name=gennorm2.src.files
value=nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt 
DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt 
NativeDigitFolding.txt/
  property name=gennorm2.tmp value=${build.dir}/gennorm2/utr30.tmp/
  property name=gennorm2.dst 
value=${resources.dir}/org/apache/lucene/analysis/icu/utr30.nrm/
  target name=gennorm2 depends=gen-utr30-data-files
echoNote that the gennorm2 and icupkg tools must be on your PATH. These 
tools
are part of the ICU4C package. See http://site.icu-project.org/ /echo
mkdir dir=${build.dir}/gennorm2/
exec executable=gennorm2 failonerror=true
  arg value=-v/
  arg value=-s/
  arg value=${utr30.data.dir}/
  arg line=${gennorm2.src.files}/
  arg value=-o/
  arg value=${gennorm2.tmp}/
/exec
!-- now convert binary file to big-endian --
exec executable=icupkg failonerror=true
  arg value=-tb/
  arg value=${gennorm2.tmp}/
  arg value=${gennorm2.dst}/
/exec
delete file=${gennorm2.tmp}/
  /target

looks like ant does not execute gennorm2. If I build utr30.nrm file using 
gennorm2 manually
 and replacing utr30.nrm in the jar file then starting solr gives the following 
error.
Caused by: java.lang.RuntimeException: java.io.IOException: ICU data file 
error: Header authentication failed, please check if you have a valid ICU data 
file

My questions are;
 1. if the above code in the build file does not get executed then how the 
utr30 file is generated?
 2. How to change a character mapping. 


Thanks.
Alex.



Re: Reverse Matching

2014-02-14 Thread Ahmet Arslan
Hi Siraj,

MemoryIndex is used for such use case. Here is a couple of pointers: 

http://www.slideshare.net/jdhok/diy-percolator


http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html




On Friday, February 14, 2014 8:21 PM, Siraj Haider si...@jobdiva.com wrote:
Hi There,
Is there a way to do reverse matching by indexing the queries in an index and 
passing a document to see how many queries matched that? I know that I can have 
the queries in memory and have the document parsed in a memory index and then 
loop through trying to match each query. The issue I have is, we could have 
millions of such queries and looping through them to match it against the 
document is not feasible for us.

regards
-Siraj
(212) 306-0154



This electronic mail message and any attachments may contain information which 
is privileged, sensitive and/or otherwise exempt from disclosure under 
applicable law. The information is intended only for the use of the individual 
or entity named as the addressee above. If you are not the intended recipient, 
you are hereby notified that any disclosure, copying, distribution (electronic 
or otherwise) or forwarding of, or the taking of any action in reliance on, the 
contents of this transmission is strictly prohibited. If you have received this 
electronic transmission in error, please notify us by telephone, facsimile, or 
e-mail as noted above to arrange for the return of any electronic mail or 
attachments. Thank You.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: char mapping in lucene-icu

2014-02-14 Thread Jack Krupansky

Do you get the exception if you run ant before changing the data files?

Header authentication failed, please check if you have a valid ICU data 
file


Check with the ICU project as to the proper format for THEIR files. I mean, 
this doesn't sound like a Lucene issue.


Maybe it could be as simple as whether the data file should have DOS or UNIX 
or Mac line endings (CRLF vs. NL vs. CR.) Be sure to use an editor that 
satisfies the requirements of ICU.


To be clear, Lucene itself does not have a published API for modifying the 
mappings of ICU.


-- Jack Krupansky

-Original Message- 
From: alx...@aim.com

Sent: Friday, February 14, 2014 7:48 PM
To: java-user@lucene.apache.org
Subject: char mapping in lucene-icu


Hello,

I try to use lucene-icu li in solr-4.6.1. I need to  change a char mapping 
in lucene-icu. I have made changes

to

lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt

and built jar file using ant , but it did not help.

I took a look to  lucene/analysis/icu/build.xml and see these lines

property name=gennorm2.src.files
 value=nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt DiacriticFolding.txt 
DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt/

 property name=gennorm2.tmp value=${build.dir}/gennorm2/utr30.tmp/
 property name=gennorm2.dst 
value=${resources.dir}/org/apache/lucene/analysis/icu/utr30.nrm/

 target name=gennorm2 depends=gen-utr30-data-files
   echoNote that the gennorm2 and icupkg tools must be on your PATH. 
These tools

are part of the ICU4C package. See http://site.icu-project.org/ /echo
   mkdir dir=${build.dir}/gennorm2/
   exec executable=gennorm2 failonerror=true
 arg value=-v/
 arg value=-s/
 arg value=${utr30.data.dir}/
 arg line=${gennorm2.src.files}/
 arg value=-o/
 arg value=${gennorm2.tmp}/
   /exec
   !-- now convert binary file to big-endian --
   exec executable=icupkg failonerror=true
 arg value=-tb/
 arg value=${gennorm2.tmp}/
 arg value=${gennorm2.dst}/
   /exec
   delete file=${gennorm2.tmp}/
 /target

looks like ant does not execute gennorm2. If I build utr30.nrm file using 
gennorm2 manually
and replacing utr30.nrm in the jar file then starting solr gives the 
following error.
Caused by: java.lang.RuntimeException: java.io.IOException: ICU data file 
error: Header authentication failed, please check if you have a valid ICU 
data file


My questions are;
1. if the above code in the build file does not get executed then how the 
utr30 file is generated?

2. How to change a character mapping.


Thanks.
Alex.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reverse Matching

2014-02-14 Thread Ahmet Arslan
Hi,

Here are two more relevant links:

https://github.com/flaxsearch/luwak


http://www.lucenerevolution.org/2013/Turning-Search-Upside-Down-Using-Lucene-for-Very-Fast-Stored-Queries


Ahmet


On Saturday, February 15, 2014 3:01 AM, Ahmet Arslan iori...@yahoo.com wrote:
Hi Siraj,

MemoryIndex is used for such use case. Here is a couple of pointers: 

http://www.slideshare.net/jdhok/diy-percolator


http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html





On Friday, February 14, 2014 8:21 PM, Siraj Haider si...@jobdiva.com wrote:
Hi There,
Is there a way to do reverse matching by indexing the queries in an index and 
passing a document to see how many queries matched that? I know that I can have 
the queries in memory and have the document parsed in a memory index and then 
loop through trying to match each query. The issue I have is, we could have 
millions of such queries and looping through them to match it against the 
document is not feasible for us.

regards
-Siraj
(212) 306-0154



This electronic mail message and any attachments may contain information which 
is privileged, sensitive and/or otherwise exempt from disclosure under 
applicable law. The information is intended only for the use of the individual 
or entity named as the addressee above. If you are not the intended recipient, 
you are hereby notified that any disclosure, copying, distribution (electronic 
or otherwise) or forwarding of, or the taking of any action in reliance on, the 
contents of this transmission is strictly prohibited. If you have received this 
electronic transmission in error, please notify us by telephone, facsimile, or 
e-mail as noted above to arrange for the return of any electronic mail or 
attachments. Thank You.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: char mapping in lucene-icu

2014-02-14 Thread alxsss


Hi Jack,

 I do not get exception before changing data files. And  I do not get exception 
after changing data files and creating lucene-icu...jar by ant.
But changing data files and running ant does not change the output.

So I decided to manually create .nrm file by using steps outlined in the 
build.xml file 

 property name=gennorm2.src.files
  value=nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt 
DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt 
NativeDigitFolding.txt/
  property name=gennorm2.tmp value=${build.dir}/gennorm2/utr30.tmp/
  property name=gennorm2.dst 
value=${resources.dir}/org/apache/lucene/analysis/icu/utr30.nrm/
  target name=gennorm2 depends=gen-utr30-data-files
echoNote that the gennorm2 and icupkg tools must be on your PATH. These 
tools
are part of the ICU4C package. See http://site.icu-project.org/ /echo
mkdir dir=${build.dir}/gennorm2/
exec executable=gennorm2 failonerror=true
  arg value=-v/
  arg value=-s/
  arg value=${utr30.data.dir}/
  arg line=${gennorm2.src.files}/
  arg value=-o/
  arg value=${gennorm2.tmp}/
/exec
!-- now convert binary file to big-endian --
exec executable=icupkg failonerror=true
  arg value=-tb/
  arg value=${gennorm2.tmp}/
  arg value=${gennorm2.dst}/
/exec
delete file=${gennorm2.tmp}/
  /target


namely


gennorm2 -v -s src/data/utr30 nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt 
DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt 
NativeDigitFolding.txt -o  utr30.tmp

icupkg -tb  utr30.tmp  utr30.nrm
 
then I unpacked lucene-icu...jar file, replaced .nrm file  and created new jar 
file using jar cf 

Solr gives error if I use this new .jar file

What I noticed was that ant task actually does not run gennorm2 task.
 
If I delete gennrom2 entiry from build.xml file utr30nrm still gets created by 
ant task. I have deleted even these lines


  target name=compile-core depends=jar-analyzers-common, 
common.compile-core /

  property name=utr30.data.dir location=src/data/utr30/
  target name=gen-utr30-data-files depends=compile-tools
java
classname=org.apache.lucene.analysis.icu.GenerateUTR30DataFiles
dir=${utr30.data.dir}
fork=true
failonerror=true
  classpath
path refid=icujar/
pathelement location=${build.dir}/classes/tools/
  /classpath
/java
  /target

it still gets created. So, I wondered how ant creates it?


icu support team wrote that they do not have any mappings. 
I mean mappings between diacritic letters and latin letters.

 

 Thanks.
Alex.



 

-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: java-user java-user@lucene.apache.org
Sent: Fri, Feb 14, 2014 5:13 pm
Subject: Re: char mapping in lucene-icu


Do you get the exception if you run ant before changing the data files?

Header authentication failed, please check if you have a valid ICU data 
file

Check with the ICU project as to the proper format for THEIR files. I mean, 
this doesn't sound like a Lucene issue.

Maybe it could be as simple as whether the data file should have DOS or UNIX 
or Mac line endings (CRLF vs. NL vs. CR.) Be sure to use an editor that 
satisfies the requirements of ICU.

To be clear, Lucene itself does not have a published API for modifying the 
mappings of ICU.

-- Jack Krupansky

-Original Message- 
From: alx...@aim.com
Sent: Friday, February 14, 2014 7:48 PM
To: java-user@lucene.apache.org
Subject: char mapping in lucene-icu


Hello,

I try to use lucene-icu li in solr-4.6.1. I need to  change a char mapping 
in lucene-icu. I have made changes
to

lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt

and built jar file using ant , but it did not help.

I took a look to  lucene/analysis/icu/build.xml and see these lines

property name=gennorm2.src.files
  value=nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt DiacriticFolding.txt 
DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt/
  property name=gennorm2.tmp value=${build.dir}/gennorm2/utr30.tmp/
  property name=gennorm2.dst 
value=${resources.dir}/org/apache/lucene/analysis/icu/utr30.nrm/
  target name=gennorm2 depends=gen-utr30-data-files
echoNote that the gennorm2 and icupkg tools must be on your PATH. 
These tools
are part of the ICU4C package. See http://site.icu-project.org/ /echo
mkdir dir=${build.dir}/gennorm2/
exec executable=gennorm2 failonerror=true
  arg value=-v/
  arg value=-s/
  arg value=${utr30.data.dir}/
  arg line=${gennorm2.src.files}/
  arg value=-o/
  arg value=${gennorm2.tmp}/
/exec
!-- now convert binary file to big-endian --
exec executable=icupkg failonerror=true
  arg value=-tb/
  arg value=${gennorm2.tmp}/
  arg value=${gennorm2.dst}/
/exec
delete file=${gennorm2.tmp}/
  /target

looks like ant does not execute gennorm2. If I build utr30.nrm file using 
gennorm2 manually
and replacing utr30.nrm in the