codec mismatch
Hello, This is my first question to lucene mailing list, sorry if the question sounds funny. I have been experimenting to store lucene index files on cassandra, unfortunately the exception got overwhelmed. Below are the stacktrace. org.apache.lucene.index.CorruptIndexException: codec mismatch: actual codec=CompoundFileWriterData vs expected codec=Lucene46FieldInfos (resource: SlicedIndexInput(SlicedIndexInput(_0.fnm in lucene-cassandra-desc) in lucene-cassandra-desc slice=31:340)) at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:140) at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130) at org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56) at org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:214) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94) at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66) at org.apache.lucene.store.Search.init(Search.java:41) at org.apache.lucene.store.Search.main(Search.java:34) I'm not sure what does it means, can anybody help? When I check the hex representation of _0.fnm in cassandra, and translated to ascii. It is something like this: ??l??Lucene46FieldInfos??path?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?modified?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?contentsPerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0 It looks to me the expected codec is found in the _0.fnm file or am I wrong? Thank you and please let me know if you need additional information.
Re: codec mismatch
This means Lucene was attempting to open _0.fnm but somehow got the contents of _0.cfs instead; seems likely that it's a bug in the Cassanda Directory implementation? Somehow it's opening the wrong file name? Mike McCandless http://blog.mikemccandless.com On Fri, Feb 14, 2014 at 3:13 AM, Jason Wee peich...@gmail.com wrote: Hello, This is my first question to lucene mailing list, sorry if the question sounds funny. I have been experimenting to store lucene index files on cassandra, unfortunately the exception got overwhelmed. Below are the stacktrace. org.apache.lucene.index.CorruptIndexException: codec mismatch: actual codec=CompoundFileWriterData vs expected codec=Lucene46FieldInfos (resource: SlicedIndexInput(SlicedIndexInput(_0.fnm in lucene-cassandra-desc) in lucene-cassandra-desc slice=31:340)) at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:140) at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130) at org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56) at org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:214) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94) at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66) at org.apache.lucene.store.Search.init(Search.java:41) at org.apache.lucene.store.Search.main(Search.java:34) I'm not sure what does it means, can anybody help? When I check the hex representation of _0.fnm in cassandra, and translated to ascii. It is something like this: ??l??Lucene46FieldInfos??path?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?modified?Q??PerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0?contentsPerFieldPostingsFormat.format?Lucene41?PerFieldPostingsFormat.suffix?0 It looks to me the expected codec is found in the _0.fnm file or am I wrong? Thank you and please let me know if you need additional information. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Tokenization and PrefixQuery
Hello, I am designing a system with documents having one field containing values such as Ae1 Br2 Cy8 ..., i.e. a sequence of items made of letters and numbers (max=7 per item), all separated by a space, possibly 200 items per field, with no limit upon the number of documents (although I would not expect more than a few millions documents). The order of these values are important, and I want to search for these, always starting with the first value, and including as many following values as needed: for instance, Ae1, Ae1 Br2 would be possible search values. At first, I indexed these using a space-delimited analyzer, and ran PrefixQueries. I encountered some performance issues though, so ended up building my own tokenizer, which would create tokens for all starting combinations (Ae1, Ae1 Br2...), up to certain limit, called the analysis depth. I would then dynamically create TermQueries to match these tokens when searching under the analysis depth, and PrefixQueries when searching over the analysis depth (the whole string also being indexed as a single token). The performance was great, because TermQueries are very fast, and PrefixQueries are not bad either, when the underlying relevant number of documents is small (which happens to be the case when searching beyond the analysis depth). I have however two questions: one regarding the PrefixQuery, and one regarding the general design. Regarding the PrefixQuery: it seems that it stops matching documents when the length of the searched string exceeds a certain length. Is that the expected behavior, an if so, can I / should I manage this length? Regarding the general design: I have adopted an hybrid approach TermQueries/PrefixQueries, letting clients customize the analysis depth, so as to keep a balance between the performance and the size of the index. I am however not sure this is a good idea: would it be better to tokenize the full string (i.e. analysis depth is infinity, so as to only use TermQueries)? Or could my design be substituted by an altogether different, more successful analysis approach? Thank you in advance for your insights. Kind regards. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Collector is collecting more than the specified hits
This is how Collector works: it is called for every document matching the query, and then its job is to choose which of those hits to keep. This is because in general the hits to keep can come at any time, not just the first N hits you see; e.g. the best scoring hit may be the very last one. But if you have prior knowledge, e.g. that your index is already pre-sorted by the criteria that you sort by at query time, then indeed after seeing the first N hits you can stop; to do this you must throw your own exception, and catch it up above. See Lucene's TimeLimitingCollector for a similar example ... Mike McCandless http://blog.mikemccandless.com On Fri, Feb 14, 2014 at 2:47 AM, saisantoshi saisantosh...@gmail.com wrote: The problem with the below collector is the collect method is not stopping after the numHits count has reached. Is there a way to stop the collector collecting the docs after it has reached the numHits specified. For example: * TopScoreDocCollector topScore = TopScoreDocCollector.create(numHits, true); * // TopScoreDocCollector topScore = TopScoreDocCollector.create(30, true); I would except the below collector to pause/exit out after it has collected the specified numHits ( in this case it's 30). But what's happening here is the collector is collecting all the docs and thereby causing delay in searches. Can we configure the collect method below to collect/stop after it has reached numHits specified? PLease let me know if there any issue with the collector below? public class MyCollector extends PositiveScoresOnlyCollector { private IndexReader indexReader; public MyCollector (IndexReader indexReader,PositiveScoresOnlyCollector topScore) { super(topScore); this.indexReader = indexReader; } @Override public void collect(int doc) { try { //Custom Logic super.collect(doc); } } catch (Exception e) { } } //Usage: MyCollector collector; TopScoreDocCollector topScore = TopScoreDocCollector.create(numHits, true); IndexSearcher searcher = new IndexSearcher(reader); try { collector = new MyCollector(indexReader, new PositiveScoresOnlyCollector(topScore)); searcher.search(query, (Filter) null, collector); } finally { } Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Tokenization and PrefixQuery
On Fri, Feb 14, 2014 at 6:17 AM, Yann-Erwan Perio ye.pe...@gmail.com wrote: Hello, I am designing a system with documents having one field containing values such as Ae1 Br2 Cy8 ..., i.e. a sequence of items made of letters and numbers (max=7 per item), all separated by a space, possibly 200 items per field, with no limit upon the number of documents (although I would not expect more than a few millions documents). The order of these values are important, and I want to search for these, always starting with the first value, and including as many following values as needed: for instance, Ae1, Ae1 Br2 would be possible search values. At first, I indexed these using a space-delimited analyzer, and ran PrefixQueries. I encountered some performance issues though, so ended up building my own tokenizer, which would create tokens for all starting combinations (Ae1, Ae1 Br2...), up to certain limit, called the analysis depth. This is similar to PathHierarchyTokenizer, I think. I would then dynamically create TermQueries to match these tokens when searching under the analysis depth, and PrefixQueries when searching over the analysis depth (the whole string also being indexed as a single token). The performance was great, because TermQueries are very fast, and PrefixQueries are not bad either, when the underlying relevant number of documents is small (which happens to be the case when searching beyond the analysis depth). I have however two questions: one regarding the PrefixQuery, and one regarding the general design. Regarding the PrefixQuery: it seems that it stops matching documents when the length of the searched string exceeds a certain length. Is that the expected behavior, an if so, can I / should I manage this length? That should not be the case: it should match all terms with that prefix regardless of the term's length. Try to boil it down to a small test case? Regarding the general design: I have adopted an hybrid approach TermQueries/PrefixQueries, letting clients customize the analysis depth, so as to keep a balance between the performance and the size of the index. I am however not sure this is a good idea: would it be better to tokenize the full string (i.e. analysis depth is infinity, so as to only use TermQueries)? Or could my design be substituted by an altogether different, more successful analysis approach? I think your approach is a typical one (adding more terms to the index so you get TermQuery instead of MoreCostlyQuery). E.g., ShingleFilter, CommonGrams are examples of the same general idea. Another example is AnalyingInfixSuggester, which does the same thing you are doing under-the-hood but one byte at a time (i.e. all term prefixes up to a certain depth), and it also makes its analysis depth controllable. Maybe expose it to your users as a very expert tunable? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Tokenization and PrefixQuery
On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless luc...@mikemccandless.com wrote: This is similar to PathHierarchyTokenizer, I think. Ah, yes, very much. I'll check it out and see if I can make something of it. I am not sure to what extent it'll be reusable though, as my tokenizer also sets payloads (the next coming path part is set on the current token as a payload, so as to provide a perspective of what's coming ahead, at search time). Regarding the PrefixQuery: it seems that it stops matching documents when the length of the searched string exceeds a certain length. Is that the expected behavior, an if so, can I / should I manage this length? That should not be the case: it should match all terms with that prefix regardless of the term's length. Try to boil it down to a small test case? I guess I've been too shallow with my testing, then :( Well, I'll dig deeper, and if I find something wrong with Lucene, I'll post a small test case demonstrating the issue - but so far, the errors were always on my side. I think your approach is a typical one (adding more terms to the index so you get TermQuery instead of MoreCostlyQuery). E.g., ShingleFilter, CommonGrams are examples of the same general idea. Another example is AnalyingInfixSuggester, which does the same thing you are doing under-the-hood but one byte at a time (i.e. all term prefixes up to a certain depth), and it also makes its analysis depth controllable. Maybe expose it to your users as a very expert tunable? This is what I have done, letting the clients of the framework specify the analysis depth through their configuration file. Thanks a lot for your feedback, it's very appreciated. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Tokenization and PrefixQuery
On Fri, Feb 14, 2014 at 1:11 PM, Yann-Erwan Perio ye.pe...@gmail.com wrote: On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless luc...@mikemccandless.com wrote: Hi again, That should not be the case: it should match all terms with that prefix regardless of the term's length. Try to boil it down to a small test case? I guess I've been too shallow with my testing, then :( Well, I'll dig deeper, and if I find something wrong with Lucene, I'll post a small test case demonstrating the issue - but so far, the errors were always on my side. I have written a test which demonstrates that the mistake is indeed on my side. It's probably due to inconsistent rules for indexing/searching content having special characters (namely the plus sign). Sorry for the inconvenience, and thanks again for your answers. Kind regards. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Tokenization and PrefixQuery
On Fri, Feb 14, 2014 at 8:21 AM, Yann-Erwan Perio ye.pe...@gmail.com wrote: I have written a test which demonstrates that the mistake is indeed on my side. It's probably due to inconsistent rules for indexing/searching content having special characters (namely the plus sign). OK, thanks for bringing closure. Sorry for the inconvenience, and thanks again for your answers. You're welcome! Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Collector is collecting more than the specified hits
I am not interested in the scores at all. My requirement is simple, I only need the first 100 hits or the numHits I specify ( irrespective of there scores). The collector should stop after collecting the numHits specified. Is there a way to tell in the collector to stop after collecting the numHits. Please correct me if I am wrong. I am trying to do the following. public void collect(int doc) throws IOException { if (collector.getTotalHits() = maxHits ) {// this way, I can stop it to not collect after the getTotalHits is more than numHits. delegate.collect(doc); } } I have to write a separate collector extending the Collector because I am not able to get the call to getTotalHits() if I am using PositiveScoresOnlyCollector. -- View this message in context: http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329p4117441.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Reverse Matching
Hi There, Is there a way to do reverse matching by indexing the queries in an index and passing a document to see how many queries matched that? I know that I can have the queries in memory and have the document parsed in a memory index and then loop through trying to match each query. The issue I have is, we could have millions of such queries and looping through them to match it against the document is not feasible for us. regards -Siraj (212) 306-0154 This electronic mail message and any attachments may contain information which is privileged, sensitive and/or otherwise exempt from disclosure under applicable law. The information is intended only for the use of the individual or entity named as the addressee above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution (electronic or otherwise) or forwarding of, or the taking of any action in reliance on, the contents of this transmission is strictly prohibited. If you have received this electronic transmission in error, please notify us by telephone, facsimile, or e-mail as noted above to arrange for the return of any electronic mail or attachments. Thank You.
IndexWriter croaks on large file
I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file 2GB in size, it dies with the following exception: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, startOffset=-2147483648,endOffset=-2147483647 Essentially, I'm doing this: Directory directory = new MMapDirectory(indexPath); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer); IndexWriter iw = new IndexWriter(directory, iwc); InputStream is = my input stream; InputStreamReader reader = new InputStreamReader(is); Document doc = new Document(); doc.add(new StoredField(fileid, fileid)); doc.add(new StoredField(pathname, pathname)); doc.add(new TextField(content, reader)); iw.addDocument(doc); It's the IndexWriter addDocument method that throws the exception. In looking at the Lucene source code, it appears that the offsets being used internally are int, which makes it somewhat obvious why this is happening. This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly capable of handling a file over 2GB in this manner. What has changed and how do I get around this ? Is Lucene no longer capable of handling files this large, or is there some other way I should be doing this ? Here's the full stack trace sans my code: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, startOffset=-2147483648,endOffset=-2147483647 at org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183) at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202) Thanks, John -- John Cecere Principal Engineer - Oracle Corporation 732-987-4317 / john.cec...@oracle.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Extending StandardTokenizer Jflex to not split on '/'
Hi guys, this is my first time posting on the Lucene list, so hello everyone. I really like the way that the StandardTokenizer works, however I'd like for it to not split tokens on / (forward slash). I've been looking at http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand the rules, but I'm either misunderstanding or missing something. If I understand correctly, the symbols in MidLetter keep it from splitting a token as long as there's alpha chars on either side. I tried adding the forward slash to the MidLetter and MidLetterSupp rules (tried different combinations), but it still seems like it's splitting on it. Does anyone have any tips or ideas? Thanks Diego Fernandez - 爱国 Software Engineer US GSS Supportability - Diagnostics - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extending StandardTokenizer Jflex to not split on '/'
Welcome Diego, I think you’re right about MidLetter - adding a char to it should disable splitting on that char, as long as there is a letter on one side or the other. (If you’d like that behavior to be extended to numeric digits, you should use MidNumLet instead.) I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex (compressed whitespace diff below): -MidLetter = (\p{WB:MidLetter}| {MidLetterSupp}) +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp}) then running ‘ant jflex’ under lucene/analysis/common/, and the following text was split as indicated (I tested by adding the method below to TestStandardAnalyzer.java): public void testMidLetterSlash() throws Exception { BaseTokenStreamTestCase.assertAnalyzesTo(a, /one/two/three/ four”, new String[]{ one/two/three, four }); BaseTokenStreamTestCase.assertAnalyzesTo(a, 1/two/3”, new String[] { 1, two, 3 }); } So it works for me - are you regenerating the scanner (‘ant jflex’)? FYI, I found a bug when I was testing the above: “http://example.com” is left intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should instead result in “http://example.com” being split into “http” and “example.com”. Further testing indicates that this is a problem for MidLetter, MidNumLet and MidNum. I’ve filed an issue: https://issues.apache.org/jira/browse/LUCENE-5447. Steve On Feb 14, 2014, at 1:42 PM, Diego Fernandez difer...@redhat.com wrote: Hi guys, this is my first time posting on the Lucene list, so hello everyone. I really like the way that the StandardTokenizer works, however I'd like for it to not split tokens on / (forward slash). I've been looking at http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand the rules, but I'm either misunderstanding or missing something. If I understand correctly, the symbols in MidLetter keep it from splitting a token as long as there's alpha chars on either side. I tried adding the forward slash to the MidLetter and MidLetterSupp rules (tried different combinations), but it still seems like it's splitting on it. Does anyone have any tips or ideas? Thanks Diego Fernandez - 爱国 Software Engineer US GSS Supportability - Diagnostics - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexWriter croaks on large file
I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At any rate, I don't have control over the size of the documents that go into my database. Sometimes my customer's log files end up really big. I'm willing to have huge indexes for these things. Wouldn't just changing from int to long for the offsets solve the problem ? I'm sure it would probably have to be changed in a lot of places, but why impose such a limitation ? Especially since it's using an InputStream and only dealing with a block of data at a time. I'll take a look at your suggestion. Thanks, John On 2/14/14 3:20 PM, Michael McCandless wrote: Hmm, why are you indexing such immense documents? In 3.x Lucene never sanity checked the offsets, so we would silently index negative (int overflow'd) offsets into e.g. term vectors. But in 4.x, we now detect this and throw the exception you're seeing, because it can lead to index corruption when you index the offsets into the postings. If you really must index such enormous documents, maybe you could create a custom tokenizer (derived from StandardTokenizer) that fixes the offset before setting them? Or maybe just doesn't even set them. Note that position can also overflow, if your documents get too large. Mike McCandless http://blog.mikemccandless.com On Fri, Feb 14, 2014 at 1:36 PM, John Cecere john.cec...@oracle.com wrote: I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file 2GB in size, it dies with the following exception: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, startOffset=-2147483648,endOffset=-2147483647 Essentially, I'm doing this: Directory directory = new MMapDirectory(indexPath); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer); IndexWriter iw = new IndexWriter(directory, iwc); InputStream is = my input stream; InputStreamReader reader = new InputStreamReader(is); Document doc = new Document(); doc.add(new StoredField(fileid, fileid)); doc.add(new StoredField(pathname, pathname)); doc.add(new TextField(content, reader)); iw.addDocument(doc); It's the IndexWriter addDocument method that throws the exception. In looking at the Lucene source code, it appears that the offsets being used internally are int, which makes it somewhat obvious why this is happening. This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly capable of handling a file over 2GB in this manner. What has changed and how do I get around this ? Is Lucene no longer capable of handling files this large, or is there some other way I should be doing this ? Here's the full stack trace sans my code: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, startOffset=-2147483648,endOffset=-2147483647 at org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183) at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202) Thanks, John -- John Cecere Principal Engineer - Oracle Corporation 732-987-4317 / john.cec...@oracle.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- John Cecere Principal Engineer - Oracle Corporation 732-987-4317 / john.cec...@oracle.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexWriter croaks on large file
You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file) -Glen On Fri, Feb 14, 2014 at 4:12 PM, John Cecere john.cec...@oracle.com wrote: I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At any rate, I don't have control over the size of the documents that go into my database. Sometimes my customer's log files end up really big. I'm willing to have huge indexes for these things. Wouldn't just changing from int to long for the offsets solve the problem ? I'm sure it would probably have to be changed in a lot of places, but why impose such a limitation ? Especially since it's using an InputStream and only dealing with a block of data at a time. I'll take a look at your suggestion. Thanks, John On 2/14/14 3:20 PM, Michael McCandless wrote: Hmm, why are you indexing such immense documents? In 3.x Lucene never sanity checked the offsets, so we would silently index negative (int overflow'd) offsets into e.g. term vectors. But in 4.x, we now detect this and throw the exception you're seeing, because it can lead to index corruption when you index the offsets into the postings. If you really must index such enormous documents, maybe you could create a custom tokenizer (derived from StandardTokenizer) that fixes the offset before setting them? Or maybe just doesn't even set them. Note that position can also overflow, if your documents get too large. Mike McCandless http://blog.mikemccandless.com On Fri, Feb 14, 2014 at 1:36 PM, John Cecere john.cec...@oracle.com wrote: I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file 2GB in size, it dies with the following exception: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, startOffset=-2147483648,endOffset=-2147483647 Essentially, I'm doing this: Directory directory = new MMapDirectory(indexPath); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer); IndexWriter iw = new IndexWriter(directory, iwc); InputStream is = my input stream; InputStreamReader reader = new InputStreamReader(is); Document doc = new Document(); doc.add(new StoredField(fileid, fileid)); doc.add(new StoredField(pathname, pathname)); doc.add(new TextField(content, reader)); iw.addDocument(doc); It's the IndexWriter addDocument method that throws the exception. In looking at the Lucene source code, it appears that the offsets being used internally are int, which makes it somewhat obvious why this is happening. This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly capable of handling a file over 2GB in this manner. What has changed and how do I get around this ? Is Lucene no longer capable of handling files this large, or is there some other way I should be doing this ? Here's the full stack trace sans my code: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, startOffset=-2147483648,endOffset=-2147483647 at org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183) at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202) Thanks, John -- John Cecere Principal Engineer - Oracle Corporation 732-987-4317 / john.cec...@oracle.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- John Cecere Principal Engineer - Oracle Corporation 732-987-4317 / john.cec...@oracle.com
Re: IndexWriter croaks on large file
As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to users?John, for huge data set, it's usually a good idea to roll your own distributed indexes, and modelyou data schema very carefully. For example, if you are going to index log files, one reasonableidea is to make every 5 minutes of logs a document.Regards,TriOn Feb 14, 2014, at 01:20 PM, Glen Newton glen.new...@gmail.com wrote:You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file) -Glen On Fri, Feb 14, 2014 at 4:12 PM, John Cecere john.cec...@oracle.com wrote:I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. Atany rate, I don't have control over the size of the documents that go intomy database. Sometimes my customer's log files end up really big. I'mwilling to have huge indexes for these things.Wouldn't just changing from int to long for the offsets solve the problem ?I'm sure it would probably have to be changed in a lot of places, but whyimpose such a limitation ? Especially since it's using an InputStream andonly dealing with a block of data at a time.I'll take a look at your suggestion.Thanks,JohnOn 2/14/14 3:20 PM, Michael McCandless wrote:Hmm, why are you indexing such immense documents?In 3.x Lucene never sanity checked the offsets, so we would silentlyindex negative (int overflow'd) offsets into e.g. term vectors.But in 4.x, we now detect this and throw the exception you're seeing,because it can lead to index corruption when you index the offsetsinto the postings.If you really must index such enormous documents, maybe you couldcreate a custom tokenizer (derived from StandardTokenizer) that"fixes" the offset before setting them? Or maybe just doesn't evenset them.Note that position can also overflow, if your documents get too large.Mike McCandlesshttp://blog.mikemccandless.comOn Fri, Feb 14, 2014 at 1:36 PM, John Cecere john.cec...@oracle.comwrote:I'm having a problem with Lucene 4.5.1. Whenever I attempt to index afile 2GB in size, it dies with the following exception:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be = startOffset,startOffset=-2147483648,endOffset=-2147483647Essentially, I'm doing this:Directory directory = new MMapDirectory(indexPath);Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,analyzer);IndexWriter iw = new IndexWriter(directory, iwc);InputStream is = my input stream;InputStreamReader reader = new InputStreamReader(is);Document doc = new Document();doc.add(new StoredField("fileid", fileid));doc.add(new StoredField("pathname", pathname));doc.add(new TextField("content", reader));iw.addDocument(doc);It's the IndexWriter addDocument method that throws the exception. Inlooking at the Lucene source code, it appears that the offsets being usedinternally are int, which makes it somewhat obvious why this ishappening.This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectlycapable of handling a file over 2GB in this manner. What has changed andhowdo I get around this ? Is Lucene no longer capable of handling files thislarge, or is there some other way I should be doing this ?Here's the full stack trace sans my code:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be = startOffset,startOffset=-2147483648,endOffset=-2147483647atorg.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)atorg.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)atorg.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)atorg.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)atorg.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)atorg.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)atorg.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)atorg.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)atorg.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)atorg.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)Thanks,John--John CecerePrincipal Engineer - Oracle Corporation732-987-4317 / john.cec...@oracle.com-To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.orgFor additional commands, e-mail:
Only highlight terms that caused a search hit/match
Hello, I have recently been given a requirement to improve document highlights within our system. Unfortunately, the current functionality gives more of a best-guess on what terms to highlight vs the actual terms to highlight that actually did perform the match. A couple examples of issues that were found: Nested boolean clause with a term that doesn’t exist ANDed with a term that does highlights the ignored term in the query Text: a b c Logical Query: a OR (b AND z) Result: ba/b bb/b c Expected: ba/b b c Nested span query doesn’t maintain the proper positions and offsets Text: y z x y z a Logical Query: (“x y z”, a) span near 10 Result: by/b bz/b bx/b by/b bz/b ba/b Expected: y z bx/b by/b bz/b ba/b I am currently using the Highlighter with a QueryScorer and a SimpleSpanFragmenter. While looking through the code it looks like the entire query structure is dropped in the WeightedSpanTermExtractor by just grabbing any positive TermQuery and flattening them all into a simple Map which is then passed on to highlight all of those terms. I believe this over simplification of term extraction is the crux of the issue and needs to be modified in order to produce more “exact” highlights. I was brainstorming with a colleague and thought perhaps we can spin up a MemoryIndex to index that one document and start performing a depth-first search of all queries within the overall Lucene query graph. At that point we can start querying the MemoryIndex for leaf queries and start walking back up the tree, pruning branches that don’t result in a search hit which results in a map of actual matched query terms. This approach seems pretty painful but will hopefully produce better matches. I would like to see what the experts on the mailing list would have to say about this approach or is there a better way to retrieve the query terms positions that produced the match? Or perhaps there is a different Highlighter implementation that should be used, though our user queries are extremely complex with a lot of nested queries of various types. Thanks, -Steve
char mapping in lucene-icu
Hello, I try to use lucene-icu li in solr-4.6.1. I need to change a char mapping in lucene-icu. I have made changes to lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt and built jar file using ant , but it did not help. I took a look to lucene/analysis/icu/build.xml and see these lines property name=gennorm2.src.files value=nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt/ property name=gennorm2.tmp value=${build.dir}/gennorm2/utr30.tmp/ property name=gennorm2.dst value=${resources.dir}/org/apache/lucene/analysis/icu/utr30.nrm/ target name=gennorm2 depends=gen-utr30-data-files echoNote that the gennorm2 and icupkg tools must be on your PATH. These tools are part of the ICU4C package. See http://site.icu-project.org/ /echo mkdir dir=${build.dir}/gennorm2/ exec executable=gennorm2 failonerror=true arg value=-v/ arg value=-s/ arg value=${utr30.data.dir}/ arg line=${gennorm2.src.files}/ arg value=-o/ arg value=${gennorm2.tmp}/ /exec !-- now convert binary file to big-endian -- exec executable=icupkg failonerror=true arg value=-tb/ arg value=${gennorm2.tmp}/ arg value=${gennorm2.dst}/ /exec delete file=${gennorm2.tmp}/ /target looks like ant does not execute gennorm2. If I build utr30.nrm file using gennorm2 manually and replacing utr30.nrm in the jar file then starting solr gives the following error. Caused by: java.lang.RuntimeException: java.io.IOException: ICU data file error: Header authentication failed, please check if you have a valid ICU data file My questions are; 1. if the above code in the build file does not get executed then how the utr30 file is generated? 2. How to change a character mapping. Thanks. Alex.
Re: Reverse Matching
Hi Siraj, MemoryIndex is used for such use case. Here is a couple of pointers: http://www.slideshare.net/jdhok/diy-percolator http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html On Friday, February 14, 2014 8:21 PM, Siraj Haider si...@jobdiva.com wrote: Hi There, Is there a way to do reverse matching by indexing the queries in an index and passing a document to see how many queries matched that? I know that I can have the queries in memory and have the document parsed in a memory index and then loop through trying to match each query. The issue I have is, we could have millions of such queries and looping through them to match it against the document is not feasible for us. regards -Siraj (212) 306-0154 This electronic mail message and any attachments may contain information which is privileged, sensitive and/or otherwise exempt from disclosure under applicable law. The information is intended only for the use of the individual or entity named as the addressee above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution (electronic or otherwise) or forwarding of, or the taking of any action in reliance on, the contents of this transmission is strictly prohibited. If you have received this electronic transmission in error, please notify us by telephone, facsimile, or e-mail as noted above to arrange for the return of any electronic mail or attachments. Thank You. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: char mapping in lucene-icu
Do you get the exception if you run ant before changing the data files? Header authentication failed, please check if you have a valid ICU data file Check with the ICU project as to the proper format for THEIR files. I mean, this doesn't sound like a Lucene issue. Maybe it could be as simple as whether the data file should have DOS or UNIX or Mac line endings (CRLF vs. NL vs. CR.) Be sure to use an editor that satisfies the requirements of ICU. To be clear, Lucene itself does not have a published API for modifying the mappings of ICU. -- Jack Krupansky -Original Message- From: alx...@aim.com Sent: Friday, February 14, 2014 7:48 PM To: java-user@lucene.apache.org Subject: char mapping in lucene-icu Hello, I try to use lucene-icu li in solr-4.6.1. I need to change a char mapping in lucene-icu. I have made changes to lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt and built jar file using ant , but it did not help. I took a look to lucene/analysis/icu/build.xml and see these lines property name=gennorm2.src.files value=nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt/ property name=gennorm2.tmp value=${build.dir}/gennorm2/utr30.tmp/ property name=gennorm2.dst value=${resources.dir}/org/apache/lucene/analysis/icu/utr30.nrm/ target name=gennorm2 depends=gen-utr30-data-files echoNote that the gennorm2 and icupkg tools must be on your PATH. These tools are part of the ICU4C package. See http://site.icu-project.org/ /echo mkdir dir=${build.dir}/gennorm2/ exec executable=gennorm2 failonerror=true arg value=-v/ arg value=-s/ arg value=${utr30.data.dir}/ arg line=${gennorm2.src.files}/ arg value=-o/ arg value=${gennorm2.tmp}/ /exec !-- now convert binary file to big-endian -- exec executable=icupkg failonerror=true arg value=-tb/ arg value=${gennorm2.tmp}/ arg value=${gennorm2.dst}/ /exec delete file=${gennorm2.tmp}/ /target looks like ant does not execute gennorm2. If I build utr30.nrm file using gennorm2 manually and replacing utr30.nrm in the jar file then starting solr gives the following error. Caused by: java.lang.RuntimeException: java.io.IOException: ICU data file error: Header authentication failed, please check if you have a valid ICU data file My questions are; 1. if the above code in the build file does not get executed then how the utr30 file is generated? 2. How to change a character mapping. Thanks. Alex. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Reverse Matching
Hi, Here are two more relevant links: https://github.com/flaxsearch/luwak http://www.lucenerevolution.org/2013/Turning-Search-Upside-Down-Using-Lucene-for-Very-Fast-Stored-Queries Ahmet On Saturday, February 15, 2014 3:01 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Siraj, MemoryIndex is used for such use case. Here is a couple of pointers: http://www.slideshare.net/jdhok/diy-percolator http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html On Friday, February 14, 2014 8:21 PM, Siraj Haider si...@jobdiva.com wrote: Hi There, Is there a way to do reverse matching by indexing the queries in an index and passing a document to see how many queries matched that? I know that I can have the queries in memory and have the document parsed in a memory index and then loop through trying to match each query. The issue I have is, we could have millions of such queries and looping through them to match it against the document is not feasible for us. regards -Siraj (212) 306-0154 This electronic mail message and any attachments may contain information which is privileged, sensitive and/or otherwise exempt from disclosure under applicable law. The information is intended only for the use of the individual or entity named as the addressee above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution (electronic or otherwise) or forwarding of, or the taking of any action in reliance on, the contents of this transmission is strictly prohibited. If you have received this electronic transmission in error, please notify us by telephone, facsimile, or e-mail as noted above to arrange for the return of any electronic mail or attachments. Thank You. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: char mapping in lucene-icu
Hi Jack, I do not get exception before changing data files. And I do not get exception after changing data files and creating lucene-icu...jar by ant. But changing data files and running ant does not change the output. So I decided to manually create .nrm file by using steps outlined in the build.xml file property name=gennorm2.src.files value=nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt/ property name=gennorm2.tmp value=${build.dir}/gennorm2/utr30.tmp/ property name=gennorm2.dst value=${resources.dir}/org/apache/lucene/analysis/icu/utr30.nrm/ target name=gennorm2 depends=gen-utr30-data-files echoNote that the gennorm2 and icupkg tools must be on your PATH. These tools are part of the ICU4C package. See http://site.icu-project.org/ /echo mkdir dir=${build.dir}/gennorm2/ exec executable=gennorm2 failonerror=true arg value=-v/ arg value=-s/ arg value=${utr30.data.dir}/ arg line=${gennorm2.src.files}/ arg value=-o/ arg value=${gennorm2.tmp}/ /exec !-- now convert binary file to big-endian -- exec executable=icupkg failonerror=true arg value=-tb/ arg value=${gennorm2.tmp}/ arg value=${gennorm2.dst}/ /exec delete file=${gennorm2.tmp}/ /target namely gennorm2 -v -s src/data/utr30 nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt -o utr30.tmp icupkg -tb utr30.tmp utr30.nrm then I unpacked lucene-icu...jar file, replaced .nrm file and created new jar file using jar cf Solr gives error if I use this new .jar file What I noticed was that ant task actually does not run gennorm2 task. If I delete gennrom2 entiry from build.xml file utr30nrm still gets created by ant task. I have deleted even these lines target name=compile-core depends=jar-analyzers-common, common.compile-core / property name=utr30.data.dir location=src/data/utr30/ target name=gen-utr30-data-files depends=compile-tools java classname=org.apache.lucene.analysis.icu.GenerateUTR30DataFiles dir=${utr30.data.dir} fork=true failonerror=true classpath path refid=icujar/ pathelement location=${build.dir}/classes/tools/ /classpath /java /target it still gets created. So, I wondered how ant creates it? icu support team wrote that they do not have any mappings. I mean mappings between diacritic letters and latin letters. Thanks. Alex. -Original Message- From: Jack Krupansky j...@basetechnology.com To: java-user java-user@lucene.apache.org Sent: Fri, Feb 14, 2014 5:13 pm Subject: Re: char mapping in lucene-icu Do you get the exception if you run ant before changing the data files? Header authentication failed, please check if you have a valid ICU data file Check with the ICU project as to the proper format for THEIR files. I mean, this doesn't sound like a Lucene issue. Maybe it could be as simple as whether the data file should have DOS or UNIX or Mac line endings (CRLF vs. NL vs. CR.) Be sure to use an editor that satisfies the requirements of ICU. To be clear, Lucene itself does not have a published API for modifying the mappings of ICU. -- Jack Krupansky -Original Message- From: alx...@aim.com Sent: Friday, February 14, 2014 7:48 PM To: java-user@lucene.apache.org Subject: char mapping in lucene-icu Hello, I try to use lucene-icu li in solr-4.6.1. I need to change a char mapping in lucene-icu. I have made changes to lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt and built jar file using ant , but it did not help. I took a look to lucene/analysis/icu/build.xml and see these lines property name=gennorm2.src.files value=nfc.txt nfkc.txt nfkc_cf.txt BasicFoldings.txt DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt/ property name=gennorm2.tmp value=${build.dir}/gennorm2/utr30.tmp/ property name=gennorm2.dst value=${resources.dir}/org/apache/lucene/analysis/icu/utr30.nrm/ target name=gennorm2 depends=gen-utr30-data-files echoNote that the gennorm2 and icupkg tools must be on your PATH. These tools are part of the ICU4C package. See http://site.icu-project.org/ /echo mkdir dir=${build.dir}/gennorm2/ exec executable=gennorm2 failonerror=true arg value=-v/ arg value=-s/ arg value=${utr30.data.dir}/ arg line=${gennorm2.src.files}/ arg value=-o/ arg value=${gennorm2.tmp}/ /exec !-- now convert binary file to big-endian -- exec executable=icupkg failonerror=true arg value=-tb/ arg value=${gennorm2.tmp}/ arg value=${gennorm2.dst}/ /exec delete file=${gennorm2.tmp}/ /target looks like ant does not execute gennorm2. If I build utr30.nrm file using gennorm2 manually and replacing utr30.nrm in the