Re: char mapping in lucene-icu

2014-02-14 Thread alxsss
Hi Jack, I do not get exception before changing data files. And I do not get exception after changing data files and creating lucene-icu...jar by ant. But changing data files and running ant does not change the output. So I decided to manually create .nrm file by using steps outlined in the

Re: Reverse Matching

2014-02-14 Thread Ahmet Arslan
Hi, Here are two more relevant links: https://github.com/flaxsearch/luwak http://www.lucenerevolution.org/2013/Turning-Search-Upside-Down-Using-Lucene-for-Very-Fast-Stored-Queries Ahmet On Saturday, February 15, 2014 3:01 AM, Ahmet Arslan wrote: Hi Siraj, MemoryIndex is used for such use

Re: char mapping in lucene-icu

2014-02-14 Thread Jack Krupansky
Do you get the exception if you run ant before changing the data files? "Header authentication failed, please check if you have a valid ICU data file" Check with the ICU project as to the proper format for THEIR files. I mean, this doesn't sound like a Lucene issue. Maybe it could be as sim

Re: Reverse Matching

2014-02-14 Thread Ahmet Arslan
Hi Siraj, MemoryIndex is used for such use case. Here is a couple of pointers:  http://www.slideshare.net/jdhok/diy-percolator http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html On Friday, February 14, 2014 8:21 PM, Siraj Haider wrote: Hi There, Is

char mapping in lucene-icu

2014-02-14 Thread alxsss
Hello, I try to use lucene-icu li in solr-4.6.1. I need to change a char mapping in lucene-icu. I have made changes to lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt and built jar file using ant , but it did not help. I took a look to lucene/analysis/icu/build.xml and see these l

Only highlight terms that caused a search hit/match

2014-02-14 Thread Steve Davids
Hello, I have recently been given a requirement to improve document highlights within our system. Unfortunately, the current functionality gives more of a best-guess on what terms to highlight vs the actual terms to highlight that actually did perform the match. A couple examples of issues that

Re: IndexWriter croaks on large file

2014-02-14 Thread Tri Cao
As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to use

Re: IndexWriter croaks on large file

2014-02-14 Thread Glen Newton
You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file) -Glen On Fri, Feb 14, 2014 at 4:12 PM, John Cecere wrote: > I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At > any rate, I don't have control over the siz

Re: IndexWriter croaks on large file

2014-02-14 Thread John Cecere
I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At any rate, I don't have control over the size of the documents that go into my database. Sometimes my customer's log files end up really big. I'm willing to have huge indexes for these things. Wouldn't just changing from

Re: Extending StandardTokenizer Jflex to not split on '/'

2014-02-14 Thread Steve Rowe
Welcome Diego, I think you’re right about MidLetter - adding a char to it should disable splitting on that char, as long as there is a letter on one side or the other. (If you’d like that behavior to be extended to numeric digits, you should use MidNumLet instead.) I tested this by adding “/“

Re: Actual min and max-value of NumericField during codec flush

2014-02-14 Thread Michael McCandless
On Fri, Feb 14, 2014 at 12:14 AM, Ravikumar Govindarajan wrote: > Early-Query termination quits by throwing an Exception right?. Is it ok to > individually search using SegmentReader and then break-off, instead of > using a MultiReader, especially when the order is known before search > begins?

Re: IndexWriter croaks on large file

2014-02-14 Thread Michael McCandless
Hmm, why are you indexing such immense documents? In 3.x Lucene never sanity checked the offsets, so we would silently index negative (int overflow'd) offsets into e.g. term vectors. But in 4.x, we now detect this and throw the exception you're seeing, because it can lead to index corruption when

Re: Adding custom weights to individual terms

2014-02-14 Thread Rune Stilling
Hi Lukai That was a great help. Thank you. I’m continuing reading about payloads: http://searchhub.org/2009/08/05/getting-started-with-payloads/ Didn’t know that concept at all. Regards, Rune Den 13/02/2014 kl. 23.12 skrev lukai : > Hi, Rune: > Per your requirement, you can generate a separ

Re: Collector is collecting more than the specified hits

2014-02-14 Thread Tri Cao
If I understand correctly, you'd like to shortcut the execution when you reach the desirednumber of hits. Unfortunately, I don't think there's a graceful way to do that right now inCollector. To stop further collecting, you need to throw an IOException (or a subtype of it)and catch the exception la

Extending StandardTokenizer Jflex to not split on '/'

2014-02-14 Thread Diego Fernandez
Hi guys, this is my first time posting on the Lucene list, so hello everyone. I really like the way that the StandardTokenizer works, however I'd like for it to not split tokens on / (forward slash). I've been looking at http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to underst

IndexWriter croaks on large file

2014-02-14 Thread John Cecere
I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file > 2GB in size, it dies with the following exception: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=-2147483648,endOffset=-2147483647 Essentially

Reverse Matching

2014-02-14 Thread Siraj Haider
Hi There, Is there a way to do reverse matching by indexing the queries in an index and passing a document to see how many queries matched that? I know that I can have the queries in memory and have the document parsed in a memory index and then loop through trying to match each query. The issue

Re: Collector is collecting more than the specified hits

2014-02-14 Thread saisantoshi
I am not interested in the scores at all. My requirement is simple, I only need the first 100 hits or the numHits I specify ( irrespective of there scores). The collector should stop after collecting the numHits specified. Is there a way to tell in the collector to stop after collecting the numHits

Re: Tokenization and PrefixQuery

2014-02-14 Thread Michael McCandless
On Fri, Feb 14, 2014 at 8:21 AM, Yann-Erwan Perio wrote: > I have written a test which demonstrates that the mistake is indeed on > my side. It's probably due to inconsistent rules for > indexing/searching content having special characters (namely the > "plus" sign). OK, thanks for bringing clos

Re: Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
On Fri, Feb 14, 2014 at 1:11 PM, Yann-Erwan Perio wrote: > On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless > wrote: Hi again, >> That should not be the case: it should match all terms with that >> prefix regardless of the term's length. Try to boil it down to a >> small test case? > > I g

Re: Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless wrote: > This is similar to PathHierarchyTokenizer, I think. Ah, yes, very much. I'll check it out and see if I can make something of it. I am not sure to what extent it'll be reusable though, as my tokenizer also sets payloads (the next comin

Re: Tokenization and PrefixQuery

2014-02-14 Thread Michael McCandless
On Fri, Feb 14, 2014 at 6:17 AM, Yann-Erwan Perio wrote: > Hello, > > I am designing a system with documents having one field containing > values such as "Ae1 Br2 Cy8 ...", i.e. a sequence of items made of > letters and numbers (max=7 per item), all separated by a space, > possibly 200 items per f

Re: Collector is collecting more than the specified hits

2014-02-14 Thread Michael McCandless
This is how Collector works: it is called for every document matching the query, and then its job is to choose which of those hits to keep. This is because in general the hits to keep can come at any time, not just the first N hits you see; e.g. the best scoring hit may be the very last one. But

Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
Hello, I am designing a system with documents having one field containing values such as "Ae1 Br2 Cy8 ...", i.e. a sequence of items made of letters and numbers (max=7 per item), all separated by a space, possibly 200 items per field, with no limit upon the number of documents (although I would no

Re: codec mismatch

2014-02-14 Thread Michael McCandless
This means Lucene was attempting to open _0.fnm but somehow got the contents of _0.cfs instead; seems likely that it's a bug in the Cassanda Directory implementation? Somehow it's opening the wrong file name? Mike McCandless http://blog.mikemccandless.com On Fri, Feb 14, 2014 at 3:13 AM, Jason

codec mismatch

2014-02-14 Thread Jason Wee
Hello, This is my first question to lucene mailing list, sorry if the question sounds funny. I have been experimenting to store lucene index files on cassandra, unfortunately the exception got overwhelmed. Below are the stacktrace. org.apache.lucene.index.CorruptIndexException: codec mismatch: a