Re: Use multiple lucene indices

2011-12-06 Thread Danil ŢORIN
10B documents is a lot of data. Index/file won't scale: you will not be able to open all the indexes in the same time (filehandlers limits, memory limits, etc), and if you'll search through them sequentially, it will take a lot of time. Unless in your usecase you always know the file you are sear

Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Jamie Johnson
I suppose that's fair enough. Some quick googling seems that this has been asked many times with pretty much the same response. Sorry to add to the noise. On Tue, Dec 6, 2011 at 9:34 PM, Darren Govoni wrote: > I asked here[1] and it said "Ask again later." > > [1] http://8ball.tridelphia.net/ >

Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Darren Govoni
I asked here[1] and it said "Ask again later." [1] http://8ball.tridelphia.net/ On 12/06/2011 08:46 PM, Jamie Johnson wrote: Thanks Robert. Is there a timetable for that? I'm trying to gauge whether it is appropriate to push for my organization to move to the current lucene 4.0 implementation

Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Jamie Johnson
Thanks Robert. Is there a timetable for that? I'm trying to gauge whether it is appropriate to push for my organization to move to the current lucene 4.0 implementation (we're using solr cloud which is built against trunk) or if it's expected there will be changes to what is currently on trunk.

Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Robert Muir
On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson wrote: > Is there a timetable for when it is expected to be finalized? it will be finalized when Lucene 4.0 is released. -- lucidimagination.com - To unsubscribe, e-mail: java-user

Lucene 4.0 Index Format Finalization Timetable

2011-12-06 Thread Jamie Johnson
Is there a timetable for when it is expected to be finalized? I'm not looking for an exact date, just an approximate like (next month, 2 months 6 months,etc) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For a

tokenizing text using language analyzer but preserving stopwords if possible

2011-12-06 Thread Ilya Zavorin
I need to implement a "quick and dirty" or "poor man's" translation of a foreign language document by looking up each word in a dictionary and replacing it with the English translation. So what I need is to tokenize the original foreign text into words and then access each word, look it up and g

Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

2011-12-06 Thread Ian Lea
There are utilities floating around for getting output from analyzers - would that help? I think there are some in LIA, probably others elsewhere. The idea being that you grab the stored fields from the index, pass them through your analyzer, grab the output and use that. Or can you do something

Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

2011-12-06 Thread E. van Chastelet
I'm still struggling with this. I've tried to implement the solution mentioned in previous reply, but unfortunately there is a blocking issue with this: I cannot find a way to create another index from the source index in a way that the new index has the field values in it. The only way to copy

Re: lucene-core-3.3.0 not optimizing

2011-12-06 Thread Erick Erickson
Try taking a look at the patch, but on a quick glance it doesn't look like the underlying code has changed much. But note the whole point of this is that optimize is overused given its former name, why do you want to keep using it? Best Erick On Tue, Dec 6, 2011 at 1:04 AM, KARTHIK SHIVAKUMAR w

Re: Don't get results wheras Luke does...

2011-12-06 Thread Felipe Carvalho
I had a similar problem. The problem was the "-' char, which is a special char for Lucene. You can try indexing the data in lowercase and use WhitespaceAnalyzer for both indexing and searching over the field. One other option is replace "-" with "_" when indexing and searching. This way, your data

Re: Don't get results wheras Luke does...

2011-12-06 Thread Ian Lea
Try QueryParser.setLowercaseExpandedTerms(false). QueryParser will lowercase terms in prefix etc queries by default. If that doesn't work, and it was my problem, I'd just lowercase everything, everywhere. Life's too short to mess around with case issues. -- Ian. On Tue, Dec 6, 2011 at 8:12 A

Re: Use multiple lucene indices

2011-12-06 Thread Rui Wang
Hi Danil, Thank you for your suggestions. We will have approximately half million documents per file, so using your calculation, 2 files * 50 = 10, 000, 000, 000. And we are likely to get more files in the future, so a scalable solution is most desirable. The document IDs are not uniq

Re: Use multiple lucene indices

2011-12-06 Thread Danil ŢORIN
How many documents there are in the system ? approximate it by: 2 files * avg(docs/file) >From my understanding your queries will be just lookup for a document ID (Q: are those IDs unique between files? or you need to filter by filename?) If that will be the only usecase than maybe you should

Re: Use multiple lucene indices

2011-12-06 Thread Rui Wang
Hi Guys, Thank you very much for your answers. I will do some profiling on memory usage, but is there any documentation on how Lucene uses/allocates the memory? Best wishes, Rui Wang On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote: > hi > >>> would the memory usage go through the roof?

Don't get results wheras Luke does...

2011-12-06 Thread ejblom
Dear Lucene-users, I am a bit puzzled over this. I have a query which should return some documents, if I use Luke, I obtain hits using the org.apache.lucene.analysis.KeywordAnalyzer. This is the query: domain:NB-AR* (I have data indexed using: doc.add(new Field("domain", NB-ARC, Field.Store.YE