Re: Using RangeFilter

2008-01-24 Thread vivek sar
I've a field as NO_NORM, does it has to be untokenized to be able to sort on it? On Jan 21, 2008 12:47 PM, Antony Bowesman [EMAIL PROTECTED] wrote: vivek sar wrote: I need to be able to sort on optime as well, thus need to store it . Lucene's default sorting does not need the field to be

Re: Is Fair Similarity working with lucene 2.2 ?

2008-01-24 Thread Fabrice Robini
Is there anything I can do to pass my Unit-Test ? Or it is impossible ? Thanks a lot, Fabrice Fabrice Robini wrote: Hi Srikant, I really thank you for your reply, it's very interesting. I have to say I am confused with that now... I do not know what I can to for passing this Unit

Re: Multiple searchers (Was: CachingWrapperFilter: why cache per IndexReader?)

2008-01-24 Thread Toke Eskildsen
On Thu, 2008-01-24 at 08:18 +1100, Antony Bowesman wrote: These are odd. The last case in both of the above shows a slowdown compared to 2.1 index and version and in the first 50K queries, the 2.3 index and version is even slower than 2.3 with 2.1 index. It catches up in the longer

Lucene search strings two

2008-01-24 Thread Prathiba Paka
Hi all. I need to check two conditions in search first i need to find out bank name next in those i need to find documents consisting particular city finally i need the documents which satisfy both conditions i.e., documents with bank+city please can anyone help me Thanks, prathiba.P

Re: Using RangeFilter

2008-01-24 Thread Antony Bowesman
vivek sar wrote: I've a field as NO_NORM, does it has to be untokenized to be able to sort on it? NO_NORMS is the same as UNTOKENIZED + omitNorms, so you can sort on that. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED]

Full Text Searching a Relational Model

2008-01-24 Thread yarong
Hi, (Warning, not for the weak-hearted) I'm currently working on a project where we have a large and complex data model, related to Genomics. We are trying to build a search engine that provides full text and field-based text searches for our customer base (mostly academic research), and are

LogMergePolicy

2008-01-24 Thread Koji Sekiguchi
Hello, I'm curious, why is LogMergePolicy named *Log*MergePolicy? (Why not ExpMergePolicy? :-) Thank you, Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

RE: LogMergePolicy

2008-01-24 Thread Steven Parkes
I'm curious, why is LogMergePolicy named *Log*MergePolicy? (Why not ExpMergePolicy? :-) Well, I guess it's a matter of perspective. When you look at the way the algorithm works, the merge decisions are based on a concept of level and levels are assigned based on the log of the

Re: LogMergePolicy

2008-01-24 Thread Yonik Seeley
On Jan 24, 2008 8:40 AM, Steven Parkes [EMAIL PROTECTED] wrote: I'm curious, why is LogMergePolicy named *Log*MergePolicy? (Why not ExpMergePolicy? :-) Well, I guess it's a matter of perspective. When you look at the way the algorithm works, the merge decisions are based on a

Re: LogMergePolicy

2008-01-24 Thread Koji Sekiguchi
Thank you Steven and Yonik, I think I got it. And I can find LogMergePolicy uses Math.log() to find merges. :-) Thank you again, Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: Full Text Searching a Relational Model

2008-01-24 Thread Chris Lu
In general, you just need to denorm the data and create a list of Genes, and add each Genes' related information by SQLs. Ranking can be easily adjusted via each field's weight, not a big deal. Seems an ideal case for using DBSight. It can also do incremental indexing, which you may also need.

Creating search query

2008-01-24 Thread spring
Hi, I have an index with some fields which are indexed and un_tokenized (keywords) and one field which is indexed and tokenized (content). Now I want to create a Query-Object: TermQuery k1 = new TermQuery(new Term(foo, some foo)); TermQuery k2 = new TermQuery(new Term(bar, some

RE: Compass

2008-01-24 Thread spring
Thank you. -Original Message- From: Lukas Vlcek [mailto:[EMAIL PROTECTED] Sent: Mittwoch, 23. Januar 2008 08:23 To: java-user@lucene.apache.org Subject: Re: Compass Hi, I am using Compass with Spring and JPA. It works pretty nice. I don't store index into database, I use

Re: Creating search query

2008-01-24 Thread Erick Erickson
That should work fine, assuming that foo and bar are the untokenized fields and content is the tokenized content. Erick On Jan 24, 2008 1:18 PM, [EMAIL PROTECTED] wrote: Hi, I have an index with some fields which are indexed and un_tokenized (keywords) and one field which is indexed and

RE: Creating search query

2008-01-24 Thread spring
Yes, sorry, that's the case. Thank you! -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Donnerstag, 24. Januar 2008 19:49 To: java-user@lucene.apache.org Subject: Re: Creating search query That should work fine, assuming that foo and bar are the

RE: Design questions

2008-01-24 Thread spring
-Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Freitag, 11. Januar 2008 16:16 To: java-user@lucene.apache.org Subject: Re: Design questions But you could also vary this scheme by simply storing in your document the offsets for the beginning of each page.

Re: Design questions

2008-01-24 Thread Erick Erickson
I think you'll have to implement your own Analyzer and count. That is, every call to next() that returns a token will have to also increment some counter by 1. To use this, you must have some way of knowing when a page ends, and at that point you call your instance of your custom analyzer to see

RE: Lucene, HTML and Hebrew

2008-01-24 Thread Itamar Syn-Hershko
Steve and all, I didn't know whether to send a detailed description of my case to aid with seeing the whole picture, or to send a list of short questions which will require loads of follow-up. I guess I know what is better now, thanks Lucene does not store proximity relations between data

FYI: parallel corpus in 22 languages

2008-01-24 Thread Andrzej Bialecki
Hi all, Just FYI, perhaps this is old news for you ... This large corpus is freely available and it is pairwise sentence-aligned for all language combinations. This looks like a good resource for linguistic information, such as frequent words and phrases, n-gram profiles, etc.

RE: Lucene, HTML and Hebrew

2008-01-24 Thread Steven A Rowe
Hi Itamar, On 01/24/2008 at 2:55 PM, Itamar Syn-Hershko wrote: Lucene does not store proximity relations between data in different fields, only within individual fields So are 2 calls for doc-add with the same field but different texts are considered as 1 field (latter call being

Re: stange exception while indexing

2008-01-24 Thread Michael McCandless
That means that one of the merges, which run in the background by default with 2.3, hit an unhandled exception. Did you see another exception logged / printed to stderr before this one? Mike Cam Bazz wrote: Does anyone have any idea about the error I got while indexing? Best Regards,

Re: stange exception while indexing

2008-01-24 Thread Cam Bazz
no. only after that there was a gc error. I am also not using the compound index file format in order to increase indexing speed. could it be because of that? I will run the test case again tomorrow. What can I do to increase logging? Best, -C.B. On Jan 24, 2008 11:52 PM, Michael McCandless

Re: stange exception while indexing

2008-01-24 Thread Michael McCandless
Hmm, you should have seen an exception before that one from optimize. Can you post the GC error? Was it an OutOfMemoryError situation? Mike On Jan 24, 2008, at 5:32 PM, Cam Bazz wrote: no. only after that there was a gc error. I am also not using the compound index file format in order to

Re: stange exception while indexing

2008-01-24 Thread Michael McCandless
Oh, also, I don't think not using CFS would lead to this, unless it's somehow triggering too many file descriptors... Mike Cam Bazz wrote: no. only after that there was a gc error. I am also not using the compound index file format in order to increase indexing speed. could it be

RE: Design questions

2008-01-24 Thread spring
Or, you could just do things twice. That is, send your text through a TokenStream, then call next() and count. Then send it all through doc.add(). Hm. This means read the content twice, doesn't matter using an own analyzer oder overriding/wrapping the main analyzer. Is there anywhere a hook

Lucene to index OCR text

2008-01-24 Thread Renaud Waldura
I've been poking around the list archives and didn't really come up against anything interesting. Anyone using Lucene to index OCR text? Any strategies/algorithms/packages you recommend? I have a large collection (10^7 docs) that's mostly the result of OCR. We index/search/etc. with Lucene

MapReduce usage with Lucene Indexing

2008-01-24 Thread roger dimitri
Hi, I am very new to Lucene Hadoop, and I have a project where I need to use Lucene to index some input given either as a a huge collection of Java objects or one huge java object. I read about Hadoop's MapReduce utilities and I want to leverage that feature in my case described above.

Re: Lucene to index OCR text

2008-01-24 Thread Erick Erickson
Lots of luck to you, because I haven't a clue. My company deals with OCR data and we haven't had a single workable idea. Of course, our data sets are minuscule compared to what you're talking about, so we haven't tried to heuristically clean up the data. But given that Google is scanning the

Re: Lucene to index OCR text

2008-01-24 Thread Kyle Maxwell
I've been poking around the list archives and didn't really come up against anything interesting. Anyone using Lucene to index OCR text? Any strategies/algorithms/packages you recommend? I have a large collection (10^7 docs) that's mostly the result of OCR. We index/search/etc. with Lucene

[ANNOUNCE] Lucene Java 2.3.0 release available

2008-01-24 Thread Michael Busch
Release 2.3.0 of Lucene Java is now available! Many new features, optimizations, and bug fixes have been added since 2.2, including: * significantly improved indexing performance * segment merging in background threads * refreshable IndexReaders * faster StandardAnalyzer and improved

Threads blocking on isDeleted when swapping indices for a very long time...

2008-01-24 Thread Michael Stoppelman
Hi all, I've been tracking down a problem happening in our production environment. When we switch an index after doing deletes adds, running some searches, and finally changing the pointer from old index to new all the threads start stacking up all waiting on isDeleted(). The threads seem to

Re: Archiving Index using partitions

2008-01-24 Thread vivek sar
Thanks Otis for your response. I've few more questions, 1) Is it recommended to do index partitioning for large indexes? - We index around 35 fields (storing only two of them - simple ids) - Each document is around 200 bytes - Our index grows to around 50G a week 2) The