Improving Lucene Search Performance
Hi, Is there any thing to take care while creating index for improving lucene text search speed. Thanks And Regards Dilshad K.P * Confidentiality Statement/Disclaimer * This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt. The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.
Re: Spell check on a subset of an index ( 'namespace' aware spell checker)
Ian, thank you for your suggestions. I have looked to the TermEnum and TermDocs, but they don't offer a combination with terms and frequencies (used by our autocompleter class) from a filtered set of docs. Eventually I implemented the following solution: - In the source index, get all terms for namespace field - For each namespace ns: * copy the source index to a new location * remove all documents that match (*:*) -(namespace:ns) * construct spellcheck/autocompletion index from index I still need to look for other possibilities where I have 1 spellcheck and 1 autocompletion index for all namespaces with support for namespace filtering. For the autocompleter this will become more difficult, because this should sort the completions on a frequency field that represents the frequency scoped to one namespace. But this has lower priority atm. The main goal was to have spellchecking and autocompletion scoped to namespaces, where there is one source index containing all namespaces. Regards, Elmer On 12/06/2011 03:40 PM, Ian Lea wrote: There are utilities floating around for getting output from analyzers - would that help? I think there are some in LIA, probably others elsewhere. The idea being that you grab the stored fields from the index, pass them through your analyzer, grab the output and use that. Or can you do something with TermEnum and/or TermDocs. Not sure exactly what or how though ... -- Ian. On Tue, Dec 6, 2011 at 2:20 PM, E. van Chastelet wrote: I'm still struggling with this. I've tried to implement the solution mentioned in previous reply, but unfortunately there is a blocking issue with this: I cannot find a way to create another index from the source index in a way that the new index has the field values in it. The only way to copy document's field values from one to another index is to have stored fields. But stored fields hold "the original String in its entirety", and not the analyzed String, which I need. Is there another way to copy documents with (at least the spellcheck field) from the one to another index? Recap: I have a source index holding documents for different namespaces. These documents hold one field (analyzed) that should be used for spell checking. I want to construct an spellchecker index for each namespace separately. To accomplish this, I first get the list of namespaces (each document has a namespace field in the original index). Then, for each namespace, I get the list of documents that match this namespace. Then I'd like to use this subset to construct a spellchecker index. Regards, Elmer On 11/23/2011 03:28 PM, E. van Chastelet wrote: I currently have an idea to get it done, but it's not a nice solution. If we have an index Q with all documents for all namespaces, we first extract the list of all terms that appear for the field namespace in Q (this field indicates the namespace of the document). Then, for each namespace n in the terms list: - Get all docs from Q that match +namespace:n - Construct a temporary index from these docs - Use this temporary index to construct the dictionary, which the SpellChecker can use as input. - Call indexDictionary on SpellChecker to create spellcheck index for current namespace. - Delete temporary index We now have separate spell check indexes for each namespace. Any suggestions for a cleaner solution? Regards, Elmer van Chastelet On 11/10/2011 01:16 PM, E. van Chastelet wrote: Hi all, In our project we like to have the ability to get search results scoped to one 'namespace' (as we call it). This can easily be achieved by using a filter or just an additional must-clause. For the spellchecker (and our autocompletion, which is a modified spellchecker), the story seems different. The spell checker index is created using a LuceneDictionary, which has a IndexReader as source. We would like to get (spellcheck/autocomplete) suggestions that are scoped to one namespace (i.e. field 'namespace' should have a particular value). With a single source index containing docs for all namespaces, it seems not possible to create a spellcheck index for each namespace the ordinary way. Q1: Is there a way to construct a LuceneDictionary from a subset of a single source index (all terms where namespace = %value%) ? Another, maybe better solution is to customize the spellchecker by adding an additional namespace field to the spellchecker index. At query-time, an additional must-clause is added, scoping the suggestions to one (or more) namespace(s). The advantage of this is to have a singleton spellchecker (or at least the index reader) for all namespaces. This also means less open files by our application (imagine if there are over 1000 namespaces). Q2: Will there be a significant penalty (say more than 50% slower) for the additional must-clause at query time? Q3: Or can you think of a better solution for this problem? :) How we currently do it: we currently use Lucene 3.1 with Hibernate Search and we
Re: Improving Lucene Search Performance
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed. Some of the tips relate to indexing but most to search time stuff. -- Ian. On Thu, Dec 8, 2011 at 10:45 AM, Dilshad K. P. wrote: > Hi, > Is there any thing to take care while creating index for improving lucene > text search speed. > > Thanks And Regards > Dilshad K.P > * Confidentiality Statement/Disclaimer * > > This message and any attachments is intended for the sole use of the intended > recipient. It may contain confidential information. Any unauthorized use, > dissemination or modification is strictly prohibited. If you are not the > intended recipient, please notify the sender immediately then delete it from > all your systems, and do not copy, use or print. Internet communications are > not secure and it is the responsibility of the recipient to make sure that it > is virus/malicious code exempt. > The company/sender cannot be responsible for any unauthorized alterations or > modifications made to the contents. If you require any form of confirmation > of the contents, please contact the company/sender. The company/sender is not > liable for any errors or omissions in the content of this message. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Score per position
Hi, A few days ago I asked a similar question: 1) in coming lucene 4.0, there is a feature sort like payload in document level: >lucene 4 has a feature called IndexDocValues which is essentially a > payload per document per field. > > you can read about it here: > http://www.searchworkings.org/blog/-/blogs/introducing-lucene-index-doc-values > http://www.searchworkings.org/blog/-/blogs/apache-lucene-flexiblescoring-with-indexdocvalues > http://www.searchworkings.org/blog/-/blogs/indexdocvalues-their-applications 2) may consider using FieldCache along with CustomScoreQuery (my case is timestamp filed, but we can put whatever logic into customized field, during indexing time). >>> you can simply index your timestamp (untokenzied) and wrap your query >>> in a CustomScoreQuery. This query accepts your user query and a >>> ValueSource. During search CustomScoreQuery calls your valuesource for >>> each document that the user query scores and multiplies the result of >>> the ValueSource into the score. Inside your valuesource you can simply >>> get the timestamps from the FieldCache and calculate your custom >>> boost... Best regards, Lisheng -Original Message- From: arnon ma [mailto:arnon...@yahoo.com] Sent: Wednesday, December 07, 2011 4:26 AM To: java-user@lucene.apache.org Subject: Score per position We have an application where every term position in a document is associated with an "engine score". A term query should then be scored according to the sum of "engine scores" of the term in a document, rather than on the term frequency. For example, term frequency of 5 with an average engine score of 100 should be equivalent to term frequency of 1 with engine score 500. I understood that if I keep the engine score per position in the payload, I will be able to use scorePayload in combination of a summary version of PayloadFunction to get the sum of engine scores of a term in a document, and so will be able to achieve my goal. There are two issues with this solution: 1. Even the simplest term query should scan the positions file in order to get the payloads, which could be a performance issue. We would prefer to index the sum of engine scores in advance per document, in addition to the term frequency. This is some sort of payload in the document level. Does Lucene support that or have any other solution for this issue ? 2. The "engine score" of a phrase occurrence is defined as the multiplication of engine scores of the terms that compose the phrase. So in scorePayload I need the payloads of all the terms in the phrase in order to be able to appropriately score the phrase occurrence. As much as I understand, the current interface of scorePayload does not provide this information. Is there another way this can be achieved in Lucene ? Thanks in advance, Arnon. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Split mutable logical document into two Lucene documents
It is conceivable that nested documents might help. https://issues.apache.org/jira/browse/LUCENE-2454. I don't know anything about that so might be way off target. -- Ian. On Wed, Dec 7, 2011 at 8:46 PM, Brandon Mintern wrote: > We have a document tagging system where documents are composed of two > types of data: > > Rarely changed (hereafter: "immutable") data - document text and > metadata that we upload and almost never change. The text can be > hundreds of pages. > > User created (hereafter: "mutable") data - document properties that > are set by users of our system. In total a document's properties are > generally several dozen bytes at most. Even viewing a document changes > the data (e.g. the document's "viewed" property. > > > At present, all data is part of a single Lucene document. The problem > is that when any piece of mutable data is updated (this happens > relatively frequently), we have to reindex the entire document. We'd > like to have two separate indexed Lucene documents per logical > document, one containing the immutable data and the other containing > the much smaller and more transient mutable data. When the mutable > data changes, we can delete that document's mutable Lucene document > and index a new one very quickly. > > There are two major difficulties when actually performing a search, though: > > 1. We are providing complex queries to retrieve logical documents > based on information in either of its Lucene documents. It seems > non-trivial to fetch a logical document in a BooleanQuery with > Occur.MUST clauses referring to fields in both of the Lucene > documents. > > 2. We need to sort results (logical document IDs) based on fields in > either of its Lucene documents. > > Has anyone done anything like this before? Is there functionality I'm > overlooking that could make this easier? > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SpanNearQuery and matching spans inside the first span
Have you read http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/? Might help explain some of the behaviour you are seeing. -- Ian. On Tue, Dec 6, 2011 at 4:42 AM, Trejkaz wrote: > Supposing I have a document with just "hi there" as the text. > > If I do a span query like this: > > near(near(term('hi'), term('there'), slop=0, forwards), > term('hi'), slop=1, any-direction) > > that returns no hits. However, if I do a span query like this: > > near(near(term('hi'), term('there'), slop=0, forwards), > term('there'), slop=1, any-direction) > > that returns the document. > > It seems that the rule is that if the two spans *start* at the same > position, then they are not considered "near" each other. But from > the POV of a user (and from this developer) this is lop-sided because > in both situations, the second span was inside the first span. It > seems like they should either both be considered hits, or both be > considered non-hits. > > I am wondering what others think about this and whether there is any > way to manipulate/rewrite the query to get a more balanced-looking > result. > > (I'm sure it gets particularly hairy, though, when your two spans > overlap only partially... is that "near" or not?) > > TX > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene 4.0 MemoryIndex Bug?
I've been playing around with Lucene's MemoryIndex and anytime I try to use index.addField(String, String, Analyzer), I receive: java.lang.NoSuchMethodError: org.apache.lucene.util.BytesRef.deepCopyOf(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/util/BytesRef; anytime I make a call to it. I've tried inserting string literals and string objects and it's not taking anything. Digging around in the source code, I narrowed it down to this call in MemoryIndex: terms.put(BytesRef.deepCopyOf(ref), positions); and ref is not null. I've also debugged BytesRef.deepCopyOf command and it is working fine. Any thoughts? Thanks! Stephen
Re: Lucene 4.0 MemoryIndex Bug?
Hi, You mixed incompatible jar file versions of Lucene 4.0 modules. Try to recompile everything from source. Uwe -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de Stephen Howe schrieb: I've been playing around with Lucene's MemoryIndex and anytime I try to use index.addField(String, String, Analyzer), I receive: java.lang.NoSuchMethodError: org.apache.lucene.util.BytesRef.deepCopyOf(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/util/BytesRef; anytime I make a call to it. I've tried inserting string literals and string objects and it's not taking anything. Digging around in the source code, I narrowed it down to this call in MemoryIndex: terms.put(BytesRef.deepCopyOf(ref), positions); and ref is not null. I've also debugged BytesRef.deepCopyOf command and it is working fine. Any thoughts? Thanks! Stephen
Re: Lucene 4.0 MemoryIndex Bug?
anytime you see nosuchmethoderror, it means its a bug in your configuration (wrong, out of date classes/jar files) On Thu, Dec 8, 2011 at 3:55 PM, Stephen Howe wrote: > I've been playing around with Lucene's MemoryIndex and anytime I try to use > index.addField(String, String, Analyzer), I > receive: java.lang.NoSuchMethodError: > org.apache.lucene.util.BytesRef.deepCopyOf(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/util/BytesRef; > anytime I make a call to it. I've tried inserting string literals and > string objects and it's not taking anything. > > Digging around in the source code, I narrowed it down to this call in > MemoryIndex: terms.put(BytesRef.deepCopyOf(ref), positions); and ref is not > null. I've also debugged BytesRef.deepCopyOf command and it is working > fine. > > Any thoughts? > > Thanks! > Stephen -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Split mutable logical document into two Lucene documents
Thank you for the pointer. I looked into nested documents, but it appears that the implementation relies on each parent document being indexed immediately before all of its children. Unfortunately, this presents two problems: 1. Any optimize operation will break nesting 2. Deleting and reindexing a child would break the parent-child hierarchy unless the parent was reindexed as well. Since this is the problem we're trying to solve in the first place, this doesn't seem to get us where we need to be. We also looked at ParallelReader, but that requires the immutable/mutable pair are added to the exact same position in separate indexes. This is very brittle for our use, and it would require rebuilding the entire mutable index just to change a single value, or reindexing both the mutable and immutable information. Neither solution is better than just keeping the mutable and immutable data together. I think there are some things we could do with filters, but I think it will be easier and more flexible for us to have simple Lucene queries return a sorted list of document IDs (our full document identifier) and then perform set-union, set-intersection, and set-inversion ourselves. Thanks for your time, Brandon On Thu, Dec 8, 2011 at 9:57 AM, Ian Lea wrote: > It is conceivable that nested documents might help. > https://issues.apache.org/jira/browse/LUCENE-2454. I don't know > anything about that so might be way off target. > > > -- > Ian. > > > On Wed, Dec 7, 2011 at 8:46 PM, Brandon Mintern wrote: >> We have a document tagging system where documents are composed of two >> types of data: >> >> Rarely changed (hereafter: "immutable") data - document text and >> metadata that we upload and almost never change. The text can be >> hundreds of pages. >> >> User created (hereafter: "mutable") data - document properties that >> are set by users of our system. In total a document's properties are >> generally several dozen bytes at most. Even viewing a document changes >> the data (e.g. the document's "viewed" property. >> >> >> At present, all data is part of a single Lucene document. The problem >> is that when any piece of mutable data is updated (this happens >> relatively frequently), we have to reindex the entire document. We'd >> like to have two separate indexed Lucene documents per logical >> document, one containing the immutable data and the other containing >> the much smaller and more transient mutable data. When the mutable >> data changes, we can delete that document's mutable Lucene document >> and index a new one very quickly. >> >> There are two major difficulties when actually performing a search, though: >> >> 1. We are providing complex queries to retrieve logical documents >> based on information in either of its Lucene documents. It seems >> non-trivial to fetch a logical document in a BooleanQuery with >> Occur.MUST clauses referring to fields in both of the Lucene >> documents. >> >> 2. We need to sort results (logical document IDs) based on fields in >> either of its Lucene documents. >> >> Has anyone done anything like this before? Is there functionality I'm >> overlooking that could make this easier? >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 Index Format Finalization Timetable
While we are in constant sync due to the merge, lucene would still be updated multiple times before a solr 4 release, and it would be subject to happen at any time - so its really not any different. On Wednesday, December 7, 2011, Jamie Johnson wrote: > Yeah, biggest issue for us is we're using the SolrCloud features. > While I see some good things related to the Lucene and Solr code bases > being merged, this is certainly a frustrating aspect of it as I don't > require some of the changes that are in Lucene 4.0 (withstanding > anything that SolrCloud requires that is). > > I think the best solution (assuming it works) is to try to lock a > version of Lucene 4.0 while upgrading Solr. I'll have to test to see > if this works or not, but at least it's something. > > On Wed, Dec 7, 2011 at 9:02 AM, Mike Sokolov wrote: >> My personal view, as a bystander with no more information than you, is that >> one has to assume there will be further index format changes before a 4.0 >> release. This is based on the number of changes in the last 9 months, and >> the amount of activity on the dev list. >> >> For us the implication is we need to stick w/3.x for now. You might be in a >> different situation if you really need the 4.0 changes. Maybe you can just >> stick w/the current trunk and take responsibility for patching critical >> bugfixes, hoping you won't have to recreate your index too many times... >> >> -Mike >> >> >> On 12/06/2011 09:48 PM, Jamie Johnson wrote: >>> >>> I suppose that's fair enough. Some quick googling seems that this has >>> been asked many times with pretty much the same response. Sorry to >>> add to the noise. >>> >>> On Tue, Dec 6, 2011 at 9:34 PM, Darren Govoni wrote: >>> I asked here[1] and it said "Ask again later." [1] http://8ball.tridelphia.net/ On 12/06/2011 08:46 PM, Jamie Johnson wrote: > > Thanks Robert. Is there a timetable for that? I'm trying to gauge > whether it is appropriate to push for my organization to move to the > current lucene 4.0 implementation (we're using solr cloud which is > built against trunk) or if it's expected there will be changes to what > is currently on trunk. I'm not looking for anything hard, just trying > to plan as much as possible understanding that this is one of the > implications of using trunk. > > On Tue, Dec 6, 2011 at 6:48 PM, Robert Muir wrote: > >> >> On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson >> wrote: >> >>> >>> Is there a timetable for when it is expected to be finalized? >>> >> >> it will be finalized when Lucene 4.0 is released. >> >> -- >> lucidimagination.com >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> - >>> To unsubscribe, e-mail: -- - Mark http://www.lucidimagination.com
how to do remote debug on benchmark test or whatever test?
Which file should i go and set: -Xdebug -Xrunjdwp:transport=dt_socket,address=8886,server=y,suspend=y ? thanks! hao - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Getting RuntimeException: after flush: fdx size mismatch while Indexing
I am using Lucene 3.5. I want to create around 30 million documents. While doing Indexing I am getting the following Exception: Caused by: java.lang.RuntimeException: after flush: fdx size mismatch: 7442 docs vs 32768 length in bytes of _ct.fdx file exists?=true at org.apache.lucene.index.StoredFieldsWriter.flush(StoredFieldsWriter.java:58) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:59) at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:581) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3623) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3588) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2073) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2040) at com.cisco.ason.document.AbstractLuceneDocWriter.write(AbstractLuceneDocWriter.java:88) ... 10 more I don't get this Exception consistently. I am not setting the maxBufferedDocs the default value is -1. So the flush is as per ramBufferSizeMB which is set to default 16.0 MB. Did anyone face this problem. I would appreciate any suggestions. -- Thanks for your time, Jamir...