Improving Lucene Search Performance

2011-12-08 Thread Dilshad K. P.
Hi,
Is there any thing to take care while creating index for improving lucene text 
search speed.

Thanks And Regards
Dilshad K.P
* Confidentiality Statement/Disclaimer *

This message and any attachments is intended for the sole use of the intended 
recipient. It may contain confidential information. Any unauthorized use, 
dissemination or modification is strictly prohibited. If you are not the 
intended recipient, please notify the sender immediately then delete it from 
all your systems, and do not copy, use or print. Internet communications are 
not secure and it is the responsibility of the recipient to make sure that it 
is virus/malicious code exempt.
The company/sender cannot be responsible for any unauthorized alterations or 
modifications made to the contents. If you require any form of confirmation of 
the contents, please contact the company/sender. The company/sender is not 
liable for any errors or omissions in the content of this message.


Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

2011-12-08 Thread E. van Chastelet

Ian, thank you for your suggestions.

I have looked to the TermEnum and TermDocs, but they don't offer a 
combination with terms and frequencies (used by our autocompleter class) 
from a filtered set of docs.


Eventually I implemented the following solution:
- In the source index, get all terms for namespace field
- For each namespace ns:
 * copy the source index to a new location
 * remove all documents that match  (*:*) -(namespace:ns)
 * construct spellcheck/autocompletion index from index

I still need to look for other possibilities where I have 1 spellcheck 
and 1 autocompletion index for all namespaces with support for namespace 
filtering. For the autocompleter this will become more difficult, 
because this should sort the completions on a frequency field that 
represents the frequency scoped to one namespace. But this has lower 
priority atm.
The main goal was to have spellchecking and autocompletion scoped to 
namespaces, where there is one source index containing all namespaces.


Regards,
Elmer

On 12/06/2011 03:40 PM, Ian Lea wrote:

There are utilities floating around for getting output from analyzers
- would that help?  I think there are some in LIA, probably others
elsewhere.  The idea being that you grab the stored fields from the
index, pass them through your analyzer, grab the output and use that.

Or can you do something with TermEnum and/or TermDocs.  Not sure
exactly what or how though ...


--
Ian.

On Tue, Dec 6, 2011 at 2:20 PM, E. van Chastelet
  wrote:

I'm still struggling with this.

I've tried to implement the solution mentioned in previous reply, but
unfortunately there is a blocking issue with this:
I cannot find a way to create another index from the source index in a way
that the new index has the field values in it. The only way to copy
document's field values from one to another index is to have stored fields.
But stored fields hold "the original String in its entirety", and not the
analyzed String, which I need. Is there another way to copy documents with
(at least the spellcheck field) from the one to another index?

Recap:
I have a source index holding documents for different namespaces. These
documents hold one field (analyzed) that should be used for spell checking.
I want to construct an spellchecker index for each namespace separately. To
accomplish this, I first get the list of namespaces (each document has a
namespace field in the original index). Then, for each namespace, I get the
list of documents that match this namespace. Then I'd like to use this
subset to construct a spellchecker index.

Regards,
Elmer


On 11/23/2011 03:28 PM, E. van Chastelet wrote:

I currently have an idea to get it done, but it's not a nice solution.

If we have an index Q with all documents for all namespaces, we first
extract the list of all terms that appear for the field namespace in Q (this
field indicates the namespace of the document).

Then, for each namespace n in the terms list:
  - Get all docs from Q that match +namespace:n
  - Construct a temporary index from these docs
  - Use this temporary index to construct the dictionary, which the
SpellChecker can use as input.
  - Call indexDictionary on SpellChecker to create spellcheck index for
current namespace.
  - Delete temporary index

We now have separate spell check indexes for each namespace.

Any suggestions for a cleaner solution?

Regards,
Elmer van Chastelet



On 11/10/2011 01:16 PM, E. van Chastelet wrote:

Hi all,

In our project we like to have the ability to get search results scoped
to one 'namespace' (as we call it). This can easily be achieved by using a
filter or just an additional must-clause.
For the spellchecker (and our autocompletion, which is a modified
spellchecker), the story seems different. The spell checker index is created
using a LuceneDictionary, which has a IndexReader as source. We would like
to get (spellcheck/autocomplete) suggestions that are scoped to one
namespace (i.e. field 'namespace' should have a particular value).
With a single source index containing docs for all namespaces, it seems
not possible to create a spellcheck index for each namespace the ordinary
way.
Q1: Is there a way to construct a LuceneDictionary from a subset of a
single source index (all terms where namespace = %value%) ?

Another, maybe better solution is to customize the spellchecker by adding
an additional namespace field to the spellchecker index. At query-time, an
additional must-clause is added, scoping the suggestions to one (or more)
namespace(s). The advantage of this is to have a singleton spellchecker (or
at least the index reader) for all namespaces. This also means less open
files by our application (imagine if there are over 1000 namespaces).
Q2: Will there be a significant penalty (say more than 50% slower) for
the additional must-clause at query time?

Q3: Or can you think of a better solution for this problem? :)

How we currently do it: we currently use Lucene 3.1 with Hibernate Search
and we 

Re: Improving Lucene Search Performance

2011-12-08 Thread Ian Lea
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed.  Some of
the tips relate to indexing but most to search time stuff.


--
Ian.


On Thu, Dec 8, 2011 at 10:45 AM, Dilshad K. P.  wrote:
> Hi,
> Is there any thing to take care while creating index for improving lucene 
> text search speed.
>
> Thanks And Regards
> Dilshad K.P
> * Confidentiality Statement/Disclaimer *
>
> This message and any attachments is intended for the sole use of the intended 
> recipient. It may contain confidential information. Any unauthorized use, 
> dissemination or modification is strictly prohibited. If you are not the 
> intended recipient, please notify the sender immediately then delete it from 
> all your systems, and do not copy, use or print. Internet communications are 
> not secure and it is the responsibility of the recipient to make sure that it 
> is virus/malicious code exempt.
> The company/sender cannot be responsible for any unauthorized alterations or 
> modifications made to the contents. If you require any form of confirmation 
> of the contents, please contact the company/sender. The company/sender is not 
> liable for any errors or omissions in the content of this message.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Score per position

2011-12-08 Thread Zhang, Lisheng
Hi,

A few days ago I asked a similar question:

1) in coming lucene 4.0, there is a feature sort like payload in document level:

>lucene 4 has a feature called IndexDocValues which is essentially a
> payload per document per field.
>
> you can read about it here:
> http://www.searchworkings.org/blog/-/blogs/introducing-lucene-index-doc-values
> http://www.searchworkings.org/blog/-/blogs/apache-lucene-flexiblescoring-with-indexdocvalues
> http://www.searchworkings.org/blog/-/blogs/indexdocvalues-their-applications

2) may consider using FieldCache along with CustomScoreQuery (my case is 
timestamp 
filed, but we can put whatever logic into customized field, during indexing 
time).

>>> you can simply index your timestamp (untokenzied) and wrap your query
>>> in a CustomScoreQuery. This query accepts your user query and a
>>> ValueSource. During search CustomScoreQuery calls your valuesource for
>>> each document that the user query scores and multiplies the result of
>>> the ValueSource into the score. Inside your valuesource you can simply
>>> get the timestamps from the FieldCache and calculate your custom
>>> boost...

Best regards, Lisheng


-Original Message-
From: arnon ma [mailto:arnon...@yahoo.com]
Sent: Wednesday, December 07, 2011 4:26 AM
To: java-user@lucene.apache.org
Subject: Score per position


We have an application where every term position in a document is associated 
with an "engine score".
A term query should then be scored according to the sum of "engine scores" of 
the term in a document, rather than on the term frequency.
For example, term frequency of 5 with an average engine score of 100 should be 
equivalent to term frequency of 1 with engine score 500.
 
I understood that if I keep the engine score per position in the payload, I 
will be able to use scorePayload in combination of a summary version of 
PayloadFunction to get the sum of engine scores of a term in a document, and so 
will be able to achieve my goal.
 
There are two issues with this solution:
1. Even the simplest term query should scan the positions file in order to get 
the payloads, which could be a performance issue.
We would prefer to index the sum of engine scores in advance per document, in 
addition to the term frequency. This is some sort of payload in the document 
level. Does Lucene support that or have any other solution for this issue ?
 
2. The "engine score" of a phrase occurrence is defined as the multiplication 
of engine scores of the terms that compose the phrase.
So in scorePayload I need the payloads of all the terms in the phrase in order 
to be able to appropriately score the phrase occurrence.
As much as I understand, the current interface of scorePayload does not provide 
this information.
Is there another way this can be achieved in Lucene ?
 
Thanks in advance,
Arnon.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Split mutable logical document into two Lucene documents

2011-12-08 Thread Ian Lea
It is conceivable that nested documents might help.
https://issues.apache.org/jira/browse/LUCENE-2454.  I don't know
anything about that so might be way off target.


--
Ian.


On Wed, Dec 7, 2011 at 8:46 PM, Brandon Mintern  wrote:
> We have a document tagging system where documents are composed of two
> types of data:
>
> Rarely changed (hereafter: "immutable") data - document text and
> metadata that we upload and almost never change. The text can be
> hundreds of pages.
>
> User created (hereafter: "mutable") data - document properties that
> are set by users of our system. In total a document's properties are
> generally several dozen bytes at most. Even viewing a document changes
> the data (e.g. the document's "viewed" property.
>
>
> At present, all data is part of a single Lucene document. The problem
> is that when any piece of mutable data is updated (this happens
> relatively frequently), we have to reindex the entire document. We'd
> like to have two separate indexed Lucene documents per logical
> document, one containing the immutable data and the other containing
> the much smaller and more transient mutable data. When the mutable
> data changes, we can delete that document's mutable Lucene document
> and index a new one very quickly.
>
> There are two major difficulties when actually performing a search, though:
>
> 1. We are providing complex queries to retrieve logical documents
> based on information in either of its Lucene documents. It seems
> non-trivial to fetch a logical document in a BooleanQuery with
> Occur.MUST clauses referring to fields in both of the Lucene
> documents.
>
> 2. We need to sort results (logical document IDs) based on fields in
> either of its Lucene documents.
>
> Has anyone done anything like this before? Is there functionality I'm
> overlooking that could make this easier?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SpanNearQuery and matching spans inside the first span

2011-12-08 Thread Ian Lea
Have you read http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/?
 Might help explain some of the behaviour you are seeing.


--
Ian.


On Tue, Dec 6, 2011 at 4:42 AM, Trejkaz  wrote:
> Supposing I have a document with just "hi there" as the text.
>
> If I do a span query like this:
>
>    near(near(term('hi'), term('there'), slop=0, forwards),
> term('hi'), slop=1, any-direction)
>
> that returns no hits.  However, if I do a span query like this:
>
>    near(near(term('hi'), term('there'), slop=0, forwards),
> term('there'), slop=1, any-direction)
>
> that returns the document.
>
> It seems that the rule is that if the two spans *start* at the same
> position, then they are not considered "near" each other.  But from
> the POV of a user (and from this developer) this is lop-sided because
> in both situations, the second span was inside the first span.  It
> seems like they should either both be considered hits, or both be
> considered non-hits.
>
> I am wondering what others think about this and whether there is any
> way to manipulate/rewrite the query to get a more balanced-looking
> result.
>
> (I'm sure it gets particularly hairy, though, when your two spans
> overlap only partially... is that "near" or not?)
>
> TX
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene 4.0 MemoryIndex Bug?

2011-12-08 Thread Stephen Howe
I've been playing around with Lucene's MemoryIndex and anytime I try to use
index.addField(String, String, Analyzer), I
receive: java.lang.NoSuchMethodError:
org.apache.lucene.util.BytesRef.deepCopyOf(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/util/BytesRef;
anytime I make a call to it. I've tried inserting string literals and
string objects and it's not taking anything.

Digging around in the source code, I narrowed it down to this call in
MemoryIndex: terms.put(BytesRef.deepCopyOf(ref), positions); and ref is not
null. I've also debugged BytesRef.deepCopyOf command and it is working
fine.

Any thoughts?

Thanks!
Stephen


Re: Lucene 4.0 MemoryIndex Bug?

2011-12-08 Thread Uwe Schindler
Hi,

You mixed incompatible jar file versions of Lucene 4.0 modules. Try to 
recompile everything from source.

Uwe
--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de



Stephen Howe  schrieb:

I've been playing around with Lucene's MemoryIndex and anytime I try to use
index.addField(String, String, Analyzer), I
receive: java.lang.NoSuchMethodError:
org.apache.lucene.util.BytesRef.deepCopyOf(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/util/BytesRef;
anytime I make a call to it. I've tried inserting string literals and
string objects and it's not taking anything.

Digging around in the source code, I narrowed it down to this call in
MemoryIndex: terms.put(BytesRef.deepCopyOf(ref), positions); and ref is not
null. I've also debugged BytesRef.deepCopyOf command and it is working
fine.

Any thoughts?

Thanks!
Stephen



Re: Lucene 4.0 MemoryIndex Bug?

2011-12-08 Thread Robert Muir
anytime you see nosuchmethoderror, it means its a bug in your
configuration (wrong, out of date classes/jar files)

On Thu, Dec 8, 2011 at 3:55 PM, Stephen Howe  wrote:
> I've been playing around with Lucene's MemoryIndex and anytime I try to use
> index.addField(String, String, Analyzer), I
> receive: java.lang.NoSuchMethodError:
> org.apache.lucene.util.BytesRef.deepCopyOf(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/util/BytesRef;
> anytime I make a call to it. I've tried inserting string literals and
> string objects and it's not taking anything.
>
> Digging around in the source code, I narrowed it down to this call in
> MemoryIndex: terms.put(BytesRef.deepCopyOf(ref), positions); and ref is not
> null. I've also debugged BytesRef.deepCopyOf command and it is working
> fine.
>
> Any thoughts?
>
> Thanks!
> Stephen



-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Split mutable logical document into two Lucene documents

2011-12-08 Thread Brandon Mintern
Thank you for the pointer. I looked into nested documents, but it
appears that the implementation relies on each parent document being
indexed immediately before all of its children. Unfortunately, this
presents two problems:

1. Any optimize operation will break nesting
2. Deleting and reindexing a child would break the parent-child
hierarchy unless the parent was reindexed as well. Since this is the
problem we're trying to solve in the first place, this doesn't seem to
get us where we need to be.

We also looked at ParallelReader, but that requires the
immutable/mutable pair are added to the exact same position in
separate indexes. This is very brittle for our use, and it would
require rebuilding the entire mutable index just to change a single
value, or reindexing both the mutable and immutable information.
Neither solution is better than just keeping the mutable and immutable
data together.

I think there are some things we could do with filters, but I think it
will be easier and more flexible for us to have simple Lucene queries
return a sorted list of document IDs (our full document identifier)
and then perform set-union, set-intersection, and set-inversion
ourselves.

Thanks for your time,
Brandon

On Thu, Dec 8, 2011 at 9:57 AM, Ian Lea  wrote:
> It is conceivable that nested documents might help.
> https://issues.apache.org/jira/browse/LUCENE-2454.  I don't know
> anything about that so might be way off target.
>
>
> --
> Ian.
>
>
> On Wed, Dec 7, 2011 at 8:46 PM, Brandon Mintern  wrote:
>> We have a document tagging system where documents are composed of two
>> types of data:
>>
>> Rarely changed (hereafter: "immutable") data - document text and
>> metadata that we upload and almost never change. The text can be
>> hundreds of pages.
>>
>> User created (hereafter: "mutable") data - document properties that
>> are set by users of our system. In total a document's properties are
>> generally several dozen bytes at most. Even viewing a document changes
>> the data (e.g. the document's "viewed" property.
>>
>>
>> At present, all data is part of a single Lucene document. The problem
>> is that when any piece of mutable data is updated (this happens
>> relatively frequently), we have to reindex the entire document. We'd
>> like to have two separate indexed Lucene documents per logical
>> document, one containing the immutable data and the other containing
>> the much smaller and more transient mutable data. When the mutable
>> data changes, we can delete that document's mutable Lucene document
>> and index a new one very quickly.
>>
>> There are two major difficulties when actually performing a search, though:
>>
>> 1. We are providing complex queries to retrieve logical documents
>> based on information in either of its Lucene documents. It seems
>> non-trivial to fetch a logical document in a BooleanQuery with
>> Occur.MUST clauses referring to fields in both of the Lucene
>> documents.
>>
>> 2. We need to sort results (logical document IDs) based on fields in
>> either of its Lucene documents.
>>
>> Has anyone done anything like this before? Is there functionality I'm
>> overlooking that could make this easier?
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-08 Thread Mark Miller
While we are in constant sync due to the merge, lucene would still be
updated multiple times before a solr 4 release, and it would be subject to
happen at any time - so its really not any different.

On Wednesday, December 7, 2011, Jamie Johnson  wrote:
> Yeah, biggest issue for us is we're using the SolrCloud features.
> While I see some good things related to the Lucene and Solr code bases
> being merged, this is certainly a frustrating aspect of it as I don't
> require some of the changes that are in Lucene 4.0  (withstanding
> anything that SolrCloud requires that is).
>
> I think the best solution (assuming it works) is to try to lock a
> version of Lucene 4.0 while upgrading Solr.  I'll have to test to see
> if this works or not, but at least it's something.
>
> On Wed, Dec 7, 2011 at 9:02 AM, Mike Sokolov  wrote:
>> My personal view, as a bystander with no more information than you, is
that
>> one has to assume there will be further index format changes before a 4.0
>> release.  This is based on the number of changes in the last 9 months,
and
>> the amount of activity on the dev list.
>>
>> For us the implication is we need to stick w/3.x for now.  You might be
in a
>> different situation if you really need the 4.0 changes.  Maybe you can
just
>> stick w/the current trunk and take responsibility for patching critical
>> bugfixes, hoping you won't have to recreate your index too many times...
>>
>> -Mike
>>
>>
>> On 12/06/2011 09:48 PM, Jamie Johnson wrote:
>>>
>>> I suppose that's fair enough.  Some quick googling seems that this has
>>> been asked many times with pretty much the same response.  Sorry to
>>> add to the noise.
>>>
>>> On Tue, Dec 6, 2011 at 9:34 PM, Darren Govoni
 wrote:
>>>

 I asked here[1] and it said "Ask again later."

 [1] http://8ball.tridelphia.net/


 On 12/06/2011 08:46 PM, Jamie Johnson wrote:

>
> Thanks Robert.  Is there a timetable for that?  I'm trying to gauge
> whether it is appropriate to push for my organization to move to the
> current lucene 4.0 implementation (we're using solr cloud which is
> built against trunk) or if it's expected there will be changes to what
> is currently on trunk.  I'm not looking for anything hard, just trying
> to plan as much as possible understanding that this is one of the
> implications of using trunk.
>
> On Tue, Dec 6, 2011 at 6:48 PM, Robert Muir
 wrote:
>
>>
>> On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson
>>  wrote:
>>
>>>
>>> Is there a timetable for when it is expected to be finalized?
>>>
>>
>> it will be finalized when Lucene 4.0 is released.
>>
>> --
>> lucidimagination.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


>>>
>>> -
>>> To unsubscribe, e-mail:

-- 
- Mark

http://www.lucidimagination.com


how to do remote debug on benchmark test or whatever test?

2011-12-08 Thread hao yan
Which file should i go and set:

-Xdebug -Xrunjdwp:transport=dt_socket,address=8886,server=y,suspend=y ?

thanks!

hao

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Getting RuntimeException: after flush: fdx size mismatch while Indexing

2011-12-08 Thread Jamir Shaikh
I am using Lucene 3.5. I want to create around 30 million documents.
While doing Indexing I am getting the following Exception:

Caused by: java.lang.RuntimeException: after flush: fdx size mismatch: 7442
docs vs 32768 length in bytes of _ct.fdx file exists?=true

at
org.apache.lucene.index.StoredFieldsWriter.flush(StoredFieldsWriter.java:58)

at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:59)

at
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:581)

at
org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3623)

at
org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3588)

at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2073)

at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2040)

at
com.cisco.ason.document.AbstractLuceneDocWriter.write(AbstractLuceneDocWriter.java:88)

... 10 more


I don't get this Exception consistently.

I am not setting the maxBufferedDocs the default value is -1.

So the flush is as per ramBufferSizeMB which is set to default 16.0 MB.


Did anyone face this problem.

I would appreciate any suggestions.


-- 
Thanks for your time,
Jamir...