RE: Query regarding usage of Lucene(Filtering folder)

2008-02-28 Thread Daan de Wit
This sure is possible with Lucene. What you need to do is index the path
along with your documents, so you get a field like this: `path:
/subfolder/subsubfolder`. Now you can restrict your search to a specific
path. Including subfolders in the search can be done by adding a '*' to
the path used in the query.

Kind regards,
Daan

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 28, 2008 8:00
To: java-user@lucene.apache.org
Subject: Query regarding usage of Lucene(Filtering folder)

Hi All,

 

I had a query regarding usage of lucene.

I have done the indexing for the files kept in root folder ->
subfolder-> Subfolder structure.

When I make the search with particular word it returns me the list of
matching files across the folder structure right from root to the last
subfolder.

I want to restrict search to specific folder only.

Is it possible with lucene?

If yes, please suggest me the steps to follow.

 

Thanks,

Mohammad

 

 



DISCLAIMER:
This message contains privileged and confidential information and is
intended only for an individual named. If you are not the intended
recipient, you should not disseminate, distribute, store, print, copy or
deliver this message. Please notify the sender immediately by e-mail if
you have received this e-mail by mistake and delete this e-mail from
your system. E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete or contain viruses. The sender,
therefore,  does not accept liability for any errors or omissions in the
contents of this message which arise as a result of e-mail transmission.
If verification is required, please request a hard-copy version.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model: New Similarity Implementation Issues

2008-02-28 Thread Grant Ingersoll
Not sure I am understanding what you are asking, but I will give it a  
shot.   See below



On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:



Hi List,

I am pretty new to Lucene. Certainly, it is very exciting. I need to
implement a new Similarity class based on the Term Vector Space  
Model given

in http://www.miislita.com/term-vector/term-vector-3.html

Although that model is similar to Lucene’s model
(http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html 
),

I am having hard time to extend the Similarity class to calculate that
model.

In that model, “tf” is multiplied with Idf for all terms in the  
index, but
in Lucene “tf” is calculated only for terms in the given Query.  
Because of
that effect, the norm calculation should also include “idf” for all  
terms.
Lucene calculates the norm, during indexing, by “just” counting the  
number
of terms per document. In the web formula (in miislita.com), a  
document norm

is calculated after multiplying “tf” and “idf”.


Are you wondering if there is a way to score all documents regardless  
of whether the document has the term or not?  I don't quite get your  
statement: "In that model, “tf” is multiplied with Idf for all terms  
in the index, but in Lucene “tf” is calculated only for terms in the  
given Query."


Isn't the result for those documents that don't have query terms just  
going to be 0 or am I not fully understanding?  I briefly skimmed the  
paper you cite and it doesn't seem that different, it's just  
describing the Salton's VSM right?





FYI: I could implement “idf” according to miisliat.com formula, but  
not the

“tf” and “norm”

Could you please comment me how I can implement a new Similarity  
class that
will fit in the Lucene’s architecture, but still implement the  
vector space

model given in miislita.com


In the end, you may need to implement some lower level Query classes,  
but I still don't fully understand what you are trying to do, so I  
wouldn't head down that path just yet.


--
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How do i get a text summary

2008-02-28 Thread spring
> If you want something from an index it has to be IN the 
> index. So, store a
> summary field in each document and make sure that field is part of the
> query.

And how could one create automatically such a summary?
Taking the first 2 lines of a document makes not always much sense.
How does google this?

Thank you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do i get a text summary

2008-02-28 Thread Mathieu Lecarme

[EMAIL PROTECTED] a écrit :
If you want something from an index it has to be IN the 
index. So, store a

summary field in each document and make sure that field is part of the
query.



And how could one create automatically such a summary?
  
Have a look to http://alias-i.com/lingpipe/index.html or 
http://www.nzdl.org/Kea/

Summerizing is a datamining stuff.

Taking the first 2 lines of a document makes not always much sense.
How does google this?
  

The simpler way is to give text context, n words before, and n words after.

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How do i get a text summary

2008-02-28 Thread Donna L Gresh
I think you may want to look into the Highlighter. It allows you to show 
the "relevant" bits of the document which contributed to the document 
being matched to the query. It does a pretty good job. Of course it does 
not create a "summary" but it does give you a good idea of why the 
document was hit.

http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/highlight/Highlighter.html

Donna Gresh


<[EMAIL PROTECTED]> wrote on 02/28/2008 07:42:40 AM:

> > If you want something from an index it has to be IN the 
> > index. So, store a
> > summary field in each document and make sure that field is part of the
> > query.
> 
> And how could one create automatically such a summary?
> Taking the first 2 lines of a document makes not always much sense.
> How does google this?
> 
> Thank you.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


Re: Vector Space Model: New Similarity Implementation Issues

2008-02-28 Thread Dharmalingam

Thanks for the reply. Sorry if my explanation is not clear. Yes, you are
correct the model is based on  Salton's VSM. However, the calculation of the
term weight and the doc norm is, in my opinion, different from Lucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they calcuate the
document norm based on the weight wi=tfi*idfi. I looked at the interfaces of
Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.sqrt(numTerms));
 }

You can see that this lengthNorm for a doc is quite different from that
website norm calculation.

Similarly, the querynorm interface of DefaultSimilarity class is:

 /** Implemented as 1/sqrt(sumOfSquaredWeights). */
  public float queryNorm(float sumOfSquaredWeights) {
return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
  }

This is again different the website model.

I also have difficulities with tf interface of DefaultSimilarity: 
/** Implemented as sqrt(freq). */
  public float tf(float freq) {
return (float)Math.sqrt(freq);
  }

In that website model, a tf refers to the frequency of a term within a doc.

I hope explained it better. Please let me know if it is unclear. I am
looking for an easy way to implement that table, and of course want to
integrate with my lucene (  i.e., myIndexWriter.setSimilarity(new
mySimilarity());) Will this be possible by just somehow inheriting the base
classes of Lucene.

Thanks for your advice.

Grant Ingersoll-6 wrote:
> 
> Not sure I am understanding what you are asking, but I will give it a  
> shot.   See below
> 
> 
> On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:
> 
>>
>> Hi List,
>>
>> I am pretty new to Lucene. Certainly, it is very exciting. I need to
>> implement a new Similarity class based on the Term Vector Space  
>> Model given
>> in http://www.miislita.com/term-vector/term-vector-3.html
>>
>> Although that model is similar to Lucene’s model
>> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
>>  
>> ),
>> I am having hard time to extend the Similarity class to calculate that
>> model.
>>
>> In that model, “tf” is multiplied with Idf for all terms in the  
>> index, but
>> in Lucene “tf” is calculated only for terms in the given Query.  
>> Because of
>> that effect, the norm calculation should also include “idf” for all  
>> terms.
>> Lucene calculates the norm, during indexing, by “just” counting the  
>> number
>> of terms per document. In the web formula (in miislita.com), a  
>> document norm
>> is calculated after multiplying “tf” and “idf”.
> 
> Are you wondering if there is a way to score all documents regardless  
> of whether the document has the term or not?  I don't quite get your  
> statement: "In that model, “tf” is multiplied with Idf for all terms  
> in the index, but in Lucene “tf” is calculated only for terms in the  
> given Query."
> 
> Isn't the result for those documents that don't have query terms just  
> going to be 0 or am I not fully understanding?  I briefly skimmed the  
> paper you cite and it doesn't seem that different, it's just  
> describing the Salton's VSM right?
> 
>>
>>
>> FYI: I could implement “idf” according to miisliat.com formula, but  
>> not the
>> “tf” and “norm”
>>
>> Could you please comment me how I can implement a new Similarity  
>> class that
>> will fit in the Lucene’s architecture, but still implement the  
>> vector space
>> model given in miislita.com
> 
> In the end, you may need to implement some lower level Query classes,  
> but I still don't fully understand what you are trying to do, so I  
> wouldn't head down that path just yet.
> 
> --
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15736946.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing source code files

2008-02-28 Thread Dharmalingam

I am working on some sort of search mechanism to link a requirement (i.e. a
query) to source code files (i.e., documents). For that purpose, I indexed
the source code files using Lucene. Contrary to traditional natural language
search scenario, we search for code files that are relevant to a given
requirement. One problem here is that the source files usually contain a lot
of abbreviations, words joint by _ or combination of words and/or
abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether anyone
of you already did indexing of (source) files or documents which contain
that kind of words.
-- 
View this message in context: 
http://www.nabble.com/Indexing-source-code-files-tp15738615p15738615.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene-Highlight words in a searched docs

2008-02-28 Thread Ravinder.Teepiredddy
Hi All,

 

How do we Highlight words in a searched docs. Please give inputs on
"rewritten query as the input for the highlighter, i.e. call rewrite()
on the query".

 

Thanks,

Ravinder 



DISCLAIMER:
This message contains privileged and confidential information and is intended 
only for an individual named. If you are not the intended recipient, you should 
not disseminate, distribute, store, print, copy or deliver this message. Please 
notify the sender immediately by e-mail if you have received this e-mail by 
mistake and delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete or contain viruses. The 
sender, therefore,  does not accept liability for any errors or omissions in 
the contents of this message which arise as a result of e-mail transmission. If 
verification is required, please request a hard-copy version.


Re: Indexing source code files

2008-02-28 Thread Mathieu Lecarme

Dharmalingam a écrit :

I am working on some sort of search mechanism to link a requirement (i.e. a
query) to source code files (i.e., documents). For that purpose, I indexed
the source code files using Lucene. Contrary to traditional natural language
search scenario, we search for code files that are relevant to a given
requirement. One problem here is that the source files usually contain a lot
of abbreviations, words joint by _ or combination of words and/or
abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether anyone
of you already did indexing of (source) files or documents which contain
that kind of words.
  

You need a specific Tokenizer.
You will use several Field : class, method, comments, code, javadoc. 
Some field can use casual tokenizer (comments), other needs a specific 
one : splitting oneJavWord in several words.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing source code files

2008-02-28 Thread Ken Krugler

I am working on some sort of search mechanism to link a requirement (i.e. a
query) to source code files (i.e., documents). For that purpose, I indexed
the source code files using Lucene. Contrary to traditional natural language
search scenario, we search for code files that are relevant to a given
requirement. One problem here is that the source files usually contain a lot
of abbreviations, words joint by _ or combination of words and/or
abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether anyone
of you already did indexing of (source) files or documents which contain
that kind of words.


Yes, that's been something we've spent a fair amount of time on...see 
http://www.krugle.org (public code search).


As Mathieu noted, the first thing you really want to do is split the 
file up into at least comments vs. code. Then you can use a regular 
analyzer (or perhaps something more human language-specific, e.g. 
with stemming support) on the comment text, and your own custom 
tokenizer on the code.


In the code, you might further want to treat literals (strings, etc) 
differently than other terms.


And in "real" code terms, then you want to do essentially synonym 
processing, where you turn a single term into multiple terms based on 
the automatic splitting of the term using '_', '-', camelCasing, 
letter/digit transitions, etc.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to obtain the freq term vector of a field from a remote index ?

2008-02-28 Thread Ariel
Hi folks:

I need to know how to get the frequency term vector of a field from a remote
index in another host.
I know that *IndexSearcher *class has a method named
*getIndexReader().getTermFreqVector(idDoc,
fieldName) *to know the the term frequency vector of certain field* *but I
am using* RemoteSearchable *that is * Searcher, *because my search
functionalities are in an rmi server. I access the remoteSearcheble from a
another host to obtain the hits but so  far I haven't found the way to
obtain the term frequency vector of certain field too.
Do you know if it is possible to do that ??? How can I make it ?
Any help is appreciated .
Greetings
Ariel


Re: How do i get a text summary

2008-02-28 Thread Karl Wettin

[EMAIL PROTECTED] skrev:
If you want something from an index it has to be IN the 
index. So, store a

summary field in each document and make sure that field is part of the
query.


And how could one create automatically such a summary?
Taking the first 2 lines of a document makes not always much sense.
How does google this?


Google don't summarize, they highlight parts that match the query. See 
previous reponses.


If you really want to summarize there are a number of more and less 
scientific ways to figure out what's important and what's not.


Very simple algorithmic solutions usually involve ranking top senstances 
by looking at distribution of terms in sentances, paragraphs and the 
whole document. I implemented something like this a couple of years back 
that worked fairly well.


Citeseer is a great source for papers on pretty much any IR related 
subject: 



   karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene-Highlight words in a searched docs

2008-02-28 Thread Mitchell, Erica
Hi Ravinder

Checkout Highlighter.test in
lucene-2.3.1\contrib\highlighter\src\test\org\apache\lucene\search\highl
ight\ folder

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] 
Sent: 28 February 2008 11:03
To: java-user@lucene.apache.org
Subject: Lucene-Highlight words in a searched docs

Hi All,

 

How do we Highlight words in a searched docs. Please give inputs on
"rewritten query as the input for the highlighter, i.e. call rewrite()
on the query".

 

Thanks,

Ravinder 



DISCLAIMER:
This message contains privileged and confidential information and is
intended only for an individual named. If you are not the intended
recipient, you should not disseminate, distribute, store, print, copy or
deliver this message. Please notify the sender immediately by e-mail if
you have received this e-mail by mistake and delete this e-mail from
your system. E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete or contain viruses. The sender,
therefore,  does not accept liability for any errors or omissions in the
contents of this message which arise as a result of e-mail transmission.
If verification is required, please request a hard-copy version.


IONA Technologies PLC (registered in Ireland)
Registered Number: 171387
Registered Address: The IONA Building, Shelbourne Road, Dublin 4, Ireland

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model: New Similarity Implementation Issues

2008-02-28 Thread Grant Ingersoll


On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:



Thanks for the reply. Sorry if my explanation is not clear. Yes, you  
are
correct the model is based on  Salton's VSM. However, the  
calculation of the
term weight and the doc norm is, in my opinion, different from  
Lucene. If

you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they  
calcuate the
document norm based on the weight wi=tfi*idfi. I looked at the  
interfaces of

Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
   return (float)(1.0 / Math.sqrt(numTerms));
}

You can see that this lengthNorm for a doc is quite different from  
that

website norm calculation.


The lengthNorm method is different from the IDF calculation.  In the  
Similarity class, that is handled by the idf() method.  Length norm is  
an attempt to address one of the limitations listed further down in  
that paper:
"Long Documents: Very long documents make similarity measures  
difficult (vectors with small dot products and high dimensionality)"







Similarly, the querynorm interface of DefaultSimilarity class is:

/** Implemented as 1/sqrt(sumOfSquaredWeights). */
 public float queryNorm(float sumOfSquaredWeights) {
   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
 }

This is again different the website model.


Query norm is an attempt to allow for comparison of scores across  
queries, but I don't think one should do that anyway.






I also have difficulities with tf interface of DefaultSimilarity:
/** Implemented as sqrt(freq). */
 public float tf(float freq) {
   return (float)Math.sqrt(freq);
 }



These are all callback methods from within the Scorer classes that  
each Query uses.  Have a look at TermScorer for how these things get  
called.



Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words.   
Setup a simple Similarity class where you override all of these  
methods to return 1 (or some simple default)

and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scores  
the way it does.  Then, you can work to modify from there.


Here's the bigger question:  what is your ultimate goal here?  Are you  
just trying to understand Lucene at an academic/programming level or  
do you have something you are trying to achieve in terms of relevance?


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model: New Similarity Implementation Issues

2008-02-28 Thread Dharmalingam

Thanks for your tips. My overall goal is to quickly implement 7 variants of
vector space model using Lucene. You can find these variants in the
updloaded file.

I am doing all these stuffs for a much broader goal: I am trying to recover
traceability links from requirements to source code files. I treat every
requirement as a query. In this problem, I would like to compare these
collection of algorithms for their relevance.




Grant Ingersoll-6 wrote:
> 
> 
> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
> 
>>
>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you  
>> are
>> correct the model is based on  Salton's VSM. However, the  
>> calculation of the
>> term weight and the doc norm is, in my opinion, different from  
>> Lucene. If
>> you look at the table given in
>> http://www.miislita.com/term-vector/term-vector-3.html, they  
>> calcuate the
>> document norm based on the weight wi=tfi*idfi. I looked at the  
>> interfaces of
>> Similarity and DefaultSimilairty class. I place it below:
>>
>> public float lengthNorm(String fieldName, int numTerms) {
>>return (float)(1.0 / Math.sqrt(numTerms));
>> }
>>
>> You can see that this lengthNorm for a doc is quite different from  
>> that
>> website norm calculation.
> 
> The lengthNorm method is different from the IDF calculation.  In the  
> Similarity class, that is handled by the idf() method.  Length norm is  
> an attempt to address one of the limitations listed further down in  
> that paper:
> "Long Documents: Very long documents make similarity measures  
> difficult (vectors with small dot products and high dimensionality)"
> 
> 
> 
>>
>>
>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>
>> /** Implemented as 1/sqrt(sumOfSquaredWeights). */
>>  public float queryNorm(float sumOfSquaredWeights) {
>>return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>  }
>>
>> This is again different the website model.
> 
> Query norm is an attempt to allow for comparison of scores across  
> queries, but I don't think one should do that anyway.
> 
> 
>>
>>
>> I also have difficulities with tf interface of DefaultSimilarity:
>> /** Implemented as sqrt(freq). */
>>  public float tf(float freq) {
>>return (float)Math.sqrt(freq);
>>  }
>>
> 
> These are all callback methods from within the Scorer classes that  
> each Query uses.  Have a look at TermScorer for how these things get  
> called.
> 
> 
> Try this as an example:
> 
> Setup a really simple index with 1 or 2 docs each with a few words.   
> Setup a simple Similarity class where you override all of these  
> methods to return 1 (or some simple default)
> and then index your documents and do a few queries.
> 
> Then, have a look at Searcher.explain() to see why a document scores  
> the way it does.  Then, you can work to modify from there.
> 
> Here's the bigger question:  what is your ultimate goal here?  Are you  
> just trying to understand Lucene at an academic/programming level or  
> do you have something you are trying to achieve in terms of relevance?
> 
> -Grant
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf 
-- 
View this message in context: 
http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query regarding usage of Lucene - Filtering folders

2008-02-28 Thread Erick Erickson
Sure, but you have to make it happen. The most straight-forward thing I
can think of is to index (probably UN_TOKENIZED) the
path to the file in a new field when you index the contents.
 Then you can easily restrict things however you want by
including an AND clause with the path fragment you wish to restrict your
search to

Best
Erick

On Thu, Feb 28, 2008 at 1:58 AM, <[EMAIL PROTECTED]> wrote:

> Hi
>
> I would like to join java-user mailing list.
>
> I had a query regarding usage of lucene.
>
>
>
>
>
> I have done the indexing for the files kept in root folder -> subfolder
>
> -> subfolder structure.
>
>
>
>
>
> When I make the search with particular word it returns me the list of
> matching files across the folder structure right from root to the last
> subfolder.
>
>
>
>
>
> I want to restrict search to specific folder only.
>
>
>
> Is it possible with lucene?
>
>
>
> If yes, please suggest me the steps to follow.
>
>
>
>
>
> Thanks,
>
>
>
> Mohammad
>
>
>
>
>
>
>
> DISCLAIMER:
> This message contains privileged and confidential information and is
> intended only for an individual named. If you are not the intended
> recipient, you should not disseminate, distribute, store, print, copy or
> deliver this message. Please notify the sender immediately by e-mail if you
> have received this e-mail by mistake and delete this e-mail from your
> system. E-mail transmission cannot be guaranteed to be secure or error-free
> as information could be intercepted, corrupted, lost, destroyed, arrive late
> or incomplete or contain viruses. The sender, therefore,  does not accept
> liability for any errors or omissions in the contents of this message which
> arise as a result of e-mail transmission. If verification is required,
> please request a hard-copy version.
>


Re: DefaultIndexAccessor

2008-02-28 Thread vivek sar
Mark,

  We deployed our indexer (using defaultIndexAccessor) on one of the
production site and getting this error,

Caused by: java.util.concurrent.RejectedExecutionException
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
at 
org.apache.lucene.indexaccessor.DefaultIndexAccessor.release(DefaultIndexAccessor.java:514)


This is happening repeatedly every time the indexer runs.

This is running your latest IndexAccessor-021508 code.  Any ideas
(it's kind of urgent for us)?

Thanks,
-vivek


On Fri, Feb 15, 2008 at 6:50 PM, vivek sar <[EMAIL PROTECTED]> wrote:
> Mark,
>
>  Thanks for the quick fix. Actually, it is possible that there might
>  had been simultaneous queries using the MultiSearcher. I assumed it
>  was thread-safe, thus was re-using the same instance. I'll update my
>  application code as well.
>
>  Thanks,
>  -vivek
>
>
>
>  On Feb 15, 2008 5:56 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
>  > Here is the fix: https://issues.apache.org/jira/browse/LUCENE-1026
>  >
>  >
>  > vivek sar wrote:
>  > > Mark,
>  > >
>  > >There seems to be some issue with DefaultMultiIndexAccessor.java. I
>  > > got following NPE exception,
>  > >
>  > >  2008-02-13 07:10:28,021 ERROR [http-7501-Processor6] 
> ReportServiceImpl -
>  > > java.lang.NullPointerException
>  > > at 
> org.apache.lucene.indexaccessor.DefaultMultiIndexAccessor.release(DefaultMultiIndexAccessor.java:89)
>  > >
>  > > Looks like the IndexAccessor for one of the Searcher in the
>  > > MultiSearcher returned null. Not sure how is that possible, any ideas
>  > > how is that possible?
>  > >
>  > > In my case it caused a critical error as the writer thread was stuck
>  > > forever (we found out after couple of days) because of this,
>  > >
>  > > "PS thread 9" prio=1 tid=0x2aac70eb95d0 nid=0x6ba in Object.wait()
>  > > [0x47533000..0x47533b80]
>  > > at java.lang.Object.wait(Native Method)
>  > > - waiting on <0x2aab3e5c7700> (a
>  > > org.apache.lucene.indexaccessor.DefaultIndexAccessor)
>  > > at java.lang.Object.wait(Unknown Source)
>  > > at 
> org.apache.lucene.indexaccessor.DefaultIndexAccessor.waitForReadersAndCloseCached(DefaultIndexAccessor.java:593)
>  > > at 
> org.apache.lucene.indexaccessor.DefaultIndexAccessor.release(DefaultIndexAccessor.java:510)
>  > > - locked <0x2aab3e5c7700> (a
>  > > org.apache.lucene.indexaccessor.DefaultIndexAccessor)
>  > >
>  > > The only way to recover was to re-start the application.
>  > >
>  > > I use both MultiSearcher and IndexSearcher in my application, I've
>  > > looked at your code but not able to pinpoint how can it go wrong? Of
>  > > course, you do have to check for null in the
>  > > MultiIndexAccessor.release, but how could you get null index accessor
>  > > at first place?
>  > >
>  > > I do call IndexAccessor.close during partitioning of indexes, but the
>  > > close should wait for all Searchers to close before doing anything.
>  > >
>  > > Do you have any updates to your code since 02/04/2008?
>  > >
>  > > Thanks,
>  > > -vivek
>  > >
>  > > On Feb 6, 2008 8:37 AM, Jay <[EMAIL PROTECTED]> wrote:
>  > >
>  > >> Thanks for your clarifications, Mark!
>  > >>
>  > >>
>  > >> Jay
>  > >>
>  > >>
>  > >> Mark Miller wrote:
>  > >>
>  >  5. Although currently IndexSearcher.close() does almost nothing except
>  >  to close the internal index reader, it might be a safer to close
>  >  searcher itself as well in closeCachedSearcher(), just in case, the
>  >  searcher may have other resources to release in the future version of
>  >  Lucene.
>  > 
>  > >>> Didn't catch that "as well". You are right, great idea Jay, thanks.
>  > >>>
>  > >>> -
>  > >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > >>> For additional commands, e-mail: [EMAIL PROTECTED]
>  > >>>
>  > >> -
>  > >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > >> For additional commands, e-mail: [EMAIL PROTECTED]
>  > >>
>  > >>
>  > >>
>  > >
>  > > -
>  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > > For additional commands, e-mail: [EMAIL PROTECTED]
>  > >
>  > >
>  > >
>  >
>  > -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
>  >
>  >
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model: New Similarity Implementation Issues

2008-02-28 Thread Grant Ingersoll

FYI: The mailing list handler strips attachments.

At any rate, sounds like an interesting project.  I don't know how  
easy it will be for you to implement 7 variants of VSM in Lucene given  
the nature of the APIs, but if you do, it might be handy to see your  
changes as a patch.  :-)  Also not quite sure what all those variants  
will help with when it comes to your broader goal, but that isn't for  
me to decide :-)  Seems like your goal is to find the traceability  
stuff, not see if you can figure out how to change Lucene's  
similarity!  To that end, my two cents would be to focus on creating  
the right kinds of queries, analyzers, etc.



-Grant

On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:



Thanks for your tips. My overall goal is to quickly implement 7  
variants of

vector space model using Lucene. You can find these variants in the
updloaded file.

I am doing all these stuffs for a much broader goal: I am trying to  
recover
traceability links from requirements to source code files. I treat  
every

requirement as a query. In this problem, I would like to compare these
collection of algorithms for their relevance.




Grant Ingersoll-6 wrote:



On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:



Thanks for the reply. Sorry if my explanation is not clear. Yes, you
are
correct the model is based on  Salton's VSM. However, the
calculation of the
term weight and the doc norm is, in my opinion, different from
Lucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they
calcuate the
document norm based on the weight wi=tfi*idfi. I looked at the
interfaces of
Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
  return (float)(1.0 / Math.sqrt(numTerms));
}

You can see that this lengthNorm for a doc is quite different from
that
website norm calculation.


The lengthNorm method is different from the IDF calculation.  In the
Similarity class, that is handled by the idf() method.  Length norm  
is

an attempt to address one of the limitations listed further down in
that paper:
"Long Documents: Very long documents make similarity measures
difficult (vectors with small dot products and high dimensionality)"






Similarly, the querynorm interface of DefaultSimilarity class is:

/** Implemented as 1/sqrt(sumOfSquaredWeights). */
public float queryNorm(float sumOfSquaredWeights) {
  return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
}

This is again different the website model.


Query norm is an attempt to allow for comparison of scores across
queries, but I don't think one should do that anyway.





I also have difficulities with tf interface of DefaultSimilarity:
/** Implemented as sqrt(freq). */
public float tf(float freq) {
  return (float)Math.sqrt(freq);
}



These are all callback methods from within the Scorer classes that
each Query uses.  Have a look at TermScorer for how these things get
called.


Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words.
Setup a simple Similarity class where you override all of these
methods to return 1 (or some simple default)
and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scores
the way it does.  Then, you can work to modify from there.

Here's the bigger question:  what is your ultimate goal here?  Are  
you

just trying to understand Lucene at an academic/programming level or
do you have something you are trying to achieve in terms of  
relevance?


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
--
View this message in context: 
http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model: New Similarity Implementation Issues

2008-02-28 Thread Dharmalingam

You can find those variants of the vector space model in this interesting
article:
http://ieeexplore.ieee.org/iel1/52/12658/00582976.pdf?tp=&isnumber=&arnumber=582976

Now, I got confirmed with you the current nature of Similarity API's will be
not easy to quickly realize these variants.

Actually, I implemented the earlier web-site model as a separate Java
program, which uses Lucene classes, but not through inherting the Similarity
class. It appears inherting similarity class will not solve my problem of
realization these variant


Grant Ingersoll-6 wrote:
> 
> FYI: The mailing list handler strips attachments.
> 
> At any rate, sounds like an interesting project.  I don't know how  
> easy it will be for you to implement 7 variants of VSM in Lucene given  
> the nature of the APIs, but if you do, it might be handy to see your  
> changes as a patch.  :-)  Also not quite sure what all those variants  
> will help with when it comes to your broader goal, but that isn't for  
> me to decide :-)  Seems like your goal is to find the traceability  
> stuff, not see if you can figure out how to change Lucene's  
> similarity!  To that end, my two cents would be to focus on creating  
> the right kinds of queries, analyzers, etc.
> 
> 
> -Grant
> 
> On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:
> 
>>
>> Thanks for your tips. My overall goal is to quickly implement 7  
>> variants of
>> vector space model using Lucene. You can find these variants in the
>> updloaded file.
>>
>> I am doing all these stuffs for a much broader goal: I am trying to  
>> recover
>> traceability links from requirements to source code files. I treat  
>> every
>> requirement as a query. In this problem, I would like to compare these
>> collection of algorithms for their relevance.
>>
>>
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>>
>>> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
>>>

 Thanks for the reply. Sorry if my explanation is not clear. Yes, you
 are
 correct the model is based on  Salton's VSM. However, the
 calculation of the
 term weight and the doc norm is, in my opinion, different from
 Lucene. If
 you look at the table given in
 http://www.miislita.com/term-vector/term-vector-3.html, they
 calcuate the
 document norm based on the weight wi=tfi*idfi. I looked at the
 interfaces of
 Similarity and DefaultSimilairty class. I place it below:

 public float lengthNorm(String fieldName, int numTerms) {
   return (float)(1.0 / Math.sqrt(numTerms));
 }

 You can see that this lengthNorm for a doc is quite different from
 that
 website norm calculation.
>>>
>>> The lengthNorm method is different from the IDF calculation.  In the
>>> Similarity class, that is handled by the idf() method.  Length norm  
>>> is
>>> an attempt to address one of the limitations listed further down in
>>> that paper:
>>> "Long Documents: Very long documents make similarity measures
>>> difficult (vectors with small dot products and high dimensionality)"
>>>
>>>
>>>


 Similarly, the querynorm interface of DefaultSimilarity class is:

 /** Implemented as 1/sqrt(sumOfSquaredWeights). */
 public float queryNorm(float sumOfSquaredWeights) {
   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
 }

 This is again different the website model.
>>>
>>> Query norm is an attempt to allow for comparison of scores across
>>> queries, but I don't think one should do that anyway.
>>>
>>>


 I also have difficulities with tf interface of DefaultSimilarity:
 /** Implemented as sqrt(freq). */
 public float tf(float freq) {
   return (float)Math.sqrt(freq);
 }

>>>
>>> These are all callback methods from within the Scorer classes that
>>> each Query uses.  Have a look at TermScorer for how these things get
>>> called.
>>>
>>>
>>> Try this as an example:
>>>
>>> Setup a really simple index with 1 or 2 docs each with a few words.
>>> Setup a simple Similarity class where you override all of these
>>> methods to return 1 (or some simple default)
>>> and then index your documents and do a few queries.
>>>
>>> Then, have a look at Searcher.explain() to see why a document scores
>>> the way it does.  Then, you can work to modify from there.
>>>
>>> Here's the bigger question:  what is your ultimate goal here?  Are  
>>> you
>>> just trying to understand Lucene at an academic/programming level or
>>> do you have something you are trying to achieve in terms of  
>>> relevance?
>>>
>>> -Grant
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>> http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
>> -- 
>> View this message in context:
>> http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
>> Sent from the Lucene

Re: DefaultIndexAccessor

2008-02-28 Thread vivek sar
Mark,

 Some more information,

  1) I run indexwriter every 5 mins
  2) After every cycle I check if I need to partition (based on
the index size)
  3) In the partition interface,
a)  I first call close on the index accessor (so all the
searchers can close before I move that index)
  accessor =
IndexAccessorFactory.getInstance().getAccessor(dir.getFile());
  accessor.close();
b) Then I re-open the index accessor,
   accessor = indexFactory.getAccessor(dir.getFile());
   accessor.open();
c) I optimized the my indexes using the Index Writer (that
I get from the accessor).
   masterWriter = this.indexAccessor.getWriter(false);
   masterWriter.optimize(optimizeSegment);
d) Once the optimization is done I release the masterWriter,
this.indexAccessor.release(masterWriter);

 Now here is where I get the "RejectedExecutionException".
Reading up little more on this exception,
http://pveentjer.wordpress.com/2008/02/06/are-you-dealing-with-the-rejectedexecutionexception/,
I see this might be happening because something got stuck during the
close cycle, so the ExecutorSerivce is not accepting any new tasks.
I'm not sure how would this happen.

The critical problem is once I get this exception, every release call
throws the same exception (looks like shutdown never gets done).
Because of this my readers are never refreshed and I can not read any
new indexes.

May be I've to check whether the accessor is completely closed before
re-opening?  Could you in your release check whether the pool
(ExecutorService) is in shutdown state? Any thing else I can check?

Thanks,
-vivek

On Thu, Feb 28, 2008 at 1:26 PM, vivek sar <[EMAIL PROTECTED]> wrote:
> Mark,
>
>   We deployed our indexer (using defaultIndexAccessor) on one of the
>  production site and getting this error,
>
>  Caused by: java.util.concurrent.RejectedExecutionException
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown
>  Source)
> at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
> at 
> org.apache.lucene.indexaccessor.DefaultIndexAccessor.release(DefaultIndexAccessor.java:514)
>
>
>  This is happening repeatedly every time the indexer runs.
>
>  This is running your latest IndexAccessor-021508 code.  Any ideas
>  (it's kind of urgent for us)?
>
>  Thanks,
>  -vivek
>
>
>
>
>  On Fri, Feb 15, 2008 at 6:50 PM, vivek sar <[EMAIL PROTECTED]> wrote:
>  > Mark,
>  >
>  >  Thanks for the quick fix. Actually, it is possible that there might
>  >  had been simultaneous queries using the MultiSearcher. I assumed it
>  >  was thread-safe, thus was re-using the same instance. I'll update my
>  >  application code as well.
>  >
>  >  Thanks,
>  >  -vivek
>  >
>  >
>  >
>  >  On Feb 15, 2008 5:56 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
>  >  > Here is the fix: https://issues.apache.org/jira/browse/LUCENE-1026
>  >  >
>  >  >
>  >  > vivek sar wrote:
>  >  > > Mark,
>  >  > >
>  >  > >There seems to be some issue with DefaultMultiIndexAccessor.java. I
>  >  > > got following NPE exception,
>  >  > >
>  >  > >  2008-02-13 07:10:28,021 ERROR [http-7501-Processor6] 
> ReportServiceImpl -
>  >  > > java.lang.NullPointerException
>  >  > > at 
> org.apache.lucene.indexaccessor.DefaultMultiIndexAccessor.release(DefaultMultiIndexAccessor.java:89)
>  >  > >
>  >  > > Looks like the IndexAccessor for one of the Searcher in the
>  >  > > MultiSearcher returned null. Not sure how is that possible, any ideas
>  >  > > how is that possible?
>  >  > >
>  >  > > In my case it caused a critical error as the writer thread was stuck
>  >  > > forever (we found out after couple of days) because of this,
>  >  > >
>  >  > > "PS thread 9" prio=1 tid=0x2aac70eb95d0 nid=0x6ba in Object.wait()
>  >  > > [0x47533000..0x47533b80]
>  >  > > at java.lang.Object.wait(Native Method)
>  >  > > - waiting on <0x2aab3e5c7700> (a
>  >  > > org.apache.lucene.indexaccessor.DefaultIndexAccessor)
>  >  > > at java.lang.Object.wait(Unknown Source)
>  >  > > at 
> org.apache.lucene.indexaccessor.DefaultIndexAccessor.waitForReadersAndCloseCached(DefaultIndexAccessor.java:593)
>  >  > > at 
> org.apache.lucene.indexaccessor.DefaultIndexAccessor.release(DefaultIndexAccessor.java:510)
>  >  > > - locked <0x2aab3e5c7700> (a
>  >  > > org.apache.lucene.indexaccessor.DefaultIndexAccessor)
>  >  > >
>  >  > > The only way to recover was to re-start the application.
>  >  > >
>  >  > > I use both MultiSearcher and IndexSearcher in my application, I've
>  >  > > looked at your code but not able to pinpoint how can it go wrong? Of
>  >  > > course,

SOC: Lulu, a Lua implementation of Lucene

2008-02-28 Thread Petite Abeille

A proposal for a Lua entry for the "Google Summer of Code" '08:

lu·lu (lū'lū) n. Slang.

A remarkable person, object, or idea.

A very attractive or seductive looking woman.

A Lua implementation of Lucene.


Skimpy details bellow:

http://svr225.stepx.com:3388/lulu
http://lua-users.org/wiki/GoogleSummerOfCodeIdeas


Feel free to drop by the Lua mailing list if you are interested.

http://www.lua.org/lua-l.html


Cheers,

PA.

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DefaultIndexAccessor

2008-02-28 Thread Mark Miller

Hey vivek,

Sorry you ran into this. I believe the problem is that I had just not 
foreseen the use case of closing and then reopening the Accessor. The 
only time I ever close the Accessors is when I am shutting down the JVM.


What do you do about all of the IndexAccessor requests while it is in a 
closed state? Could their be a better way of accomplishing this without 
closing the Accessor? Would a new method that just stalled everything be 
better? Then you wouldn't have to recreate any resources possibly?


In any case, the problem is that after the Executor gets shutdown it is 
not reopened in the open method. I can certainly change this, but I need 
to look for any other issues as well. I will add an open after a 
shutdown test to investigate. I am going to think about the issue 
further and I will get back to you soon.


Thanks for all of the details.

- Mark

vivek sar wrote:

Mark,

 Some more information,

  1) I run indexwriter every 5 mins
  2) After every cycle I check if I need to partition (based on
the index size)
  3) In the partition interface,
a)  I first call close on the index accessor (so all the
searchers can close before I move that index)
  accessor =
IndexAccessorFactory.getInstance().getAccessor(dir.getFile());
  accessor.close();
b) Then I re-open the index accessor,
   accessor = indexFactory.getAccessor(dir.getFile());
   accessor.open();
c) I optimized the my indexes using the Index Writer (that
I get from the accessor).
   masterWriter = this.indexAccessor.getWriter(false);
   masterWriter.optimize(optimizeSegment);
d) Once the optimization is done I release the masterWriter,
this.indexAccessor.release(masterWriter);

 Now here is where I get the "RejectedExecutionException".
Reading up little more on this exception,
http://pveentjer.wordpress.com/2008/02/06/are-you-dealing-with-the-rejectedexecutionexception/,
I see this might be happening because something got stuck during the
close cycle, so the ExecutorSerivce is not accepting any new tasks.
I'm not sure how would this happen.

The critical problem is once I get this exception, every release call
throws the same exception (looks like shutdown never gets done).
Because of this my readers are never refreshed and I can not read any
new indexes.

May be I've to check whether the accessor is completely closed before
re-opening?  Could you in your release check whether the pool
(ExecutorService) is in shutdown state? Any thing else I can check?

Thanks,
-vivek

On Thu, Feb 28, 2008 at 1:26 PM, vivek sar <[EMAIL PROTECTED]> wrote:
  

Mark,

  We deployed our indexer (using defaultIndexAccessor) on one of the
 production site and getting this error,

 Caused by: java.util.concurrent.RejectedExecutionException
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown
 Source)
at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
at 
org.apache.lucene.indexaccessor.DefaultIndexAccessor.release(DefaultIndexAccessor.java:514)


 This is happening repeatedly every time the indexer runs.

 This is running your latest IndexAccessor-021508 code.  Any ideas
 (it's kind of urgent for us)?

 Thanks,
 -vivek




 On Fri, Feb 15, 2008 at 6:50 PM, vivek sar <[EMAIL PROTECTED]> wrote:
 > Mark,
 >
 >  Thanks for the quick fix. Actually, it is possible that there might
 >  had been simultaneous queries using the MultiSearcher. I assumed it
 >  was thread-safe, thus was re-using the same instance. I'll update my
 >  application code as well.
 >
 >  Thanks,
 >  -vivek
 >
 >
 >
 >  On Feb 15, 2008 5:56 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
 >  > Here is the fix: https://issues.apache.org/jira/browse/LUCENE-1026
 >  >
 >  >
 >  > vivek sar wrote:
 >  > > Mark,
 >  > >
 >  > >There seems to be some issue with DefaultMultiIndexAccessor.java. I
 >  > > got following NPE exception,
 >  > >
 >  > >  2008-02-13 07:10:28,021 ERROR [http-7501-Processor6] 
ReportServiceImpl -
 >  > > java.lang.NullPointerException
 >  > > at 
org.apache.lucene.indexaccessor.DefaultMultiIndexAccessor.release(DefaultMultiIndexAccessor.java:89)
 >  > >
 >  > > Looks like the IndexAccessor for one of the Searcher in the
 >  > > MultiSearcher returned null. Not sure how is that possible, any ideas
 >  > > how is that possible?
 >  > >
 >  > > In my case it caused a critical error as the writer thread was stuck
 >  > > forever (we found out after couple of days) because of this,
 >  > >
 >  > > "PS thread 9" prio=1 tid=0x2aac70eb95d0 nid=0x6ba in Object.wait()
 >  > > [0x47533000..0x47533b80]
 >  > > at java.lang.Object.wait(Native Method)
 >  > > - wait

Re: DefaultIndexAccessor

2008-02-28 Thread vivek sar
Mark,

Yes, I think that's what precisely is happening. I call
accessor.close, which shuts down all the ExecutorService. I was
assuming the accessor.open would re-open it (I think that's how it
worked in older version of your IndexAccessor).

Basically, I need a way to stop (or close) all the IndexSearchers for
a specific IndexAccessor and do not allow them to re-open until I flag
the indexAccessor that it's safe to give out new index searchers. So I
am able to optimize the index, rename it and move it to somewhere else
during partitioning. Right now without closing the searchers I can not
rename the index as it wouldn't allow me to if some other thread has a
file handle to that index.

I don't know if there is a way to get an exclusive writer thread to an
index using IndexAccessor. I would think a better way for me would be
to,

1) Call a method on IndexAccessor, let's say stopIndex() - This
would clear all the caches (stop all the open searchers, readers and
writers) and flag the index accessor so no other reader or writer
thread can be taken from this index accessor
2) I use my own (not using IndexAccessor) IndexWriter to do
optimization on the index that needs to be partitioned and release it
3) Once done with partition, I call another method on
IndexAccessor, let's say startIndex()   -> This will simply flag so
now the IndexAccessor would allow to get searchers, readers and
writers. The start would have to reopen all the searchers and readers.

Not sure if this is a good design for what I am trying to do. This
would require two new methods on IndexAccessor - stopIndex() and
startIndex(). Any thoughts?

Thanks,
-vivek


On Thu, Feb 28, 2008 at 3:55 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
> Hey vivek,
>
>  Sorry you ran into this. I believe the problem is that I had just not
>  foreseen the use case of closing and then reopening the Accessor. The
>  only time I ever close the Accessors is when I am shutting down the JVM.
>
>  What do you do about all of the IndexAccessor requests while it is in a
>  closed state? Could their be a better way of accomplishing this without
>  closing the Accessor? Would a new method that just stalled everything be
>  better? Then you wouldn't have to recreate any resources possibly?
>
>  In any case, the problem is that after the Executor gets shutdown it is
>  not reopened in the open method. I can certainly change this, but I need
>  to look for any other issues as well. I will add an open after a
>  shutdown test to investigate. I am going to think about the issue
>  further and I will get back to you soon.
>
>  Thanks for all of the details.
>
>  - Mark
>
>
>
>  vivek sar wrote:
>  > Mark,
>  >
>  >  Some more information,
>  >
>  >   1) I run indexwriter every 5 mins
>  >   2) After every cycle I check if I need to partition (based on
>  > the index size)
>  >   3) In the partition interface,
>  > a)  I first call close on the index accessor (so all the
>  > searchers can close before I move that index)
>  >   accessor =
>  > IndexAccessorFactory.getInstance().getAccessor(dir.getFile());
>  >   accessor.close();
>  > b) Then I re-open the index accessor,
>  >accessor = 
> indexFactory.getAccessor(dir.getFile());
>  >accessor.open();
>  > c) I optimized the my indexes using the Index Writer (that
>  > I get from the accessor).
>  >masterWriter = 
> this.indexAccessor.getWriter(false);
>  >masterWriter.optimize(optimizeSegment);
>  > d) Once the optimization is done I release the masterWriter,
>  > this.indexAccessor.release(masterWriter);
>  >
>  >  Now here is where I get the "RejectedExecutionException".
>  > Reading up little more on this exception,
>  > 
> http://pveentjer.wordpress.com/2008/02/06/are-you-dealing-with-the-rejectedexecutionexception/,
>  > I see this might be happening because something got stuck during the
>  > close cycle, so the ExecutorSerivce is not accepting any new tasks.
>  > I'm not sure how would this happen.
>  >
>  > The critical problem is once I get this exception, every release call
>  > throws the same exception (looks like shutdown never gets done).
>  > Because of this my readers are never refreshed and I can not read any
>  > new indexes.
>  >
>  > May be I've to check whether the accessor is completely closed before
>  > re-opening?  Could you in your release check whether the pool
>  > (ExecutorService) is in shutdown state? Any thing else I can check?
>  >
>  > Thanks,
>  > -vivek
>  >
>  > On Thu, Feb 28, 2008 at 1:26 PM, vivek sar <[EMAIL PROTECTED]> wrote:
>  >
>  >> Mark,
>  >>
>  >>   We deployed our indexer (using defaultIndexAccessor) on one of the
>  >>  production site and getting this error,
>  >>
>  >>  Caused by: java.util.concurrent.RejectedExec

Re: DefaultIndexAccessor

2008-02-28 Thread Mark Miller
I added the Thread Pool recently, so things did probably work before 
that. I am certainly willing to put the Thread Pool init in the open 
call instead of the constructor.


As for the best method to use, I was thinking of something along the 
same lines as what you suggest.


One of the decisions will be how to handle shutting down method calls on 
the Accessor. Throw an Exception or block?


In any case, I will put up code that makes the above change and your 
code should work as it did. I'll be sure to add this to the test cases.



Just as a personal interest question, what has led you to setup your 
index this way? Adding partitions as it grows that is.


- Mark

vivek sar wrote:

Mark,

Yes, I think that's what precisely is happening. I call
accessor.close, which shuts down all the ExecutorService. I was
assuming the accessor.open would re-open it (I think that's how it
worked in older version of your IndexAccessor).

Basically, I need a way to stop (or close) all the IndexSearchers for
a specific IndexAccessor and do not allow them to re-open until I flag
the indexAccessor that it's safe to give out new index searchers. So I
am able to optimize the index, rename it and move it to somewhere else
during partitioning. Right now without closing the searchers I can not
rename the index as it wouldn't allow me to if some other thread has a
file handle to that index.

I don't know if there is a way to get an exclusive writer thread to an
index using IndexAccessor. I would think a better way for me would be
to,

1) Call a method on IndexAccessor, let's say stopIndex() - This
would clear all the caches (stop all the open searchers, readers and
writers) and flag the index accessor so no other reader or writer
thread can be taken from this index accessor
2) I use my own (not using IndexAccessor) IndexWriter to do
optimization on the index that needs to be partitioned and release it
3) Once done with partition, I call another method on
IndexAccessor, let's say startIndex()   -> This will simply flag so
now the IndexAccessor would allow to get searchers, readers and
writers. The start would have to reopen all the searchers and readers.

Not sure if this is a good design for what I am trying to do. This
would require two new methods on IndexAccessor - stopIndex() and
startIndex(). Any thoughts?

Thanks,
-vivek


On Thu, Feb 28, 2008 at 3:55 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
  

Hey vivek,

 Sorry you ran into this. I believe the problem is that I had just not
 foreseen the use case of closing and then reopening the Accessor. The
 only time I ever close the Accessors is when I am shutting down the JVM.

 What do you do about all of the IndexAccessor requests while it is in a
 closed state? Could their be a better way of accomplishing this without
 closing the Accessor? Would a new method that just stalled everything be
 better? Then you wouldn't have to recreate any resources possibly?

 In any case, the problem is that after the Executor gets shutdown it is
 not reopened in the open method. I can certainly change this, but I need
 to look for any other issues as well. I will add an open after a
 shutdown test to investigate. I am going to think about the issue
 further and I will get back to you soon.

 Thanks for all of the details.

 - Mark



 vivek sar wrote:
 > Mark,
 >
 >  Some more information,
 >
 >   1) I run indexwriter every 5 mins
 >   2) After every cycle I check if I need to partition (based on
 > the index size)
 >   3) In the partition interface,
 > a)  I first call close on the index accessor (so all the
 > searchers can close before I move that index)
 >   accessor =
 > IndexAccessorFactory.getInstance().getAccessor(dir.getFile());
 >   accessor.close();
 > b) Then I re-open the index accessor,
 >accessor = 
indexFactory.getAccessor(dir.getFile());
 >accessor.open();
 > c) I optimized the my indexes using the Index Writer (that
 > I get from the accessor).
 >masterWriter = 
this.indexAccessor.getWriter(false);
 >masterWriter.optimize(optimizeSegment);
 > d) Once the optimization is done I release the masterWriter,
 > this.indexAccessor.release(masterWriter);
 >
 >  Now here is where I get the "RejectedExecutionException".
 > Reading up little more on this exception,
 > 
http://pveentjer.wordpress.com/2008/02/06/are-you-dealing-with-the-rejectedexecutionexception/,
 > I see this might be happening because something got stuck during the
 > close cycle, so the ExecutorSerivce is not accepting any new tasks.
 > I'm not sure how would this happen.
 >
 > The critical problem is once I get this exception, every release call
 > throws the same exception (looks like shutdown never gets done).
 > Because of this my readers are n

Re: DefaultIndexAccessor

2008-02-28 Thread vivek sar
Mark,

 Just for my clarification,

1) Would you have indexStop and indexStart methods? If that's the case
then I don't have to call close() at all. These new methods would
serve as just cleaning up the caches and not closing the thread pool.

I would prefer not to call close() and init() again if possible.

The reason we have to do partition is because our index size grows
over 50G a week and then optimization takes hours. I'd a thread going
on this topic in the mailing list,
http://www.gossamer-threads.com/lists/lucene/java-user/57366?search_string=partition;#57366.

Thanks,
-vivek





On Thu, Feb 28, 2008 at 5:01 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
> I added the Thread Pool recently, so things did probably work before
>  that. I am certainly willing to put the Thread Pool init in the open
>  call instead of the constructor.
>
>  As for the best method to use, I was thinking of something along the
>  same lines as what you suggest.
>
>  One of the decisions will be how to handle shutting down method calls on
>  the Accessor. Throw an Exception or block?
>
>  In any case, I will put up code that makes the above change and your
>  code should work as it did. I'll be sure to add this to the test cases.
>
>
>  Just as a personal interest question, what has led you to setup your
>  index this way? Adding partitions as it grows that is.
>
>
>
>  - Mark
>
>  vivek sar wrote:
>  > Mark,
>  >
>  > Yes, I think that's what precisely is happening. I call
>  > accessor.close, which shuts down all the ExecutorService. I was
>  > assuming the accessor.open would re-open it (I think that's how it
>  > worked in older version of your IndexAccessor).
>  >
>  > Basically, I need a way to stop (or close) all the IndexSearchers for
>  > a specific IndexAccessor and do not allow them to re-open until I flag
>  > the indexAccessor that it's safe to give out new index searchers. So I
>  > am able to optimize the index, rename it and move it to somewhere else
>  > during partitioning. Right now without closing the searchers I can not
>  > rename the index as it wouldn't allow me to if some other thread has a
>  > file handle to that index.
>  >
>  > I don't know if there is a way to get an exclusive writer thread to an
>  > index using IndexAccessor. I would think a better way for me would be
>  > to,
>  >
>  > 1) Call a method on IndexAccessor, let's say stopIndex() - This
>  > would clear all the caches (stop all the open searchers, readers and
>  > writers) and flag the index accessor so no other reader or writer
>  > thread can be taken from this index accessor
>  > 2) I use my own (not using IndexAccessor) IndexWriter to do
>  > optimization on the index that needs to be partitioned and release it
>  > 3) Once done with partition, I call another method on
>  > IndexAccessor, let's say startIndex()   -> This will simply flag so
>  > now the IndexAccessor would allow to get searchers, readers and
>  > writers. The start would have to reopen all the searchers and readers.
>  >
>  > Not sure if this is a good design for what I am trying to do. This
>  > would require two new methods on IndexAccessor - stopIndex() and
>  > startIndex(). Any thoughts?
>  >
>  > Thanks,
>  > -vivek
>  >
>  >
>  > On Thu, Feb 28, 2008 at 3:55 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
>  >
>  >> Hey vivek,
>  >>
>  >>  Sorry you ran into this. I believe the problem is that I had just not
>  >>  foreseen the use case of closing and then reopening the Accessor. The
>  >>  only time I ever close the Accessors is when I am shutting down the JVM.
>  >>
>  >>  What do you do about all of the IndexAccessor requests while it is in a
>  >>  closed state? Could their be a better way of accomplishing this without
>  >>  closing the Accessor? Would a new method that just stalled everything be
>  >>  better? Then you wouldn't have to recreate any resources possibly?
>  >>
>  >>  In any case, the problem is that after the Executor gets shutdown it is
>  >>  not reopened in the open method. I can certainly change this, but I need
>  >>  to look for any other issues as well. I will add an open after a
>  >>  shutdown test to investigate. I am going to think about the issue
>  >>  further and I will get back to you soon.
>  >>
>  >>  Thanks for all of the details.
>  >>
>  >>  - Mark
>  >>
>  >>
>  >>
>  >>  vivek sar wrote:
>  >>  > Mark,
>  >>  >
>  >>  >  Some more information,
>  >>  >
>  >>  >   1) I run indexwriter every 5 mins
>  >>  >   2) After every cycle I check if I need to partition (based on
>  >>  > the index size)
>  >>  >   3) In the partition interface,
>  >>  > a)  I first call close on the index accessor (so all the
>  >>  > searchers can close before I move that index)
>  >>  >   accessor =
>  >>  > IndexAccessorFactory.getInstance().getAccessor(dir.getFile());
>  >>  >   accessor.close();
>  >>  > b) Then I re-open the index accessor,
>

Re: Vector Space Model: New Similarity Implementation Issues

2008-02-28 Thread h t
Compare with classical VSM, lucene just ignore the denominator (|Q|*|D|) of
similarity formula,
but it add norm(t,d) and coord(q,d) to calculate the fraction of terms in
Query and Doc,
so it's a modified implementation of VSM in practice.
 Do you just want to verify which implementation of VSM in "ieee-sw-rank" is
more precise in practice by lucene?
If so, it's an useful experiment.

2008/2/27, Dharmalingam <[EMAIL PROTECTED]>:
>
>
> Hi List,
>
> I am pretty new to Lucene. Certainly, it is very exciting. I need to
> implement a new Similarity class based on the Term Vector Space Model
> given
> in http://www.miislita.com/term-vector/term-vector-3.html
>
> Although that model is similar to Lucene's model
> (
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
> ),
> I am having hard time to extend the Similarity class to calculate that
> model.
>
> In that model, "tf" is multiplied with Idf for all terms in the index, but
> in Lucene "tf" is calculated only for terms in the given Query. Because of
> that effect, the norm calculation should also include "idf" for all terms.
> Lucene calculates the norm, during indexing, by "just" counting the number
> of terms per document. In the web formula (in miislita.com), a document
> norm
> is calculated after multiplying "tf" and "idf".
>
> FYI: I could implement "idf" according to miisliat.com formula, but not
> the
> "tf" and "norm"
>
> Could you please comment me how I can implement a new Similarity class
> that
> will fit in the Lucene's architecture, but still implement the vector
> space
> model given in miislita.com
>
> Thanks a lot for your comments,
>
> Dharma
>
>
> --
> View this message in context:
> http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15696719.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: How do i get a text summary

2008-02-28 Thread h t
Hi Karl,
Where is the introduction of below algorithm? Thanks.
"Very simple algorithmic solutions usually involve ranking top senstances
by looking at distribution of terms in sentances, paragraphs and the
whole document. I implemented something like this a couple of years back
that worked fairly well."



2008/2/29, Karl Wettin <[EMAIL PROTECTED]>:
>
> [EMAIL PROTECTED] skrev:
>
> >> If you want something from an index it has to be IN the
> >> index. So, store a
> >> summary field in each document and make sure that field is part of the
> >> query.
> >
> > And how could one create automatically such a summary?
> > Taking the first 2 lines of a document makes not always much sense.
> > How does google this?
>
>
> Google don't summarize, they highlight parts that match the query. See
> previous reponses.
>
> If you really want to summarize there are a number of more and less
> scientific ways to figure out what's important and what's not.
>
> Very simple algorithmic solutions usually involve ranking top senstances
> by looking at distribution of terms in sentances, paragraphs and the
> whole document. I implemented something like this a couple of years back
> that worked fairly well.
>
> Citeseer is a great source for papers on pretty much any IR related
> subject: 
>
>
>
> karl
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: DefaultIndexAccessor

2008-02-28 Thread Mark Miller



vivek sar wrote:

Mark,

 Just for my clarification,

1) Would you have indexStop and indexStart methods? If that's the case
then I don't have to call close() at all. These new methods would
serve as just cleaning up the caches and not closing the thread pool.
  
Yes. This is the approach I agree with. I am putting up new code that 
allows the close(), open() calls anyway though. There is nothing keeping 
it from working and it used to work so its a good idea to make it work 
again. It is also a quick fix for you.


https://issues.apache.org/jira/browse/LUCENE-1026

I will be adding the new stop start calls quickly, but I don't want to 
rush it out.

I would prefer not to call close() and init() again if possible.

The reason we have to do partition is because our index size grows
over 50G a week and then optimization takes hours. I'd a thread going
on this topic in the mailing list,
http://www.gossamer-threads.com/lists/lucene/java-user/57366?search_string=partition;#57366.
  
Gotchya. A comment I have on that is that you might try keeping the 
mergefactor really low as well. This will keep searches faster, make 
optimization much faster (its amortized), and not slow down writes that 
much in my experience (since IndexAccessor drops the writes off (spawns 
new thread) anyway, slightly longer writes shouldnt be a big deal at 
all. I'd try out even as low as 2 or 3. I run some fairly large 
interactive indexes and the writes, even when blocking until the write 
is done, are pretty darn responsive.


- Mark

Thanks,
-vivek





On Thu, Feb 28, 2008 at 5:01 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
  

I added the Thread Pool recently, so things did probably work before
 that. I am certainly willing to put the Thread Pool init in the open
 call instead of the constructor.

 As for the best method to use, I was thinking of something along the
 same lines as what you suggest.

 One of the decisions will be how to handle shutting down method calls on
 the Accessor. Throw an Exception or block?

 In any case, I will put up code that makes the above change and your
 code should work as it did. I'll be sure to add this to the test cases.


 Just as a personal interest question, what has led you to setup your
 index this way? Adding partitions as it grows that is.



 - Mark

 vivek sar wrote:
 > Mark,
 >
 > Yes, I think that's what precisely is happening. I call
 > accessor.close, which shuts down all the ExecutorService. I was
 > assuming the accessor.open would re-open it (I think that's how it
 > worked in older version of your IndexAccessor).
 >
 > Basically, I need a way to stop (or close) all the IndexSearchers for
 > a specific IndexAccessor and do not allow them to re-open until I flag
 > the indexAccessor that it's safe to give out new index searchers. So I
 > am able to optimize the index, rename it and move it to somewhere else
 > during partitioning. Right now without closing the searchers I can not
 > rename the index as it wouldn't allow me to if some other thread has a
 > file handle to that index.
 >
 > I don't know if there is a way to get an exclusive writer thread to an
 > index using IndexAccessor. I would think a better way for me would be
 > to,
 >
 > 1) Call a method on IndexAccessor, let's say stopIndex() - This
 > would clear all the caches (stop all the open searchers, readers and
 > writers) and flag the index accessor so no other reader or writer
 > thread can be taken from this index accessor
 > 2) I use my own (not using IndexAccessor) IndexWriter to do
 > optimization on the index that needs to be partitioned and release it
 > 3) Once done with partition, I call another method on
 > IndexAccessor, let's say startIndex()   -> This will simply flag so
 > now the IndexAccessor would allow to get searchers, readers and
 > writers. The start would have to reopen all the searchers and readers.
 >
 > Not sure if this is a good design for what I am trying to do. This
 > would require two new methods on IndexAccessor - stopIndex() and
 > startIndex(). Any thoughts?
 >
 > Thanks,
 > -vivek
 >
 >
 > On Thu, Feb 28, 2008 at 3:55 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
 >
 >> Hey vivek,
 >>
 >>  Sorry you ran into this. I believe the problem is that I had just not
 >>  foreseen the use case of closing and then reopening the Accessor. The
 >>  only time I ever close the Accessors is when I am shutting down the JVM.
 >>
 >>  What do you do about all of the IndexAccessor requests while it is in a
 >>  closed state? Could their be a better way of accomplishing this without
 >>  closing the Accessor? Would a new method that just stalled everything be
 >>  better? Then you wouldn't have to recreate any resources possibly?
 >>
 >>  In any case, the problem is that after the Executor gets shutdown it is
 >>  not reopened in the open method. I can certainly change this, but I need
 >>  to look for any other issues as well. I will add an open after a
 >>  shutdown test to investigate.

Re: DefaultIndexAccessor

2008-02-28 Thread vivek sar
Thanks Mark. I'll wait for your enhancements in IndexAccessor on the
new methods.

I use mergeFactor = 100. I've read about the merge factor and it's
hard to balance both the read/write optimization. What's the number do
you use?

Thanks again.
-vivek

On Thu, Feb 28, 2008 at 7:14 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
>
>
>  vivek sar wrote:
>  > Mark,
>  >
>  >  Just for my clarification,
>  >
>  > 1) Would you have indexStop and indexStart methods? If that's the case
>  > then I don't have to call close() at all. These new methods would
>  > serve as just cleaning up the caches and not closing the thread pool.
>  >
>  Yes. This is the approach I agree with. I am putting up new code that
>  allows the close(), open() calls anyway though. There is nothing keeping
>  it from working and it used to work so its a good idea to make it work
>  again. It is also a quick fix for you.
>
>
>  https://issues.apache.org/jira/browse/LUCENE-1026
>
>  I will be adding the new stop start calls quickly, but I don't want to
>  rush it out.
>
> > I would prefer not to call close() and init() again if possible.
>  >
>  > The reason we have to do partition is because our index size grows
>  > over 50G a week and then optimization takes hours. I'd a thread going
>  > on this topic in the mailing list,
>  > 
> http://www.gossamer-threads.com/lists/lucene/java-user/57366?search_string=partition;#57366.
>  >
>  Gotchya. A comment I have on that is that you might try keeping the
>  mergefactor really low as well. This will keep searches faster, make
>  optimization much faster (its amortized), and not slow down writes that
>  much in my experience (since IndexAccessor drops the writes off (spawns
>  new thread) anyway, slightly longer writes shouldnt be a big deal at
>  all. I'd try out even as low as 2 or 3. I run some fairly large
>  interactive indexes and the writes, even when blocking until the write
>  is done, are pretty darn responsive.
>
>  - Mark
>
>
> > Thanks,
>  > -vivek
>  >
>  >
>  >
>  >
>  >
>  > On Thu, Feb 28, 2008 at 5:01 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
>  >
>  >> I added the Thread Pool recently, so things did probably work before
>  >>  that. I am certainly willing to put the Thread Pool init in the open
>  >>  call instead of the constructor.
>  >>
>  >>  As for the best method to use, I was thinking of something along the
>  >>  same lines as what you suggest.
>  >>
>  >>  One of the decisions will be how to handle shutting down method calls on
>  >>  the Accessor. Throw an Exception or block?
>  >>
>  >>  In any case, I will put up code that makes the above change and your
>  >>  code should work as it did. I'll be sure to add this to the test cases.
>  >>
>  >>
>  >>  Just as a personal interest question, what has led you to setup your
>  >>  index this way? Adding partitions as it grows that is.
>  >>
>  >>
>  >>
>  >>  - Mark
>  >>
>  >>  vivek sar wrote:
>  >>  > Mark,
>  >>  >
>  >>  > Yes, I think that's what precisely is happening. I call
>  >>  > accessor.close, which shuts down all the ExecutorService. I was
>  >>  > assuming the accessor.open would re-open it (I think that's how it
>  >>  > worked in older version of your IndexAccessor).
>  >>  >
>  >>  > Basically, I need a way to stop (or close) all the IndexSearchers for
>  >>  > a specific IndexAccessor and do not allow them to re-open until I flag
>  >>  > the indexAccessor that it's safe to give out new index searchers. So I
>  >>  > am able to optimize the index, rename it and move it to somewhere else
>  >>  > during partitioning. Right now without closing the searchers I can not
>  >>  > rename the index as it wouldn't allow me to if some other thread has a
>  >>  > file handle to that index.
>  >>  >
>  >>  > I don't know if there is a way to get an exclusive writer thread to an
>  >>  > index using IndexAccessor. I would think a better way for me would be
>  >>  > to,
>  >>  >
>  >>  > 1) Call a method on IndexAccessor, let's say stopIndex() - This
>  >>  > would clear all the caches (stop all the open searchers, readers and
>  >>  > writers) and flag the index accessor so no other reader or writer
>  >>  > thread can be taken from this index accessor
>  >>  > 2) I use my own (not using IndexAccessor) IndexWriter to do
>  >>  > optimization on the index that needs to be partitioned and release it
>  >>  > 3) Once done with partition, I call another method on
>  >>  > IndexAccessor, let's say startIndex()   -> This will simply flag so
>  >>  > now the IndexAccessor would allow to get searchers, readers and
>  >>  > writers. The start would have to reopen all the searchers and readers.
>  >>  >
>  >>  > Not sure if this is a good design for what I am trying to do. This
>  >>  > would require two new methods on IndexAccessor - stopIndex() and
>  >>  > startIndex(). Any thoughts?
>  >>  >
>  >>  > Thanks,
>  >>  > -vivek
>  >>  >
>  >>  >
>  >>  > On Thu, Feb 28, 2008 at 3:55 PM, Mark Miller <[EMAIL PROTECTE

Re: When does QueryParser creates PhraseQueries

2008-02-28 Thread Daniel Noll
On Wednesday 27 February 2008 00:50:04 [EMAIL PROTECTED] wrote:
> Looks that this is really hard-coded behaviour, and not Analyzer-specific.

The whitespace part is coded into QueryParser.jj, yes.  So are the quotes 
and : and other query-specific things.

> I want to search for directories with tokenizing them, e.g.
> /home/reuschling - this seems to be not possible with the current
> queryparser.

That's possible by changing the analyser.  For instance StandardAnalyzer will 
tokenise that as two terms, but WhitespaceAnalyzer will tokenise it as one.

> | If you subclass QueryParser than you can override this method and modify
> | it to do whatever evil trick you want to do.
>
> Overriding getFieldQuery() will not work because I can't differ between
> "home/reuschling", which should trigger a PhraseQuery, and home/reuschling
> without apostrophe, which should trigger a BooleanQuery...I will search
> whether I can find a better place for this:)

That much is true.  Likewise, there is no difference between quoting "cat" and 
typing cat without quotes.

You could possibly override the parse(String) method and mangle the string in 
some way so that you know.  So if the user enters /a/b it could pass 
down /a/b, but if they enter "/a/b" it could pass down "SOMETHING/a/b", and 
you then detect the SOMETHING in getFieldQuery.  Just have to make sure the 
something isn't tokenised out by the analyser.

Or you could clone QueryParser.jj itself and modify it to call different 
methods for the two situations.

Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Rebuilding Document from index?

2008-02-28 Thread Daniel Noll
On Wednesday 27 February 2008 03:33:53 Itamar Syn-Hershko wrote:
> I'm still trying to engineer the best possible solution for Lucene with
> Hebrew, right now my path is NOT using a stemmer by default, only by
> explicit request of the user. MoreLikeThis would only return relevant
> results if I will use a non-stemmed scoring and lookup.

This appears to be the case for all languages too, the stemming will skew 
similarity and result in unrelated documents scoring higher than they need 
to.

Some people seem to be working around this by having two fields where one is 
stemmed and the other isn't.  You could then use the stemmed field when doing 
queries but use the non-stemmed field for MoreLikeThis.

Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Inconsistent Search Speed

2008-02-28 Thread Daniel Noll
On Thursday 28 February 2008 01:52:27 Erick Erickson wrote:
> And don't iterate through the Hits object for more than 100 or so hits.
> Like Mark said. Really. Really don't ...

Is there a good trick for avoiding this?

Say you have a situation like this...
  - User searches
  - User sees first N hits, perhaps scrolls
  - User chooses to save results to a file

Clearly for the first two, using Hits is normal.  For the third step you would 
be iterating over potentially a larger number of results, so Hits is not 
recommended.  But implementing a HitCollector from scratch to get the same 
results as Hits seems silly, so what is the usual way out of this?  Do you 
re-execute the query using TopDocs?  Or do you call hitDoc(hits.length()) to 
force Hits itself to load the remainder, and then go back to the start and 
iterate through?

Using TopDocs up-front would be desirable but it turns out it tries to 
allocate the maximum you pass in, up-front...

Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]