Re: IndexReader.reopen memory leak

2008-06-01 Thread Doron Cohen
Hi John,

IndexReader newInner=in.reopen();
>  if (in!=newInner)
>  {
>in.close();
>this.in=newInner;
>
>// code to clean up my data
>_cache.clear();
>_indexData.load(this, true);
>init(_fieldConfig);
>  }
>

Just to be sure on this, could you confirm the two appearances above:
- in
- this.in
refer to exactly the same variable?

Assuming they are, could you provide some more code:
- entire method containing the above code
- method reopen() of your FilteredIndexReader.
- method newReader()
- constructor of FilteredIndexReader if it is invoked from newReader()

Regards,
Doron


Re: IndexReader.reopen memory leak

2008-06-01 Thread Mark Miller


 
Yes...I constantly index with 8 threads on one writer while searching 
with many more threads. Then I let it run for like an hour and watch. 
The index is tiny to start and then grows to a moderate size...nothing 
crazy.


I am also reopening a lot on a real index of 3.5 million +  docs 
though...and I see no leak evidence there either.

A couple interesting limitations with these results:

In the reopen test I was only using one field. I'll try a lot more.

On the 3.5+ million index there are loads of fields, but field norms are 
off.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ANN: New release Lucene-Oracle integration

2008-06-01 Thread Marcelo Ochoa
Hi All:
  I am just releasing a new binary distribution of Oracle-Lucene
integration by using Lucene-OJVM Data Catridge.
  Here the change log:
* Compiled against Lucene 2.3.2 production release
* Used latest API for merging based on RAM usage
* Use Writer for deleting during Sync
* Confirm 4x improvement during indexing reported by Lucene dev. group
* Fix workaround which changes order of the rowids in ODCRIDList
* Added an Spanish WikiPedia Analyzer for testing
* Reports IOException instead of RunTimeException to signal EOF or
File Not Found
* Decouple Flush functionality from TableIndexer
  I would like to say thanks a lot to Michael McCandless for helping
to solve nice glitch with Oracle JIT compiler which causes that
DocumentsWriter class do not work on 11.1.0.6 release.
  11g binary version have a workaround for this problem.
  Oracle OJVM dev. team told me that this problems its not
reproducible on 11.2 and 11.1.0.7 versions.
  Latest binary dist. can be downloaded at:
http://sourceforge.net/project/showfiles.php?group_id=56183&package_id=255524&release_id=603580
  Also I have posted a new entry at my blog with some performance
experience against Wikipedia Spanish dump uploaded to XMLDB:
http://marceloochoa.blogspot.com/2008/06/new-binary-release-of-lucene-oracle.html
  Latest documentation is at:
http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
  Best regards, Marcelo.

-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
__
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/183296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Displaying and highlighting results from a Wild Card and Fuzzy search using Lucene in Java

2008-06-01 Thread Daniel Naber
On Sonntag, 1. Juni 2008, syedfa wrote:

> I am trying to display my results from doing a search of an xml document
> (some quotes from shakespeare's "Hamlet") using a WildCard and Fuzzy
> search, and then I'm trying to highlight the keyword(s) in the results,
> but unfortunately I am having problems.

Please see
http://wiki.apache.org/lucene-java/LuceneFAQ#head-75566820ee94a425c7e2950ac61d24e405fbd914

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: how to unsubscribe?

2008-06-01 Thread Chris Hostetter

: I've already tried this but the subject line is fixed and I wrote a roman to
: convince the mail daemon that I'm not interested in spamming.. but it didn't
: care :)

Silly question, but you were sending your email to 
"[EMAIL PROTECTED]" and not "[EMAIL PROTECTED]" correct?

Are you still having a problem, or were you able to unsubscribe?

If it is still a problem, my suggestion is to file an "INFRA" bug in 
Jira using the "Mailing List" component, and attach copies of the email 
you sent (with full headers) and the response you got (with full 
headers)...

https://issues.apache.org/jira/browse/INFRA


-Hoss

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening an index directory inside a jar

2008-06-01 Thread Chris Hostetter

: The crux of the issue seems to be that lucene cannot open segments file that
: is inside the jar (under luceneFiles/index directory)

i'm not entirely sure why it would have problems finding the segments 
file, but a larger problem is that Lucene needs random access which (last 
time i checked) isn't available when accessing files in jars...

http://www.nabble.com/Accessing-Lucene-Index-stored-in-a-jar-file-to3009604.html

...you cn always include the index in a jar, and then extract it before 
using it.

: unit/integration/functional tests depend on index files to be created. The
: manual step of creating the index files breaks the automated CI builds or
: some reliance on building the index in some tmp directory. Unfortunately
: that approach has issues if we run tests concurrently. Also, building the
: index takes a couple of minutes, so generating them on the fly for tests is
: expensive and increases the build time.

there's no inherent reason why concurrent tests need to collide if you use 
temp directories -- just have each test create it's own private tmp 
directory and copy the index (or exactract the index from the jar) to 
that private directory in the "setUp" method.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: date filter filtering out non-dated items?

2008-06-01 Thread Chris Hostetter

: While I could add a future date to these documents, this kind of feels 
: hackish and I would be interested in other ideas on how to filter out 
: expired documents.

this just came up on the solr list, the answer is equally applicable 
but note that you'll need to combine it with some other query class 
(MatchAllDocs perhaps)...

>> you have to invert your logic.  docs that "have not yet expired, or will 
>> never expire" match the negacted query for "docs expired in the past" 

http://www.nabble.com/expression-in-an-fq-parameter-fails-to17353677.html#a17375261




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Frequencies sorted by frequencies

2008-06-01 Thread Grant Ingersoll
I don't know of a way, sorry.  Most of the Similarity methods do not  
take a field name.



On May 29, 2008, at 9:20 AM, Hider, Sandy wrote:


Thanks for taking the time to answer.  I see what you mean.  The thing
is I also plan on using the standard score.  Would there be a way to  
use

the both the standard score and the TF-only Score in a single index?

Sandy


-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 28, 2008 2:34 PM
To: java-user@lucene.apache.org
Subject: Re: Frequencies sorted by frequencies

I think you could override all the Similarity factors except tf() with
1, such that the term frequency is the only factor in the scoring.
Then you just submit the term as a query.  Note, I think you will need
to override the similarity during indexing, too, so that norm length  
is

turned off, too.  Note, I haven't tried it :-).  Use the explain()
functionality to double check.  At any rate, it should be quick to  
test.


See
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similar
ity.html

-Grant


On May 28, 2008, at 10:48 AM, Hider, Sandy wrote:


Hi All,
I am trying to figure out a quick way to find the top N documents
sorted by frequency of a term.

I found:

IndexRead.termDocs()

which provides an enumeration of doc() and freq() but it returns an
enumeration sorted by doc number.   Is there a way to get the results
sorted by freq?  Or is there another query I can run the find these
results?

Thanks in advance,

Sandy



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene search time in real production use?

2008-06-01 Thread Grant Ingersoll

Those benchmarks are pretty old, I think.

-Grant

On May 31, 2008, at 12:28 PM, Karl Wettin wrote:



31 maj 2008 kl. 14.25 skrev lucene user:
What are some average search and retrieval times for Lucene queries  
in real
production use? Would people include relevant stuff like the number  
of

documents in your index, etc.?

Thanks for your help!


http://lucene.apache.org/java/docs/benchmarks.html

How well it works depends on many factors. What your corpus looks  
like, load on index, what sort of queries are executed, hardware, et  
c. You can estimate how your application will work by using and  
extending the benchmarker contrib tool.



karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: how to unsubscribe?

2008-06-01 Thread Daniel Freudenberger
As you can see I'm still part of this list.

I'll submit a bug report.

Thanks in advance,
Daniel

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 01, 2008 9:16 PM
To: Lucene Users
Cc: Daniel Freudenberger
Subject: RE: how to unsubscribe?


: I've already tried this but the subject line is fixed and I wrote a roman
to
: convince the mail daemon that I'm not interested in spamming.. but it
didn't
: care :)

Silly question, but you were sending your email to 
"[EMAIL PROTECTED]" and not "[EMAIL PROTECTED]" correct?

Are you still having a problem, or were you able to unsubscribe?

If it is still a problem, my suggestion is to file an "INFRA" bug in 
Jira using the "Mailing List" component, and attach copies of the email 
you sent (with full headers) and the response you got (with full 
headers)...

https://issues.apache.org/jira/browse/INFRA


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening an index directory inside a jar

2008-06-01 Thread Doron Cohen
>
> : The crux of the issue seems to be that lucene cannot open segments file
> that
> : is inside the jar (under luceneFiles/index directory)
>
> i'm not entirely sure why it would have problems finding the segments
> file, but a larger problem is that Lucene needs random access which (last
> time i checked) isn't available when accessing files in jars...
>
>
> http://www.nabble.com/Accessing-Lucene-Index-stored-in-a-jar-file-to3009604.html
>

I bumped into (though never used)
http://commons.apache.org/vfs/apidocs/org/apache/commons/vfs/provider/jar/JarFileSystem.html
There, FileContent has this method
getRandomAccessContent(org.apache.commons.vfs.util.RandomAccessMode)
so it seems worth exploring.

HTH,
Doron


Re: Boolean Query Issue

2008-06-01 Thread Sonu Sudhakar
Hi,

I have done some more analysis on this issue. I think it is related to
lucene's default operator.
I am getting excat results, when I sets the default operator as 'OR', but
facing problem when setting the default operator as 'AND'.

The following are the lucene QueryParser outputs for both cases.

Query :* TTL:store AND TTL:data OR TTL:variable

*1. When lucene default operator is '*OR'

*QueryParser output  using toString method: *
+TTL:store +TTL:data TTL:variable


*2. When lucene default operator is '*AND'

*QueryParser output  using toString method:
*+TTL:store TTL:data TTL:variable

*The output of second case is confusing me.

Could anybody please give me an explanation for this behavior?

Thanks,
Sonu

On Thu, May 29, 2008 at 3:49 PM, Sonu Sudhakar <[EMAIL PROTECTED]> wrote:

> Erick,
>
> Thanks for your reply.
>
> I am working with approximately 1 million documents. They are indexed in 4
> servers. Each document has multiple fields. I am using ParallelMultiSearcher
> for searching purpose.
>
> I tried a few queries in the title(TTL) field.
>
> i started with a simple query without boolean operators.
>
> *1. TTL:data => 3733 results (all matches had "data" in title)*
>
> Then I tried a second one with AND operator.
>
> *2. TTL:data AND TTL:store => 19 results*
>
> I analyzed the results. the results had both "data" and "store" in the
> title.
>
> *I then tried OR operator*
>
> *3. TTL:data AND TTL:store OR TTL:variable*
>
> I got 3,733 results., same as the query TTL:data.
>
> I even tried giving a meaningless query
>
> TTL:data AND TTL:storet OR TTL:variablet => 3,733 results (The
> results were same as that of TTL:data.)
>
> TTL:data AND TTL:computer OR TTL:device => 3,733 results (this also showed
> the same results as above)
>
> The same thing repeats for other cases too. The queries below also behaved
> the same way.
> i.e. -
>
> 1. TTL:store AND TTL:data OR TTL:variable => 76 results
> 2. TTL:store AND TTL:data OR TTL:variable => 76 results
> 3. TTL:store AND TTL:computer OR TTL:device => 76 results
>
>
> 1. TTL:variable AND TTL:data OR  TTL:store => 1,496 results
> 2. TTL:variable AND TTL:data OR  TTL:store => 1,496 results
> 3. TTL:variable AND TTL:computer OR  TTL:device => 1,496 results
>
> I hope you have a clearer picture of my issue now.
>
> Thanks,
> Sonu
>
>
> On Wed, May 28, 2008 at 7:09 PM, Erick Erickson <[EMAIL PROTECTED]>
> wrote:
>
>> It's unclear what you *should* expect. How is your data
>> distributed?
>>
>> In other words, how many documents do you have? In this example,
>> for instance,
>> 1. TTL:data AND TTL:store OR TTL:variable => 3,733 results
>> it considered the TTL:data part only.
>>
>> it's perfecily reasonable if every document that had "variable" in the
>> field *also* has "data" and "store" in the field. So your numbers
>> don't give us much to work with.
>>
>> Remember, though, that Lucene syntax isn't a pure boolean syntax. See
>>
>> http://wiki.apache.org/lucene-java/BooleanQuerySyntax
>>
>> And when in doubt parenthesize ...
>>
>> Best
>> Erick
>>
>> On Wed, May 28, 2008 at 7:44 AM, Sonu Sudhakar <[EMAIL PROTECTED]> wrote:
>>
>> > Hi,
>> >
>> > I have some issue with boolean queries.
>> >
>> > I am using Lucene-core-2.3.1.
>> >
>> > I have done test on boolean query with 3 terms (data, store, variable)
>> in
>> > my
>> > TTL field. The TTL field is indexed and searched using StandardAnalyzer.
>> >
>> > The three terms when searched individually gave the following result
>> >
>> > 1. TTL:data  => 3,733 results
>> > 2. TTL:store  => 76 results
>> > 3. TTL:variable  => 1,496 results
>> >
>> > But found issue when combining these terms with boolean operators.
>> >
>> > e.g.
>> > 1. TTL:data AND TTL:store OR TTL:variable => 3,733 results
>> > it considered the TTL:data part only.
>> >
>> > 2. TTL:store AND TTL:data OR TTL:variable => 76 results
>> > it considered  the TTL:store part only.
>> >
>> > 3. TTL:variable AND TTL:data OR  TTL:store => 1,496 results
>> > it considered  the TTL:variable part only.
>> >
>> > But I am getting correct result when combining terms with 'AND'
>> operator. I
>> > think the issue is with 'OR' operator.
>> >
>> >
>> > Could anybody give an explanation for this behavior of lucene?
>> > Could you give suggestions to rectify this?
>> >
>> > Thanks,
>> > Sonu
>> >
>>
>
>


Re: How to add PageRank score with lucene's relevant score in sorting

2008-06-01 Thread Doron Cohen
Hi Jarvis,


> I have a problem that how to "combine" two score to sort the search
> result documents.
> for example I  have 10 million pages in lucene index , and i know their
> pagerank scores. i give a query to it , every docs returned have a
> lucene-score, mark it as R (relevant score), and  i  also  have its
> pagerank score, mark it as P,  what i need is i want to sort the search
> result base on the value "P+R".  You know if i store the pagerank score in
> index and get it every search time , then compute P+R , then sort it , this
> way is too slow. in my system , when the search hits 50 result , the
> sort may cost about 20s.
>

Check CustomScoreQuery in
http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/search/function/package-summary.html

Probably something like this:
- implement ValueSource on top of the pagerank values,
- create a valueSourceQuery on top of it,
- create a customScoreQuery on top of the original query and the
valueSourceQuery.
Note that by default, customScoreQuery multiplies the scores, but you can
override this.

Doron