Re: IndexReader.reopen memory leak
Hi John, IndexReader newInner=in.reopen(); > if (in!=newInner) > { >in.close(); >this.in=newInner; > >// code to clean up my data >_cache.clear(); >_indexData.load(this, true); >init(_fieldConfig); > } > Just to be sure on this, could you confirm the two appearances above: - in - this.in refer to exactly the same variable? Assuming they are, could you provide some more code: - entire method containing the above code - method reopen() of your FilteredIndexReader. - method newReader() - constructor of FilteredIndexReader if it is invoked from newReader() Regards, Doron
Re: IndexReader.reopen memory leak
Yes...I constantly index with 8 threads on one writer while searching with many more threads. Then I let it run for like an hour and watch. The index is tiny to start and then grows to a moderate size...nothing crazy. I am also reopening a lot on a real index of 3.5 million + docs though...and I see no leak evidence there either. A couple interesting limitations with these results: In the reopen test I was only using one field. I'll try a lot more. On the 3.5+ million index there are loads of fields, but field norms are off. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ANN: New release Lucene-Oracle integration
Hi All: I am just releasing a new binary distribution of Oracle-Lucene integration by using Lucene-OJVM Data Catridge. Here the change log: * Compiled against Lucene 2.3.2 production release * Used latest API for merging based on RAM usage * Use Writer for deleting during Sync * Confirm 4x improvement during indexing reported by Lucene dev. group * Fix workaround which changes order of the rowids in ODCRIDList * Added an Spanish WikiPedia Analyzer for testing * Reports IOException instead of RunTimeException to signal EOF or File Not Found * Decouple Flush functionality from TableIndexer I would like to say thanks a lot to Michael McCandless for helping to solve nice glitch with Oracle JIT compiler which causes that DocumentsWriter class do not work on 11.1.0.6 release. 11g binary version have a workaround for this problem. Oracle OJVM dev. team told me that this problems its not reproducible on 11.2 and 11.1.0.7 versions. Latest binary dist. can be downloaded at: http://sourceforge.net/project/showfiles.php?group_id=56183&package_id=255524&release_id=603580 Also I have posted a new entry at my blog with some performance experience against Wikipedia Spanish dump uploaded to XMLDB: http://marceloochoa.blogspot.com/2008/06/new-binary-release-of-lucene-oracle.html Latest documentation is at: http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg Best regards, Marcelo. -- Marcelo F. Ochoa http://marceloochoa.blogspot.com/ http://marcelo.ochoa.googlepages.com/home __ Do you Know DBPrism? Look @ DB Prism's Web Site http://www.dbprism.com.ar/index.html More info? Chapter 17 of the book "Programming the Oracle Database using Java & Web Services" http://www.amazon.com/gp/product/183296/ Chapter 21 of the book "Professional XML Databases" - Wrox Press http://www.amazon.com/gp/product/1861003587/ Chapter 8 of the book "Oracle & Open Source" - O'Reilly http://www.oreilly.com/catalog/oracleopen/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Displaying and highlighting results from a Wild Card and Fuzzy search using Lucene in Java
On Sonntag, 1. Juni 2008, syedfa wrote: > I am trying to display my results from doing a search of an xml document > (some quotes from shakespeare's "Hamlet") using a WildCard and Fuzzy > search, and then I'm trying to highlight the keyword(s) in the results, > but unfortunately I am having problems. Please see http://wiki.apache.org/lucene-java/LuceneFAQ#head-75566820ee94a425c7e2950ac61d24e405fbd914 regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: how to unsubscribe?
: I've already tried this but the subject line is fixed and I wrote a roman to : convince the mail daemon that I'm not interested in spamming.. but it didn't : care :) Silly question, but you were sending your email to "[EMAIL PROTECTED]" and not "[EMAIL PROTECTED]" correct? Are you still having a problem, or were you able to unsubscribe? If it is still a problem, my suggestion is to file an "INFRA" bug in Jira using the "Mailing List" component, and attach copies of the email you sent (with full headers) and the response you got (with full headers)... https://issues.apache.org/jira/browse/INFRA -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening an index directory inside a jar
: The crux of the issue seems to be that lucene cannot open segments file that : is inside the jar (under luceneFiles/index directory) i'm not entirely sure why it would have problems finding the segments file, but a larger problem is that Lucene needs random access which (last time i checked) isn't available when accessing files in jars... http://www.nabble.com/Accessing-Lucene-Index-stored-in-a-jar-file-to3009604.html ...you cn always include the index in a jar, and then extract it before using it. : unit/integration/functional tests depend on index files to be created. The : manual step of creating the index files breaks the automated CI builds or : some reliance on building the index in some tmp directory. Unfortunately : that approach has issues if we run tests concurrently. Also, building the : index takes a couple of minutes, so generating them on the fly for tests is : expensive and increases the build time. there's no inherent reason why concurrent tests need to collide if you use temp directories -- just have each test create it's own private tmp directory and copy the index (or exactract the index from the jar) to that private directory in the "setUp" method. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: date filter filtering out non-dated items?
: While I could add a future date to these documents, this kind of feels : hackish and I would be interested in other ideas on how to filter out : expired documents. this just came up on the solr list, the answer is equally applicable but note that you'll need to combine it with some other query class (MatchAllDocs perhaps)... >> you have to invert your logic. docs that "have not yet expired, or will >> never expire" match the negacted query for "docs expired in the past" http://www.nabble.com/expression-in-an-fq-parameter-fails-to17353677.html#a17375261 -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Frequencies sorted by frequencies
I don't know of a way, sorry. Most of the Similarity methods do not take a field name. On May 29, 2008, at 9:20 AM, Hider, Sandy wrote: Thanks for taking the time to answer. I see what you mean. The thing is I also plan on using the standard score. Would there be a way to use the both the standard score and the TF-only Score in a single index? Sandy -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 28, 2008 2:34 PM To: java-user@lucene.apache.org Subject: Re: Frequencies sorted by frequencies I think you could override all the Similarity factors except tf() with 1, such that the term frequency is the only factor in the scoring. Then you just submit the term as a query. Note, I think you will need to override the similarity during indexing, too, so that norm length is turned off, too. Note, I haven't tried it :-). Use the explain() functionality to double check. At any rate, it should be quick to test. See http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similar ity.html -Grant On May 28, 2008, at 10:48 AM, Hider, Sandy wrote: Hi All, I am trying to figure out a quick way to find the top N documents sorted by frequency of a term. I found: IndexRead.termDocs() which provides an enumeration of doc() and freq() but it returns an enumeration sorted by doc number. Is there a way to get the results sorted by freq? Or is there another query I can run the find these results? Thanks in advance, Sandy -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene search time in real production use?
Those benchmarks are pretty old, I think. -Grant On May 31, 2008, at 12:28 PM, Karl Wettin wrote: 31 maj 2008 kl. 14.25 skrev lucene user: What are some average search and retrieval times for Lucene queries in real production use? Would people include relevant stuff like the number of documents in your index, etc.? Thanks for your help! http://lucene.apache.org/java/docs/benchmarks.html How well it works depends on many factors. What your corpus looks like, load on index, what sort of queries are executed, hardware, et c. You can estimate how your application will work by using and extending the benchmarker contrib tool. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: how to unsubscribe?
As you can see I'm still part of this list. I'll submit a bug report. Thanks in advance, Daniel -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Sunday, June 01, 2008 9:16 PM To: Lucene Users Cc: Daniel Freudenberger Subject: RE: how to unsubscribe? : I've already tried this but the subject line is fixed and I wrote a roman to : convince the mail daemon that I'm not interested in spamming.. but it didn't : care :) Silly question, but you were sending your email to "[EMAIL PROTECTED]" and not "[EMAIL PROTECTED]" correct? Are you still having a problem, or were you able to unsubscribe? If it is still a problem, my suggestion is to file an "INFRA" bug in Jira using the "Mailing List" component, and attach copies of the email you sent (with full headers) and the response you got (with full headers)... https://issues.apache.org/jira/browse/INFRA -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening an index directory inside a jar
> > : The crux of the issue seems to be that lucene cannot open segments file > that > : is inside the jar (under luceneFiles/index directory) > > i'm not entirely sure why it would have problems finding the segments > file, but a larger problem is that Lucene needs random access which (last > time i checked) isn't available when accessing files in jars... > > > http://www.nabble.com/Accessing-Lucene-Index-stored-in-a-jar-file-to3009604.html > I bumped into (though never used) http://commons.apache.org/vfs/apidocs/org/apache/commons/vfs/provider/jar/JarFileSystem.html There, FileContent has this method getRandomAccessContent(org.apache.commons.vfs.util.RandomAccessMode) so it seems worth exploring. HTH, Doron
Re: Boolean Query Issue
Hi, I have done some more analysis on this issue. I think it is related to lucene's default operator. I am getting excat results, when I sets the default operator as 'OR', but facing problem when setting the default operator as 'AND'. The following are the lucene QueryParser outputs for both cases. Query :* TTL:store AND TTL:data OR TTL:variable *1. When lucene default operator is '*OR' *QueryParser output using toString method: * +TTL:store +TTL:data TTL:variable *2. When lucene default operator is '*AND' *QueryParser output using toString method: *+TTL:store TTL:data TTL:variable *The output of second case is confusing me. Could anybody please give me an explanation for this behavior? Thanks, Sonu On Thu, May 29, 2008 at 3:49 PM, Sonu Sudhakar <[EMAIL PROTECTED]> wrote: > Erick, > > Thanks for your reply. > > I am working with approximately 1 million documents. They are indexed in 4 > servers. Each document has multiple fields. I am using ParallelMultiSearcher > for searching purpose. > > I tried a few queries in the title(TTL) field. > > i started with a simple query without boolean operators. > > *1. TTL:data => 3733 results (all matches had "data" in title)* > > Then I tried a second one with AND operator. > > *2. TTL:data AND TTL:store => 19 results* > > I analyzed the results. the results had both "data" and "store" in the > title. > > *I then tried OR operator* > > *3. TTL:data AND TTL:store OR TTL:variable* > > I got 3,733 results., same as the query TTL:data. > > I even tried giving a meaningless query > > TTL:data AND TTL:storet OR TTL:variablet => 3,733 results (The > results were same as that of TTL:data.) > > TTL:data AND TTL:computer OR TTL:device => 3,733 results (this also showed > the same results as above) > > The same thing repeats for other cases too. The queries below also behaved > the same way. > i.e. - > > 1. TTL:store AND TTL:data OR TTL:variable => 76 results > 2. TTL:store AND TTL:data OR TTL:variable => 76 results > 3. TTL:store AND TTL:computer OR TTL:device => 76 results > > > 1. TTL:variable AND TTL:data OR TTL:store => 1,496 results > 2. TTL:variable AND TTL:data OR TTL:store => 1,496 results > 3. TTL:variable AND TTL:computer OR TTL:device => 1,496 results > > I hope you have a clearer picture of my issue now. > > Thanks, > Sonu > > > On Wed, May 28, 2008 at 7:09 PM, Erick Erickson <[EMAIL PROTECTED]> > wrote: > >> It's unclear what you *should* expect. How is your data >> distributed? >> >> In other words, how many documents do you have? In this example, >> for instance, >> 1. TTL:data AND TTL:store OR TTL:variable => 3,733 results >> it considered the TTL:data part only. >> >> it's perfecily reasonable if every document that had "variable" in the >> field *also* has "data" and "store" in the field. So your numbers >> don't give us much to work with. >> >> Remember, though, that Lucene syntax isn't a pure boolean syntax. See >> >> http://wiki.apache.org/lucene-java/BooleanQuerySyntax >> >> And when in doubt parenthesize ... >> >> Best >> Erick >> >> On Wed, May 28, 2008 at 7:44 AM, Sonu Sudhakar <[EMAIL PROTECTED]> wrote: >> >> > Hi, >> > >> > I have some issue with boolean queries. >> > >> > I am using Lucene-core-2.3.1. >> > >> > I have done test on boolean query with 3 terms (data, store, variable) >> in >> > my >> > TTL field. The TTL field is indexed and searched using StandardAnalyzer. >> > >> > The three terms when searched individually gave the following result >> > >> > 1. TTL:data => 3,733 results >> > 2. TTL:store => 76 results >> > 3. TTL:variable => 1,496 results >> > >> > But found issue when combining these terms with boolean operators. >> > >> > e.g. >> > 1. TTL:data AND TTL:store OR TTL:variable => 3,733 results >> > it considered the TTL:data part only. >> > >> > 2. TTL:store AND TTL:data OR TTL:variable => 76 results >> > it considered the TTL:store part only. >> > >> > 3. TTL:variable AND TTL:data OR TTL:store => 1,496 results >> > it considered the TTL:variable part only. >> > >> > But I am getting correct result when combining terms with 'AND' >> operator. I >> > think the issue is with 'OR' operator. >> > >> > >> > Could anybody give an explanation for this behavior of lucene? >> > Could you give suggestions to rectify this? >> > >> > Thanks, >> > Sonu >> > >> > >
Re: How to add PageRank score with lucene's relevant score in sorting
Hi Jarvis, > I have a problem that how to "combine" two score to sort the search > result documents. > for example I have 10 million pages in lucene index , and i know their > pagerank scores. i give a query to it , every docs returned have a > lucene-score, mark it as R (relevant score), and i also have its > pagerank score, mark it as P, what i need is i want to sort the search > result base on the value "P+R". You know if i store the pagerank score in > index and get it every search time , then compute P+R , then sort it , this > way is too slow. in my system , when the search hits 50 result , the > sort may cost about 20s. > Check CustomScoreQuery in http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/search/function/package-summary.html Probably something like this: - implement ValueSource on top of the pagerank values, - create a valueSourceQuery on top of it, - create a customScoreQuery on top of the original query and the valueSourceQuery. Note that by default, customScoreQuery multiplies the scores, but you can override this. Doron