Re: Memory question
Another option to consider is to *decrease* the JVM maximum heap size. This in effect leaves more memory for swapped in mmio pages and decreases the GC effort, which might increase system performance and stability. Regards, Christoph Am 15.05.2012 21:38, schrieb Chris Bamford: Thanks Uwe. What I'd like to understand is the implications of this on a server which opens a large number of indexes over a long period. Will this non-heap memory continue to grow? Will gc be effective at spotting it and releasing it via references in the heap? I had an instance yesterday where a server swapped itself to a standstill and had to be restarted. The load average was through the roof and I am trying to understand why. One of my recent changes is updating from 2.3 to 3.6, so naturally I am keen to know the impact of the mmap stuff which is now standard under the covers. My server caches indexsearchers and then closes them based on how full the heap is getting. My worry is that if the bulk of the memory is being allocated outside the Jvm, how can I make sensible decisions? Thanks for any pointers / info. Chris -Original Message- From: u...@thetaphi.de To: java-user@lucene.apache.org Sent: Tue, 15 May 2012 18:10 Subject: RE: Memory question It mmaps the files into virtual memory if it runs on a 64 bit JVM. Because of that you see the mmapped CFS files. This is outside Java Heap and is all *virtual* no RAM is explicitely occupied except the O/S cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Chris Bamford [mailto:chris.bamf...@talktalk.net] Sent: Tuesday, May 15, 2012 4:47 PM To: java-user@lucene.apache.org Subject: Memory question Hi Can anyone tell me what happens to the memory when Lucene opens an index? Is it loaded into the JVM's heap or is it mapped into virtual memory outside of it? I am running on Linux and if I use pmap on the PID of my JVM, I can see lots of entries for index cfs files. Does this mean that indexes are mapped into non-heap memory? If so, how can I monitor the space my process is using if I cache open IndexSearchers? The details are: Sun 64-bit JVM on Linux. Lucene 3.6 running in 2.3 compatibility mode (as we are in the in the process of a migration to 3.6) Thanks, - Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Memory question
Another good link is http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html, which also includes details on iCMS, which is the Incremental Mode for CMS. On 5/15/12 6:32 PM, "Lutz Fechner" wrote: >CMS is the concurrent mark sweep garbage collector. Instead of waiting >for the memory to fill up and being collected for memory to be freed up >again, it runs concurrently while the app threads are running. Usually >the JVM would call a full stop the world collection. All threads would >be on hold until GC is finished. CMS would prevent lengthy stops from >happening in trade for constant micro stops. > >This usually is a good option for apps that are sensitive (from a user >experience) from hanging during GC time. > >See http://docs.oracle.com/javase/6/docs/technotes/guides/vm/cms-6.html > >Best Regards > >Lutz > >-Original Message- >From: Chris Bamford [mailto:chris.bamf...@talktalk.net] >Sent: Dienstag, 15. Mai 2012 16:38 >To: java-user@lucene.apache.org >Subject: Re: Memory question > > > Hi John, > >Very interesting, thanks for the detailed explanation. It certainly >sounds like the same symptoms! >Can I please clarify a couple of things ? > >- I have googled CMS / iCMS as I wasn't familiar with those acronyms >(apart from 'code management system' and that didn't sound right!) > and am I right in thinking that it is some sort of monitoring code >pulled into your server via a jar? (I'm confused why it would have its' >own > GC cycle...) >- So are you suggesting I play with my own JVM's (Sun/Oracle) parameters >to achieve a similar effect ? > >Thanks again, > >- Chris > > > > > > >-Original Message- >From: Nader, John P >To: java-user@lucene.apache.org >Sent: Tue, 15 May 2012 21:12 >Subject: Re: Memory question > > >We've encountered this issue and came up with a fairly good approach to >address it. > >We are on Lucene 3.0.2 with Java 1.6.0_29. Our indices are about 35GB >in >size. Our JVM runs at 20GB of heap, with about 12GB of steady usage. >Our >server has 32GB total. > >What would happen in our case is that the linux would page in more and >more of the memory mapped index files into memory, forcing idle portions >of the JVM heap to be swapped out. This was not an issue until our CMS >GC >kicked in. This would force swapping in of all JVM memory to collect >unused references. I/O wait would shoot up and performance would >suffer. >Yes, even CMS can kill performance is you are swapping. The tell-tale >sign was a spike in inbound swap at the start of CMS. > >In our case, we addressed the situation using iCMS, which is Incremental >CMS. This takes the mark phase (and sweep too?) and does it >continuously >with a configurable duty cycle. The result was that swapping was >smoothed >out to be a small stead drag on the system instead of a hard spike. >There >was an small loss in performance, but a big gain in stability. > >This tuning may be an option for you. BTW, pmap will give you statistic >on total file size and how much is resident. The java heap shows up in >pmap as well on linux, so you can determine how much of that is in >memory >as well. > >John > > > >On 5/15/12 3:38 PM, "Chris Bamford" wrote: > >>Thanks Uwe. >> >>What I'd like to understand is the implications of this on a server >which >>opens a large number of indexes over a long period. Will this non-heap >>memory continue to grow? Will gc be effective at spotting it and >>releasing it via references in the heap? >> >> I had an instance yesterday where a server swapped itself to a >>standstill and had to be restarted. The load average was through the >roof >>and I am trying to understand why. One of my recent changes is updating >>from 2.3 to 3.6, so naturally I am keen to know the impact of the mmap >>stuff which is now standard under the covers. >> >>My server caches indexsearchers and then closes them based on how full >>the heap is getting. My worry is that if the bulk of the memory is >being >>allocated outside the Jvm, how can I make sensible decisions? >> >>Thanks for any pointers / info. >> >>Chris >> >> >> >>-Original Message- >>From: u...@thetaphi.de >>To: java-user@lucene.apache.org >>Sent: Tue, 15 May 2012 18:10 >>Subject: RE: Memory question >> >> >> >>It mmaps the files into virtual memory if it runs on a 64 bit JVM. >Because >>of that you see the mmapped CFS files. This is outside Java Heap and is >>all >>*virtual* no RAM is explicitely occupied except the O/S cache. >> >>- >>Uwe Schindler >>H.-H.-Meier-Allee 63, D-28213 Bremen >>http://www.thetaphi.de >>eMail: u...@thetaphi.de >> >>> -Original Message- >>> From: Chris Bamford [mailto:chris.bamf...@talktalk.net] >>> Sent: Tuesday, May 15, 2012 4:47 PM >>> To: java-user@lucene.apache.org >>> Subject: Memory question >>> >>> Hi >>> >>> Can anyone tell me what happens to the memory when Lucene opens an >>>index? >>> Is it loaded into the JVM's heap or is it mapped into virtual memory >>outside of >>> it? >>> I a
Re: Memory question
Thanks everyone. Looks like I have lots of reading to do :-) -Original Message- From: Nader, John P To: java-user@lucene.apache.org Sent: Wed, 16 May 2012 16:27 Subject: Re: Memory question Another good link is http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html, which also includes details on iCMS, which is the Incremental Mode for CMS. On 5/15/12 6:32 PM, "Lutz Fechner" wrote: >CMS is the concurrent mark sweep garbage collector. Instead of waiting >for the memory to fill up and being collected for memory to be freed up >again, it runs concurrently while the app threads are running. Usually >the JVM would call a full stop the world collection. All threads would >be on hold until GC is finished. CMS would prevent lengthy stops from >happening in trade for constant micro stops. > >This usually is a good option for apps that are sensitive (from a user >experience) from hanging during GC time. > >See http://docs.oracle.com/javase/6/docs/technotes/guides/vm/cms-6.html > >Best Regards > >Lutz > >-Original Message- >From: Chris Bamford [mailto:chris.bamf...@talktalk.net] >Sent: Dienstag, 15. Mai 2012 16:38 >To: java-user@lucene.apache.org >Subject: Re: Memory question > > > Hi John, > >Very interesting, thanks for the detailed explanation. It certainly >sounds like the same symptoms! >Can I please clarify a couple of things ? > >- I have googled CMS / iCMS as I wasn't familiar with those acronyms >(apart from 'code management system' and that didn't sound right!) > and am I right in thinking that it is some sort of monitoring code >pulled into your server via a jar? (I'm confused why it would have its' >own > GC cycle...) >- So are you suggesting I play with my own JVM's (Sun/Oracle) parameters >to achieve a similar effect ? > >Thanks again, > >- Chris > > > > > > >-Original Message- >From: Nader, John P >To: java-user@lucene.apache.org >Sent: Tue, 15 May 2012 21:12 >Subject: Re: Memory question > > >We've encountered this issue and came up with a fairly good approach to >address it. > >We are on Lucene 3.0.2 with Java 1.6.0_29. Our indices are about 35GB >in >size. Our JVM runs at 20GB of heap, with about 12GB of steady usage. >Our >server has 32GB total. > >What would happen in our case is that the linux would page in more and >more of the memory mapped index files into memory, forcing idle portions >of the JVM heap to be swapped out. This was not an issue until our CMS >GC >kicked in. This would force swapping in of all JVM memory to collect >unused references. I/O wait would shoot up and performance would >suffer. >Yes, even CMS can kill performance is you are swapping. The tell-tale >sign was a spike in inbound swap at the start of CMS. > >In our case, we addressed the situation using iCMS, which is Incremental >CMS. This takes the mark phase (and sweep too?) and does it >continuously >with a configurable duty cycle. The result was that swapping was >smoothed >out to be a small stead drag on the system instead of a hard spike. >There >was an small loss in performance, but a big gain in stability. > >This tuning may be an option for you. BTW, pmap will give you statistic >on total file size and how much is resident. The java heap shows up in >pmap as well on linux, so you can determine how much of that is in >memory >as well. > >John > > > >On 5/15/12 3:38 PM, "Chris Bamford" wrote: > >>Thanks Uwe. >> >>What I'd like to understand is the implications of this on a server >which >>opens a large number of indexes over a long period. Will this non-heap >>memory continue to grow? Will gc be effective at spotting it and >>releasing it via references in the heap? >> >> I had an instance yesterday where a server swapped itself to a >>standstill and had to be restarted. The load average was through the >roof >>and I am trying to understand why. One of my recent changes is updating >>from 2.3 to 3.6, so naturally I am keen to know the impact of the mmap >>stuff which is now standard under the covers. >> >>My server caches indexsearchers and then closes them based on how full >>the heap is getting. My worry is that if the bulk of the memory is >being >>allocated outside the Jvm, how can I make sensible decisions? >> >>Thanks for any pointers / info. >> >>Chris >> >> >> >>-Original Message- >>From: u...@thetaphi.de >>To: java-user@lucene.apache.org >>Sent: Tue, 15 May 2012 18:10 >>Subject: RE: Memory question >> >> >> >>It mmaps the files into virtual memory if it runs on a 64 bit JVM. >Because >>of that you see the mmapped CFS files. This is outside Java Heap and is >>all >>*virtual* no RAM is explicitely occupied except the O/S cache. >> >>- >>Uwe Schindler >>H.-H.-Meier-Allee 63, D-28213 Bremen >>http://www.thetaphi.de >>eMail: u...@thetaphi.de >> >>> -Original Message- >>> From: Chris Bamford [mailto:chris.bamf...@talktalk.net] >>> Sent: Tuesday, May 15, 2012 4:47 PM >>> To: java-user@lucene.apache.org >>> Subject: Memo
Search Ranking
Hi, I am quite new to Lucene. I am trying to use it to index listings of local businesses. The index has only one field, that stores the attributes of a listing as well as email addresses of users who have rated that business. For example, Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com f...@company.com" Listing 2: "ABC Takeaway London f...@company.com bar...@company.com" Now when the user does a search with "Takeaway f...@company.com", how do I get listing 1 to always come before listing 2, because it has the term f...@company.com appear twice where as listing 2 has it only once? Regards Meeraj
Re: Search Ranking
Use the explain function to understand why the query is producing the results you see. http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query, int) Does your current query return Listing 2 first? That might be because of term frequencies. Which analyzers are you using? http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63 Cheers, Ivan On Wed, May 16, 2012 at 12:41 PM, Meeraj Kunnumpurath wrote: > Hi, > > I am quite new to Lucene. I am trying to use it to index listings of local > businesses. The index has only one field, that stores the attributes of a > listing as well as email addresses of users who have rated that business. > > For example, > > Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com > f...@company.com" > Listing 2: "ABC Takeaway London f...@company.com bar...@company.com" > > Now when the user does a search with "Takeaway f...@company.com", how do I > get listing 1 to always come before listing 2, because it has the term > f...@company.com appear twice where as listing 2 has it only once? > > Regards > Meeraj - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search Ranking
Thanks Ivan. I don't use Lucene directly, it is used behind the scene by the Neo4J graph database for full-text indexing. According to their documentation for full text indexes they use white space tokenizer in the analyser. Yes, I do get Listing 2 first now. Though if I exclude the term "Takeaway" from the search string, and just put "f...@company.com", I get Listing 1 first. Regards Meeraj On Wed, May 16, 2012 at 8:49 PM, Ivan Brusic wrote: > Use the explain function to understand why the query is producing the > results you see. > > > http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query > , > int) > > Does your current query return Listing 2 first? That might be because > of term frequencies. Which analyzers are you using? > > http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63 > > Cheers, > > Ivan > > On Wed, May 16, 2012 at 12:41 PM, Meeraj Kunnumpurath > wrote: > > Hi, > > > > I am quite new to Lucene. I am trying to use it to index listings of > local > > businesses. The index has only one field, that stores the attributes of a > > listing as well as email addresses of users who have rated that business. > > > > For example, > > > > Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com > > f...@company.com" > > Listing 2: "ABC Takeaway London f...@company.com bar...@company.com" > > > > Now when the user does a search with "Takeaway f...@company.com", how > do I > > get listing 1 to always come before listing 2, because it has the term > > f...@company.com appear twice where as listing 2 has it only once? > > > > Regards > > Meeraj > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Search Ranking
I have tried the same using Lucene directly with the following code, import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.util.Version; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.index.IndexReader; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TopScoreDocCollector; import org.apache.lucene.search.ScoreDoc; public class LuceneTest { public static void main(String[] args) throws Exception { StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35); RAMDirectory index = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer); IndexWriter indexWriter = new IndexWriter(index, config); Document doc1 = new Document(); doc1.add(new Field("searchText", "ABC Takeaway f...@company.com f...@company.com", Field.Store.YES, Field.Index.ANALYZED)); Document doc2 = new Document(); doc2.add(new Field("searchText", "XYZ Takeaway f...@company.com", Field.Store.YES, Field.Index.ANALYZED)); indexWriter.addDocument(doc1); indexWriter.addDocument(doc2); indexWriter.close(); Query q = new QueryParser(Version.LUCENE_35, "searchText", analyzer).parse("Takeaway"); int hitsPerPage = 10; IndexReader reader = IndexReader.open(index); IndexSearcher searcher = new IndexSearcher(reader); TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; System.out.println("Found " + hits.length + " hits."); for(int i=0;i wrote: > Thanks Ivan. > > I don't use Lucene directly, it is used behind the scene by the Neo4J > graph database for full-text indexing. According to their documentation for > full text indexes they use white space tokenizer in the analyser. Yes, I do > get Listing 2 first now. Though if I exclude the term "Takeaway" from the > search string, and just put "f...@company.com", I get Listing 1 first. > > Regards > Meeraj > > > On Wed, May 16, 2012 at 8:49 PM, Ivan Brusic wrote: > >> Use the explain function to understand why the query is producing the >> results you see. >> >> >> http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query >> , >> int) >> >> Does your current query return Listing 2 first? That might be because >> of term frequencies. Which analyzers are you using? >> >> http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63 >> >> Cheers, >> >> Ivan >> >> On Wed, May 16, 2012 at 12:41 PM, Meeraj Kunnumpurath >> wrote: >> > Hi, >> > >> > I am quite new to Lucene. I am trying to use it to index listings of >> local >> > businesses. The index has only one field, that stores the attributes of >> a >> > listing as well as email addresses of users who have rated that >> business. >> > >> > For example, >> > >> > Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com >> > f...@company.com" >> > Listing 2: "ABC Takeaway London f...@company.com bar...@company.com" >> > >> > Now when the user does a search with "Takeaway f...@company.com", how >> do I >> > get listing 1 to always come before listing 2, because it has the term >> > f...@company.com appear twice where as listing 2 has it only once? >> > >> > Regards >> > Meeraj >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >
Re: Search Ranking
The actual query is Query q = new QueryParser(Version.LUCENE_35, "searchText", analyzer).parse("Takeaway f...@company.com"); If I use Query q = new QueryParser(Version.LUCENE_35, "searchText", analyzer).parse(" f...@company.com"); I get them in the reverse order. Regards Meeraj On Wed, May 16, 2012 at 9:48 PM, Meeraj Kunnumpurath < meeraj.kunnumpur...@asyska.com> wrote: > I have tried the same using Lucene directly with the following code, > > import org.apache.lucene.store.RAMDirectory; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > import org.apache.lucene.index.IndexWriterConfig; > import org.apache.lucene.util.Version; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.index.IndexWriter; > import org.apache.lucene.queryParser.QueryParser; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.TopScoreDocCollector; > import org.apache.lucene.search.ScoreDoc; > > public class LuceneTest { > > public static void main(String[] args) throws Exception { > > StandardAnalyzer analyzer = new > StandardAnalyzer(Version.LUCENE_35); > RAMDirectory index = new RAMDirectory(); > IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, > analyzer); > IndexWriter indexWriter = new IndexWriter(index, config); > > Document doc1 = new Document(); > doc1.add(new Field("searchText", "ABC Takeaway f...@company.com > f...@company.com", Field.Store.YES, Field.Index.ANALYZED)); > Document doc2 = new Document(); > doc2.add(new Field("searchText", "XYZ Takeaway f...@company.com", > Field.Store.YES, Field.Index.ANALYZED)); > > indexWriter.addDocument(doc1); > indexWriter.addDocument(doc2); > indexWriter.close(); > > Query q = new QueryParser(Version.LUCENE_35, "searchText", > analyzer).parse("Takeaway"); > > int hitsPerPage = 10; > IndexReader reader = IndexReader.open(index); > IndexSearcher searcher = new IndexSearcher(reader); > TopScoreDocCollector collector = > TopScoreDocCollector.create(hitsPerPage, true); > searcher.search(q, collector); > ScoreDoc[] hits = collector.topDocs().scoreDocs; > > System.out.println("Found " + hits.length + " hits."); > for(int i=0;i int docId = hits[i].doc; > Document d = searcher.doc(docId); > System.out.println((i + 1) + ". " + d.get("searchText")); > } > > } > > } > > The output is .. > > Found 2 hits. > 1. XYZ Takeaway f...@company.com > 2. ABC Takeaway f...@company.com f...@company.com > > > On Wed, May 16, 2012 at 9:21 PM, Meeraj Kunnumpurath < > meeraj.kunnumpur...@asyska.com> wrote: > >> Thanks Ivan. >> >> I don't use Lucene directly, it is used behind the scene by the Neo4J >> graph database for full-text indexing. According to their documentation for >> full text indexes they use white space tokenizer in the analyser. Yes, I do >> get Listing 2 first now. Though if I exclude the term "Takeaway" from the >> search string, and just put "f...@company.com", I get Listing 1 first. >> >> Regards >> Meeraj >> >> >> On Wed, May 16, 2012 at 8:49 PM, Ivan Brusic wrote: >> >>> Use the explain function to understand why the query is producing the >>> results you see. >>> >>> >>> http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query >>> , >>> int) >>> >>> Does your current query return Listing 2 first? That might be because >>> of term frequencies. Which analyzers are you using? >>> >>> http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63 >>> >>> Cheers, >>> >>> Ivan >>> >>> On Wed, May 16, 2012 at 12:41 PM, Meeraj Kunnumpurath >>> wrote: >>> > Hi, >>> > >>> > I am quite new to Lucene. I am trying to use it to index listings of >>> local >>> > businesses. The index has only one field, that stores the attributes >>> of a >>> > listing as well as email addresses of users who have rated that >>> business. >>> > >>> > For example, >>> > >>> > Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com >>> > f...@company.com" >>> > Listing 2: "ABC Takeaway London f...@company.com bar...@company.com" >>> > >>> > Now when the user does a search with "Takeaway f...@company.com", how >>> do I >>> > get listing 1 to always come before listing 2, because it has the term >>> > f...@company.com appear twice where as listing 2 has it only once? >>> > >>> > Regards >>> > Meeraj >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >
Re: Search Ranking
This is the output I get from explaining the plan .. Found 2 hits. 1. XYZ Takeaway f...@company.com 0.5148823 = (MATCH) sum of: 0.17162743 = (MATCH) weight(searchText:takeaway in 1), product of: 0.57735026 = queryWeight(searchText:takeaway), product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.97109574 = queryNorm 0.29726744 = (MATCH) fieldWeight(searchText:takeaway in 1), product of: 1.0 = tf(termFreq(searchText:takeaway)=1) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(field=searchText, doc=1) 0.34325486 = (MATCH) sum of: 0.17162743 = (MATCH) weight(searchText:fred in 1), product of: 0.57735026 = queryWeight(searchText:fred), product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.97109574 = queryNorm 0.29726744 = (MATCH) fieldWeight(searchText:fred in 1), product of: 1.0 = tf(termFreq(searchText:fred)=1) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(field=searchText, doc=1) 0.17162743 = (MATCH) weight(searchText:company.com in 1), product of: 0.57735026 = queryWeight(searchText:company.com), product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.97109574 = queryNorm 0.29726744 = (MATCH) fieldWeight(searchText:company.com in 1), product of: 1.0 = tf(termFreq(searchText:company.com)=1) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(field=searchText, doc=1) 2. ABC Takeaway f...@company.com f...@company.com 0.49279732 = (MATCH) sum of: 0.12872057 = (MATCH) weight(searchText:takeaway in 0), product of: 0.57735026 = queryWeight(searchText:takeaway), product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.97109574 = queryNorm 0.22295058 = (MATCH) fieldWeight(searchText:takeaway in 0), product of: 1.0 = tf(termFreq(searchText:takeaway)=1) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.375 = fieldNorm(field=searchText, doc=0) 0.36407676 = (MATCH) sum of: 0.18203838 = (MATCH) weight(searchText:fred in 0), product of: 0.57735026 = queryWeight(searchText:fred), product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.97109574 = queryNorm 0.31529972 = (MATCH) fieldWeight(searchText:fred in 0), product of: 1.4142135 = tf(termFreq(searchText:fred)=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.375 = fieldNorm(field=searchText, doc=0) 0.18203838 = (MATCH) weight(searchText:company.com in 0), product of: 0.57735026 = queryWeight(searchText:company.com), product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.97109574 = queryNorm 0.31529972 = (MATCH) fieldWeight(searchText:company.com in 0), product of: 1.4142135 = tf(termFreq(searchText:company.com)=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.375 = fieldNorm(field=searchText, doc=0) On Wed, May 16, 2012 at 9:50 PM, Meeraj Kunnumpurath < meeraj.kunnumpur...@asyska.com> wrote: > The actual query is > > Query q = new QueryParser(Version.LUCENE_35, "searchText", > analyzer).parse("Takeaway f...@company.com"); > > If I use > > Query q = new QueryParser(Version.LUCENE_35, "searchText", > analyzer).parse("f...@company.com"); > > I get them in the reverse order. > > Regards > Meeraj > > > On Wed, May 16, 2012 at 9:48 PM, Meeraj Kunnumpurath < > meeraj.kunnumpur...@asyska.com> wrote: > >> I have tried the same using Lucene directly with the following code, >> >> import org.apache.lucene.store.RAMDirectory; >> import org.apache.lucene.document.Document; >> import org.apache.lucene.document.Field; >> import org.apache.lucene.index.IndexWriterConfig; >> import org.apache.lucene.util.Version; >> import org.apache.lucene.analysis.standard.StandardAnalyzer; >> import org.apache.lucene.index.IndexWriter; >> import org.apache.lucene.queryParser.QueryParser; >> import org.apache.lucene.index.IndexReader; >> import org.apache.lucene.search.IndexSearcher; >> import org.apache.lucene.search.Query; >> import org.apache.lucene.search.TopScoreDocCollector; >> import org.apache.lucene.search.ScoreDoc; >> >> public class LuceneTest { >> >> public static void main(String[] args) throws Exception { >> >> StandardAnalyzer analyzer = new >> StandardAnalyzer(Version.LUCENE_35); >> RAMDirectory index = new RAMDirectory(); >> IndexWriterConfig config = new >> IndexWriterConfig(Version.LUCENE_35, >> analyzer); >> IndexWriter indexWriter = new IndexWriter(index, config); >> >> Document doc1 = new Document(); >> doc1.add(new Field("searchText", "ABC Takeaway f...@company.com >> f...@company.com", Field.Store.YES, Field.Index.ANALYZED)); >> Document doc2 = new Document(); >> doc2.add(new Field("searchText", "XYZ Takeaway f...@company.com", >> Field.Store.YES, Field.Index.ANALYZED)); >> >> indexWriter.addDocument(doc1); >> indexWriter.addDocument(doc2); >> indexWriter.close(); >> >> Query q = new
Re: Search Ranking
Also, if I do the below Query q = new QueryParser(Version.LUCENE_35, "searchText", analyzer).parse("Takeaway f...@company.com^100") I get them in reverse order. Do I need to boost the term, even if it appears more than once in the document? Regards Meeraj On Wed, May 16, 2012 at 9:52 PM, Meeraj Kunnumpurath < meeraj.kunnumpur...@asyska.com> wrote: > This is the output I get from explaining the plan .. > > > Found 2 hits. > 1. XYZ Takeaway f...@company.com > 0.5148823 = (MATCH) sum of: > 0.17162743 = (MATCH) weight(searchText:takeaway in 1), product of: > 0.57735026 = queryWeight(searchText:takeaway), product of: > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.97109574 = queryNorm > 0.29726744 = (MATCH) fieldWeight(searchText:takeaway in 1), product of: > 1.0 = tf(termFreq(searchText:takeaway)=1) > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.5 = fieldNorm(field=searchText, doc=1) > 0.34325486 = (MATCH) sum of: > 0.17162743 = (MATCH) weight(searchText:fred in 1), product of: > 0.57735026 = queryWeight(searchText:fred), product of: > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.97109574 = queryNorm > 0.29726744 = (MATCH) fieldWeight(searchText:fred in 1), product of: > 1.0 = tf(termFreq(searchText:fred)=1) > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.5 = fieldNorm(field=searchText, doc=1) > 0.17162743 = (MATCH) weight(searchText:company.com in 1), product of: > 0.57735026 = queryWeight(searchText:company.com), product of: > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.97109574 = queryNorm > 0.29726744 = (MATCH) fieldWeight(searchText:company.com in 1), > product of: > 1.0 = tf(termFreq(searchText:company.com)=1) > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.5 = fieldNorm(field=searchText, doc=1) > > > 2. ABC Takeaway f...@company.com f...@company.com > 0.49279732 = (MATCH) sum of: > 0.12872057 = (MATCH) weight(searchText:takeaway in 0), product of: > 0.57735026 = queryWeight(searchText:takeaway), product of: > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.97109574 = queryNorm > 0.22295058 = (MATCH) fieldWeight(searchText:takeaway in 0), product of: > 1.0 = tf(termFreq(searchText:takeaway)=1) > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.375 = fieldNorm(field=searchText, doc=0) > 0.36407676 = (MATCH) sum of: > 0.18203838 = (MATCH) weight(searchText:fred in 0), product of: > 0.57735026 = queryWeight(searchText:fred), product of: > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.97109574 = queryNorm > 0.31529972 = (MATCH) fieldWeight(searchText:fred in 0), product of: > 1.4142135 = tf(termFreq(searchText:fred)=2) > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.375 = fieldNorm(field=searchText, doc=0) > 0.18203838 = (MATCH) weight(searchText:company.com in 0), product of: > 0.57735026 = queryWeight(searchText:company.com), product of: > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.97109574 = queryNorm > 0.31529972 = (MATCH) fieldWeight(searchText:company.com in 0), > product of: > 1.4142135 = tf(termFreq(searchText:company.com)=2) > 0.5945349 = idf(docFreq=2, maxDocs=2) > 0.375 = fieldNorm(field=searchText, doc=0) > > > On Wed, May 16, 2012 at 9:50 PM, Meeraj Kunnumpurath < > meeraj.kunnumpur...@asyska.com> wrote: > >> The actual query is >> >> Query q = new QueryParser(Version.LUCENE_35, "searchText", >> analyzer).parse("Takeaway f...@company.com"); >> >> If I use >> >> Query q = new QueryParser(Version.LUCENE_35, "searchText", >> analyzer).parse("f...@company.com"); >> >> I get them in the reverse order. >> >> Regards >> Meeraj >> >> >> On Wed, May 16, 2012 at 9:48 PM, Meeraj Kunnumpurath < >> meeraj.kunnumpur...@asyska.com> wrote: >> >>> I have tried the same using Lucene directly with the following code, >>> >>> import org.apache.lucene.store.RAMDirectory; >>> import org.apache.lucene.document.Document; >>> import org.apache.lucene.document.Field; >>> import org.apache.lucene.index.IndexWriterConfig; >>> import org.apache.lucene.util.Version; >>> import org.apache.lucene.analysis.standard.StandardAnalyzer; >>> import org.apache.lucene.index.IndexWriter; >>> import org.apache.lucene.queryParser.QueryParser; >>> import org.apache.lucene.index.IndexReader; >>> import org.apache.lucene.search.IndexSearcher; >>> import org.apache.lucene.search.Query; >>> import org.apache.lucene.search.TopScoreDocCollector; >>> import org.apache.lucene.search.ScoreDoc; >>> >>> public class LuceneTest { >>> >>> public static void main(String[] args) throws Exception { >>> >>> StandardAnalyzer analyzer = new >>> StandardAnalyzer(Version.LUCENE_35); >>> RAMDirectory index = new RAMDirectory(); >>> IndexWriterConfig config = new >>> IndexWriterConfig(Version.LUCENE_35, >>> analyzer); >>> IndexWriter indexWrite
Optional Terms
Hi, I have the following documents Document doc1 = new Document(); doc1.add(new Field("searchText", "ABC Takeaway f...@company.com f...@company.com", Field.Store.YES, Field.Index.ANALYZED)); Document doc2 = new Document(); doc2.add(new Field("searchText", "XYZ Takeaway f...@company.com", Field.Store.YES, Field.Index.ANALYZED)); Document doc3 = new Document(); doc2.add(new Field("searchText", "LMN Takeaway", Field.Store.YES, Field.Index.ANALYZED)); My query is Query q = new QueryParser(Version.LUCENE_35, "searchText", analyzer).parse("+Takeaway f...@company.com^100"); This returns only doc1 and doc2. How do I need to modify the query, so that the first term (Takeaway) is mandatory and the second one (f...@company.com) is optional? Also, I would like to boost those documents based on the number of occurrences of the second term. Regards Meeraj
Approches/semantics for arbitrarily combining boolean and proximity search operators?
I'm working on a product for librarians and similar people, who apparently expect to be able to combine classic boolean operators (i.e. AND, OR, NOT) with proximity operators (especially w/n and pre/n -- which basically map to unordered and ordered SpanQueries with slop n, respectively) in unrestricted fashion. For example, users appear to believe that not only are relatively easy-to-grasp queries like these legitimate: medical w/5 agreement (medical w/5 agreement) and (doctor w/10 rights) but also crazier ones, perhaps like agreement w/5 (medical and companion) (dog or dragon) w/5 (cat and cow) (daisy and (dog or dragon)) w/25 (cat not cow) What I've noticed is that it's not always obvious how to interpret such queries; it's not always obvious what the user had in mind, nor how you might construct a Lucene query to carry out the user's intent. Therefore, to broaden my understanding of the problem, I'd like to learn more about schemes people may have used/proposed to assign meanings to arbitrary combinations/nestings of, say, the AND, OR, NOT, W, and PRE operators. Ultimately I'm interested in parsing queries into Lucene Query objects (perhaps SpanQuery objects). But I'm also interested in a more abstract/mathematical discussion. (After all, I'm open to the possibility that the best implementation requires new Query classes to be written.) So far I've found two main sources of inspiration, but I'm not 100% satisfied with either: 1. Mark Miller's no-longer-maintained qsol query parser (https://github.com/markrmiller/qsol) The premise of qsol seems to be that you can simplify any arbitrary combination of boolean and proximity operators into an equivalent expression where boolean operators can contain proximity operators, but where no boolean operator appears nested inside a proximity operator. (With the exception of OR, which maps nicely to a SpanOrQuery...) This is a convenient transformation, because then you can readily express the query in terms of existing SpanQuery and BooleanQuery classes. The main principle that qsol uses for these transformations might be sloppily summarized as, "You can distribute any boolean operator over any proximity operator". Thus, for example, input query agreement w/5 (medical and companion) gets morphed into something along along the lines of (agreement w/5 medical) and (medical w/5 companion) In closer-to-Lucene syntax, that's +(spanNear(slop=5, inOrder=false, ageement, medical)) +(spanNear(slop=5, inOrder=false, medical, companion)) I like how qsol is able to provide at least *some* Lucene-executable Query for every input query string, and I like how it seems to take a single principle of distributing booleans over proximities and see it through pretty systematically. Unfortunately, the Query objects returned by qsol don't always align perfectly with what I imagine user intent to be. For example, when my users try queries like agreement w/5 (medical and companion) I believe they are seeking spans where a single occurrence of "agreement" is near both "medical" and "companion". qsol will find documents like that, but it will also return what I think I consider to be spurious matches, namely docs where there's an "agreement" near "medical" and an "agreement" near "companion", but no "agreement" that's near both. Also non-intuitive is that qsol generates rather different Query objects for these two queries: (dog or dragon) w/5 (cat and cow) (cat and cow) w/5 (dog or dragon) The former maps to something like ((dog w/5 cat) AND (dog w/5 cow)) OR ((dragon w/5 cat) AND (dragon w/5 cow)) while the latter maps to something like ((cat w/5 dog) OR (cat w/5 dragon)) AND ((cow w/5 dog) OR (cow w/5 dragon)) I'm not sure there's an obviously better thing for qsol to do with these, but it would be nice if w/5 were a symmetric operation. Note: The above does not reflect qsol's actual syntax, only its semantics. 2. Minimum interval semantics This approach is reflected in the Lucene "positions branch" (PositionIntervalIterator stuff). It's also nicely described in the paper "An Algebra for Structured Text Search and A Framework for its Implementation" (Clarke, Cormack, Burkowski). I don't know of a Lucene query parser built around this stuff yet, but it seems possible to construct one. The basic idea is that each subquery of a query string corresponds to a set of matching spans/intervals/extents, and operators (whether "boolean" or "proximity") are a means of filter and combining spans/intervals/extents. For example: * Query [ medical ] would correspond to every span where the term "medical" appears * Query [ medical and companion ] would correspond to all the minimum-match spans containing both "medical" and "companion" * Query [ agreement w/5 (medical and companion) ] would correspond to all the minimum-match spans where "agreement" was within 5 words of a minimum-match span for [medical and companion] I like how minimum inte
Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?
> medical w/5 agreement > (medical w/5 agreement) and (doctor w/10 rights) > > but also crazier ones, perhaps like > > agreement w/5 (medical and companion) > (dog or dragon) w/5 (cat and cow) > (daisy and (dog or dragon)) w/25 (cat not cow) This syntax reminds me Surround. http://wiki.apache.org/solr/SurroundQueryParser http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?
On Thu, May 17, 2012 at 7:11 AM, Chris Harris wrote: > but also crazier ones, perhaps like > > agreement w/5 (medical and companion) > (dog or dragon) w/5 (cat and cow) > (daisy and (dog or dragon)) w/25 (cat not cow) [skip] Everything in your post matches our experience. We ended up writing something which transforms the query as well but had to give up on certain crazy things people tried, such as this form: (A and B) w/5 (C and D) For this one: A w/5 (B and C) We found the user expected the same A to be within 5 terms of both a B and a C, and rewrote it to match that but also match more than they asked for. So far, there have been no complaints about the overmatches (it's documented.) There is probably an extremely accurate way to rewrite it, but it couldn't be figured out at the time. Maybe start with spans for A and then remove spans not-near a B and spans not-near a C, which would leave you with only spans near an A. The problem is that if you expand the query to something like this, it gets quite a bit more complex, so a user query which is already complex could turn into a really hard to understand mess... TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?
It sounds me as if there could be a market for a new kind of query that would implement: A w/5 (B and C) in the way that people understand it to mean - the same A near both B and C, not just any A. Maybe it's too hard to implement using rewrites into existing SpanQueries? In term of the PositionIterator work - instead of A being within 5 in a "minimum" distance sense, what we want is that its "maximum" distance to all the terms in the other query (B and C) is 5. I'm not sure if any query in that branch covers this case though, either, but if I recall, there was a way to implement extensions to it that were fairly natural. -Mike On 5/16/2012 7:15 PM, Trejkaz wrote: On Thu, May 17, 2012 at 7:11 AM, Chris Harris wrote: but also crazier ones, perhaps like agreement w/5 (medical and companion) (dog or dragon) w/5 (cat and cow) (daisy and (dog or dragon)) w/25 (cat not cow) [skip] Everything in your post matches our experience. We ended up writing something which transforms the query as well but had to give up on certain crazy things people tried, such as this form: (A and B) w/5 (C and D) For this one: A w/5 (B and C) We found the user expected the same A to be within 5 terms of both a B and a C, and rewrote it to match that but also match more than they asked for. So far, there have been no complaints about the overmatches (it's documented.) There is probably an extremely accurate way to rewrite it, but it couldn't be figured out at the time. Maybe start with spans for A and then remove spans not-near a B and spans not-near a C, which would leave you with only spans near an A. The problem is that if you expand the query to something like this, it gets quite a bit more complex, so a user query which is already complex could turn into a really hard to understand mess... TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org