Re: Memory question

2012-05-16 Thread Christoph Kaser
Another option to consider is to *decrease* the JVM maximum heap size. 
This in effect leaves more memory for swapped in mmio pages and 
decreases the GC effort, which might increase system performance and 
stability.


Regards,
Christoph

Am 15.05.2012 21:38, schrieb Chris Bamford:

Thanks Uwe.

What I'd like to understand is the implications of this on a server which opens 
a large number of indexes over a long period. Will this non-heap memory 
continue to grow? Will gc be effective at spotting it and releasing it via 
references in the heap?

  I had an instance yesterday where a server swapped itself to a standstill and 
had to be restarted. The load average was through the roof and I am trying to 
understand why. One of my recent changes is updating from 2.3 to 3.6, so 
naturally I am keen to know the impact of the mmap stuff which is now standard 
under the covers.

My server caches indexsearchers and then closes them based on how full the heap 
is getting. My worry is that if the bulk of the memory is being allocated 
outside the Jvm, how can I make sensible decisions?

Thanks for any pointers / info.

Chris



-Original Message-
From: u...@thetaphi.de
To: java-user@lucene.apache.org
Sent: Tue, 15 May 2012 18:10
Subject: RE: Memory question



It mmaps the files into virtual memory if it runs on a 64 bit JVM. Because
of that you see the mmapped CFS files. This is outside Java Heap and is all
*virtual* no RAM is explicitely occupied except the O/S cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


-Original Message-
From: Chris Bamford [mailto:chris.bamf...@talktalk.net]
Sent: Tuesday, May 15, 2012 4:47 PM
To: java-user@lucene.apache.org
Subject: Memory question

Hi

Can anyone tell me what happens to the memory when Lucene opens an index?
Is it loaded into the JVM's heap or is it mapped into virtual memory

outside of

it?
I am running on Linux and if I use pmap on the PID of my JVM, I can see

lots of

entries for index cfs files.

Does this mean that indexes are mapped into non-heap memory?  If so, how
can I monitor the space my process is using if I cache open

IndexSearchers?

The details are:

Sun 64-bit JVM on Linux.
Lucene 3.6 running in 2.3 compatibility mode (as we are in the in the

process of

a migration to 3.6)

Thanks,

- Chris

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Memory question

2012-05-16 Thread Nader, John P
Another good link is
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html,
which also includes details on iCMS, which is the Incremental Mode for CMS.

On 5/15/12 6:32 PM, "Lutz Fechner"  wrote:

>CMS is the concurrent mark sweep garbage collector. Instead of waiting
>for the memory to fill up and being collected for memory to be freed up
>again, it runs concurrently while the app threads are running. Usually
>the JVM would call a full stop the world collection. All threads would
>be on hold until GC is finished. CMS would prevent lengthy stops from
>happening in trade for constant micro stops.
>
>This usually is a good option for apps that are sensitive (from a user
>experience) from hanging during GC time.
>
>See http://docs.oracle.com/javase/6/docs/technotes/guides/vm/cms-6.html
>
>Best Regards
>
>Lutz
>
>-Original Message-
>From: Chris Bamford [mailto:chris.bamf...@talktalk.net]
>Sent: Dienstag, 15. Mai 2012 16:38
>To: java-user@lucene.apache.org
>Subject: Re: Memory question
>
>
> Hi John,
>
>Very interesting, thanks for the detailed explanation.   It certainly
>sounds like the same symptoms!
>Can I please clarify a couple of things ?
>
>- I have googled CMS / iCMS as I wasn't familiar with those acronyms
>(apart from 'code management system' and that didn't sound right!)
>  and am I right in thinking that it is some sort of monitoring code
>pulled into your server via a jar? (I'm confused why it would have its'
>own
>  GC cycle...)
>- So are you suggesting I play with my own JVM's (Sun/Oracle) parameters
>to achieve a similar effect ?
>
>Thanks again,
>
>- Chris
>
>
> 
>
> 
>
>-Original Message-
>From: Nader, John P 
>To: java-user@lucene.apache.org 
>Sent: Tue, 15 May 2012 21:12
>Subject: Re: Memory question
>
>
>We've encountered this issue and came up with a fairly good approach to
>address it.
>
>We are on Lucene 3.0.2 with Java 1.6.0_29.  Our indices are about 35GB
>in
>size.  Our JVM runs at 20GB of heap, with about 12GB of steady usage.
>Our
>server has 32GB total.
>
>What would happen in our case is that the linux would page in more and
>more of the memory mapped index files into memory, forcing idle portions
>of the JVM heap to be swapped out.  This was not an issue until our CMS
>GC
>kicked in.  This would force swapping in of all JVM memory to collect
>unused references.  I/O wait would shoot up and performance would
>suffer.
>Yes, even CMS can kill performance is you are swapping.  The tell-tale
>sign was a spike in inbound swap at the start of CMS.
>
>In our case, we addressed the situation using iCMS, which is Incremental
>CMS.  This takes the mark phase (and sweep too?) and does it
>continuously
>with a configurable duty cycle.  The result was that swapping was
>smoothed
>out to be a small stead drag on the system instead of a hard spike.
>There
>was an small loss in performance, but a big gain in stability.
>
>This tuning may be an option for you.  BTW, pmap will give you statistic
>on total file size and how much is resident.  The java heap shows up in
>pmap as well on linux, so you can determine how much of that is in
>memory
>as well.
>
>John
>
>
>
>On 5/15/12 3:38 PM, "Chris Bamford"  wrote:
>
>>Thanks Uwe. 
>>
>>What I'd like to understand is the implications of this on a server
>which
>>opens a large number of indexes over a long period. Will this non-heap
>>memory continue to grow? Will gc be effective at spotting it and
>>releasing it via references in the heap?
>>
>> I had an instance yesterday where a server swapped itself to a
>>standstill and had to be restarted. The load average was through the
>roof
>>and I am trying to understand why. One of my recent changes is updating
>>from 2.3 to 3.6, so naturally I am keen to know the impact of the mmap
>>stuff which is now standard under the covers.
>>
>>My server caches indexsearchers and then closes them based on how full
>>the heap is getting. My worry is that if the bulk of the memory is
>being
>>allocated outside the Jvm, how can I make sensible decisions?
>>
>>Thanks for any pointers / info.
>>
>>Chris
>>
>>
>>
>>-Original Message-
>>From: u...@thetaphi.de
>>To: java-user@lucene.apache.org
>>Sent: Tue, 15 May 2012 18:10
>>Subject: RE: Memory question
>>
>>
>>
>>It mmaps the files into virtual memory if it runs on a 64 bit JVM.
>Because
>>of that you see the mmapped CFS files. This is outside Java Heap and is
>>all
>>*virtual* no RAM is explicitely occupied except the O/S cache.
>>
>>-
>>Uwe Schindler
>>H.-H.-Meier-Allee 63, D-28213 Bremen
>>http://www.thetaphi.de
>>eMail: u...@thetaphi.de
>>
>>> -Original Message-
>>> From: Chris Bamford [mailto:chris.bamf...@talktalk.net]
>>> Sent: Tuesday, May 15, 2012 4:47 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Memory question
>>> 
>>> Hi
>>> 
>>> Can anyone tell me what happens to the memory when Lucene opens an
>>>index?
>>> Is it loaded into the JVM's heap or is it mapped into virtual memory
>>outside of
>>> it?
>>> I a

Re: Memory question

2012-05-16 Thread Chris Bamford

 Thanks everyone. Looks like I have lots of reading to do :-)

 

 

-Original Message-
From: Nader, John P 
To: java-user@lucene.apache.org 
Sent: Wed, 16 May 2012 16:27
Subject: Re: Memory question


Another good link is
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html,
which also includes details on iCMS, which is the Incremental Mode for CMS.

On 5/15/12 6:32 PM, "Lutz Fechner"  wrote:

>CMS is the concurrent mark sweep garbage collector. Instead of waiting
>for the memory to fill up and being collected for memory to be freed up
>again, it runs concurrently while the app threads are running. Usually
>the JVM would call a full stop the world collection. All threads would
>be on hold until GC is finished. CMS would prevent lengthy stops from
>happening in trade for constant micro stops.
>
>This usually is a good option for apps that are sensitive (from a user
>experience) from hanging during GC time.
>
>See http://docs.oracle.com/javase/6/docs/technotes/guides/vm/cms-6.html
>
>Best Regards
>
>Lutz
>
>-Original Message-
>From: Chris Bamford [mailto:chris.bamf...@talktalk.net]
>Sent: Dienstag, 15. Mai 2012 16:38
>To: java-user@lucene.apache.org
>Subject: Re: Memory question
>
>
> Hi John,
>
>Very interesting, thanks for the detailed explanation.   It certainly
>sounds like the same symptoms!
>Can I please clarify a couple of things ?
>
>- I have googled CMS / iCMS as I wasn't familiar with those acronyms
>(apart from 'code management system' and that didn't sound right!)
>  and am I right in thinking that it is some sort of monitoring code
>pulled into your server via a jar? (I'm confused why it would have its'
>own
>  GC cycle...)
>- So are you suggesting I play with my own JVM's (Sun/Oracle) parameters
>to achieve a similar effect ?
>
>Thanks again,
>
>- Chris
>
>
> 
>
> 
>
>-Original Message-
>From: Nader, John P 
>To: java-user@lucene.apache.org 
>Sent: Tue, 15 May 2012 21:12
>Subject: Re: Memory question
>
>
>We've encountered this issue and came up with a fairly good approach to
>address it.
>
>We are on Lucene 3.0.2 with Java 1.6.0_29.  Our indices are about 35GB
>in
>size.  Our JVM runs at 20GB of heap, with about 12GB of steady usage.
>Our
>server has 32GB total.
>
>What would happen in our case is that the linux would page in more and
>more of the memory mapped index files into memory, forcing idle portions
>of the JVM heap to be swapped out.  This was not an issue until our CMS
>GC
>kicked in.  This would force swapping in of all JVM memory to collect
>unused references.  I/O wait would shoot up and performance would
>suffer.
>Yes, even CMS can kill performance is you are swapping.  The tell-tale
>sign was a spike in inbound swap at the start of CMS.
>
>In our case, we addressed the situation using iCMS, which is Incremental
>CMS.  This takes the mark phase (and sweep too?) and does it
>continuously
>with a configurable duty cycle.  The result was that swapping was
>smoothed
>out to be a small stead drag on the system instead of a hard spike.
>There
>was an small loss in performance, but a big gain in stability.
>
>This tuning may be an option for you.  BTW, pmap will give you statistic
>on total file size and how much is resident.  The java heap shows up in
>pmap as well on linux, so you can determine how much of that is in
>memory
>as well.
>
>John
>
>
>
>On 5/15/12 3:38 PM, "Chris Bamford"  wrote:
>
>>Thanks Uwe. 
>>
>>What I'd like to understand is the implications of this on a server
>which
>>opens a large number of indexes over a long period. Will this non-heap
>>memory continue to grow? Will gc be effective at spotting it and
>>releasing it via references in the heap?
>>
>> I had an instance yesterday where a server swapped itself to a
>>standstill and had to be restarted. The load average was through the
>roof
>>and I am trying to understand why. One of my recent changes is updating
>>from 2.3 to 3.6, so naturally I am keen to know the impact of the mmap
>>stuff which is now standard under the covers.
>>
>>My server caches indexsearchers and then closes them based on how full
>>the heap is getting. My worry is that if the bulk of the memory is
>being
>>allocated outside the Jvm, how can I make sensible decisions?
>>
>>Thanks for any pointers / info.
>>
>>Chris
>>
>>
>>
>>-Original Message-
>>From: u...@thetaphi.de
>>To: java-user@lucene.apache.org
>>Sent: Tue, 15 May 2012 18:10
>>Subject: RE: Memory question
>>
>>
>>
>>It mmaps the files into virtual memory if it runs on a 64 bit JVM.
>Because
>>of that you see the mmapped CFS files. This is outside Java Heap and is
>>all
>>*virtual* no RAM is explicitely occupied except the O/S cache.
>>
>>-
>>Uwe Schindler
>>H.-H.-Meier-Allee 63, D-28213 Bremen
>>http://www.thetaphi.de
>>eMail: u...@thetaphi.de
>>
>>> -Original Message-
>>> From: Chris Bamford [mailto:chris.bamf...@talktalk.net]
>>> Sent: Tuesday, May 15, 2012 4:47 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Memo

Search Ranking

2012-05-16 Thread Meeraj Kunnumpurath
Hi,

I am quite new to Lucene. I am trying to use it to index listings of local
businesses. The index has only one field, that stores the attributes of a
listing as well as email addresses of users who have rated that business.

For example,

Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com
f...@company.com"
Listing 2: "ABC Takeaway London f...@company.com bar...@company.com"

Now when the user does a search with "Takeaway f...@company.com", how do I
get listing 1 to always come before listing 2, because it has the term
f...@company.com appear twice where as listing 2 has it only once?

Regards
Meeraj


Re: Search Ranking

2012-05-16 Thread Ivan Brusic
Use the explain function to understand why the query is producing the
results you see.

http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query,
int)

Does your current query return Listing 2 first? That might be because
of term frequencies. Which analyzers are you using?

http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63

Cheers,

Ivan

On Wed, May 16, 2012 at 12:41 PM, Meeraj Kunnumpurath
 wrote:
> Hi,
>
> I am quite new to Lucene. I am trying to use it to index listings of local
> businesses. The index has only one field, that stores the attributes of a
> listing as well as email addresses of users who have rated that business.
>
> For example,
>
> Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com
> f...@company.com"
> Listing 2: "ABC Takeaway London f...@company.com bar...@company.com"
>
> Now when the user does a search with "Takeaway f...@company.com", how do I
> get listing 1 to always come before listing 2, because it has the term
> f...@company.com appear twice where as listing 2 has it only once?
>
> Regards
> Meeraj

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search Ranking

2012-05-16 Thread Meeraj Kunnumpurath
Thanks Ivan.

I don't use Lucene directly, it is used behind the scene by the Neo4J graph
database for full-text indexing. According to their documentation for full
text indexes they use white space tokenizer in the analyser. Yes, I do get
Listing 2 first now. Though if I exclude the term "Takeaway" from the
search string, and just put "f...@company.com", I get Listing 1 first.

Regards
Meeraj

On Wed, May 16, 2012 at 8:49 PM, Ivan Brusic  wrote:

> Use the explain function to understand why the query is producing the
> results you see.
>
>
> http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query
> ,
> int)
>
> Does your current query return Listing 2 first? That might be because
> of term frequencies. Which analyzers are you using?
>
> http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63
>
> Cheers,
>
> Ivan
>
> On Wed, May 16, 2012 at 12:41 PM, Meeraj Kunnumpurath
>  wrote:
> > Hi,
> >
> > I am quite new to Lucene. I am trying to use it to index listings of
> local
> > businesses. The index has only one field, that stores the attributes of a
> > listing as well as email addresses of users who have rated that business.
> >
> > For example,
> >
> > Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com
> > f...@company.com"
> > Listing 2: "ABC Takeaway London f...@company.com bar...@company.com"
> >
> > Now when the user does a search with "Takeaway f...@company.com", how
> do I
> > get listing 1 to always come before listing 2, because it has the term
> > f...@company.com appear twice where as listing 2 has it only once?
> >
> > Regards
> > Meeraj
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Search Ranking

2012-05-16 Thread Meeraj Kunnumpurath
I have tried the same using Lucene directly with the following code,

import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.util.Version;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.search.ScoreDoc;

public class LuceneTest {

public static void main(String[] args) throws Exception {

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
RAMDirectory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
analyzer);
IndexWriter indexWriter = new IndexWriter(index, config);

Document doc1 = new Document();
doc1.add(new Field("searchText", "ABC Takeaway f...@company.com
f...@company.com", Field.Store.YES, Field.Index.ANALYZED));
Document doc2 = new Document();
doc2.add(new Field("searchText", "XYZ Takeaway f...@company.com",
Field.Store.YES, Field.Index.ANALYZED));

indexWriter.addDocument(doc1);
indexWriter.addDocument(doc2);
indexWriter.close();

Query q = new QueryParser(Version.LUCENE_35, "searchText",
analyzer).parse("Takeaway");

int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector =
TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

System.out.println("Found " + hits.length + " hits.");
for(int i=0;i wrote:

> Thanks Ivan.
>
> I don't use Lucene directly, it is used behind the scene by the Neo4J
> graph database for full-text indexing. According to their documentation for
> full text indexes they use white space tokenizer in the analyser. Yes, I do
> get Listing 2 first now. Though if I exclude the term "Takeaway" from the
> search string, and just put "f...@company.com", I get Listing 1 first.
>
> Regards
> Meeraj
>
>
> On Wed, May 16, 2012 at 8:49 PM, Ivan Brusic  wrote:
>
>> Use the explain function to understand why the query is producing the
>> results you see.
>>
>>
>> http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query
>> ,
>> int)
>>
>> Does your current query return Listing 2 first? That might be because
>> of term frequencies. Which analyzers are you using?
>>
>> http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63
>>
>> Cheers,
>>
>> Ivan
>>
>> On Wed, May 16, 2012 at 12:41 PM, Meeraj Kunnumpurath
>>  wrote:
>> > Hi,
>> >
>> > I am quite new to Lucene. I am trying to use it to index listings of
>> local
>> > businesses. The index has only one field, that stores the attributes of
>> a
>> > listing as well as email addresses of users who have rated that
>> business.
>> >
>> > For example,
>> >
>> > Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com
>> > f...@company.com"
>> > Listing 2: "ABC Takeaway London f...@company.com bar...@company.com"
>> >
>> > Now when the user does a search with "Takeaway f...@company.com", how
>> do I
>> > get listing 1 to always come before listing 2, because it has the term
>> > f...@company.com appear twice where as listing 2 has it only once?
>> >
>> > Regards
>> > Meeraj
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


Re: Search Ranking

2012-05-16 Thread Meeraj Kunnumpurath
The actual query is

Query q = new QueryParser(Version.LUCENE_35, "searchText",
analyzer).parse("Takeaway f...@company.com");

If I use

Query q = new QueryParser(Version.LUCENE_35, "searchText", analyzer).parse("
f...@company.com");

I get them in the reverse order.

Regards
Meeraj

On Wed, May 16, 2012 at 9:48 PM, Meeraj Kunnumpurath <
meeraj.kunnumpur...@asyska.com> wrote:

> I have tried the same using Lucene directly with the following code,
>
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.util.Version;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.TopScoreDocCollector;
> import org.apache.lucene.search.ScoreDoc;
>
> public class LuceneTest {
>
> public static void main(String[] args) throws Exception {
>
> StandardAnalyzer analyzer = new
> StandardAnalyzer(Version.LUCENE_35);
> RAMDirectory index = new RAMDirectory();
> IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
> analyzer);
> IndexWriter indexWriter = new IndexWriter(index, config);
>
> Document doc1 = new Document();
> doc1.add(new Field("searchText", "ABC Takeaway f...@company.com
> f...@company.com", Field.Store.YES, Field.Index.ANALYZED));
> Document doc2 = new Document();
> doc2.add(new Field("searchText", "XYZ Takeaway f...@company.com",
> Field.Store.YES, Field.Index.ANALYZED));
>
> indexWriter.addDocument(doc1);
> indexWriter.addDocument(doc2);
> indexWriter.close();
>
> Query q = new QueryParser(Version.LUCENE_35, "searchText",
> analyzer).parse("Takeaway");
>
> int hitsPerPage = 10;
> IndexReader reader = IndexReader.open(index);
> IndexSearcher searcher = new IndexSearcher(reader);
> TopScoreDocCollector collector =
> TopScoreDocCollector.create(hitsPerPage, true);
> searcher.search(q, collector);
> ScoreDoc[] hits = collector.topDocs().scoreDocs;
>
> System.out.println("Found " + hits.length + " hits.");
> for(int i=0;i int docId = hits[i].doc;
> Document d = searcher.doc(docId);
> System.out.println((i + 1) + ". " + d.get("searchText"));
> }
>
> }
>
> }
>
> The output is ..
>
> Found 2 hits.
> 1. XYZ Takeaway f...@company.com
> 2. ABC Takeaway f...@company.com f...@company.com
>
>
> On Wed, May 16, 2012 at 9:21 PM, Meeraj Kunnumpurath <
> meeraj.kunnumpur...@asyska.com> wrote:
>
>> Thanks Ivan.
>>
>> I don't use Lucene directly, it is used behind the scene by the Neo4J
>> graph database for full-text indexing. According to their documentation for
>> full text indexes they use white space tokenizer in the analyser. Yes, I do
>> get Listing 2 first now. Though if I exclude the term "Takeaway" from the
>> search string, and just put "f...@company.com", I get Listing 1 first.
>>
>> Regards
>> Meeraj
>>
>>
>> On Wed, May 16, 2012 at 8:49 PM, Ivan Brusic  wrote:
>>
>>> Use the explain function to understand why the query is producing the
>>> results you see.
>>>
>>>
>>> http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query
>>> ,
>>> int)
>>>
>>> Does your current query return Listing 2 first? That might be because
>>> of term frequencies. Which analyzers are you using?
>>>
>>> http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63
>>>
>>> Cheers,
>>>
>>> Ivan
>>>
>>> On Wed, May 16, 2012 at 12:41 PM, Meeraj Kunnumpurath
>>>  wrote:
>>> > Hi,
>>> >
>>> > I am quite new to Lucene. I am trying to use it to index listings of
>>> local
>>> > businesses. The index has only one field, that stores the attributes
>>> of a
>>> > listing as well as email addresses of users who have rated that
>>> business.
>>> >
>>> > For example,
>>> >
>>> > Listing 1: "XYZ Takeaway London f...@company.com bar...@company.com
>>> > f...@company.com"
>>> > Listing 2: "ABC Takeaway London f...@company.com bar...@company.com"
>>> >
>>> > Now when the user does a search with "Takeaway f...@company.com", how
>>> do I
>>> > get listing 1 to always come before listing 2, because it has the term
>>> > f...@company.com appear twice where as listing 2 has it only once?
>>> >
>>> > Regards
>>> > Meeraj
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>


Re: Search Ranking

2012-05-16 Thread Meeraj Kunnumpurath
This is the output I get from explaining the plan ..

Found 2 hits.
1. XYZ Takeaway f...@company.com
0.5148823 = (MATCH) sum of:
  0.17162743 = (MATCH) weight(searchText:takeaway in 1), product of:
0.57735026 = queryWeight(searchText:takeaway), product of:
  0.5945349 = idf(docFreq=2, maxDocs=2)
  0.97109574 = queryNorm
0.29726744 = (MATCH) fieldWeight(searchText:takeaway in 1), product of:
  1.0 = tf(termFreq(searchText:takeaway)=1)
  0.5945349 = idf(docFreq=2, maxDocs=2)
  0.5 = fieldNorm(field=searchText, doc=1)
  0.34325486 = (MATCH) sum of:
0.17162743 = (MATCH) weight(searchText:fred in 1), product of:
  0.57735026 = queryWeight(searchText:fred), product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.97109574 = queryNorm
  0.29726744 = (MATCH) fieldWeight(searchText:fred in 1), product of:
1.0 = tf(termFreq(searchText:fred)=1)
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(field=searchText, doc=1)
0.17162743 = (MATCH) weight(searchText:company.com in 1), product of:
  0.57735026 = queryWeight(searchText:company.com), product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.97109574 = queryNorm
  0.29726744 = (MATCH) fieldWeight(searchText:company.com in 1),
product of:
1.0 = tf(termFreq(searchText:company.com)=1)
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(field=searchText, doc=1)

2. ABC Takeaway f...@company.com f...@company.com
0.49279732 = (MATCH) sum of:
  0.12872057 = (MATCH) weight(searchText:takeaway in 0), product of:
0.57735026 = queryWeight(searchText:takeaway), product of:
  0.5945349 = idf(docFreq=2, maxDocs=2)
  0.97109574 = queryNorm
0.22295058 = (MATCH) fieldWeight(searchText:takeaway in 0), product of:
  1.0 = tf(termFreq(searchText:takeaway)=1)
  0.5945349 = idf(docFreq=2, maxDocs=2)
  0.375 = fieldNorm(field=searchText, doc=0)
  0.36407676 = (MATCH) sum of:
0.18203838 = (MATCH) weight(searchText:fred in 0), product of:
  0.57735026 = queryWeight(searchText:fred), product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.97109574 = queryNorm
  0.31529972 = (MATCH) fieldWeight(searchText:fred in 0), product of:
1.4142135 = tf(termFreq(searchText:fred)=2)
0.5945349 = idf(docFreq=2, maxDocs=2)
0.375 = fieldNorm(field=searchText, doc=0)
0.18203838 = (MATCH) weight(searchText:company.com in 0), product of:
  0.57735026 = queryWeight(searchText:company.com), product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.97109574 = queryNorm
  0.31529972 = (MATCH) fieldWeight(searchText:company.com in 0),
product of:
1.4142135 = tf(termFreq(searchText:company.com)=2)
0.5945349 = idf(docFreq=2, maxDocs=2)
0.375 = fieldNorm(field=searchText, doc=0)

On Wed, May 16, 2012 at 9:50 PM, Meeraj Kunnumpurath <
meeraj.kunnumpur...@asyska.com> wrote:

> The actual query is
>
> Query q = new QueryParser(Version.LUCENE_35, "searchText",
> analyzer).parse("Takeaway f...@company.com");
>
> If I use
>
> Query q = new QueryParser(Version.LUCENE_35, "searchText",
> analyzer).parse("f...@company.com");
>
> I get them in the reverse order.
>
> Regards
> Meeraj
>
>
> On Wed, May 16, 2012 at 9:48 PM, Meeraj Kunnumpurath <
> meeraj.kunnumpur...@asyska.com> wrote:
>
>> I have tried the same using Lucene directly with the following code,
>>
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> import org.apache.lucene.index.IndexWriterConfig;
>> import org.apache.lucene.util.Version;
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.queryParser.QueryParser;
>> import org.apache.lucene.index.IndexReader;
>> import org.apache.lucene.search.IndexSearcher;
>> import org.apache.lucene.search.Query;
>> import org.apache.lucene.search.TopScoreDocCollector;
>> import org.apache.lucene.search.ScoreDoc;
>>
>> public class LuceneTest {
>>
>> public static void main(String[] args) throws Exception {
>>
>> StandardAnalyzer analyzer = new
>> StandardAnalyzer(Version.LUCENE_35);
>> RAMDirectory index = new RAMDirectory();
>> IndexWriterConfig config = new
>> IndexWriterConfig(Version.LUCENE_35,
>> analyzer);
>> IndexWriter indexWriter = new IndexWriter(index, config);
>>
>> Document doc1 = new Document();
>> doc1.add(new Field("searchText", "ABC Takeaway f...@company.com
>> f...@company.com", Field.Store.YES, Field.Index.ANALYZED));
>> Document doc2 = new Document();
>> doc2.add(new Field("searchText", "XYZ Takeaway f...@company.com",
>> Field.Store.YES, Field.Index.ANALYZED));
>>
>> indexWriter.addDocument(doc1);
>> indexWriter.addDocument(doc2);
>> indexWriter.close();
>>
>> Query q = new 

Re: Search Ranking

2012-05-16 Thread Meeraj Kunnumpurath
Also, if I do the below

Query q = new QueryParser(Version.LUCENE_35, "searchText",
analyzer).parse("Takeaway f...@company.com^100")

I get them in reverse order. Do I need to boost the term, even if it
appears more than once in the document?

Regards
Meeraj

On Wed, May 16, 2012 at 9:52 PM, Meeraj Kunnumpurath <
meeraj.kunnumpur...@asyska.com> wrote:

> This is the output I get from explaining the plan ..
>
>
> Found 2 hits.
> 1. XYZ Takeaway f...@company.com
> 0.5148823 = (MATCH) sum of:
>   0.17162743 = (MATCH) weight(searchText:takeaway in 1), product of:
> 0.57735026 = queryWeight(searchText:takeaway), product of:
>   0.5945349 = idf(docFreq=2, maxDocs=2)
>   0.97109574 = queryNorm
> 0.29726744 = (MATCH) fieldWeight(searchText:takeaway in 1), product of:
>   1.0 = tf(termFreq(searchText:takeaway)=1)
>   0.5945349 = idf(docFreq=2, maxDocs=2)
>   0.5 = fieldNorm(field=searchText, doc=1)
>   0.34325486 = (MATCH) sum of:
> 0.17162743 = (MATCH) weight(searchText:fred in 1), product of:
>   0.57735026 = queryWeight(searchText:fred), product of:
> 0.5945349 = idf(docFreq=2, maxDocs=2)
> 0.97109574 = queryNorm
>   0.29726744 = (MATCH) fieldWeight(searchText:fred in 1), product of:
> 1.0 = tf(termFreq(searchText:fred)=1)
> 0.5945349 = idf(docFreq=2, maxDocs=2)
> 0.5 = fieldNorm(field=searchText, doc=1)
> 0.17162743 = (MATCH) weight(searchText:company.com in 1), product of:
>   0.57735026 = queryWeight(searchText:company.com), product of:
> 0.5945349 = idf(docFreq=2, maxDocs=2)
> 0.97109574 = queryNorm
>   0.29726744 = (MATCH) fieldWeight(searchText:company.com in 1),
> product of:
> 1.0 = tf(termFreq(searchText:company.com)=1)
> 0.5945349 = idf(docFreq=2, maxDocs=2)
> 0.5 = fieldNorm(field=searchText, doc=1)
>
>
> 2. ABC Takeaway f...@company.com f...@company.com
> 0.49279732 = (MATCH) sum of:
>   0.12872057 = (MATCH) weight(searchText:takeaway in 0), product of:
> 0.57735026 = queryWeight(searchText:takeaway), product of:
>   0.5945349 = idf(docFreq=2, maxDocs=2)
>   0.97109574 = queryNorm
> 0.22295058 = (MATCH) fieldWeight(searchText:takeaway in 0), product of:
>   1.0 = tf(termFreq(searchText:takeaway)=1)
>   0.5945349 = idf(docFreq=2, maxDocs=2)
>   0.375 = fieldNorm(field=searchText, doc=0)
>   0.36407676 = (MATCH) sum of:
> 0.18203838 = (MATCH) weight(searchText:fred in 0), product of:
>   0.57735026 = queryWeight(searchText:fred), product of:
> 0.5945349 = idf(docFreq=2, maxDocs=2)
> 0.97109574 = queryNorm
>   0.31529972 = (MATCH) fieldWeight(searchText:fred in 0), product of:
> 1.4142135 = tf(termFreq(searchText:fred)=2)
> 0.5945349 = idf(docFreq=2, maxDocs=2)
> 0.375 = fieldNorm(field=searchText, doc=0)
> 0.18203838 = (MATCH) weight(searchText:company.com in 0), product of:
>   0.57735026 = queryWeight(searchText:company.com), product of:
> 0.5945349 = idf(docFreq=2, maxDocs=2)
> 0.97109574 = queryNorm
>   0.31529972 = (MATCH) fieldWeight(searchText:company.com in 0),
> product of:
> 1.4142135 = tf(termFreq(searchText:company.com)=2)
> 0.5945349 = idf(docFreq=2, maxDocs=2)
> 0.375 = fieldNorm(field=searchText, doc=0)
>
>
> On Wed, May 16, 2012 at 9:50 PM, Meeraj Kunnumpurath <
> meeraj.kunnumpur...@asyska.com> wrote:
>
>> The actual query is
>>
>> Query q = new QueryParser(Version.LUCENE_35, "searchText",
>> analyzer).parse("Takeaway f...@company.com");
>>
>> If I use
>>
>> Query q = new QueryParser(Version.LUCENE_35, "searchText",
>> analyzer).parse("f...@company.com");
>>
>> I get them in the reverse order.
>>
>> Regards
>> Meeraj
>>
>>
>> On Wed, May 16, 2012 at 9:48 PM, Meeraj Kunnumpurath <
>> meeraj.kunnumpur...@asyska.com> wrote:
>>
>>> I have tried the same using Lucene directly with the following code,
>>>
>>> import org.apache.lucene.store.RAMDirectory;
>>> import org.apache.lucene.document.Document;
>>> import org.apache.lucene.document.Field;
>>> import org.apache.lucene.index.IndexWriterConfig;
>>> import org.apache.lucene.util.Version;
>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>> import org.apache.lucene.index.IndexWriter;
>>> import org.apache.lucene.queryParser.QueryParser;
>>> import org.apache.lucene.index.IndexReader;
>>> import org.apache.lucene.search.IndexSearcher;
>>> import org.apache.lucene.search.Query;
>>> import org.apache.lucene.search.TopScoreDocCollector;
>>> import org.apache.lucene.search.ScoreDoc;
>>>
>>> public class LuceneTest {
>>>
>>> public static void main(String[] args) throws Exception {
>>>
>>> StandardAnalyzer analyzer = new
>>> StandardAnalyzer(Version.LUCENE_35);
>>> RAMDirectory index = new RAMDirectory();
>>> IndexWriterConfig config = new
>>> IndexWriterConfig(Version.LUCENE_35,
>>> analyzer);
>>> IndexWriter indexWrite

Optional Terms

2012-05-16 Thread Meeraj Kunnumpurath
Hi,

I have the following documents

Document doc1 = new Document();
doc1.add(new Field("searchText", "ABC Takeaway f...@company.com
f...@company.com", Field.Store.YES, Field.Index.ANALYZED));
Document doc2 = new Document();
doc2.add(new Field("searchText", "XYZ Takeaway f...@company.com",
Field.Store.YES, Field.Index.ANALYZED));
Document doc3 = new Document();
doc2.add(new Field("searchText", "LMN Takeaway", Field.Store.YES,
Field.Index.ANALYZED));

My query is

Query q = new QueryParser(Version.LUCENE_35, "searchText",
analyzer).parse("+Takeaway f...@company.com^100");

This returns only doc1 and doc2. How do I need to modify the query, so that
the first term (Takeaway) is mandatory and the second one (f...@company.com)
is optional? Also, I would like to boost those documents based on the
number of occurrences of the second term.

Regards
Meeraj


Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Chris Harris
I'm working on a product for librarians and similar people, who
apparently expect to be able to combine classic boolean operators
(i.e. AND, OR, NOT) with proximity operators (especially w/n and pre/n
-- which basically map to unordered and ordered SpanQueries with slop
n, respectively) in unrestricted fashion. For example, users appear to
believe that not only are relatively easy-to-grasp queries like these
legitimate:

medical w/5 agreement
(medical w/5 agreement) and (doctor w/10 rights)

but also crazier ones, perhaps like

agreement w/5 (medical and companion)
(dog or dragon) w/5 (cat and cow)
(daisy and (dog or dragon)) w/25 (cat not cow)

What I've noticed is that it's not always obvious how to interpret
such queries; it's not always obvious what the user had in mind, nor
how you might construct a Lucene query to carry out the user's intent.

Therefore, to broaden my understanding of the problem, I'd like to
learn more about schemes people may have used/proposed to assign
meanings to arbitrary combinations/nestings of, say, the AND, OR, NOT,
W, and PRE operators. Ultimately I'm interested in parsing queries
into Lucene Query objects (perhaps SpanQuery objects). But I'm also
interested in a more abstract/mathematical discussion. (After all, I'm
open to the possibility that the best implementation requires new
Query classes to be written.)

So far I've found two main sources of inspiration, but I'm not 100%
satisfied with either:

1. Mark Miller's no-longer-maintained qsol query parser
(https://github.com/markrmiller/qsol)

The premise of qsol seems to be that you can simplify any arbitrary
combination of boolean and proximity operators into an equivalent
expression where boolean operators can contain proximity operators,
but where no boolean operator appears nested inside a proximity
operator. (With the exception of OR, which maps nicely to a
SpanOrQuery...) This is a convenient transformation, because then you
can readily express the query in terms of existing SpanQuery and
BooleanQuery classes.

The main principle that qsol uses for these transformations might be
sloppily summarized as, "You can distribute any boolean operator over
any proximity operator".

Thus, for example, input query

agreement w/5 (medical and companion)

gets morphed into something along along the lines of

(agreement w/5 medical) and (medical w/5 companion)

In closer-to-Lucene syntax, that's

+(spanNear(slop=5, inOrder=false, ageement, medical))
+(spanNear(slop=5, inOrder=false, medical, companion))

I like how qsol is able to provide at least *some* Lucene-executable
Query for every input query string, and I like how it seems to take a
single principle of distributing booleans over proximities and see it
through pretty systematically.

Unfortunately, the Query objects returned by qsol don't always align
perfectly with what I imagine user intent to be.

For example, when my users try queries like

agreement w/5 (medical and companion)

I believe they are seeking spans where a single occurrence of
"agreement" is near both "medical" and "companion". qsol will find
documents like that, but it will also return what I think I consider
to be spurious matches, namely docs where there's an "agreement" near
"medical" and an "agreement" near "companion", but no "agreement"
that's near both.

Also non-intuitive is that qsol generates rather different Query
objects for these two queries:

(dog or dragon) w/5 (cat and cow)
(cat and cow) w/5 (dog or dragon)

The former maps to something like

((dog w/5 cat) AND (dog w/5 cow)) OR ((dragon w/5 cat) AND (dragon w/5 cow))

while the latter maps to something like

((cat w/5 dog) OR (cat w/5 dragon)) AND ((cow w/5 dog) OR (cow w/5 dragon))

I'm not sure there's an obviously better thing for qsol to do with
these, but it would be nice if w/5 were a symmetric operation.

Note: The above does not reflect qsol's actual syntax, only its semantics.

2. Minimum interval semantics

This approach is reflected in the Lucene "positions branch"
(PositionIntervalIterator stuff). It's also nicely described in the
paper "An Algebra for Structured Text Search and A Framework for its
Implementation" (Clarke, Cormack, Burkowski).

I don't know of a Lucene query parser built around this stuff yet, but
it seems possible to construct one. The basic idea is that each
subquery of a query string corresponds to a set of matching
spans/intervals/extents, and operators (whether "boolean" or
"proximity") are a means of filter and combining
spans/intervals/extents. For example:

* Query [ medical ] would correspond to every span where the term
"medical" appears
* Query [ medical and companion ] would correspond to all the
minimum-match spans containing both "medical" and "companion"
* Query [ agreement w/5 (medical and companion) ] would correspond to
all the minimum-match spans where "agreement" was within 5 words of a
minimum-match span for [medical and companion]

I like how minimum inte

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Ahmet Arslan
> medical w/5 agreement
> (medical w/5 agreement) and (doctor w/10 rights)
> 
> but also crazier ones, perhaps like
> 
> agreement w/5 (medical and companion)
> (dog or dragon) w/5 (cat and cow)
> (daisy and (dog or dragon)) w/25 (cat not cow)

This syntax reminds me Surround.

http://wiki.apache.org/solr/SurroundQueryParser
http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Trejkaz
On Thu, May 17, 2012 at 7:11 AM, Chris Harris  wrote:
> but also crazier ones, perhaps like
>
> agreement w/5 (medical and companion)
> (dog or dragon) w/5 (cat and cow)
> (daisy and (dog or dragon)) w/25 (cat not cow)
[skip]

Everything in your post matches our experience. We ended up writing
something which transforms the query as well but had to give up on
certain crazy things people tried, such as this form:

   (A and B) w/5 (C and D)

For this one:

  A w/5 (B and C)

We found the user expected the same A to be within 5 terms of both a B
and a C, and rewrote it to match that but also match more than they
asked for. So far, there have been no complaints about the overmatches
(it's documented.)

There is probably an extremely accurate way to rewrite it, but it
couldn't be figured out at the time. Maybe start with spans for A and
then remove spans not-near a B and spans not-near a C, which would
leave you with only spans near an A. The problem is that if you expand
the query to something like this, it gets quite a bit more complex, so
a user query which is already complex could turn into a really hard to
understand mess...

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Mike Sokolov
It sounds me as if there could be a market for a new kind of query that 
would implement:


A w/5 (B and C)

in the way that people understand it to mean - the same A near both B 
and C, not just any A.


Maybe it's too hard to implement using rewrites into existing SpanQueries?

In term of the PositionIterator work  - instead of A being within 5 in a 
"minimum" distance sense, what we want is that its "maximum" distance to 
all the terms in the other query (B and C) is 5.  I'm not sure if any 
query in that branch covers this case though, either, but if I recall, 
there was a way to implement extensions to it that were fairly natural.


-Mike

On 5/16/2012 7:15 PM, Trejkaz wrote:

On Thu, May 17, 2012 at 7:11 AM, Chris Harris  wrote:

but also crazier ones, perhaps like

agreement w/5 (medical and companion)
(dog or dragon) w/5 (cat and cow)
(daisy and (dog or dragon)) w/25 (cat not cow)

[skip]

Everything in your post matches our experience. We ended up writing
something which transforms the query as well but had to give up on
certain crazy things people tried, such as this form:

(A and B) w/5 (C and D)

For this one:

   A w/5 (B and C)

We found the user expected the same A to be within 5 terms of both a B
and a C, and rewrote it to match that but also match more than they
asked for. So far, there have been no complaints about the overmatches
(it's documented.)

There is probably an extremely accurate way to rewrite it, but it
couldn't be figured out at the time. Maybe start with spans for A and
then remove spans not-near a B and spans not-near a C, which would
leave you with only spans near an A. The problem is that if you expand
the query to something like this, it gets quite a bit more complex, so
a user query which is already complex could turn into a really hard to
understand mess...

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org