Re: [Suggestions Required] 110 Concurrency users indexing on Lucene dont finish in 200 ms.

2014-02-13 Thread sree
Thanks for your reply. We are using 100 threads and each indexes 100
documents. Now we created a standalone project which uses lucene to index
100 documents for 100 theads concurrently and we can see that each thread
uses an average of more than 1 sec.

lucene-group.zip
  
 


Please find attached source files, excel sheet and profiler image for more
information.

thanks
Sreedeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggestions-Required-110-Concurrency-users-indexing-on-Lucene-dont-finish-in-200-ms-tp4116625p4117133.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Which is better ,Search through query and whole text document or search through query with document field.

2014-02-13 Thread Rajendra Rao
Hello,

I have query and document.Its unstructured & natural  text.I used lucene
for searching document on query.If I  separate Document into field and then
search.what will be difference?
I can't check it because now i don't have field separated data .But in
future we will have.

Thanks.


Re: [Suggestions Required] 110 Concurrency users indexing on Lucene dont finish in 200 ms.

2014-02-13 Thread Michael McCandless
For better performance, you should not send 100 threads to
IndexWriter, but rather a number of threads in proportion to how many
CPUs the machine has.  E.g. if your CPU has 8 cores then use at most
12 (=8 * 1.5) indexing threads. It's fine to have 100 client threads
sending documents, but drop these documents into a queue and have the
12 indexing threads pull from there.

If you will have more than 8 threads in IndexWriter at once, then you
should call IndexWriterConfig.setMaxThreadStates to increase the
default (8).

Also, your benchmark does not allow for JVM warming, so you are
measuring e.g. hotspot compilation time.  It's better to make a long
running test and then measure the indexing throughput at steady state,
once the JVM is warmed.

In Lucene's nightly benchmark
(https://people.apache.org/~mikemccand/lucenebench/indexing.html ), we
index ~ 1 KB sized docs around 44.9 Kdocs/sec, or ~ 145 GB/hour, but
that's a very different test than what you are running (e.g., it uses
.addDocument not the more costly .updateDocument)...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 5:08 AM, sree  wrote:
> Thanks for your reply. We are using 100 threads and each indexes 100
> documents. Now we created a standalone project which uses lucene to index
> 100 documents for 100 theads concurrently and we can see that each thread
> uses an average of more than 1 sec.
>
> lucene-group.zip
> 
> 
>
>
> Please find attached source files, excel sheet and profiler image for more
> information.
>
> thanks
> Sreedeep
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Suggestions-Required-110-Concurrency-users-indexing-on-Lucene-dont-finish-in-200-ms-tp4116625p4117133.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Adding custom weights to individual terms

2014-02-13 Thread Michael McCandless
You could stuff your custom weights into a payload, and index that,
but this is per term per document per position, while it sounds like
you just want one float for each term regardless of which
documents/positions where that term occurred?

Doing your own custom attribute would be a challenge: not only must
you create & set this attribute during indexing, but you then must
change the indexing process (custom chain, custom codec) to get the
new attribute into the index, and then make a custom query that can
pull this attribute at search time.

What are these term weights?  Are you sure you can't compute these
weights at search time with a custom similarity using the stats that
are already stored (docFreq, totalTermFreq, maxDoc, etc.)?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling  wrote:
> Hi list
>
> I'm trying to figure out how customizable scoring and weighting is in the 
> Lucene API. I read about the API's but still can't figure out if the 
> following is possible.
>
> I would like to do normal document text indexing, but I would like to control 
> the weight added to tokens my self, also I would like to control the 
> weighting of query tokens and the how things are added together.
>
> When indexing a word I would like attache my own weights to the word, and use 
> these weights when querying for documents. F.ex.
>
> Doc 1
> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) 
> API(0.3)
>
> Doc 2
> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>
> The floats in parentheses are some I would like to add in the indexing 
> process, not something coming from Lucene tdf/id ex.
>
> Wen querying I would like to repeat this and also create the weights for each 
> term "myself" and control how the final doc score is calculated.
>
> I have read that it's possible to attach your own custom attributes to 
> tokens. Is this the way to go? Ie. should I add my custom weight as 
> attributes to tokens, and then access these attributes when calculating 
> document score in the search process (described here 
> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html
>  under "adding a custom attribute")?
>
> The reason why I'm asking is that I can't find any examples of this being 
> done anywhere. But I found someone stating "With Lucene, it is impossible to 
> increase or decrease the weight of individual terms in a document".
>
> With regards
> Rune

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which is better ,Search through query and whole text document or search through query with document field.

2014-02-13 Thread Ian Lea
The one that meets your requirements most easily will be the best.

If people will want to search for words in particular fields you'll
need to split it but if they only ever want to search across all
fields there's no point.

A common requirement is to want both, in which case you can split it
and also store everything in a common field called something like
"contents".  Or look at MultiFieldQueryParser.


--
Ian.


On Thu, Feb 13, 2014 at 10:16 AM, Rajendra Rao
 wrote:
> Hello,
>
> I have query and document.Its unstructured & natural  text.I used lucene
> for searching document on query.If I  separate Document into field and then
> search.what will be difference?
> I can't check it because now i don't have field separated data .But in
> future we will have.
>
> Thanks.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Adding custom weights to individual terms

2014-02-13 Thread Shai Erera
I often prefer to manage such weights outside the index. Usually managing
them inside the index leads to problems in the future when e.g the weights
change. If they are encoded in the index, it means re-indexing. Also, if
the weight changes then in some segments the weight will be different than
others. I think that if you manage the weights e.g. in a simple FST (which
is very compat), it will give you the best flexibility and it's very easy
to use.

Shai


On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> You could stuff your custom weights into a payload, and index that,
> but this is per term per document per position, while it sounds like
> you just want one float for each term regardless of which
> documents/positions where that term occurred?
>
> Doing your own custom attribute would be a challenge: not only must
> you create & set this attribute during indexing, but you then must
> change the indexing process (custom chain, custom codec) to get the
> new attribute into the index, and then make a custom query that can
> pull this attribute at search time.
>
> What are these term weights?  Are you sure you can't compute these
> weights at search time with a custom similarity using the stats that
> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling  wrote:
> > Hi list
> >
> > I'm trying to figure out how customizable scoring and weighting is in
> the Lucene API. I read about the API's but still can't figure out if the
> following is possible.
> >
> > I would like to do normal document text indexing, but I would like to
> control the weight added to tokens my self, also I would like to control
> the weighting of query tokens and the how things are added together.
> >
> > When indexing a word I would like attache my own weights to the word,
> and use these weights when querying for documents. F.ex.
> >
> > Doc 1
> > Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
> API(0.3)
> >
> > Doc 2
> > Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
> >
> > The floats in parentheses are some I would like to add in the indexing
> process, not something coming from Lucene tdf/id ex.
> >
> > Wen querying I would like to repeat this and also create the weights for
> each term "myself" and control how the final doc score is calculated.
> >
> > I have read that it's possible to attach your own custom attributes to
> tokens. Is this the way to go? Ie. should I add my custom weight as
> attributes to tokens, and then access these attributes when calculating
> document score in the search process (described here
> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder
>  "adding a custom attribute")?
> >
> > The reason why I'm asking is that I can't find any examples of this
> being done anywhere. But I found someone stating "With Lucene, it is
> impossible to increase or decrease the weight of individual terms in a
> document".
> >
> > With regards
> > Rune
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


MAX_TERM_LENGTH

2014-02-13 Thread Marcio Napoli
Hi All,

I have a need to work with big terms. So the 32k is not enough. How can i
increase the maximum size of a term? Found in the IndexWriter
MAX_TERM_LENGTH constant, which refers to FieldCache and
DocumentsWriterPerThread (BYTE_BLOCK_SIZE-2).

Thanks,
Marcio Napoli

Go beyond Lucene(tm) features with Numere(R)
http://numere.stela.org.br


simple question about index reader

2014-02-13 Thread Yonghui Zhao
Hi,

I am new to lucene and I get a simple question about index reader.

If I open a DirectoryReader say reader1 based on a disk directory, then the
lucene index directory is changed, to get new result I need get a new
DirectoryReader.

Suppose reader1 will get the result before the change forever.

I am wondering how lucene can guarantee reader1's result is not changed.

If I delete all docs from the folder after reader1 is opened,  after
optimize the directory should be empty now, how can reader1 still get old
result?


Re: MAX_TERM_LENGTH

2014-02-13 Thread Michael McCandless
Why do you index such immense terms?  What's the end user use case?
Do they really need to be inverted?  Maybe use binary doc values
instead?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 8:36 AM, Marcio Napoli  wrote:
> Hi All,
>
> I have a need to work with big terms. So the 32k is not enough. How can i
> increase the maximum size of a term? Found in the IndexWriter
> MAX_TERM_LENGTH constant, which refers to FieldCache and
> DocumentsWriterPerThread (BYTE_BLOCK_SIZE-2).
>
> Thanks,
> Marcio Napoli
>
> Go beyond Lucene(tm) features with Numere(R)
> http://numere.stela.org.br

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: simple question about index reader

2014-02-13 Thread Michael McCandless
The reader holds all the underlying files still open, and relies on
the filesystem to "protect" still-open files that are deleted.

Windows does this by refusing to allow deletion.  Unix does it by
keeping the file bytes available on disk but removing the directory
entry ("delete on last close").

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 10:14 AM, Yonghui Zhao  wrote:
> Hi,
>
> I am new to lucene and I get a simple question about index reader.
>
> If I open a DirectoryReader say reader1 based on a disk directory, then the
> lucene index directory is changed, to get new result I need get a new
> DirectoryReader.
>
> Suppose reader1 will get the result before the change forever.
>
> I am wondering how lucene can guarantee reader1's result is not changed.
>
> If I delete all docs from the folder after reader1 is opened,  after
> optimize the directory should be empty now, how can reader1 still get old
> result?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



IndexWriter and IndexReader

2014-02-13 Thread Cemo
I am quite new to Lucene. I am trying to prepare an application where:

   1. ~ 100K documents exist.
   2. ~ 4 search server will be utilized
   3. Documents are not frequently updated and I want to check every minute
   a deletion or addition.
   4. I am ready to sacrifice some system resource to keep my setup as
   simple as possible

Here are my questions:


   1. I would like to create a single index folder and mapping this folder
   to each server. Is this a good practice?
   2. Instead of updating
   (org.apache.lucene.index.IndexWriterConfig.OpenMode#CREATE_OR_APPEND) my
   indexes, I want to overwrite them. Opening my indexes with
   IndexWriterConfig.OpenMode.CREATE seems enough. I am considering that
   SearcherFactory can warm and prepare my IndexReader for live system. Is
   this a good way or am I totally in wrong direction?
   3. During IndexWriter operations such as overwriting indexes, what are
   the consequences of "searcherManager.acquire();"? I am afraid of having
   some concurrency errors at live system under heavy load.

Thanks


RE: Getting term ords during collect

2014-02-13 Thread Kyle Judson
The SortedSetDocValuesField worked great.

Thanks.
Kyle

> From: luc...@mikemccandless.com
> Date: Wed, 12 Feb 2014 05:39:24 -0500
> Subject: Re: Getting term ords during collect
> To: java-user@lucene.apache.org
> 
> It sounds like you are just indexing at TextField and then calling
> getDocTermOrds?  This then requires a slow "uninvert" step...Hmm, how
> are you adding this field to your documents?
> 
> Instead, you should use SortedSetDocValuesField, which will store the
> doc values directly in the index, and loading them at search time
> should be fast.  But note that you cannot search on the field if you
> do that; if you also need to search then you should still index the
> TextField as well.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Tue, Feb 11, 2014 at 10:38 PM, Kyle Judson  wrote:
> > Too long is always relative but one of the fields in a 24G index with 3.9M 
> > terms takes 2.5 min to load from SSD.
> >
> > I'm getting the SortedSetDocValues from FieldCache.DEFAULT.getDocTermOrds.
> >
> > What are the other DV formats? I'll look them up and try them.
> >
> > Thanks
> > Kyle
> >
> >> From: luc...@mikemccandless.com
> >> Date: Tue, 11 Feb 2014 19:59:03 -0500
> >> Subject: Re: Getting term ords during collect
> >> To: java-user@lucene.apache.org
> >>
> >> SortedSetDV is probably the best way to do so.  You could also encode
> >> the ords yourself into a byte[] and use binary DV.
> >>
> >> But why are you seeing it take too long to load?  You can switch to
> >> different DV formats to tradeoff RAM usage and lookup speed..
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Tue, Feb 11, 2014 at 6:57 PM, Kyle Judson  wrote:
> >> > Hi All,
> >> >
> >> > What are the ways I can get the ords for the terms of a particular field 
> >> > in the collect method of a Collector?
> >> >
> >> > I'm currently using a SortedSetDocValues that I obtained before the 
> >> > query but it's taking longer to load than I would like.
> >> >
> >> > Thanks
> >> > Kyle
> >> >
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
  

Re: MAX_TERM_LENGTH

2014-02-13 Thread Marcio Napoli
Hey Mike,

I need quick access to values per document. The use of binary values is
possible via doc FieldCache -> FieldCacheSource.getValues ()?

Thanks,
Marcio Napoli

Go beyond Lucene(tm) features with Numere(R)
http://numere.stela.org.br


2014-02-13 13:16 GMT-02:00 Michael McCandless :

> Why do you index such immense terms?  What's the end user use case?
> Do they really need to be inverted?  Maybe use binary doc values
> instead?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 13, 2014 at 8:36 AM, Marcio Napoli 
> wrote:
> > Hi All,
> >
> > I have a need to work with big terms. So the 32k is not enough. How can i
> > increase the maximum size of a term? Found in the IndexWriter
> > MAX_TERM_LENGTH constant, which refers to FieldCache and
> > DocumentsWriterPerThread (BYTE_BLOCK_SIZE-2).
> >
> > Thanks,
> > Marcio Napoli
> >
> > Go beyond Lucene(tm) features with Numere(R)
> > http://numere.stela.org.br
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: MAX_TERM_LENGTH

2014-02-13 Thread Michael McCandless
You can use IndexReader.getBinaryDocValues(field).

BTW your site should reference *Apache* Lucene, not just Lucene.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 11:51 AM, Marcio Napoli  wrote:
> Hey Mike,
>
> I need quick access to values per document. The use of binary values is
> possible via doc FieldCache -> FieldCacheSource.getValues ()?
>
> Thanks,
> Marcio Napoli
>
> Go beyond Lucene(tm) features with Numere(R)
> http://numere.stela.org.br
>
>
> 2014-02-13 13:16 GMT-02:00 Michael McCandless :
>
>> Why do you index such immense terms?  What's the end user use case?
>> Do they really need to be inverted?  Maybe use binary doc values
>> instead?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Feb 13, 2014 at 8:36 AM, Marcio Napoli 
>> wrote:
>> > Hi All,
>> >
>> > I have a need to work with big terms. So the 32k is not enough. How can i
>> > increase the maximum size of a term? Found in the IndexWriter
>> > MAX_TERM_LENGTH constant, which refers to FieldCache and
>> > DocumentsWriterPerThread (BYTE_BLOCK_SIZE-2).
>> >
>> > Thanks,
>> > Marcio Napoli
>> >
>> > Go beyond Lucene(tm) features with Numere(R)
>> > http://numere.stela.org.br
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MAX_TERM_LENGTH

2014-02-13 Thread Marcio Napoli
Thanks for note,

Marcio Napoli

Go beyond Apache Lucene(tm) features with Numere(R)
http://numere.stela.org.br



2014-02-13 14:56 GMT-02:00 Michael McCandless :

> You can use IndexReader.getBinaryDocValues(field).
>
> BTW your site should reference *Apache* Lucene, not just Lucene.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 13, 2014 at 11:51 AM, Marcio Napoli 
> wrote:
> > Hey Mike,
> >
> > I need quick access to values per document. The use of binary values is
> > possible via doc FieldCache -> FieldCacheSource.getValues ()?
> >
> > Thanks,
> > Marcio Napoli
> >
> > Go beyond Lucene(tm) features with Numere(R)
> > http://numere.stela.org.br
> >
> >
> > 2014-02-13 13:16 GMT-02:00 Michael McCandless  >:
> >
> >> Why do you index such immense terms?  What's the end user use case?
> >> Do they really need to be inverted?  Maybe use binary doc values
> >> instead?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Feb 13, 2014 at 8:36 AM, Marcio Napoli  >
> >> wrote:
> >> > Hi All,
> >> >
> >> > I have a need to work with big terms. So the 32k is not enough. How
> can i
> >> > increase the maximum size of a term? Found in the IndexWriter
> >> > MAX_TERM_LENGTH constant, which refers to FieldCache and
> >> > DocumentsWriterPerThread (BYTE_BLOCK_SIZE-2).
> >> >
> >> > Thanks,
> >> > Marcio Napoli
> >> >
> >> > Go beyond Lucene(tm) features with Numere(R)
> >> > http://numere.stela.org.br
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: IndexWriter and IndexReader

2014-02-13 Thread Michael McCandless
Overwriting an index in-place while open IndexReaders are actively
searching works fine.

You can either open a new IW with OpenMode.CREATE, or, you can call
IW.deleteAll() if you have an existing IW already open.

Writing to a shared index directory mapped to N machines is not
generally done, because performance is often poor, though it should
work fine; usually apps replicate the index out to the N machines.
Lucene has a replication module that does this... but really if you
want to distribute load to N machines, you may want to just use Solr
or ElasticSearch, since they handle the replication / query load
balancing for you.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 10:54 AM, Cemo  wrote:
> I am quite new to Lucene. I am trying to prepare an application where:
>
>1. ~ 100K documents exist.
>2. ~ 4 search server will be utilized
>3. Documents are not frequently updated and I want to check every minute
>a deletion or addition.
>4. I am ready to sacrifice some system resource to keep my setup as
>simple as possible
>
> Here are my questions:
>
>
>1. I would like to create a single index folder and mapping this folder
>to each server. Is this a good practice?
>2. Instead of updating
>(org.apache.lucene.index.IndexWriterConfig.OpenMode#CREATE_OR_APPEND) my
>indexes, I want to overwrite them. Opening my indexes with
>IndexWriterConfig.OpenMode.CREATE seems enough. I am considering that
>SearcherFactory can warm and prepare my IndexReader for live system. Is
>this a good way or am I totally in wrong direction?
>3. During IndexWriter operations such as overwriting indexes, what are
>the consequences of "searcherManager.acquire();"? I am afraid of having
>some concurrency errors at live system under heavy load.
>
> Thanks

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Adding custom weights to individual terms

2014-02-13 Thread Rune Stilling
Den 13/02/2014 kl. 12.36 skrev Michael McCandless :

> You could stuff your custom weights into a payload, and index that,
> but this is per term per document per position, while it sounds like
> you just want one float for each term regardless of which
> documents/positions where that term occurred?

No I want to store a weight per term per document. The point is that my custom 
term weight is semantically dependent on the document context exactly the same 
way the other standard term weights are.

It doesn’t make sense to also have a separate weight per position.

> Doing your own custom attribute would be a challenge: not only must
> you create & set this attribute during indexing, but you then must
> change the indexing process (custom chain, custom codec) to get the
> new attribute into the index, and then make a custom query that can
> pull this attribute at search time.

Hmmm well - But will it solve my problem then?

> What are these term weights?  Are you sure you can't compute these
> weights at search time with a custom similarity using the stats that
> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?

Yes I’m sure. I’m doing a semantic analysis of the documents before they are 
indexed, and it’s the result of this I want to store as a custom weight on a 
term per document basis. The docFreq, etc. are reflecting a quite simple 
approach to term weighting (i.e. - td/idf), which just isn’t precise enough in 
my case.

So it seems I might as well build my own term lists and code the indexing and 
searching process manually?

With regards,
Rune

> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling  wrote:
>> Hi list
>> 
>> I'm trying to figure out how customizable scoring and weighting is in the 
>> Lucene API. I read about the API's but still can't figure out if the 
>> following is possible.
>> 
>> I would like to do normal document text indexing, but I would like to 
>> control the weight added to tokens my self, also I would like to control the 
>> weighting of query tokens and the how things are added together.
>> 
>> When indexing a word I would like attache my own weights to the word, and 
>> use these weights when querying for documents. F.ex.
>> 
>> Doc 1
>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) 
>> API(0.3)
>> 
>> Doc 2
>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>> 
>> The floats in parentheses are some I would like to add in the indexing 
>> process, not something coming from Lucene tdf/id ex.
>> 
>> Wen querying I would like to repeat this and also create the weights for 
>> each term "myself" and control how the final doc score is calculated.
>> 
>> I have read that it's possible to attach your own custom attributes to 
>> tokens. Is this the way to go? Ie. should I add my custom weight as 
>> attributes to tokens, and then access these attributes when calculating 
>> document score in the search process (described here 
>> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html
>>  under "adding a custom attribute")?
>> 
>> The reason why I'm asking is that I can't find any examples of this being 
>> done anywhere. But I found someone stating "With Lucene, it is impossible to 
>> increase or decrease the weight of individual terms in a document".
>> 
>> With regards
>> Rune
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Adding custom weights to individual terms

2014-02-13 Thread Rune Stilling
I’m not sure how I would do that, when Lucene is meant to use my custom weights 
when calculating document weights when executing a search query.

Doc 1
Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) API(0.3)

Doc 2
Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)

Query
Lucene

0.7 and 0.5 are my custom weight and should be used to return Doc 1 with weight 
0.7 and Doc 2 with weight 0.5 as an answer to my query.

/Rune

Den 13/02/2014 kl. 13.27 skrev Shai Erera :

> I often prefer to manage such weights outside the index. Usually managing
> them inside the index leads to problems in the future when e.g the weights
> change. If they are encoded in the index, it means re-indexing. Also, if
> the weight changes then in some segments the weight will be different than
> others. I think that if you manage the weights e.g. in a simple FST (which
> is very compat), it will give you the best flexibility and it's very easy
> to use.
> 
> Shai
> 
> 
> On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
> 
>> You could stuff your custom weights into a payload, and index that,
>> but this is per term per document per position, while it sounds like
>> you just want one float for each term regardless of which
>> documents/positions where that term occurred?
>> 
>> Doing your own custom attribute would be a challenge: not only must
>> you create & set this attribute during indexing, but you then must
>> change the indexing process (custom chain, custom codec) to get the
>> new attribute into the index, and then make a custom query that can
>> pull this attribute at search time.
>> 
>> What are these term weights?  Are you sure you can't compute these
>> weights at search time with a custom similarity using the stats that
>> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> 
>> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling  wrote:
>>> Hi list
>>> 
>>> I'm trying to figure out how customizable scoring and weighting is in
>> the Lucene API. I read about the API's but still can't figure out if the
>> following is possible.
>>> 
>>> I would like to do normal document text indexing, but I would like to
>> control the weight added to tokens my self, also I would like to control
>> the weighting of query tokens and the how things are added together.
>>> 
>>> When indexing a word I would like attache my own weights to the word,
>> and use these weights when querying for documents. F.ex.
>>> 
>>> Doc 1
>>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
>> API(0.3)
>>> 
>>> Doc 2
>>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>>> 
>>> The floats in parentheses are some I would like to add in the indexing
>> process, not something coming from Lucene tdf/id ex.
>>> 
>>> Wen querying I would like to repeat this and also create the weights for
>> each term "myself" and control how the final doc score is calculated.
>>> 
>>> I have read that it's possible to attach your own custom attributes to
>> tokens. Is this the way to go? Ie. should I add my custom weight as
>> attributes to tokens, and then access these attributes when calculating
>> document score in the search process (described here
>> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder
>>  "adding a custom attribute")?
>>> 
>>> The reason why I'm asking is that I can't find any examples of this
>> being done anywhere. But I found someone stating "With Lucene, it is
>> impossible to increase or decrease the weight of individual terms in a
>> document".
>>> 
>>> With regards
>>> Rune
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Adding custom weights to individual terms

2014-02-13 Thread Rune Stilling
Den 13/02/2014 kl. 12.36 skrev Michael McCandless :

> You could stuff your custom weights into a payload, and index that,
> but this is per term per document per position, while it sounds like
> you just want one float for each term regardless of which
> documents/positions where that term occurred?

No I want to store a weight per term per document. The point is that my custom 
term weight is semantically dependent on the document context exactly the same 
way the other standard term weights are.

It doesn’t make sense to also have a separate weight per position.

> Doing your own custom attribute would be a challenge: not only must
> you create & set this attribute during indexing, but you then must
> change the indexing process (custom chain, custom codec) to get the
> new attribute into the index, and then make a custom query that can
> pull this attribute at search time.

Hmmm well - But will it solve my problem then?

> What are these term weights?  Are you sure you can't compute these
> weights at search time with a custom similarity using the stats that
> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?

Yes I’m sure. I’m doing a semantic analysis of the documents before they are 
indexed, and it’s the result of this I want to store as a custom weight on a 
term per document basis. The docFreq, etc. are reflecting a quite simple 
approach to term weighting (i.e. - td/idf), which just isn’t precise enough in 
my case.

So it seems I might as well build my own term lists and code the indexing and 
searching process manually?

With regards,
Rune

> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling  wrote:
>> Hi list
>> 
>> I'm trying to figure out how customizable scoring and weighting is in the 
>> Lucene API. I read about the API's but still can't figure out if the 
>> following is possible.
>> 
>> I would like to do normal document text indexing, but I would like to 
>> control the weight added to tokens my self, also I would like to control the 
>> weighting of query tokens and the how things are added together.
>> 
>> When indexing a word I would like attache my own weights to the word, and 
>> use these weights when querying for documents. F.ex.
>> 
>> Doc 1
>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) 
>> API(0.3)
>> 
>> Doc 2
>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>> 
>> The floats in parentheses are some I would like to add in the indexing 
>> process, not something coming from Lucene tdf/id ex.
>> 
>> Wen querying I would like to repeat this and also create the weights for 
>> each term "myself" and control how the final doc score is calculated.
>> 
>> I have read that it's possible to attach your own custom attributes to 
>> tokens. Is this the way to go? Ie. should I add my custom weight as 
>> attributes to tokens, and then access these attributes when calculating 
>> document score in the search process (described here 
>> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html
>>  under "adding a custom attribute")?
>> 
>> The reason why I'm asking is that I can't find any examples of this being 
>> done anywhere. But I found someone stating "With Lucene, it is impossible to 
>> increase or decrease the weight of individual terms in a document".
>> 
>> With regards
>> Rune
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Adding custom weights to individual terms

2014-02-13 Thread lukai
Hi, Rune:
  Per your requirement, you can generate a separated filed for the document
before send document to lucene. Let's say the name is: score_field. The
content of this field in this way:
 Doc 1#score_field:
  Lucence:0.7 is:0 ...
Doc 2#score_field:
  Lucene:0.5 is:0 ...

 Store the field with "indexed", store other fields as "stored". And store
the weight value as payload for terms(wrap your ananlyzer to consume the
weight value, basically you can leverage: DelimitedPayloadTokenFilter and
WhitespaceTokenizer to form a basic analyzer which can take the input
format). Make sure the term in each document in score_field is unique
(according your description it's already fullfilled). You can also disable
to index the position information for this filed, cuz you dont need it.

Then when you do query:
1. If you want to do score like a cosine similarity based on query and
document, you should implement a query parser to parse weight you assigned
in different terms in query phrase.
2. create a new query type and customize you score function and tell lucene
to use your scorer.

  Here is a small snippet of a query type i had created before, basically
you can follow this logic to manipulate your score value:

 final Terms terms = fields.terms(fieldName);

  if(terms != null ){

final TermsEnum termsEnum = terms.iterator(null);

BytesRef bytes = new BytesRef(wandTerm.queryTerm);

if(termsEnum.seekExact(new BytesRef(wandTerm.queryTerm))){



  float ub = termsEnum.maxFeatureValue();

  int docFreq = termsEnum.docFreq();

  //logger.warn("term:"+wandTerm.queryTerm +"   :" + ub);

  DocsAndPositionsEnum docsPositionEnum =
termsEnum.docsAndPositions(acceptDocs, null);


tts.add(newWandPosting(fieldName,bytes,docsPositionEnum,ub,wandTerm.
featureValue,(totalDocNum+1)*1.0f/docFreq ));

}



On Thu, Feb 13, 2014 at 10:49 AM, Rune Stilling  wrote:

> I'm not sure how I would do that, when Lucene is meant to use my custom
> weights when calculating document weights when executing a search query.
>
> Doc 1
> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
> API(0.3)
>
> Doc 2
> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>
> Query
> Lucene
>
> 0.7 and 0.5 are my custom weight and should be used to return Doc 1 with
> weight 0.7 and Doc 2 with weight 0.5 as an answer to my query.
>
> /Rune
>
> Den 13/02/2014 kl. 13.27 skrev Shai Erera :
>
> > I often prefer to manage such weights outside the index. Usually managing
> > them inside the index leads to problems in the future when e.g the
> weights
> > change. If they are encoded in the index, it means re-indexing. Also, if
> > the weight changes then in some segments the weight will be different
> than
> > others. I think that if you manage the weights e.g. in a simple FST
> (which
> > is very compat), it will give you the best flexibility and it's very easy
> > to use.
> >
> > Shai
> >
> >
> > On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> You could stuff your custom weights into a payload, and index that,
> >> but this is per term per document per position, while it sounds like
> >> you just want one float for each term regardless of which
> >> documents/positions where that term occurred?
> >>
> >> Doing your own custom attribute would be a challenge: not only must
> >> you create & set this attribute during indexing, but you then must
> >> change the indexing process (custom chain, custom codec) to get the
> >> new attribute into the index, and then make a custom query that can
> >> pull this attribute at search time.
> >>
> >> What are these term weights?  Are you sure you can't compute these
> >> weights at search time with a custom similarity using the stats that
> >> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling  wrote:
> >>> Hi list
> >>>
> >>> I'm trying to figure out how customizable scoring and weighting is in
> >> the Lucene API. I read about the API's but still can't figure out if the
> >> following is possible.
> >>>
> >>> I would like to do normal document text indexing, but I would like to
> >> control the weight added to tokens my self, also I would like to control
> >> the weighting of query tokens and the how things are added together.
> >>>
> >>> When indexing a word I would like attache my own weights to the word,
> >> and use these weights when querying for documents. F.ex.
> >>>
> >>> Doc 1
> >>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
> >> API(0.3)
> >>>
> >>> Doc 2
> >>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
> >>>
> >>> The floats in parentheses are some I would like to add in the indexing
> >> process, not som

Re: Actual min and max-value of NumericField during codec flush

2014-02-13 Thread Ravikumar Govindarajan
Yeah, now I understood a little bit.

Since LogMP always merges adjacent segments, that should pretty much serve
my use-case, when used with a SortingMP

Early-Query termination quits by throwing an Exception right?. Is it ok to
individually search using SegmentReader and then break-off, instead of
using a MultiReader, especially when the order is known before search
begins?

The reason why I insisted on a time-stamp based merging is because there is
a possiblility of an out-of-order segment added via addIndex(...) call.
That segment can be of any older time-stamp [month ago, year-ago etc...],
albeit extremely rare. Should I worry about it during merges, or just
handle overlaps during search

--
Ravi



On Thu, Feb 13, 2014 at 1:21 PM, Shai Erera  wrote:

> Hi
>
> LogMP *always* picks adjacent segments together. Therefore, if you have
> segments S1, S2, S3, S4 where the date-wise sort order is S4>S3>S2>S1, then
> LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent
> segments and in a raw (i.e. it doesn't skip segments).
>
> I guess what both Mike and I don't understand is why you insist on merging
> based on the timestamp of each segment. I.e. if the order, timestamp-wise,
> of the segments isn't as I described above, then merging them like so won't
> hurt - i.e. they will still be unsorted. No harm is done.
>
> Maybe MergePolicy isn't what you need here. If you can record somewhere the
> min/max timestamp of each segment, you can use a MultiReader to wrap the
> sorted list of IndexReaders (actually SegmentReaders). Then your "reader",
> always traverses segments from new to old.
>
> If this approach won't address your issue, then you can merge based on
> timestamps - there's nothing wrong about it. What Mike suggested is that
> you benchmark your application with this merge policy, for a long period of
> time (few hours/days, depending on your indexing rate), because what might
> happen is that your merges are always unbalanced and your indexing
> performance will degrade because of unbalanced amount of IO that happens
> during the merge.
>
> Shai
>
>
> On Thu, Feb 13, 2014 at 7:25 AM, Ravikumar Govindarajan <
> ravikumar.govindara...@gmail.com> wrote:
>
> > @Mike,
> >
> > I had suggested the same approach in one of my previous mails, where-by
> > each segment records min/max timestamps in seg-info diagnostics and use
> it
> > for merging adjacent segments.
> >
> > "Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. "
> >
> > But you have expressed reservations
> >
> > "This seems somewhat dangerous...
> >
> > Not taking into account the "true" segment size can lead to very very
> > poor merge decisions ... you should turn on IndexWriter's infoStream
> > and do a long running test to convince yourself the merging is being
> > sane."
> >
> > Will merging be disastrous, if I choose a TimeMergePolicy? I will also
> test
> > and verify, but it's always great to hear finer points from experts.
> >
> > @Shai,
> >
> > LogByteSizeMP categorizes "adjacency" by "size", whereas it would be
> better
> > if "timestamp" is used in my case
> >
> > Sure, I need to wrap this in an SMP to make sure that the newly-created
> > segment is also in sorted-order
> >
> > --
> > Ravi
> >
> >
> >
> > On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera  wrote:
> >
> > > Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks
> > adjacent
> > > segments and SortingMP ensures the merged segment is also sorted.
> > >
> > > Shai
> > >
> > >
> > > On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
> > > ravikumar.govindara...@gmail.com> wrote:
> > >
> > > > Yes exactly as you have described.
> > > >
> > > > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological
> order
> > > and
> > > > goes for a merge
> > > >
> > > > While SortingMergePolicy will correctly solve the merge-part, it does
> > not
> > > > however play any role in picking segments to merge right?
> > > >
> > > > SMP internally delegates to TieredMergePolicy, which might pick S1&S4
> > to
> > > > merge disturbing the global-order. Ideally only "adjacent" segments
> > > should
> > > > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
> > > >
> > > > Can there be a better selection of segments to merge in this case, so
> > as
> > > to
> > > > maintain a semblance of global-ordering?
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > >
> > > > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> > > > luc...@mikemccandless.com> wrote:
> > > >
> > > > > OK, I see (early termination).
> > > > >
> > > > > That's a challenge, because you really want the docs sorted
> backwards
> > > > > from how they were added right?  And, e.g., merged and then
> searched
> > > > > in "reverse segment order"?
> > > > >
> > > > > I think you should be able to do this w/ SortingMergePolicy?  And
> > then
> > > > > use a custom collector that stops after you've g

Collector is collecting more than the specified hits

2014-02-13 Thread saisantoshi
The problem with the below collector is the collect method is not stopping
after the numHits count has reached. Is there a way to stop the collector
collecting the docs after it has reached the numHits specified.

For example:
* TopScoreDocCollector topScore = TopScoreDocCollector.create(numHits,
true); *
// TopScoreDocCollector topScore = TopScoreDocCollector.create(30, true); 

I would except the below collector to pause/exit out after it has collected
the specified numHits ( in this case it's 30). But what's happening here is
the collector is collecting all the docs and thereby causing delay in
searches. Can we configure the collect method below to collect/stop after it
has reached numHits specified? PLease let me know if there any issue with
the collector below?

public class MyCollector extends PositiveScoresOnlyCollector  { 

private IndexReader indexReader; 
  

public MyCollector (IndexReader indexReader,PositiveScoresOnlyCollector
topScore) { 
super(topScore); 
this.indexReader = indexReader; 
} 

@Override 
public void collect(int doc) { 
try { 
   //Custom Logic 
super.collect(doc); 
   } 

} catch (Exception e) { 
  
} 
} 



//Usage: 

MyCollector collector; 
TopScoreDocCollector topScore =
TopScoreDocCollector.create(numHits, true); 
IndexSearcher searcher = new IndexSearcher(reader); 
try { 
collector = new MyCollector(indexReader, new
PositiveScoresOnlyCollector(topScore)); 
searcher.search(query, (Filter) null, collector); 
} finally { 
  
} 

Thanks,
Sai.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org