Grouping field using pruned terms?

2013-07-24 Thread Ravikumar Govindarajan
TermFirstPassGroupingCollector loads all terms for a given group-by field,
through FieldCache.

Is it possible to instruct the class to group only pruned terms of a field,
based on a user-supplied query [RangeQuery, TermQuery etc...]

This way, only pruned terms are grouped and all others are ignored

Is such a pre-processing step possible in group-by query?

--
Ravi


Matched words from document - Stemmed and Synonyms

2013-07-24 Thread venkatesham.gu...@igate.com
I am looking for a feature in solr that will give me all matched words in the
document when I search with a word.
My field uses Stemming and as well as Synonym filters.

For example I have documents and part of the text goes like below
1.We were very careful about my surgery
2.are still needing shunts, or Decompression Surgeries, Taps work for som
3.prior to surgeries and Berinert as a rescue drug.
4.identified the disease as we were in the same area of operations as
outcome from dioxins

When I am search with surgery its giving all 4 documents that is fine, but I
need to know what are the words in each doc that matched with my search
query "surgery", here - 1. surgery 2. Surgeries and 3. surgeries
4.operations - from each doc. 

I have tried using Highlighter but its not only giving matched words also
gives unwanted words and also it it not highlighting the synonym matched
words - in this example operation-

Thanks
Venkatesham Gundu





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Matched-words-from-document-Stemmed-and-Synonyms-tp4079968.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Search a Part of the Sentence/Complete sentence in lucene 4.3

2013-07-24 Thread Ankit Murarka

Dear All,

Say suppose I have 3 documents. The sample text is

/*File 1 : */

Mr X David is a manager of the company. He is the senior most manager. I 
also want to become manager of the company.


/*File 2 :*/

Mr X David manager of the company is also very senior. He happens to be 
the senior most manager. I wish even I could reach that place.


/*File 3:*/

Mr X David is working for a company. He happens to be the manager of the 
company.Infact he is the senior most manager. I dont want to become like 
him.


/*String I wish to search :* X David is a manager of the company./

Ideally I should get only file1 in the hit result.

I have no clue how to achieve this. Basically I am trying to match the 
part of the sentence or a complete sentence. What can be the best 
methodology.
I presume is a are the stop words and will be skipped during indexing by 
the StandardAnalyzer.


What wonders me how do I then search for a part of the sentence or 
complete sentence if sentence contains some/many stopwords.


I am using StandardAnalyzer. Please guide.

--
Regards

Ankit



Re: Search a Part of the Sentence/Complete sentence in lucene 4.3

2013-07-24 Thread Michael McCandless
PhraseQuery?

You can skip the holes created by stopwords ... e.g. QueryParser does
this.  Ie, the PhraseQuery becomes "X David _ _ manager _ _ company"
if is/a/of/the are stop words, which isn't perfect (could return false
matches) but should work well in practice ...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka
 wrote:
> Dear All,
>
> Say suppose I have 3 documents. The sample text is
>
> /*File 1 : */
>
> Mr X David is a manager of the company. He is the senior most manager. I
> also want to become manager of the company.
>
> /*File 2 :*/
>
> Mr X David manager of the company is also very senior. He happens to be the
> senior most manager. I wish even I could reach that place.
>
> /*File 3:*/
>
> Mr X David is working for a company. He happens to be the manager of the
> company.Infact he is the senior most manager. I dont want to become like
> him.
>
> /*String I wish to search :* X David is a manager of the company./
>
> Ideally I should get only file1 in the hit result.
>
> I have no clue how to achieve this. Basically I am trying to match the part
> of the sentence or a complete sentence. What can be the best methodology.
> I presume is a are the stop words and will be skipped during indexing by the
> StandardAnalyzer.
>
> What wonders me how do I then search for a part of the sentence or complete
> sentence if sentence contains some/many stopwords.
>
> I am using StandardAnalyzer. Please guide.
>
> --
> Regards
>
> Ankit
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search a Part of the Sentence/Complete sentence in lucene 4.3

2013-07-24 Thread Ankit Murarka
I tried using Phrase Query with slops. Now since I am specifying the 
slop I also need to specify the 2nd term.


In my case the 2nd term is not present. The whole string to be searched 
is still 1 single term.


How do I skip the holes created by stopwords. I do not know before hand 
how many stop words are skipped and what string user is going to enter.


Is there a definite way to skip the holes created by stopwords.

I was now looking for MultiphraseQuery splitting the user provided 
string on space and providing each word as a term to multiphrasequery.


Will it help..?? Is there any alternative. ??

On 7/24/2013 4:48 PM, Michael McCandless wrote:

PhraseQuery?

You can skip the holes created by stopwords ... e.g. QueryParser does
this.  Ie, the PhraseQuery becomes "X David _ _ manager _ _ company"
if is/a/of/the are stop words, which isn't perfect (could return false
matches) but should work well in practice ...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka
  wrote:
   

Dear All,

Say suppose I have 3 documents. The sample text is

/*File 1 : */

Mr X David is a manager of the company. He is the senior most manager. I
also want to become manager of the company.

/*File 2 :*/

Mr X David manager of the company is also very senior. He happens to be the
senior most manager. I wish even I could reach that place.

/*File 3:*/

Mr X David is working for a company. He happens to be the manager of the
company.Infact he is the senior most manager. I dont want to become like
him.

/*String I wish to search :* X David is a manager of the company./

Ideally I should get only file1 in the hit result.

I have no clue how to achieve this. Basically I am trying to match the part
of the sentence or a complete sentence. What can be the best methodology.
I presume is a are the stop words and will be skipped during indexing by the
StandardAnalyzer.

What wonders me how do I then search for a part of the sentence or complete
sentence if sentence contains some/many stopwords.

I am using StandardAnalyzer. Please guide.

--
Regards

Ankit

 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


   



--
Regards

Ankit Murarka

"Peace is found not in what surrounds us, but in what we hold within."


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question on Lucene hot-backup functionality.

2013-07-24 Thread Michael McCandless
This is unfortunately very trappy ... this happened with LUCENE-4876,
where we added cloning of IndexDeletionPolicy on IW construction.
It's very confusing that the IDP you set on your IWC is not in fact
the one that IW uses...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 24, 2013 at 2:35 AM, Shai Erera  wrote:
> Hi
>
> In Lucene 4.4 we've improved the snapshotting process so that you don't
> need to specify an ID.
> Also, there's a new Replicator module which can be used for just that
> purpose - take hot index backups of the index.
> It pretty much hides most of the snapshotting from you. You can read about
> it here: http://shaierera.blogspot.com/2013/05/the-replicator.html
>
> As for your problem, I think it's related to the fact IndexWriter clones
> the given IndexWriterConfig, including the SnapshotDeletionPolicy.
> So you should obtain it from IW.getLiveConfig().getIndexDeletionPolicy(),
> rather than IndexWriterConfig.getIndexDeletionPolicy(). I'm not sure what
> Indexer.getSnapshotter() does, but I'd make sure that it uses IW.
>
> Shai
>
>
> On Wed, Jul 24, 2013 at 7:34 AM, Marcos Juarez Lopez wrote:
>
>> I'm trying to get Lucene's hot backup functionality to work.  I posted the
>> question in detail over at StackOverflow, but it seems there's very little
>> Lucene knowledge over there.
>>
>> Basically, I think I have setup everything correctly, but I can't get a
>> valid snapshot when trying to do a backup.  I'm following both the Lucene
>> book's instructions, as well as the latest Lucene Javadocs, to no avail.
>>  Original question at the link, but I'll copy the relevant bits below:
>>
>> http://stackoverflow.com/questions/17753226/lucene-4-3-1-backup-process
>>
>> This is the code I have up to now:
>>
>> public Indexer(Directory indexDir, PrintStream printStream) throws
>> IOException {
>> IndexWriterConfig config = new
>> IndexWriterConfig(Version.LUCENE_43, new Analyzer());
>> snapshotter = new SnapshotDeletionPolicy(new
>> KeepOnlyLastCommitDeletionPolicy());
>> writerConfig.setIndexDeletionPolicy(snapshotter);
>> indexWriter = new IndexWriter(indexDir, writerConfig);
>> }
>>
>> And when starting the backup, you can't just do snapshotter.snapshot(). You
>> now have to specify an arbitrary commitIdentifier id, and use that after
>> you're done to release the snapshot.
>>
>> SnapshotDeletionPolicy snapshotter = indexer.getSnapshotter();
>> String commitIdentifier = generateCommitIdentifier();
>> try {
>> IndexCommit commit = snapshotter.snapshot(commitIdentifier);
>> for (String fileName : commit.getFileNames()) {
>> backupFile(fileName);
>> }
>> } catch (Exception e) {
>> logger.error("Exception", e);
>> } finally {
>> snapshotter.release(commitIdentifier);
>> indexer.deleteUnusedFiles();
>> }
>>
>> However, this doesn't seem to be working. Regardless of whether there have
>> been docs indexed or not, and regardless of whether I have committed or
>> not, my call tosnapshotter.snapshot(commitIdentifier) always throws an
>> IllegalStateException sayingNo index commit to snapshot. Looking at the
>> code, the SnapshotDeletionPolicy seems to think there have been no commits,
>> even though I'm committing to disk every 5 seconds or so. I've verified,
>> and there are docs being written and committed to indexes all the time, but
>> snapshotter always thinks there have been zero commits.
>>
>> Any idea of what I'm doing wrong?
>>
>> Thanks!
>>
>> Marcos Juarez
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search a Part of the Sentence/Complete sentence in lucene 4.3

2013-07-24 Thread Dawn Zoƫ Raison

Did you consider using shingles?
It solves the "to be or not to be" problem quite nicely.

Dawn

On 24/07/2013 12:34, Ankit Murarka wrote:
I tried using Phrase Query with slops. Now since I am specifying the 
slop I also need to specify the 2nd term.


In my case the 2nd term is not present. The whole string to be 
searched is still 1 single term.


How do I skip the holes created by stopwords. I do not know before 
hand how many stop words are skipped and what string user is going to 
enter.


Is there a definite way to skip the holes created by stopwords.

I was now looking for MultiphraseQuery splitting the user provided 
string on space and providing each word as a term to multiphrasequery.


Will it help..?? Is there any alternative. ??

On 7/24/2013 4:48 PM, Michael McCandless wrote:

PhraseQuery?

You can skip the holes created by stopwords ... e.g. QueryParser does
this.  Ie, the PhraseQuery becomes "X David _ _ manager _ _ company"
if is/a/of/the are stop words, which isn't perfect (could return false
matches) but should work well in practice ...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka
  wrote:

Dear All,

Say suppose I have 3 documents. The sample text is

/*File 1 : */

Mr X David is a manager of the company. He is the senior most 
manager. I

also want to become manager of the company.

/*File 2 :*/

Mr X David manager of the company is also very senior. He happens to 
be the

senior most manager. I wish even I could reach that place.

/*File 3:*/

Mr X David is working for a company. He happens to be the manager of 
the
company.Infact he is the senior most manager. I dont want to become 
like

him.

/*String I wish to search :* X David is a manager of the company./

Ideally I should get only file1 in the hit result.

I have no clue how to achieve this. Basically I am trying to match 
the part
of the sentence or a complete sentence. What can be the best 
methodology.
I presume is a are the stop words and will be skipped during 
indexing by the

StandardAnalyzer.

What wonders me how do I then search for a part of the sentence or 
complete

sentence if sentence contains some/many stopwords.

I am using StandardAnalyzer. Please guide.

--
Regards

Ankit


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org







--

Rgds.
*Dawn Raison*
Technical Director, Digitorial Ltd.

E:d...@digitorial.co.uk W:http://www.digitorial.co.uk
M: 07956 609 618T: 01428 729 431
Reg: 04644583, England & Wales
Church Villas Ecchinswell, Newbury, RG20  4TT

This email and any attached files are for the exclusive use of the 
addressee and may contain privileged and/or confidential information. If 
you receive this email in error you should not disclose the contents to 
any other person nor take copies but should delete it immediately. 
Digitorial Ltd makes no warranty as to the accuracy or completeness of 
this email and accepts no liability for its contents or use. Any 
opinions expressed in this email are those of the author and do not 
necessarily reflect the opinions of Digitorial Ltd.




Re: Search a Part of the Sentence/Complete sentence in lucene 4.3

2013-07-24 Thread Michael McCandless
With PhraseQuery you can specify where each term must occur in the phrase.

So X must occur in position 0, David in position 1, and then manager
in position 4 (skipping 2 holes).

QueryParser does this for you: when it analyzes the users phrase, if
the resulting tokens have holes, then it sets the positions
accordingly.

And I agree: shingles are a good solution here too, but they make your
index larger.  CommonGramsFilter lets you shingle only specific words,
e.g. you could pass your stop words to it.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 24, 2013 at 7:34 AM, Ankit Murarka
 wrote:
> I tried using Phrase Query with slops. Now since I am specifying the slop I
> also need to specify the 2nd term.
>
> In my case the 2nd term is not present. The whole string to be searched is
> still 1 single term.
>
> How do I skip the holes created by stopwords. I do not know before hand how
> many stop words are skipped and what string user is going to enter.
>
> Is there a definite way to skip the holes created by stopwords.
>
> I was now looking for MultiphraseQuery splitting the user provided string on
> space and providing each word as a term to multiphrasequery.
>
> Will it help..?? Is there any alternative. ??
>
>
> On 7/24/2013 4:48 PM, Michael McCandless wrote:
>>
>> PhraseQuery?
>>
>> You can skip the holes created by stopwords ... e.g. QueryParser does
>> this.  Ie, the PhraseQuery becomes "X David _ _ manager _ _ company"
>> if is/a/of/the are stop words, which isn't perfect (could return false
>> matches) but should work well in practice ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka
>>   wrote:
>>
>>>
>>> Dear All,
>>>
>>> Say suppose I have 3 documents. The sample text is
>>>
>>> /*File 1 : */
>>>
>>> Mr X David is a manager of the company. He is the senior most manager. I
>>> also want to become manager of the company.
>>>
>>> /*File 2 :*/
>>>
>>> Mr X David manager of the company is also very senior. He happens to be
>>> the
>>> senior most manager. I wish even I could reach that place.
>>>
>>> /*File 3:*/
>>>
>>> Mr X David is working for a company. He happens to be the manager of the
>>> company.Infact he is the senior most manager. I dont want to become like
>>> him.
>>>
>>> /*String I wish to search :* X David is a manager of the company./
>>>
>>> Ideally I should get only file1 in the hit result.
>>>
>>> I have no clue how to achieve this. Basically I am trying to match the
>>> part
>>> of the sentence or a complete sentence. What can be the best methodology.
>>> I presume is a are the stop words and will be skipped during indexing by
>>> the
>>> StandardAnalyzer.
>>>
>>> What wonders me how do I then search for a part of the sentence or
>>> complete
>>> sentence if sentence contains some/many stopwords.
>>>
>>> I am using StandardAnalyzer. Please guide.
>>>
>>> --
>>> Regards
>>>
>>> Ankit
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
>
>
> --
> Regards
>
> Ankit Murarka
>
> "Peace is found not in what surrounds us, but in what we hold within."
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Performance measurements

2013-07-24 Thread Sriram Sankar
I did some performance tests on a real index using a query having the
following pattern:

termA AND (termB1 OR termB2 OR ... OR termBn)

The results were not good and I was wondering if I may be doing something
wrong (and what I would need to do to improve performance), or is it just
that the OR is very inefficient.

The format for the data below is illustrated below by example:

5|10
time: 0.092728962; scored: 18

Here, n=5, and we measure performance for retrieval of 10 results which is
0.0927ms. Had we not early terminated, we would have obtained 18 results.

As you will see in the data below, the performance for n=0 is very good,
but goes down drastically as n is increased.

Sriram.


0|10
time: 0.007941587; scored: 10887

0|1000
time: 0.018967384; scored: 10887

0|5000
time: 0.061943552; scored: 10887

0|1
time: 0.115327001; scored: 10887

1|10
time: 0.053950965; scored: 0

5|20
time: 0.274681853; scored: 18

10|10
time: 0.14251254; scored: 22

10|20
time: 0.282503313; scored: 22

20|10
time: 0.251964067; scored: 32

20|30
time: 0.52860957; scored: 32

50|10
time: 0.888969702; scored: 57

50|30
time: 1.078579956; scored: 57

50|50
time: 1.601169195; scored: 57

100|10
time: 1.396391061; scored: 79

100|40
time: 1.8083494; scored: 79

100|80
time: 2.921094513; scored: 79

200|10
time: 2.848105701; scored: 119

200|50
time: 3.472198462; scored: 119

200|100
time: 4.722673648; scored: 119

400|10
time: 4.463727049; scored: 235

400|100
time: 6.554119665; scored: 235

400|200
time: 9.591892527; scored: 235


Re: Performance measurements

2013-07-24 Thread Sriram Sankar
Clarification - I used an MMap'd index and warmed it up with similar
queries, as well as running the identical query many times before starting
measurements.  I had ample heap space.

Sriram.


On Wed, Jul 24, 2013 at 9:11 AM, Sriram Sankar  wrote:

> I did some performance tests on a real index using a query having the
> following pattern:
>
> termA AND (termB1 OR termB2 OR ... OR termBn)
>
> The results were not good and I was wondering if I may be doing something
> wrong (and what I would need to do to improve performance), or is it just
> that the OR is very inefficient.
>
> The format for the data below is illustrated below by example:
>
> 5|10
> time: 0.092728962; scored: 18
>
> Here, n=5, and we measure performance for retrieval of 10 results which is
> 0.0927ms. Had we not early terminated, we would have obtained 18 results.
>
> As you will see in the data below, the performance for n=0 is very good,
> but goes down drastically as n is increased.
>
> Sriram.
>
>
> 0|10
> time: 0.007941587; scored: 10887
>
> 0|1000
> time: 0.018967384; scored: 10887
>
> 0|5000
> time: 0.061943552; scored: 10887
>
> 0|1
> time: 0.115327001; scored: 10887
>
> 1|10
> time: 0.053950965; scored: 0
>
> 5|20
> time: 0.274681853; scored: 18
>
> 10|10
> time: 0.14251254; scored: 22
>
> 10|20
> time: 0.282503313; scored: 22
>
> 20|10
> time: 0.251964067; scored: 32
>
> 20|30
> time: 0.52860957; scored: 32
>
> 50|10
> time: 0.888969702; scored: 57
>
> 50|30
> time: 1.078579956; scored: 57
>
> 50|50
> time: 1.601169195; scored: 57
>
> 100|10
> time: 1.396391061; scored: 79
>
> 100|40
> time: 1.8083494; scored: 79
>
> 100|80
> time: 2.921094513; scored: 79
>
> 200|10
> time: 2.848105701; scored: 119
>
> 200|50
> time: 3.472198462; scored: 119
>
> 200|100
> time: 4.722673648; scored: 119
>
> 400|10
> time: 4.463727049; scored: 235
>
> 400|100
> time: 6.554119665; scored: 235
>
> 400|200
> time: 9.591892527; scored: 235
>
>


Re: Performance measurements

2013-07-24 Thread Jack Krupansky
Thanks for the detailed numbers. Nothing seems unexpected to me. Increasing 
query complexity or term count is simply going to increase query execution 
time.


I think I'll add a new rule to my informal performance guidance - Query 
complexity of no more than ten to twenty terms is a "slam dunk", but more 
than that is "uncharted territory" that risks queries taking more than half 
a second or even multiple seconds and requires a proof of concept 
implementation to validate reasonable query times.


-- Jack Krupansky

-Original Message- 
From: Sriram Sankar

Sent: Wednesday, July 24, 2013 12:11 PM
To: java-user@lucene.apache.org
Subject: Performance measurements

I did some performance tests on a real index using a query having the
following pattern:

termA AND (termB1 OR termB2 OR ... OR termBn)

The results were not good and I was wondering if I may be doing something
wrong (and what I would need to do to improve performance), or is it just
that the OR is very inefficient.

The format for the data below is illustrated below by example:

5|10
time: 0.092728962; scored: 18

Here, n=5, and we measure performance for retrieval of 10 results which is
0.0927ms. Had we not early terminated, we would have obtained 18 results.

As you will see in the data below, the performance for n=0 is very good,
but goes down drastically as n is increased.

Sriram.


0|10
time: 0.007941587; scored: 10887

0|1000
time: 0.018967384; scored: 10887

0|5000
time: 0.061943552; scored: 10887

0|1
time: 0.115327001; scored: 10887

1|10
time: 0.053950965; scored: 0

5|20
time: 0.274681853; scored: 18

10|10
time: 0.14251254; scored: 22

10|20
time: 0.282503313; scored: 22

20|10
time: 0.251964067; scored: 32

20|30
time: 0.52860957; scored: 32

50|10
time: 0.888969702; scored: 57

50|30
time: 1.078579956; scored: 57

50|50
time: 1.601169195; scored: 57

100|10
time: 1.396391061; scored: 79

100|40
time: 1.8083494; scored: 79

100|80
time: 2.921094513; scored: 79

200|10
time: 2.848105701; scored: 119

200|50
time: 3.472198462; scored: 119

200|100
time: 4.722673648; scored: 119

400|10
time: 4.463727049; scored: 235

400|100
time: 6.554119665; scored: 235

400|200
time: 9.591892527; scored: 235 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance measurements

2013-07-24 Thread Adrien Grand
Hi,

On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar  wrote:
> termA AND (termB1 OR termB2 OR ... OR termBn)

Maybe this comment is not appropriate for your use-case, but if you
don't actually need scoring from the disjunction on the right of the
query, a TermsFilter will be faster when n gets large.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance measurements

2013-07-24 Thread Sriram Sankar
No I do not need scoring.  This is a pure retrieval query - which matches
what we used to do with Unicorn in Facebook - something like:

(name:sriram AND (friend:1 OR friend:2 ...))

This automatically gives us second degree.

With Unicorn, we would always get sub-millisecond performance even for
n>500.

Should I assume that Lucene is that much worse - or is it that this use
case has not been optimized?

Sriram.



On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand  wrote:

> Hi,
>
> On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar  wrote:
> > termA AND (termB1 OR termB2 OR ... OR termBn)
>
> Maybe this comment is not appropriate for your use-case, but if you
> don't actually need scoring from the disjunction on the right of the
> query, a TermsFilter will be faster when n gets large.
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Performance measurements

2013-07-24 Thread Jack Krupansky
Unicorn sounds like it was optimized for graph search. Specialized search 
engines can in fact beat out generalized search engines for specific use 
cases.


Scoring has been a major focus of Lucene. Non-scored filters are also 
available, but the query parsers are focused (exclusively) on scored-search.


As Adrien indicates, try using raw Lucene filters and you should get much 
better results. Whether even that will compete with a use-case-specific 
(graph) search engine remains to be seen.


-- Jack Krupansky

-Original Message- 
From: Sriram Sankar

Sent: Wednesday, July 24, 2013 1:03 PM
To: java-user@lucene.apache.org
Subject: Re: Performance measurements

No I do not need scoring.  This is a pure retrieval query - which matches
what we used to do with Unicorn in Facebook - something like:

(name:sriram AND (friend:1 OR friend:2 ...))

This automatically gives us second degree.

With Unicorn, we would always get sub-millisecond performance even for
n>500.

Should I assume that Lucene is that much worse - or is it that this use
case has not been optimized?

Sriram.



On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand  wrote:


Hi,

On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar  wrote:
> termA AND (termB1 OR termB2 OR ... OR termBn)

Maybe this comment is not appropriate for your use-case, but if you
don't actually need scoring from the disjunction on the right of the
query, a TermsFilter will be faster when n gets large.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance measurements

2013-07-24 Thread Sriram Sankar
On Wed, Jul 24, 2013 at 10:24 AM, Jack Krupansky wrote:

> Unicorn sounds like it was optimized for graph search. Specialized search
> engines can in fact beat out generalized search engines for specific use
> cases.
>

Yes and no (I worked on it).  Yes, there are many aspect of Unicorn that
have been optimized for graph search.  But the tests I am running have very
little to do with those optimizations.  I am still learning about Lucene
and have suspected that the scoring framework (that has to be very general)
may be contributing to the performance issues.  With Unicorn, we made a
decision to do all scoring after retrieval and not during retrieval.


>
> Scoring has been a major focus of Lucene. Non-scored filters are also
> available, but the query parsers are focused (exclusively) on scored-search.
>

When you say "filter" do you mean a step performed after retrieval?  Or is
it yet another retrieval operation?


>
> As Adrien indicates, try using raw Lucene filters and you should get much
> better results. Whether even that will compete with a use-case-specific
> (graph) search engine remains to be seen.


Thanks (I will study this more).

Sriram.



>
>
> -- Jack Krupansky
>
> -Original Message- From: Sriram Sankar
> Sent: Wednesday, July 24, 2013 1:03 PM
> To: java-user@lucene.apache.org
> Subject: Re: Performance measurements
>
>
> No I do not need scoring.  This is a pure retrieval query - which matches
> what we used to do with Unicorn in Facebook - something like:
>
> (name:sriram AND (friend:1 OR friend:2 ...))
>
> This automatically gives us second degree.
>
> With Unicorn, we would always get sub-millisecond performance even for
> n>500.
>
> Should I assume that Lucene is that much worse - or is it that this use
> case has not been optimized?
>
> Sriram.
>
>
>
> On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand  wrote:
>
>  Hi,
>>
>> On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar  wrote:
>> > termA AND (termB1 OR termB2 OR ... OR termBn)
>>
>> Maybe this comment is not appropriate for your use-case, but if you
>> don't actually need scoring from the disjunction on the right of the
>> query, a TermsFilter will be faster when n gets large.
>>
>> --
>> Adrien
>>
>> --**--**-
>> To unsubscribe, e-mail: 
>> java-user-unsubscribe@lucene.**apache.org
>> For additional commands, e-mail: 
>> java-user-help@lucene.apache.**org
>>
>>
>>
>
> --**--**-
> To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.**apache.org
> For additional commands, e-mail: 
> java-user-help@lucene.apache.**org
>
>


Re: Performance measurements

2013-07-24 Thread Jack Krupansky
I think I've exhausted my expertise in Lucene filters, but I think you can 
wrap a query with a filter and also wrap a filter with a query. So, for 
IndexSearcher.search, you could take a filter and wrap it with 
ConstantScoreQuery. So, if a BooleanQuery got wrapped as a filter, it could 
be wrapped as a CSQ for search so that no scoring would be done.


-- Jack Krupansky

-Original Message- 
From: Sriram Sankar

Sent: Wednesday, July 24, 2013 3:58 PM
To: java-user@lucene.apache.org
Subject: Re: Performance measurements

On Wed, Jul 24, 2013 at 10:24 AM, Jack Krupansky 
wrote:



Unicorn sounds like it was optimized for graph search. Specialized search
engines can in fact beat out generalized search engines for specific use
cases.



Yes and no (I worked on it).  Yes, there are many aspect of Unicorn that
have been optimized for graph search.  But the tests I am running have very
little to do with those optimizations.  I am still learning about Lucene
and have suspected that the scoring framework (that has to be very general)
may be contributing to the performance issues.  With Unicorn, we made a
decision to do all scoring after retrieval and not during retrieval.




Scoring has been a major focus of Lucene. Non-scored filters are also
available, but the query parsers are focused (exclusively) on 
scored-search.




When you say "filter" do you mean a step performed after retrieval?  Or is
it yet another retrieval operation?




As Adrien indicates, try using raw Lucene filters and you should get much
better results. Whether even that will compete with a use-case-specific
(graph) search engine remains to be seen.



Thanks (I will study this more).

Sriram.






-- Jack Krupansky

-Original Message- From: Sriram Sankar
Sent: Wednesday, July 24, 2013 1:03 PM
To: java-user@lucene.apache.org
Subject: Re: Performance measurements


No I do not need scoring.  This is a pure retrieval query - which matches
what we used to do with Unicorn in Facebook - something like:

(name:sriram AND (friend:1 OR friend:2 ...))

This automatically gives us second degree.

With Unicorn, we would always get sub-millisecond performance even for
n>500.

Should I assume that Lucene is that much worse - or is it that this use
case has not been optimized?

Sriram.



On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand  wrote:

 Hi,


On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar  wrote:
> termA AND (termB1 OR termB2 OR ... OR termBn)

Maybe this comment is not appropriate for your use-case, but if you
don't actually need scoring from the disjunction on the right of the
query, a TermsFilter will be faster when n gets large.

--
Adrien

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



lucene indexwriter crash

2013-07-24 Thread ash nix
Hi,

I am using lucene 4 to index very big data.
The indexer crashed after three days (147Gig of current index size). I find
the stack trash weird.
Any ideas on this will be helpful.

Exception in thread "main" java.io.FileNotFoundException:
/ir/data/data/collections/KBA/2013/index/1322092800-1326499200/_9cx_Lucene40_0.tim
(Input/output error)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:233)
at
org.apache.lucene.store.FSDirectory$FSIndexOutput.(FSDirectory.java:512)
at
org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:289)
at
org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:62)
at
org.apache.lucene.codecs.BlockTreeTermsWriter.(BlockTreeTermsWriter.java:160)
at
org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsConsumer(Lucene40PostingsFormat.java:304)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:130)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
at
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:482)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:419)
at
org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:313)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:386)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1445)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1124)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1105)
at kba.NewCorpusReader.createLuceneIndex(NewCorpusReader.java:747)
at kba.NewCorpusReader.main(NewCorpusReader.java:825)

-- 
Thanks,
A


Re: Tokenize String using Operators(Logical Operator, : operator etc)

2013-07-24 Thread dheerajjoshim
Greetings,

I have wrote a custom tokenizer class which extends lucene tokenizer class.

Thanks for all replies

Regards
DJ



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenize-String-using-Operators-Logical-Operator-operator-etc-tp4079673p4080225.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



2 exceptions in IndexWriter

2013-07-24 Thread Yonghui Zhao
Recently I find my unit test will failed sometimes but no always.  I use
Lucene 4.3.0

After inverstigation, I found when I try to open a IndexWriter for a disk
directory.

Some time it will throw this exception:

org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
NativeFSLock@/tmp/test-idx/write.lock

The default time out is 1000ms, when I set to 3000ms, seems this exception
disappeared.
But I think 1000ms should be enough, how can it happen.   What's the
recommended number?


Some time it will throw another exception which I think is more serious.

read past EOF: SimpleFSIndexInput

Each test will purge the index folder,  however this exception will happen
sometimes.

I don't know the reason, how can I fix this exception?