Re: Filtering a SpanQuery

2008-05-12 Thread Eran Sevi
Thanks Paul,

I'll give your code sample a try.
I still think that calling getSpans (the first line of code) that returns
millions of results is going to be much slower than calling getSpans that's
going to return only a few thousands of results. Since the filtering is only
performed after calling this method it can't help in this case.

I guess your suggested solution is my best option without changing the way
getSpans works (which I'm not going to change any time soon )
Eran.
On Wed, May 7, 2008 at 7:22 PM, Paul Elschot <[EMAIL PROTECTED]> wrote:

> Op Wednesday 07 May 2008 10:18:38 schreef Eran Sevi:
> > Thanks Paul for your reply,
> >
> > Since my index contains a couple of millions documents and the filter
> > is supposed to limit the search space to a few thousands I was hoping
> > I won't have to do the filtering myself after running the query on
> > all the index.
>
> The code I gave earlier effectively does a filtered query search
> on the index. It visits the resulting Spans, and does not provide
> a score value per document as SpanScorer would do.
> Please make sure to test that code thoroughly for reliable results.
>
> >
> > Maybe this is the case anyway and behind the scenes the filter does
> > exactly what you suggested.
>
> Yes, a filtered query search would use skipTo() on the Spans via
> SpanScorer. But the difference between the normal case
> and your case is that you don't need SpanScorer.
>
> > From what I tested the number of results of the SpanQuery greatly
> > affects the running speed so if I'm going to use about 0.1% of the
> > results I'm loosing a lot of time and memory for gathering and
> > storing the spans I'm not going to use.
> >
> > I don't know how SpanQuery works internally but I guess that if the
> > filter is known beforehand,
>
> A Filter needs to make a BitSet available before the query search.
>
> > it could speed things up quite a bit.
>
> I would expect a substantial speedup from using skipTo() on the
> Spans when only 0.1% of the results passes the filter.
>
> Regards,
> Paul Elschot
>
> > Eran.
> >
> >
> > On Wed, May 7, 2008 at 10:34 AM, Paul Elschot
> > <[EMAIL PROTECTED]>
> >
> > wrote:
> > > Op Tuesday 06 May 2008 17:39:38 schreef Paul Elschot:
> > > > Eran,
> > > >
> > > > Op Tuesday 06 May 2008 10:15:10 schreef Eran Sevi:
> > > > > Hi,
> > > > >
> > > > > I am looking for a way to filter a SpanQuery according to some
> > > > > other query (on another field from the one used for the
> > > > > SpanQuery). I need to get access to the spans themselves of
> > > > > course. I don't care about the scoring of the filter results
> > > > > and just need the positions of hits found in the documents that
> > > > > matches the filter.
> > > >
> > > > I think you'll have to implement the filtering on the Spans
> > > > yourself. That's not really difficult, just use Spans.skipTo().
> > > > The code to do that could look sth like this (untested):
> > > >
> > > > Spans spans = yourSpanQuery.getSpans(reader);
> > > > BitSet bits = yourFilter.bits(reader);
> > > > int filterDoc = bits.nextSetBit(0);
> > > > while ((filterDoc >= 0) and spans.skipTo(filterDoc)) {
> > > >   boolean more = true;
> > > >   while (more and (spans.doc() == filterDoc)) {
> > > >  // use spans.start() and spans.end() here
> > > >  // ...
> > > >  more = spans.next();
> > > >   }
> > > >   if (! more) {
> > > > break;
> > > >   }
> > > >   filterDoc = bits.nextSetBit(spans.doc());
> > >
> > > At this point, no skipping on the spans should be done when
> > > filterDoc equals spans.doc(), so this code still needs some work.
> > > But I think you get the idea.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > > > }
> > > >
> > > > Please check the javadocs of java.util.BitSet, there may
> > > > be a 1 off error in the arguments to nextSetBit().
> > > >
> > > > Regards,
> > > > Paul Elschot
> > > >
> > > > > I tried looking through the archives and found some reference
> > > > > to a SpanQueryFilter patch, however I don't see how it can help
> > > > > me achieve what I want to do. This class receives only one
> > > > > query parameter (which I guess is the actual query) and not a
> > > > > query and a filter for example.
> > > > >
> > > > > Any help about how I can achieve this will be appreciated.
> > > > >
> > > > > Thanks,
> > > > > Eran.
> > > >
> > > > -
> > > > To unsubscribe, e-mail:
> > > > [EMAIL PROTECTED] For additional commands,
> > > > e-mail: [EMAIL PROTECTED]
> > >
> > > ---
> > >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


[ANNOUNCE] Lucene Java 2.3.2 release available

2008-05-12 Thread Michael Busch

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Release 2.3.2 of Lucene Java is now available!

This release contains fixes for bugs found in 2.3.1. It does not contain
any new features, API or file format changes, which makes it fully
compatible to 2.3.0 and 2.3.1.

The detailed change log is at:
http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_3_2/CHANGES.txt

Binary and source distributions are available at
http://www.apache.org/dyn/closer.cgi/lucene/java/

Lucene artifacts are also available in the Maven2 repository at
http://repo1.maven.org/maven2/org/apache/lucene/

- -Michael (on behalf of the Lucene team)

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIIjc1HyEjw2vYcqARAtQwAJ9ECy0rHTa4/PZ44BZu65GAmOMUCwCfVd/y
boi+Th6nTuXpoziHJ8tBTx0=
=bg/N
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search and retrieve the line data from the File

2008-05-12 Thread Madan Narra
Hi All,

I am very much new to Lucene and want to extend my skills over this tool

But i am in need of a quick assignment which i would need to complete
soon...so haven't got much time to read over the docs/books over net..

So please suggest how can i archive the below task and the rest i can make
over.

I have a inbound file of around 10k records , each line is a new record.

I need to search for a word or phrase in the inbound file, and retrieve all
the line data which consists the word.

Ex:

102652;ABN AMRO  Monthly Income Plan-Regular Plan-Growth Option;13.3054
102653;ABN AMRO  Monthly Income Plan-Regular Plan-Monthly Dividend
Option;10.0011
102654;ABN AMRO  Monthly Income Plan-Regular Plan-Quarterly Dividend
Option;10.0498
102645;ABN AMRO  Short Term Income Fund-Institutional Plan-Daily Dividend
Option;10.0104
106524;AIG India Treasury Plus Fund-Institutional Plan-Daily Dividend
Option;10.0109
100601;Canara Robeco Cigo-Growth Plan;21.88
101620;Franklin India International Fund;10.3648
100948;Franklin India Monthly Income Plan-Growth;23.1588

As Shown above, if i search for "*Franklin*" i  need to show up the below
thing :

101620;*Franklin *India International Fund;10.3648
100948*;**Franklin ** *India Monthly Income Plan-Growth;23.1588

If i search for "*ABN*" , than the result should be as below :

102652;*ABN *AMRO  Monthly Income Plan-Regular Plan-Growth Option;13.3054
102653;*ABN *AMRO  Monthly Income Plan-Regular Plan-Monthly Dividend
Option;10.0011
102654;*ABN *AMRO  Monthly Income Plan-Regular Plan-Quarterly Dividend
Option;10.0498
102645;*ABN *AMRO  Short Term Income Fund-Institutional Plan-Daily Dividend
Option;10.0104

Hope my question is clear and understanding...

Please help me over how could i achieve the above process to search for a
word in the file and display the results as discussed.

Thanks,
Madan N


Re: confused about an entry in the FAQ

2008-05-12 Thread Stephane Nicoll
I tried all this and I am confused about the result. I am trying to
implement an hybrid query handler where I fetch the IDs from a
database criteria and the IDs from a full text lucene query and I
intersect them to return the result to the user. The database query
and the intersection works fine even with high load. However the
lucene query is much slower when the number of concurrent users
raises.

Here is what I am doing on the lucene side

final QueryParser queryParser = new
QueryParser(criteria.getDefaultField(), analyzer);
final Query q = queryParser.parse(criteria.getFullTextQuery());
// Index Searcher is shared for all threads and is not
reopened during the load test
final IndexSearcher indexSearcher = getIndexSearcher();
final Set result = new TreeSet();
indexSearcher.search(q, new HitCollector() {
public void collect(int i, float v) {
try {
final Document d =
indexSearcher.getIndexReader().document(i, new FieldSelector() {
public FieldSelectorResult accept(String s) {
if (s.equals(CatalogItem.ATTR_ID)) {
return FieldSelectorResult.LOAD;
} else {
return FieldSelectorResult.NO_LOAD;
}
}
});
result.add(Long.parseLong(d.get(CatalogItem.ATTR_ID)));
} catch (IOException e) {
throw new RuntimeException("Could not collect
lucene IDs", e);
}
}
});
return result;


When running with one thread, I have the following figures per test:

Database query is done in[125 msecs] (size=598]
Lucene query is done in[80 msecs (size=15204]
Intersect is done in[4 msecs] (size=103]
Hybrid query is done in[97 msecs]

-> 327 msec / user

When running with ten threads, I have the following figures per user per test:

Database query is done in[222 msecs] (size=94]
Lucene query is done in[2364 msecs (size=15367]
Intersect is done in[0 msecs] (size=12]
Hybrid query is done in[18 msecs]

-> 2.5 sec / user !!

I am just wondering how I can improve this. Clearly there is something
wrong in my code since it's much slower with multiple threads running
concurrently on the same index. The size of the index is 5Mb, I only
store:

* an "id" field (which is the primary key of the related object in the db
* a "class" field which is the class nazme of the related object
(Hibernate search does that for me)

The "keywords" field is indexed but not stored as it is a
representation of other data stored in the db. The searches are
performed on the keywords field only ("foo AND bar" is a typical
query)

Any help is appreciated. If you also know a Spring bean that could
take care of opening/closing the index readers properly, let me know.
Hibernate Search introduces deadlock with multiple threads and the
lucene integration in spring modules does not seeem to do what I want.

Thanks,
Stéphane


On Sat, May 10, 2008 at 8:05 PM, Patrick Turcotte <[EMAIL PROTECTED]> wrote:
> Did you try the IndexSearcher.doc(int i, FieldSelector fieldSelector)  method?
>
>  Could be faster because Lucene don't have do "prepare" the whole document.
>
>  Patrick
>
>
>  On Sat, May 10, 2008 at 9:35 AM, Stephane Nicoll
>  <[EMAIL PROTECTED]> wrote:
>
>
> > From the FAQ:
>  >
>  > "Don't iterate over more hits than needed.
>  > Iterating over all hits is slow for two reasons. Firstly, the search()
>  > method that returns a Hits object re-executes the search internally
>  > when you need more than 100 hits. Solution: use the search method that
>  > takes a HitCollector instead."
>  >
>  > I had a look to HitCollector but it returns the documentId and the
>  > javadoc recommends not fetching the original query there.
>  >
>  > I have to return *one* indexed field from the query result and
>  > currently I am iterating on all results and it's slow. Can you explain
>  > a bit more how I could improve this?
>  >
>  > Thanks,
>  > Stéphane
>  >
>  >
>  > --
>  > Large Systems Suck: This rule is 100% transitive. If you build one,
>  > you suck" -- S.Yegge
>  >
>
> > -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>
> > For additional commands, e-mail: [EMAIL PROTECTED]
>  >
>  >
>
>
> -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>
>
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



posting lists of index are sorted?

2008-05-12 Thread Miguel Costa
Hi all,
 
I have two questions related to the Lucene ranking.
 
1) Does anyone know how the posting lists (term -> doc1 doc2 doc3) from the
index are sorted?
It is used a TFxIDF value, the boost value or none to sort documents (doc1
doc2 doc3)? Does Lucene compute the ranking for all the documents in the
posting lists or only part?
 
2) Does anyone know how to add more ranking features to the ranking function
of Lucene (eg. Pagerank, BM25)?
Extending the DefaultSimilarity class from Lucene is insufficient to achieve
this. It is only prepared to change the TFxIDF function.
 
Thanks in advance.
 

--

Miguel Costa 

HYPERLINK "http://xldb.fc.ul.pt/~mcosta/"http://xldb.fc.ul.pt/~mcosta/

 

FCCN-Fundação para a Computação Científica Nacional Av. do Brasil, n.º 101

1700-066 Lisboa

Tel.: +351 21 8440190

Fax: +351 218472167

HYPERLINK "outbind://25/www.fccn.pt"www.fccn.pt

Aviso de Confidencialidade

Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter
informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos
termos da lei. Caso tenha recepcionado indevidamente esta mensagem,
solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o
telefone +351 218440100 devendo apagar o seu conteúdo de imediato. 

This message is intended exclusively for its addressee. It may contain
CONFIDENTIAL information protected by law. If this message has been received
by error, please notify us via e-mail or by telephone +351 218440100 and
delete it immediately.

 

No virus found in this outgoing message.
Checked by AVG. 
Version: 7.5.524 / Virus Database: 269.23.14/1425 - Release Date: 09-05-2008
12:38
 


Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

2008-05-12 Thread Nick Burch

On Mon, 12 May 2008, Lukas Vlcek wrote:
I need to find a reliable way how to extract content out of Word, Excel 
and PowerPoint formats prior to indexing and I am not sure if POI is the 
best way to go. Can anybody share experience with POI and/or other 
[commercial] Java library for text extraction from MS formats?


We use poi for text extraction, and it works just fine for us. POI 3.1 
should offer a few improvements on text extraction, and POI 3.5 will give 
you OOXML text extraction too.


You might also like to take a look at Apache Tika 
. It wraps up POI (and a few other 
document extractor libraries), giving you a simple, common interface for 
text extraction


Nick

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Question about startOffset and endOffset

2008-05-12 Thread Brendan Grainger

Hi,

I have a TokenStream that inserts synonym tokens into the stream when  
matched. One thing I am wondering about is what is the effect of the  
startOffset and endOffset. I have something like this:


Token synonymToken = new Token(originalToken.startOffset(),  
originalToken.endOffset(), "SYNONYM");

synonymToken.setPositionIncrement(0);

What I am wondering is if I set the startOffset and endOffset to 0 and  
the endOffset to the length of the synonym string what effect will  
this have?


eg
Token synonymToken = new Token(0, repTok.endOffset(), "SYNONYM");
synonymToken.setPositionIncrement(0);

Thanks



Re: Question about startOffset and endOffset

2008-05-12 Thread Erick Erickson
Is this a theoretical question or is there a use-case you're trying
to support? If the latter, a statement of the problem you're trying
to solve would be helpful.

If the former, setting all your start offsets to 0 seems wrong. You're
essentially saying that all tokens are at the beginning of the document
and each one is some ever-increasing distance from the start. Some
of your later words will look really, really long (like the length
of your document).

Offhand, I expect this will affect up span queries, phrase
queries, and who knows what else? Maybe scoring?
Or maybe not, maybe some (or all) of these work off of
termpositions rather than offsets. So unless you're trying to
accomplish something specific, I sure wouldn't do this
given the problems it *could* create.

That said, I'm not much of an expert on how the offsets are
used, but I'd be really leery of changing them "just for the fun of
it" ..

Best
Erick

On Mon, May 12, 2008 at 12:06 PM, Brendan Grainger <
[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have a TokenStream that inserts synonym tokens into the stream when
> matched. One thing I am wondering about is what is the effect of the
> startOffset and endOffset. I have something like this:
>
> Token synonymToken = new Token(originalToken.startOffset(),
> originalToken.endOffset(), "SYNONYM");
> synonymToken.setPositionIncrement(0);
>
> What I am wondering is if I set the startOffset and endOffset to 0 and the
> endOffset to the length of the synonym string what effect will this have?
>
> eg
> Token synonymToken = new Token(0, repTok.endOffset(), "SYNONYM");
> synonymToken.setPositionIncrement(0);
>
> Thanks
>
>


Re: Question about startOffset and endOffset

2008-05-12 Thread Karl Wettin

Erick Erickson skrev:

Offhand, I expect this will affect up span queries, phrase
queries, and who knows what else? Maybe scoring?


I belive that the offsets are just meta data stored with the term 
vectors, used by the highlighter et c. Phrase and span queries use term 
position in the stream (positionIncrement) for sloppy matching and scoring.


So if you don't store the term vectors with offsets for some reason you 
really don't have to bother about them.


  karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

2008-05-12 Thread Karl Wettin

Lukas Vlcek skrev:

Hi,

I need to find a reliable way how to extract content out of Word, Excel and
PowerPoint formats prior to indexing and I am not sure if POI is the best
way to go. Can anybody share experience with POI and/or other [commercial]
Java library for text extraction from MS formats?


I like Antiword for .doc files.

http://www.winfield.demon.nl/


   karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Numerical Range Query

2008-05-12 Thread Dan Hardiker

Hi,

I've got an application which stores ratings for content in a Lucene 
index. It works a treat for the most part, apart from the use-case I 
have for being able to filter out ratings that have less than a given 
number of rates. It kinda works, but seems to use Alpha ranging rather 
than Numeric ranging.


Here is the Java code I am using:

luceneQuery.add( new RangeQuery( new Term(RateUtils.SF_FILTERED_CNT, 
minRatesString), null, true), BooleanClause.Occur.MUST );


For context:

* luceneQuery is a org.apache.lucene.search.BooleanQuery
* RateUtils.SF_FILTERED_CNT is the String containing the appropriate 
field name "rating-filtered-count"

* minRatesString is an integer as a String

Here is where the field is added into the index:

document.add( new Field(RateUtils.SF_FILTERED_CNT, String.valueOf( 
filteredCount ), Field.Store.YES, Field.Index.UN_TOKENIZED) );


For context:

* document is a org.apache.lucene.document.Document
* filteredCount is an int (counting the number of rates that have occurred)

Unfortunately it doesn't work quite as I expected as if I have 5 
documents in the index:


# 5 ratings
# 9 ratings
# 1 rating
# 0 ratings
# 11 ratings

If minRatesString is "5" then only the first document is returned, if 
it's "1" then the 3rd and 5th are returned, if its "6" then none are 
returned. It appears to be filtering alphabetically (starting with the 
first digit/character and matching on that) rather than numerically.


Oddly enough, if I sort on that field ... it works as I expect.

Am I missing something?


--
Dan Hardiker

PS: I've been googling for well over an hour, if I'm not searching with 
the right terms - please advise me! I tried to find a way to search the 
archives specifically, but I could only browse them month by month.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numerical Range Query

2008-05-12 Thread Erick Erickson
Yep, lucene works with strings, not numbers so the fact that you're
not getting what you expect is expected .

Although I'm a bit puzzled by what you're actually getting back.
You might try using Luke to look at your index to see what's
there.

See the NumberTools class for some help here...

BTW, at least in Lucene 2.1, the preferred way to go about this
would be ConstantScoreRangeQuery...

Best
Erick

On Mon, May 12, 2008 at 1:39 PM, Dan Hardiker <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> I've got an application which stores ratings for content in a Lucene
> index. It works a treat for the most part, apart from the use-case I have
> for being able to filter out ratings that have less than a given number of
> rates. It kinda works, but seems to use Alpha ranging rather than Numeric
> ranging.
>
> Here is the Java code I am using:
>
> luceneQuery.add( new RangeQuery( new Term(RateUtils.SF_FILTERED_CNT,
> minRatesString), null, true), BooleanClause.Occur.MUST );
>
> For context:
>
> * luceneQuery is a org.apache.lucene.search.BooleanQuery
> * RateUtils.SF_FILTERED_CNT is the String containing the appropriate field
> name "rating-filtered-count"
> * minRatesString is an integer as a String
>
> Here is where the field is added into the index:
>
> document.add( new Field(RateUtils.SF_FILTERED_CNT, String.valueOf(
> filteredCount ), Field.Store.YES, Field.Index.UN_TOKENIZED) );
>
> For context:
>
> * document is a org.apache.lucene.document.Document
> * filteredCount is an int (counting the number of rates that have
> occurred)
>
> Unfortunately it doesn't work quite as I expected as if I have 5 documents
> in the index:
>
> # 5 ratings
> # 9 ratings
> # 1 rating
> # 0 ratings
> # 11 ratings
>
> If minRatesString is "5" then only the first document is returned, if it's
> "1" then the 3rd and 5th are returned, if its "6" then none are returned. It
> appears to be filtering alphabetically (starting with the first
> digit/character and matching on that) rather than numerically.
>
> Oddly enough, if I sort on that field ... it works as I expect.
>
> Am I missing something?
>
>
> --
> Dan Hardiker
>
> PS: I've been googling for well over an hour, if I'm not searching with
> the right terms - please advise me! I tried to find a way to search the
> archives specifically, but I could only browse them month by month.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Numerical Range Query

2008-05-12 Thread Dan Hardiker

Erick Erickson wrote:

Although I'm a bit puzzled by what you're actually getting back.
You might try using Luke to look at your index to see what's
there.


I've looked through with Luke and it doesn't look like much has changed 
between using NumberTools and not. NumberTools definitely does some 
padding which makes sense, however even though I'm using that, Lucene or 
Luke seems to be boiling it down to just the number. I'm not sure which.



See the NumberTools class for some help here...

BTW, at least in Lucene 2.1, the preferred way to go about this
would be ConstantScoreRangeQuery...


Taking your advice I'm now indexing using:

document.add( new Field(RateUtils.SF_FILTERED_CNT, 
NumberTools.longToString( filteredCount ), Field.Store.YES, 
Field.Index.UN_TOKENIZED) );


and searching using:

I'm now
int minRates = Long.valueOf( minRatesString ).intValue();
luceneQuery.add( new ConstantScoreRangeQuery( RateUtils.SF_FILTERED_CNT, 
NumberTools.longToString(minRates), "", true, false ), 
BooleanClause.Occur.MUST );


I get very odd results back now, but they seem to work similarly. The 
documentation for ConstantScoreRangeQuery is rather thin however I did 
find this example which suggests I'm doing the right thing:


http://github.com/we4tech/semantic-repository/tree/master/development/idea-repository-core/src/main/java/com/ideabase/repository/core/index/ExtendedQueryParser.java

The code _looks_ like it should work, it makes sense logically but it 
still doesn't do what I'm expecting.


I've tried changing the indexing over to Field.Index.NO_NORMS and it 
makes the field value "0b" instead of "11", and 
"02" instead of "2" ... but that meant that the searching 
didn't pick up on that field _at all_.


Surely "find me results where numeric field x is higher than y" can't be 
an uncommon request? I can think of many areas where you want to do that 
(age filtering for example).


Any other suggestions of what I should be looking for, or where I can 
look to find out the next step to take?



--
Dan Hardiker

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numerical Range Query

2008-05-12 Thread Erick Erickson
Are you using NumberTools both at index and query time? Because
this works exactly as I expect

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumberTools;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.ConstantScoreRangeQuery;

import java.io.IOException;

/**
 * Created by: eoericks
 * Date: May 12, 2008
 * History: $Log$
 */
public class Test {
public static void main(String args[]) {
try {
Test test = new Test();
test.doIndex();
test.doSearch();
} catch (Exception e) {
e.printStackTrace();
}
}
private void doIndex() throws IOException {

IndexWriter w = new
IndexWriter(FSDirectory.getDirectory("C:/lucidx"), new StandardAnalyzer(),
true);
Document doc = new Document();
doc.add(new Field("num", NumberTools.longToString(1), Field.Store.NO,
Field.Index.UN_TOKENIZED));
doc.add(new Field("name", "doc 1", Field.Store.YES,
Field.Index.UN_TOKENIZED));
w.addDocument(doc);

doc = new Document();
doc.add(new Field("num", NumberTools.longToString(11),
Field.Store.NO, Field.Index.UN_TOKENIZED));
doc.add(new Field("name", "doc 11", Field.Store.YES,
Field.Index.UN_TOKENIZED));
w.addDocument(doc);

doc = new Document();
doc.add(new Field("num", NumberTools.longToString(5), Field.Store.NO,
Field.Index.UN_TOKENIZED));
doc.add(new Field("name", "doc 5", Field.Store.YES,
Field.Index.UN_TOKENIZED));
w.addDocument(doc);

doc = new Document();
doc.add(new Field("num", NumberTools.longToString(9), Field.Store.NO,
Field.Index.UN_TOKENIZED));
doc.add(new Field("name", "doc 9", Field.Store.YES,
Field.Index.UN_TOKENIZED));
w.addDocument(doc);

w.close();

}

private void doSearch() throws IOException {
IndexSearcher r = new
IndexSearcher(FSDirectory.getDirectory("c:/lucidx"));
oneSearch(r, 1L);
oneSearch(r, 2L);
oneSearch(r, 5L);
oneSearch(r, 9L);
oneSearch(r, 0L);

}
private void oneSearch(IndexSearcher r, Long lower) throws IOException {
System.out.println("\n\nSearching for greater than " +
Long.toString(lower));
Hits hits = r.search(new ConstantScoreRangeQuery("num",
NumberTools.longToString(lower), null,  false, true));
for (int idx = 0; idx < hits.length(); ++idx) {
System.out.println(hits.doc(idx).get("name"));
}

}
}


***output***

Searching for greater than 1
doc 11
doc 5
doc 9


Searching for greater than 2
doc 11
doc 5
doc 9


Searching for greater than 5
doc 11
doc 9


Searching for greater than 9
doc 11


Searching for greater than 0
doc 1
doc 11
doc 5
doc 9


On Mon, May 12, 2008 at 3:21 PM, Dan Hardiker <[EMAIL PROTECTED]>
wrote:

> Erick Erickson wrote:
>
> > Although I'm a bit puzzled by what you're actually getting back.
> > You might try using Luke to look at your index to see what's
> > there.
> >
>
> I've looked through with Luke and it doesn't look like much has changed
> between using NumberTools and not. NumberTools definitely does some padding
> which makes sense, however even though I'm using that, Lucene or Luke seems
> to be boiling it down to just the number. I'm not sure which.
>
>  See the NumberTools class for some help here...
> >
> > BTW, at least in Lucene 2.1, the preferred way to go about this
> > would be ConstantScoreRangeQuery...
> >
>
> Taking your advice I'm now indexing using:
>
> document.add( new Field(RateUtils.SF_FILTERED_CNT,
> NumberTools.longToString( filteredCount ), Field.Store.YES,
> Field.Index.UN_TOKENIZED) );
>
> and searching using:
>
> I'm now
> int minRates = Long.valueOf( minRatesString ).intValue();
> luceneQuery.add( new ConstantScoreRangeQuery( RateUtils.SF_FILTERED_CNT,
> NumberTools.longToString(minRates), "", true, false ),
> BooleanClause.Occur.MUST );
>
> I get very odd results back now, but they seem to work similarly. The
> documentation for ConstantScoreRangeQuery is rather thin however I did find
> this example which suggests I'm doing the right thing:
>
>
> http://github.com/we4tech/semantic-repository/tree/master/development/idea-repository-core/src/main/java/com/ideabase/repository/core/index/ExtendedQueryParser.java
>
> The code _looks_ like it should work, it makes sense logically but it
> still doesn't do what I'm expecting.
>
> I've tried changing the indexing over to Field.Index.NO_NORMS and it makes
> the field value "0b" instead of "11", and "02"
> instead of "2" ... but that meant that the searching didn't pick up on that
> field _at all_.
>
> Surely "find me results where numeric field x is higher

Re: Question about startOffset and endOffset

2008-05-12 Thread Brendan Grainger

Hi Erick,

Thanks for the reply. The use case I have is this:

Say you have a synonym expansion like this:

ac -> air conditioning

And to keep it simple, a document where the first term is ac. When  
analyzing the document I currently create a token stream that looks  
something like this for the 'ac' term:


ac -> Token(positionIncrement = 1, startOffset = 0, endOffset = 2,  
type = 'word')
air -> Token(positionIncrement = 0, startOffset = 0, endOffset = 2,  
type = 'synonym')
conditioning -> Token(positionIncrement = 0, startOffset = 0,  
endOffset = 2, type = 'synonym')


Now what I'd like to do is be able to do is to display the fact that  
this expansion took place to the user when they query. However, now I  
don't know how to reconstruct the word 'air conditioning'. If however,  
I do this:


ac -> Token(positionIncrement = 1, startOffset = 0, endOffset = 2,  
type = 'word')
air -> Token(positionIncrement = 0, startOffset = 0, endOffset = 3,  
type = 'synonym')
conditioning -> Token(positionIncrement = 0, startOffset = 3,  
endOffset = 15, type = 'synonym')


I can reconstruct the fact that ac maps to air conditioning.

Thanks
Brendan

On May 12, 2008, at 12:23 PM, Erick Erickson wrote:


Is this a theoretical question or is there a use-case you're trying
to support? If the latter, a statement of the problem you're trying
to solve would be helpful.

If the former, setting all your start offsets to 0 seems wrong. You're
essentially saying that all tokens are at the beginning of the  
document

and each one is some ever-increasing distance from the start. Some
of your later words will look really, really long (like the length
of your document).

Offhand, I expect this will affect up span queries, phrase
queries, and who knows what else? Maybe scoring?
Or maybe not, maybe some (or all) of these work off of
termpositions rather than offsets. So unless you're trying to
accomplish something specific, I sure wouldn't do this
given the problems it *could* create.

That said, I'm not much of an expert on how the offsets are
used, but I'd be really leery of changing them "just for the fun of
it" ..

Best
Erick

On Mon, May 12, 2008 at 12:06 PM, Brendan Grainger <
[EMAIL PROTECTED]> wrote:


Hi,

I have a TokenStream that inserts synonym tokens into the stream when
matched. One thing I am wondering about is what is the effect of the
startOffset and endOffset. I have something like this:

Token synonymToken = new Token(originalToken.startOffset(),
originalToken.endOffset(), "SYNONYM");
synonymToken.setPositionIncrement(0);

What I am wondering is if I set the startOffset and endOffset to 0  
and the
endOffset to the length of the synonym string what effect will this  
have?


eg
Token synonymToken = new Token(0, repTok.endOffset(), "SYNONYM");
synonymToken.setPositionIncrement(0);

Thanks





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filtering a SpanQuery

2008-05-12 Thread Paul Elschot
Op Monday 12 May 2008 09:06:36 schreef Eran Sevi:
> Thanks Paul,
>
> I'll give your code sample a try.
> I still think that calling getSpans (the first line of code) that
> returns millions of results is going to be much slower than calling
> getSpans that's going to return only a few thousands of results.
> Since the filtering is only performed after calling this method it
> can't help in this case.

The given untested code (apart from possible bugs) would process
exactly as many documents and their Spans as SpanScorer would
in the explicitly filtered case.

> I guess your suggested solution is my best option without changing
> the way getSpans works (which I'm not going to change any time soon )

Before doing that, have a look at the code of SpanWeight/SpanScorer,
ConjunctionScorer, and the filtering code in IndexSearcher.

Regards,
Paul Elschot.



> Eran.
>
> On Wed, May 7, 2008 at 7:22 PM, Paul Elschot <[EMAIL PROTECTED]> 
wrote:
> > Op Wednesday 07 May 2008 10:18:38 schreef Eran Sevi:
> > > Thanks Paul for your reply,
> > >
> > > Since my index contains a couple of millions documents and the
> > > filter is supposed to limit the search space to a few thousands I
> > > was hoping I won't have to do the filtering myself after running
> > > the query on all the index.
> >
> > The code I gave earlier effectively does a filtered query search
> > on the index. It visits the resulting Spans, and does not provide
> > a score value per document as SpanScorer would do.
> > Please make sure to test that code thoroughly for reliable results.
> >
> > > Maybe this is the case anyway and behind the scenes the filter
> > > does exactly what you suggested.
> >
> > Yes, a filtered query search would use skipTo() on the Spans via
> > SpanScorer. But the difference between the normal case
> > and your case is that you don't need SpanScorer.
> >
> > > From what I tested the number of results of the SpanQuery greatly
> > > affects the running speed so if I'm going to use about 0.1% of
> > > the results I'm loosing a lot of time and memory for gathering
> > > and storing the spans I'm not going to use.
> > >
> > > I don't know how SpanQuery works internally but I guess that if
> > > the filter is known beforehand,
> >
> > A Filter needs to make a BitSet available before the query search.
> >
> > > it could speed things up quite a bit.
> >
> > I would expect a substantial speedup from using skipTo() on the
> > Spans when only 0.1% of the results passes the filter.
> >
> > Regards,
> > Paul Elschot
> >
> > > Eran.
> > >
> > >
> > > On Wed, May 7, 2008 at 10:34 AM, Paul Elschot
> > > <[EMAIL PROTECTED]>
> > >
> > > wrote:
> > > > Op Tuesday 06 May 2008 17:39:38 schreef Paul Elschot:
> > > > > Eran,
> > > > >
> > > > > Op Tuesday 06 May 2008 10:15:10 schreef Eran Sevi:
> > > > > > Hi,
> > > > > >
> > > > > > I am looking for a way to filter a SpanQuery according to
> > > > > > some other query (on another field from the one used for
> > > > > > the SpanQuery). I need to get access to the spans
> > > > > > themselves of course. I don't care about the scoring of the
> > > > > > filter results and just need the positions of hits found in
> > > > > > the documents that matches the filter.
> > > > >
> > > > > I think you'll have to implement the filtering on the Spans
> > > > > yourself. That's not really difficult, just use
> > > > > Spans.skipTo(). The code to do that could look sth like this
> > > > > (untested):
> > > > >
> > > > > Spans spans = yourSpanQuery.getSpans(reader);
> > > > > BitSet bits = yourFilter.bits(reader);
> > > > > int filterDoc = bits.nextSetBit(0);
> > > > > while ((filterDoc >= 0) and spans.skipTo(filterDoc)) {
> > > > >   boolean more = true;
> > > > >   while (more and (spans.doc() == filterDoc)) {
> > > > >  // use spans.start() and spans.end() here
> > > > >  // ...
> > > > >  more = spans.next();
> > > > >   }
> > > > >   if (! more) {
> > > > > break;
> > > > >   }
> > > > >   filterDoc = bits.nextSetBit(spans.doc());
> > > >
> > > > At this point, no skipping on the spans should be done when
> > > > filterDoc equals spans.doc(), so this code still needs some
> > > > work. But I think you get the idea.
> > > >
> > > > Regards,
> > > > Paul Elschot
> > > >
> > > > > }
> > > > >
> > > > > Please check the javadocs of java.util.BitSet, there may
> > > > > be a 1 off error in the arguments to nextSetBit().
> > > > >
> > > > > Regards,
> > > > > Paul Elschot
> > > > >
> > > > > > I tried looking through the archives and found some
> > > > > > reference to a SpanQueryFilter patch, however I don't see
> > > > > > how it can help me achieve what I want to do. This class
> > > > > > receives only one query parameter (which I guess is the
> > > > > > actual query) and not a query and a filter for example.
> > > > > >
> > > > > > Any help about how I can achieve this will be appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Eran.
> > > > >
> > > > > --