Re: Index and Field.Text

2003-12-05 Thread Tatu Saloranta
On Friday 05 December 2003 10:45, Doug Cutting wrote:
> Tatu Saloranta wrote:
> > Also, shouldn't there be at least 3 methods that take Readers; one for
> > Text-like handling, another for UnStored, and last for UnIndexed.
>
> How do you store the contents of a Reader?  You'd have to double-buffer
> it, first reading it into a String to store, and then tokenizing the
> StringReader.  A key feature of Reader values is that they're streamed:

Not really, you can pass Reader to tokenizer, which then reads and tokenizes 
directly (I think that's the way code also works). This because internally 
String is read using StringReader, so passing a String looks more like a 
convenience feature?

> the entire value is never in RAM.  Storing a Reader value would remove
> that advantage.  The current API makes this explicit: when you want
> something streamed, you pass in a Reader, when you're willing to have
> the entire value in memory, pass in a String.

I guess for things that are both tokenized and stored, passing a Reader can't 
really help a lot; if one wants to reduce mem usage, text needs to be read 
twice, or analyzer needs to help in writing output; or, text needs to be read 
in-memory much like what happens now. It'd simplify application code a bit, 
but wouldn't do much more.

So I guess I need to downgrade my suggestion to require just 2 
Reader-taking factory methods? :-)
I still think that index-only and store-only version would both make sense. In 
latter case, storing could be done in fully streaming fashion; in former 
tokenization can be done?

> Yes, it is a bit confusing that Text(String, String) stores its value,
> while Text(String, Reader) does not, but it is at least well documented.
>   And we cannot change it: that would break too many applications.  But
> we can put this on the list for Lucene 2.0 cleanups.

Yes, I understand that. It'd not be reasonable to do such a change. But how 
about adding more intuitive factory method (UnStored(String, Reader))?

> When I first wrote these static methods I meant for them to be
> constructor-like.  I wanted to have multiple Field(String, String)
> constructors, but that's not possible, so I used capitalized static
> methods instead.  I've never seen anyone else do this (capitalize any
> method but a real constructor) so I guess I didn't start a fad!  This

:-)

> should someday too be cleaned up.  Lucene was the first Java program
> that I ever wrote, and thus its style is in places non-standard.  Sorry.

Best standards are created by people doing things others use, follow or 
imitate... so it was worth a try! :-)

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



write.lock

2003-12-05 Thread Aaron Galea
Hi

I am starting to get an error about a write.lock in lucene when creating an index in 
an empty directory. It used to work fine before but now it started to occur and as far 
as I know I didn't touch anything. Printing out the stack trace from the excpetion 
thrown I get the following :

java.io.IOException: couldn't delete write.lock
at org.apache.lucene.store.FSDirectory.create(Unknown Source)
at org.apache.lucene.store.FSDirectory.getDirectory(Unknown Source)
at org.apache.lucene.store.FSDirectory.getDirectory(Unknown Source)
at org.apache.lucene.index.IndexWriter.(Unknown Source)
at qa.answerextraction.AnswerExtractionImpl.processDocument(Unknown Source)
at qa.answerextraction.AnswerExtractionServerPOA._invoke(Unknown Source)   
 at org.jacorb.poa.RequestProcessor.invokeOperation(Unknown Source)
at org.jacorb.poa.RequestProcessor.process(Unknown Source)
at org.jacorb.poa.RequestProcessor.run(Unknown Source)

The code creating this problem is:

IndexWriter writer;

try {
writer = new IndexWriter(indexLocation, sa,false);
} catch (java.io.IOException e) {
writer = new IndexWriter(indexLocation, sa,true);
}

This problem only happens when indexing the very first file. After that it works fine. 
All that seems it needs in the directory is a "segments" file. 

Could anyone explain to me the problem or what I am doing wrong in it?

regards 
Aaron 





Sent through the WebMail system at nextgen.net.mt

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Dror Matalon
On Fri, Dec 05, 2003 at 05:28:54PM -0500, Erik Hatcher wrote:
> On Friday, December 5, 2003, at 04:28  PM, Dror Matalon wrote:
> >Then I'm out of ideas.  The next thing is for you to post your search
> >code so we can see why it's not searching the field.
> 
> Giving up so easily, Dror?!  :))

You're right :-). What I should have said instead is:
Please show the output of query.toString()
which will tell us what the query really is.

But in the end, your approach is much better. Get the education from
these excellent articles, check out the javadocs, and things should fall
into place.

Dror

> 
> The problem is, when using any type of QueryParser with a Keyword 
> field, you have to then be careful about analysis.  My guess is that at 
> query parsing time, that the analyzer is stripping numbers or in some 
> mangling the "id".
> 
> Look back in the e-mail archives for my AnalyzerUtils, run a string 
> containing just a sample id through it using the analyzer you are using 
> in your real code and see what comes out.
> 
> Again, Tracy, please read the articles at java.net on Lucene - and 
> there is one on QueryParser too.  You are definitely having a learning 
> curve situation here and aren't quite in the zone of Lucene 
> understanding yet, that is why folks here are getting frustrated with 
> your questions.  We are hanging in there with you though and will get 
> you through this.  I'll give you some pointers here - in the latest 
> Lucene 1.3 versions, there is a PerFieldAnalyzerWrapper that might come 
> in handy here - otherwise you might consider using a different analyzer.
> 
> A good first pass is to experiment with the WhitespaceAnalyzer and be 
> sure to phrase your test queries with the same case you indexed with.  
> I believe you'll find that it will work.  If it works then, you will 
> have a very good clue that the analyzer is the problem.  At that point, 
> go and read those java.net articles I wrote, especially the first one 
> having to do with analyzers.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 04:28  PM, Dror Matalon wrote:
Then I'm out of ideas.  The next thing is for you to post your search
code so we can see why it's not searching the field.
Giving up so easily, Dror?!  :))

The problem is, when using any type of QueryParser with a Keyword 
field, you have to then be careful about analysis.  My guess is that at 
query parsing time, that the analyzer is stripping numbers or in some 
mangling the "id".

Look back in the e-mail archives for my AnalyzerUtils, run a string 
containing just a sample id through it using the analyzer you are using 
in your real code and see what comes out.

Again, Tracy, please read the articles at java.net on Lucene - and 
there is one on QueryParser too.  You are definitely having a learning 
curve situation here and aren't quite in the zone of Lucene 
understanding yet, that is why folks here are getting frustrated with 
your questions.  We are hanging in there with you though and will get 
you through this.  I'll give you some pointers here - in the latest 
Lucene 1.3 versions, there is a PerFieldAnalyzerWrapper that might come 
in handy here - otherwise you might consider using a different analyzer.

A good first pass is to experiment with the WhitespaceAnalyzer and be 
sure to phrase your test queries with the same case you indexed with.  
I believe you'll find that it will work.  If it works then, you will 
have a very good clue that the analyzer is the problem.  At that point, 
go and read those java.net articles I wrote, especially the first one 
having to do with analyzers.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Returning one result

2003-12-05 Thread Dror Matalon
Then I'm out of ideas.  The next thing is for you to post your search
code so we can see why it's not searching the field.

On Fri, Dec 05, 2003 at 03:34:38PM -0500, Pleasant, Tracy wrote:
> Yes it is in the list of arrays that I want searched.
> 
> -Original Message-
> From: Dror Matalon [mailto:[EMAIL PROTECTED]
> Sent: Friday, December 05, 2003 3:32 PM
> To: Lucene Users List
> Subject: Re: Returning one result
> 
> 
> 
> On Fri, Dec 05, 2003 at 03:14:08PM -0500, Pleasant, Tracy wrote:
> > What do you mean 'add' in MultiFieldQueryParser?  I am using all the
> > fields 
> 
> Sorry, that was wrong. What I meant to say is are you adding the field
> to the array of fields that need to be searched? 
> 
> You need to use a MultiFieldQueryParser and pass it the array of fields
> that you want searched.
> 
> Dror
> 
> > 
> > When I index it does 
> > 
> >  add (Field.Keyword(..,..))
> > 
> > 
> > But I don't want the user to have to type ID: It would be
> > nice to just type ID Number. On your site if you just put: 11183 in
> the
> > search box there are no results. 
> > 
> > well, right now I'll just do it as text and query that field for the
> id
> > # to display the document.  It can't hurt, right? :)  Unless the
> Keyword
> > is a better way
> > 
> > 
> > 
> > -Original Message-
> > From: Dror Matalon [mailto:[EMAIL PROTECTED]
> > Sent: Friday, December 05, 2003 3:06 PM
> > To: Lucene Users List
> > Subject: Re: Returning one result
> > 
> > 
> > On Fri, Dec 05, 2003 at 02:45:34PM -0500, Pleasant, Tracy wrote:
> > > Maybe we are having some communication issues. 
> > > 
> > > At any rate, I did index it as a KEYWORD and when displaying used
> the
> > > TermQuery.
> > > 
> > > The only problem with this though is by storing the ID (i.e. AR345)
> as
> > a
> > > Keyword, if I search for AR345 no results are returned when I use
> the
> > > MultiFieldQueryParser .
> > > 
> > > *sigh* *arg*
> > 
> > OK. 
> > 
> > Go to http://www.fastbuzz.com/search/index.jsp and type "lucene"
> without
> > the quotes  and hit search. You get results from different
> channels/rss
> > feeds.
> > 
> > Now type "lucene channel:11183" without the quotes and hit search. You
> > get results only from Java-Channel. 
> > 
> > We're inserting the field channel as a keyword, and it does what I
> > understand you want to use AR345.
> > 
> > I would guess that in MultiFieldQueryParser you are not doing an add()
> > of the field for AR345 which is why the search fails. 
> > 
> > Regards,
> > 
> > Dror
> > 
> > 
> > > 
> > > 
> > > 
> > > -Original Message-
> > > From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> > > Sent: Friday, December 05, 2003 2:13 PM
> > > To: Lucene Users List
> > > Subject: Re: Returning one result
> > > 
> > > 
> > > On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
> > > > Say ID is Ar3453 .. well the user may want to search for Ar3453,
> so
> > in
> > > > order for it to be searchable then it would have to be indexed and
> > not
> > > 
> > > > a
> > > > keyword.
> > > 
> > > *arg* - we're having a serious communication issue here.  My advice
> to
> > 
> > > you is to actually write some simple tests (test-driven learning
> using
> > 
> > > JUnit is a wonderful way to experiement with Lucene, especially
> thanks
> > 
> > > to the RAMDirectory).  Please refer to my articles at java.net as
> well
> > 
> > > as the other great Lucene articles out there.
> > > 
> > > Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
> > > javadocs say this for this method:
> > > 
> > >/** Constructs a String-valued Field that is not tokenized, but
> is 
> > >  >>>indexed<<<
> > >  and stored.  Useful for non-text fields, e.g. date or url.  */
> > > 
> > > [I added the emphasis there]
> > > 
> > > 
> > > > So after using
> > > > TermQuery query = new TermQuery(new Term("id", term));
> > > >
> > > > How would I return the other fields in the document?
> > > >
> > > > For instance to display a record it would get the record with the
> id
> > #
> > > > and then display the title, contents, etc.
> > > 
> > > Umm you'd use *exactly* the same way as if you had used 
> > > QueryParser.  QueryParser would create a TermQuery for you, in fact,
> 
> > > except it would analyze your text first, which is what you want to 
> > > avoid, right?
> > > 
> > > Hits.doc(n) gives you back a Document.  And then 
> > > Document.get("fieldName") gives you back the fields (as long as you
> > >>> 
> > > stored <<< them in the index too).
> > > 
> > > Again, please attempt some of these things in code.  It is a trivial
> 
> > > matter to index and search using RAMDirectory and experiment with 
> > > TermQuery, QueryParser, Analyzers, etc.
> > > 
> > >   Erik
> > > 
> > > 
> > >
> -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > 
> > > 
> > >
> 

RE: Probabilistic Model in Lucene - possible?

2003-12-05 Thread ambiesense
I guess you mean "Modern Information Retrieval" ... I would be a little bit
careful since this book has theoretical glasses on. It might look more
difficult than expected. However I would like to discuss this further. How could it
be archived to get the values your are writing about? Any first ideas?

Cheers,
Ralf

> Deal all,
> 
> I am interested in implement a probabilistic model in Lucene as well.
> I checked the book titled "model information retrieval" authored by
> Ricardo 
> Baeza-Yates and Berthier Ribeiro-Neto, it seems to me that the 
> implementation is not very complicated when we use Lucene's IndexReader 
> class, almost all the parameters needed are there: the total number of 
> document in the index (collection), the number of documents having a 
> particular term, that is it. Probably we need to find out a satisfied
> method
> of defining the weights of terms in the documents as well as in the query.
> 
> Cheers,
> 
> Shengli
> 
> 
> 
> Adam Saltiel <[EMAIL PROTECTED]> said:
> 
> > Herb,
> > Any one game ... ?
> > No takers? I would be very interested, but maybe beyond what can be
> > posted in a mail list. I'd be equally interested in any references you
> > may have.
> > As we are on this subject how does LSI and the similar CNG (context
> > network graph) fit into the model used by lucene. Could lucene be
> > massaged to implement different mathematical models of search and
> > retrieval, if so how modular are the core functions?
> > 
> > Adam Saltiel
> > 
> > 
> > > -Original Message-
> > > From: Chong, Herb [mailto:[EMAIL PROTECTED]
> > > Sent: Thursday, December 04, 2003 1:53 PM
> > > To: Lucene Users List
> > > Subject: RE: Probabilistic Model in Lucene - possible?
> > >
> > > not all tf/idf variants are probabilistic models, but a great many are
> > if
> > > the term weights are probabilities. if we just take straight,
> > unmodified
> > > Term Frequency in a document, Inverse Document Frequency in the
> > corpus,
> > > and the Term Frequency in the query as 1, you are in fact comparing
> > the
> > > statistical properties of the query against the statistical properties
> > of
> > > the query. they are probabilities you are comparing. i can't think of
> > many
> > > papers that come right out and say it, but if you look at an
> > individual
> > > term weight and can interpret it as a genuine probability, the vector
> > > space model based on the weights is a probabilistic model. the
> > derivation
> > > is relatively straight forward to show it, if you have the right
> > general
> > > model to start with. once you start throwing in ad hoc normalizations,
> > > then things get out of whack and it's not longer a probabilistic
> > model.
> > >
> > > the implementations that i have done are with a former company and
> > that
> > > means secret and protected by various intellectual property rights.
> > > however, i can sketch here the general approach one has to take and an
> > > outline of the derivation that unifies probabilistic models with
> > vector
> > > space models and at the same time incorporate pairwise interterm
> > > correlation. in fact, the pairwise interterm correlations are a
> > > fundamental assumption. once you do all this, you can show that the
> > > traditional vector space model is a special case of a pairwise
> > interterm
> > > correlation model. for those that are interested in advanced matrix
> > > algebra and some basic statistics, it should be very interesting. if
> > only
> > > i had a published paper, i would post it. unfortunately, what i have
> > is
> > > very obtuse because it's protected. the only paper that started out
> > was
> > > submitted to SIGIR but rejected by all but one referee. that one
> > thought
> > > this was a tremendous unification of the two methods, but academic
> > > journals being what they are, when 4 out of 5 referees can't
> > understand
> > > the paper, it doesn't get published. i may brush it off and enlarge
> > into a
> > > much longer paper for the Journal of IR, but once again, unless you
> > are
> > > comfortable with probability theory and matrix theory, you are not
> > going
> > > to follow it.
> > >
> > > so, who is game for a tutorial on the derivation?
> > >
> > > Herb...
> > >
> > > -Original Message-
> > > From: Karsten Konrad [mailto:[EMAIL PROTECTED]
> > > Sent: Thursday, December 04, 2003 5:09 AM
> > > To: Lucene Users List
> > > Subject: AW: Probabilistic Model in Lucene - possible?
> > >
> > >
> > >
> > > Hi Herb,
> > >
> > > thank you for your insights.
> > >
> > > >>
> > > but by most accepted definitions, the tf/idf model in Lucene is a
> > > probabilistic model.
> > > >>
> > >
> > > Can you send some pointers to help me understand that? Are all TF/IDF-
> > > variants
> > > probabilistic models? If so, what makes any model a non-probabilistic
> > one?
> > > If you claim that TF/IDF is probabilistic, then the plain cosine (an
> > > extreme
> > > form of TF/IDF, with IDF for all terms being consi

RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Yes it is in the list of arrays that I want searched.

-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 3:32 PM
To: Lucene Users List
Subject: Re: Returning one result



On Fri, Dec 05, 2003 at 03:14:08PM -0500, Pleasant, Tracy wrote:
> What do you mean 'add' in MultiFieldQueryParser?  I am using all the
> fields 

Sorry, that was wrong. What I meant to say is are you adding the field
to the array of fields that need to be searched? 

You need to use a MultiFieldQueryParser and pass it the array of fields
that you want searched.

Dror

> 
> When I index it does 
> 
>  add (Field.Keyword(..,..))
> 
> 
> But I don't want the user to have to type ID: It would be
> nice to just type ID Number. On your site if you just put: 11183 in
the
> search box there are no results. 
> 
> well, right now I'll just do it as text and query that field for the
id
> # to display the document.  It can't hurt, right? :)  Unless the
Keyword
> is a better way
> 
> 
> 
> -Original Message-
> From: Dror Matalon [mailto:[EMAIL PROTECTED]
> Sent: Friday, December 05, 2003 3:06 PM
> To: Lucene Users List
> Subject: Re: Returning one result
> 
> 
> On Fri, Dec 05, 2003 at 02:45:34PM -0500, Pleasant, Tracy wrote:
> > Maybe we are having some communication issues. 
> > 
> > At any rate, I did index it as a KEYWORD and when displaying used
the
> > TermQuery.
> > 
> > The only problem with this though is by storing the ID (i.e. AR345)
as
> a
> > Keyword, if I search for AR345 no results are returned when I use
the
> > MultiFieldQueryParser .
> > 
> > *sigh* *arg*
> 
> OK. 
> 
> Go to http://www.fastbuzz.com/search/index.jsp and type "lucene"
without
> the quotes  and hit search. You get results from different
channels/rss
> feeds.
> 
> Now type "lucene channel:11183" without the quotes and hit search. You
> get results only from Java-Channel. 
> 
> We're inserting the field channel as a keyword, and it does what I
> understand you want to use AR345.
> 
> I would guess that in MultiFieldQueryParser you are not doing an add()
> of the field for AR345 which is why the search fails. 
> 
> Regards,
> 
> Dror
> 
> 
> > 
> > 
> > 
> > -Original Message-
> > From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> > Sent: Friday, December 05, 2003 2:13 PM
> > To: Lucene Users List
> > Subject: Re: Returning one result
> > 
> > 
> > On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
> > > Say ID is Ar3453 .. well the user may want to search for Ar3453,
so
> in
> > > order for it to be searchable then it would have to be indexed and
> not
> > 
> > > a
> > > keyword.
> > 
> > *arg* - we're having a serious communication issue here.  My advice
to
> 
> > you is to actually write some simple tests (test-driven learning
using
> 
> > JUnit is a wonderful way to experiement with Lucene, especially
thanks
> 
> > to the RAMDirectory).  Please refer to my articles at java.net as
well
> 
> > as the other great Lucene articles out there.
> > 
> > Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
> > javadocs say this for this method:
> > 
> >/** Constructs a String-valued Field that is not tokenized, but
is 
> >  >>>indexed<<<
> >  and stored.  Useful for non-text fields, e.g. date or url.  */
> > 
> > [I added the emphasis there]
> > 
> > 
> > > So after using
> > > TermQuery query = new TermQuery(new Term("id", term));
> > >
> > > How would I return the other fields in the document?
> > >
> > > For instance to display a record it would get the record with the
id
> #
> > > and then display the title, contents, etc.
> > 
> > Umm you'd use *exactly* the same way as if you had used 
> > QueryParser.  QueryParser would create a TermQuery for you, in fact,

> > except it would analyze your text first, which is what you want to 
> > avoid, right?
> > 
> > Hits.doc(n) gives you back a Document.  And then 
> > Document.get("fieldName") gives you back the fields (as long as you
> >>> 
> > stored <<< them in the index too).
> > 
> > Again, please attempt some of these things in code.  It is a trivial

> > matter to index and search using RAMDirectory and experiment with 
> > TermQuery, QueryParser, Analyzers, etc.
> > 
> > Erik
> > 
> > 
> >
-
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> >
-
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -

Re: Returning one result

2003-12-05 Thread Dror Matalon

Mike,

Boy, I said it so badly and yet you understood :-).

Dror

On Fri, Dec 05, 2003 at 03:31:15PM -0500, Michael Giles wrote:
> Tracy,
> 
> I believe what Dror was referring to was the call to 
> MultiFieldQueryParser.parse(). The second argument to that call is a 
> String[] of field names on which to execute the query.  If the field that 
> contains "AR345" isn't listed in that array, you will not get any results.
> 
> -Mike
> 
> At 03:14 PM 12/5/2003, you wrote:
> >What do you mean 'add' in MultiFieldQueryParser?  I am using all the
> >fields
> >
> >When I index it does
> >
> > add (Field.Keyword(..,..))
> >
> >
> >But I don't want the user to have to type ID: It would be
> >nice to just type ID Number. On your site if you just put: 11183 in the
> >search box there are no results.
> >
> >well, right now I'll just do it as text and query that field for the id
> ># to display the document.  It can't hurt, right? :)  Unless the Keyword
> >is a better way
> >
> >
> >
> >-
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Dror Matalon

On Fri, Dec 05, 2003 at 03:14:08PM -0500, Pleasant, Tracy wrote:
> What do you mean 'add' in MultiFieldQueryParser?  I am using all the
> fields 

Sorry, that was wrong. What I meant to say is are you adding the field
to the array of fields that need to be searched? 

You need to use a MultiFieldQueryParser and pass it the array of fields
that you want searched.

Dror

> 
> When I index it does 
> 
>  add (Field.Keyword(..,..))
> 
> 
> But I don't want the user to have to type ID: It would be
> nice to just type ID Number. On your site if you just put: 11183 in the
> search box there are no results. 
> 
> well, right now I'll just do it as text and query that field for the id
> # to display the document.  It can't hurt, right? :)  Unless the Keyword
> is a better way
> 
> 
> 
> -Original Message-
> From: Dror Matalon [mailto:[EMAIL PROTECTED]
> Sent: Friday, December 05, 2003 3:06 PM
> To: Lucene Users List
> Subject: Re: Returning one result
> 
> 
> On Fri, Dec 05, 2003 at 02:45:34PM -0500, Pleasant, Tracy wrote:
> > Maybe we are having some communication issues. 
> > 
> > At any rate, I did index it as a KEYWORD and when displaying used the
> > TermQuery.
> > 
> > The only problem with this though is by storing the ID (i.e. AR345) as
> a
> > Keyword, if I search for AR345 no results are returned when I use the
> > MultiFieldQueryParser .
> > 
> > *sigh* *arg*
> 
> OK. 
> 
> Go to http://www.fastbuzz.com/search/index.jsp and type "lucene" without
> the quotes  and hit search. You get results from different channels/rss
> feeds.
> 
> Now type "lucene channel:11183" without the quotes and hit search. You
> get results only from Java-Channel. 
> 
> We're inserting the field channel as a keyword, and it does what I
> understand you want to use AR345.
> 
> I would guess that in MultiFieldQueryParser you are not doing an add()
> of the field for AR345 which is why the search fails. 
> 
> Regards,
> 
> Dror
> 
> 
> > 
> > 
> > 
> > -Original Message-
> > From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> > Sent: Friday, December 05, 2003 2:13 PM
> > To: Lucene Users List
> > Subject: Re: Returning one result
> > 
> > 
> > On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
> > > Say ID is Ar3453 .. well the user may want to search for Ar3453, so
> in
> > > order for it to be searchable then it would have to be indexed and
> not
> > 
> > > a
> > > keyword.
> > 
> > *arg* - we're having a serious communication issue here.  My advice to
> 
> > you is to actually write some simple tests (test-driven learning using
> 
> > JUnit is a wonderful way to experiement with Lucene, especially thanks
> 
> > to the RAMDirectory).  Please refer to my articles at java.net as well
> 
> > as the other great Lucene articles out there.
> > 
> > Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
> > javadocs say this for this method:
> > 
> >/** Constructs a String-valued Field that is not tokenized, but is 
> >  >>>indexed<<<
> >  and stored.  Useful for non-text fields, e.g. date or url.  */
> > 
> > [I added the emphasis there]
> > 
> > 
> > > So after using
> > > TermQuery query = new TermQuery(new Term("id", term));
> > >
> > > How would I return the other fields in the document?
> > >
> > > For instance to display a record it would get the record with the id
> #
> > > and then display the title, contents, etc.
> > 
> > Umm you'd use *exactly* the same way as if you had used 
> > QueryParser.  QueryParser would create a TermQuery for you, in fact, 
> > except it would analyze your text first, which is what you want to 
> > avoid, right?
> > 
> > Hits.doc(n) gives you back a Document.  And then 
> > Document.get("fieldName") gives you back the fields (as long as you
> >>> 
> > stored <<< them in the index too).
> > 
> > Again, please attempt some of these things in code.  It is a trivial 
> > matter to index and search using RAMDirectory and experiment with 
> > TermQuery, QueryParser, Analyzers, etc.
> > 
> > Erik
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-

RE: Returning one result

2003-12-05 Thread Michael Giles
Tracy,

I believe what Dror was referring to was the call to 
MultiFieldQueryParser.parse(). The second argument to that call is a 
String[] of field names on which to execute the query.  If the field that 
contains "AR345" isn't listed in that array, you will not get any results.

-Mike

At 03:14 PM 12/5/2003, you wrote:
What do you mean 'add' in MultiFieldQueryParser?  I am using all the
fields
When I index it does

 add (Field.Keyword(..,..))

But I don't want the user to have to type ID: It would be
nice to just type ID Number. On your site if you just put: 11183 in the
search box there are no results.
well, right now I'll just do it as text and query that field for the id
# to display the document.  It can't hurt, right? :)  Unless the Keyword
is a better way


-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
What do you mean 'add' in MultiFieldQueryParser?  I am using all the
fields 

When I index it does 

 add (Field.Keyword(..,..))


But I don't want the user to have to type ID: It would be
nice to just type ID Number. On your site if you just put: 11183 in the
search box there are no results. 

well, right now I'll just do it as text and query that field for the id
# to display the document.  It can't hurt, right? :)  Unless the Keyword
is a better way



-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 3:06 PM
To: Lucene Users List
Subject: Re: Returning one result


On Fri, Dec 05, 2003 at 02:45:34PM -0500, Pleasant, Tracy wrote:
> Maybe we are having some communication issues. 
> 
> At any rate, I did index it as a KEYWORD and when displaying used the
> TermQuery.
> 
> The only problem with this though is by storing the ID (i.e. AR345) as
a
> Keyword, if I search for AR345 no results are returned when I use the
> MultiFieldQueryParser .
> 
> *sigh* *arg*

OK. 

Go to http://www.fastbuzz.com/search/index.jsp and type "lucene" without
the quotes  and hit search. You get results from different channels/rss
feeds.

Now type "lucene channel:11183" without the quotes and hit search. You
get results only from Java-Channel. 

We're inserting the field channel as a keyword, and it does what I
understand you want to use AR345.

I would guess that in MultiFieldQueryParser you are not doing an add()
of the field for AR345 which is why the search fails. 

Regards,

Dror


> 
> 
> 
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Friday, December 05, 2003 2:13 PM
> To: Lucene Users List
> Subject: Re: Returning one result
> 
> 
> On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
> > Say ID is Ar3453 .. well the user may want to search for Ar3453, so
in
> > order for it to be searchable then it would have to be indexed and
not
> 
> > a
> > keyword.
> 
> *arg* - we're having a serious communication issue here.  My advice to

> you is to actually write some simple tests (test-driven learning using

> JUnit is a wonderful way to experiement with Lucene, especially thanks

> to the RAMDirectory).  Please refer to my articles at java.net as well

> as the other great Lucene articles out there.
> 
> Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
> javadocs say this for this method:
> 
>/** Constructs a String-valued Field that is not tokenized, but is 
>  >>>indexed<<<
>  and stored.  Useful for non-text fields, e.g. date or url.  */
> 
> [I added the emphasis there]
> 
> 
> > So after using
> > TermQuery query = new TermQuery(new Term("id", term));
> >
> > How would I return the other fields in the document?
> >
> > For instance to display a record it would get the record with the id
#
> > and then display the title, contents, etc.
> 
> Umm you'd use *exactly* the same way as if you had used 
> QueryParser.  QueryParser would create a TermQuery for you, in fact, 
> except it would analyze your text first, which is what you want to 
> avoid, right?
> 
> Hits.doc(n) gives you back a Document.  And then 
> Document.get("fieldName") gives you back the fields (as long as you
>>> 
> stored <<< them in the index too).
> 
> Again, please attempt some of these things in code.  It is a trivial 
> matter to index and search using RAMDirectory and experiment with 
> TermQuery, QueryParser, Analyzers, etc.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Dror Matalon
On Fri, Dec 05, 2003 at 02:45:34PM -0500, Pleasant, Tracy wrote:
> Maybe we are having some communication issues. 
> 
> At any rate, I did index it as a KEYWORD and when displaying used the
> TermQuery.
> 
> The only problem with this though is by storing the ID (i.e. AR345) as a
> Keyword, if I search for AR345 no results are returned when I use the
> MultiFieldQueryParser .
> 
> *sigh* *arg*

OK. 

Go to http://www.fastbuzz.com/search/index.jsp and type "lucene" without
the quotes  and hit search. You get results from different channels/rss
feeds.

Now type "lucene channel:11183" without the quotes and hit search. You
get results only from Java-Channel. 

We're inserting the field channel as a keyword, and it does what I
understand you want to use AR345.

I would guess that in MultiFieldQueryParser you are not doing an add()
of the field for AR345 which is why the search fails. 

Regards,

Dror


> 
> 
> 
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Friday, December 05, 2003 2:13 PM
> To: Lucene Users List
> Subject: Re: Returning one result
> 
> 
> On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
> > Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
> > order for it to be searchable then it would have to be indexed and not
> 
> > a
> > keyword.
> 
> *arg* - we're having a serious communication issue here.  My advice to 
> you is to actually write some simple tests (test-driven learning using 
> JUnit is a wonderful way to experiement with Lucene, especially thanks 
> to the RAMDirectory).  Please refer to my articles at java.net as well 
> as the other great Lucene articles out there.
> 
> Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
> javadocs say this for this method:
> 
>/** Constructs a String-valued Field that is not tokenized, but is 
>  >>>indexed<<<
>  and stored.  Useful for non-text fields, e.g. date or url.  */
> 
> [I added the emphasis there]
> 
> 
> > So after using
> > TermQuery query = new TermQuery(new Term("id", term));
> >
> > How would I return the other fields in the document?
> >
> > For instance to display a record it would get the record with the id #
> > and then display the title, contents, etc.
> 
> Umm you'd use *exactly* the same way as if you had used 
> QueryParser.  QueryParser would create a TermQuery for you, in fact, 
> except it would analyze your text first, which is what you want to 
> avoid, right?
> 
> Hits.doc(n) gives you back a Document.  And then 
> Document.get("fieldName") gives you back the fields (as long as you >>> 
> stored <<< them in the index too).
> 
> Again, please attempt some of these things in code.  It is a trivial 
> matter to index and search using RAMDirectory and experiment with 
> TermQuery, QueryParser, Analyzers, etc.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Thanks, but using it as a Keyword, it will not get returned with my
search results when I use MultiFieldQueryParser.

If I could I would use just parse(query) but that is not a static
method, only parse(query,field,analyzer) is... So when I do that and use
an analyzer, the keyword field isn't searched.



-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 2:14 PM
To: Lucene Users List
Subject: Re: Returning one result


On Fri, Dec 05, 2003 at 01:25:23PM -0500, Pleasant, Tracy wrote:
> What I meant is.
> 
> Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
> order for it to be searchable then it would have to be indexed and not
a
> keyword.

No. You should store it as a keyword. 

>From the javadocs:
Keyword(String name, String value)
  Constructs a String-valued Field that is not tokenized, but is
indexed and stored.


> 
> So after using
> TermQuery query = new TermQuery(new Term("id", term));
> 
> How would I return the other fields in the document?
> 
> For instance to display a record it would get the record with the id #
> and then display the title, contents, etc.
> 
> 
> 
> 
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Friday, December 05, 2003 11:32 AM
> To: Lucene Users List
> Subject: Re: Returning one result
> 
> 
> On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
> > Maybe I should have been more clear.
> >
> > static Field Keyword(String name, String value)
> >   Constructs a String-valued Field that is not tokenized,
but 
> > is
> > indexed and stored.
> >
> > I need to have it tokenized because people will search for that also

> > and
> > it needs to be searchable.
> 
> Search for *what* also?  Tokenized means that it is broken into pieces

> which will be separate terms.  For example: "see spot" is tokenized 
> into "see" and "spot", and searching for either of those terms will 
> match.
> 
> Just try it and see, please!  :)
> 
> > Should I have two fields - one as a keyword and one as text?
> 
> Depends on what you're doing... but an "id" field to me indicates 
> Field.Keyword to me, only.
> 
> > How would I do that when I want to return search results..
> >
> >  Searcher searcher = new IndexSearcher("index");
> >  String term = request.getParameter("id");
> 
> >  Query query = QueryParser.parse(term, "id", new
> > StandardAnalyzer());
> >
> >  Hits hits  = searcher.search(query);
> >
> > Would it have to be something like:
> >  TermQuery query = ???
> 
> Yes.  TermQuery query = new TermQuery(new Term("id", term));
> 
> Use searcher.search exactly as you did before.  Just don't use 
> QueryParser to construct a query.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Maybe we are having some communication issues. 

At any rate, I did index it as a KEYWORD and when displaying used the
TermQuery.

The only problem with this though is by storing the ID (i.e. AR345) as a
Keyword, if I search for AR345 no results are returned when I use the
MultiFieldQueryParser .

*sigh* *arg*



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 2:13 PM
To: Lucene Users List
Subject: Re: Returning one result


On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
> Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
> order for it to be searchable then it would have to be indexed and not

> a
> keyword.

*arg* - we're having a serious communication issue here.  My advice to 
you is to actually write some simple tests (test-driven learning using 
JUnit is a wonderful way to experiement with Lucene, especially thanks 
to the RAMDirectory).  Please refer to my articles at java.net as well 
as the other great Lucene articles out there.

Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
javadocs say this for this method:

   /** Constructs a String-valued Field that is not tokenized, but is 
 >>>indexed<<<
 and stored.  Useful for non-text fields, e.g. date or url.  */

[I added the emphasis there]


> So after using
> TermQuery query = new TermQuery(new Term("id", term));
>
> How would I return the other fields in the document?
>
> For instance to display a record it would get the record with the id #
> and then display the title, contents, etc.

Umm you'd use *exactly* the same way as if you had used 
QueryParser.  QueryParser would create a TermQuery for you, in fact, 
except it would analyze your text first, which is what you want to 
avoid, right?

Hits.doc(n) gives you back a Document.  And then 
Document.get("fieldName") gives you back the fields (as long as you >>> 
stored <<< them in the index too).

Again, please attempt some of these things in code.  It is a trivial 
matter to index and search using RAMDirectory and experiment with 
TermQuery, QueryParser, Analyzers, etc.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Dror Matalon
On Fri, Dec 05, 2003 at 01:25:23PM -0500, Pleasant, Tracy wrote:
> What I meant is.
> 
> Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
> order for it to be searchable then it would have to be indexed and not a
> keyword.

No. You should store it as a keyword. 

>From the javadocs:
Keyword(String name, String value)
  Constructs a String-valued Field that is not tokenized, but is
indexed and stored.


> 
> So after using
> TermQuery query = new TermQuery(new Term("id", term));
> 
> How would I return the other fields in the document?
> 
> For instance to display a record it would get the record with the id #
> and then display the title, contents, etc.
> 
> 
> 
> 
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Friday, December 05, 2003 11:32 AM
> To: Lucene Users List
> Subject: Re: Returning one result
> 
> 
> On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
> > Maybe I should have been more clear.
> >
> > static Field Keyword(String name, String value)
> >   Constructs a String-valued Field that is not tokenized, but 
> > is
> > indexed and stored.
> >
> > I need to have it tokenized because people will search for that also 
> > and
> > it needs to be searchable.
> 
> Search for *what* also?  Tokenized means that it is broken into pieces 
> which will be separate terms.  For example: "see spot" is tokenized 
> into "see" and "spot", and searching for either of those terms will 
> match.
> 
> Just try it and see, please!  :)
> 
> > Should I have two fields - one as a keyword and one as text?
> 
> Depends on what you're doing... but an "id" field to me indicates 
> Field.Keyword to me, only.
> 
> > How would I do that when I want to return search results..
> >
> >  Searcher searcher = new IndexSearcher("index");
> >  String term = request.getParameter("id");
> 
> >  Query query = QueryParser.parse(term, "id", new
> > StandardAnalyzer());
> >
> >  Hits hits  = searcher.search(query);
> >
> > Would it have to be something like:
> >  TermQuery query = ???
> 
> Yes.  TermQuery query = new TermQuery(new Term("id", term));
> 
> Use searcher.search exactly as you did before.  Just don't use 
> QueryParser to construct a query.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 01:25  PM, Pleasant, Tracy wrote:
Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
order for it to be searchable then it would have to be indexed and not 
a
keyword.
*arg* - we're having a serious communication issue here.  My advice to 
you is to actually write some simple tests (test-driven learning using 
JUnit is a wonderful way to experiement with Lucene, especially thanks 
to the RAMDirectory).  Please refer to my articles at java.net as well 
as the other great Lucene articles out there.

Let me try again a Field.Keyword *IS* indexed!  Even Lucene's 
javadocs say this for this method:

  /** Constructs a String-valued Field that is not tokenized, but is 
>>>indexed<<<
and stored.  Useful for non-text fields, e.g. date or url.  */

[I added the emphasis there]


So after using
TermQuery query = new TermQuery(new Term("id", term));
How would I return the other fields in the document?

For instance to display a record it would get the record with the id #
and then display the title, contents, etc.
Umm you'd use *exactly* the same way as if you had used 
QueryParser.  QueryParser would create a TermQuery for you, in fact, 
except it would analyze your text first, which is what you want to 
avoid, right?

Hits.doc(n) gives you back a Document.  And then 
Document.get("fieldName") gives you back the fields (as long as you >>> 
stored <<< them in the index too).

Again, please attempt some of these things in code.  It is a trivial 
matter to index and search using RAMDirectory and experiment with 
TermQuery, QueryParser, Analyzers, etc.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: How would you delete an entry that was indexed like this

2003-12-05 Thread Aviran
This is kind of a problem, in order to delete documents using terms you need
to have a keyword field which contain a unique value, otherwise you might
ending deleting more then you want.

-Original Message-
From: Mike Hogan [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 05, 2003 1:06 PM
To: [EMAIL PROTECTED]
Subject: How would you delete an entry that was indexed like this


Hi,

If I index a document like this:

IndexWriter writer = createWriter();
Document document = new Document(); document.add(Field.Text(ID_FIELD_NAME,
componentId)); document.add(Field.Text(CONTENTS_FIELD_NAME,
componentDescription)); writer.addDocument(document); writer.optimize();
writer.close();

What code must I execute to later delete the document (I tried following the
docs and whats done in the code and test cases.  I saw Terms being used to
ID the document to delete.  But I am not clear what value to put in the
Term, as I do not know how Terms relate to Fields).

Many thanks,
Mike.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: implementing a TokenFilter for aliases

2003-12-05 Thread Doug Cutting
Position increments are for relative token positions.  A position 
increment of zero means that a token is logically at the same position 
as the previous token.  A position increment of one means that a token 
immediately follows the preceding token in the stream, it's the next 
token to the right (in a left-to-right language).  A position increment 
of two means that it is two tokens past the previous token, that there's 
a "phantom" token between them, inhibiting exact phrase matches.

You're setting the position increment to things based on the number of 
characters in the token's text.  That makes no sense.  Token positions 
are not character positions.  I think what you want to do is use a 
positionIncrement of zero, so the tokens lie at the same position.

Doug

Allen Atamer wrote:
The FAQ describes implementing a TokenFilter for applying aliases. I have a
trouble accomplishing this.
 
This is the code that I have so far for the next Method within AliasFilter.
After reading some posts, I also got the idea to call
setPositionIncrement(). Neither way works, because when I search for the
alias, no search results come back.
 
Thank you for your help,
 
Allen Atamer
 

 
  public Token next() throws java.io.IOException {
Token token = tokenStream.next();
 
if (aliasMap == null || token == null) {
  return token;
}
 
TermData t = (TermData)aliasMap.get(token.termText());
 
if (t == null) {
  return token;
}
 
String tokenText = AliasManager.replaceIgnoreCase(
token.termText(), t.getTerm(), t.getTeach());
 
int increment = tokenText.length() - token.termText().length();
if (increment > 0) {
  token.setPositionIncrement(increment);
}
 
return new Token(tokenText, token.startOffset(), token.endOffset());
  }
 
 
 
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Also what I am indexing is not a bunch of separate documents - or then
it would be easy to simply have a field called "url" and then the link
would go directly do that document. 

However, there is a text URL with many records
During indexing, a function parses each record and puts each into a
document with appropriate fields. 

When I go to display a particular Document (Lucene Document) I just
query the index for that unique ID rather than go through and parse
through the URL with all the records. 

Wouldn't querying the index for that unique ID be better than going
through that entire page and parsing through it - there is more room for
error that way.  

It's a long story why there isn't a database but it can't be done (don't
ask ... long story). 

-Original Message-
From: Pleasant, Tracy 
Sent: Friday, December 05, 2003 1:25 PM
To: Lucene Users List
Subject: RE: Returning one result


What I meant is.

Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
order for it to be searchable then it would have to be indexed and not a
keyword.

So after using
TermQuery query = new TermQuery(new Term("id", term));

How would I return the other fields in the document?

For instance to display a record it would get the record with the id #
and then display the title, contents, etc.




-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 11:32 AM
To: Lucene Users List
Subject: Re: Returning one result


On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
> Maybe I should have been more clear.
>
> static Field Keyword(String name, String value)
>   Constructs a String-valued Field that is not tokenized, but 
> is
> indexed and stored.
>
> I need to have it tokenized because people will search for that also 
> and
> it needs to be searchable.

Search for *what* also?  Tokenized means that it is broken into pieces 
which will be separate terms.  For example: "see spot" is tokenized 
into "see" and "spot", and searching for either of those terms will 
match.

Just try it and see, please!  :)

> Should I have two fields - one as a keyword and one as text?

Depends on what you're doing... but an "id" field to me indicates 
Field.Keyword to me, only.

> How would I do that when I want to return search results..
>
>  Searcher searcher = new IndexSearcher("index");
>  String term = request.getParameter("id");

>  Query query = QueryParser.parse(term, "id", new
> StandardAnalyzer());
>
>  Hits hits  = searcher.search(query);
>
> Would it have to be something like:
>  TermQuery query = ???

Yes.  TermQuery query = new TermQuery(new Term("id", term));

Use searcher.search exactly as you did before.  Just don't use 
QueryParser to construct a query.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: implementing a TokenFilter for aliases

2003-12-05 Thread Allen Atamer
173 is the ID field from a database (which we use as a primary key). For
Lucene's purpose, it only stores the field, and does not index it.

The place where I put the print statements is before the actual filtering.
The goal of the AliasFilter is to replace spitline. The debug line is in the
Tokenizer, and the filters are run afterwards so I am not sure what is
happening inside lucene.

I can't put the util line into the analyzer after the AliasFilter is run
because it will call recursively into tokenStream() and cause a stack
overflow. I will try to work on seeing what is happening after aliasfilter
is run

Allen


> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: December 5, 2003 12:23 PM
> To: Lucene Users List
> Subject: Re: implementing a TokenFilter for aliases
> 
> On Friday, December 5, 2003, at 11:59  AM, Allen Atamer wrote:
> > Below are the results of a debug run on the piece of text that I want
> > aliased. The token "spitline" must be recognized as "splitline" i.e.
> > when I
> > do a search for "splitline", this record will come up.
> >
> > 1: [173] , start:1, end:2
> > 1: [missing] , start:1, end:6
> > 2: [hardware] , start:9, end:7
> > 3: [for] , start:18, end:2
> > 4: [bypass] , start:22, end:5
> > 5: [spitline] , start:29, end:37
> >
> > I also added extra debug info after the token text, which are the
> > startOffset, and the endOffset. Lucene has the first token "173" only
> > stored, it is not indexed. The remaining terms are tokenized, indexed
> > and
> > stored. Does this make a difference?
> 
> I don't understand what you mean by "173" - is that output from a
> different string being analyzed?
> 
> Well, it's obvious from this output that you cannot find "spitline"
> when "splitline" is used in a search.  Your analyzer isn't working as
> you expect, I'm guessing.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
What I meant is.

Say ID is Ar3453 .. well the user may want to search for Ar3453, so in
order for it to be searchable then it would have to be indexed and not a
keyword.

So after using
TermQuery query = new TermQuery(new Term("id", term));

How would I return the other fields in the document?

For instance to display a record it would get the record with the id #
and then display the title, contents, etc.




-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 11:32 AM
To: Lucene Users List
Subject: Re: Returning one result


On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
> Maybe I should have been more clear.
>
> static Field Keyword(String name, String value)
>   Constructs a String-valued Field that is not tokenized, but 
> is
> indexed and stored.
>
> I need to have it tokenized because people will search for that also 
> and
> it needs to be searchable.

Search for *what* also?  Tokenized means that it is broken into pieces 
which will be separate terms.  For example: "see spot" is tokenized 
into "see" and "spot", and searching for either of those terms will 
match.

Just try it and see, please!  :)

> Should I have two fields - one as a keyword and one as text?

Depends on what you're doing... but an "id" field to me indicates 
Field.Keyword to me, only.

> How would I do that when I want to return search results..
>
>  Searcher searcher = new IndexSearcher("index");
>  String term = request.getParameter("id");

>  Query query = QueryParser.parse(term, "id", new
> StandardAnalyzer());
>
>  Hits hits  = searcher.search(query);
>
> Would it have to be something like:
>  TermQuery query = ???

Yes.  TermQuery query = new TermQuery(new Term("id", term));

Use searcher.search exactly as you did before.  Just don't use 
QueryParser to construct a query.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Testing for Optimization

2003-12-05 Thread Doug Cutting
jt oob wrote:
Can I safely delete those files which do not have the prefix listed in
the segments file?
Have a look at the index file format documentation:

  http://jakarta.apache.org/lucene/docs/fileformats.html

The only file besides segments that should exist is the "deleteable" 
file, and the files named in the "deleteable" file.  These are files 
which couldn't be deleted, typically on Win32, where you can't delete an 
open file.  Lucene will try to delete them again later, but it shouldn't 
hurt for you to delete them first.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How would you delete an entry that was indexed like this

2003-12-05 Thread Mike Hogan
Hi,

If I index a document like this:

IndexWriter writer = createWriter();
Document document = new Document();
document.add(Field.Text(ID_FIELD_NAME, componentId));
document.add(Field.Text(CONTENTS_FIELD_NAME, componentDescription));
writer.addDocument(document);
writer.optimize();
writer.close();

What code must I execute to later delete the document (I tried following the
docs and whats done in the code and test cases.  I saw Terms being used to
ID the document to delete.  But I am not clear what value to put in the
Term, as I do not know how Terms relate to Fields).

Many thanks,
Mike.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index and Field.Text

2003-12-05 Thread Doug Cutting
Tatu Saloranta wrote:
Also, shouldn't there be at least 3 methods that take Readers; one for 
Text-like handling, another for UnStored, and last for UnIndexed.
How do you store the contents of a Reader?  You'd have to double-buffer 
it, first reading it into a String to store, and then tokenizing the 
StringReader.  A key feature of Reader values is that they're streamed: 
the entire value is never in RAM.  Storing a Reader value would remove 
that advantage.  The current API makes this explicit: when you want 
something streamed, you pass in a Reader, when you're willing to have 
the entire value in memory, pass in a String.

Yes, it is a bit confusing that Text(String, String) stores its value, 
while Text(String, Reader) does not, but it is at least well documented. 
 And we cannot change it: that would break too many applications.  But 
we can put this on the list for Lucene 2.0 cleanups.

When I first wrote these static methods I meant for them to be 
constructor-like.  I wanted to have multiple Field(String, String) 
constructors, but that's not possible, so I used capitalized static 
methods instead.  I've never seen anyone else do this (capitalize any 
method but a real constructor) so I guess I didn't start a fad!  This 
should someday too be cleaned up.  Lucene was the first Java program 
that I ever wrote, and thus its style is in places non-standard.  Sorry.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: implementing a TokenFilter for aliases

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 11:59  AM, Allen Atamer wrote:
Below are the results of a debug run on the piece of text that I want
aliased. The token "spitline" must be recognized as "splitline" i.e. 
when I
do a search for "splitline", this record will come up.

1: [173] , start:1, end:2
1: [missing] , start:1, end:6
2: [hardware] , start:9, end:7
3: [for] , start:18, end:2
4: [bypass] , start:22, end:5
5: [spitline] , start:29, end:37
I also added extra debug info after the token text, which are the
startOffset, and the endOffset. Lucene has the first token "173" only
stored, it is not indexed. The remaining terms are tokenized, indexed 
and
stored. Does this make a difference?
I don't understand what you mean by "173" - is that output from a 
different string being analyzed?

Well, it's obvious from this output that you cannot find "spitline" 
when "splitline" is used in a search.  Your analyzer isn't working as 
you expect, I'm guessing.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: implementing a TokenFilter for aliases

2003-12-05 Thread Allen Atamer
Erik,

Below are the results of a debug run on the piece of text that I want
aliased. The token "spitline" must be recognized as "splitline" i.e. when I
do a search for "splitline", this record will come up.

1: [173] , start:1, end:2
1: [missing] , start:1, end:6
2: [hardware] , start:9, end:7
3: [for] , start:18, end:2
4: [bypass] , start:22, end:5
5: [spitline] , start:29, end:37

I also added extra debug info after the token text, which are the
startOffset, and the endOffset. Lucene has the first token "173" only
stored, it is not indexed. The remaining terms are tokenized, indexed and
stored. Does this make a difference?

Allen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 10:41  AM, Pleasant, Tracy wrote:
Maybe I should have been more clear.

static Field Keyword(String name, String value)
  Constructs a String-valued Field that is not tokenized, but 
is
indexed and stored.

I need to have it tokenized because people will search for that also 
and
it needs to be searchable.
Search for *what* also?  Tokenized means that it is broken into pieces 
which will be separate terms.  For example: "see spot" is tokenized 
into "see" and "spot", and searching for either of those terms will 
match.

Just try it and see, please!  :)

Should I have two fields - one as a keyword and one as text?
Depends on what you're doing... but an "id" field to me indicates 
Field.Keyword to me, only.

How would I do that when I want to return search results..

 Searcher searcher = new IndexSearcher("index");
 String term = request.getParameter("id");

 Query query = QueryParser.parse(term, "id", new
StandardAnalyzer());
 Hits hits  = searcher.search(query);

Would it have to be something like:
 TermQuery query = ???
Yes.  TermQuery query = new TermQuery(new Term("id", term));

Use searcher.search exactly as you did before.  Just don't use 
QueryParser to construct a query.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index and Field.Text

2003-12-05 Thread Tatu Saloranta
On Friday 05 December 2003 08:22, Erik Hatcher wrote:
> On Friday, December 5, 2003, at 09:48  AM, Grant Ingersoll wrote:
...
> > Field.Text(String, String) instead of the Field.Text(String, Reader)
> > version, which means I am storing the contents in the index.
>
> So use Field.UnStored(String, String) then.  It is the same as
> Field.Text(String, Reader).
>
> The static "factory" methods on Field are merely for convenience.  You
> can control all the flags yourself using the constructor:

I think it's almost a bug that they act differently, although having same 
method name. I don't think method should be called Text() if it behaves like 
UnStored()? Additionally, implementation for non-public constructor relies on 
default values for isIndexed, isStored and isTokenized; it probably should 
take those from static method for clarity?

Also, shouldn't there be at least 3 methods that take Readers; one for 
Text-like handling, another for UnStored, and last for UnIndexed. It's 
probably ok not to have one for keywords. For other types, though, it's often 
more convenient to just pass in Reader.
(internally difference between passing in a Reader or String is not huge, as 
String will be accessed via StringReader).

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 1.2 "Hit Highlighting"

2003-12-05 Thread Kenneth Campbell
Can someone point me in the right direction with regards to "Hit Highlighting"

I have seen what Mark Harwood has done and I like it, however I am using lucene 1.2. 
Are there a compatibility issues. 

If no any suggestions about implementation would be helpful.

If yes are there any suggestions for "Hit Highlighting" with lucene 1.2.


Thanks 

Ken C.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 10:31  AM, Pleasant, Tracy wrote:
Ok thanks, but still I can't use the Simple analyzer since it won't 
even
index that whole thing. I 'll give TermQuery a try. Thanks.


Yes, certainly the analyzer is important for "analyzed" fields, but it 
is not used for Field.Keyword.  Please provide more details on the 
issue you encountered using Field.Keyword.



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2003 6:18 PM
To: Lucene Users List
Subject: Re: Returning one result
You really should use a TermQuery in this case anyway, rather than
using QueryParser.  You wouldn't have to worry about the analyzer at
that point anyway (and I assume you're using Field.Keyword during
indexing).
	Erik

On Thursday, December 4, 2003, at 05:01  PM, Pleasant, Tracy wrote:

Ok I realized teh Simple Analyzer does not index numbers, so I
switched
back to Standard.

-Original Message-
From: Pleasant, Tracy
Sent: Thursday, December 04, 2003 4:53 PM
To: Lucene Users List
Subject: Returning one result
 I am indexing a group of items and one field , id, is unique.  When
the
user clicks on a results I want just that one result to show.
 I index and search using SimpleAnalyzer.

 Query query_es = QueryParser.parse(query, "id", new
SimpleAnalyzer());
 It should return only one result but returns 200.





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Maybe I should have been more clear.

static Field Keyword(String name, String value) 
  Constructs a String-valued Field that is not tokenized, but is
indexed and stored. 

I need to have it tokenized because people will search for that also and
it needs to be searchable. 

Should I have two fields - one as a keyword and one as text? 


How would I do that when I want to return search results..

Right now, in the results page it will have something like
Record AR334 

Then in display_record.jsp:
 Searcher searcher = new IndexSearcher("index");
 String term = request.getParameter("id");

 Query query = QueryParser.parse(term, "id", new
StandardAnalyzer());

 Hits hits  = searcher.search(query);

Would it have to be something like:
 TermQuery query = ???

or 
 Query query = QueryParser.Term("id");

? ? ? 

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2003 6:18 PM
To: Lucene Users List
Subject: Re: Returning one result


You really should use a TermQuery in this case anyway, rather than 
using QueryParser.  You wouldn't have to worry about the analyzer at 
that point anyway (and I assume you're using Field.Keyword during 
indexing).

Erik


On Thursday, December 4, 2003, at 05:01  PM, Pleasant, Tracy wrote:

> Ok I realized teh Simple Analyzer does not index numbers, so I
switched
> back to Standard.
>
> -Original Message-
> From: Pleasant, Tracy
> Sent: Thursday, December 04, 2003 4:53 PM
> To: Lucene Users List
> Subject: Returning one result
>
>
>  I am indexing a group of items and one field , id, is unique.  When 
> the
> user clicks on a results I want just that one result to show.
>
>  I index and search using SimpleAnalyzer.
>
>
>  Query query_es = QueryParser.parse(query, "id", new
SimpleAnalyzer());
>
>  It should return only one result but returns 200.
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Actually Erik, no I'm using Field.Text
When I used Field.Keyword and tried to get the word for return with
search results it would not display correctly... 

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2003 6:18 PM
To: Lucene Users List
Subject: Re: Returning one result


You really should use a TermQuery in this case anyway, rather than 
using QueryParser.  You wouldn't have to worry about the analyzer at 
that point anyway (and I assume you're using Field.Keyword during 
indexing).

Erik


On Thursday, December 4, 2003, at 05:01  PM, Pleasant, Tracy wrote:

> Ok I realized teh Simple Analyzer does not index numbers, so I
switched
> back to Standard.
>
> -Original Message-
> From: Pleasant, Tracy
> Sent: Thursday, December 04, 2003 4:53 PM
> To: Lucene Users List
> Subject: Returning one result
>
>
>  I am indexing a group of items and one field , id, is unique.  When 
> the
> user clicks on a results I want just that one result to show.
>
>  I index and search using SimpleAnalyzer.
>
>
>  Query query_es = QueryParser.parse(query, "id", new
SimpleAnalyzer());
>
>  It should return only one result but returns 200.
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Returning one result

2003-12-05 Thread Pleasant, Tracy
Ok thanks, but still I can't use the Simple analyzer since it won't even
index that whole thing. I 'll give TermQuery a try. Thanks.

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2003 6:18 PM
To: Lucene Users List
Subject: Re: Returning one result


You really should use a TermQuery in this case anyway, rather than 
using QueryParser.  You wouldn't have to worry about the analyzer at 
that point anyway (and I assume you're using Field.Keyword during 
indexing).

Erik


On Thursday, December 4, 2003, at 05:01  PM, Pleasant, Tracy wrote:

> Ok I realized teh Simple Analyzer does not index numbers, so I
switched
> back to Standard.
>
> -Original Message-
> From: Pleasant, Tracy
> Sent: Thursday, December 04, 2003 4:53 PM
> To: Lucene Users List
> Subject: Returning one result
>
>
>  I am indexing a group of items and one field , id, is unique.  When 
> the
> user clicks on a results I want just that one result to show.
>
>  I index and search using SimpleAnalyzer.
>
>
>  Query query_es = QueryParser.parse(query, "id", new
SimpleAnalyzer());
>
>  It should return only one result but returns 200.
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index and Field.Text

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 09:48  AM, Grant Ingersoll wrote:
I have seen the example SAX based XML processing in the Lucene sandbox 
(thanks to the authors for contributing!) and have successfully 
adapted this approach for my application.  The one thing that does not 
sit well with me is the fact that I am using the method 
Field.Text(String, String) instead of the Field.Text(String, Reader) 
version, which means I am storing the contents in the index.
So use Field.UnStored(String, String) then.  It is the same as 
Field.Text(String, Reader).

The static "factory" methods on Field are merely for convenience.  You 
can control all the flags yourself using the constructor:

public Field(String name, String string,
   boolean store, boolean index, boolean token)
2. If storing the content is going to adversaly effect searching, has 
anyone written an XMLReader that extends java.io.Reader.
You could always use a StringReader wrapper :))

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


class definition used in Lucene

2003-12-05 Thread Shengli.Wu

hi, 

I have problems for understanding some classes definitions in Lucene
(see the end of this e-mail for the source code).

A class "FilterIndexReader" is defined at 1. 
Then "FilterTermDocs" is defined as a nested static class at 2.
 
At 3, 

public FilterTermDocs(TermDocs in) 

is a constructor. What I am not understand are as follows: 

1. Now that FilterTermDocs is a static class, then why it has a constructor 
at 3?

2. Why we can use (TermDocs in) for the constructor at 3? Here "TermDocs" is 
an interface, does that mean "in" is an object of "TermDocs"?
Thanks in advance for your help!

Best,

Shengli


1 public class FilterIndexReader extends IndexReader {

  /** Base class for filtering [EMAIL PROTECTED] TermDocs} implementations. */
2  public static class FilterTermDocs implements TermDocs {3protected
   TermDocs in;

3  public FilterTermDocs(TermDocs in) { this.in = in; }
...




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index and Field.Text

2003-12-05 Thread Chong, Herb
you are storing the same information both ways. the string gets analyzed and 
discarded, just like with the Reader.

Herb...

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 9:49 AM
To: [EMAIL PROTECTED]
Subject: Index and Field.Text


Hi,

I have seen the example SAX based XML processing in the Lucene sandbox (thanks to the 
authors for contributing!) and have successfully adapted this approach for my 
application.  The one thing that does not sit well with me is the fact that I am using 
the method Field.Text(String, String) instead of the Field.Text(String, Reader) 
version, which means I am storing the contents in the index.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index and Field.Text

2003-12-05 Thread Grant Ingersoll
Hi,

I have seen the example SAX based XML processing in the Lucene sandbox (thanks to the 
authors for contributing!) and have successfully adapted this approach for my 
application.  The one thing that does not sit well with me is the fact that I am using 
the method Field.Text(String, String) instead of the Field.Text(String, Reader) 
version, which means I am storing the contents in the index.

Some questions:

1. Should I care?  What is the cost of storing the contents of these files versus 
using the Reader based method.  Presumably, the index size is going to be larger, but 
will it adversaly effect search time?  If yes, how much so (relatively speaking)?

2. If storing the content is going to adversaly effect searching, has anyone written 
an XMLReader that extends java.io.Reader.  I guess it would need to take in the name 
of the tag(s) that you want the reader to retrieve and then extend all of the 
java.io.Reader results to return values based on just the tag values that I am 
interested in.  Has anyone taken this approach?  If not, does it at least seem like a 
valid approach?

Thanks for your help!

-Grant Ingersoll



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Range Query

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 06:54  AM, Ramrakhiani, Vikas wrote:
Hi,
When I do range query like id:[0* to 9*] the result set exclude 
documents
having id 0, 90 ... i.e boundary values are excluded.
Is it expected or am I going wrong some where.
It is expected.  You're thinking that wildcards work on range queries.  
They do not.  You are literally starting the range at "0*", which is 
greater than "0" lexicograhically.  If you are doing number ranges, 
though, you probably want to do some padding with leading zeros so all 
numbers have the same string size.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Probabilistic Model in Lucene - possible?

2003-12-05 Thread Chong, Herb
anyone interested, contact me offline. whoever contacts me by the end of next week, 
i'll email an outline of the derivation and we can discuss it in private emails. i 
guarantee, you will learn something interesting about search engines.

Herb

-Original Message-
From: Adam Saltiel [mailto:[EMAIL PROTECTED]
Sent: Friday, December 05, 2003 3:46 AM
To: 'Lucene Users List'
Subject: RE: Probabilistic Model in Lucene - possible?


Herb,
Any one game ... ?
No takers? I would be very interested, but maybe beyond what can be
posted in a mail list. I'd be equally interested in any references you
may have.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Probabilistic Model in Lucene - possible?

2003-12-05 Thread Shengli.Wu
Deal all,

I am interested in implement a probabilistic model in Lucene as well.
I checked the book titled "model information retrieval" authored by Ricardo 
Baeza-Yates and Berthier Ribeiro-Neto, it seems to me that the 
implementation is not very complicated when we use Lucene's IndexReader 
class, almost all the parameters needed are there: the total number of 
document in the index (collection), the number of documents having a 
particular term, that is it. Probably we need to find out a satisfied method
of defining the weights of terms in the documents as well as in the query.

Cheers,

Shengli



Adam Saltiel <[EMAIL PROTECTED]> said:

> Herb,
> Any one game ... ?
> No takers? I would be very interested, but maybe beyond what can be
> posted in a mail list. I'd be equally interested in any references you
> may have.
> As we are on this subject how does LSI and the similar CNG (context
> network graph) fit into the model used by lucene. Could lucene be
> massaged to implement different mathematical models of search and
> retrieval, if so how modular are the core functions?
> 
> Adam Saltiel
> 
> 
> > -Original Message-
> > From: Chong, Herb [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, December 04, 2003 1:53 PM
> > To: Lucene Users List
> > Subject: RE: Probabilistic Model in Lucene - possible?
> >
> > not all tf/idf variants are probabilistic models, but a great many are
> if
> > the term weights are probabilities. if we just take straight,
> unmodified
> > Term Frequency in a document, Inverse Document Frequency in the
> corpus,
> > and the Term Frequency in the query as 1, you are in fact comparing
> the
> > statistical properties of the query against the statistical properties
> of
> > the query. they are probabilities you are comparing. i can't think of
> many
> > papers that come right out and say it, but if you look at an
> individual
> > term weight and can interpret it as a genuine probability, the vector
> > space model based on the weights is a probabilistic model. the
> derivation
> > is relatively straight forward to show it, if you have the right
> general
> > model to start with. once you start throwing in ad hoc normalizations,
> > then things get out of whack and it's not longer a probabilistic
> model.
> >
> > the implementations that i have done are with a former company and
> that
> > means secret and protected by various intellectual property rights.
> > however, i can sketch here the general approach one has to take and an
> > outline of the derivation that unifies probabilistic models with
> vector
> > space models and at the same time incorporate pairwise interterm
> > correlation. in fact, the pairwise interterm correlations are a
> > fundamental assumption. once you do all this, you can show that the
> > traditional vector space model is a special case of a pairwise
> interterm
> > correlation model. for those that are interested in advanced matrix
> > algebra and some basic statistics, it should be very interesting. if
> only
> > i had a published paper, i would post it. unfortunately, what i have
> is
> > very obtuse because it's protected. the only paper that started out
> was
> > submitted to SIGIR but rejected by all but one referee. that one
> thought
> > this was a tremendous unification of the two methods, but academic
> > journals being what they are, when 4 out of 5 referees can't
> understand
> > the paper, it doesn't get published. i may brush it off and enlarge
> into a
> > much longer paper for the Journal of IR, but once again, unless you
> are
> > comfortable with probability theory and matrix theory, you are not
> going
> > to follow it.
> >
> > so, who is game for a tutorial on the derivation?
> >
> > Herb...
> >
> > -Original Message-
> > From: Karsten Konrad [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, December 04, 2003 5:09 AM
> > To: Lucene Users List
> > Subject: AW: Probabilistic Model in Lucene - possible?
> >
> >
> >
> > Hi Herb,
> >
> > thank you for your insights.
> >
> > >>
> > but by most accepted definitions, the tf/idf model in Lucene is a
> > probabilistic model.
> > >>
> >
> > Can you send some pointers to help me understand that? Are all TF/IDF-
> > variants
> > probabilistic models? If so, what makes any model a non-probabilistic
> one?
> > If you claim that TF/IDF is probabilistic, then the plain cosine (an
> > extreme
> > form of TF/IDF, with IDF for all terms being considered constant) of
> VSM
> > would
> > also be a probabilistic model.
> >
> > >>
> > it's got strange normalizations though that doesn't allow comparisons
> of
> > rank values across queries.
> > >>
> >
> > Lucene's internal ranking sometimes returns values > 1.0, these are
> then
> > normalized to 1.0,
> > adjusting other rankings accordingly. While I have nothing to say
> against
> > this - it's a hack,
> > but useful - it makes comparing the rank values across queries really
> > difficult. It's like
> > using different scales whenever you measure somet

Range Query

2003-12-05 Thread Ramrakhiani, Vikas
Hi,
When I do range query like id:[0* to 9*] the result set exclude documents
having id 0, 90 ... i.e boundary values are excluded.
Is it expected or am I going wrong some where.
thanks,
vikas. 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Testing for Optimization

2003-12-05 Thread jt oob
 --- Dror Matalon <[EMAIL PROTECTED]> wrote: > I believe that indexes
that are optimized have only one segment. So
> in
> theory you could check and see that you only have one file with a
> ".fdt", ".fdx", etc. 

If run `cat/index_dir/segements` on an optimized index there is only 
only string in there. It matches up with prefix of files in the index
directory.

If i run the same on an un-optimized index dir then i get back several
strings.

There are more files in the optimized index dir than just the ones with
the prefix listed in the segments file.
Have i corrupted my indexes?
Can I safely delete those files which do not have the prefix listed in
the segments file?

Thanks,
jt


Download Yahoo! Messenger now for a chance to win Live At Knebworth DVDs
http://www.yahoo.co.uk/robbiewilliams

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Probabilistic Model in Lucene - possible?

2003-12-05 Thread Adam Saltiel
Herb,
Any one game ... ?
No takers? I would be very interested, but maybe beyond what can be
posted in a mail list. I'd be equally interested in any references you
may have.
As we are on this subject how does LSI and the similar CNG (context
network graph) fit into the model used by lucene. Could lucene be
massaged to implement different mathematical models of search and
retrieval, if so how modular are the core functions?

Adam Saltiel


> -Original Message-
> From: Chong, Herb [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 04, 2003 1:53 PM
> To: Lucene Users List
> Subject: RE: Probabilistic Model in Lucene - possible?
>
> not all tf/idf variants are probabilistic models, but a great many are
if
> the term weights are probabilities. if we just take straight,
unmodified
> Term Frequency in a document, Inverse Document Frequency in the
corpus,
> and the Term Frequency in the query as 1, you are in fact comparing
the
> statistical properties of the query against the statistical properties
of
> the query. they are probabilities you are comparing. i can't think of
many
> papers that come right out and say it, but if you look at an
individual
> term weight and can interpret it as a genuine probability, the vector
> space model based on the weights is a probabilistic model. the
derivation
> is relatively straight forward to show it, if you have the right
general
> model to start with. once you start throwing in ad hoc normalizations,
> then things get out of whack and it's not longer a probabilistic
model.
>
> the implementations that i have done are with a former company and
that
> means secret and protected by various intellectual property rights.
> however, i can sketch here the general approach one has to take and an
> outline of the derivation that unifies probabilistic models with
vector
> space models and at the same time incorporate pairwise interterm
> correlation. in fact, the pairwise interterm correlations are a
> fundamental assumption. once you do all this, you can show that the
> traditional vector space model is a special case of a pairwise
interterm
> correlation model. for those that are interested in advanced matrix
> algebra and some basic statistics, it should be very interesting. if
only
> i had a published paper, i would post it. unfortunately, what i have
is
> very obtuse because it's protected. the only paper that started out
was
> submitted to SIGIR but rejected by all but one referee. that one
thought
> this was a tremendous unification of the two methods, but academic
> journals being what they are, when 4 out of 5 referees can't
understand
> the paper, it doesn't get published. i may brush it off and enlarge
into a
> much longer paper for the Journal of IR, but once again, unless you
are
> comfortable with probability theory and matrix theory, you are not
going
> to follow it.
>
> so, who is game for a tutorial on the derivation?
>
> Herb...
>
> -Original Message-
> From: Karsten Konrad [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 04, 2003 5:09 AM
> To: Lucene Users List
> Subject: AW: Probabilistic Model in Lucene - possible?
>
>
>
> Hi Herb,
>
> thank you for your insights.
>
> >>
> but by most accepted definitions, the tf/idf model in Lucene is a
> probabilistic model.
> >>
>
> Can you send some pointers to help me understand that? Are all TF/IDF-
> variants
> probabilistic models? If so, what makes any model a non-probabilistic
one?
> If you claim that TF/IDF is probabilistic, then the plain cosine (an
> extreme
> form of TF/IDF, with IDF for all terms being considered constant) of
VSM
> would
> also be a probabilistic model.
>
> >>
> it's got strange normalizations though that doesn't allow comparisons
of
> rank values across queries.
> >>
>
> Lucene's internal ranking sometimes returns values > 1.0, these are
then
> normalized to 1.0,
> adjusting other rankings accordingly. While I have nothing to say
against
> this - it's a hack,
> but useful - it makes comparing the rank values across queries really
> difficult. It's like
> using different scales whenever you measure something different, and
then
> you do not tell
> anyone about it.
>
> >>
> it isn't terribly hard to make a normalized probabilistic model that
> allows comparing of document scores across queries and assign a
meaning to
> the score. i've done it.
> >>
>
> Stop bragging, send us your Similarity implementation :)
>
> Regards,
>
> Karsten
>
>
> -Ursprüngliche Nachricht-
> Von: Chong, Herb [mailto:[EMAIL PROTECTED]
> Gesendet: Mittwoch, 3. Dezember 2003 23:01
> An: Lucene Users List
> Betreff: RE: Probabilistic Model in Lucene - possible?
>
>
> i think i am missing the original question, but by most accepted
> definitions, the tf/idf model in Lucene is a probabilistic model. it's
got
> strange normalizations though that doesn't allow comparisons of rank
> values across queries.
>
> it isn't terribly hard to make a normalized probabilistic model that
> allows comparing of docum