Re: sorting by score and an additional field

2004-11-04 Thread Daniel Naber
On Thursday 04 November 2004 03:52, Chris Fraschetti wrote:

 I can only get it to sort by one or the other... but when it does one,
 it does sort correctly, but together in {score, custom_field} only the
 first sort seems to apply.

Do you use real documents for that test? The score is a float value and 
it's hardly ever the same for two documents (unless you use very short 
test documents), so that's why the second field may not be used for 
sorting.

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Faster highlighting with TermPositionVectors

2004-11-04 Thread mark harwood
Hi Aviran,
The code you are calling assumes that you have indexed with TermVector support for 
offsets (and optionally positions) ie code like this:
 
doc.add(new Field(contents, content, 
   Field.Store.COMPRESS, Field.Index.TOKENIZED, 
   Field.TermVector.WITH_POSITIONS_OFFSETS));
 
If you haven't stored offsets then the getTermFreqVector method returns a 
TermFreqVector rather than the TermPositionVector subclass, hence the class cast 
exception. I should tighten up that section of code to check for this situation and 
throw an exception with a suitable message. 
 
By the way, the getAnyTokenStream method is coded a little more defensively and will 
silently drop back to re-analyzing (parsing) the original content if it is asked to 
get a TokenStream for a field that doesnt have offset data stored. This is probably 
the safest way to code your app and the cost of the logic which checks the field 
storage type is minimal.
 
Cheers 
Mark


-
 ALL-NEW Yahoo! Messenger - all new features - even more fun!  

Re: Faster highlighting with TermPositionVectors

2004-11-04 Thread Erik Hatcher
Mark,
This is great stuff!
One quick comment just at my look at the code (I haven't tried it yet). 
 Shouldn't the tpv variable be used in this method?

public static TokenStream getAnyTokenStream(IndexReader reader,int 
docId, String field,Analyzer analyzer) throws IOException
{
		TokenStream ts=null;

		TermFreqVector tfv=(TermFreqVector) 
reader.getTermFreqVector(docId,field);
		if(tfv!=null)
		{
		if(tfv instanceof TermPositionVector)
		{
		//the most efficient choice..
		TermPositionVector tpv=(TermPositionVector) 
reader.getTermFreqVector(docId,field);
		ts=getTokenStream(reader,docId,field);
		}
		}
		//No token info stored so fall back to analyzing raw content
		if(ts==null)
		{
		ts=getTokenStream(reader,docId,field,analyzer);
		}
		return ts;
}

Erik
On Oct 28, 2004, at 7:16 PM, [EMAIL PROTECTED] wrote:
Thanks to the recent changes (see CVS) in TermFreqVector support we 
can now make use of term offset information held
in the Lucene index rather than incurring the cost of re-analyzing 
text to highlight it.

I have created a  class ( see 
http://www.inperspective.com/lucene/TokenSources.java ) which handles 
creating
a TokenStream from the TermPositionVector stored in the database which 
can then be passed to the highlighter.
This approach is significantly faster than re-parsing the original 
text.
If people are happy with this class I'll add it to the Highlighter 
sandbox but it may sit better elsewhere in the Lucene code base
as a more general purpose utility.

BTW as part of putting this together I found that the TermFreq code 
throws a null pointer when indexing fields
that produce no tokens (ie empty or all stopwords). Otherwise things 
work very well.

Cheers
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


A TokenFilter to split words and numbers

2004-11-04 Thread william.sporrong
Hi,

I’m trying to implement a TokenFilter that splits words that contains
numbers into a phrase with separate words and numbers. For example I want to
turn v70 into a phrase “v 70”. I’ve implemented a filter that does the
actual split with a regular expression. Then I use this filter in my
analyzer which is passed to the QueryParser. The resulting Query looks fine
+(+words:”v 70”) but I does not return any Hits. If I instead pass in the
input string “v 70” (ignored by the filter) the resulting query looks the
same but I get Hits. Why is this? Does it have something to do with the
QueryParser guessing what kind of query it is by examining the string and
thus presumes that the first string should not be parsed into a PhraseQuery?

 

Anyways if there is a correct way to accomplish what I want could anyone
please give me a hint? One way I thought about is preparsining the query and
construct several subqueries i.e PhraseQuerys and so on and then combine
them in a BooleanQuery but I guess there is a nicer solution?

 

I have a similar problem with another Filter Iäm trying to implement that
should remove certain suffixes and replace them with a wildcard (
bilar-bil*).

 

/William



Re: A TokenFilter to split words and numbers

2004-11-04 Thread Morus Walter
william.sporrong writes:

 Does it have something to do with the
 QueryParser guessing what kind of query it is by examining the string and
 thus presumes that the first string should not be parsed into a PhraseQuery?

QueryParser creates a PhraseQuery for words that are tokenized to more
than one token.
You should see that in the serialized query.
  
 
 Anyways if there is a correct way to accomplish what I want could anyone
 please give me a hint? One way I thought about is preparsining the query and
 construct several subqueries i.e PhraseQuerys and so on and then combine
 them in a BooleanQuery but I guess there is a nicer solution?
 
I guess you could overwrite the getFieldQuery method of query parser
and change the way queries are generated.
  
 
 I have a similar problem with another Filter Iäm trying to implement that
 should remove certain suffixes and replace them with a wildcard (
 bilar-bil*).
 
If you expect bil* to be executed as a wildcard/prefix query, this
cannot work. The query parser parses the query, not the analyzer output.
Again you might introduce such behaviour in getFieldQuery.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Efficient search on lucene mailing archives

2004-11-04 Thread Sreedhar, Dantam
When I want to search for any thing I use the following URL. 

http://marc.theaimsgroup.com/

-Sreedhar

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 15, 2004 2:18 AM
To: Lucene Users List
Subject: Re: Efficient search on lucene mailing archives



On Oct 14, 2004, at 4:27 PM, David Spencer wrote:

 sam s wrote:

 Hi Folks,
 Is there any place where I can do a better search on lucene mailing 
 archives?
 I tried JGuru and looks like their search is paid.
 Apache maintained archives lags efficient searching.

 Of course one of the ironies is, shouldn't we be able to use Lucene to

 search the mailing list archives and even apache.org?

Eyebrowse uses Lucene and is set up for the Apache e-mail lists:

http://nagoya.apache.org/eyebrowse/SummarizeList?listId=30

It seems clunky to navigate though and would be nice to have more 
recent e-mails ranked higher than older mails.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



one huge index or many small ones?

2004-11-04 Thread javier muguruza
Hi,

We are going to move from a just-in-time perl based search to using
lucene in our project. I have to index emails (bodies and also
attachements). I keep in the filesystem all the bodies and attachments
for a long period of time. I have to find emails that fullfil certain
conditions, some of the conditions are take care of at a different
level, so in the end I have a SUBSET of emails I have to run through
lucene.

I was assuming that the best way would be to create an index for each
email. Having an unique index for a group of emails (say a day worth
of email) seems too coarse grained, imagine a day has 1 emails,
and some queries will like to look in only a handful of the
emails...But the problem with having one index per emails is the
massive number of emails...imagine having 10 indexes

Anyway, any idea about that? I just wanted to check wether someones
feels I am wrong.

Thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: one huge index or many small ones?

2004-11-04 Thread Erik Hatcher
One index per e-mail is way overkill and probably not even feasible 
resource-wise.  Take advantage of fields in Lucene documents and use 
BooleanQuery to AND in other criteria for filtering, or use a Filter if 
the filtering criteria is relatively static.

Erik
On Nov 4, 2004, at 11:00 AM, javier muguruza wrote:
Hi,
We are going to move from a just-in-time perl based search to using
lucene in our project. I have to index emails (bodies and also
attachements). I keep in the filesystem all the bodies and attachments
for a long period of time. I have to find emails that fullfil certain
conditions, some of the conditions are take care of at a different
level, so in the end I have a SUBSET of emails I have to run through
lucene.
I was assuming that the best way would be to create an index for each
email. Having an unique index for a group of emails (say a day worth
of email) seems too coarse grained, imagine a day has 1 emails,
and some queries will like to look in only a handful of the
emails...But the problem with having one index per emails is the
massive number of emails...imagine having 10 indexes
Anyway, any idea about that? I just wanted to check wether someones
feels I am wrong.
Thanks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: one huge index or many small ones?

2004-11-04 Thread Giulio Cesare Solaroli
Hi Javier,

I suggest you to build a single index, with all the information you
need to find the right mail you are looking for. You than can use
Lucene alone to find you messages.

Giulio Cesare


On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
 Hi,
 
 We are going to move from a just-in-time perl based search to using
 lucene in our project. I have to index emails (bodies and also
 attachements). I keep in the filesystem all the bodies and attachments
 for a long period of time. I have to find emails that fullfil certain
 conditions, some of the conditions are take care of at a different
 level, so in the end I have a SUBSET of emails I have to run through
 lucene.
 
 I was assuming that the best way would be to create an index for each
 email. Having an unique index for a group of emails (say a day worth
 of email) seems too coarse grained, imagine a day has 1 emails,
 and some queries will like to look in only a handful of the
 emails...But the problem with having one index per emails is the
 massive number of emails...imagine having 10 indexes
 
 Anyway, any idea about that? I just wanted to check wether someones
 feels I am wrong.
 
 Thanks
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: one huge index or many small ones?

2004-11-04 Thread javier muguruza
Thanks Erik and Giulio for the fast reply.

I am just starting to look at lucene so forgive me if I got some ideas
wrong. I understand your concerns about one index per email. But
having one index only is also (I guess) out of question.

I am building an email archive. Email will be kept indefinitely
available for search, adding new email every day. Imagine a company
with millions of emails per day (been there), keep it growing for
years, adding stuff to the index while using it for searches
continuously...

That's why my idea is to decide on a time frame (a day, a month...an
extreme would be an instant, that is a single email, my original idea)
and build the index for all the email in that timeframe. After the
timeframe is finished no more stuff will be ever added.

Before the lucene search emails are selected based on other conditions
(we store the from, to, date etc in database as well, and these
conditions are enforced with a sql query first, so I would not need to
enforce them in the lucene search again, also that query can be quite
sophisticated and I guess would not be easyly possible to do it in
lucene by itself). That first db step gives me a group of emails that
maybe I have to further narrow down based on a lucene search (of body
and attachment contents). Having an index for more than one emails
means that after the search I would have to get only the overlaping
emails from the two searches...Maybe this is better than keeping the
same info I have in the db in lucene fields as well.

An example: I want all the email from [EMAIL PROTECTED] from Jan
to Dec containing the word 'money'. I run the db query that returns a
list with john's email for that period of time, then (lets assume I
have one index per day) I iterate on every day, looking for emails
that contain 'money', from the results returned by lucene I keep only
these that are also in the first list.

Does that sound better? 


On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
[EMAIL PROTECTED] wrote:
 Hi Javier,
 
 I suggest you to build a single index, with all the information you
 need to find the right mail you are looking for. You than can use
 Lucene alone to find you messages.
 
 Giulio Cesare
 
 
 
 
 On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
  Hi,
 
  We are going to move from a just-in-time perl based search to using
  lucene in our project. I have to index emails (bodies and also
  attachements). I keep in the filesystem all the bodies and attachments
  for a long period of time. I have to find emails that fullfil certain
  conditions, some of the conditions are take care of at a different
  level, so in the end I have a SUBSET of emails I have to run through
  lucene.
 
  I was assuming that the best way would be to create an index for each
  email. Having an unique index for a group of emails (say a day worth
  of email) seems too coarse grained, imagine a day has 1 emails,
  and some queries will like to look in only a handful of the
  emails...But the problem with having one index per emails is the
  massive number of emails...imagine having 10 indexes
 
  Anyway, any idea about that? I just wanted to check wether someones
  feels I am wrong.
 
  Thanks
  
  -
 
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: one huge index or many small ones?

2004-11-04 Thread Justin Swanhart
First off, I think you should make a decision about what you want to
store in your index and how you go about searching it.

The less information you store in your index, the better, for
performance reasons.  If you can store the messages in an external
database you probably should.  I would create a table that contains a
clob and an associated id that can be used to get the message at any
time.

Assuming mail is in SMTP RFC format:

I would suggest:
Unstored: Subject
Keyword: From
Keyword: To
Stored,Unindexed: ID  -- this would be the ID to the message in your database
Unstored: Body 
Keyword: Month
Keyword: Day
Keyword: Year
(and any other keywords you might use)

Your lucene query would then look something like:
+From:[EMAIL PROTECTED] +(Subject:money Body:money) +Year:2004

Use the stored ID field to get the message contents from your database.

If you want to break your index down into multiple indexes, based on
some criteria such as time frame you could do that too.  You would
then use a MultiSearcher or ParallelMultiSearcher to process the
multiple indexes.


On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza [EMAIL PROTECTED] wrote:
 Thanks Erik and Giulio for the fast reply.
 
 I am just starting to look at lucene so forgive me if I got some ideas
 wrong. I understand your concerns about one index per email. But
 having one index only is also (I guess) out of question.
 
 I am building an email archive. Email will be kept indefinitely
 available for search, adding new email every day. Imagine a company
 with millions of emails per day (been there), keep it growing for
 years, adding stuff to the index while using it for searches
 continuously...
 
 That's why my idea is to decide on a time frame (a day, a month...an
 extreme would be an instant, that is a single email, my original idea)
 and build the index for all the email in that timeframe. After the
 timeframe is finished no more stuff will be ever added.
 
 Before the lucene search emails are selected based on other conditions
 (we store the from, to, date etc in database as well, and these
 conditions are enforced with a sql query first, so I would not need to
 enforce them in the lucene search again, also that query can be quite
 sophisticated and I guess would not be easyly possible to do it in
 lucene by itself). That first db step gives me a group of emails that
 maybe I have to further narrow down based on a lucene search (of body
 and attachment contents). Having an index for more than one emails
 means that after the search I would have to get only the overlaping
 emails from the two searches...Maybe this is better than keeping the
 same info I have in the db in lucene fields as well.
 
 An example: I want all the email from [EMAIL PROTECTED] from Jan
 to Dec containing the word 'money'. I run the db query that returns a
 list with john's email for that period of time, then (lets assume I
 have one index per day) I iterate on every day, looking for emails
 that contain 'money', from the results returned by lucene I keep only
 these that are also in the first list.
 
 Does that sound better?
 
 On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
 
 
 [EMAIL PROTECTED] wrote:
  Hi Javier,
 
  I suggest you to build a single index, with all the information you
  need to find the right mail you are looking for. You than can use
  Lucene alone to find you messages.
 
  Giulio Cesare
 
 
 
 
  On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
   Hi,
  
   We are going to move from a just-in-time perl based search to using
   lucene in our project. I have to index emails (bodies and also
   attachements). I keep in the filesystem all the bodies and attachments
   for a long period of time. I have to find emails that fullfil certain
   conditions, some of the conditions are take care of at a different
   level, so in the end I have a SUBSET of emails I have to run through
   lucene.
  
   I was assuming that the best way would be to create an index for each
   email. Having an unique index for a group of emails (say a day worth
   of email) seems too coarse grained, imagine a day has 1 emails,
   and some queries will like to look in only a handful of the
   emails...But the problem with having one index per emails is the
   massive number of emails...imagine having 10 indexes
  
   Anyway, any idea about that? I just wanted to check wether someones
   feels I am wrong.
  
   Thanks
  
   -
 
 
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



prefix wildcard matching options (*blah)

2004-11-04 Thread Justin Swanhart
I'm thinking about making a seperate field in my index for prefix
wildcard searches.
I would chop off x characters from the front to create subtokens for
the prefix matches.

For the term: republican
terms created: republican epublican publican ublican blican

My query parser would then intelligently decide if their is a term
that has a wildcard as the first character of the term.  Instead of
searching the normal field, it would then remove the wildcard from the
start of the term and search on the prefix field instead.

A search for *pub* would be converted to pub* in the prefix field.  
A search for *blican would be converted to blican

Does this sound like an intelligent way to create fast prefix querying ability?

Can I index the prefix field with a seperate analyzer that makes the
prefix tokens, or should I just do the index-time expansion manually? 
I wouldn't need to search with this analyzer, just index with it,
because the searching doesn't have to expand all those terms.

If using a seperate analyzer for the prefix field makes more sense how
do I make a tokenizer that returns multiple tokens for one word?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: one huge index or many small ones?

2004-11-04 Thread Sergiu Gordea
javier muguruza wrote:
Hi Javier,
I think the your optimization should take care of the response time of  
search queries. I asume that this is the
variable  you need to optimize. Probably it will be a good thing to read 
first the lucene benchmarks:
http://jakarta.apache.org/lucene/docs/benchmarks.html. 
http://jakarta.apache.org/lucene/docs/benchmarks.html

If you have a mandatory date constraint for each of your indexes you 
can split the index on time basis, I asume that
one index per month will be enough I think ... 10.000 emails I think it 
will be fast enough if you will search in only one index afterwards.
But I think this is not such a good Idea?

What about creating one index per user? If your search require a user or 
a sender, and you can get its name from database, and apply only
the other constrains on an index dedicated to that user .. I think the 
lucene search will be much more faster.

Also the database search will be fast .. I don'T think you will have 
more then 1.000-10.000 user names.

or maybe 1 index/user/year
or 1 index/receiver/year + 1index/sender/year
What about this solution is it feasible for your system?
All the best,
 Sergiu
Thanks Erik and Giulio for the fast reply.
I am just starting to look at lucene so forgive me if I got some ideas
wrong. I understand your concerns about one index per email. But
having one index only is also (I guess) out of question.
I am building an email archive. Email will be kept indefinitely
available for search, adding new email every day. Imagine a company
with millions of emails per day (been there), keep it growing for
years, adding stuff to the index while using it for searches
continuously...
That's why my idea is to decide on a time frame (a day, a month...an
extreme would be an instant, that is a single email, my original idea)
and build the index for all the email in that timeframe. After the
timeframe is finished no more stuff will be ever added.
Before the lucene search emails are selected based on other conditions
(we store the from, to, date etc in database as well, and these
conditions are enforced with a sql query first, so I would not need to
enforce them in the lucene search again, also that query can be quite
sophisticated and I guess would not be easyly possible to do it in
lucene by itself). That first db step gives me a group of emails that
maybe I have to further narrow down based on a lucene search (of body
and attachment contents). Having an index for more than one emails
means that after the search I would have to get only the overlaping
emails from the two searches...Maybe this is better than keeping the
same info I have in the db in lucene fields as well.
An example: I want all the email from [EMAIL PROTECTED] from Jan
to Dec containing the word 'money'. I run the db query that returns a
list with john's email for that period of time, then (lets assume I
have one index per day) I iterate on every day, looking for emails
that contain 'money', from the results returned by lucene I keep only
these that are also in the first list.
Does that sound better? 

On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
[EMAIL PROTECTED] wrote:
 

Hi Javier,
I suggest you to build a single index, with all the information you
need to find the right mail you are looking for. You than can use
Lucene alone to find you messages.
Giulio Cesare

On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
   

Hi,
We are going to move from a just-in-time perl based search to using
lucene in our project. I have to index emails (bodies and also
attachements). I keep in the filesystem all the bodies and attachments
for a long period of time. I have to find emails that fullfil certain
conditions, some of the conditions are take care of at a different
level, so in the end I have a SUBSET of emails I have to run through
lucene.
I was assuming that the best way would be to create an index for each
email. Having an unique index for a group of emails (say a day worth
of email) seems too coarse grained, imagine a day has 1 emails,
and some queries will like to look in only a handful of the
emails...But the problem with having one index per emails is the
massive number of emails...imagine having 10 indexes
Anyway, any idea about that? I just wanted to check wether someones
feels I am wrong.
Thanks
-
 

   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


updating documents in the index

2004-11-04 Thread Chris Fraschetti
So I've read that the only way to change a field in an already indexed
document is to simple remove it and readd it... but that can be costly
if I need to go back to where the data origionally came from and
reparse and reindex it all.

Is there a way to keep the document around after the delete call to
the indexreader so that I can modify a field and add it again with a
writer?

I would simple rip out all the fields and then create a new document,
but the 'content' field isn't stored due to the fact that my index
would be much larger if i kept the content around.

Anyone have any good solutions to do this short of keeping around the
content in the index or going back to the origional document source?

Does 'luke' rebuild a document so that it can be updated? If so, how
do they go about it.

Thanks is advance everyone!

-Chris Fraschetti

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting by score and an additional field

2004-11-04 Thread Chris Fraschetti
Erik:

doc.add(Field.Keyword(rank_field, rank_value));

is what I use to build my customized rank field.


Considering the rank_value is an integer, should it be zero padded?
Currently I have it padded because the rest of lucene needs it that
way, should it be the same here?

If I specify INT or STRING, the sort of rank works just fine... but
its when I combine the two that I have issues. I'm using 1.4.2... but
I'll see how my code differs from yours and give it a try.. can you
tell me how you indexed your secondary rank field? as a keyword or
what have you?

Thanks,
Chris Fraschetti


On Thu, 4 Nov 2004 04:33:12 -0500, Erik Hatcher
[EMAIL PROTECTED] wrote:
 On Nov 3, 2004, at 9:52 PM, Chris Fraschetti wrote:
  Has anyone had any luck using lucene's built in sort functions to sort
  first by the lucene hit score and secondarily by a Field in each
  document indexed as Keyword and in integer form?
 
 I get multiple sort fields to work, here's two examples:
 
 new Sort(new SortField[]{
   new SortField(category),
   SortField.FIELD_SCORE,
   new SortField(pubmonth, SortField.INT, true)
 });
 
new Sort(new SortField[] {SortField.FIELD_SCORE, new
 SortField(category)})
 
 Both of these, on a tiny dataset of only 10 documents, works exactly as
 expected.
 
  I can only get it to sort by one or the other... but when it does one,
  it does sort correctly, but together in {score, custom_field} only the
  first sort seems to apply.
 
  Any ideas?
 
 Are you using Lucene 1.4.2?  How did you index your integer field?  Are
 you simply using the .toString() of an Integer?  Or zero padding the
 field somehow?  You can use the .toString method, but you have to be
 sure that the sorting code does the right parsing of it - so you might
 need to specify SortField.INT as its type.  It will do automatic
 detection if the type is not specified, but that assumes that the first
 document it encounters parses properly, otherwise it will fall back to
 using a String sort.
 
Erik
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: updating documents in the index

2004-11-04 Thread Andrzej Bialecki
Chris Fraschetti wrote:
So I've read that the only way to change a field in an already indexed
document is to simple remove it and readd it... but that can be costly
if I need to go back to where the data origionally came from and
reparse and reindex it all.
Yes.
Is there a way to keep the document around after the delete call to
the indexreader so that I can modify a field and add it again with a
writer?
Lucene does not provide this functionality (yet?). If you try to read 
from index a document that contains unstored fields, you will get 
nulls instead of their values. In other words, you cannot read the 
Document instance, modify it, and then add it - because you will lose 
all information from unstored fields. Also, when you re-add the 
document all fields need to be analyzed once again.

I would simple rip out all the fields and then create a new document,
but the 'content' field isn't stored due to the fact that my index
would be much larger if i kept the content around.
Anyone have any good solutions to do this short of keeping around the
content in the index or going back to the origional document source?
Does 'luke' rebuild a document so that it can be updated? If so, how
do they go about it.
They (me and Luke :) do it the hard way - we iterate over all terms in 
the index, and then iterate over all documents which contain that term. 
If the enumeration contains the selected doc number, terms and their 
positions are put in the target term array. After going through the 
whole index, we end up with an array containing all terms and every 
position of each term in the document. This array is then concatenated 
using spaces. That's it - not really a solution, rather a hack.

This could be sped up using term vectors (Lucene 1.4.x), but you first 
need to build your index with term vectors.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: one huge index or many small ones?

2004-11-04 Thread javier muguruza
Sergiu,

A month could have tens of millions of emails in the worst case, but
maybe I could discard such bad assumption for our current project.
Lets say 1 emails per day max, that makes 300k emails a month.
Either I would choose one index per day or per month (or week or
whatever).

Your suggestion about index per user is not valid, my searches do not
require a user desafortunately. They can maybe say 'all email from
department C from last week' etc. So, if I choose one index per day(or
month) I already know that I will have to search in many indexes
depending on the timeframe (the time frame is the only required value
for the search)

thanks for the suggestions!


On Thu, 04 Nov 2004 19:01:53 +0100, Sergiu Gordea
[EMAIL PROTECTED] wrote:
 javier muguruza wrote:
 
 Hi Javier,
 
 I think the your optimization should take care of the response time of
 search queries. I asume that this is the
 variable  you need to optimize. Probably it will be a good thing to read
 first the lucene benchmarks:
 http://jakarta.apache.org/lucene/docs/benchmarks.html.
 http://jakarta.apache.org/lucene/docs/benchmarks.html
 
 If you have a mandatory date constraint for each of your indexes you
 can split the index on time basis, I asume that
 one index per month will be enough I think ... 10.000 emails I think it
 will be fast enough if you will search in only one index afterwards.
 But I think this is not such a good Idea?
 
 What about creating one index per user? If your search require a user or
 a sender, and you can get its name from database, and apply only
 the other constrains on an index dedicated to that user .. I think the
 lucene search will be much more faster.
 
 Also the database search will be fast .. I don'T think you will have
 more then 1.000-10.000 user names.
 
 or maybe 1 index/user/year
 
 or 1 index/receiver/year + 1index/sender/year
 
 What about this solution is it feasible for your system?
 
 All the best,
 
  Sergiu
 
 
 
 Thanks Erik and Giulio for the fast reply.
 
 I am just starting to look at lucene so forgive me if I got some ideas
 wrong. I understand your concerns about one index per email. But
 having one index only is also (I guess) out of question.
 
 I am building an email archive. Email will be kept indefinitely
 available for search, adding new email every day. Imagine a company
 with millions of emails per day (been there), keep it growing for
 years, adding stuff to the index while using it for searches
 continuously...
 
 That's why my idea is to decide on a time frame (a day, a month...an
 extreme would be an instant, that is a single email, my original idea)
 and build the index for all the email in that timeframe. After the
 timeframe is finished no more stuff will be ever added.
 
 Before the lucene search emails are selected based on other conditions
 (we store the from, to, date etc in database as well, and these
 conditions are enforced with a sql query first, so I would not need to
 enforce them in the lucene search again, also that query can be quite
 sophisticated and I guess would not be easyly possible to do it in
 lucene by itself). That first db step gives me a group of emails that
 maybe I have to further narrow down based on a lucene search (of body
 and attachment contents). Having an index for more than one emails
 means that after the search I would have to get only the overlaping
 emails from the two searches...Maybe this is better than keeping the
 same info I have in the db in lucene fields as well.
 
 An example: I want all the email from [EMAIL PROTECTED] from Jan
 to Dec containing the word 'money'. I run the db query that returns a
 list with john's email for that period of time, then (lets assume I
 have one index per day) I iterate on every day, looking for emails
 that contain 'money', from the results returned by lucene I keep only
 these that are also in the first list.
 
 Does that sound better?
 
 
 On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
 [EMAIL PROTECTED] wrote:
 
 
 Hi Javier,
 
 I suggest you to build a single index, with all the information you
 need to find the right mail you are looking for. You than can use
 Lucene alone to find you messages.
 
 Giulio Cesare
 
 
 
 
 On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
 
 
 Hi,
 
 We are going to move from a just-in-time perl based search to using
 lucene in our project. I have to index emails (bodies and also
 attachements). I keep in the filesystem all the bodies and attachments
 for a long period of time. I have to find emails that fullfil certain
 conditions, some of the conditions are take care of at a different
 level, so in the end I have a SUBSET of emails I have to run through
 lucene.
 
 I was assuming that the best way would be to create an index for each
 email. Having an unique index for a group of emails (say a day worth
 of email) seems too coarse grained, imagine a day has 1 emails,
 and some queries will like to look in only a 

Re: one huge index or many small ones?

2004-11-04 Thread javier muguruza
Justin, 

Yes, I wanted as less info as possible in the index. The body and
atachemntes will be stored outside lucene. As I mentioned,  I only
need to deal with the body/attachments contents with lucene, from, to,
subject, dates etc are deal with before. My idea was:

Unstored: Body + attachment (after extracting text)

I dont need to know in with attchment the word I am looking for are,
it's enough to know there are in the email. I will have a look at
MultiSearcher or ParallelMultiSearcher,

thanks!


On Thu, 4 Nov 2004 10:28:18 -0700, Justin Swanhart [EMAIL PROTECTED] wrote:
 First off, I think you should make a decision about what you want to
 store in your index and how you go about searching it.
 
 The less information you store in your index, the better, for
 performance reasons.  If you can store the messages in an external
 database you probably should.  I would create a table that contains a
 clob and an associated id that can be used to get the message at any
 time.
 
 Assuming mail is in SMTP RFC format:
 
 I would suggest:
 Unstored: Subject
 Keyword: From
 Keyword: To
 Stored,Unindexed: ID  -- this would be the ID to the message in your database
 Unstored: Body
 Keyword: Month
 Keyword: Day
 Keyword: Year
 (and any other keywords you might use)
 
 Your lucene query would then look something like:
 +From:[EMAIL PROTECTED] +(Subject:money Body:money) +Year:2004
 
 Use the stored ID field to get the message contents from your database.
 
 If you want to break your index down into multiple indexes, based on
 some criteria such as time frame you could do that too.  You would
 then use a MultiSearcher or ParallelMultiSearcher to process the
 multiple indexes.
 
 
 
 
 On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza [EMAIL PROTECTED] wrote:
  Thanks Erik and Giulio for the fast reply.
 
  I am just starting to look at lucene so forgive me if I got some ideas
  wrong. I understand your concerns about one index per email. But
  having one index only is also (I guess) out of question.
 
  I am building an email archive. Email will be kept indefinitely
  available for search, adding new email every day. Imagine a company
  with millions of emails per day (been there), keep it growing for
  years, adding stuff to the index while using it for searches
  continuously...
 
  That's why my idea is to decide on a time frame (a day, a month...an
  extreme would be an instant, that is a single email, my original idea)
  and build the index for all the email in that timeframe. After the
  timeframe is finished no more stuff will be ever added.
 
  Before the lucene search emails are selected based on other conditions
  (we store the from, to, date etc in database as well, and these
  conditions are enforced with a sql query first, so I would not need to
  enforce them in the lucene search again, also that query can be quite
  sophisticated and I guess would not be easyly possible to do it in
  lucene by itself). That first db step gives me a group of emails that
  maybe I have to further narrow down based on a lucene search (of body
  and attachment contents). Having an index for more than one emails
  means that after the search I would have to get only the overlaping
  emails from the two searches...Maybe this is better than keeping the
  same info I have in the db in lucene fields as well.
 
  An example: I want all the email from [EMAIL PROTECTED] from Jan
  to Dec containing the word 'money'. I run the db query that returns a
  list with john's email for that period of time, then (lets assume I
  have one index per day) I iterate on every day, looking for emails
  that contain 'money', from the results returned by lucene I keep only
  these that are also in the first list.
 
  Does that sound better?
 
  On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
 
 
  [EMAIL PROTECTED] wrote:
   Hi Javier,
  
   I suggest you to build a single index, with all the information you
   need to find the right mail you are looking for. You than can use
   Lucene alone to find you messages.
  
   Giulio Cesare
  
  
  
  
   On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
Hi,
   
We are going to move from a just-in-time perl based search to using
lucene in our project. I have to index emails (bodies and also
attachements). I keep in the filesystem all the bodies and attachments
for a long period of time. I have to find emails that fullfil certain
conditions, some of the conditions are take care of at a different
level, so in the end I have a SUBSET of emails I have to run through
lucene.
   
I was assuming that the best way would be to create an index for each
email. Having an unique index for a group of emails (say a day worth
of email) seems too coarse grained, imagine a day has 1 emails,
and some queries will like to look in only a handful of the
emails...But the problem with having one index per emails is the
massive 

Re: one huge index or many small ones?

2004-11-04 Thread Giulio Cesare Solaroli
Hi Javier,


On Thu, 4 Nov 2004 20:08:15 +0100, javier muguruza [EMAIL PROTECTED] wrote:
 Justin,
 
 Yes, I wanted as less info as possible in the index. The body and
 atachemntes will be stored outside lucene. As I mentioned,  I only
 need to deal with the body/attachments contents with lucene, from, to,
 subject, dates etc are deal with before.

You probably can get away with this solution as well, but I would like
to suggest you to test Lucene performance before starting optimizing.

If your query on the text of the body/attachments are not huge (my
user end up with rewritten query whose lengths are up to
600KBytes!!), Lucene will be probably able to return your the
right result much faster than looking in different places for the same
query.

Don't be afraid of the number of documents either; not before testing
on some real data. You could easily find that a simpler architecture
can perform fast enough, and be much more easy to set up and tune.

[...]


Giulio Cesare

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighting in Lucene

2004-11-04 Thread Ramon Aseniero
Hi All,

 

I would like to know if Lucene support highlighting on the searched text?

 

Thanks in advance.

 

Thanks,

Ramon Aseniero



RE: Highlighting in Lucene

2004-11-04 Thread Will Allen
There is a highlighting tool in the sandbox (3/4 of the way down):

http://jakarta.apache.org/lucene/docs/lucene-sandbox/

-Original Message-
From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 04, 2004 3:40 PM
To: 'Lucene Users List'
Subject: Highlighting in Lucene


Hi All,

 

I would like to know if Lucene support highlighting on the searched text?

 

Thanks in advance.

 

Thanks,

Ramon Aseniero


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Highlighting in Lucene

2004-11-04 Thread Ramon Aseniero
Hi Will,

Thanks a lot that really helps.

Thanks,
Ramon

-Original Message-
From: Will Allen [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 04, 2004 12:45 PM
To: Lucene Users List
Subject: RE: Highlighting in Lucene

There is a highlighting tool in the sandbox (3/4 of the way down):

http://jakarta.apache.org/lucene/docs/lucene-sandbox/

-Original Message-
From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 04, 2004 3:40 PM
To: 'Lucene Users List'
Subject: Highlighting in Lucene


Hi All,

 

I would like to know if Lucene support highlighting on the searched text?

 

Thanks in advance.

 

Thanks,

Ramon Aseniero


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-04 Thread Otis Gospodnetic
Hm, as far as I know, a CVS sub-directory in an index directory should
not bother Lucene.  As a matter of fact, I tested this (I used a file,
not a directory) for Lucene in Action.  What error are you getting?

I know there is -I CVS option for ignoring files; perhaps it works with
directories, too.

Otis


--- Chuck Williams [EMAIL PROTECTED] wrote:

 I have a Tomcat web module being developed with Netbeans 4.0 ide
 using
 CVS.  One CVS repository holds the sources of my various web files in
 a
 directory structure that directly parallels the standard Tomcat
 webapp
 directory structure.  This is well supported in a fully automated way
 within Netbeans.  I have my search index directory as a subdirectory
 of
 WEB-INF, which seemed the natural place to put it.  The index files
 themselves are not in the repository.  I want to be able to do CVS
 Update for the web module directory tree as a whole.  However, this
 places a CVS subdirectory within the index directory, which in turn
 causes Lucene indexing to blow up the next time I run it since this
 is
 an unexpected entry in the index directory.  To make things works, to
 work around the problem I both need to delete the CVS subdirectory
 and
 find and delete the pointers to it in the Entries file and Netbeans
 cache file within the CVS subdirectory of the parent directory.  This
 is
 annoying to say the least.
 
  
 
 I've asked the Netbeans users if there is a way to avoid creation of
 the
 index's CVS subdirectory, but the same thing happened using WinCVS
 and I
 so I expect this is not a Netbeans issue.  It could be my relative
 ignorance of CVS.
 
  
 
 How do others avoid this problem?
 
  
 
 Any advice or suggestions would be appreciated.
 
  
 
 Thanks,
 
  
 
 Chuck
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-04 Thread Chuck Williams
Otis, thanks for looking at this.  The stack trace of the exception is
below.  I looked at the code.  It wants to delete every file in the
index directory, but fails to delete the CVS subdirectory entry
(presumably because it is marked read-only; the specific exception is
swallowed).  Even if it could delete the CVS subdirectory, this would
just cause another problem with Netbeans/CVS, since it wouldn't know how
to fix up the pointers in the parent CVS subdirectory.  Is there a
change I could make that would cause it to safely leave this alone?

This problem only arises on a full index (incremental == false =
create == true).  Incremental indexes work fine in my app.

Chuck

java.io.IOException: Cannot delete CVS
at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144)
at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:128)
at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102)
at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173)
at [my app]...

   -Original Message-
   From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 1:54 PM
   To: Lucene Users List
   Subject: Re: Is there an easy way to have indexing ignore a CVS
   subdirectory in the index directory?
   
   Hm, as far as I know, a CVS sub-directory in an index directory
should
   not bother Lucene.  As a matter of fact, I tested this (I used a
file,
   not a directory) for Lucene in Action.  What error are you getting?
   
   I know there is -I CVS option for ignoring files; perhaps it works
with
   directories, too.
   
   Otis
   
   
   --- Chuck Williams [EMAIL PROTECTED] wrote:
   
I have a Tomcat web module being developed with Netbeans 4.0 ide
using
CVS.  One CVS repository holds the sources of my various web files
in
a
directory structure that directly parallels the standard Tomcat
webapp
directory structure.  This is well supported in a fully automated
way
within Netbeans.  I have my search index directory as a
subdirectory
of
WEB-INF, which seemed the natural place to put it.  The index
files
themselves are not in the repository.  I want to be able to do CVS
Update for the web module directory tree as a whole.  However,
this
places a CVS subdirectory within the index directory, which in
turn
causes Lucene indexing to blow up the next time I run it since
this
is
an unexpected entry in the index directory.  To make things works,
to
work around the problem I both need to delete the CVS subdirectory
and
find and delete the pointers to it in the Entries file and
Netbeans
cache file within the CVS subdirectory of the parent directory.
This
is
annoying to say the least.
   
   
   
I've asked the Netbeans users if there is a way to avoid creation
of
the
index's CVS subdirectory, but the same thing happened using WinCVS
and I
so I expect this is not a Netbeans issue.  It could be my relative
ignorance of CVS.
   
   
   
How do others avoid this problem?
   
   
   
Any advice or suggestions would be appreciated.
   
   
   
Thanks,
   
   
   
Chuck
   
   
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



PorterStemmer / Levenshtein Distance

2004-11-04 Thread Yousef Ourabi
Hey,
On the site It says Lucence Uses Levenshtein distance
algorithm for fuzzy matching, where is this in the
source code? Also I would like to use the porter
stemming algorithm for somethign else, Are there any
documents on the Lucence implementation of Porter
Stemmer.

Best,
Yousef

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sorting in Lucene.

2004-11-04 Thread Ramon Aseniero
Hi All,

 

Does Lucene supports sorting on the search results?

 

Thanks in advance.

Ramon



RE: Sorting in Lucene.

2004-11-04 Thread Chuck Williams
Yes, by one or multiple criteria.

Chuck

   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 6:21 PM
   To: 'Lucene Users List'
   Subject: Sorting in Lucene.
   
   Hi All,
   
   
   
   Does Lucene supports sorting on the search results?
   
   
   
   Thanks in advance.
   
   Ramon


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting in Lucene.

2004-11-04 Thread Ramon Aseniero
Hi Chuck,

Can you please point me to some articles or FAQ about Sorting in Lucene?

Thanks a lot for your reply.

Thanks,
Ramon

-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 04, 2004 9:44 PM
To: Lucene Users List
Subject: RE: Sorting in Lucene.

Yes, by one or multiple criteria.

Chuck

   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 6:21 PM
   To: 'Lucene Users List'
   Subject: Sorting in Lucene.
   
   Hi All,
   
   
   
   Does Lucene supports sorting on the search results?
   
   
   
   Thanks in advance.
   
   Ramon


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting in Lucene.

2004-11-04 Thread Chuck Williams
Ramon,

I'm not sure where a guide or tutorial might be, but you should be able
to see how to do it from the javadoc.  Look at classes Sort, SortField,
SortComparator.  I've also included a recent message from this group
below concerning sorting with multiple fields.  FYI, a number of people
have wanted to first sort by score and secondarily by another field.
This is tricky since scores are frequently different in low-order
decimal positions.

Good luck,

Chuck

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 04, 2004 1:33 AM
To: Lucene Users List
Subject: Re: sorting by score and an additional field

On Nov 3, 2004, at 9:52 PM, Chris Fraschetti wrote:
 Has anyone had any luck using lucene's built in sort functions to sort
 first by the lucene hit score and secondarily by a Field in each
 document indexed as Keyword and in integer form?

I get multiple sort fields to work, here's two examples:

 new Sort(new SortField[]{
   new SortField(category),
   SortField.FIELD_SCORE,
   new SortField(pubmonth, SortField.INT, true)
 });

new Sort(new SortField[] {SortField.FIELD_SCORE, new 
SortField(category)})

Both of these, on a tiny dataset of only 10 documents, works exactly as 
expected.

 I can only get it to sort by one or the other... but when it does one,
 it does sort correctly, but together in {score, custom_field} only the
 first sort seems to apply.

 Any ideas?

Are you using Lucene 1.4.2?  How did you index your integer field?  Are 
you simply using the .toString() of an Integer?  Or zero padding the 
field somehow?  You can use the .toString method, but you have to be 
sure that the sorting code does the right parsing of it - so you might 
need to specify SortField.INT as its type.  It will do automatic 
detection if the type is not specified, but that assumes that the first 
document it encounters parses properly, otherwise it will fall back to 
using a String sort.

Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 9:53 PM
   To: 'Lucene Users List'
   Subject: RE: Sorting in Lucene.
   
   Hi Chuck,
   
   Can you please point me to some articles or FAQ about Sorting in
Lucene?
   
   Thanks a lot for your reply.
   
   Thanks,
   Ramon
   
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 9:44 PM
   To: Lucene Users List
   Subject: RE: Sorting in Lucene.
   
   Yes, by one or multiple criteria.
   
   Chuck
   
  -Original Message-
  From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
  Sent: Thursday, November 04, 2004 6:21 PM
  To: 'Lucene Users List'
  Subject: Sorting in Lucene.
 
  Hi All,
 
 
 
  Does Lucene supports sorting on the search results?
 
 
 
  Thanks in advance.
 
  Ramon
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting in Lucene.

2004-11-04 Thread Ramon Aseniero
Hi chuck,

Thanks a lot this is really helpful.

Thanks,
Ramon

-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 04, 2004 10:05 PM
To: Lucene Users List
Subject: RE: Sorting in Lucene.

Ramon,

I'm not sure where a guide or tutorial might be, but you should be able
to see how to do it from the javadoc.  Look at classes Sort, SortField,
SortComparator.  I've also included a recent message from this group
below concerning sorting with multiple fields.  FYI, a number of people
have wanted to first sort by score and secondarily by another field.
This is tricky since scores are frequently different in low-order
decimal positions.

Good luck,

Chuck

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 04, 2004 1:33 AM
To: Lucene Users List
Subject: Re: sorting by score and an additional field

On Nov 3, 2004, at 9:52 PM, Chris Fraschetti wrote:
 Has anyone had any luck using lucene's built in sort functions to sort
 first by the lucene hit score and secondarily by a Field in each
 document indexed as Keyword and in integer form?

I get multiple sort fields to work, here's two examples:

 new Sort(new SortField[]{
   new SortField(category),
   SortField.FIELD_SCORE,
   new SortField(pubmonth, SortField.INT, true)
 });

new Sort(new SortField[] {SortField.FIELD_SCORE, new 
SortField(category)})

Both of these, on a tiny dataset of only 10 documents, works exactly as 
expected.

 I can only get it to sort by one or the other... but when it does one,
 it does sort correctly, but together in {score, custom_field} only the
 first sort seems to apply.

 Any ideas?

Are you using Lucene 1.4.2?  How did you index your integer field?  Are 
you simply using the .toString() of an Integer?  Or zero padding the 
field somehow?  You can use the .toString method, but you have to be 
sure that the sorting code does the right parsing of it - so you might 
need to specify SortField.INT as its type.  It will do automatic 
detection if the type is not specified, but that assumes that the first 
document it encounters parses properly, otherwise it will fall back to 
using a String sort.

Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





   -Original Message-
   From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 9:53 PM
   To: 'Lucene Users List'
   Subject: RE: Sorting in Lucene.
   
   Hi Chuck,
   
   Can you please point me to some articles or FAQ about Sorting in
Lucene?
   
   Thanks a lot for your reply.
   
   Thanks,
   Ramon
   
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Thursday, November 04, 2004 9:44 PM
   To: Lucene Users List
   Subject: RE: Sorting in Lucene.
   
   Yes, by one or multiple criteria.
   
   Chuck
   
  -Original Message-
  From: Ramon Aseniero [mailto:[EMAIL PROTECTED]
  Sent: Thursday, November 04, 2004 6:21 PM
  To: 'Lucene Users List'
  Subject: Sorting in Lucene.
 
  Hi All,
 
 
 
  Does Lucene supports sorting on the search results?
 
 
 
  Thanks in advance.
 
  Ramon
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



INDEXREADER + DELETE + LUCENE1.4.1

2004-11-04 Thread Karthik N S



Hi Guy's

Apologies



There seems to be a bug unresolved [ Or may I be may be doing something
wrong ] in IndexReader.delete(int docNum)

Here is the Code

indexSearcher = null;
indexDirectory = null;
indexReader = null;
indexDirectory
=FSDirectory.getDirectory(/root/MERGEDINDEX/MERGER_1,false);
indexReader = IndexReader.open(indexDirectory);

IndexReader.unlock(indexDirectory);
indexSearcher = new IndexSearcher(indexReader);
query = new TermQuery(new Term(fieldName, FiledValue));
hits = indexSearcher.search(query);


if ( hits.length()  0 ) {

for(int k=0;k=hits.length();k++) {
PRINTDBG_.append(QUERY :  + query.toString() + \n +
FIELD NAME :  + fieldName + \n +
FIELD VALUE:  + FiledValue + \n +
TOTAL HITS :  + hits.length() + \n +
DELETING :  + k);

indexReader.delete(k);

}
}

indexReader.close();
indexSearcher.close();
indexDirectory.close();

System.out.printl( Debugger :  +PRINTDBG_);
indexReader = null;
indexSearcher = null;
indexDirectory = null;

//optimization
indexDirectory = FSDirectory.getDirectory(pathMergeIndex,false);
IndexWriter writer = new IndexWriter(indexDirectory, analyzer, false);
writer.mergeFactor = mergeFactorVal_;
writer.maxMergeDocs = maxMergeDocsVal_;
writer.optimize();
writer.close();

indexDirectory = null;
writer = null;

In spite of Using a new IndexReader for every Deletion of documents and
Optimization's
The 'indexReader.delete(k)' does not seems to work

Configuration History

a) 1 MergerIndex = 1000 subIndexes [ fieldName = KeyWord Field Type]

b) O/s Windows

c) Amd Processor

e) Lucene 1.4.1

f) Jdk 1.4.2

Please Some body Suggest me For Alternates 



  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]