Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-30 Thread Erik Hatcher
Sorry, Otis is right.  I just couldn't see anything else in your code  
that could have been wrong.


Erik


On Jul 29, 2006, at 11:42 PM, Otis Gospodnetic wrote:

I think you can reuse them.  Fields should he handled/analyzed  
sequentially.  I reuse them for some stuff on Simpy.com.


But you may want to clean up that try/catch.  Instead of catching  
the IOException, you may want to use !IndexReader.indexExists(...)  
in place of that boolean param to IndexWriter ctor.


Otis

- Original Message 
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, July 29, 2006 4:04:23 PM
Subject: Re: PerFieldAnalyzerWrapper use?  Analyzer's not being  
used as expected


Hey Erik,

Will do.  May I ask why?  Out of curiousity.

Thanks,
Michael

Erik Hatcher wrote:


I think you should use a new instance of each analyzer for each
field, not reuse instances.  Other than that, your usage is fine.

Erik


On Jul 29, 2006, at 3:49 PM, Michael J. Prichard wrote:


So I have the following code...

// let's get our SynonymAnalyzer
SynonymAnalyzer synAnalyzer = getSynonymAnalyzer();
// let's get our EmailAnalyzer
EmailAnalyzer emailAnalyzer = getEmailAnalyzer();

// set up perfieldanalyzer
PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new
StandardAnalyzer());   aWrapper.addAnalyzer("subject",
synAnalyzer);
aWrapper.addAnalyzer("content", synAnalyzer);
aWrapper.addAnalyzer("from", emailAnalyzer);
aWrapper.addAnalyzer("to", emailAnalyzer);
aWrapper.addAnalyzer("cc", emailAnalyzer);
aWrapper.addAnalyzer("bcc", emailAnalyzer);

// create the writer
try {
   wr = new IndexWriter(indexDir, aWrapper, false);
   wr.setUseCompoundFile(false);
} catch (IOException iox) {
   // means it ain't there
   wr = new IndexWriter(indexDir, aWrapper, true);
   wr.setUseCompoundFile(false);
}

-

When I add a Document to the IndexWriter it does not seem to use   
the

analyzer's I want it too.  Just uses StandardAnalyzer for
everythign!  Is this the correct way to use PerFieldAnalyzerWrapper?

Thanks,
Michael

P.S.  I am using Lucene 2 libs.

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: email libraries

2006-07-30 Thread John Haxby

Andrzej Bialecki wrote:
Just for the record - I've been using javamail POP and IMAP providers 
in the past, and they were prone to hanging with some servers, and 
resource intensive. I've been also using Outlook (proper, not Outlook 
Express - this is AFAIK impossible to work with) via a Java-COM bridge 
such as Jawin or JNIWrapper plus Redemption . This also tends to be 
rather unstable, and requires a lot of fine-tuning ...
We use javamail a *lot* with the Scalix IMAP server (the web access part 
uses IMAP underneath).   We have had performance problems with the way 
that javamail works, although for just scanning a message store to index 
messages it's OK.   We have tuned the web access code somewhat to make 
it behave better but we've also re-engineered the IMAP server somewhat, 
partly with javamail in mind, and performance and resource usage on the 
server are now somewhat under control.

So, be prepared to suffer quite a bit. ;)
If you're doing complicated things, yes, but if it's simple access for 
the purposes of indexing then you probably don't need to worry too much.


jch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



java.lang.IllegalAccessError: tried to access method org.apache.lucene.search.HitDoc.

2006-07-30 Thread Alan Ezust

I'm having difficulty getting Lucene to work for me, and it keeps
coming back to this HitDoc class.

At the moment ,whenever I call the IndexBuilder.search method,
this what I get:

[error] WorkThread: java.lang.IllegalAccessError: tried to access
method org.apache.lucene.search.HitDoc.(FI)V from class
org.apache.lucene.search.Hits
[error] WorkThread:  at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:94)
[error] WorkThread:  at org.apache.lucene.search.Hits.(Hits.java:53)
[error] WorkThread:  at
org.apache.lucene.search.Searcher.search(Searcher.java:44)
[error] WorkThread:  at
org.apache.lucene.search.Searcher.search(Searcher.java:36)
[error] WorkThread:  at
infoviewer.lucene.IndexBuilder.search(IndexBuilder.java:118)
[error] WorkThread:  at
infoviewer.lucene.SearchPanel$ActionHandler$1.run(SearchPanel.java:190)
[error] WorkThread:  at
org.gjt.sp.util.WorkThread.doRequest(WorkThread.java:194)
[error] WorkThread:  at
org.gjt.sp.util.WorkThread.doRequests(WorkThread.java:161)


I tried moving class HitDoc out of Hits.java and into its own
HitDoc.java file, and making the class and ctor public, but I still
get this error... So now I'm really confused.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Consult some information about adding index while searching

2006-07-30 Thread hu andy

Thank you


Re: Span Query NLE

2006-07-30 Thread Paul Elschot
On Tuesday 25 July 2006 03:26, Charlie wrote:
...
> 
> can "surround" be nested
> 
> 3w(4n(a?a AND bb?) AND cc+)

Yes, but iirc the "arguments" need to be separated by comma's:
3w( 4n( ... , ...) , ...)
instead of by AND.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-30 Thread Michael J. Prichard

This look better?

   // Check to see if index exists. 
   // If it doesn't, then set createIndex boolean to true

   boolean createIndex = false;
   if (!IndexReader.indexExists(indexDir)) {
   createIndex = true;
   }

   // let's set up the index writer
   wr = new IndexWriter(indexDir, aWrapper, createIndex);
   wr.setUseCompoundFile(false);



Otis Gospodnetic wrote:


I think you can reuse them.  Fields should he handled/analyzed sequentially.  I 
reuse them for some stuff on Simpy.com.

But you may want to clean up that try/catch.  Instead of catching the 
IOException, you may want to use !IndexReader.indexExists(...) in place of that 
boolean param to IndexWriter ctor.

Otis

- Original Message 
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, July 29, 2006 4:04:23 PM
Subject: Re: PerFieldAnalyzerWrapper use?  Analyzer's not being used as 
expected

Hey Erik,

Will do.  May I ask why?  Out of curiousity.

Thanks,
Michael

Erik Hatcher wrote:

 

I think you should use a new instance of each analyzer for each  
field, not reuse instances.  Other than that, your usage is fine.


   Erik


On Jul 29, 2006, at 3:49 PM, Michael J. Prichard wrote:

   


So I have the following code...

// let's get our SynonymAnalyzer
SynonymAnalyzer synAnalyzer = getSynonymAnalyzer();
// let's get our EmailAnalyzer
EmailAnalyzer emailAnalyzer = getEmailAnalyzer();

// set up perfieldanalyzer
PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new  
StandardAnalyzer());   aWrapper.addAnalyzer("subject",  
synAnalyzer);

aWrapper.addAnalyzer("content", synAnalyzer);
aWrapper.addAnalyzer("from", emailAnalyzer);
aWrapper.addAnalyzer("to", emailAnalyzer);
aWrapper.addAnalyzer("cc", emailAnalyzer);
aWrapper.addAnalyzer("bcc", emailAnalyzer);

// create the writer
try {
  wr = new IndexWriter(indexDir, aWrapper, false);
  wr.setUseCompoundFile(false);
} catch (IOException iox) {
  // means it ain't there
  wr = new IndexWriter(indexDir, aWrapper, true);
  wr.setUseCompoundFile(false);
}

-

When I add a Document to the IndexWriter it does not seem to use  the 
analyzer's I want it too.  Just uses StandardAnalyzer for  
everythign!  Is this the correct way to use PerFieldAnalyzerWrapper?


Thanks,
Michael

P.S.  I am using Lucene 2 libs.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-30 Thread Michael J. Prichard

Kewl :)

I updated the Filter(for anyone interested).  Actually..if anyone 
wants I can zip it up and send it to them...let me know.


 EmailFilter

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Token;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Stack;

public class EmailFilter extends TokenFilter {
   public static final String TOKEN_TYPE_EMAIL = "EMAILPART";

   private Stack emailTokenStack;
  
   public EmailFilter(TokenStream in) {

   super(in);
   emailTokenStack = new Stack();
   }

   public Token next() throws IOException {

   if (emailTokenStack.size() > 0) {
   return (Token) emailTokenStack.pop();
   }   


   Token token = input.next();
   if (token == null) {
   return null;
   }

   addEmailPartsToStack(token);

   return token;
   }
  
   private void addEmailPartsToStack(Token token) throws IOException {

   String[] parts = getEmailParts(token.termText());

   if (parts == null) return;

   for (int i = 0; i < parts.length; i++) {
   Token synToken = new Token(parts[i],
token.startOffset(),
token.endOffset(),
TOKEN_TYPE_EMAIL);
   synToken.setPositionIncrement(0);

   emailTokenStack.push(synToken);
   }
   }

   /*
* Parses emails into its parts for tokenization.
* For example [EMAIL PROTECTED] would be broken into
*
*[EMAIL PROTECTED]
*[john]
*[foo.com]
*[foo]
*[com]
*  
*/

   private String[] getEmailParts(String email) {

   // array for the parts
   String[] emailParts;
   // so i can add them before calling toArray
   ArrayList partsList = new ArrayList();

   /* let's do it */
   // split on the @
   String[] splitOnAmpersand = email.split("@");
   // add the username
   try {
   partsList.add(splitOnAmpersand[0]);
   } catch (ArrayIndexOutOfBoundsException ae) {
   // ignore
   }

   // add the full host name
   try {
   partsList.add(splitOnAmpersand[1]);
   } catch (ArrayIndexOutOfBoundsException ae) {
   // ignore
   }

   // split the host name into pieces
   if (splitOnAmpersand.length > 1) {
   String[] splitOnDot = splitOnAmpersand[1].split("\\.");
   // add all pieces from splitOnDot
   for (int i=0; i < splitOnDot.length; i++) {
   partsList.add(splitOnDot[i]);
   }

   /*
*  if this is great than 2 then we need to add the domain 
name which

*  should be the last two
* 
*/

   if (splitOnDot.length > 2) {
   String domain = splitOnDot[splitOnDot.length-2] + "." + 
splitOnDot[splitOnDot.length-1];

   // add domain
   partsList.add(domain);
   }
   }
  
   return (String[]) partsList.toArray(new String[0]);   
   }


}

 end EmailFilter




Otis Gospodnetic wrote:


No, you're not missing anything. :)
That JavaMail API is good for getting the whole email, but you then need to 
chop it up with your EmailAnalyzer, so you're doing the right thing.

Otis

- Original Message 
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, July 29, 2006 2:51:59 PM
Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

Hasan Diwan wrote:

 


Michael:

On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote:

   


Howdynot sure if anyone else wants this but here is my first attempt
at writing an analyzer for an email address...modifications, updates,
fixes welcome.
 


Why reinvent the wheel? See
http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String) 


and use as:

InternetAddress valid = InternetAddress.parse(string)[0]; // far
simpler than rewriting it

   

i dont see where i can break an email address into simpler pieces for 
tokens.  i use javamail when parsing the message and then pulling the 
email using InternetAddress.  I don't see where I can break an email 
address like [EMAIL PROTECTED] into "[EMAIL PROTECTED]", "john", "foo.com", "foo" 
and "com" without splitting it.  Am I missing something?


Thanks!
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-30 Thread Otis Gospodnetic
Or simpler:
wr = new IndexWriter(indexDir, aWrapper, !IndexReader.indexExists(indexDir));

- Original Message 
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Sunday, July 30, 2006 1:35:29 PM
Subject: Re: PerFieldAnalyzerWrapper use?  Analyzer's not being used as 
expected

This look better?

// Check to see if index exists. 
// If it doesn't, then set createIndex boolean to true
boolean createIndex = false;
if (!IndexReader.indexExists(indexDir)) {
createIndex = true;
}

// let's set up the index writer
wr = new IndexWriter(indexDir, aWrapper, createIndex);
wr.setUseCompoundFile(false);



Otis Gospodnetic wrote:

>I think you can reuse them.  Fields should he handled/analyzed sequentially.  
>I reuse them for some stuff on Simpy.com.
>
>But you may want to clean up that try/catch.  Instead of catching the 
>IOException, you may want to use !IndexReader.indexExists(...) in place of 
>that boolean param to IndexWriter ctor.
>
>Otis
>
>- Original Message 
>From: Michael J. Prichard <[EMAIL PROTECTED]>
>To: java-user@lucene.apache.org
>Sent: Saturday, July 29, 2006 4:04:23 PM
>Subject: Re: PerFieldAnalyzerWrapper use?  Analyzer's not being used as 
>expected
>
>Hey Erik,
>
>Will do.  May I ask why?  Out of curiousity.
>
>Thanks,
>Michael
>
>Erik Hatcher wrote:
>
>  
>
>>I think you should use a new instance of each analyzer for each  
>>field, not reuse instances.  Other than that, your usage is fine.
>>
>>Erik
>>
>>
>>On Jul 29, 2006, at 3:49 PM, Michael J. Prichard wrote:
>>
>>
>>
>>>So I have the following code...
>>>
>>>// let's get our SynonymAnalyzer
>>>SynonymAnalyzer synAnalyzer = getSynonymAnalyzer();
>>>// let's get our EmailAnalyzer
>>>EmailAnalyzer emailAnalyzer = getEmailAnalyzer();
>>>
>>>// set up perfieldanalyzer
>>>PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new  
>>>StandardAnalyzer());   aWrapper.addAnalyzer("subject",  
>>>synAnalyzer);
>>>aWrapper.addAnalyzer("content", synAnalyzer);
>>>aWrapper.addAnalyzer("from", emailAnalyzer);
>>>aWrapper.addAnalyzer("to", emailAnalyzer);
>>>aWrapper.addAnalyzer("cc", emailAnalyzer);
>>>aWrapper.addAnalyzer("bcc", emailAnalyzer);
>>>
>>>// create the writer
>>>try {
>>>   wr = new IndexWriter(indexDir, aWrapper, false);
>>>   wr.setUseCompoundFile(false);
>>>} catch (IOException iox) {
>>>   // means it ain't there
>>>   wr = new IndexWriter(indexDir, aWrapper, true);
>>>   wr.setUseCompoundFile(false);
>>>}
>>>
>>>-
>>>
>>>When I add a Document to the IndexWriter it does not seem to use  the 
>>>analyzer's I want it too.  Just uses StandardAnalyzer for  
>>>everythign!  Is this the correct way to use PerFieldAnalyzerWrapper?
>>>
>>>Thanks,
>>>Michael
>>>
>>>P.S.  I am using Lucene 2 libs.
>>>
>>>-
>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>For additional commands, e-mail: [EMAIL PROTECTED]
>>>  
>>>
>>
>>-
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>  
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-30 Thread Otis Gospodnetic
A good place for that in JIRA.  could you put it there?  We have a bunch of 
analyzers in Lucene's contrib, so if you are okay with putting Apache license 
on top of the source code, we can include it there.  Same for EmailAnalyzer.

Otis


- Original Message 
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Sunday, July 30, 2006 1:37:57 PM
Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

Kewl :)

I updated the Filter(for anyone interested).  Actually..if anyone 
wants I can zip it up and send it to them...let me know.

 EmailFilter

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Token;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Stack;

public class EmailFilter extends TokenFilter {
public static final String TOKEN_TYPE_EMAIL = "EMAILPART";

private Stack emailTokenStack;
   
public EmailFilter(TokenStream in) {
super(in);
emailTokenStack = new Stack();
}

public Token next() throws IOException {

if (emailTokenStack.size() > 0) {
return (Token) emailTokenStack.pop();
}   

Token token = input.next();
if (token == null) {
return null;
}

addEmailPartsToStack(token);

return token;
}
   
private void addEmailPartsToStack(Token token) throws IOException {
String[] parts = getEmailParts(token.termText());

if (parts == null) return;

for (int i = 0; i < parts.length; i++) {
Token synToken = new Token(parts[i],
 token.startOffset(),
 token.endOffset(),
 TOKEN_TYPE_EMAIL);
synToken.setPositionIncrement(0);

emailTokenStack.push(synToken);
}
}

/*
 * Parses emails into its parts for tokenization.
 * For example [EMAIL PROTECTED] would be broken into
 *
 *[EMAIL PROTECTED]
 *[john]
 *[foo.com]
 *[foo]
 *[com]
 *  
 */
private String[] getEmailParts(String email) {

// array for the parts
String[] emailParts;
// so i can add them before calling toArray
ArrayList partsList = new ArrayList();

/* let's do it */
// split on the @
String[] splitOnAmpersand = email.split("@");
// add the username
try {
partsList.add(splitOnAmpersand[0]);
} catch (ArrayIndexOutOfBoundsException ae) {
// ignore
}

// add the full host name
try {
partsList.add(splitOnAmpersand[1]);
} catch (ArrayIndexOutOfBoundsException ae) {
// ignore
}

// split the host name into pieces
if (splitOnAmpersand.length > 1) {
String[] splitOnDot = splitOnAmpersand[1].split("\\.");
// add all pieces from splitOnDot
for (int i=0; i < splitOnDot.length; i++) {
partsList.add(splitOnDot[i]);
}

/*
 *  if this is great than 2 then we need to add the domain 
name which
 *  should be the last two
 * 
 */
if (splitOnDot.length > 2) {
String domain = splitOnDot[splitOnDot.length-2] + "." + 
splitOnDot[splitOnDot.length-1];
// add domain
partsList.add(domain);
}
}
   
return (String[]) partsList.toArray(new String[0]);   
}

}

 end EmailFilter




Otis Gospodnetic wrote:

>No, you're not missing anything. :)
>That JavaMail API is good for getting the whole email, but you then need to 
>chop it up with your EmailAnalyzer, so you're doing the right thing.
>
>Otis
>
>- Original Message 
>From: Michael J. Prichard <[EMAIL PROTECTED]>
>To: java-user@lucene.apache.org
>Sent: Saturday, July 29, 2006 2:51:59 PM
>Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
>
>Hasan Diwan wrote:
>
>  
>
>>Michael:
>>
>>On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>>>Howdynot sure if anyone else wants this but here is my first attempt
>>>at writing an analyzer for an email address...modifications, updates,
>>>fixes welcome.
>>>  
>>>
>>Why reinvent the wheel? See
>>http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String)
>> 
>>
>>and use as:
>>
>>InternetAddress valid = InternetAddress.parse(string)[0]; // far
>>simpler than rewriting it
>>
>>
>>
>i dont see where i can break an email address into simpler pieces for 
>tokens.  i use javamail when parsing the message and then pulling the 
>email using InternetAddress.  I don't see where I can break an email 
>address like [EMAIL PROTECTED] into "[EMAIL PROTECTED]", "john", "foo.com", 
>"foo"

Re: Sorting

2006-07-30 Thread Rob Staveley (Tom)
The limit is much less than Integer.MAX_VALUE (2,147,483,647), unless
you have a VM which can run in more than 1G of heap. 1G limits you to a
theoretical number of 256M (268,435,456) documents with 4 bytes per
array element. In practise it will be something a less, because there
are other things which need heap too.

Were going to need to maintain a set sort indexes for documents in a
large index too, and I'm interested in suggestions for the best/easiest
way to maintain non-RAM-based (or not entirely RAM-based) sort index
which is external to Lucene. Would using MySQL for sort indexing be "a
sledgehammer to crack a nut", I wonder? I've not really thought through
the RAMifications (sorry!) of this approach. I wonder if anyone else
here has attempted to integrate an external sort using a database?

On Sat, 2006-07-29 at 22:42 +0200, karl wettin wrote:
> On Sat, 2006-07-29 at 12:39 -0700, Jason Calabrese wrote:
> > One fast way to make an alphabetic sort very fast is to presort your
> > docs before adding them to the index.  If you do this you can then
> > just sort by index order.  We are using this for a large index (1
> > million+ docs) and it works very good, and seems even slightly faster
> > than relevance sorting.
> > 
> > Using this approach may create some maintainance issues since you
> > can't add a new doc to the index at a specified position.  Instead you
> > will need to re-index everything. 
> 
> Instead of above I would probably choose an int[index size] where each
> position in the array represents the global order of that document. It's
> much easier to re-order that than re-indexing the whole corpus every
> time you want to insert something.
> 
> It limits your corpus to 2 billion items (Integer.MAX_VALUE). And it
> will consume 32 bits of RAM per document.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]