Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....
Sorry, Otis is right. I just couldn't see anything else in your code that could have been wrong. Erik On Jul 29, 2006, at 11:42 PM, Otis Gospodnetic wrote: I think you can reuse them. Fields should he handled/analyzed sequentially. I reuse them for some stuff on Simpy.com. But you may want to clean up that try/catch. Instead of catching the IOException, you may want to use !IndexReader.indexExists(...) in place of that boolean param to IndexWriter ctor. Otis - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, July 29, 2006 4:04:23 PM Subject: Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected Hey Erik, Will do. May I ask why? Out of curiousity. Thanks, Michael Erik Hatcher wrote: I think you should use a new instance of each analyzer for each field, not reuse instances. Other than that, your usage is fine. Erik On Jul 29, 2006, at 3:49 PM, Michael J. Prichard wrote: So I have the following code... // let's get our SynonymAnalyzer SynonymAnalyzer synAnalyzer = getSynonymAnalyzer(); // let's get our EmailAnalyzer EmailAnalyzer emailAnalyzer = getEmailAnalyzer(); // set up perfieldanalyzer PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer()); aWrapper.addAnalyzer("subject", synAnalyzer); aWrapper.addAnalyzer("content", synAnalyzer); aWrapper.addAnalyzer("from", emailAnalyzer); aWrapper.addAnalyzer("to", emailAnalyzer); aWrapper.addAnalyzer("cc", emailAnalyzer); aWrapper.addAnalyzer("bcc", emailAnalyzer); // create the writer try { wr = new IndexWriter(indexDir, aWrapper, false); wr.setUseCompoundFile(false); } catch (IOException iox) { // means it ain't there wr = new IndexWriter(indexDir, aWrapper, true); wr.setUseCompoundFile(false); } - When I add a Document to the IndexWriter it does not seem to use the analyzer's I want it too. Just uses StandardAnalyzer for everythign! Is this the correct way to use PerFieldAnalyzerWrapper? Thanks, Michael P.S. I am using Lucene 2 libs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: email libraries
Andrzej Bialecki wrote: Just for the record - I've been using javamail POP and IMAP providers in the past, and they were prone to hanging with some servers, and resource intensive. I've been also using Outlook (proper, not Outlook Express - this is AFAIK impossible to work with) via a Java-COM bridge such as Jawin or JNIWrapper plus Redemption . This also tends to be rather unstable, and requires a lot of fine-tuning ... We use javamail a *lot* with the Scalix IMAP server (the web access part uses IMAP underneath). We have had performance problems with the way that javamail works, although for just scanning a message store to index messages it's OK. We have tuned the web access code somewhat to make it behave better but we've also re-engineered the IMAP server somewhat, partly with javamail in mind, and performance and resource usage on the server are now somewhat under control. So, be prepared to suffer quite a bit. ;) If you're doing complicated things, yes, but if it's simple access for the purposes of indexing then you probably don't need to worry too much. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
java.lang.IllegalAccessError: tried to access method org.apache.lucene.search.HitDoc.
I'm having difficulty getting Lucene to work for me, and it keeps coming back to this HitDoc class. At the moment ,whenever I call the IndexBuilder.search method, this what I get: [error] WorkThread: java.lang.IllegalAccessError: tried to access method org.apache.lucene.search.HitDoc.(FI)V from class org.apache.lucene.search.Hits [error] WorkThread: at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:94) [error] WorkThread: at org.apache.lucene.search.Hits.(Hits.java:53) [error] WorkThread: at org.apache.lucene.search.Searcher.search(Searcher.java:44) [error] WorkThread: at org.apache.lucene.search.Searcher.search(Searcher.java:36) [error] WorkThread: at infoviewer.lucene.IndexBuilder.search(IndexBuilder.java:118) [error] WorkThread: at infoviewer.lucene.SearchPanel$ActionHandler$1.run(SearchPanel.java:190) [error] WorkThread: at org.gjt.sp.util.WorkThread.doRequest(WorkThread.java:194) [error] WorkThread: at org.gjt.sp.util.WorkThread.doRequests(WorkThread.java:161) I tried moving class HitDoc out of Hits.java and into its own HitDoc.java file, and making the class and ctor public, but I still get this error... So now I'm really confused. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Consult some information about adding index while searching
Thank you
Re: Span Query NLE
On Tuesday 25 July 2006 03:26, Charlie wrote: ... > > can "surround" be nested > > 3w(4n(a?a AND bb?) AND cc+) Yes, but iirc the "arguments" need to be separated by comma's: 3w( 4n( ... , ...) , ...) instead of by AND. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....
This look better? // Check to see if index exists. // If it doesn't, then set createIndex boolean to true boolean createIndex = false; if (!IndexReader.indexExists(indexDir)) { createIndex = true; } // let's set up the index writer wr = new IndexWriter(indexDir, aWrapper, createIndex); wr.setUseCompoundFile(false); Otis Gospodnetic wrote: I think you can reuse them. Fields should he handled/analyzed sequentially. I reuse them for some stuff on Simpy.com. But you may want to clean up that try/catch. Instead of catching the IOException, you may want to use !IndexReader.indexExists(...) in place of that boolean param to IndexWriter ctor. Otis - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, July 29, 2006 4:04:23 PM Subject: Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected Hey Erik, Will do. May I ask why? Out of curiousity. Thanks, Michael Erik Hatcher wrote: I think you should use a new instance of each analyzer for each field, not reuse instances. Other than that, your usage is fine. Erik On Jul 29, 2006, at 3:49 PM, Michael J. Prichard wrote: So I have the following code... // let's get our SynonymAnalyzer SynonymAnalyzer synAnalyzer = getSynonymAnalyzer(); // let's get our EmailAnalyzer EmailAnalyzer emailAnalyzer = getEmailAnalyzer(); // set up perfieldanalyzer PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer()); aWrapper.addAnalyzer("subject", synAnalyzer); aWrapper.addAnalyzer("content", synAnalyzer); aWrapper.addAnalyzer("from", emailAnalyzer); aWrapper.addAnalyzer("to", emailAnalyzer); aWrapper.addAnalyzer("cc", emailAnalyzer); aWrapper.addAnalyzer("bcc", emailAnalyzer); // create the writer try { wr = new IndexWriter(indexDir, aWrapper, false); wr.setUseCompoundFile(false); } catch (IOException iox) { // means it ain't there wr = new IndexWriter(indexDir, aWrapper, true); wr.setUseCompoundFile(false); } - When I add a Document to the IndexWriter it does not seem to use the analyzer's I want it too. Just uses StandardAnalyzer for everythign! Is this the correct way to use PerFieldAnalyzerWrapper? Thanks, Michael P.S. I am using Lucene 2 libs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
Kewl :) I updated the Filter(for anyone interested). Actually..if anyone wants I can zip it up and send it to them...let me know. EmailFilter import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.Token; import java.io.IOException; import java.util.ArrayList; import java.util.Stack; public class EmailFilter extends TokenFilter { public static final String TOKEN_TYPE_EMAIL = "EMAILPART"; private Stack emailTokenStack; public EmailFilter(TokenStream in) { super(in); emailTokenStack = new Stack(); } public Token next() throws IOException { if (emailTokenStack.size() > 0) { return (Token) emailTokenStack.pop(); } Token token = input.next(); if (token == null) { return null; } addEmailPartsToStack(token); return token; } private void addEmailPartsToStack(Token token) throws IOException { String[] parts = getEmailParts(token.termText()); if (parts == null) return; for (int i = 0; i < parts.length; i++) { Token synToken = new Token(parts[i], token.startOffset(), token.endOffset(), TOKEN_TYPE_EMAIL); synToken.setPositionIncrement(0); emailTokenStack.push(synToken); } } /* * Parses emails into its parts for tokenization. * For example [EMAIL PROTECTED] would be broken into * *[EMAIL PROTECTED] *[john] *[foo.com] *[foo] *[com] * */ private String[] getEmailParts(String email) { // array for the parts String[] emailParts; // so i can add them before calling toArray ArrayList partsList = new ArrayList(); /* let's do it */ // split on the @ String[] splitOnAmpersand = email.split("@"); // add the username try { partsList.add(splitOnAmpersand[0]); } catch (ArrayIndexOutOfBoundsException ae) { // ignore } // add the full host name try { partsList.add(splitOnAmpersand[1]); } catch (ArrayIndexOutOfBoundsException ae) { // ignore } // split the host name into pieces if (splitOnAmpersand.length > 1) { String[] splitOnDot = splitOnAmpersand[1].split("\\."); // add all pieces from splitOnDot for (int i=0; i < splitOnDot.length; i++) { partsList.add(splitOnDot[i]); } /* * if this is great than 2 then we need to add the domain name which * should be the last two * */ if (splitOnDot.length > 2) { String domain = splitOnDot[splitOnDot.length-2] + "." + splitOnDot[splitOnDot.length-1]; // add domain partsList.add(domain); } } return (String[]) partsList.toArray(new String[0]); } } end EmailFilter Otis Gospodnetic wrote: No, you're not missing anything. :) That JavaMail API is good for getting the whole email, but you then need to chop it up with your EmailAnalyzer, so you're doing the right thing. Otis - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, July 29, 2006 2:51:59 PM Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer) Hasan Diwan wrote: Michael: On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote: Howdynot sure if anyone else wants this but here is my first attempt at writing an analyzer for an email address...modifications, updates, fixes welcome. Why reinvent the wheel? See http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String) and use as: InternetAddress valid = InternetAddress.parse(string)[0]; // far simpler than rewriting it i dont see where i can break an email address into simpler pieces for tokens. i use javamail when parsing the message and then pulling the email using InternetAddress. I don't see where I can break an email address like [EMAIL PROTECTED] into "[EMAIL PROTECTED]", "john", "foo.com", "foo" and "com" without splitting it. Am I missing something? Thanks! Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....
Or simpler: wr = new IndexWriter(indexDir, aWrapper, !IndexReader.indexExists(indexDir)); - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, July 30, 2006 1:35:29 PM Subject: Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected This look better? // Check to see if index exists. // If it doesn't, then set createIndex boolean to true boolean createIndex = false; if (!IndexReader.indexExists(indexDir)) { createIndex = true; } // let's set up the index writer wr = new IndexWriter(indexDir, aWrapper, createIndex); wr.setUseCompoundFile(false); Otis Gospodnetic wrote: >I think you can reuse them. Fields should he handled/analyzed sequentially. >I reuse them for some stuff on Simpy.com. > >But you may want to clean up that try/catch. Instead of catching the >IOException, you may want to use !IndexReader.indexExists(...) in place of >that boolean param to IndexWriter ctor. > >Otis > >- Original Message >From: Michael J. Prichard <[EMAIL PROTECTED]> >To: java-user@lucene.apache.org >Sent: Saturday, July 29, 2006 4:04:23 PM >Subject: Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as >expected > >Hey Erik, > >Will do. May I ask why? Out of curiousity. > >Thanks, >Michael > >Erik Hatcher wrote: > > > >>I think you should use a new instance of each analyzer for each >>field, not reuse instances. Other than that, your usage is fine. >> >>Erik >> >> >>On Jul 29, 2006, at 3:49 PM, Michael J. Prichard wrote: >> >> >> >>>So I have the following code... >>> >>>// let's get our SynonymAnalyzer >>>SynonymAnalyzer synAnalyzer = getSynonymAnalyzer(); >>>// let's get our EmailAnalyzer >>>EmailAnalyzer emailAnalyzer = getEmailAnalyzer(); >>> >>>// set up perfieldanalyzer >>>PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new >>>StandardAnalyzer()); aWrapper.addAnalyzer("subject", >>>synAnalyzer); >>>aWrapper.addAnalyzer("content", synAnalyzer); >>>aWrapper.addAnalyzer("from", emailAnalyzer); >>>aWrapper.addAnalyzer("to", emailAnalyzer); >>>aWrapper.addAnalyzer("cc", emailAnalyzer); >>>aWrapper.addAnalyzer("bcc", emailAnalyzer); >>> >>>// create the writer >>>try { >>> wr = new IndexWriter(indexDir, aWrapper, false); >>> wr.setUseCompoundFile(false); >>>} catch (IOException iox) { >>> // means it ain't there >>> wr = new IndexWriter(indexDir, aWrapper, true); >>> wr.setUseCompoundFile(false); >>>} >>> >>>- >>> >>>When I add a Document to the IndexWriter it does not seem to use the >>>analyzer's I want it too. Just uses StandardAnalyzer for >>>everythign! Is this the correct way to use PerFieldAnalyzerWrapper? >>> >>>Thanks, >>>Michael >>> >>>P.S. I am using Lucene 2 libs. >>> >>>- >>>To unsubscribe, e-mail: [EMAIL PROTECTED] >>>For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> >>- >>To unsubscribe, e-mail: [EMAIL PROTECTED] >>For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
A good place for that in JIRA. could you put it there? We have a bunch of analyzers in Lucene's contrib, so if you are okay with putting Apache license on top of the source code, we can include it there. Same for EmailAnalyzer. Otis - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, July 30, 2006 1:37:57 PM Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer) Kewl :) I updated the Filter(for anyone interested). Actually..if anyone wants I can zip it up and send it to them...let me know. EmailFilter import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.Token; import java.io.IOException; import java.util.ArrayList; import java.util.Stack; public class EmailFilter extends TokenFilter { public static final String TOKEN_TYPE_EMAIL = "EMAILPART"; private Stack emailTokenStack; public EmailFilter(TokenStream in) { super(in); emailTokenStack = new Stack(); } public Token next() throws IOException { if (emailTokenStack.size() > 0) { return (Token) emailTokenStack.pop(); } Token token = input.next(); if (token == null) { return null; } addEmailPartsToStack(token); return token; } private void addEmailPartsToStack(Token token) throws IOException { String[] parts = getEmailParts(token.termText()); if (parts == null) return; for (int i = 0; i < parts.length; i++) { Token synToken = new Token(parts[i], token.startOffset(), token.endOffset(), TOKEN_TYPE_EMAIL); synToken.setPositionIncrement(0); emailTokenStack.push(synToken); } } /* * Parses emails into its parts for tokenization. * For example [EMAIL PROTECTED] would be broken into * *[EMAIL PROTECTED] *[john] *[foo.com] *[foo] *[com] * */ private String[] getEmailParts(String email) { // array for the parts String[] emailParts; // so i can add them before calling toArray ArrayList partsList = new ArrayList(); /* let's do it */ // split on the @ String[] splitOnAmpersand = email.split("@"); // add the username try { partsList.add(splitOnAmpersand[0]); } catch (ArrayIndexOutOfBoundsException ae) { // ignore } // add the full host name try { partsList.add(splitOnAmpersand[1]); } catch (ArrayIndexOutOfBoundsException ae) { // ignore } // split the host name into pieces if (splitOnAmpersand.length > 1) { String[] splitOnDot = splitOnAmpersand[1].split("\\."); // add all pieces from splitOnDot for (int i=0; i < splitOnDot.length; i++) { partsList.add(splitOnDot[i]); } /* * if this is great than 2 then we need to add the domain name which * should be the last two * */ if (splitOnDot.length > 2) { String domain = splitOnDot[splitOnDot.length-2] + "." + splitOnDot[splitOnDot.length-1]; // add domain partsList.add(domain); } } return (String[]) partsList.toArray(new String[0]); } } end EmailFilter Otis Gospodnetic wrote: >No, you're not missing anything. :) >That JavaMail API is good for getting the whole email, but you then need to >chop it up with your EmailAnalyzer, so you're doing the right thing. > >Otis > >- Original Message >From: Michael J. Prichard <[EMAIL PROTECTED]> >To: java-user@lucene.apache.org >Sent: Saturday, July 29, 2006 2:51:59 PM >Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer) > >Hasan Diwan wrote: > > > >>Michael: >> >>On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote: >> >> >> >>>Howdynot sure if anyone else wants this but here is my first attempt >>>at writing an analyzer for an email address...modifications, updates, >>>fixes welcome. >>> >>> >>Why reinvent the wheel? See >>http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String) >> >> >>and use as: >> >>InternetAddress valid = InternetAddress.parse(string)[0]; // far >>simpler than rewriting it >> >> >> >i dont see where i can break an email address into simpler pieces for >tokens. i use javamail when parsing the message and then pulling the >email using InternetAddress. I don't see where I can break an email >address like [EMAIL PROTECTED] into "[EMAIL PROTECTED]", "john", "foo.com", >"foo"
Re: Sorting
The limit is much less than Integer.MAX_VALUE (2,147,483,647), unless you have a VM which can run in more than 1G of heap. 1G limits you to a theoretical number of 256M (268,435,456) documents with 4 bytes per array element. In practise it will be something a less, because there are other things which need heap too. Were going to need to maintain a set sort indexes for documents in a large index too, and I'm interested in suggestions for the best/easiest way to maintain non-RAM-based (or not entirely RAM-based) sort index which is external to Lucene. Would using MySQL for sort indexing be "a sledgehammer to crack a nut", I wonder? I've not really thought through the RAMifications (sorry!) of this approach. I wonder if anyone else here has attempted to integrate an external sort using a database? On Sat, 2006-07-29 at 22:42 +0200, karl wettin wrote: > On Sat, 2006-07-29 at 12:39 -0700, Jason Calabrese wrote: > > One fast way to make an alphabetic sort very fast is to presort your > > docs before adding them to the index. If you do this you can then > > just sort by index order. We are using this for a large index (1 > > million+ docs) and it works very good, and seems even slightly faster > > than relevance sorting. > > > > Using this approach may create some maintainance issues since you > > can't add a new doc to the index at a specified position. Instead you > > will need to re-index everything. > > Instead of above I would probably choose an int[index size] where each > position in the array represents the global order of that document. It's > much easier to re-order that than re-indexing the whole corpus every > time you want to insert something. > > It limits your corpus to 2 billion items (Integer.MAX_VALUE). And it > will consume 32 bits of RAM per document. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]