One problem of using the lucene
Hi, I got a problem of using the lucene. I write a SynonymFilter which can add synonyms from the WordNet. Meanwhile, i used the SnowballFilter for term stemming. However, i got a problem when combining the two fiters. For instance, i got 17 documents containing the Term "support" and the following is the SynonymAnalyzer i wrote. /** * */ public TokenStream tokenStream(String fieldName, Reader reader){ TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); if (stopword != null){ result = new StopFilter(result, stopword); } result = new SnowballFilter(result, "Lovins"); result = new SynonymFilter(result, engine); return result; } If i only used the SnowballFilter, i can find the "support" in the 17 documents. However, after adding the SynonymFilter, the "support" can only be found in 10 documents. It seems the term "support" cannot be found in the left 7 documents. I dont know what's wrong with it. regards jiang xing
locked files after updating lucene to 1.4.3
hi, I run into an issue after updating lucene libs from 1.3-final to 1.4.3. We have a batch job on our web server that recreates the lucene search index every night. This job deletes all index and creates a new one. This search index gets used by the lucene-powered search feature of the web site /IS + Resin-2.1.11). The search itself still works. but once I did a search on the web site some files in the index become locked. And n the index updater fails because it tries to delete those locked files ... The error is someething like [ERROR][2006-01-15 08:15:01 - main - de.bcg.web.search.BcgSiteSearch] Error while building index. java.io.IOException: couldn't delete _e3.tis at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) at org.apache.lucene.store.FSDirectory.(FSDirectory.java:151) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:132) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:151) at de.bcg.web.search.BcgSiteSearch.buildIndex(BcgSiteSearch.java:99) at de.bcg.web.search.BcgSiteSearch.main(BcgSiteSearch.java:71) The developer of the search stuff is no longer here and I have to maintain that stuff. Now, why does this locking happen? Didn never happen with 1.3. So I probably need to update something in the code. any hints about what causes the lock and how to fix this are very welcome :) thanks Jens Ansorg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: One problem of using the lucene
Could you share the details of your SynonymFilter? Is it adding tokens into the same position as the original tokens (position increment of 0)? Are you using QueryParser for searching? If so, try TermQuery to eliminate the parser's analysis from the picture for the time being while trouble shooting. If you are using QueryParser, are you using the same analyzer? If this is the case, what is the .toString of the generated Query? Erik On Jan 16, 2006, at 3:54 AM, jason wrote: Hi, I got a problem of using the lucene. I write a SynonymFilter which can add synonyms from the WordNet. Meanwhile, i used the SnowballFilter for term stemming. However, i got a problem when combining the two fiters. For instance, i got 17 documents containing the Term "support" and the following is the SynonymAnalyzer i wrote. /** * */ public TokenStream tokenStream(String fieldName, Reader reader){ TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); if (stopword != null){ result = new StopFilter(result, stopword); } result = new SnowballFilter(result, "Lovins"); result = new SynonymFilter(result, engine); return result; } If i only used the SnowballFilter, i can find the "support" in the 17 documents. However, after adding the SynonymFilter, the "support" can only be found in 10 documents. It seems the term "support" cannot be found in the left 7 documents. I dont know what's wrong with it. regards jiang xing - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Part-Of Match
Hi Hoss, thanks for the answer, and yes you have described the problem perfectly. I think you are right lucene is in fact not the best way of solving it. I decided to simply build a letter trie consisting of all concepts and then simply do a search with that document on the trie. This brings exact matches only on the one hand (and thats exactly what I need) and furthermore yields matches even for concepts that are in plural form in the query document. So the "von Willebrands" will yield "von Willebrand". Thanks for your efforts, Sven --- Ursprüngliche Nachricht --- Datum: 15.01.2006 22:14 Von: java-user@lucene.apache.org An: java-user@lucene.apache.org Betreff: Re: AW: Part-Of Match > > : >>von Willebrand<< is not the query but a document in the index The task > : is to detect exact matches of phrases inside a query (large document) with > : these phrases stored in the index. > > Lemme see if i can restate your problem... > > You want to build a data repository in which you insert a large magnatude > of "concepts" where a concept is a short phrase consisting of a few words > (possibly just one word). The words in any given concept phrase may > overlap (or be a super set) of the words in other concepts. > > Once this concept repository is built, you want to to build a black box > arround it, such that people can hand your black box a "document" > (ie: a research paper, a newpaper article, a short story, ... > some text consisting of many many sentences) and you want your black box > to then return the list of concepts that match the input document, such > that the cnceptss with the highest score are concepts whose phrase appears > exactly in the input document. Concepts whose phrase doesn't appear > exactly in the document shoudl still be returned, but with a lower score > based on how many words in the concept's phrase are found in the input > document. > > (have i adequetly described your problem?) > > It's an interesting idea. can it be done with lucene? ... i can think of > one kludgy mechanism for doing it but i'd be very suprised if there isn't > a better way (or if there is some other software library out there that > would be more suited) > > Build a permentant index in which each concept is a Lucene Document. > these documents really only need one stored/tokenized/indexed field > containing the phrase (if you want other payload fields that's up to you). > > Each time you are asked to analyze a Text sample and return matching > phrases, run the text through your analyzer to get back a tokenstream, and > for each of those tokens, use a TermDocs iterator to find out if any > phrase in your concept index contains that term, and if so which ones. > (you could also do this by building a boolean OR query out of all the > words in your input document -- but that may run into performance > limitatios if your input docs are too big, and it will try to score each > concept which isn't neccessary so even for short input text it's less > efficient). > > Now you have an (unordered) list of concepts that have something to do > with your input text. > > Next build a RAMDirectory based index consisting of exactly one document > which you build from the input text. Loop over that list of concepts you > got, and build a boolean query out of each one along the lines that > Daniel described: a phrase query on the whole concept phrase along with > term queries for each individual word -- all optional. run each of these > boolean queries against your one document RAMDirectory. the higher the > score, the better that concept applies to your input text. > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Finding similar documents
Grant Ingersoll wrote: I believe there is a MoreLikeThis class floating around somewhere (I think it is in the contrib/similarity package). The Lucene book also has a good example, and I have some examples at http://www.cnlp.org/apachecon2005 that demonstrate using term vectors to do this Klaus wrote: Hi, is there are build-in method for finding similar documents to one given document? Thx, Klaus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] I've implemented a simple relevance feedback algorithm which extracts terms from all interesting documents and builds up a new query with this terms. This is pretty simple but It works in most cases. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Memory
Hi all, Is anyone experiencing possible memory problems on LINUX with Lucene search? Here is our scenario, we have a service that lives on LINUX that takes all incoming request through a port and does the search. Only 1 IndexSearcher is instantiated to do this from our service. When I run a ps and grep for java it shows only 1 java process running.. however, when 4 users log into our program and start to search at the same time, 4 java processes show up on TOP (and I can't see their parent PID from the top command), but still only 1 java on ps. My company fears that each process is being allocated 128M memory and is running the box out of memory (when the service is started we allocated 10 - 128M from the java call). I am still in the process of testing with our system guys and having the data analyzed with a 3rd party, but was curious as to your findings.. Thanks ahead of time, Tom - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory
If you look at the man page for 'ps' you'll see a switch that shows all the threads too (it's different on different unix flavours, so best to do look in the man page). Once you've shown the threads in 'ps' you'll see that the process that is appearing in top, and I'll bet it's parent is your original java process. I wouldn't panic, each thread is almost certainly sharing the same memory pool, so while top reports that the thread has X Mb of memory, it's really the same physical block as all the others. You see this all the time in a Tomcat app server box, where each Http Connector is a thread, and appears as it's own process. cheers, Paul Smith On 17/01/2006, at 7:11 AM, Aigner, Thomas wrote: Hi all, Is anyone experiencing possible memory problems on LINUX with Lucene search? Here is our scenario, we have a service that lives on LINUX that takes all incoming request through a port and does the search. Only 1 IndexSearcher is instantiated to do this from our service. When I run a ps and grep for java it shows only 1 java process running.. however, when 4 users log into our program and start to search at the same time, 4 java processes show up on TOP (and I can't see their parent PID from the top command), but still only 1 java on ps. My company fears that each process is being allocated 128M memory and is running the box out of memory (when the service is started we allocated 10 - 128M from the java call). I am still in the process of testing with our system guys and having the data analyzed with a 3rd party, but was curious as to your findings.. Thanks ahead of time, Tom - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory
Thanks Paul, I did a man on top and sure enough there was a PPID command on Linux (f then B) for parent process. And yes, they always have the same parent command. Thanks for your help as I'm obviously still a noob on Unix. Tom -Original Message- From: Paul Smith [mailto:[EMAIL PROTECTED] Sent: Monday, January 16, 2006 3:18 PM To: java-user@lucene.apache.org Subject: Re: Memory If you look at the man page for 'ps' you'll see a switch that shows all the threads too (it's different on different unix flavours, so best to do look in the man page). Once you've shown the threads in 'ps' you'll see that the process that is appearing in top, and I'll bet it's parent is your original java process. I wouldn't panic, each thread is almost certainly sharing the same memory pool, so while top reports that the thread has X Mb of memory, it's really the same physical block as all the others. You see this all the time in a Tomcat app server box, where each Http Connector is a thread, and appears as it's own process. cheers, Paul Smith On 17/01/2006, at 7:11 AM, Aigner, Thomas wrote: > > Hi all, > Is anyone experiencing possible memory problems on LINUX with > Lucene search? Here is our scenario, we have a service that lives on > LINUX that takes all incoming request through a port and does the > search. Only 1 IndexSearcher is instantiated to do this from our > service. When I run a ps and grep for java it shows only 1 java > process > running.. however, when 4 users log into our program and start to > search > at the same time, 4 java processes show up on TOP (and I can't see > their > parent PID from the top command), but still only 1 java on ps. > My company fears that each process is being allocated 128M > memory and is running the box out of memory (when the service is > started > we allocated 10 - 128M from the java call). I am still in the process > of testing with our system guys and having the data analyzed with a > 3rd > party, but was curious as to your findings.. > > Thanks ahead of time, > Tom > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: One problem of using the lucene
Hi, the following code is the SynonymFilter i wrote. import org.apache.lucene.analysis.*; import java.io.*; import java.util.*; /** * @author JIANG XING * * Jan 15, 2006 */ public class SynonymFilter extends TokenFilter { public static final String TOKEN_TYPE_SYNONYM = "SYNONYM"; private Stack synonymStack; private WordNetSynonymEngine engine; public SynonymFilter(TokenStream in, WordNetSynonymEngine engine){ super(in); synonymStack = new Stack(); this.engine = engine; } public Token next () throws IOException { if(synonymStack.size() > 0){ return (Token) synonymStack.pop(); } Token token = input.next(); if(token == null){ return null; } addAliasesToStack(token); return token; } private void addAliasesToStack(Token token) throws IOException { String [] synonyms = engine.getSynonyms(token.termText()); if(synonyms == null) return; for(int i = 0; i < synonyms.length; i++) { Token synToken = new Token(synonyms[i], token.startOffset(), token.endOffset(), TOKEN_TYPE_SYNONYM); synToken.setPositionIncrement(0); // synonymStack.push(synToken); } } } It is adding tokens into the same position as the original token. And then, I used the QueryParser for searching and the snowball analyzer for parsing. the following is the SynonymAnalyzer I wrote. import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.analysis.snowball.*; import java.io.*; import java.util.*; /** * @author JIANG XING * * Jan 15, 2006 */ public class SynonymAnalyzer extends Analyzer { private WordNetSynonymEngine engine; private Set stopword; public SynonymAnalyzer(String [] word) { try{ engine = new WordNetSynonymEngine(new File("C:\\PDF2Text\\SearchEngine\\WordNetIndex")); stopword = StopFilter.makeStopSet(word); }catch(IOException e){ e.printStackTrace(); } } public TokenStream tokenStream(String fieldName, Reader reader){ TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); if (stopword != null){ result = new StopFilter(result, stopword); } result = new SnowballFilter(result, "Lovins"); result = new SynonymFilter(result, engine); return result; } } I write some code in the snowballfitler (line 75-79). If i only used the snowballfilter, the term "support" can be found in all the 17 documents. However, if the code "result = new SynonymFilter(result, engine);" is used. The term "support" cannot be found in some documents. public class SnowballFilter extends TokenFilter { private static final Object [] EMPTY_ARGS = new Object[0]; private SnowballProgram stemmer; private Method stemMethod; /** Construct the named stemming filter. * * @param in the input tokens to stem * @param name the name of a stemmer */ public SnowballFilter(TokenStream in, String name) { super(in); try { Class stemClass = Class.forName("net.sf.snowball.ext." + name + "Stemmer"); stemmer = (SnowballProgram) stemClass.newInstance(); // why doesn't the SnowballProgram class have an (abstract?) stem method? stemMethod = stemClass.getMethod("stem", new Class[0]); } catch (Exception e) { throw new RuntimeException(e.toString()); } } /** Returns the next input Token, after being stemmed */ public final Token next() throws IOException { Token token = input.next(); if (token == null) return null; stemmer.setCurrent(token.termText()); try { stemMethod.invoke(stemmer, EMPTY_ARGS); } catch (Exception e) { throw new RuntimeException(e.toString()); } Token newToken = new Token(stemmer.getCurrent(), token.startOffset(), token.endOffset(), token.type()); //check the tokens. if(newToken.termText().equals("support")){ System.out.println("the term support is found"); } newToken.setPositionIncrement(token.getPositionIncrement()); return newToken; } } On 1/16/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > Could you share the details of your SynonymFilter? Is it adding > tokens into the same position as the original tokens (position > increment of 0)? Are you using QueryParser for searching? If so, > try TermQuery to eliminate the parser's analysis from the picture for > the time being while trouble shooting. > > If you are using QueryParser, are you using the same analyzer? If > this is the case, what is the .toString of the generated Query? > >Erik > > > On Jan 16, 2006, at 3:54 AM, jason wrote: > > > Hi, > > > > I got a problem of using the lucene. > > > > I write a SynonymFilter which c
How do I get a count of all search results inside of my content?
I am trying to find out a quick way to get a complete count of all search results found in all of my Documents. Let me back up... I have split the content that I am searching into many Documents and then indexed this content. Each Document represents about one "paragraph" of data. Now I search all of my Documents for a word or phrase. If I understand correctly, the Hits that are returned tell me which Documents contain the information that I am searching for. And Hits.length() would tell me how many documents contain my information. I would like to know how many total results were found for my search. In other words, if a Document contains the word or phrase more than once, I would like to know this information so that I can return a "true" count of search results that were found across all of my Documents. It seems that Lucene must already know this information since it searched the Document already when it scored and added it to my Hits. What is the best way to get this information quickly? Thanks, Gary
Re: How do I get a count of all search results inside of my content?
1) There's no need to send the same message twice just because you didn't get a rapid response to hte first one ... in most parts of hte US this has been a three day weekend, so it's not that suprising that no one wrote a reply yet since the first time you asked this question friday night. 2) you need to be carefully about your terminology... : I would like to know how many total results were found for my search. In : other words, if a Document contains the word or phrase more than once, I : would like to know this information so that I can return a "true" count of : search results that were found across all of my Documents. It seems that The total results of your search is Hits.length(). 1 result is 1 matching document. what you are asking for is information about the frequency of a word or phrase. The TermEnum class makes it easy to find out the frequency of a term in your entire index. The frequency of a phrase is more complicated. I would suggest you start by looking at the documenation on Similarity and the way scores are calculated. I believe that it is possible to write an implimentation of Similarity that will result in the raw score of a PhraseQuery on any document being the number of times that phrase appears in the document. You will then need to use a HitCollector to sum the raw scores so they don't get normalized for you. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]