One problem of using the lucene

2006-01-16 Thread jason
Hi,

I got a problem of using the lucene.

I write a SynonymFilter which can add synonyms from the WordNet. Meanwhile,
i used the SnowballFilter for term stemming. However, i got a problem when
combining the two fiters.

For instance, i got 17 documents containing the Term "support"  and  the
following is the SynonymAnalyzer i wrote.

/**
*
*/
 public TokenStream tokenStream(String fieldName, Reader reader){


TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
if (stopword != null){
  result = new StopFilter(result, stopword);
}

result = new SnowballFilter(result, "Lovins");

   result = new SynonymFilter(result, engine);

return result;
}

If i only used the SnowballFilter, i can find the "support" in the 17
documents. However, after adding the SynonymFilter, the "support" can only
be found in 10 documents. It seems the term "support" cannot be found in the
left 7 documents. I dont know what's wrong with it.

regards

jiang xing


locked files after updating lucene to 1.4.3

2006-01-16 Thread Jens Ansorg

hi,

I run into an issue after updating lucene libs from 1.3-final to 1.4.3.

We have a batch job on our web server that recreates the lucene search 
index every night. This job deletes all index and creates a new one.


This search index gets used by the lucene-powered search feature of the 
web site /IS + Resin-2.1.11).


The search itself still works. but once I did a search on the web site 
some files in the index become locked. And n the index updater fails 
because it tries to delete those locked files ... The error is 
someething like


[ERROR][2006-01-15 08:15:01 - main - de.bcg.web.search.BcgSiteSearch] 
Error while building index.

java.io.IOException: couldn't delete _e3.tis
at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
at org.apache.lucene.store.FSDirectory.(FSDirectory.java:151)
at 
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:132)
at 
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:151)
at de.bcg.web.search.BcgSiteSearch.buildIndex(BcgSiteSearch.java:99)
at de.bcg.web.search.BcgSiteSearch.main(BcgSiteSearch.java:71)


The developer of the search stuff is no longer here and I have to 
maintain that stuff. Now, why does this locking happen? Didn never 
happen with 1.3. So I probably need to update something in the code.


any hints about what causes the lock and how to fix this are very welcome :)

thanks
Jens Ansorg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: One problem of using the lucene

2006-01-16 Thread Erik Hatcher
Could you share the details of your SynonymFilter?  Is it adding  
tokens into the same position as the original tokens (position  
increment of 0)?   Are you using QueryParser for searching?  If so,  
try TermQuery to eliminate the parser's analysis from the picture for  
the time being while trouble shooting.


If you are using QueryParser, are you using the same analyzer?  If  
this is the case, what is the .toString of the generated Query?


Erik


On Jan 16, 2006, at 3:54 AM, jason wrote:


Hi,

I got a problem of using the lucene.

I write a SynonymFilter which can add synonyms from the WordNet.  
Meanwhile,
i used the SnowballFilter for term stemming. However, i got a  
problem when

combining the two fiters.

For instance, i got 17 documents containing the Term "support"   
and  the

following is the SynonymAnalyzer i wrote.

/**
*
*/
 public TokenStream tokenStream(String fieldName, Reader reader){


TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
if (stopword != null){
  result = new StopFilter(result, stopword);
}

result = new SnowballFilter(result, "Lovins");

   result = new SynonymFilter(result, engine);

return result;
}

If i only used the SnowballFilter, i can find the "support" in the 17
documents. However, after adding the SynonymFilter, the "support"  
can only
be found in 10 documents. It seems the term "support" cannot be  
found in the

left 7 documents. I dont know what's wrong with it.

regards

jiang xing



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Part-Of Match

2006-01-16 Thread sven
Hi Hoss,

thanks for the answer, and yes you have described the problem perfectly.
I think you are right lucene is in fact not the best way of solving it.
I decided to simply build a letter trie consisting of all concepts and
then simply do a search with that document on the trie.
This brings exact matches only on the one hand (and thats exactly what I
need) and furthermore yields matches even for concepts that are in plural
form in the query document.
So the "von Willebrands" will yield "von Willebrand".

Thanks for your efforts,
Sven


 --- Ursprüngliche Nachricht ---
Datum: 15.01.2006 22:14
Von: java-user@lucene.apache.org
An: java-user@lucene.apache.org
Betreff: Re: AW: Part-Of Match

>
> : >>von Willebrand<< is not the query but a document in the index
The task
> : is to detect exact matches of phrases inside a query (large document)
with
> : these phrases stored in the index.
>
> Lemme see if i can restate your problem...
>
> You want to build a data repository in which you insert a large
magnatude
> of "concepts" where a concept is a short phrase consisting of a few
words
> (possibly just one word).  The words in any given concept phrase may
> overlap (or be a super set) of the words in other concepts.
>
> Once this concept repository is built, you want to to build a black box
> arround it, such that people can hand your black box a "document"
> (ie: a research paper, a newpaper article, a short story, ...
> some text consisting of many many sentences) and you want your black
box
> to then return the list of concepts that match the input document, such
> that the cnceptss with the highest score are concepts whose phrase
appears
> exactly in the input document.  Concepts whose phrase doesn't appear
> exactly in the document shoudl still be returned, but with a lower
score
> based on how many words in the concept's phrase are found in the input
> document.
>
>   (have i adequetly described your problem?)
>
> It's an interesting idea.  can it be done with lucene? ... i can think
of
> one kludgy mechanism for doing it but i'd be very suprised if there
isn't
> a better way (or if there is some other software library out there that
> would be more suited)
>
> Build a permentant index in which each concept is a Lucene Document.
> these documents really only need one stored/tokenized/indexed field
> containing the phrase (if you want other payload fields that's up to
you).
>
> Each time you are asked to analyze a Text sample and return matching
> phrases, run the text through your analyzer to get back a tokenstream,
and
> for each of those tokens, use a TermDocs iterator to find out if any
> phrase in your concept index contains that term, and if so which ones.
> (you could also do this by building a boolean OR query out of all the
> words in your input document -- but that may run into performance
> limitatios if your input docs are too big, and it will try to score
each
> concept which isn't neccessary so even for short input text it's less
> efficient).
>
> Now you have an (unordered) list of concepts that have something to do
> with your input text.
>
> Next build a RAMDirectory based index consisting of exactly one
document
> which you build from the input text.  Loop over that list of concepts
you
> got, and build a boolean query out of each one along the lines that
> Daniel described: a phrase query on the whole concept phrase along with
> term queries for each individual word -- all optional.  run each of
these
> boolean queries against your one document RAMDirectory.  the higher the
> score, the better that concept applies to your input text.
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding similar documents

2006-01-16 Thread Stefan Gusenbauer

Grant Ingersoll wrote:

I believe there is a MoreLikeThis class floating around somewhere (I 
think it is in the contrib/similarity package).  The Lucene book also 
has a good example, and I have some examples at 
http://www.cnlp.org/apachecon2005 that demonstrate using term vectors 
to do this


Klaus wrote:


Hi,

is there are build-in method for finding similar documents to one given
document?

Thx,

Klaus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



I've implemented a simple relevance feedback algorithm which extracts 
terms from all interesting documents and builds up a new query with this 
terms. This is pretty simple but It works in most cases.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memory

2006-01-16 Thread Aigner, Thomas

Hi all,
Is anyone experiencing possible memory problems on LINUX with
Lucene search?  Here is our scenario, we have a service that lives on
LINUX that takes all incoming request through a port and does the
search.  Only 1 IndexSearcher is instantiated to do this from our
service.  When I run a ps and grep for java it shows only 1 java process
running.. however, when 4 users log into our program and start to search
at the same time, 4 java processes show up on TOP (and I can't see their
parent PID from the top command), but still only 1 java on ps.  
My company fears that each process is being allocated 128M
memory and is running the box out of memory (when the service is started
we allocated 10 - 128M from the java call).  I am still in the process
of testing with our system guys and having the data analyzed with a 3rd
party, but was curious as to your findings..

Thanks ahead of time,
Tom  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory

2006-01-16 Thread Paul Smith
If you look at the man page for 'ps' you'll see a switch that shows  
all the threads too (it's different on different unix flavours, so  
best to do look in the man page).


Once you've shown the threads in 'ps' you'll see that the process  
that is appearing in top, and I'll bet it's parent is your original  
java process.


I wouldn't panic, each thread is almost certainly sharing the same  
memory pool, so while top reports that the thread has X Mb of memory,  
it's really the same physical block as all the others.


You see this all the time in a Tomcat app server box, where each Http  
Connector is a thread, and appears as it's own process.


cheers,

Paul Smith

On 17/01/2006, at 7:11 AM, Aigner, Thomas wrote:



Hi all,
Is anyone experiencing possible memory problems on LINUX with
Lucene search?  Here is our scenario, we have a service that lives on
LINUX that takes all incoming request through a port and does the
search.  Only 1 IndexSearcher is instantiated to do this from our
service.  When I run a ps and grep for java it shows only 1 java  
process
running.. however, when 4 users log into our program and start to  
search
at the same time, 4 java processes show up on TOP (and I can't see  
their

parent PID from the top command), but still only 1 java on ps.
My company fears that each process is being allocated 128M
memory and is running the box out of memory (when the service is  
started

we allocated 10 - 128M from the java call).  I am still in the process
of testing with our system guys and having the data analyzed with a  
3rd

party, but was curious as to your findings..

Thanks ahead of time,
Tom

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Memory

2006-01-16 Thread Aigner, Thomas
Thanks Paul,
I did a man on top and sure enough there was a PPID command on
Linux (f then B) for parent process.  And yes, they always have the same
parent command.  Thanks for your help as I'm obviously still a noob on
Unix.

Tom

-Original Message-
From: Paul Smith [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 16, 2006 3:18 PM
To: java-user@lucene.apache.org
Subject: Re: Memory

If you look at the man page for 'ps' you'll see a switch that shows  
all the threads too (it's different on different unix flavours, so  
best to do look in the man page).

Once you've shown the threads in 'ps' you'll see that the process  
that is appearing in top, and I'll bet it's parent is your original  
java process.

I wouldn't panic, each thread is almost certainly sharing the same  
memory pool, so while top reports that the thread has X Mb of memory,  
it's really the same physical block as all the others.

You see this all the time in a Tomcat app server box, where each Http  
Connector is a thread, and appears as it's own process.

cheers,

Paul Smith

On 17/01/2006, at 7:11 AM, Aigner, Thomas wrote:

>
> Hi all,
>   Is anyone experiencing possible memory problems on LINUX with
> Lucene search?  Here is our scenario, we have a service that lives on
> LINUX that takes all incoming request through a port and does the
> search.  Only 1 IndexSearcher is instantiated to do this from our
> service.  When I run a ps and grep for java it shows only 1 java  
> process
> running.. however, when 4 users log into our program and start to  
> search
> at the same time, 4 java processes show up on TOP (and I can't see  
> their
> parent PID from the top command), but still only 1 java on ps.
>   My company fears that each process is being allocated 128M
> memory and is running the box out of memory (when the service is  
> started
> we allocated 10 - 128M from the java call).  I am still in the process
> of testing with our system guys and having the data analyzed with a  
> 3rd
> party, but was curious as to your findings..
>
> Thanks ahead of time,
> Tom
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: One problem of using the lucene

2006-01-16 Thread jason
Hi,

the following code is the SynonymFilter i wrote.


import org.apache.lucene.analysis.*;


import java.io.*;
import java.util.*;
/**
 * @author JIANG XING
 *
 * Jan 15, 2006
 */
public class SynonymFilter extends TokenFilter {

public static final String TOKEN_TYPE_SYNONYM = "SYNONYM";

private Stack synonymStack;
private WordNetSynonymEngine engine;

public SynonymFilter(TokenStream in, WordNetSynonymEngine engine){
super(in);
synonymStack = new Stack();
this.engine = engine;
}

public Token next () throws IOException {
if(synonymStack.size() > 0){
return (Token) synonymStack.pop();
}

Token token = input.next();


if(token == null){
return null;
}

addAliasesToStack(token);

return token;
}

private void addAliasesToStack(Token token) throws IOException {


String [] synonyms = engine.getSynonyms(token.termText());

if(synonyms == null) return;

for(int i = 0; i < synonyms.length; i++) {
Token synToken = new Token(synonyms[i], token.startOffset(),
token.endOffset(), TOKEN_TYPE_SYNONYM);

synToken.setPositionIncrement(0); //

synonymStack.push(synToken);
}
}
}
It is adding tokens into the same position as the original token. And then,
I used the QueryParser for searching and the snowball analyzer for parsing.

the following is the SynonymAnalyzer I wrote.

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.snowball.*;

import java.io.*;
import java.util.*;

/**
 * @author JIANG XING
 *
 * Jan 15, 2006
 */
public class SynonymAnalyzer extends Analyzer {
private WordNetSynonymEngine engine;
private Set stopword;

public SynonymAnalyzer(String [] word) {
try{
engine = new WordNetSynonymEngine(new
File("C:\\PDF2Text\\SearchEngine\\WordNetIndex"));
stopword = StopFilter.makeStopSet(word);
}catch(IOException e){
e.printStackTrace();
}
}

public TokenStream tokenStream(String fieldName, Reader reader){

TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
if (stopword != null){
  result = new StopFilter(result, stopword);
}

result = new SnowballFilter(result, "Lovins");

result = new SynonymFilter(result, engine);

return result;
}

}
I write some code in the snowballfitler (line 75-79). If i only used the
snowballfilter, the term "support" can be found in all the 17 documents.
However, if the code "result = new SynonymFilter(result, engine);" is used.
The term "support" cannot be found in some documents.


public class SnowballFilter extends TokenFilter {
  private static final Object [] EMPTY_ARGS = new Object[0];

  private SnowballProgram stemmer;
  private Method stemMethod;

  /** Construct the named stemming filter.
   *
   * @param in the input tokens to stem
   * @param name the name of a stemmer
   */
  public SnowballFilter(TokenStream in, String name) {
super(in);
try {
  Class stemClass =
Class.forName("net.sf.snowball.ext." + name + "Stemmer");
  stemmer = (SnowballProgram) stemClass.newInstance();
  // why doesn't the SnowballProgram class have an (abstract?) stem
method?
  stemMethod = stemClass.getMethod("stem", new Class[0]);
} catch (Exception e) {
  throw new RuntimeException(e.toString());
}
  }

  /** Returns the next input Token, after being stemmed */
  public final Token next() throws IOException {
Token token = input.next();
if (token == null)
  return null;
stemmer.setCurrent(token.termText());
try {
  stemMethod.invoke(stemmer, EMPTY_ARGS);
} catch (Exception e) {
  throw new RuntimeException(e.toString());
}

Token newToken = new Token(stemmer.getCurrent(),
  token.startOffset(), token.endOffset(), token.type());
//check the tokens.
if(newToken.termText().equals("support")){
System.out.println("the term support is found");
}

newToken.setPositionIncrement(token.getPositionIncrement());
return newToken;
  }
}



On 1/16/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
> Could you share the details of your SynonymFilter?  Is it adding
> tokens into the same position as the original tokens (position
> increment of 0)?   Are you using QueryParser for searching?  If so,
> try TermQuery to eliminate the parser's analysis from the picture for
> the time being while trouble shooting.
>
> If you are using QueryParser, are you using the same analyzer?  If
> this is the case, what is the .toString of the generated Query?
>
>Erik
>
>
> On Jan 16, 2006, at 3:54 AM, jason wrote:
>
> > Hi,
> >
> > I got a problem of using the lucene.
> >
> > I write a SynonymFilter which c

How do I get a count of all search results inside of my content?

2006-01-16 Thread Gary Mangum
I am trying to find out a quick way to get a complete count of all search
results found in all of my Documents.

Let me back up...

I have split the content that I am searching into many Documents and then
indexed this content.  Each Document represents about one "paragraph" of
data.

Now I search all of my Documents for a word or phrase.

If I understand correctly, the Hits that are returned tell me which
Documents contain the information that I am searching for.  And Hits.length()
would tell me how many documents contain my information.

I would like to know how many total results were found for my search.  In
other words, if a Document contains the word or phrase more than once, I
would like to know this information so that I can return a "true" count of
search results that were found across all of my Documents.  It seems that
Lucene must already know this information since it searched the Document
already when it scored and added it to my Hits.

What is the best way to get this information quickly?

Thanks,


Gary


Re: How do I get a count of all search results inside of my content?

2006-01-16 Thread Chris Hostetter

1) There's no need to send the same message twice just because you didn't
get a rapid response to hte first one ... in most parts of hte US this has
been a three day weekend, so it's not that suprising that no one wrote a
reply yet since the first time you asked this question friday night.

2) you need to be carefully about your terminology...

: I would like to know how many total results were found for my search.  In
: other words, if a Document contains the word or phrase more than once, I
: would like to know this information so that I can return a "true" count of
: search results that were found across all of my Documents.  It seems that

The total results of your search is Hits.length().
1 result is 1 matching document.  what you are asking for is information
about the frequency of a word or phrase.

The TermEnum class makes it easy to find out the frequency of a term in
your entire index.

The frequency of a phrase is more complicated.   I would suggest you start
by looking at the documenation on Similarity and the way scores are
calculated.  I believe that it is possible to write an implimentation of
Similarity that will result in the raw score of a PhraseQuery on
any document being the number of times that phrase appears in the
document. You will then need to use a HitCollector to sum the raw scores
so they don't get normalized for you.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]