Re: full text as input ?

2005-01-13 Thread David Spencer
Hunter Peress wrote:
is it efficient and feasible to use lucene to do full text
comparisions. eg :  take an entire text thats reasonably large ( eg
more than 10 words) and find the result set within the lucene search
index that  is statistically similar with all the text.
I do this kind of stuff all the time, no problem.
I think this came up a month ago - probably appears monthly.
For another variation search for "MoreLikeThis" in the list - it's code 
I mailed in that I haven't, yet, checked in.

Anyway, if you want to search for docs that are similar to a source 
document, you can all this method to generate a similarity query.

'srch' is the source doc
'a' is your analyzer
'field' is the field that stores the body e.g. "contents"
'stop' is an opt Set of stop words to ignore as an optimization - it's 
not needed if the Analyzer ignores stop words, but if you keep stop 
words you might still want to ignore them in this kind of query as they 
probably won't help

  public static Query formSimilarQuery( String srch, 
Analyzer a,	String field,	Set stop)
		throws org.apache.lucene.queryParser.ParseException, IOException
	{	
		TokenStream ts = a.tokenStream( field, new StringReader( srch));
		org.apache.lucene.analysis.Token t;
		BooleanQuery tmp = new BooleanQuery();
		Set already = new HashSet();
		while ( (t = ts.next()) != null)
		{
			String word = t.termText();
			if ( stop != null &&
 stop.contains( word)) continue;
			if ( ! already.add( word)) continue;
			TermQuery tq = new TermQuery( new Term( field, word));
			tmp.add( tq, false, false);
		}

// tbd, from lucene in action book
// 
https://secure.manning.com/catalog/view.php?book=hatcher2&item=source
// exclude myself
//likeThisQuery.add(new TermQuery(
//new Term("isbn", doc.get("isbn"))), false, true);
return tmp;
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: stop words and index size

2005-01-13 Thread Chris Hostetter


: The corpus is the English Wikipedia, and I indexed the title and body of
: the articles. I used a list of 525 stop words.
:
: With stopwords removed the index is 227MB.
: With stopwords kept the index is 331MB.

That doesn't seem horribly surprising.

consider that for every Term in the index, lucene is keeping track of the
list of  pairs for every document that contains that term.

Assume that something has to be in at least 25% of the docs before you
decide it's worth making it a stop word.  your URL indicates you are
dealing with 400k docs, which means that for each stop word, the space
need to store the int pairs for  is...

(4B + 4B) * 100,000 =~ 780KB  (per stop word Term, minimum)

...not counting any indexing structures that may be used internally to
improve the lookup of a Term.  assuming some of those words are in more or
less then 25% of your documents, that could easily account for a
differents of 100MB.

I suspect that an interesting excersize would be to use some of the code
I've seen tossed arround on this list that lets you iterate over all Terms
and find the most common once to help you determine your stopword list
progromaticly.  Then remove/reindex any documents that have each word as
you add it to your stoplist (one word at a time) and watch your index
shrink.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



full text as input ?

2005-01-13 Thread Hunter Peress
is it efficient and feasible to use lucene to do full text
comparisions. eg :  take an entire text thats reasonably large ( eg
more than 10 words) and find the result set within the lucene search
index that  is statistically similar with all the text.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



stop words and index size

2005-01-13 Thread David Spencer
Does anyone know how much stop words are supposed to affect the index size?
I did an experiment of building an index once with, and once without, 
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of 
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.
Thus, the index grows by 45% in this case, which I found suprising, as I 
expected it to not grow as much. I haven't dug into the details of the 
Lucene file formats but thought compression (field/term vector/sparse 
lists/ vints) would negate the affect of stopwords to a large extent.

Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36
-- Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Multi-threading problem: couldn't delete segments

2005-01-13 Thread Luke Francl
On Thu, 2005-01-13 at 12:33, David Townsend wrote:
> Just read your old post. I'm not quite sure whether I've read this correctly. 
>  Is the search worker thread also doing deletes from the index 
> 
> "a test script is going that is hitting the search
> part of our application (I think the script also updates and deletes
> Documents, but I am not sure."
> 
> Deleting also locks the index, so maybe the indexwriter is waiting for the 
> search thread to release the lock.

I checked with my co-worker, and his script is doing a search, modifying
assets (which deletes and re-inserts) and then deleting them. This is
going on while new Documents are being added to the index from another
thread. (Due to some weirdness in our application, it is also trying to
delete Documents that don't exist before inserting them -- should be
harmless, though.)

I control access to the index with a lock object during all write
accesses to the index, including deletes.

You can see the code here:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=2068605&attachId=1

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Multi-threading problem: couldn't delete segments

2005-01-13 Thread Luke Francl
On Thu, 2005-01-13 at 12:25, David Townsend wrote:
> The problem could be you're writing to an index with multiple processes. This 
> can happen if you're using a shared file system (NFS?).  We saw this problem 
> when we had two IndexWriters getting access to a single index at the same 
> time.  Usually if you're working on a single machine the file locks prevent 
> this from happening.

No, there is a single process with multiple threads (synchronized). The
filesystem is NTFS.

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Multi-threading problem: couldn't delete segments

2005-01-13 Thread David Townsend
Just read your old post. I'm not quite sure whether I've read this correctly.  
Is the search worker thread also doing deletes from the index 

"a test script is going that is hitting the search
part of our application (I think the script also updates and deletes
Documents, but I am not sure."

Deleting also locks the index, so maybe the indexwriter is waiting for the 
search thread to release the lock.

-Original Message-
From: David Townsend 
Sent: 13 January 2005 18:26
To: 'Lucene Users List'
Subject: RE: Multi-threading problem: couldn't delete segments


The problem could be you're writing to an index with multiple processes. This 
can happen if you're using a shared file system (NFS?).  We saw this problem 
when we had two IndexWriters getting access to a single index at the same time. 
 Usually if you're working on a single machine the file locks prevent this from 
happening.



-Original Message-
From: Luke Francl [mailto:[EMAIL PROTECTED]
Sent: 13 January 2005 18:13
To: Lucene Users List
Subject: Re: Multi-threading problem: couldn't delete segments


I didn't get any response to this post so I wanted to follow up (you can
read the full description of my problem in the archives:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=11986).

Here's an additional piece of information: 

I wrote a small program to confirm that on Windows, you can't rename a
file while another thread has it open.

If I am performing a search, is it possible that the IndexReader is
holding open the "segments" file when there is an attempt by my indexing
code to overwrite it with File.renameTo()?

Thanks,
Luke Francl

On Thu, 2005-01-06 at 17:43, Luke Francl wrote:
> We are having a problem with Lucene in a high concurrency
> create/delete/search situation. I thought I fixed all these problems,
> but I guess not.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Multi-threading problem: couldn't delete segments

2005-01-13 Thread David Townsend
The problem could be you're writing to an index with multiple processes. This 
can happen if you're using a shared file system (NFS?).  We saw this problem 
when we had two IndexWriters getting access to a single index at the same time. 
 Usually if you're working on a single machine the file locks prevent this from 
happening.



-Original Message-
From: Luke Francl [mailto:[EMAIL PROTECTED]
Sent: 13 January 2005 18:13
To: Lucene Users List
Subject: Re: Multi-threading problem: couldn't delete segments


I didn't get any response to this post so I wanted to follow up (you can
read the full description of my problem in the archives:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=11986).

Here's an additional piece of information: 

I wrote a small program to confirm that on Windows, you can't rename a
file while another thread has it open.

If I am performing a search, is it possible that the IndexReader is
holding open the "segments" file when there is an attempt by my indexing
code to overwrite it with File.renameTo()?

Thanks,
Luke Francl

On Thu, 2005-01-06 at 17:43, Luke Francl wrote:
> We are having a problem with Lucene in a high concurrency
> create/delete/search situation. I thought I fixed all these problems,
> but I guess not.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi-threading problem: couldn't delete segments

2005-01-13 Thread Luke Francl
I didn't get any response to this post so I wanted to follow up (you can
read the full description of my problem in the archives:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=11986).

Here's an additional piece of information: 

I wrote a small program to confirm that on Windows, you can't rename a
file while another thread has it open.

If I am performing a search, is it possible that the IndexReader is
holding open the "segments" file when there is an attempt by my indexing
code to overwrite it with File.renameTo()?

Thanks,
Luke Francl

On Thu, 2005-01-06 at 17:43, Luke Francl wrote:
> We are having a problem with Lucene in a high concurrency
> create/delete/search situation. I thought I fixed all these problems,
> but I guess not.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search failed with a "File not found" error

2005-01-13 Thread Jim Lynch
I was indexing at the time and I was under the impression that was safe, 
but it looks like the indexer may have removed a file that the search 
was trying to access.  Is there something I should be doing to lock the 
index?

Thanks,
Jim.
java.io.FileNotFoundException: /db/lucene/oasis/Clarify_Closed/_2meu.fnm 
(No such file or directory)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.(RandomAccessFile.java:200)
   at 
org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:376)
   at 
org.apache.lucene.store.FSInputStream.(FSDirectory.java:405)
   at 
org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
   at org.apache.lucene.index.FieldInfos.(FieldInfos.java:53)
   at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
   at 
org.apache.lucene.index.SegmentReader.(SegmentReader.java:94)
   at 
org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:122)
   at org.apache.lucene.store.Lock$With.run(Lock.java:109)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Réf. : Re: Réf. : Re: IndexSearcher and number of occurence

2005-01-13 Thread Bertrand VENZAL



Great, thanks for your help, I understand things quickly but I need lots of
explanation .. ;-)

For who is interested .. I was using :

int id = hits.doc(i);
instead of :
int id = hits.id(i);

Tchõ
Bertrand





On Jan 13, 2005, at 10:17 AM, Bertrand VENZAL wrote:

>
>
>
> Hi,
>
> Thanks for your quick answer, I understood wot u meant by using the
> indexSearcher to get the termFreqVector. But, you use an int as an id
> to
> find the termFrequency so I suppose that it is the position number in
> the
> IndexReader vector.
> My problem is : during the indexing phase, I can store the id, but if a
> document is deleted and recreated later on (like in an update), this
> will
> change my vector and all the id's previously set will be no more
> correct.
> Am i right on this point ? or am i missing something ...

Yes, the Document id (the one Lucene uses) is not to be relied on
long-term.  But, in the example you'd get it from Hits immediately
after a search, and thus it would be accurate and usable.  You do not
need to store any the id during indexing - Lucene maintains it and
gives it to you from Hits.

                Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



calculate score

2005-01-13 Thread Michael Scholz
Hello,

how does lucene calculate the score of a given document? In the class 
DefaultSimilarity are some parts of this formula (e.g. tf,itf), but how does 
these parts working together?

Thanks,
Michael Scholz

Re: Réf. : Re: IndexSearcher and number of occurence

2005-01-13 Thread Erik Hatcher
On Jan 13, 2005, at 10:17 AM, Bertrand VENZAL wrote:

Hi,
Thanks for your quick answer, I understood wot u meant by using the
indexSearcher to get the termFreqVector. But, you use an int as an id 
to
find the termFrequency so I suppose that it is the position number in 
the
IndexReader vector.
My problem is : during the indexing phase, I can store the id, but if a
document is deleted and recreated later on (like in an update), this 
will
change my vector and all the id's previously set will be no more 
correct.
Am i right on this point ? or am i missing something ...
Yes, the Document id (the one Lucene uses) is not to be relied on 
long-term.  But, in the example you'd get it from Hits immediately 
after a search, and thus it would be accurate and usable.  You do not 
need to store any the id during indexing - Lucene maintains it and 
gives it to you from Hits.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Réf. : Re: IndexSearcher and number of occurence

2005-01-13 Thread Bertrand VENZAL



Hi,

Thanks for your quick answer, I understood wot u meant by using the
indexSearcher to get the termFreqVector. But, you use an int as an id to
find the termFrequency so I suppose that it is the position number in the
IndexReader vector.
My problem is : during the indexing phase, I can store the id, but if a
document is deleted and recreated later on (like in an update), this will
change my vector and all the id's previously set will be no more correct.
Am i right on this point ? or am i missing something ...

thanks ...
|++|
||   Erik Hatcher ||
||   <[EMAIL PROTECTED]||
||   ns.com>  |  Pour :|
||   Envoyé par : |   "Luce|
||   lucene-user-return-12|   ne   |
||   431-bertrand.venzal=c|   Users|
||   [EMAIL PROTECTED]|   List"|
||   e.org|  |
||||
||||
|||  cc :  |
||||
||||
||||
||||
||||
|||  Objet :   |
|||   Re:  |
|||   Index|
|||   Searc|
|||   her  |
|||   and  |
|||   numbe|
|||   r of |
|||   occur|
|||   ence |
||||
||||
|++|










On Jan 13, 2005, at 5:03 AM, Bertrand VENZAL wrote:

>
>
> Hi all,
>
> Im quite new in this mailing list. I ve many difficulties to find the
> number of a word (occurence) in a document, I need to use indexSearcher
> because of the query but the score returning is not wot i m looking
> for.
> I found in the mailing List the class TermDoc but it seems to work only
> with indexReader.
>
> If anyone can give a hand of this one, I will appreciate ...

Perhaps this technique is what you're looking for set the field(s)
you're interested in capturing frequency on to be vectored.  You'll see
that flag as additional overloaded methods on the Field.  You'll still
need to use an IndexReader, but that is no problem.  Construct an
IndexReader and use it to construct the IndexSearcher that you'll also
use.  Here's some snippets of code:

                // During indexing, "subject" field was added like this:
    doc.add(Field.UnStored("subject", subject, true));

... // now during searching...

    IndexReader reader = IndexReader.open(directory);

    ...
    // from your Hits, get the document id
    int id = hits.doc(i);

    TermFreqVector vector =
        reader.getTermFreqVector(id, "subject");

Now read up on the TermFreqVector API to get at the frequency of a
specific term.

                Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MySql and Lucene

2005-01-13 Thread Miles Barr
On Thu, 2005-01-13 at 12:36 +0100, Daniel Cortes wrote:
> I what to know your opinion about this:
> 
> I've a new portal, and Lucene is the serach engine. This portal is an 
> integration of a lot of opensource software.
> phpBB(MySql) is our election for the forum, and I have to do that 
> searches with the search engine include search in the forum.
> I think that I have 2 options:
> -Every new post in the forum, it was been  indexed in the Mysql and 
> Lucene Index ( storing fields that I want to show in the results for 
> exemaple author, title date,...)
> It means that I've almost a total copy of the MySQL in my Lucene Index.
> - Or  Do the search with lucene and after do a SQL query in the 
> servlett, but how I show the results.I can't show first the Lucene's 
> results and after the phorum's results.
> Any Idea?
> thks

If space wasn't an issue I would just duplicate the data in Lucene
because that makes things easiest. 

If space is a concern you could store the post's primary key in Lucene
as the only stored field. Then do a search on Lucene, get the list of
matching posts and pull out the rest of the information from MySQL.

-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MySql and Lucene

2005-01-13 Thread Daniel Cortes
I what to know your opinion about this:
I've a new portal, and Lucene is the serach engine. This portal is an 
integration of a lot of opensource software.
phpBB(MySql) is our election for the forum, and I have to do that 
searches with the search engine include search in the forum.
I think that I have 2 options:
-Every new post in the forum, it was been  indexed in the Mysql and 
Lucene Index ( storing fields that I want to show in the results for 
exemaple author, title date,...)
It means that I've almost a total copy of the MySQL in my Lucene Index.
- Or  Do the search with lucene and after do a SQL query in the 
servlett, but how I show the results.I can't show first the Lucene's 
results and after the phorum's results.
Any Idea?
thks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IndexSearcher and number of occurence

2005-01-13 Thread Morus Walter
Bertrand VENZAL writes:
> 
> Im quite new in this mailing list. I ve many difficulties to find the
> number of a word (occurence) in a document, I need to use indexSearcher
> because of the query but the score returning is not wot i m looking for.
> I found in the mailing List the class TermDoc but it seems to work only
> with indexReader.
> 
The use of a searcher does not prevent the use of a reader (in fact
the searcher relys on a reader).
So I'd use the searcher to find the document and a reader to get the
frequency using IndexReader.termDocs.
Depending on how many frequencies your interested in, the term vector
support might be of interest.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and number of occurence

2005-01-13 Thread Erik Hatcher
On Jan 13, 2005, at 5:03 AM, Bertrand VENZAL wrote:

Hi all,
Im quite new in this mailing list. I ve many difficulties to find the
number of a word (occurence) in a document, I need to use indexSearcher
because of the query but the score returning is not wot i m looking 
for.
I found in the mailing List the class TermDoc but it seems to work only
with indexReader.

If anyone can give a hand of this one, I will appreciate ...
Perhaps this technique is what you're looking for set the field(s) 
you're interested in capturing frequency on to be vectored.  You'll see 
that flag as additional overloaded methods on the Field.  You'll still 
need to use an IndexReader, but that is no problem.  Construct an 
IndexReader and use it to construct the IndexSearcher that you'll also 
use.  Here's some snippets of code:

// During indexing, "subject" field was added like this:
doc.add(Field.UnStored("subject", subject, true));
... // now during searching...
IndexReader reader = IndexReader.open(directory);
...
// from your Hits, get the document id
int id = hits.doc(i);
TermFreqVector vector =
reader.getTermFreqVector(id, "subject");
Now read up on the TermFreqVector API to get at the frequency of a 
specific term.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IndexSearcher and number of occurence

2005-01-13 Thread Bertrand VENZAL


Hi all,

Im quite new in this mailing list. I ve many difficulties to find the
number of a word (occurence) in a document, I need to use indexSearcher
because of the query but the score returning is not wot i m looking for.
I found in the mailing List the class TermDoc but it seems to work only
with indexReader.

If anyone can give a hand of this one, I will appreciate ...

Tchõ
Bertrand



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]