Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-12 Thread Daniel Naber
On Wednesday 12 January 2005 01:47, David Spencer wrote:

 Amusingly then, documents with the terms liberal wienerwurst match
 big dog! :)

There's something like frequency information in WordNet, it could probably 
be used to ignore the uncommon meanings.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searching a document for a keyword

2005-01-12 Thread Swati Singhal
Hi,

I'm new to Lucene and also this forum.

I have a txt file, which contains the path to jpg
files. These jpg files are organized into folders. 
My search is limited to searching only this txt file.

So when i search based on a folder name, a match is
found in the txt file, but i want it to return me the
entire line as a search result and not the document
name. (which is the txt file)

How can I do that using Lucene?
I have already built the index by giving the txt file
as an input to build the index.

If this is not possible, please tell me a way to parse
jpg files to form an index file.

Thanks,
Swati

=




__ 
Do you Yahoo!? 
Yahoo! Mail - Helps protect you from nasty viruses. 
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching a document for a keyword

2005-01-12 Thread Erik Hatcher
On Jan 12, 2005, at 4:13 AM, Swati Singhal wrote:
I have a txt file, which contains the path to jpg
files. These jpg files are organized into folders.
My search is limited to searching only this txt file.
So when i search based on a folder name, a match is
found in the txt file, but i want it to return me the
entire line as a search result and not the document
name. (which is the txt file)
How can I do that using Lucene?
I have already built the index by giving the txt file
as an input to build the index.
If this is not possible, please tell me a way to parse
jpg files to form an index file.
First let me re-phrase what I think you want.  You want to be able to 
search on a folder name and retrieve back JPG filenames that are in 
that folder.  Correct?

You're using the text file as simply a way to get text into Lucene?  
Does this text file have any other relevance here?

If you have a folder of JPG images and all you're after is their 
filenames and the results granularity to be a JPG image file name, 
write a simple file system crawler that recurses your directory tree, 
and indexes a single document for each JPG, with a field for 
filename.  What type of field should the filename field be?  That 
depends on how you want to search.  You could make it a 
Field.Keyword(), which would require exact (TermQuery) or PrefixQuery's 
to work.

The Indexer example from Lucene in Action makes a great starting place 
for this crawler - you'd have to adapt it to recognize .jpg extensions 
and adjust it to only index the filename, not the contents (though the 
contents may contain text and be worth indexing also).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: QUERYPARSIN BOOSTING

2005-01-12 Thread Karthik N S
Hi Guys

Apologies...

If somebody's is  been closely watching GOOGLE, It boost's WEBSITES for
payed category sites based on search words.

Can This [ boost the Full WEBSITE ] be achieved in Lucene's search  based on
searchword

If So Please Explain /examples  ???.

with regards
karthik



-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 11, 2005 2:00 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: RE: QUERYPARSIN  BOOSTING


Karthik,

I don't think the boost in your example does much since you are using an
AND query, i.e. all hits will have to contain both vendor:nike and
contents:shoes.  If you used an OR, then the boost would put nike
products above (non-nike) shoes, unless there was some other factor that
causes score of contents:shoes to be 10x greater than that of
vendor:nike.  It's a good idea to look at the results of explain() when
analyzing what's happening with scoring, tuning your boosts and your
Similarity.

Chuck

   -Original Message-
   From: Nader Henein [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, January 11, 2005 12:21 AM
   To: Lucene Users List
   Subject: Re: QUERYPARSIN  BOOSTING
  
From the text on the Lucene Jakarta Site :
   http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
  
  
   Lucene provides the relevance level of matching documents based on
the
   terms found. To boost a term use the caret, ^, symbol with a boost
   factor (a number) at the end of the term you are searching. The
higher
   the boost factor, the more relevant the term will be.
  
   Boosting allows you to control the relevance of a document by
   boosting its term. For example, if you are searching for
  
  
  
  
   jakarta apache
  
  
  
  
   and you want the term jakarta to be more relevant boost it
using
   the ^ symbol along with the boost factor next to the term. You
would
   type:
  
  
  
  
   jakarta^4 apache
  
  
  
  
   This will make documents with the term jakarta appear more
relevant.
   You can also boost Phrase Terms as in the example:
  
  
  
  
   jakarta apache^4 jakarta lucene
  
  
  
  
   By default, the boost factor is 1. Although the boost factor
must be
   positive, it can be less than 1 (e.g. 0.2)
  
  
   Regards.
  
   Nader Henein
  
  
   Karthik N S wrote:
  
   Hi Guys
   
   
   
   Apologies...
   
   This Question may be asked million times on this form ,need some
   clarifications.
   
   1) FieldType =  keyword  name =  vendor
   
   2)FieldType =  text  name = contents
   
   Question:
   
   1) How to Construct a Query which would allow hits  avaliable for
the
   VENDOR
   to  appear  first ?.
   
   2) If boosting is to be applied How TO   ?.
   
   3) Is the Query Constructed Below correct?.
   
   +Contents:shoes +((vendor:nike)^10)
   
   
   
   Please Advise.
   Thx in advance.
   
   
   WITH WARM REGARDS
   HAVE A NICE DAY
   [ N.S.KARTHIK]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
   
   
   
  
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QUERYPARSIN BOOSTING

2005-01-12 Thread Erik Hatcher
On Jan 12, 2005, at 5:30 AM, Karthik N S wrote:
If somebody's is  been closely watching GOOGLE, It boost's WEBSITES for
payed category sites based on search words.
Do you have an example of this?  My understanding is Google *separates* 
the display of sponsored sites and ad links (like the one a friend of 
mine registered for me on my name).  Separating is different than 
boosting.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


HELP! Directory is NOT getting closed!

2005-01-12 Thread Joseph Ottinger
*sigh* Yet again, I apologize. I'm generating altogether too much traffic
here lately!

I'm stuck. I have a custom Directory, and I *need* a callback point so I
can clean up. There's a method for this: Directory.close(), which I've
overridden.

It never gets called!

According to IndexWriter.java, line 246 (in 1.4.3's codebase), if closeDir
is set, it's supposed to close the directory. That's fine - but that leads
me to believe that for some reason, closeDir is *not* set.

Why? Under what circumstances would this not be true, and under what
circumstances would you NOT want to close the Directory?

This is absolutely slaughtering my attempt at a Directory, because I need
a single unit-of-work, and I need a place to commit it, when it's done. If
I commit it inside the directory's innards, then the UOW gets corrupted
(and looks like it's more than one atomic action, which is EXACTLY what I
don't need.)

---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HELP! Directory is NOT getting closed!

2005-01-12 Thread Morus Walter
Joseph Ottinger writes:
 
 According to IndexWriter.java, line 246 (in 1.4.3's codebase), if closeDir
 is set, it's supposed to close the directory. That's fine - but that leads
 me to believe that for some reason, closeDir is *not* set.
 
 Why? Under what circumstances would this not be true, and under what
 circumstances would you NOT want to close the Directory?
 
From the sources, you can see, that is is true only, if the directory
is created by the IndexWriter itself. If you provide a directory to
the IndexWriter you have to close it yourself.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HELP! Directory is NOT getting closed!

2005-01-12 Thread Joseph Ottinger
On Wed, 12 Jan 2005, Morus Walter wrote:

 Joseph Ottinger writes:
 
  According to IndexWriter.java, line 246 (in 1.4.3's codebase), if closeDir
  is set, it's supposed to close the directory. That's fine - but that leads
  me to believe that for some reason, closeDir is *not* set.
 
  Why? Under what circumstances would this not be true, and under what
  circumstances would you NOT want to close the Directory?
 
 From the sources, you can see, that is is true only, if the directory
 is created by the IndexWriter itself. If you provide a directory to
 the IndexWriter you have to close it yourself.


ARGH! (I've been saying that a lot lately!)

Okay, I was looking at the sources but missed that. Thank you very much.
*sigh*

---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Performance hits using MultiSearcher?

2005-01-12 Thread asteigerwalt
I am pretty new to Lucene.

In my situation, there will be one, most likely, fairly large index, and over 
time a trickle of smaller indexes being created that could eventually number 
into the hundreds. Does using MultiSearcher to search against all these 
separate indexes impose a performance hit as compared to merging the smaller 
indexes into the original larger one? How long could a typical index merge 
take, just arbitrarily?

Thanks,
Ashley

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: QUERYPARSIN BOOSTING

2005-01-12 Thread Chuck Williams
Google has natural results on the left and sponsored results on the
right.  I do not believe the natural results are affected by paid
keywords at all.  What you seem to be describing is the behavior of the
sponsored results, which I believe are explicitly attached to certain
keywords.

The same approach would work in Lucene.  Create a field to hold
purchased keywords (any keywords you want to associate with the
result).  Then you can include this field in your search with a high
boost (see DistributingMultiFieldQueryParser,
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674).

Google prefers certain results over others for certain keywords based on
various factors of the keyword purchase and the site (amount paid for
the keyword, Page Rank of the site, tenure of the listing, popularity of
the listing, etc.).  You could emulate this in various ways, using a
combination of document/field boosting and perhaps replication of the
term in the field (to increase its tf), or even perhaps multiple fields
that are boosted at different levels.  I'm not sure of the best approach
to this part -- you could experiment a little.

Chuck

   -Original Message-
   From: Karthik N S [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, January 12, 2005 2:30 AM
   To: Lucene Users List
   Subject: RE: QUERYPARSIN  BOOSTING
   
   Hi Guys
   
   Apologies...
   
   If somebody's is  been closely watching GOOGLE, It boost's WEBSITES
for
   payed category sites based on search words.
   
   Can This [ boost the Full WEBSITE ] be achieved in Lucene's search
   based on
   searchword
   
   If So Please Explain /examples  ???.
   
   with regards
   karthik
   
   
   
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, January 11, 2005 2:00 PM
   To: Lucene Users List; [EMAIL PROTECTED]
   Subject: RE: QUERYPARSIN  BOOSTING
   
   
   Karthik,
   
   I don't think the boost in your example does much since you are
using an
   AND query, i.e. all hits will have to contain both vendor:nike and
   contents:shoes.  If you used an OR, then the boost would put nike
   products above (non-nike) shoes, unless there was some other factor
that
   causes score of contents:shoes to be 10x greater than that of
   vendor:nike.  It's a good idea to look at the results of explain()
when
   analyzing what's happening with scoring, tuning your boosts and your
   Similarity.
   
   Chuck
   
  -Original Message-
  From: Nader Henein [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, January 11, 2005 12:21 AM
  To: Lucene Users List
  Subject: Re: QUERYPARSIN  BOOSTING
 
   From the text on the Lucene Jakarta Site :
  http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
 
 
  Lucene provides the relevance level of matching documents based
on
   the
  terms found. To boost a term use the caret, ^, symbol with a
boost
  factor (a number) at the end of the term you are searching. The
   higher
  the boost factor, the more relevant the term will be.
 
  Boosting allows you to control the relevance of a document
by
  boosting its term. For example, if you are searching for
 
 
 
 
  jakarta apache
 
 
 
 
  and you want the term jakarta to be more relevant boost it
   using
  the ^ symbol along with the boost factor next to the term.
You
   would
  type:
 
 
 
 
  jakarta^4 apache
 
 
 
 
  This will make documents with the term jakarta appear more
   relevant.
  You can also boost Phrase Terms as in the example:
 
 
 
 
  jakarta apache^4 jakarta lucene
 
 
 
 
  By default, the boost factor is 1. Although the boost factor
   must be
  positive, it can be less than 1 (e.g. 0.2)
 
 
  Regards.
 
  Nader Henein
 
 
  Karthik N S wrote:
 
  Hi Guys
  
  
  
  Apologies...
  
  This Question may be asked million times on this form ,need
some
  clarifications.
  
  1) FieldType =  keyword  name =  vendor
  
  2)FieldType =  text  name = contents
  
  Question:
  
  1) How to Construct a Query which would allow hits  avaliable
for
   the
  VENDOR
  to  appear  first ?.
  
  2) If boosting is to be applied How TO   ?.
  
  3) Is the Query Constructed Below correct?.
  
  +Contents:shoes +((vendor:nike)^10)
  
  
  
  Please Advise.
  Thx in advance.
  
  
  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]
  
  
  
 
  
-
  To unsubscribe, e-mail:
[EMAIL PROTECTED]
  For additional commands, e-mail:
   [EMAIL PROTECTED]