Re: Searching multiple fields in one Index of Documents

2002-02-13 Thread Kelvin Tan

Charles,

See http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html

Regards,
K

- Original Message -
From: Charles Harvey [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, February 12, 2002 8:39 AM
Subject: Searching multiple fields in one Index of Documents


 I have a working installation of Lucene running against indexes created by
 a database query.
 Each Document in the Index contains fifteen or twenty fields. I am
 currently searching only one field (that contains concatenated database
 columns) because I cannot figure out how to search multiple fields. So:

 How can I use Lucene to search more than one field in an Index of
Documents?

 eg:
 field CATEGORY is(or contains) 'bar'
 AND
 field BODY contains 'foo'




 _

 The trouble with the rat-race is that even if you win you're still a
rat.
 --Lily Tomlin
 _
 Charles Harvey
 Developer
 http://www.philly.com
 Wk: 215 789 6057
 Cell: 215 588 0851


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




How does Lucene handle phrases containing words that are not indexed?

2002-02-13 Thread hugo burm


How does Lucene handle phrases (literals) containing words that are not
indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
(lucene demo, my own 12 xml documents, Cocoon search) and in all cases
it looks like that when you are looking for the phrase a specification it
also finds documents which contain the specification. (or: D. Washington
instead of G. Washington).

Of course you can change the index behaviour and make sure there are no
stopwords, and all one-letter words and numbers are indexed. But that seems
a bad approach. A better approach: 1) find all indexed words in the phrase
and from these words find all documents containing these words. 2) check the
occurence of the phrase by opening the original document.  I am wondering:
does Lucene performs step 2)? Off course this step burns some cpu cycles.

Hugo

[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




PrefixQuery Scoring

2002-02-13 Thread Jonathan Franzone

*This message was transferred with a trial version of CommuniGate(tm) Pro*

Whenever I add a PrefixQuery to my search the scoring gets really small. For
example if I do a query like this: +java then the scoring starts around
0.866... and so forth. But if I do a query like this: +java* then the
scoring start like 0.00034... Is there a specific reason for this? or is it
a bug?

Thanks,
Jonathan Franzone





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: PrefixQuery Scoring

2002-02-13 Thread Doug Cutting

 From: Jonathan Franzone [mailto:[EMAIL PROTECTED]]
 
 Whenever I add a PrefixQuery to my search the scoring gets 
 really small. For
 example if I do a query like this: +java then the scoring 
 starts around
 0.866... and so forth. But if I do a query like this: +java* then the
 scoring start like 0.00034... Is there a specific reason for 
 this?

A PrefixQuery is equivalent to a query containing all the terms matching the
prefix, and is hence usually contains a lot of terms.  With such a big
query, matching documents are likely to contain fewer of the query terms and
the match is thus weaker.  For example, the top scoring document in a prefix
query might contain only one or two of 100 or more query terms.  That's not
a very strong match.  But the top-scoring document in a single term
non-prefix query is guaranteed to contain all of the query terms, and is
thus a much stronger match.

There are of course other factors involved in scoring (e.g., document length
 term frequency).  I call the factor in question here coordination
matching.  Documents which contain more of the query terms score higher.
This is to make the top hits of boolean OR queries look like those of a
boolean AND of the same terms, with the OR results following.

Doug

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: indexing and searching different file formats

2002-02-13 Thread Peter Carlson

Hi pradeep,

The Lucene Document is not document type specific. It is a Lucene class
which is made up of fields (which have different options).
Data in a document is parsed and put into a one for more of these fields.

So Lucene can really handle any kind of document, their just needs to be a
document parser that puts the document into the Lucene Document format.


I hope this helps.

--Peter

On 2/13/02 7:54 AM, Pradeep Kumar K [EMAIL PROTECTED] wrote:

 Hi Lucene friends!
 
  How the files of different format can be indexed and searched? ( As I
 know lucene is having HTML indexer and searcher, which comes along with
 it and also XML indexer, but is there any way to index files
 irrespective of the file type)
 Any suggestions will be greatly appreciated..
 
 Thanks in advance.
 Pradeep
 
 
 --
 Robosoft Technologies, Mangalore, India
 
 
 
 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]
 
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




My own steammer (brazilian)

2002-02-13 Thread Bizu de Anúncio

My brazilian steammer has the same structure as the German steammer, except
for the inner logic.

I created it , tested it and now I'm trying to compile it with no success.
The problem is the 'StandartTokenizer.java' class ! I can´t find it in the
package org.apache.lucene.analysis.standard .

The only file that exists there is a file named 'StandartTokenizer.jj'.
What is this file for ?

I have lucene-1.2-rc2. Can someone help me,

thanks,

jk



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: My own steammer (brazilian)

2002-02-13 Thread Otis Gospodnetic

That file is created during the build process.
Try building Lucene by typing 'ant compile'.

Otis

--- Bizu_de_Anúncio [EMAIL PROTECTED] wrote:
   My brazilian steammer has the same structure as the German steammer,
 except
 for the inner logic.
 
   I created it , tested it and now I'm trying to compile it with no
 success.
 The problem is the 'StandartTokenizer.java' class ! I can´t find it
 in the
 package org.apache.lucene.analysis.standard .
 
   The only file that exists there is a file named
 'StandartTokenizer.jj'.
 What is this file for ?
 
   I have lucene-1.2-rc2. Can someone help me,
 
 thanks,
 
   jk
 
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Send FREE Valentine eCards with Yahoo! Greetings!
http://greetings.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: using lucene with a very large index

2002-02-13 Thread Otis Gospodnetic


--- tal blum [EMAIL PROTECTED] wrote:
 Hi, I'm building a very large index, that contains several
 categories.
 I have several questions I hope you can answare.
 1) Is there a way to use lucene with several indexes without merging
 them?

Look at MultiSearcher class.

 2) Does the Document id changes after merging indexes adding or
 deleting documents?

Not sure.

 3) Has anyone implemented a GUI to the lucene index, such that
 enables to deletions by id or sql-like queries?

I haven't seen anything like it.

 4) assuming I have a term query that has a large number of hits say
 10 millions, is there a way to get the say the top  10 results
 without going through all the hits?

See the Javadocs for Searcher and IndexSearcher, I think you'll find
the answer there.

Otis


__
Do You Yahoo!?
Send FREE Valentine eCards with Yahoo! Greetings!
http://greetings.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: indexing and searching different file formats

2002-02-13 Thread Andrew Libby


Pradeep,
Currently Lucene does not provide the ability to convert documents
to text for indexing.  There is talk of adding this kind of thing to the
goal of the project, along with providing crawlers to traverse web, 
local disk, ftp, and RDBMS sources of data.

The problem with indexining irrespective of file type is that each document
format contains embedded information that must be stripped out (or ignored)
and the text needs to be retrieved for indexing.  An extreeme example is
a PDF which has a considerably complicated document format.

On the contributions page there are some pointers that may provide information
about processing the types of documents you're interested in.

http://jakarta.apache.org/lucene/docs/contributions.html

If you've not taken the time to do so, look at the FAQs, they are very
informative:

http://www.lucene.com/cgi-bin/faq/faqmanager.cgi
http://jakarta.apache.org/lucene/docs/gettingstarted.html
http://www.jguru.com/faq/Lucene

Good luck!

Andy

On Wed, Feb 13, 2002 at 09:24:33PM +0530, Pradeep Kumar K wrote:
 Hi Lucene friends!
 
How the files of different format can be indexed and searched? ( As I 
 know lucene is having HTML indexer and searcher, which comes along with 
 it and also XML indexer, but is there any way to index files  
 irrespective of the file type)
 Any suggestions will be greatly appreciated..
 
 Thanks in advance.
 Pradeep
 
 
 --
 Robosoft Technologies, Mangalore, India
 
 
 
 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]
 

-- 
--
Andrew Libby
CommNav, Inc
[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: using lucene with a very large index

2002-02-13 Thread Hayes, Mark

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
 --- tal blum [EMAIL PROTECTED] wrote:
[...]
  4) assuming I have a term query that has a large number of hits say
  10 millions, is there a way to get the say the top  10 results
  without going through all the hits?
 
 See the Javadocs for Searcher and IndexSearcher, I think you'll find
 the answer there.

I have the same question but I can't see the answer in the javadocs.  Do you
mean this statement?:

The high-level search API (search(Query)) is usually more efficient, as it
skips non-high-scoring hits.

It is not clear to me what non-high-scoring hits means -- do you know?

thanks,
mark

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]