Re: Searching multiple fields in one Index of Documents
Charles, See http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html Regards, K - Original Message - From: Charles Harvey [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, February 12, 2002 8:39 AM Subject: Searching multiple fields in one Index of Documents I have a working installation of Lucene running against indexes created by a database query. Each Document in the Index contains fifteen or twenty fields. I am currently searching only one field (that contains concatenated database columns) because I cannot figure out how to search multiple fields. So: How can I use Lucene to search more than one field in an Index of Documents? eg: field CATEGORY is(or contains) 'bar' AND field BODY contains 'foo' _ The trouble with the rat-race is that even if you win you're still a rat. --Lily Tomlin _ Charles Harvey Developer http://www.philly.com Wk: 215 789 6057 Cell: 215 588 0851 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
How does Lucene handle phrases containing words that are not indexed?
How does Lucene handle phrases (literals) containing words that are not indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests (lucene demo, my own 12 xml documents, Cocoon search) and in all cases it looks like that when you are looking for the phrase a specification it also finds documents which contain the specification. (or: D. Washington instead of G. Washington). Of course you can change the index behaviour and make sure there are no stopwords, and all one-letter words and numbers are indexed. But that seems a bad approach. A better approach: 1) find all indexed words in the phrase and from these words find all documents containing these words. 2) check the occurence of the phrase by opening the original document. I am wondering: does Lucene performs step 2)? Off course this step burns some cpu cycles. Hugo [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
PrefixQuery Scoring
*This message was transferred with a trial version of CommuniGate(tm) Pro* Whenever I add a PrefixQuery to my search the scoring gets really small. For example if I do a query like this: +java then the scoring starts around 0.866... and so forth. But if I do a query like this: +java* then the scoring start like 0.00034... Is there a specific reason for this? or is it a bug? Thanks, Jonathan Franzone -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: PrefixQuery Scoring
From: Jonathan Franzone [mailto:[EMAIL PROTECTED]] Whenever I add a PrefixQuery to my search the scoring gets really small. For example if I do a query like this: +java then the scoring starts around 0.866... and so forth. But if I do a query like this: +java* then the scoring start like 0.00034... Is there a specific reason for this? A PrefixQuery is equivalent to a query containing all the terms matching the prefix, and is hence usually contains a lot of terms. With such a big query, matching documents are likely to contain fewer of the query terms and the match is thus weaker. For example, the top scoring document in a prefix query might contain only one or two of 100 or more query terms. That's not a very strong match. But the top-scoring document in a single term non-prefix query is guaranteed to contain all of the query terms, and is thus a much stronger match. There are of course other factors involved in scoring (e.g., document length term frequency). I call the factor in question here coordination matching. Documents which contain more of the query terms score higher. This is to make the top hits of boolean OR queries look like those of a boolean AND of the same terms, with the OR results following. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing and searching different file formats
Hi pradeep, The Lucene Document is not document type specific. It is a Lucene class which is made up of fields (which have different options). Data in a document is parsed and put into a one for more of these fields. So Lucene can really handle any kind of document, their just needs to be a document parser that puts the document into the Lucene Document format. I hope this helps. --Peter On 2/13/02 7:54 AM, Pradeep Kumar K [EMAIL PROTECTED] wrote: Hi Lucene friends! How the files of different format can be indexed and searched? ( As I know lucene is having HTML indexer and searcher, which comes along with it and also XML indexer, but is there any way to index files irrespective of the file type) Any suggestions will be greatly appreciated.. Thanks in advance. Pradeep -- Robosoft Technologies, Mangalore, India -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
My own steammer (brazilian)
My brazilian steammer has the same structure as the German steammer, except for the inner logic. I created it , tested it and now I'm trying to compile it with no success. The problem is the 'StandartTokenizer.java' class ! I can´t find it in the package org.apache.lucene.analysis.standard . The only file that exists there is a file named 'StandartTokenizer.jj'. What is this file for ? I have lucene-1.2-rc2. Can someone help me, thanks, jk -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: My own steammer (brazilian)
That file is created during the build process. Try building Lucene by typing 'ant compile'. Otis --- Bizu_de_Anúncio [EMAIL PROTECTED] wrote: My brazilian steammer has the same structure as the German steammer, except for the inner logic. I created it , tested it and now I'm trying to compile it with no success. The problem is the 'StandartTokenizer.java' class ! I can´t find it in the package org.apache.lucene.analysis.standard . The only file that exists there is a file named 'StandartTokenizer.jj'. What is this file for ? I have lucene-1.2-rc2. Can someone help me, thanks, jk -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Send FREE Valentine eCards with Yahoo! Greetings! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: using lucene with a very large index
--- tal blum [EMAIL PROTECTED] wrote: Hi, I'm building a very large index, that contains several categories. I have several questions I hope you can answare. 1) Is there a way to use lucene with several indexes without merging them? Look at MultiSearcher class. 2) Does the Document id changes after merging indexes adding or deleting documents? Not sure. 3) Has anyone implemented a GUI to the lucene index, such that enables to deletions by id or sql-like queries? I haven't seen anything like it. 4) assuming I have a term query that has a large number of hits say 10 millions, is there a way to get the say the top 10 results without going through all the hits? See the Javadocs for Searcher and IndexSearcher, I think you'll find the answer there. Otis __ Do You Yahoo!? Send FREE Valentine eCards with Yahoo! Greetings! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing and searching different file formats
Pradeep, Currently Lucene does not provide the ability to convert documents to text for indexing. There is talk of adding this kind of thing to the goal of the project, along with providing crawlers to traverse web, local disk, ftp, and RDBMS sources of data. The problem with indexining irrespective of file type is that each document format contains embedded information that must be stripped out (or ignored) and the text needs to be retrieved for indexing. An extreeme example is a PDF which has a considerably complicated document format. On the contributions page there are some pointers that may provide information about processing the types of documents you're interested in. http://jakarta.apache.org/lucene/docs/contributions.html If you've not taken the time to do so, look at the FAQs, they are very informative: http://www.lucene.com/cgi-bin/faq/faqmanager.cgi http://jakarta.apache.org/lucene/docs/gettingstarted.html http://www.jguru.com/faq/Lucene Good luck! Andy On Wed, Feb 13, 2002 at 09:24:33PM +0530, Pradeep Kumar K wrote: Hi Lucene friends! How the files of different format can be indexed and searched? ( As I know lucene is having HTML indexer and searcher, which comes along with it and also XML indexer, but is there any way to index files irrespective of the file type) Any suggestions will be greatly appreciated.. Thanks in advance. Pradeep -- Robosoft Technologies, Mangalore, India -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- -- Andrew Libby CommNav, Inc [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: using lucene with a very large index
-Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] --- tal blum [EMAIL PROTECTED] wrote: [...] 4) assuming I have a term query that has a large number of hits say 10 millions, is there a way to get the say the top 10 results without going through all the hits? See the Javadocs for Searcher and IndexSearcher, I think you'll find the answer there. I have the same question but I can't see the answer in the javadocs. Do you mean this statement?: The high-level search API (search(Query)) is usually more efficient, as it skips non-high-scoring hits. It is not clear to me what non-high-scoring hits means -- do you know? thanks, mark -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]