Configurable indexing of an RDBMS, has it been done before?
Many times I've written ad-hoc code that pulls in data from an RDBMS and builds a Lucene index. The use case is a typical database-driven dynamic website which would be a hassle to spider (say, due to tricky authentication). I had a feeling this had been done in a general manner but didn't see any code in the sandbox, nor did any searches turn it up. I've spent a few mins thinking this thru - what I'd expect is to be able to configure is: 1. JDBC Driver + conn params 2. Query to do a 1 time full index 3. Query to show new records 4. Query to show changed records 5. Query to show deleted records 6. Query columns to Lucene Field name mapping 7. Type of each field name (e.g. the equivalent of the args to the Field ctr) So a simple example, taking item 2 is query: select url, name, body from foo (now the column to field mapping) col 1 = url col 2 = title col 3 = contents (now the field types for each named field) url = Field( ...store=true, index=false) title = Field( ...store=true, index=true) contents = Field( ...store=false, index=true) And voilla, nice, elegant, data driven indexing. Does it exist? Should it? :) PS I know in the more general form, query needs to be replaced by queries above, and the updated query may need some time stamp variable expansion, and possibly the queries need paging to deal w/ lamo DBs like mysql that don't have cursors for large result sets... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Hi! Can you please explain how did you implement the java and php part to let them communicate through this bridge? The brige's project summary talks about java application-server or a dedicated java process and I'm not into Java that much. Currenty I'm using a self-written command-line search program and it outputs its results to the standard output. I guess your solution must be better ;) If the communication parts of your code aren't top secret, can you please share them with me/us? Regards, Sanyi __ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Configurable indexing of an RDBMS, has it been done before?
Yep, This is how we do it. We have a search.xml that maps database fields to search fields and a parameter part that describes the 'click for detailed result url' and the parameter names (based on the search fields). In this xml we also describe how the different fields should be stored we have for instance a number of large text fields for we use the unstored option. The framework that we have build around has an element that we call detailer. This detailer creates a lucene Document with the fields as specified in the search.xml To illustrate here is the code that specifies the detailer for a forum. -- XML - documenttype id=FORUM index=general defaultfield=body fields field property=messageid searchfield=messageid type=unindexed key=true/ field property=instanceid searchfield=instanceid type=unindexed / field property=subject searchfield=title type=split maxwords=8 / field property=body searchfield=default type=split maxwords=20 / field property=aka_username searchfield=username type=keyword / field property=modifiedDateAsDate searchfield=modifieddate type=keyword / /fields action uri=/forum/viewMessage.do image=/htmlarea/images/cops_insert_threadlink.gif parameter property=messageid name=messageid/ parameter property=instanceid name=instanceid/ /action analyzer classname=org.apache.lucene.analysis.standard.StandardAnalyzer/ /documenttype END XML --- Please note: Messageid is the keyfield here when we search the index we use a combined TYPE + KEY id to filter out double hits on the same document (not unusual in for instance a long forum thread). Per type of document we also specifify what picture to show in the result (image), and we specify in what index the result should be written and what the general search field is (if the user submits a query without search all and without a field specified). We have added the 'split' key word which makes it possbile to search a long text but only store a bit in the resulting hit. The reindex is pretty straightforward we build a series of detailers for all possible document types and we run through the database and call the right detailer from a HashMap. We have not included the JDBC stuff since the application is always running in Tomcat-Struts and since we cache most of the database reads. (a completely differnt story). Queries on new and changed records seem to only make sense if asked in a context of time. (Right?). We have not needed it yet. The mapping can be query from a singleton java class. (SearchConfiguration). We are currently adding functionality to store 'user structured data' best imagined as user defined input forms that are described in XML and are then stored as XML in the database. We query these documents using Lucene. These documents end up in the same index but this is quite manageable by using specialized detailers. For these document the type is more important then for the 'normally' stored documents. For this latter situation the search logic assumes that the query is appropriately configured by the application. I am not sure if this is the kind of solution that you are looking for, but everything we produce is 100% open source. Cheers, Aad David Spencer wrote: Many times I've written ad-hoc code that pulls in data from an RDBMS and builds a Lucene index. The use case is a typical database-driven dynamic website which would be a hassle to spider (say, due to tricky authentication). I had a feeling this had been done in a general manner but didn't see any code in the sandbox, nor did any searches turn it up. I've spent a few mins thinking this thru - what I'd expect is to be able to configure is: 1. JDBC Driver + conn params 2. Query to do a 1 time full index 3. Query to show new records 4. Query to show changed records 5. Query to show deleted records 6. Query columns to Lucene Field name mapping 7. Type of each field name (e.g. the equivalent of the args to the Field ctr) So a simple example, taking item 2 is query: select url, name, body from foo (now the column to field mapping) col 1 = url col 2 = title col 3 = contents (now the field types for each named field) url = Field( ...store=true, index=false) title = Field( ...store=true, index=true) contents = Field( ...store=false, index=true) And voilla, nice, elegant, data driven indexing. Does it exist? Should it? :) PS I know in the more general form, query needs to be replaced by queries above, and the updated query may need some time stamp variable expansion, and possibly the queries need paging to deal w/ lamo DBs like mysql that don't have cursors for large result sets... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To
Retrieve all documents - possible?
Hi, is it possible to retrieve ALL documents from a Lucene index? This should then actually not be a search... Karl -- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
On Feb 7, 2005, at 2:07 AM, sergiu gordea wrote: Hi Erick, In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards code*/code or code?/code. I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change must to should. Will this change available in the next realease of lucene? How do you plan to implement this? Will this be available as an atributte of QueryParser? I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery does not support leading wildcard characters. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieve all documents - possible?
you could use something like: int maxDoc = reader.maxDoc(); for (int i = 0; i maxDoc; i++) { Document doc = reader.document(i); } Bernhard Hi, is it possible to retrieve ALL documents from a Lucene index? This should then actually not be a search... Karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieve all documents - possible?
Don't forget to test if a document is deleted with reader.isDeleted(i) On Mon, 07 Feb 2005 12:09:35 +0100, Bernhard Messer wrote: you could use something like: int maxDoc = reader.maxDoc(); for (int i = 0; i maxDoc; i++) { Document doc = reader.document(i); } Bernhard Hi, is it possible to retrieve ALL documents from a Lucene index? This should then actually not be a search... Karl - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieve all documents - possible?
Karl Koch wrote: Hi, is it possible to retrieve ALL documents from a Lucene index? This should then actually not be a search... You are right. Just use the IndexReader.document(int). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Similarity coord,lengthNorm
I have varying length text fields which I am searching on. I would like relevancy to be dictated predominantly by the number of terms in my query that match. Right now I am seeing a high relevancy for a single word matching in a small document even though all the terms in my query don't match. Does, anyone have an example of a custom Similarity sub class which overrides the coord and lengthNorm methods. Thanks.. Michael
RE: Similarity coord,lengthNorm
Would fixing the lengthNorm to 1 fix this problem? Michael -Original Message- From: Michael Celona [mailto:[EMAIL PROTECTED] Sent: Monday, February 07, 2005 8:48 AM To: Lucene Users List Subject: Similarity coord,lengthNorm I have varying length text fields which I am searching on. I would like relevancy to be dictated predominantly by the number of terms in my query that match. Right now I am seeing a high relevancy for a single word matching in a small document even though all the terms in my query don't match. Does, anyone have an example of a custom Similarity sub class which overrides the coord and lengthNorm methods. Thanks.. Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Similarity coord,lengthNorm
On Feb 7, 2005, at 8:53 AM, Michael Celona wrote: Would fixing the lengthNorm to 1 fix this problem? Yes, it would eliminate the length of a field as a factor. Your best bet is to set up a test harness where you can try out various tweaks to Similarity, but setting the length normalization factor to 1.0 may be all you need to do, as the coord() takes care of the other factor you're after. Erik Michael -Original Message- From: Michael Celona [mailto:[EMAIL PROTECTED] Sent: Monday, February 07, 2005 8:48 AM To: Lucene Users List Subject: Similarity coord,lengthNorm I have varying length text fields which I am searching on. I would like relevancy to be dictated predominantly by the number of terms in my query that match. Right now I am seeing a high relevancy for a single word matching in a small document even though all the terms in my query don't match. Does, anyone have an example of a custom Similarity sub class which overrides the coord and lengthNorm methods. Thanks.. Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Similarity coord,lengthNorm
Hi Michael, I'd suggest first using the explain() mechanism to figure out what's going on. Besides lengthNorm(), another factor that is likely skewing your results in my experience is idf(), which Lucene typically makes very large by squaring the intrinsic value. I've found it helpful to flatten lengthNorm(), tf() and idf() relative to what is used in DefaultSimilarity. There is a comparative evaluation of Similarity's going on now. You might consider looking at these: Bug 32674 has a WikipediaSimilarity posted that you might want to try. You might want to flatten lengthNorm() even further (e.g. all the way to 1.0), but I'd suggest trying it as is first. If you try it, please post your assessment. Here's the link: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 You also might find it interesting to read the thread entitled RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? on lucene-dev, as this contains a discussion of many of the issues. Good luck, Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, February 07, 2005 6:51 AM To: Lucene Users List Subject: Re: Similarity coord,lengthNorm On Feb 7, 2005, at 8:53 AM, Michael Celona wrote: Would fixing the lengthNorm to 1 fix this problem? Yes, it would eliminate the length of a field as a factor. Your best bet is to set up a test harness where you can try out various tweaks to Similarity, but setting the length normalization factor to 1.0 may be all you need to do, as the coord() takes care of the other factor you're after. Erik Michael -Original Message- From: Michael Celona [mailto:[EMAIL PROTECTED] Sent: Monday, February 07, 2005 8:48 AM To: Lucene Users List Subject: Similarity coord,lengthNorm I have varying length text fields which I am searching on. I would like relevancy to be dictated predominantly by the number of terms in my query that match. Right now I am seeing a high relevancy for a single word matching in a small document even though all the terms in my query don't match. Does, anyone have an example of a custom Similarity sub class which overrides the coord and lengthNorm methods. Thanks.. Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query Analyzer
How do I set the analyzer when I build the query in my code instead of using a query parser ? Thanks in advance Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Analyzer
On Feb 7, 2005, at 11:29 AM, Ravi wrote: How do I set the analyzer when I build the query in my code instead of using a query parser ? You don't. All terms you use for any Query subclasses you instantiate must match exactly the terms in the index. If you need an analyzer to do this then you're responsible for doing it yourself, just as QueryParser does underneath. I do this myself in my current application like this: private Query createPhraseQuery(String fieldName, String string, boolean lowercase) { RossettiAnalyzer analyzer = new RossettiAnalyzer(lowercase); TokenStream stream = analyzer.tokenStream(fieldName, new StringReader(string)); PhraseQuery pq = new PhraseQuery(); Token token; try { while ((token = stream.next()) != null) { pq.add(new Term(fieldName, token.termText())); } } catch (IOException ignored) { // ignore - shouldn't get an IOException on a StringReader } if (pq.getTerms().length == 1) { // optimize single term phrase to TermQuery return new TermQuery(pq.getTerms()[0]); } return pq; } Hope that helps. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Howdy, For starters, compile and install the java bridge (and if necessary recompile PHP and Apache2) and make sure it works (there's a test php file supplied). Then, here's a simplified part of my code, just to give you an example how it works. This is the part that does the searching, indexing is done in a similar way. PHP: ...some code here for HTML page setup etc... $lucene_dir = $GLOBALS[lucene_dir]; java_set_library_path(/path/to/your/custom/lucene-classes.jar); $obj = new Java(searcher); // searcher is the custom written class that does actual searching and data output $writer = new Java(java.io.StringWriter); $obj-setWriter($writer); $obj-initSearch($lucene_dir); $obj-getQuery($query); // $query is the user supplied query from the HTML form, not visible here // get the last exception $e = java_last_exception_get(); if ($e) { // print error echo $e-toString(); } else { echo $writer-toString(); $writer-flush(); $writer-close(); } java_last_exception_get(); // clear the exception java_last_exception_clear(); - JAVA (custom written class located in the /path/to/your/custom/lucene-classes.jar): import ...whatever is needed here for the class... public class searcher { IndexReader reader = null; IndexSearcher s= null; //the searcher used to open/search the index Query q= null; //the Query created by the QueryParser BooleanQuery query = new BooleanQuery(); Hits hits = null; //the search results public Writer out; public void setWriter(Writer out) { this.out=out; } public void initSearch(String indexName) throws Exception { try { File indexFile= new File(indexName); Directory activeDir = FSDirectory.getDirectory(indexFile, false); if(reader.isLocked(activeDir)) { //out.write(Lucene index is locked, waiting 5 sec.); Thread.sleep(5000); } reader = IndexReader.open(indexName); s = new IndexSearcher(reader); //out.write(Index opened); } catch (Exception e) { throw new Exception(e.getMessage()); } } public void getQuery(String queryString) throws Exception { int totalhits = 0; Analyzer analyzer = new StandardAnalyzer(); String[] queryFields = {field1,field2,field3,field4,field5}; float[] boostFields = {10, 6, 2, 1, 1}; try { for ( int i = 0; i queryFields.length; i++) { q = QueryParser.parse(queryString, queryFields[i], analyzer); if (boostFields[i] 1) q.setBoost(boostFields[i]); query.add(q, false, false); } } catch (ParseException e) { throw new Exception(e.getMessage()); } try { hits = s.search(query); } catch (Exception e) { throw new Exception(e.getMessage()); } totalhits = hits.length(); if (totalhits == 0) { // if we find no hits, tell the user out.write(brI'm sorry I couldn't find your query: + queryString); } else { for (int i = 0; i totalhits; i++) { Document doc = hits.doc(i); String field1 = doc.get(field1); String field2 = doc.get(field2); String field3 = doc.get(field3); String field4 = doc.get(field4); String field5 = doc.get(field5); out.write(Field1: + field1 + , Field2: + field2 + , Field3: + field3 + , Field4: + field4 + , Field5: + field5 + br); } } } Sanyi said the following on 2/7/2005 3:54 AM: Hi! Can you please explain how did you implement the java and php part to let them communicate through this bridge? The brige's project summary talks about java application-server or a dedicated java process and I'm not into Java that much. Currenty I'm using a self-written command-line search program and it outputs its results to the standard output. I guess your solution must be better ;) If the communication parts of your code aren't top secret, can you please share them with me/us? Regards, Sanyi __ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query Analyzer
That worked. Thanks a lot. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, February 07, 2005 11:39 AM To: Lucene Users List Subject: Re: Query Analyzer On Feb 7, 2005, at 11:29 AM, Ravi wrote: How do I set the analyzer when I build the query in my code instead of using a query parser ? You don't. All terms you use for any Query subclasses you instantiate must match exactly the terms in the index. If you need an analyzer to do this then you're responsible for doing it yourself, just as QueryParser does underneath. I do this myself in my current application like this: private Query createPhraseQuery(String fieldName, String string, boolean lowercase) { RossettiAnalyzer analyzer = new RossettiAnalyzer(lowercase); TokenStream stream = analyzer.tokenStream(fieldName, new StringReader(string)); PhraseQuery pq = new PhraseQuery(); Token token; try { while ((token = stream.next()) != null) { pq.add(new Term(fieldName, token.termText())); } } catch (IOException ignored) { // ignore - shouldn't get an IOException on a StringReader } if (pq.getTerms().length == 1) { // optimize single term phrase to TermQuery return new TermQuery(pq.getTerms()[0]); } return pq; } Hope that helps. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Wow, thanks all for the great spectrum of possibilities. We'll be doing a design review in a week or two with the client and we'll find out what way would be best for their site. I'll report back then. Thanks again, what a group! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reconstruct segments file?
Doug Cutting [EMAIL PROTECTED] writes: Ian Soboroff wrote: I've looked over the file formats web page, and poked at a known-good segments file from a separate, similar index using od(1) and such. I guess what I'm not sure how to do is to recover the SegSize from the segment I have. The SegSize should be the same as the length in bytes of any of the .f[0-9]+ files in the segment. If your segment is in compound format then you can use IndexReader.main() in the current SVN version to list the files and sizes in the .cfs file, including its contained .f[0-9]+ files. Thanks, Doug, that is a huge help. BTW, the fileformats.html page on the Lucene web site is incorrect with regards to the segments file. The description should read: Segments -- Format, Version, Counter, SegCount, SegName, SegSize^SegCount That is, the Counter field is missing. The Counter field is a UInt32. Counter is used to generate the next segment name (see IndexWriter.newSegmentName()). Speaking of Counter, I have a dumb question. If the segments are named using an integer counter which is incremented, what is the point in converting that counter into a string for the segment filename? Why not just name the segments e.g. 1.frq, etc.? Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reconstruct segments file?
Ian Soboroff wrote: Speaking of Counter, I have a dumb question. If the segments are named using an integer counter which is incremented, what is the point in converting that counter into a string for the segment filename? Why not just name the segments e.g. 1.frq, etc.? The names are prefixed with an underscore, since it turns out that some filesystems have trouble (DOS?) with certain all-digit names. Other than that, they are integers, just with a large radix. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Configurable indexing of an RDBMS, has it been done before?
Nice, very similar to what I was thinking of, where the most significant difference is probably just that I was thinking of a batch indexer, not one embedded in a web container. Probably a worthwhile contribution to the sandbox. Aad Nales wrote: Yep, This is how we do it. We have a search.xml that maps database fields to search fields and a parameter part that describes the 'click for detailed result url' and the parameter names (based on the search fields). In this xml we also describe how the different fields should be stored we have for instance a number of large text fields for we use the unstored option. The framework that we have build around has an element that we call detailer. This detailer creates a lucene Document with the fields as specified in the search.xml To illustrate here is the code that specifies the detailer for a forum. -- XML - documenttype id=FORUM index=general defaultfield=body fields field property=messageid searchfield=messageid type=unindexed key=true/ field property=instanceid searchfield=instanceid type=unindexed / field property=subject searchfield=title type=split maxwords=8 / field property=body searchfield=default type=split maxwords=20 / field property=aka_username searchfield=username type=keyword / field property=modifiedDateAsDate searchfield=modifieddate type=keyword / /fields action uri=/forum/viewMessage.do image=/htmlarea/images/cops_insert_threadlink.gif parameter property=messageid name=messageid/ parameter property=instanceid name=instanceid/ /action analyzer classname=org.apache.lucene.analysis.standard.StandardAnalyzer/ /documenttype END XML --- Please note: Messageid is the keyfield here when we search the index we use a combined TYPE + KEY id to filter out double hits on the same document (not unusual in for instance a long forum thread). Per type of document we also specifify what picture to show in the result (image), and we specify in what index the result should be written and what the general search field is (if the user submits a query without search all and without a field specified). We have added the 'split' key word which makes it possbile to search a long text but only store a bit in the resulting hit. The reindex is pretty straightforward we build a series of detailers for all possible document types and we run through the database and call the right detailer from a HashMap. We have not included the JDBC stuff since the application is always running in Tomcat-Struts and since we cache most of the database reads. (a completely differnt story). Queries on new and changed records seem to only make sense if asked in a context of time. (Right?). We have not needed it yet. The mapping can be query from a singleton java class. (SearchConfiguration). We are currently adding functionality to store 'user structured data' best imagined as user defined input forms that are described in XML and are then stored as XML in the database. We query these documents using Lucene. These documents end up in the same index but this is quite manageable by using specialized detailers. For these document the type is more important then for the 'normally' stored documents. For this latter situation the search logic assumes that the query is appropriately configured by the application. I am not sure if this is the kind of solution that you are looking for, but everything we produce is 100% open source. Cheers, Aad David Spencer wrote: Many times I've written ad-hoc code that pulls in data from an RDBMS and builds a Lucene index. The use case is a typical database-driven dynamic website which would be a hassle to spider (say, due to tricky authentication). I had a feeling this had been done in a general manner but didn't see any code in the sandbox, nor did any searches turn it up. I've spent a few mins thinking this thru - what I'd expect is to be able to configure is: 1. JDBC Driver + conn params 2. Query to do a 1 time full index 3. Query to show new records 4. Query to show changed records 5. Query to show deleted records 6. Query columns to Lucene Field name mapping 7. Type of each field name (e.g. the equivalent of the args to the Field ctr) So a simple example, taking item 2 is query: select url, name, body from foo (now the column to field mapping) col 1 = url col 2 = title col 3 = contents (now the field types for each named field) url = Field( ...store=true, index=false) title = Field( ...store=true, index=true) contents = Field( ...store=false, index=true) And voilla, nice, elegant, data driven indexing. Does it exist? Should it? :) PS I know in the more general form, query needs to be replaced by queries above, and the updated query may need some time stamp variable expansion, and possibly the queries need paging to deal w/ lamo DBs like mysql that don't have cursors for large result
Fwd: SearchBean?
I want to double-check with the user community now that I've run this past the lucene-dev list. Anyone using SearchBean from the Sandbox? If so, please speak up and let me know what it offers that the sort feature does not. If this is now essentially deprecated, I'd like to remove it. Thanks, Erik Begin forwarded message: From: Erik Hatcher [EMAIL PROTECTED] Date: February 6, 2005 10:02:37 AM EST To: Lucene List lucene-dev@jakarta.apache.org Subject: SearchBean? Reply-To: Lucene Developers List lucene-dev@jakarta.apache.org Is the SearchBean code in the Sandbox still useful now that we have sorting in Lucene 1.4? If so, what does it offer that the core does not provide now? As I'm cleaning up the sandbox and migrating it to a contrib area, I'm evaluating the pieces and making sure it makes sense to keep or if it is no longer useful or should be reorganized in some way. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Storage Cost of Indexed, Untokenized Fields
Is there an additional storage cost to flagging an untokenized, indexed field as stored? Is the flag just for indicating that it be returned in result sets? I assume storage for tokenized fields is managed separately, but am curious if untokenized fields are resolved from the native index structure? Thanks, Todd VanderVeen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RangeQuery With Date
Hi; I am working on a set of queries that allow you to find modification dates before, after and equal to a given date. Here are some of the before queries I have been playing with. I want a query that pull up dates modified before Nov 11 2004: Query query = new RangeQuery(null, new Term(modified, 11/11/04), false); This one doesn't work. It turns up all the documents in the index. Query query = QueryParser.parse(modified:[1/1/00 TO 11/11/04], subject, new StandardAnalyzer()); This works but I don't like having to specify the begin date like this. Query query = QueryParser.parse(modified:[null TO 11/11/04], subject, new StandardAnalyzer()); This throws an exception. How are other doing a Query like this? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Similarity coord,lengthNorm
Erik Hatcher wrote: On Feb 7, 2005, at 8:53 AM, Michael Celona wrote: Would fixing the lengthNorm to 1 fix this problem? Yes, it would eliminate the length of a field as a factor. Your best bet is to set up a test harness where you can try out various tweaks to Similarity, but setting the length normalization factor to 1.0 may be all you need to do, as the coord() takes care of the other factor you're after. I'm releasing next week a new version of Luke, which includes a custom Similarity designer (using Rhino JavaScript engine) - it makes experimenting with Similarity super-easy. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RangeQuery With Date
Your dates need to be stored in lexicographical order for the RangeQuery to work. Index them using this date format: MMDD. Also, I'm not sure if the QueryParser can handle range queries with only one end point. You may need to create this query programmatically. Regards, Luke Francl
Re: RangeQuery With Date
: Your dates need to be stored in lexicographical order for the RangeQuery : to work. : : Index them using this date format: MMDD. : : Also, I'm not sure if the QueryParser can handle range queries with only : one end point. You may need to create this query programmatically. and when creating them progromaticaly, you need to use the exact same format they were indexed in. Assuming I've corectly guess what your indexing code looks like, you probably want... Query query = new RangeQuery(null, new Term(modified, 2004), false); -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RangeQuery With Date
Bingo. Thanks! Luke - Original Message - From: Chris Hostetter [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Monday, February 07, 2005 5:10 PM Subject: Re: RangeQuery With Date : Your dates need to be stored in lexicographical order for the RangeQuery : to work. : : Index them using this date format: MMDD. : : Also, I'm not sure if the QueryParser can handle range queries with only : one end point. You may need to create this query programmatically. and when creating them progromaticaly, you need to use the exact same format they were indexed in. Assuming I've corectly guess what your indexing code looks like, you probably want... Query query = new RangeQuery(null, new Term(modified, 2004), false); -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: When are deletions permanent?
When you close the IndexReader you know the delete is commited to disk. I believe calling the commit method will also guarantee that all changes are written to disk. --- Lucene Users List lucene-user@jakarta.apache.org wrote: Hi everyone, I need to update a document in Lucene. I already know that for that I need to do a delete (IndexReader) and then an add (IndexWriter). I also know that the deletion means been marked as deleted, until optimize(). My question is, when am I SURE that the mark is commited to disk? I mean, suppose that there is a crash while Im doing a deletion. Could it be that when I recover and check Lucene, the item is still there? At which point Im 100% sure the deletion is permanent? What about adds? Thanks, Christian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
I implemented this concept for my ends with query. It works very well! - Original Message - From: Chris Hostetter [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Friday, February 04, 2005 9:37 PM Subject: Re: Starts With x and Ends With x Queries : Also keep in mind that QueryParser only allows a trailing asterisk, : creating a PrefixQuery. However, if you use a WildcardQuery directly, : you can use an asterisk as the starting character (at the risk of : performance). On the issue of ends with wildcard queries, I wanted to throw out and idea that i've seen used to deal with matches like this in other systems. I've never acctually tried this with Lucene, but I've seen it used effectively with other systems where the goal is to sort strings by the least significant (ie: right most) characters first. I think it could apply nicely to people who have compelling needs for efficent 'ends with' queries. Imagine you have a field call name, which you can already do efficient prefix matching on using the PrefixQuery class. Your docs and query may look something like this... D1 name:Adam Smith age:13 state:CA ... D2 name:Joe Bob age:42 state:WA ... D3 name:John Adams age:35 state:NV ... D3 name:Sue Smith age:33 state:CA ... ...and your queries may look something like... Query q1 = new PrefixQuery(new Term(name,J*)); Query q2 = new PrefixQuery(new Term(name,Sue*)); If you want to start doing suffix queries (ie: all names ending with s, or all names ending with Smith) one approach would be to use WildcarQuery, which as Erik mentioned, will allow you to use a quey Term that starts with a *. ie... Query q3 = new WildcardQuery(new Term(name,*s)); Query q4 = new WildcardQuery(new Term(name,*Smith)); (NOTE: Erik says you can do this, but the docs for WildcardQuery say you can't I'll assume the docs are wrong and Erik is correct.) The problem is that this is horrendously inefficient. In order to find the docs that contain Terms which match your suffix, WildcardQuery must first identify what all of those Terms are, by iterating over every Term in your index to see if they match the suffix. This is much slower then a PrefixQuery, or even a WildcardQuery that has just 1 initial character before a * (ie: s*foobar), because it can then seek to directly to the first Term that starts with that character, and also stop iterating as soon as it encounters a Term that no longer begins with that character. Which leads me to my point: if you denormalize your data so that you store both the Term you want, and the *reverse* of the term you want, then a Suffix query is just a Prefix query on a reversed field -- by sacrificing space, you can get all the speed efficiencies of a PrefixQuery when doing a SuffixQuery... D1 name:Adam Smith rname:htimS madA age:13 state:CA ... D2 name:Joe Bob rname:boB oeJ age:42 state:WA ... D3 name:John Adams rname:smadA nhoJ age:35 state:NV ... D3 name:Sue Smith rname:htimS euS age:33 state:CA ... Query q1 = new PrefixQuery(new Term(name,J*)); Query q2 = new PrefixQuery(new Term(name,Sue*)); Query q3 = new PrefixQuery(new Term(rname,s*)); Query q4 = new PrefixQuery(new Term(rname,htimS*)); (If anyone sees a flaw in my theory, please chime in) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
I would like to be able to analyze my document collection (~1200 documents) and discover good buckets of categories for them. I'm pretty sure this is termed Document Clustering .. finding the emergent clumps the documents fall naturally into judging from their term vectors. Looking at the discussion that flared roughly a year ago (last message 2003-11-12) with the subject Document Clustering, it seems Lucene should be able to help with this. Has anyone had success with this recently? Last year it was suggested Carrot2 could help, and it would even produce good labels for the clusters. Has this proven to be true? Our goal is to use clustering to build a nifty graphic interface, probably using Flash. Thanks for any pointers. Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Configurable indexing of an RDBMS, has it been done before?
If that is a general thought then I will plan for some time to put this in action. Cheers, Aad David Spencer wrote: Nice, very similar to what I was thinking of, where the most significant difference is probably just that I was thinking of a batch indexer, not one embedded in a web container. Probably a worthwhile contribution to the sandbox. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]