Re: Custom filters document numbers
I'm also interested in knowing what can change the doc numbers. Does this happen frequently? Like Stanislav has been asking... what sort of operations on the index cause the document number to change for any given document? If the document numbers change frequently, is there a straightforward way to modify Lucene to keep the document numbers the same for the life of the document? I'd like to have mappings in my sql database that point to the document numbers that Lucene search returns in its Hits objects. Thanks, -Tom- --- Stanislav Jordanov [EMAIL PROTECTED] wrote: The first statement is clear to me: I know that an IndexReader sees a 'snapshot' of the document set that was taken in the moment of the Reader's creation. What I don't know is whether this 'snapshot' has also its doc numbers fixed or they may change asynchronously. And another thing I don't know is what are the index operations that may cause the (doc - doc number) mapping to change. Is it only after delete or there are other ocasions, or I'd better not count on this at all. StJ - Original Message - From: Vanlerberghe, Luc [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 24, 2005 4:07 PM Subject: RE: Custom filters document numbers An IndexReader will always see the same set of documents. Even if another process deletes some documents, adds new ones or optimizes the complete index, your IndexReader instance will not see those changes. If you detect that the Lucene index changed (e.g. by calling IndexReader.getCurrentVersion(...) once in a while), you should close and reopen your 'current' IndexReader and recalculate any data that relies on the Lucene document numbers. Regards, Luc. -Original Message- From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] Sent: donderdag 24 februari 2005 14:18 To: Lucene Users List Subject: Custom filters document numbers Given an IndexReader a custom filter is supposed to create a bit set, that maps each document numbers to {'visible', 'invisible'} On the other hand, it is stated that Lucene is allowed to change document numbers. Is it guaranteed that this BitSet's view of document numbers won't change while the BitSet is still in use (or perhaps the corresponding IndexReader is still opened) ? And another (more low-level) question. When Lucene may change document numbers? Is it only when the index is optimized after there has been a delete operation? Regards: StJ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search question
Erik, They both use the StandardAnalyzer... however looking at the toString() makes everything clearer. In the case a string has the following email address: [EMAIL PROTECTED], it gets split like so: first.last domain.com However in 1.4 it does not get split. So now we just check to see if an index was built using 1.2 or 1.4 and have some checks thrown in. Thanks for the guidance. Roy. On Wed, 22 Dec 2004 18:41:44 -0500, Erik Hatcher wrote What does toString() return for each of those queries? Are you using the same analyzer in both cases? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
search question
Hi guys, We have an index with some fields containing email addresses. Doing a search for an email address with this format: [EMAIL PROTECTED], does not bring up any results with lucene 1.4. The query: Field1:[EMAIL PROTECTED] However it returns results with 1.2. Any ideas? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lock file paths
Hey guys, Quick question... is there a way to get the file paths to the lock files? Or do I have to modify the src? Currently I can't find any methods that will return a lock's file path. Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene : avoiding locking
I'm working on a similar project... Make sure that only one call to the index method is occuring at a time. Synchronizing that method should do it. --- Luke Shannon [EMAIL PROTECTED] wrote: Hi All; I have hit a snag in my Lucene integration and don't know what to do. My company has a content management product. Each time someone changes the directory structure or a file with in it that portion of the site needs to be re-indexed so the changes are reflected in future searches (indexing must happen during run time). I have written a Indexer class with a static Index() method. The idea is too call the method every time something changes and the index needs to be re-examined. I am hoping the logic put in by Doug Cutting surrounding the UID will make indexing efficient enough to be called so frequently. This class works great when I tested it on my own little site (I have about 2000 file). But when I drop the functionality into the QA environment I get a locking error. I can't access the stack trace, all I can get at is a log file the application writes too. Here is the section my class wrote. It was right in the middle of indexing and bang lock issue. I don't know if the problem is in my code or something in the existing application. Error Message: ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent) |INFO|INDEXING INFO: Start Indexing new content. |INFO|INDEXING INFO: Index Folder Did Not Exist. Start Creation Of New Index |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING INFO: Beginnging Incremental update comparisions |INFO|INDEXING ERROR: Unable to index new content Lock obtain timed out: Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432 10f7fe8-write.lock |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent) Here is my code. You will recognize it pretty much as the IndexHTML class from the Lucene demo written by Doug Cutting. I have put a ton of comments in a attempt to understand what is going on. Any help would be appreciated. Luke package com.fbhm.bolt.search; /* * Created on Nov 11, 2004 * * This class will create a single index file for the Content * Management System (CMS). It contains logic to ensure * indexing is done intelligently. Based on IndexHTML.java * from the demo folder that ships with Lucene */ import java.io.File; import java.io.IOException; import java.util.Arrays; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; import org.pdfbox.searchengine.lucene.LucenePDFDocument; import org.apache.lucene.demo.HTMLDocument; import com.alaia.common.debug.Trace; import com.alaia.common.util.AppProperties; /** * @author lshannon Description: br * This class is used to index a content folder. It contains logic to * ensure only new or documents that have been modified since the last * search are indexed. br * Based on code writen by Doug Cutting in the IndexHTML class found in * the Lucene demo */ public class Indexer { //true during deletion pass, this is when the index already exists private static boolean deleting = false; //object to read existing indexes private static IndexReader reader; //object to write to the index folder private static IndexWriter writer; //this will be used to write the index file private static TermEnum uidIter; /* * This static method does all the work, the end result is an up-to-date index folder */ public static void Index() { //we will assume to start the index has been created boolean create = true; //set
Re: lucene file locking question
Disabling locking is only recommended for read-only indexes that aren't being modified. I think there is a comment in the code about a good example of this being an index you read off of a CD-ROM. --- John Wang [EMAIL PROTECTED] wrote: Hi folks: My application builds a super-index around the lucene index, e.g. stores some additional information outside of lucene. I am using my own locking outside of the lucene index via FileLock object in the jdk1.4 nio package. My code does the following: FileLock lock=null; try{ lock=myLockFileChannel.lock(); indexing into lucene; indexing additional information; } finally{ try{ commit lucene index by closing the IndexWriter instance. } finally{ if (lock!=null){ lock.release(); } } } Now here is the weird thing, say I terminate the process in the middle of indexing, and run the program again, I would get a Lock obtain time out exception, as long as you delete the stale lock file, the index remains uncorrupted. However, if I turn lucene file lock off since I have a lock outside it anyways, (by doing: static{ System.setProperty(disableLuceneLocks,true); } ) and do the same thing. Instead I get an unrecoverable corrupted index. Does lucene lock really guarentee index integrity under this kind of abuse or am I just getting lucky? If so, can someone shine some light on how? Thanks in advance -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
Whoops! Looks like my attachment didn't make it through. I'm re-attaching my simple test app. Thanks. --- Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote: Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? You mentioned posting code - though I don't recall getting an attachment. If you could post it as a Bugzilla issue with your code attached, it would be preserved outside of our mailboxes. If the code is self-contained enough for me to try it, I will at some point in the near future. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
Yes, I tried that too and it worked. The issue is that our Operations folks plan to install this on a pretty busy box and I was hoping that Lucene wouldn't cause issues if it only had a small slice of the CPU. Guess I'll tell them to buy a bigger box! Unless you have any other ideas. I'm running some tests with a larger timeout to see if that helps. --- Erik Hatcher [EMAIL PROTECTED] wrote: I just added a Thread.sleep(1000) in the writer thread and it has run for quite some time, and is still running as I send this. Erik On Nov 10, 2004, at 8:02 PM, [EMAIL PROTECTED] wrote: I added it to Bugzilla like you suggested: http://issues.apache.org/bugzilla/show_bug.cgi?id=32171 Let me know if you see any way to get around this issue. --- Lucene Users List [EMAIL PROTECTED] wrote: Whoops! Looks like my attachment didn't make it through. I'm re-attaching my simple test app. Thanks. --- Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED] wrote: Hi, With the information provided, I have no idea what the issue may be. Is there some information that I should post that will help determine why Lucene is giving me this error? You mentioned posting code - though I don't recall getting an attachment. If you could post it as a Bugzilla issue with your code attached, it would be preserved outside of our mailboxes. If the code is self-contained enough for me to try it, I will at some point in the near future. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search scalability
Does it take 800MB of RAM to load that index into a RAMDirectory? Or are only some of the files loaded into RAM? --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, 100 parallel searches going against a single index on a single disk means a lot of disk seeks all happening at once. One simple way of working around this is to load your FSDirectory into RAMDirectory. This should be faster (could you report your observations/comparisons?). You can also try using ramfs if you are using Linux. Otis --- Ravi [EMAIL PROTECTED] wrote: We have one large index for a document repository of 800,000 documents. The size of the index is 800MB. When we do searches against the index, it takes 300-500ms for a single search. We wanted to test the scalability and tried 100 parallel searches against the index with the same query and the average response time was 13 seconds. We used a simple IndexSearcher. Same searcher object was shared by all the searches. I'm sure people have success in configuring lucene for better scalability. Can somebody share their approach? Thanks Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene1.4.1 + OutOf Memory
There is a memory leak in the sorting code of Lucene 1.4.1. 1.4.2 has the fix! --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys Apologies.. History Ist type : 4 subindexes + MultiSearcher + Search on Content Field Only for 2000 hits = Exception [ Too many Files Open ] IInd type : 40 Mergerd Indexes [1000 subindexes each] + MultiSearcher /ParallelSearcher + Search on Content Field Only for 2 hits = Exception [ OutOf Memeory ] System Config [same for both type] Amd Processor [High End Single] RAM 1GB O/s Linux ( jantoo type ) Appserver Tomcat 5.05 Jdk [ IBM Blackdown-1.4.1-01 ( == Jdk1.4.1) ] Index contains 15 Fields Search Done only on 1 field Retrieve 11 corrosponding fields 3 Fields are for debug details Switched from Ist type to IInd Type Can some body suggest me Why is this Happening Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
Hi Fred, We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's just a guess, I would love to hear from others about this. Anyway, since it is a separate thread, a token error could kill it and there is no way for the calling thread to know about it. We had to create our own html parser since we only cared about grabbing the entire text from the html document and also we wanted to avoid the extra thread. We also do a lot of SKIPping for minimal EOF errors (html documents in email almost never follow standards). For your html needs, you might want to check out other JavaCC HTML parsers from the JavaCC web site. Roy. On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote Hi, I've been working with the HTML parser demo that comes with Lucene and I'm trying to understand why it's multi-threaded, and, more importantly, how to exit gracefully on errors. I've discovered if I throw an exception in the front-end static code (main(), etc.), the JVM hangs instead of exiting. Presumably this is because there are threads hanging around doing something. But I'm not sure what! Any pointers? I just want to exit gracefully on an error such as a required meta tag is missing or similar. Thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
compiling 1.4 source
Hi guys, So we started upgrading to 1.4 and we need to add some of our own custom code. After compiling with ant, I noticed that the 1.4 ant script builds a jar called lucene-1.5-rc1-dev.jar, not lucene-1.4-final.jar. I'm pretty sure I did not download the wrong source. Is this just a wrong name in the properties or does the source code actually contain lucene 1.5 rc1 code? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hits.doc(x) and range queries
Hi guys! I've posted previously that Hits.doc(x) was taking a long time. Turns out it has to do with a date range in our query. We usually do date ranges like this: Date:[(lucene date field) - (lucene date field)] Sometimes the begin date is 0 which is what we get from DateField.dateToString( ( new Date( 0 ) ). This is when getting our search results from the Hits object takes an absurd amount of time. Its usually each time the Hits object attempts to get more results from an IndexSearcher ( aka, every 100? ). It also takes up more memory... I was wondering why it affects the search so much even though we're only returning 350 or so results. Does the QueryParser do something similar to the DateFilter on range queries? Would it be better to use a DateFilter? We're using Lucene 1.2 (with plans to upgrade). Do newer versions of Lucene have this problem? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Custom filter
On Fri, 20 Aug 2004 20:01:36 -0400, Erik Hatcher wrote On Aug 20, 2004, at 6:48 PM, [EMAIL PROTECTED] wrote: We're currently in lucene 1.2... haven't moved to 1.3 yet. Skip 1.3 and go straight to 1.4.1 :) Upgrade - why not? Well we have some MASSIVE indexes so updating needs to be planned out. In the meantime we continue with 1.2. So, just for curiousity's sake... any clue on the filter? Or perhaps someone could clue me in on what kind of terms the query parser creates ( and what the searcher class does with them ) when it has something like (From:(blah OR blah2) OR To:(blah OR blah2)). Tried to look at the QueryParser.jj file but javacc makes my head hurt... Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Custom filter
Hi guys! I was hoping someone here could help me out with a custom filter. We have an index of emails and do some searches on the text of an email message and also searches based on the email addresses in a To, From or CC. Since we also do searches on a bunch of emails, we created a custom filter for searches on an array of fields for an array of values. [code included below] The problem we're having is that creating a query string like so: Message:viagra AND (From:(email1 OR email2) OR To:(email1 OR email2) OR CC:(email1 OR email2)) would return results, but our filter combined with a query string of Message:viagra sometimes wouldn't. One thing I noticed is that when the results do return with the filter, the email has the format of [EMAIL PROTECTED], but the one that doesn't has something like [EMAIL PROTECTED] Also it might have something to do with the storage of the From or To or CC. We don't parse out the email addresses before storing them. So sometimes the value of a From/To/CC field might be [EMAIL PROTECTED] or local [EMAIL PROTECTED] or even [EMAIL PROTECTED]. Could the carrots be throwing off my filter? I also wouldn't mind any suggestions to doing this filter better. Here is the bits method from our custom filter: - final public BitSet bits( IndexReader reader ) throws IOException { BitSet bits = new BitSet( reader.maxDoc() ); for ( int x = 0; x fields.length; x++ ) { for ( int y = 0; y values.length; y++ ) { TermDocs termDocs = reader.termDocs( new Term( fields[x], values[y] ) ); try { while ( termDocs.next() ) { bits.set( termDocs.doc() ); } } finally { termDocs.close(); } } } return bits; } - Thanks in advance, Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Proximity searching and phrase
Hi, I was wondering is there is a way to do proximity searches with phrases eg very good NEAR sometimes. Any help on this would be welcome. Many thanks, Roy
Re: addIndexes vs addDocument
Otis, Okay, got it... however we weren't creating new document objects... just grabbing a document through an IndexReader and calling addDocument on another index. Would that still work with unstored fields(well, its working for us since we don't have any unstored fields)? Thanks a lot! Roy. On Tue, 6 Jul 2004 19:46:30 -0700 (PDT), Otis Gospodnetic wrote Quick example. Index A has fields 'title' and 'contents'. Field 'contents' is stored in A as Field.UnStored. This means that you cannot retrieve the original content of the 'contents' field, since that value was not stored verbatim in the index. Therefore, you cannot create a new Document instance, pull out String value of the 'contents' field from A, use it to create another field, add it to the new Document instance, and add that Document to a new index B using addDocument method. addIndexes method does not need to pull out the original String field values from Documents, so it will work. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
addIndexes and optimize
Hey y'all again, Just wondering why the IndexWriter.addIndexes method calls optimize before and after it starts merging segments together. We would like to create an addIndexes method that doesn't optimize and call optimize on the IndexWriter later. Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
moving 1.2 index to 1.4
Hey guys, We have a couple of giant indexes that were done in lucene 1.2. We would like to move to lucene 1.4 at some point. We have heard that we would probably need to re-index our indexes to take advantage of certain new features/optimizations of lucene 1.3/1.4. We were wondering if it was possible to open our old 1.2 index with an IndexReader, get each Document object, and add it to a new 1.4 index? Would it be the same as re-building an index from scratch? Thanks! Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
stop words in index
Hi! How comes that stop words show up in index (HighFreqTerms)? Yes, I do you the same analyzer for indexing and searching. class SearchFacade { private final static String[] GERMAN_STOP_WORDS = new String[] { foo, bar }; private final static Analyzer GERMAN_ANALYZER = new SnowballAnalyzer( German2, GERMAN_STOP_WORDS ); public void index() { writer = new IndexWriter( Configuration.Lucene.INDEX, GERMAN_ANALYZER, true ); ... } public void search(String q) { final Query q = MultiFieldQueryParser.parse( query, new String[] { blah, foo, bar }, GERMAN_ANALYZER ); ... } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Author or SearchBean
Hi! Where can I get the mail address of the author of SearchBean (sandbox) from? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pagable results
On Tuesday 11 May 2004 15:58, Ryan Sonnek wrote: When performing a search with lucene, is it possible to only return a subset of the results? I need to be able to page through results, and it Yes, http://www.nitwit.de/vlh2/ :-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Monday 12 April 2004 20:54, [EMAIL PROTECTED] wrote: On Sunday 11 April 2004 17:46, Erik Hatcher wrote: In other words, you need to invent your own pattern here?! :) I just experimented a bit and came up with the ValueListSupplier which replaces the ValueList in the VLH. Seems to work so far... :-) Comments are greatly appreciated! FYI http://www.nitwit.de/vlh2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searcher not aware of index changes
Hi! My Searcher's instance it not aware of changes to the index. I even create a new instance but it seems only a complete restart does help(?): indexSearcher = new IndexSearcher(IndexReader.open(index)); Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searcher not aware of index changes
On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote: This is not normal behaviour. Normally using a new IndexSearcher should reflect the modified state of your index. Could you post a more informative bit of code? BTW Why can't Lucene care for it itself? Well, according to my logging it does create a new instance. I use only one instance of SessoinFacade: public class SearchFacade extends Observable { protected class IndexObserver implements Observer { private final Log log = LogFactory.getLog(getClass()); public Searcher indexSearcher; public IndexObserver() { newSearcher(); // init } public void update(Observable o, Object arg) { log.debug(Index has changed, creating new Searcher ); newSearcher(); } private void newSearcher() { try { indexSearcher = new IndexSearcher(IndexReader.open(Configuration.LuceneIndex.MAIN)); } catch (IOException e) { log.error(Could not instantiate searcher: + e); } } public Searcher getIndexSearcher() { return indexSearcher; } } private IndexObserver indexObserver; public SearchFacade() { addObserver(indexObserver = new IndexObserver()); } public void createIndex() { ... setChanged(); // index has changed notifyObservers(); } public Hits search(String query) { Searcher searcher = indexObserver.getIndexSearcher(); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Sunday 11 April 2004 17:46, Erik Hatcher wrote: In other words, you need to invent your own pattern here?! :) I just experimented a bit and came up with the ValueListSupplier which replaces the ValueList in the VLH. Seems to work so far... :-) Comments are greatly appreciated! Timo public class ValueListSupplier implements IValueListIterator { private final Log log = LogFactory.getLog(this.getClass()); // TODO junit test case private Hits hits; protected BitSet fetched; protected List list; protected int index; public ValueListSupplier(Hits hits) { int size = hits.length(); this.list = new ArrayList(size); // stupid idiots at SUN for (int i = 0; i size; i++) list.add(null); this.fetched = new BitSet(); this.hits = hits; this.index = 0; } public List getList() { return list; } public int size() { return list.size(); } public boolean hasPrevious() { return index 0; } public boolean hasNext() { return index size(); } /** * @param index */ public synchronized void move(int index) { this.index = index; } public void reset() { move(0); } public Object current() { validate(index, index + 1); return list.get(index); } public List previous(int count) { int from = Math.max(0, index - count); int to = index; validate(from, to); move(from); return list.subList(from, to); } public List next(int count) { int from = index; int to = Math.min(Math.max(0, size() - 1), index + count); validate(from, to); move(to); return list.subList(from, to); } /** * @param from * starting index (inclusive) * @param to * ending index (exclusive) */ private void validate(int from, int to) { while ((from = fetched.nextClearBit(from)) to) { log.debug(fetching # + from); try { list.set(from, SearchResultAdapter.wrap(hits.doc(from))); fetched.set(from); } catch (IOException e) { // TODO potentially bug e.printStackTrace(); } } } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Saturday 10 April 2004 20:40, Erik Hatcher wrote: Thats the beauty it is up to you to load the doc iff you want it. As I want all of them I don't see why this should be faster at all... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Sunday 11 April 2004 13:40, Erik Hatcher wrote: using a HitCollector you are bypassing those mechanisms. Whether it is measurably faster would depend on several other factors. Well, it is hardly faster, so this is no real solution :-\ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Saturday 10 April 2004 20:40, Erik Hatcher wrote: Thats the beauty it is up to you to load the doc iff you want it. Well, there's another problem with HitCollector: the list I build is not sorted by score :-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Sunday 11 April 2004 17:16, Erik Hatcher wrote: Well, yes the one we already discussed. Let your presentation tier talk directly to Hits, so you are as efficient as possible with access to documents, and only fetch what you need. Again, don't let patterns get in your way. Well, the sense of tiers and (BTW: language-independant) patterns is to modularize software and make things exchangable. This way neither the presentation tier nor the search engine is exchangable. The problem actually is that VLH is designed to have a static list of VOs. VLH needs to evolve to support something like a data provider that dynamically may add data. The problems here so far is that an Iterator must throw an ConcurrentModificationException if the backing data is modified but as data in a VLH is actually never removed but only added this should be something possible to implement. Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Friday 09 April 2004 23:59, Ype Kingma wrote: When you need 3000 hits and their stored fields, you might consider using the lower level search API with your own HitCollector. I apologize for the stupid question but ... where's the actualy result in HitCollector? :-) collect(int doc, float score) Where doc is the index and score is its score - and where's the Document? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ValueListHandler pattern with Lucene
Hi! I implemented a VLH pattern Lucene's search hits but noticed that hits.doc() is quite slow (3000+ hits took about 500ms). So, I want to ask people here for a solution. I tought about something like a wrapper for the VO (value/transfer object), i.e. that the VO does not actually contain the value but a reference to lucene's Hits instance. But this somewhat a hack... Any ideas? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Friday 02 April 2004 23:48, Erik Hatcher wrote: On Apr 2, 2004, at 10:00 AM, [EMAIL PROTECTED] wrote: On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote: Field.Keyword is suitable for storing data like Url. Give that a try. I just tried this a minute ago and found that I cannot use wildcards with Keywords: url:www.yahoo.* You *can* use wildcards with keywords (in fact, a keyword really has no meaning once indexed - everything is a term at that point). Well, I just tried. I also was surprised actually - but it just didn't work. I can use wildcards for doc.add(Field.Text(url, row.getString(url))); but I cannot for doc.add(Field.Keyword(url, row.getString(url))); - create a utility (I've posted one on the list in the past) that shows what your analyzer is doing graphically. Interesting. Can you give me subject/date of that posting? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Simple date/range question
On Friday 02 April 2004 17:03, [EMAIL PROTECTED] wrote: date:[20030101 TO 20030202] [java] 11:05:53,735 ERROR [view.SearchAction] org.apache.lucene.queryParser.ParseException: Encountered 20030202 at line 1, column 18. [java] Was expecting: [java] ] ... Why is this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Simple date/range question
On Saturday 03 April 2004 11:53, Erik Hatcher wrote: I didn't catch in your first message that it was throwing a ParseException this is odd. Are you certain that date:[20030101 TO 20030202] is the complete string your passing to QueryParser? Did Yes. you subclass QueryParser? If so, what is that code? (what is the No. I use a MultiFieldQueryParser: Query qQuery = MultiFieldQueryParser.parse(query, new String[] { id, title, summary, contents, date }, GERMAN_ANALYZER); Hits hits = searcher.search(qQuery); complete stack trace?) [java] 12:38:03,109 ERROR [view.SearchAction] org.apache.lucene.queryParser.ParseException: Encountered 20030404 at line 1, column 18. [java] Was expecting: [java] ] ... [java] org.apache.lucene.queryParser.ParseException: Encountered 20030404 at line 1, column 18. [java] Was expecting: [java] ] ... [java] at org.apache.lucene.queryParser.QueryParser.generateParseException(QueryParser.java:994) [java] at org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.java:874) [java] at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:657) [java] at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:521) [java] at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:464) [java] at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108) [java] at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87) [java] at org.apache.lucene.queryParser.MultiFieldQueryParser.parse(MultiFieldQueryParser.java:115) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 03 April 2004 11:48, Erik Hatcher wrote: Provide us the results of running your url through that, using the same SnowballAnalyzer(German2): Analzying http://www.yahoo.com/foo/bar.html; org.apache.lucene.analysis.WhitespaceAnalyzer: [http://www.yahoo.com/foo/bar.html] org.apache.lucene.analysis.SimpleAnalyzer: [http] [www] [yahoo] [com] [foo] [bar] [html] org.apache.lucene.analysis.StopAnalyzer: [http] [www] [yahoo] [com] [foo] [bar] [html] org.apache.lucene.analysis.standard.StandardAnalyzer: [http] [www.yahoo.com] [foo] [bar.html] org.apache.lucene.analysis.snowball.SnowballAnalyzer: [http] [www.yahoo.com] [foo] [bar.html] analyzer you are using, and also do the same on .toString of the query you parsed. Those two pieces of info will tell all. url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* Well, I actually use a MultiFieldQueryParser, that's probably why the term does appear so often. Strange parser, it should be clear that am explicit url:xyz should only look in the url field, shouldn't it? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 03 April 2004 15:19, Erik Hatcher wrote: date:[20030101 TO 20030202] I found the/my bug. Since Lucene is case-sensitive, I do lower-case all queries for user's convenience. The ParseException is thrown because the TO becomes to. Well, I really think Lucene needs to daff such stumbling blocks aside... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 03 April 2004 17:11, Erik Hatcher wrote: No objections that error messages and such could be made clearer. Patches welcome! Care to submit better error message handling in this case? Or perhaps allow lower-case to? I think the best would be if Lucene would simply have a setCaseSensitive(boolean). IMHO it's in any case a bad idea to make searches case-sensitive (per default). But, also, folks need to really step back and practice basic troubleshooting skills. I asked you if that string was what you passed to the QueryParser and you said yes, when in fact it was not. And you I forgot that I did lower-case it. I fact I even output it in it's original state but lower-case it just before I pass it to lucene. That lower-casing is what I would call a hack and hence it's no surprise that I forgot it :-) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote: Field.Keyword is suitable for storing data like Url. Give that a try. I just tried this a minute ago and found that I cannot use wildcards with Keywords: url:www.yahoo.* - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Simple date/range question
Hi! I do have some problems with date and the QueryParser range syntax. code: java.sql.Timestamp time = row.getTimestamp(timestamp); if (time != null) doc.add(Field.Keyword(date, new Date(time.getTime(; query: date:[20030101 TO 20030202] date:20030101 The first query does throw a ParserException, the second doesn't return any hits. Hmm...there must be something simple I misunderstood :) BTW what about custom date format in QueryParser (...and are the last two digits actually the day or month)? TIA Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Simple date/range question
On Friday 02 April 2004 18:59, Otis Gospodnetic wrote: You Timestamp contains HH mm, and ss, that's likely why your second My timestamp contains date and time. query doesn't match anything. Drop everything other than MMDD from the index, and things should work. What's wrong with new Date(timestamp)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
On Tuesday 09 March 2004 20:51, Timothy Stone wrote: Michael Giles wrote: Tim, Looks like you can only access it with a subscription. :( Sounds good, though. Really? I don't have a subscription. Got to it via the archives actually now that I think about it: Try Volume 7, Issue 12. I also need an subscription for: http://www.sys-con.com/story/search.cfm?pub=1ss=lucene - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
On Fri, 5 Mar 2004 19:18:04 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: Thanks for the idea for a good example for the upcoming Lucene in Action book... it's been added! Thanks for mentioning me in the book ;) What about boolean fields? It's certainly not a good idea to use true or false strings... BTW, isn't it slow to treat everything as strings? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Storing numbers
Hi! I want to store numbers (id) in my index: long id = 1069421083284; doc.add(Field.UnStored(in, String.valueOf(id))); But searching for id:1069421083284 doesn't return any hits. Well, did I misunderstand something? UnStored is the number is stored but not index (analyzed), isn't it? Anyway, Field.Text doesn't work either. TIA Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean for multiple terms
On Thursday 04 March 2004 17:55, [EMAIL PROTECTED] wrote: Consider the query +michael +jackson not to return any hits because there's no michael in index, but there is jackson (e.g. janet...). Is there any reasonable approach how to determine whether one or multiple terms of a query - and which - do let the query fail? In order to illustrate, google for george buhs - it will suggest george bush. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
On Friday 05 March 2004 12:27, Morus Walter wrote: doc.add(Field.UnStored(in, String.valueOf(id))); But searching for id:1069421083284 doesn't return any hits. If your field is named 'in' you shouldn't search in 'id'. Right? Well, indexing and analyzing are different things. UnStored means, the number is not stored (as the name says) but indexed. And IIRC it's analyzed before indexing. Shouldn't make a difference for a single number. What I'd use in this case is an unstored keyword (given that you really don't want to have the id returned from lucene, which is the consequence of not storing). Sorry, typo :-) I do have severeal docs in index and each doc does have an id. And I just want to find a particular doc by its id. doc.add(Field.UnIndexed(id, String.valueOf(id))); doesn't work either. And as I mentioned not even Field.Text does work - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
On Friday 05 March 2004 18:01, Erik Hatcher wrote: 0001 for example. Be sure all numbers have the same width and zero padded. And what about a range like 100 TO 1000? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Did you mean for multiple terms
Hi! Consider the query +michael +jackson not to return any hits because there's no michael in index, but there is jackson (e.g. janet...). Is there any reasonable approach how to determine whether one or multiple terms of a query - and which - do let the query fail? Kind Regards Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene scalability/clustering
On Saturday 21 February 2004 20:24, Otis Gospodnetic wrote: http://jakarta.apache.org/lucene/docs/benchmarks.html BTW, where can I get Peter Halacsy's IndexSearcherCache? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene scalability/clustering
Hi! How well does Lucene scale? Is it able to handle 100.000 (more or less complex) queries a day (i.e. 9 to 5) on an index with half a million docs? What hardware is recommended for that demand? What to do if it cannot handle it quickly enough? Regards, Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple equal Fields?
Hi! What happens if I do this: doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, blah)); Is there a field foo with value blah or are there two foos (actually not possible) or is there one foo with the values bar and blah? And what does happen in this case: doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, bar)); Does lucene store this only once? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 20:56, Erik Hatcher wrote: On Feb 16, 2004, at 9:50 AM, [EMAIL PROTECTED] wrote: TokenStream in = new WhitespaceAnalyzer().tokenStream(contents, new StringReader(doc.getField(contents).stringValue())); The field is the field name. No built-in analyzers use it, but custom analyzers could key off of it to do field-specific analysis. Look at If I want to tokenize all Fields I would have to get a tokenStream of each Field seperately and process them seperately? Or can I get one master stream that compounds all Fields? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Tuesday 17 February 2004 15:18, Erik Hatcher wrote: You would do them separately. I'm not clear on what you are trying to do. The Analyzer does all this during indexing automatically for you, but it sounds like you are just trying to emulate what an Analyzer already does to extract words from text? I am still doing this: TokenStream in = analyzer.tokenStream(contents, new StringReader(reader.document(i).getField(contents).stringValue())); And I want to extract all words from all Fields. Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Tuesday 17 February 2004 16:13, Erik Hatcher wrote: The words (or terms) are already in the index ready to be read very rapidly and accurately. IndexReader is what you want to investigate if your fields are indexed. Look into IndexReader and pull the terms directly rather than re-analyzing the text. Provided contents was an indexed field, you Well, but my index was created using a GermanAnalyzer. I have to re-analyze it with WhitespaceAnalyzer if I don't want the words to be truncated... What you do is what I did at the beginning of the thread :-) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Tuesday 17 February 2004 18:05, Erik Hatcher wrote: *arg* I feel like we are going in circles here. Me, too :-) Why use the GermanAnalyzer at all if it is not what you want? Re-index! I want to use the GermanAnalyzer. But not for the did you mean functionality... That's what this thread is all about :) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Incrementally updating and monitoring the index
On Friday 13 February 2004 19:10, Stephane James Vaucher wrote: Very possible, before adding a document, you can check (with the judicious use of an id) if it has already been added. If it hasn't, do your notification, but this requires programming. So you mean adding the new documents to a temporary index first, running all queries against it and then write the temp index to the final index? RAMDirectory ram = new RAMDirectory(); for (docs...) ram.addDocument(doc); IndexSearcher searcher = new IndexSearcher(ram), for (queries...) if (searcher.search(query) != null) notify(); finalIndex.addIndexes(ram); finalIndex.optimize(); ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Thursday 12 February 2004 18:35, Viparthi, Kiran (AFIS) wrote: As mentioned the only way I can see is to get the output of the analyzer directly as a TokenStream iterate through it and insert it into a Map. Could you provide or point me to some example code on how to get and use TokenStream. The API docs are somewhat unclear to me... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 12:02, Viparthi, Kiran (AFIS) wrote: As mentioned I didn't use any information from index so I didn't uses any TokenStream but let me check it out. deprecated: String description = doc.getField(contents).stringValue(); final java.io.Reader r = new StringReader(description); final TokenStream in = analyzer.tokenStream(r); for (Token token; (token = in.next()) != null; ) { System.out.println(token.termText()); } But the result is the same, the words are actually truncated (instead of has, had, have, etc. only ha) :-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 12:40, Erik Hatcher wrote: On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote: String description = doc.getField(contents).stringValue(); What is the value of description here? ? The value of the field contents :-) Long, plain text.. final java.io.Reader r = new StringReader(description); final TokenStream in = analyzer.tokenStream(r); And what analyzer are you using here? GermanAnalyzer (yes, has, had, etc. below is fictional but most people here probably don't speak german...e.g. automobile may become automob or something like this). But the result is the same, the words are actually truncated (instead of has, had, have, etc. only ha) :-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 15:16, Erik Hatcher wrote: And thus the nature of the problem. Try using the WhitespaceAnalyzer instead to see what you get. Much better! :-) But sometimes it still returns multiple words as a single term...:-\ And it does not care for punctuation, but that's probably something I'll have to do on my own... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 15:27, [EMAIL PROTECTED] wrote: But sometimes it still returns multiple words as a single term...:-\ Sorry, silly mistake of mine. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 12:12, [EMAIL PROTECTED] wrote: deprecated: String description = doc.getField(contents).stringValue(); final java.io.Reader r = new StringReader(description); final TokenStream in = analyzer.tokenStream(r); for (Token token; (token = in.next()) != null; ) { System.out.println(token.termText()); } Can somebody explain tokenStream() to me? This is not deprecated: TokenStream in = new WhitespaceAnalyzer().tokenStream(contents, new StringReader(doc.getField(contents).stringValue())); But what is the first argument (field) for tokenStream() good for? Actually I can type whatever I want...? Don't understand the short description in the API docs... Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Monday 16 February 2004 15:16, Erik Hatcher wrote: And thus the nature of the problem. Try using the WhitespaceAnalyzer instead to see what you get. Can I chain multiple analyzer in order to filter common stop words? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Word not in index
Hi! I do build a list of all unique words in all my docs from WhitespaceAnalyzer.tokenStream(). I also do index all my docs using a GermanAnalyzer in another index. There are plenty of word in the word list that don't return any hits when searching the doc index built using the GermanAnalyzer - and these are no stop words. Why is this? Thanks a lot! Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word not in index
On Monday 16 February 2004 19:20, [EMAIL PROTECTED] wrote: Why is this? Another curiosity is that apparently the case does matter: albert (Einstein :) does return hits, but Albert does not - despite the docs contain Albert and not albert. Can somebody explain? Thanks! Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word not in index
On Monday 16 February 2004 19:57, Otis Gospodnetic wrote: Searches ARE case sensitive, it is just that some Analyzers lowercase all tokens. If you are using WhitespaceAnalyzer, then tokens will not GermanAnalyzer apparently is one of them. Too bad :-( Is there a case-sensitive alternative out there? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word not in index
On Monday 16 February 2004 19:45, Markus Spath wrote: Analyzers preprocess the text to be indexed; different Analyzers will generate different text-tokens that are indexed. only you can know which Analyzer fits your needs, but you need to apply this one consistently for indexing, searching and generating lists of unique words, if you want to get expectable results. Well, not sure whether I understood. GermanAnalyzer - just as any other analyzer - does index all word except stop words, right? What's actually the sense of a search engine if I cannot search for words in the text? :-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word not in index
On Monday 16 February 2004 20:56, Otis Gospodnetic wrote: Timo, by the nature of your questions it seems like you didn't see the Articles section of Lucene's site. There are links to several articles --- [EMAIL PROTECTED] wrote: Well, not sure whether I understood. Well, was actually a case problem, too... :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Limiting hit count
On Friday 13 February 2004 15:02, Erik Hatcher wrote: Use a HitCollector and grab the first one that comes in, then bail out. That should do the trick for getting the first hit only. According to the API docs I ought to use HitCollector only if I need all hits :-) And there's certainly a reason for it - I don't think that this will speed up the search ;) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Incrementally updating and monitoring the index
Hi! Can Lucene incrementally update its index (i.e. balancing will a list of docs and removing those that are no more found)? I'd like to monitor the index for certain queries/terms, i.e. I want to be notified if there are (new) hits for a list of terms each time after I add a document to the index - continously. Is this possibe? The index will contain several hundrets of thousands of documents and will be frequently accessed concurrently. TIA Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
Hi Ronnie! On Thursday 12 February 2004 09:50, [EMAIL PROTECTED] wrote: There is no built-in way in Lucene to achieve this. I have done a simple implementation with a patched FuzzyQuery for each term. A new method (bestOrderRewrite) returns a ordered list of all fuzzy terms that indeed exist in index. There is no guarantee that the suggested term is spelled Could you please post your FuzzyQuery (did you pach the class or extend it?) or send via email? Thanks a lot Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Thursday 12 February 2004 09:43, Viparthi, Kiran (AFIS) wrote: We archived this by creating a separate index words extracting the complete list of words. How were you extracting the words? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Thursday 12 February 2004 18:03, [EMAIL PROTECTED] wrote: On Thursday 12 February 2004 17:53, [EMAIL PROTECTED] wrote: How were you extracting the words? Oops, sorry that this stupid question :) Got it. Hm, seems the question wasn't so stupid anyway: IndexReader reader = IndexReader.open(ram); TermEnum te = reader.terms(); while(te.next()) { ... But this includes obviously parts of words, too :-\ Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ANNOUNCE: Plucene
Hi! Somewhat off-topic: is there a PHP port of Lucene? Warm regards Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Did you mean...
On Thursday 12 February 2004 00:15, Matt Tucker wrote: We implemented that type of system using a spelling engine by Wintertree: http://www.wintertree-software.com There are some free Java spelling packages out there too that you could likely use. But this does not ensure that the word really exists in the index. The word google does propose however to exist. Regards Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLDocument
On Monday 02 February 2004 10:41, John Moylan wrote: Another easy HTML parser is HTMLparser.sf.net This one doesn't seem to be a SAX parser...:-\ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[newbie] Hit quality rating
Hi! Is there a hit quality rating in Lucene or are there only hits and non-hits? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [newbie] Hit quality rating
On Wednesday 04 February 2004 14:48, Otis Gospodnetic wrote: There is score. Oops, you are right Hits.score(). But it seems I have to implement a sorting iterator on my own :-\ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SQLDirectory
On Monday 02 February 2004 21:08, Jochen wrote: RE: Lucene Optimized Query Broken? Thanks for the hint. Alas, I also didn't find it there :-( Anyway, I need something that does work on any (Postgres) SQL db. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLDocument
On Sunday 01 February 2004 15:27, Felix Huber wrote: Of course it's there: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/ Thanks. But didn't find that contribution/ant directory there anyway...:-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SQLDirectory
On Monday 02 February 2004 22:00, Philippe Laflamme wrote: I'll look into making the implementation available if you're interested. I'd be very interested! Please :) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SQLDirectory
Hi! There was some third-party SQLDirectory for lucene 1.2 which was abandoned for a matter of performance. Well, why not loading the index into RAM? Is there some (official) SQLDirectory for 1.3? searcher = new IndexSearcher(IndexReader.open(new RAMDirectory(new SQLDirectory())); I'd really like to have the index where I do have all the data - in the database. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HTMLDocument
Hi! Is there any HTMLDocument out there? The one in the demo package of lucene does not handle non-wellformed HTML files (what about nekohtml?) and seems to have some other inabilities and bugs as well (and why isn't it part of the distro but in a demo package?!)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLDocument
On Sunday 01 February 2004 13:21, Erik Hatcher wrote: On Feb 1, 2004, at 6:19 AM, [EMAIL PROTECTED] wrote: Nutch uses NekoHTML, so you can browse around that codebase and borrow Nutch(.org)? No code there... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
storing index in database
Hi! Somebody wrote a SQLDirectory for lucene 1.2 (only) but discontinued it for a matter of performance issues. Well, I really would like to store that index at the same place as the data ifself - in the database and not somewhere in the filesystem. I don't quite understand the performance problem at all but in any event if a index sizes only some MBytes why not selecting all the index once out of the DB and keeping it in memory? So, I'd like to ask people here whether there is a way to and which one is the best to store the index reasonably in db. Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: storing index in database
On Friday 03 October 2003 16:29, Guilherme Barile wrote: Why not just use a RAMDirectory ? Yes, that was my idea to store the index database and load it into memory. I'm just asking people hese whether this is a good idea or if there are better (more standard) ways (where I have to do less on my own). ...since I'm a lucene newbie :-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Redefine the wildcards
On Friday 26 September 2003 15:47, [EMAIL PROTECTED] wrote: because of too much hits. So i wonder if it possible to redefine the wildcards in lucene to make them replace only numbers and not caracters . What about regular expressions? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
wildcards in fields?
Hi! I search in a field called url. url:www.blah.com does return hits while url:blah.com does not. So I tried url:*blah.com but this does even throw a ParseException. What am I doing wrong? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
stop words in index
Hi! I use a GermanAnalyzer for indexing and searching, search for der (the) does not return any hits. But examining the index with Luke does show up der as the top ranked word. Other word which are probably stop words as well (zum) return hits. bug? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NLucene up to date ?
Yes, given the lack of updating the c# version I thought users would be maintaining their own version in line with current developments. I too had to add those items you mentioned. What I would like to see is all these 'implementations' consolidated and maintained regularly as per java. I am not sure how widely known Lucene is in the .NET community - my guess it it isn't. A tried and tested Lucene .NET version will definitely help it reach other audiences. I am pleased to hear Pasha (re)taking up the reigns. Brendon [EMAIL PROTECTED] wrote: I talked to one of the maintainers of NLucene and he said that he was planning on releasing a 1.2 version (not beta apparently) in two months. That was back in June and I haven't heard or seen anything since then so I cant really say if it is still being actively developed. Sounds like you are doing the same thing I am doing which is adding functionality that you need on your own. I've also added a few things to NLucene like multifield queries and the default boolean operator setting. Brian Hi all, http://sourceforge.net/projects/nlucene/ has a version numbered 1.2b2. Does anyone know if this source is still being maintained to be closer to the java developments ? Was this an external project to Apache Jakarta ? I (we) have just successfully released a search engine using a c# implmentation of Lucene. Code had to be brought up to date in line with recent java builds, and enhanced with additional features (eg field sorting, term position score factoring, etc). Any other c# users who would like to see NLucene kept in line with the java version ? Maybe I'm just being lazy with having to maintain my own version of Lucene =). Surely there are others out there who are c# users and follow the mailing lists (I remember a Brian somewhere !) but seldom post. Brendon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NLucene up to date ? Lucene.Net is up to date.
Excellent news. Will you be keeping the source up to date with the java developments ? Can't wait to get my hands on the source, yes that damn bit shift operator (unsigned ?) always worried me =) Just by the way, would the .NET version have a similar style sandbox area where users can submit small add-on type functionality ? For example, field sorting. Would love to share the code for use and comment as this seems to be a common request. Big thanks Pasha, Brendon [EMAIL PROTECTED] wrote: Hi, I talked to one of the maintainers of NLucene and he said that he was planning on releasing a 1.2 version (not beta apparently) in two months. That was back in June and I haven't heard or seen anything since then so I cant really say if it is still being actively developed. Sounds like you are doing the same thing I am doing which is adding functionality that you need on your own. I've also added a few things to NLucene like multifield queries and the default boolean operator setting. By the way, I hope that Lucene.Net 1.3rc1 will be available from http://sourceforge.net/ in this week. Lucene.Net is ready, but sourceforge is not :) Lucene.Net is a complete up to date port of Lucene 1.3rc1 includes samples and demos (web demo also). A few differences between nLucene and Lucene.net are: 1. version of Lucene: Lucene.Net is a 1.3rc1, nLucene - is a 1.2 2. java code compatible: Lucene.Net only change naming notation, like IndexWriter, nLucene implement some methods as a attributes and others 3. demos: Lucene.Net contain all of Lucene demos and tests include web demos. nLucene does not. 4. .NET Framework 1.1 and VS 2003 compatible 5. (for internal developer only): correct implement of java operator :) Pasha - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NLucene up to date ?
No additional classes have been created. The functionality was simply implemented via new properties and method overloading, so original signatures remain intact. As far as supporting future versions, I cannot say as I will no longer be using it at work. Keeping the c# version in line with java would have to be done in my own time, so no guarantees. Taking the 1.2b2 source I only brought in the fixes, enhancements, etc that affected how I was using Lucene. I keep up with the nightly builds on a regular basis and update the c# source where appropriate, so any bugs should have been rectified. Brendon [EMAIL PROTECTED] wrote: Hi, From: [EMAIL PROTECTED] I (we) have just successfully released a search engine using a c# implmentation of Lucene. Code had to be brought up to date in line with recent java builds, and enhanced with additional features (eg field sorting, term position score factoring, etc). Is it hard-code additional or new classes? Are you going to support new versions of lucene? Pasha P.s nLucene is lucene 1.2 based with old bugs and not supported. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
NLucene up to date ?
Hi all, http://sourceforge.net/projects/nlucene/ has a version numbered 1.2b2. Does anyone know if this source is still being maintained to be closer to the java developments ? Was this an external project to Apache Jakarta ? I (we) have just successfully released a search engine using a c# implmentation of Lucene. Code had to be brought up to date in line with recent java builds, and enhanced with additional features (eg field sorting, term position score factoring, etc). Any other c# users who would like to see NLucene kept in line with the java version ? Maybe I'm just being lazy with having to maintain my own version of Lucene =). Surely there are others out there who are c# users and follow the mailing lists (I remember a Brian somewhere !) but seldom post. Brendon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NLucene up to date ?
Replies to Erik and Scott inline. [EMAIL PROTECTED] wrote: Do these implementations maintain file compatibility with the Java version? Scott Yes and no, explanation will help me explain. The field ordering functionality required additional files to be created at index time if the Document.Field property indicates so. At search time, the entire contents of the 'field sorting' files are read in. As the IndexReader is shared for all client calls (for a pre-defined period of time as the index has been implemented 'incremental' style) this cost is only incurred once. Code-wise, the technique follows the pattern for the Normalisation byte writing and reading, the difference being an Int being written. Yes, there is a memory usage hit, but the performance and functionality offered offsets this. All other file formats remain identical. I have coded LuceNET (!) so that it gracefully continues if the index segments do not have these additional 'sorting' files (naming convention like the normalisation files). Erik Hatcher wrote: I'd love to see there be quality implementations of the Lucene API in other languages, that are up to date with the latest Java codebase. I'm embarking on a Ruby port, which I'm hosting at rubyforge.org. There is a Python version called Lupy. A related question I have is what about performance comparisons between the different language implementations? Will Java be the fastest? Is there a test suite already available that can demonstrate the performance characteristics of a particular implementation? I'd love to see the numbers and see if even the Java version can be beat. Erik Performance wise, queries typically run in hundreths of seconds. Including term position in the scoring impacted the timings as expected. Indexing takes time, but then this wasn't really part of the design goals. As far as comparing to the java implementation in terms in performance, I haven't tried as this workplace is a MS shop. Java vs c# all over ? Just kidding =) On Thursday, July 31, 2003, at 08:43 AM, [EMAIL PROTECTED] wrote: Hi all, http://sourceforge.net/projects/nlucene/ has a version numbered 1.2b2. Does anyone know if this source is still being maintained to be closer to the java developments ? Was this an external project to Apache Jakarta ? I (we) have just successfully released a search engine using a c# implmentation of Lucene. Code had to be brought up to date in line with recent java builds, and enhanced with additional features (eg field sorting, term position score factoring, etc). Any other c# users who would like to see NLucene kept in line with the java version ? Maybe I'm just being lazy with having to maintain my own version of Lucene =). Surely there are others out there who are c# users and follow the mailing lists (I remember a Brian somewhere !) but seldom post. Brendon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Distribution of junit.jar with Lucene Binaries
For what reason is the JUnit-Lib paked within the binary-Distribution of Lucene ? Greetings Manfred -- +++ GMX - Mail, Messaging more http://www.gmx.net +++ Bitte lächeln! Fotogalerie online mit GMX ohne eigene Homepage! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Stress/scalability testing Lucene
Ah, for some reason i thought none of the Lucene methods were thread safe, or is this only in the case of reading and writing at the same time? I thought I read this in the FAQ. Roy. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Wednesday, November 20, 2002 5:04 PM To: Lucene Users List Subject: Re: Stress/scalability testing Lucene * Replies will be sent through Spamex to [EMAIL PROTECTED] * For additional info click - http://www.spamex.com/i/?v=886513 Justin Greene wrote: We created a thread pool to read and parse the email messages. 10 threads seems to be the magic number here for us. We then created a queue of messages to be indexed onto which we push the parsed messages and have a single thread adding messages to the index. IndexWriter.addDocument(Document) is thread safe, so you don't need a separate indexing thread. So long as your analyzer is thread safe, you can index each messages in the thread that parses it, for even greater parallelism. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] This email and any attachments are confidential and may be legally privileged. No confidentiality or privilege is waived or lost by any transmission in error. If you are not the intended recipient you are hereby notified that any use, printing, copying or disclosure is strictly prohibited. Please delete this email and any attachments, without printing, copying, forwarding or saving them and notify the sender immediately by reply e-mail. Zurich Capital Markets and its affiliates reserve the right to monitor all e-mail communications through its networks. Unless otherwise stated, any pricing information in this e-mail is indicative only, is subject to change and does not constitute an offer to enter into any transaction at such price and any terms in relation to any proposed transaction are indicative only and subject to express final confirmation.
the order of fields in Document.fields()
Quick question about Document.fields(). Lucene provides you with a method to retrieve the value of a field or grab all fields as an Enumeration. It does not, however, allow you to grab all values of one field for a document, it will only return the last value added for that field. For example, I am indexing email messages that might have multiple To/CC/BCC fields in the message header. Currently to grab all the values when I display an email that has been indexed, I must use the fields() method to grab an Enumeration of all fields in a document. I then separate them into different arrays based on the field names. However I am concerned about the order of the fields since I consider the first To or CC or BCC to be the main value for each field. Is the order of the fields returned in the order that they are added? Or is there no order? If there is no order, can someone suggest a solution? Thanks! Roy. This email and any attachments are confidential and may be legally privileged. No confidentiality or privilege is waived or lost by any transmission in error. If you are not the intended recipient you are hereby notified that any use, printing, copying or disclosure is strictly prohibited. Please delete this email and any attachments, without printing, copying, forwarding or saving them and notify the sender immediately by reply e-mail. Zurich Capital Markets and its affiliates reserve the right to monitor all e-mail communications through its networks. Unless otherwise stated, any pricing information in this e-mail is indicative only, is subject to change and does not constitute an offer to enter into any transaction at such price and any terms in relation to any proposed transaction are indicative only and subject to express final confirmation.
RE: the order of fields in Document.fields()
Shouldn't there be at least one method that returns an array of fields in the correct order? Roy. -Original Message- The order is preserved (or reversed actually), so it's not random. It's reverse of the order of the order in which the fields were added to the document. This would be easy to test... This email and any attachments are confidential and may be legally privileged. No confidentiality or privilege is waived or lost by any transmission in error. If you are not the intended recipient you are hereby notified that any use, printing, copying or disclosure is strictly prohibited. Please delete this email and any attachments, without printing, copying, forwarding or saving them and notify the sender immediately by reply e-mail. Zurich Capital Markets and its affiliates reserve the right to monitor all e-mail communications through its networks. Unless otherwise stated, any pricing information in this e-mail is indicative only, is subject to change and does not constitute an offer to enter into any transaction at such price and any terms in relation to any proposed transaction are indicative only and subject to express final confirmation.
Deleting a document found in a search
I am just getting started with Lucene and I think I have a problem understanding some basic concepts. I am using two-part identifiers to uniquely identify a document in the index. So whenever I want to index a document, I first want to find and delete the old form. To find it, I intend to use: BooleanQuery findOurs = new BooleanQuery(); findOurs.add(new TermQuery(new Term(Id, id)), true, false); findOurs.add(new TermQuery(new Term(Domain, domain)), true, false); System.out.println(Deleting document matching: \ + findOurs.toString() + ''); Searcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(findOurs); // Assert: hits.length() = 1 for (int i = 0 ; i hits.length() i 10; i++) { Document d = hits.doc(i); // Now what can I do to find document id? int id = ?? searcher.delete(id); } But I can't discover how to convert a search result into a document id. It is recorded in the private HitDoc class, but since it is not publicly accessible, there must be a reason why it would not work to add a public getter for it. Is there an alternative way that I can do this? My first thought is to define a Field.Keyword(composite-key, domain + \u + id). This would allow me to use the delete(Term) interface to delete the key. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Enumerating all Terms
Is there a way of getting a list of all Terms that have been indexed? I guess it would approximate a wildcard query of the form *:* if that were valid, and instead of returning matching documents, just returning the fields and values. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]