Re: Efficient document spooling and indexing
Yes, I think that you are correct, since I see my index directory growing as I add documents to the index, even though I don't call close() until I'm finished adding all documents. Hm, I wonder what exactly gets written to the disk between add and close. I shall rewrite my stuff to use RAMDirectory then. I like efficient code and hate wasting any kind of resources, not just computing ones. Thanks, Otis --- Ian Lea [EMAIL PROTECTED] wrote: Data may not be committed to disk, buffers flushed, files closed, etc. until IndexWriter.close() is called, but file IO does happen before then. So I would expect the answer to your question to be no. -- Ian. [EMAIL PROTECTED] Otis Gospodnetic wrote: Hello, This is from a thread from about 2 weeks ago. What is the answer to this question? If data is written to disk only when IndexWriter's close() is called, wouldn't the sample code below be as efficient as the sample code that uses RAMDirectory, further down? Thanks, Otis When using the FSWriter, the actual file io doesn't occur until I close the writer, right? So wouldn't it be just as efficient to do the following: IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, false); while (... more docs to index...) ... add 100,000 docs to fsWriter ... } fsWriter.optimize(); fsWriter.close(); -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Friday, November 02, 2001 10:47 AM To: 'Lucene Users List' Subject: RE: Indexing problem Well, I don't know if there's an archive of the list, so this what Doug wrote: A more efficient and slightly more complex approach would be to build large indexes in RAM, and copy them to disk with IndexWriter.addIndexes: IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, true); while (... more docs to index...) RAMDirectory ramDir = new RAMDirectory(); IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true); ... add 100,000 docs to ramWriter ... ramWriter.optimize(); ramWriter.close(); fsWriter.addIndexes(new Directory[] { ramDir }); } fsWriter.optimize(); fsWriter.close(); Scott -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. http://geocities.yahoo.com/ps/info1 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Order of Package Compilation
Why not just use Ant to build Lucene? Otis --- srinivasa v [EMAIL PROTECTED] wrote: Hi all, I got the lucene source files, When I started to compile all packages again in some order, it is giving some error saying some classnot foundthe order in which I compiled is given below. com\lucene\store\*.java com\lucene\util\*.java com\lucene\document\*.java com\lucene\analysis\standard\*.java com\lucene\analysis\*.java com\lucene\index\*.java com\lucene\search\*.java com\lucene\queryParser\*.java I hope the order may be wrong, if yes in what order i have to compile ? Plese help me. Thanks in Advance Srini - Do You Yahoo!? Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. __ Do You Yahoo!? Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. http://geocities.yahoo.com/ps/info1 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing other documents type than html and txt
You'd have to write parsers for each of those document types to convert it to text and then index it. Sure, you can feed it something like XML, but then you may consider something like xmldb.org instead. Otis --- Antonio Vazquez [EMAIL PROTECTED] wrote: Hi all, I have a doubt. I know that lucene can index html and text documents, but can it index other type of documents like pdf,docs, and xls documents? if it can, how can I implement it? Perhaps can implement it like html and txt indexing? regards Antonio _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. http://geocities.yahoo.com/ps/info1 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Industry Use of Lucene?
It looks like a person at Overture (former Goto.com) is using it. I know ScreamingMedia.com used it at one point. Otis --- Jeff Kunkle [EMAIL PROTECTED] wrote: Does anyone know of any companies or agencies using Lucene for their products/projects? I am attempting to make a marketing pitch for Lucene to my manager and I know one of the first questions will be, Who else is using it? I know Lucene is a very powerful, fast, and flexible full-text search engine but my manager will need a little more coercing. Any help on this topic is greatly appreciated. Thanks, Jeff -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Send your FREE holiday greetings online! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: FW: Installation notes
You need to download and install JavaCC. Try this: http://marc.theaimsgroup.com/?l=lucene-userw=2r=1s=javaccq=b Otis --- Patrick Codere [EMAIL PROTECTED] wrote: Dear All, I just downloaded the latest version of Lucene, and not being to familiar with java, I would like to get some help on installing it. I downloaded it, and using ant I got the following message: could not create task of type : javacc.. What does this mean? Please Help. Thanks. __ Do You Yahoo!? Send your FREE holiday greetings online! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: existing or not existing
Yes, I would use this, especially the IndexReader methods that you suggested. Otis --- Doug Cutting [EMAIL PROTECTED] wrote: From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] You could try looking for a segments file in the index directory. If it exists, the index exists, else it does not. Is there a better way? I think that's currently the best way. But it's not great, because it requires applications to know something about the internal structure of the index. Going forward, I'm hesitant to change the semantics of the 'create' flag. I'm also hesitant to add another flag or constructor method. Perhaps the addition of the following IndexReader methods would suffice: /** Returns true iff an index exists in the named directory. */ public static boolean indexExists(String directory); public static boolean indexExists(File directory); public static boolean indexExists(Directory directory); These are analogous to the 'lastModified' methods. Internally these would just check for the existence of the segments file. Does that sound like a good plan? Another place that currently requires application knowledge of index structure is failure recovery. Currently if an indexing application crashes it may leave .lock files in the directory which must be removed before the index can be altered again. Perhaps this can be resolved similarly by adding methods like: /** Returns true iff the index in the named directory is currently locked.*/ public static boolean isLocked(Directory directory); /** Forcibly unlocks the index in the named directory. * Caution: this should only be used by failure recovery code, * when it is known that no other process or thread is in fact * currently accessing this index. */ public static void unlock(Directory directory); We could also have String and File versions for convenience. Would folks use something like this? If so, more fodder for the TODO list! Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Send your FREE holiday greetings online! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: WildcardQuery
If I understand you correctly, you tried to search for '*new*'. I believe you can't use an asterisk (*) as the first query of the query term. So, new* is valid, while *new or *new* is not. Otis --- Serge A. Redchuk [EMAIL PROTECTED] wrote: Hello sampreet, Tuesday, December 11, 2001, 6:44:29 AM, you wrote: sic Hi All, sic This must be simple enough, but can anyone please explain me when a sic WildcardQuery is created in QueryParser i.e. what special characters in the sic query string are required to build a WildcardQuery within QueryParser? Moreover, when I achieved complex search like this: path:*new* comp* by combining WildcardQueries in BooleanQuery (NOT BY QueryParser), and then got that query using boolq.toString(...); - the QueryParser COULD NOT parse this string !!! Is not it strange ? : QueryParser.parse( bquery.toString( ... ) ) - do not work :-( -- Best regards, Sergemailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: continue ideo-logic error in QueryParser and in BooleanQuery !
Actually, I do not think this is a bug. You cannot make searches with queries that have only the NOT part. You cannot ask Lucene to match all documents that do not contain a certain term. For instance, issuing a 'NOT pretty' will not return doc1, doc3, doc4. You have to use that NOT pretty in combination with something else (AND). For instance 'love AND NOT pretty' should return doc1. I was about to say that you can check what other search engines do when you give them just the negation, so I tried av.com and google.com. AltaVista does return a bunch of matches, but Google doesn't let you enter such a query. Otis --- Serge A. Redchuk [EMAIL PROTECTED] wrote: .. Let we have 4 docs: doc1: Love is life doc2: Java is pretty nice language doc3: C++ is powerful, but unsafe doc4: Onion and love sometimes are not compatoble So, if search for love OR NOT onion Here I was wrong: (nevertheless it not solve described bug) result must be: doc1, doc2, doc3. must be: result must be: doc1, doc2, doc3, doc4. (ALL) . Certainly I understand that people will not compose such complex queries to search for ALL, but lucene still do not finds all. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: DateFilter and NullPointerException
Hm, do you know which line in DateFilter.java this NPE comes from? Could you try compiling Lucene with the -g switch so that we can see the line numbers in the exception stack trace? If you want you can also submit a bug report at http://nagoya.apache.org/bugzilla/ Thanks, Otis --- Uro¹_Jurgliè [EMAIL PROTECTED] wrote: I'm having a problem when using Query and DateFilter for a search. If I create DateFilter with DateFilter.After with current timedate as parameter, then I get NullPointerException when executing search (Searcher.search(Query, DateFilter)). Had anyone experienced something like that? If I set time just a bit in past, it returns empty hits which is how it should behave all the time. code snipet: // I have java files as documents, consisting of content (Field.Text()) and modified (Field.Keyword()) Query q = new WildcardQuery(new Term(content, packag*)); DateFilter df = DateFilter.After(modified, Calendar.getInstance().getTime()); Searcher searcher = new IndexSearcher(path); Hits hits = searcher.search(q, df); // line 66 exception: Exception in thread main java.lang.NullPointerException at org.apache.lucene.search.DateFilter.bits(Unknown Source) at org.apache.lucene.search.IndexSearcher.search(Unknown Source) at org.apache.lucene.search.Hits.getMoreDocs(Unknown Source) at org.apache.lucene.search.Hits.init(Unknown Source) at org.apache.lucene.search.Searcher.search(Unknown Source) at Search.main(Search.java:66) Regards, Uros. __ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Using a DateFilter without a query
Hello, --- Jan_Stövesand [EMAIL PROTECTED] wrote: Hi, is it possible to use a DateFilter without a query. I would like to get all Documents from within a certain period of time WITHOUT specifying any query except the range of dates. I don't know, but I'd like to know. Have you tried it? Is there something like query that will always return all documents from an index? This has been asked in the past. It can't be done, but you could work around it by adding a field with a known, constant value to each document. Then searching for that will give you all documents in the index. Is there a better way? Otis __ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: IndexReader/IndexSearcher
Uh, I don't repeat myself, but I'll repeat others' words :) It is the analyzer (StandardAnalyzer, I believe) that lowercases text before indexing it. If you use the same analyzer to search it will lowercase text before performing a search, so you'll find the document with bo23 in it even if you use BO23 in the search. Otis --- Mike Baroukh [EMAIL PROTECTED] wrote: I reply to myself : It seem that when using IndexReader, keywords must be lower case. So, I indexed BO23, I can search BO23 with IndexSearcher, but I must use bo23 to search with IndexReader. Am I right ? Mike - Original Message - From: Mike Baroukh [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, December 19, 2001 12:57 PM Subject: IndexReader/IndexSearcher Hi all. Can somebody tell me where is my error. There is something I don't understand. If I search something with IndexReader indexReader = IndexReader.open(/myindex); TermDocs docs = indexReader.termDocs(new Term(codman, BO23)); while ( (docs!=null) (docs.next()) ) { nbis++; } if (docs!=null) docs.close(); indexReader.close(); I see that nbis = 0 so temDocs returned nothing. But, If I use SimpleAnalyzer analyzer = new SimpleAnalyzer(); IndexSearcher indexSearcher = new IndexSearcher(/myindex); Query query = QueryParser.parse(BO23, codman, analyzer); Hits hits = indexSearcher.search(query); nbis = hits.length(); It's exactly the same query, the same index but this time, it return 1 document. I don't understand where this difference came from ? I know that the firs way is not the good way of searching but what I wan't is to delete from the index the document returned wy the search #2. Thanks in advance for any help. Mike -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: About indexing
Parag, I'm not sure if I understood your question correctly, but it seems like you want to create a Field that holds the path information (e.g. TEST/subdir1 or TEST/subdir2, and so on), and then include that in the query based on which path(s) you want to search. You could use TEST to search just TEST, TEST/subdir1 to search just TEST/subdir1, or TEST* to search everything under TEST. Otis --- Parag Dharmadhikari [EMAIL PROTECTED] wrote: Hi all, If I will create the index of files in different thread (which may be invoked at any time)then is it possible to index on files from the root directory and then selectively search on the different path on created index. For example first I will index from root directiory say , TEST. Then depending on the selected directory path (which will be resides inside the root directory TEST) I will search on the created index. Thanx in advance regards parag -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Send FREE video emails in Yahoo! Mail! http://promo.yahoo.com/videomail/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: I want to search on BOTH -- (1) XML data and (2) Text data.
Hello, You could write an XML parser (see http://xml.apache.org/ for some XML tools) and store XML elements as Fields in Lucene Documents. To search for 'Hello' and 'Hello Mr. President!' you can store the whole article body as a Text (or maybe UnStored) Field. You can also look on www.mail-archive.com and search this list's archive for some related discussions. Try searching for Philip Ogren (I think I got the name right), he sent some code that lets you go from XML - Lucene Document quickly, I think. Otis --- Harun Altay [EMAIL PROTECTED] wrote: Hello Friends, I want to search on BOTH -- (1) XML data and (2) Text data. (1). Text Data -- mostly consist of HTML pages, residing on the server... example : hundreds of HTML, TXT file, etc... (2). XML Data -- for example, Articles that was stored in XML format, lets say like this : article article code /article code article title /article title author /author date ... /date etc ... /etc body of th eTEXT . .. the article body, TEXT .. . . . . /body of th eTEXT /article In this type of search, we need to search this XML-based author file in two different ways : 2.a. First Way of searching : Searching XML file through its KEYWORDS, like : date = Jan-01-2002 and author = George Washington 2.b. Second Way of Searching : Free search on the article body. For example : All the articles, whose body has the word 'Hello', or the sentence 'Hello Mr. President!' Note-1: XML file may reside either Operating System level, or in a XML-supporting DATABASE, as well. Note-2: If I need to have them, I can write extra java classes to support some more functionality, if possible... Thank you very much, Harun. __ Do You Yahoo!? Send FREE video emails in Yahoo! Mail! http://promo.yahoo.com/videomail/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Anyone run Linux JVM 1.4 Beta 3 with lucene ?
Oui :) Otis --- Winton Davies [EMAIL PROTECTED] wrote: Hi guys, I'm getting stung by JVM 1.3.1_01 on Linux, max allocation of heap is about 1.9 gb. Anyway, I'm thinking of going to 1.4 ? Anyone run Lucene under this beta ? Cheers, Winton -- Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Send FREE video emails in Yahoo! Mail! http://promo.yahoo.com/videomail/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: My own steammer (brazilian)
That file is created during the build process. Try building Lucene by typing 'ant compile'. Otis --- Bizu_de_Anúncio [EMAIL PROTECTED] wrote: My brazilian steammer has the same structure as the German steammer, except for the inner logic. I created it , tested it and now I'm trying to compile it with no success. The problem is the 'StandartTokenizer.java' class ! I can´t find it in the package org.apache.lucene.analysis.standard . The only file that exists there is a file named 'StandartTokenizer.jj'. What is this file for ? I have lucene-1.2-rc2. Can someone help me, thanks, jk -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Send FREE Valentine eCards with Yahoo! Greetings! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: using lucene with a very large index
--- tal blum [EMAIL PROTECTED] wrote: Hi, I'm building a very large index, that contains several categories. I have several questions I hope you can answare. 1) Is there a way to use lucene with several indexes without merging them? Look at MultiSearcher class. 2) Does the Document id changes after merging indexes adding or deleting documents? Not sure. 3) Has anyone implemented a GUI to the lucene index, such that enables to deletions by id or sql-like queries? I haven't seen anything like it. 4) assuming I have a term query that has a large number of hits say 10 millions, is there a way to get the say the top 10 results without going through all the hits? See the Javadocs for Searcher and IndexSearcher, I think you'll find the answer there. Otis __ Do You Yahoo!? Send FREE Valentine eCards with Yahoo! Greetings! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Index Locked For Write
--- Howk, Michael [EMAIL PROTECTED] wrote: Out of curiosity, why didn't we need to close the writer in rc2 or rc3? When you suggest a synchronized keyword, are you suggesting that the writer is not inherently thread-safe? Do we need to write our own thread management on top of Lucene? Sorry, that might have been a wrong suggestion, IndexWriter (at least the add method) is supposed to be thread safe. Otis -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 21, 2002 4:07 PM To: Lucene Users List Subject: RE: Index Locked For Write You could use synchronized keyword and use IndexReader.isLocked() or something like that, no? Otis --- Howk, Michael [EMAIL PROTECTED] wrote: Thank you for your quick responses. But in our application, we're working in a transactional environment where multiple threads are accessing a single writer using the recommended singleton pattern. Since no thread has exclusive access to the writer, how can we have one thread arbitrarily decide to close the writer? Michael -Original Message- From: Mark Tucker [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 21, 2002 3:51 PM To: Lucene Users List Subject: RE: Index Locked For Write You forgot to close your writer after the call to optimize. -Original Message- From: Howk, Michael [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 21, 2002 2:49 PM To: Lucene Mailing List (E-mail) Subject: Index Locked For Write We just got the newest daily build (to try to fix some NullPointer errors with ? and _ characters), and we're getting the same problem that Daniel Calvo mentioned: Index Locked for Write. Here's basically what our code is doing: IndexWriter writer = new IndexWriter(path, analyzer, create); try { Document doc = new Document(); doc.add(Field.Keyword(DOC_ID, 14)); doc.add(Field.UnStored(ANY, mushu)); writer.addDocument(doc); writer.optimize(); // Search the document for our keyword { IndexReader reader = IndexReader.open(path); IndexSearcher searcher = new IndexSearcher(reader); Vector returnStuff = searcher.search(mushu); } // Verify that we got one record back assertNotNull(returnStuff); assertEquals(1, returnStuff.size()); } finally { // Clean up after ourselves IndexReader reader = IndexReader.open(path); reader.delete(new Term(DOC_ID, 14)); reader.close(); } And the exception we're getting on the reader.delete line in the finally clause: java.io.IOException: Index locked for write: Lock@C:\devtools\JBossTomcat\jboss\indexes\marc\write.lock at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteC all.java:245) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:220) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:122) at org.jboss.ejb.plugins.jrmp.server.JRMPContainerInvoker_Stub.invoke(Unknown Source) at org.jboss.ejb.plugins.jrmp.interfaces.GenericProxy.invokeContainer(GenericPr oxy.java:357) at org.jboss.ejb.plugins.jrmp.interfaces.StatelessSessionProxy.invoke(Stateless SessionProxy.java:123) at $Proxy5.deleteDocument(Unknown Source) Are we using the right approach? Any suggestions? Thank you. Michael Howk -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - Coverage of the 2002 Olympic Games http://sports.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - Coverage of the 2002 Olympic Games http://sports.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Performance Tuning
You could try playing with a merge factor... Otis --- Aruna Raghavan [EMAIL PROTECTED] wrote: Hi, Are there any ways to finetune the CPU performance with Lucene? I know of the usage of optimize() calls but I am wondering if there are any other ways to improve the CPU time/Disk space performace. Thanks! -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - Coverage of the 2002 Olympic Games http://sports.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Boolean Query Parsing with IN keyword
Jonathan, That's most likely caused by StandardAnalyzer, which you are probably using. 'in' is listed as one of the stop words: public static final String[] STOP_WORDS = { a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with }; Try searching for state:or It should yield no matches. But, StandardAnalyzer is no longer final (get the latest build) and you can write a class that subclasses it and calls this StandardAnalyser constructor: /** Builds an analyzer with the given stop words. */ public StandardAnalyzer(String[] stopWords) { stopTable = StopFilter.makeStopTable(stopWords); } Pass it your own list of stop words and you are done. If you've already indexed some data you have to be careful which words you choose as stop words. I suggest sticking with the above list (minus 'in', 'or', etc.) for now. Once you have your class use it instead of StandardAnalyzer. Otis --- Jonathan Franzone [EMAIL PROTECTED] wrote: *This message was transferred with a trial version of CommuniGate(tm) Pro* I'm trying to search on a US State field. The lucene field name is state and so I'm building a query like: +(state:fl state:al state:in) to search for documents in Florida, Alabama, or Indiana. But whenever I pass in or IN to the QueryParser it strips it out. Passing the above query to the QueryParser yields +(state:fl state:al). Is there a way to escape the in keyword? I've tried enclosing it in double and single quotes, neither of which worked. Thanks, Jonathan Franzone -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - Coverage of the 2002 Olympic Games http://sports.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Software License
Actually, I think ASL doesn't require this, although it is nice when even commercial entities give credit in some way. I could be wrong about ASL. Otis --- Rafael Luque [EMAIL PROTECTED] wrote: Hi all, I know Lucene is a free project, however I think its use is under Apache Software License (ASL) terms, so someone using Lucene should reference the project, use the logo 'powered by Lucene', ... I have suspects about a company releasing a commercial search engine based on Lucene and not mentioning Lucene at all. What kind of actions can we take to protect Open Source projects like Lucene of this kind of malicious use? Thanks, __ Do You Yahoo!? Yahoo! Sports - Coverage of the 2002 Olympic Games http://sports.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: SegmentTermPositions throwing nullpointer
Have you got the latest Lucene? Nightly build? Try that, this looks like an old bug that has been fixed. Otis --- Charles Harvey [EMAIL PROTECTED] wrote: We are having some bizarre instances where SegmentTermPositions is throwing nullpointers. It only happens on certain queries, but it happens across different indexes using the same query terms, always quoted. Seems to be obscure multi word terms in quotes that make this happen. randi cohen and wacky tobaccy threw on a sun box but did not throw on a pc. java.lang.null exception pointer threw on a pc Any ideas, anyone? I looked at the class and noticed that no nullpointers were thrown on purpose. I'm not familiar with the lucene code, so I'm not too sure what is happening in this process, and the lovely Unknown Source doesn't help out too much... java.lang.NullPointerException at org.apache.lucene.index.SegmentTermPositions.seek(Unknown Source) at org.apache.lucene.index.SegmentTermDocs.seek(Unknown Source) at org.apache.lucene.index.IndexReader.termPositions(Unknown Source) at org.apache.lucene.search.PhraseQuery.scorer(Unknown Source) at org.apache.lucene.search.Query.scorer(Unknown Source) at org.apache.lucene.search.IndexSearcher.search(Unknown Source) at org.apache.lucene.search.Hits.getMoreDocs(Unknown Source) at org.apache.lucene.search.Hits.init(Unknown Source) at org.apache.lucene.search.Searcher.search(Unknown Source) at org.apache.lucene.search.Searcher.search(Unknown Source) _ The trouble with the rat-race is that even if you win you're still a rat. --Lily Tomlin _ Charles Harvey Developer http://www.philly.com Wk: 215 789 6057 Cell: 215 588 0851 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Greetings - Send FREE e-cards for every occasion! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: TimeOut Exception when Indexing with EJB (Please Help)
Hello, I think you should just try your two suggestions and see. The answer depends on how exactly you do it, OS configuration, etc. Does this happen on an optimized index, too? Otis --- Tihon One [EMAIL PROTECTED] wrote: Hi all; I've tried to index a 100K text file on a empty Index folder (0 MB of indexed file) and it took 0.77 second. However, when my index folder get larger (~20MB of Indexed files) the same 100K text file would take up to 30 seconds. Im using EJB to do the index processing and my SessionBean will get a TimeOutException if it take longer than 30 second. I prefer not to re-set the Transactions TimeOut to longer time. What will happen if the Index folder get larger (~ 1GB) ? I understand that the indexing process can be slow but is there a way that I can speed up the process no matter what the size of my Index folder is? * If I increase the IndexWriter.mergeFactor = 1000 will it causes FileNotFoundException (too many open files)? Is there a solution for this error? * If I use RAMDirectory, will it cause Out of Memory Exception? Is there a solution for this error? Environment: WebLogic Server 6.1 Java 1.3.1 Document with ( 8 Keyword Fields and 10 Text Fields). The files range from 10KB 3000KB Thanks TiHon _ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: how to parse XHTML
Terry, These are really not Lucene questions. Lucene will let you index text, but you need to figure out how to parse your XHTML files. Take a look at Jtidy on sf.net, I think Jtidy can help you with parsing XHTML, or perhaps Xerces from xml.apache.org can. Otis --- Terry McGregor [EMAIL PROTECTED] wrote: Hi, I'm new to Lucene, and I was wondering how I should parse XHTML files. Should I name them with the .HTML file extention and use org.apache.lucene.demo.IndexHTML or name them with the .XML file extention and use an XML parser? Also, I would like to keep my XHTML files with a .XHTML file extention, if possible, but that's not so important. Thanks, Terry. _ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: phrase query and slop factor
Wouldn't that depend on how far from each other you wanted to allow them to be? If you have a document with 100 words indexed and you are searching for first second wouldn't you have to set the slop to about 100, just in case the word 'first' is the very first word in the document, and 'second' is the very last work in your document? I haven't used slop factor, so this is only theory :) Otis --- Norbert Pabi¶ [EMAIL PROTECTED] wrote: What must be slop factor to allow any combination of word in phrase? -- Norbert Pabi¶ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Virtual Index
If you prefer the old way (multiple indices) you can do that with Lucene, too. Look at MultiSearcher class. Lucene also supports range queries which may be helpful. I haven't used them, but it sounds like the thing to look at. Otis --- Paul Dlug [EMAIL PROTECTED] wrote: We have a relatively large (300,000+ documents) set of XML files to index. The files themselves are articles broken up by journal and decade so that users can restrict their search to specific journals and year ranges. Under our old search engine this was done by creating a seperate index for each journal/decade and then creating a virtual index which would search the smaller indexes and put the results together (with scoring preserved). In Lucene it looks like I would have to build one large index and do something like this: title:test (journal:myjournal (year:1990 || year:1991 || year:1992 || year:1993 || year:1994 || year:1995 || year:1996 || year:1997 || year:1998 || year:1999)) Is there a better way to do this? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene throws an ArrayIndexOutOfBoundsException() if the first te rm in my query string is a stopWord
Hm, I've got the latest Lucene (from CVS) and don't have this issue. The query I tried on our index is: +title:of +title:someotherwordthatDOESgetmeresults Otis --- Biswas, Goutam_Kumar [EMAIL PROTECTED] wrote: Dear Lucene Users Lucene throws an ArrayIndexOutOfBoundsException() if the first term in my query string is a stopWord. Why is it so ? I'm making AND as the default mode of search. So I'm adding an AND operator between each term of my query. That is if my query is 'cats dogs' I'm rephrasing it as 'cats AND dogs'. But if the first term is a stopWord (example: 'of cats ...') I get the ArrayIndexOutOfBoundsException. I'm tried something like the following to get away with this: // String queryStr = of AND by AND for AND cats AND dogs; // 'of', 'by', 'for' are stopwords Query query = null; Analyzer myAnalyzer = new MyAnalyzer(stopWords); try { query = QueryParser.parse(queryStr, content, myAnalyzer); // content is the default field to search. } catch (ArrayIndexOutOfBoundsException e) { queryStr = queryStr.substring(queryStr.indexOf(AND) + 3); } // // so my final queryStr becomes 'cats AND dogs' which works fine! // // Is there a better way to handle this situation ? or can someone throw a pointer on why this error is occuring in the first place ? Thanks in advance -Goutam -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
2 exceptions
Hello, Do these 2 exceptions look familiar to anyone: java.lang.ArrayIndexOutOfBoundsException: 111 at java.util.Vector.elementAt(Vector.java(Compiled Code)) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:136) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:132) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:134) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:166) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:156) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:205) at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:91) at org.apache.lucene.search.Similarity.idf(Similarity.java:104) at org.apache.lucene.search.TermQuery.sumOfSquaredWeights(TermQuery.java:76) at org.apache.lucene.search.BooleanQuery.sumOfSquaredWeights(BooleanQuery.java:105) at org.apache.lucene.search.Query.scorer(Query.java:91) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:105) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:91) at org.apache.lucene.search.Hits.init(Hits.java:81) at org.apache.lucene.search.Searcher.search(Searcher.java:75) at org.apache.lucene.search.Searcher.search(Searcher.java:69) The second exception that I am getting is this: java.io.IOException: Interrupted system call at java.io.RandomAccessFile.seek(Native Method) at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:271) at org.apache.lucene.store.InputStream.refill(InputStream.java:166) at org.apache.lucene.store.InputStream.readVInt(InputStream.java(Compiled Code)) at org.apache.lucene.store.InputStream.readVInt(InputStream.java(Compiled Code)) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:127) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:166) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:161) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:205) at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:91) at org.apache.lucene.search.Similarity.idf(Similarity.java:104) at org.apache.lucene.search.TermQuery.sumOfSquaredWeights(TermQuery.java:76) at org.apache.lucene.search.BooleanQuery.sumOfSquaredWeights(BooleanQuery.java:105) at org.apache.lucene.search.Query.scorer(Query.java:91) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:105) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:91) at org.apache.lucene.search.Hits.init(Hits.java:81) at org.apache.lucene.search.Searcher.search(Searcher.java:75) at org.apache.lucene.search.Searcher.search(Searcher.java:69) Any search I make in a multi-threaded environment seems to fail withone of these exceptions. The search code in use looks like this: try { // if the index has been modified since opened, re-open it. if (IndexReader.lastModified(_paIndexDir) = _paIndexLastMod) { _paIndexLastMod = new Date().getTime(); if (_paIndexSearcher != null) _paIndexSearcher.close(); _paIndexLastMod = new Date().getTime(); } if (_paIndexSearcher == null) _paIndexSearcher = new IndexSearcher(_paIndexDir); } catch (IOException e) { _log.error(Could not open/close IndexSearcher: + e.getMessage()); return; } Query query = null; Hits hits = null; try { query = MultiFieldQueryParser.parse(queryString, new String[] {title, description}, _analyzer); hits = _paIndexSearcher.search(query); } catch (ParseException e) { _log.warn(QueryParser threw ParseException while parsing: + queryString, e); } catch (TokenMgrError e) { _log.warn(QueryParser threw TokenMgrException while parsing: + queryString, e); } catch (IOException e) { _log.error(IndexSearcher threw IOException while searching for: + queryString, e); } I'm about to look at the source, but if any of these exceptions look familiar to anyone, or if you see a flaw in the code above please let me know. Thanks, Otis __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: optimize(), delete() calls on IndexWriter
No they don't. Note that delete() is in IndexReader. Otis --- Aruna Raghavan [EMAIL PROTECTED] wrote: Hi, Do calls like optimize() and delete() on the Indexwriter cause a separate thread to be kicked off? Thanks! Aruna. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: 1.02 download on jakarta.apache.org?
I don't think you are blind. You could get the latest source from the CVS, or wait a few weeks when I hope we will get the new release out... Otis --- Shannon Booher [EMAIL PROTECTED] wrote: Maybe I'm just blind, but Lucene v1.02 does not appear to be available through jakarta.apache.org. There is no listing for Lucene under Release Builds, only Milestone and Nightly... thanks, sjb -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: 2 exceptions
Just for the list/knowledge archive: I found the source of one of the exceptions in my code: java.io.IOException: Interrupted system call at java.io.RandomAccessFile.seek(Native Method) at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:271) at // if the index has been modified since opened, re-open it. if (IndexReader.lastModified(_paIndexDir) = _paIndexLastMod) { _paIndexLastMod = new Date().getTime(); if (_paIndexSearcher != null) _paIndexSearcher.close(); _paIndexLastMod = new Date().getTime(); } if (_paIndexSearcher == null) _paIndexSearcher = new IndexSearcher(_paIndexDir); BUG: ^ And what if it's != null? It's already close()d above. The other one might have been a side-effect of the above bug. Otis __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: QueryParser and Double Quotes
I think there is no way to do that since a double quote is a special character for query parser. There was some discussion about introducing an escape character to allow things like this, but the discussion has not materialized yet. Otis --- Tony Biag [EMAIL PROTECTED] wrote: Is there a way where I can search for phrase containing double quote? For example, the search string is: 6 nail. Thanks for any answers. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Maximum indexable data
I haven't heard of any such limit. There is a 'limit' of 10,000 characters on a field length, but that is a limit only because that number is hard coded in the source. However, shouldn't this be very simple for you to test? Index something over and over and see if you ever hit the wall :) Otis --- Herman Chen [EMAIL PROTECTED] wrote: Hi, Is there a limit for the amount of data indexable by a segment? If so is there a limit for searching? i.e. can I give MultiSearcher several indices that are all close to the maximum size. Thanks. -- Herman __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing across multiple servers
This is becoming a FAQ... Not by itself, so you have to write an application to collect the data to be indexed yourself, and then feed it to Lucene. Otis --- Ryan Ogaard [EMAIL PROTECTED] wrote: Does Lucene support the indexing/searching of multiple servers across the network (file servers, web servers, databases, ...)? Thank you, Ryan -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: special character handling
It depends on the Analyzer used. Otis --- Aruna Raghavan [EMAIL PROTECTED] wrote: Hi, Does lucene replace all special characters with spaces when it adds the document to the index? Thanks! -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: special character handling
This is answered in FAQA: http://jguru.com/faq/view.jsp?EID=538308 --- Aruna Raghavan [EMAIL PROTECTED] wrote: Otis, I am using StandardAnalyzer. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 12, 2002 3:37 PM To: Lucene Users List Subject: Re: special character handling It depends on the Analyzer used. Otis --- Aruna Raghavan [EMAIL PROTECTED] wrote: Hi, Does lucene replace all special characters with spaces when it adds the document to the index? Thanks! -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Try FREE Yahoo! Mail - the world's greatest free email! http://mail.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: size and nos of documents in the index
Parag, Indexing time and index size should be proportional to the size of documents being indexed. Also, I believe a document containing more different, unique terms will result in a larger index size increase than a document containing more duplicates. For instance I am going to bed in a few moments because I am tired will result in more unique terms than Good night. As for the maximum number of documents that can be indexed I think there is virtually no limit, other than you hardware and things like that. Otis --- Parag Dharmadhikari [EMAIL PROTECTED] wrote: Hi all, How the indexing is afftected by the size of documents and what is the maximum number of documents which can be indexed. regards parag __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Wildcard Searching
Hello, This was a thread on lucene-user initially, but I'm copying lucene-dev as well. Sorry about duplicates. --- Stefan Bergstrand [EMAIL PROTECTED] wrote: Doug Cutting [EMAIL PROTECTED] writes: Just noticed this problem in my program. It seems as if the analyzer passed to QueryParser.parse(), never is passed to PrefixQuery (which is what my test case is parsed to). A quick look in QueryParser.jj confirms this: q = new PrefixQuery(new Term(field, term.image.substring (0, term.image.length()-1))); I thought that queries such as 'rou?d' are considered wildcard queries by QueryParser.jj, and not Prefix queries, no? In the default definition of token in QueryParser.jj I see this: | PREFIXTERM: _TERM_START_CHAR (_TERM_CHAR)* * | WILDTERM: _TERM_START_CHAR (_TERM_CHAR | ( [ *, ? ] ))* Then further down in QueryParser.jj we have this: if (wildcard) q = new WildcardQuery(new Term(field, term.image)); So a WildWuery is being constructed, not PrefixQuery, I think. What I don't understand is why the definition of _TERM_START_CHAR looks like this: | #_TERM_START_CHAR: ~[ , \t, +, -, !, (, ), :, ^, [, ], \, {, }, ~, * ] Maybe the name is misleading, but it seems like _TERM_START_CHAR are the characters that a TERM can start with, because later in QueryParser.jj we have TERM defined as: | TERM: _TERM_START_CHAR (_TERM_CHAR)* and _TERM_CHAR has this definition: | #_TERM_CHAR: _TERM_START_CHAR So how can we have a * in _TERM_START_CHAR when terms are not allowed to start with a *, and if we do have *, how come we do not have ? as well? Can somebodyt correct me in every place where I made false statements, assumptions, and conclusions? Thanks, Otis From: Howk, Michael [mailto:[EMAIL PROTECTED]] Also, Lucene returns the parsed version of each of our searches. When we search by rou*d, Lucene parses it as rou*d (which is what we would expect). But when we search by rou?d, Lucene parses it as rou d. It seems to wrap the term in quotes and replace the question mark with a space. Any ideas? Or can someone give us an idea of how to understand WildcardQuery or WildcardTermEnum? It sounds like the problem is in the query parser. Brian? Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- --- Stefan Bergstrand Polopoly - Cultivating the information garden Ph: +46 8 506 782 67 Cell: +46 704 47 82 67 Fax: +46 8 506 782 51 [EMAIL PROTECTED], http://www.polopoly.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: corrupted index
Oh, I just thought of something (wine does body good). Perhaps one could use Runtime (the class) to catch the JVM shutdown and do whatever is needed to prevent index corruption. I believe there are some shutdown hook methods in there that may let you do that. I'm too lazy to look up the API docs now, but I rememeber reading about that once, and perhaps it was even mentioned on one of the 2 Lucene mailing lists. On the other hand, it would be great to have a tool that can verify an existing index. I don't know enough about the actual file structure yet to write something like that, but maybe somebody else has done that already or would like to contribute. Otis --- Steven J. Owens [EMAIL PROTECTED] wrote: Otis, You can remove the .lock file and try re-indexing or continuing indexing where you left off. I am not sure about the corrupt index. I have never seen it happen, and I believe I recall reading some messages from Doug Cutting saying that index should never be left in an inconsistent state. Obviously never should be, but if something's pulling the rug out from under his JRE, changes could be only partially written, right? Or is the writing format in some sense transactionally safe? I've never worked directly on something like this, but I worked at a database software company where they used transaction semantics and a journaling scheme to fake a bulletproof file system. Is this how the index-writing code is implemented? In general, I can guess Doug's response - just torch the old index directory and rebuild it; Lucene's indexing is fast enough that you don't need to get clever. This seems to be Doug's stance in general (i.e. don't get fancy, I already put all the fanciness you'll need into extremely fast indexing and searching). So far, it seems to work :-). I could be making this up, though, so I suggest you search through lucene-user and lucene-dev archives on www.mail-archive.com. A search for corrupt should do it. Once you figure things out maybe you can post a summary here. I got a little curious, so I went and did the searches. There is exactly one message in each list archive (dev and users) with the keyword corrupt in it. The lucene-users instance is irrelevant: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html The lucene-dev instance is more useful: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html It's a post from Doug, dated sept 27, 2001, about adding not just thread-safety but process-safety: It should be impossible to corrupt an index through the Lucene API. However if a Lucene process exits unexpectedly it can leave the index locked. The remedy is simply to, at a time when it is certain that no processes are accessing the index, remove all lock files. So it sounds like it's worth trying just removing the lock files. Hm, is there a way to come up with a sanity check you can run on an index to make sure it's not corrupted? This might be an excellent thing to reassure yourself with: something went wrong? Run a sanity check, if it fails just reindex. Steven J. Owens [EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Bugs
Hola, I don't have year of search engine writing experience either, but I did look at your reports on Sourceforge earlier and I will try to look at the source to see if they are the right fixes. I haven't used DateFilter, which, I think, you said contains the bug, so no promises, but I'll look. That part of code might have changed since your reports, and I may have trouble locating the lines you mentiones, so I may ask you to point me to the right lines in the new source. Tomorrow or Monday. Right now I have to go kill some crapes and go to bed. Otis --- David Smiley [EMAIL PROTECTED] wrote: Oh I *have* downloaded the CVS source and I actually did *fix* (maybe) two of these three bugs and I did *submit* what I did exactly to fix them to the sourceforge / mailing-list for public review (but not in diff/patch format since they were one-liners). The problem is that much of Lucene is very complicated (understandably so) and I never got someone more familiar with Lucene's more complicated parts (like Doug, or perhaps some others here) to respond to see if my fix was correct and completely addresses the issue. Not one person responded except for some other guy to say he experienced the same bug and that nobody responded to his bug report either :-(. The 3rd bug, the one that I didn't fix, I took the time to write a test program that showed the bug. What's needed now for these bugs to be squashed, is someone that really knows Lucene's complicated parts to verify if my 2 fixes are sufficient and to at least investigate the 3rd bug. I'm not the one with years of search-engine writing experience ;-). I really appreciate your response by the way, it's a welcome change... and an initial step. ~ Dave Smiley On Saturday, March 16, 2002, at 08:59 PM, Andrew C. Oliver wrote: You need not be asked, help is always wanted. How about instead of submitting bugs, submit patches. Simply get the sources via CVS (click on CVS Repository on the Jakarta front page), fix the bugs and then do cvs diff -u to create patches. Post those into bugzilla and put [PATCH] on the summary line and I think you'll find them applied rather quickly. -Andy -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Multiple field searching
I'm using MultiTermQueryParser and it works for me. Otis --- Tate Jones [EMAIL PROTECTED] wrote: hi, I am trying to search across multiple fields using the following query +keyword:computers +subject:News content:xml or +(keyword:{computers}) +(subject:{News}) content:xml i have added the fields to the document correctly. Have also tried using the MutipleFieldQueryParser without success. The only query that works is, which is not correct as they are OR's keyword:computers subject:IT content:xml Is anyone having the same problems Thanks in advance Tate -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Question Deleting/Reindexing Files
The standard answer is try deleting/adding in batches instead of individually. Seems more efficient, too, if you can write your application that way. That is what you are essentially doing by writing to a separate index and then doing a bunch of deletions, followed by re-additions. I know I'm stating the obvious, but I wanted to get this out of the way :) Otis --- Spencer, Dave [EMAIL PROTECTED] wrote: [1] There's no update so delete and then add is what you want. [2] I have had the same problems w/ using an IndexWriter and IndexReader at the same time and getting a locking problem when deleting. I think I sent mail to the list w/ a test case a week ago [disclaimer: this is not a complaint!] and I think the issue is still open. Maybe I should turn this into a bug report? I know fixing bugs is encourage but I don't have enough context about the right solution, or how the locking apparently changed to foul this up, though I did look thru things. My workaround was to write new entries to a new index and then run a separate merge utility that 1st does a delete pass, and then reopens and does adds, based on a primary key (the URL of each doc in my case). -Original Message- From: Joe Hajek [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 20, 2002 12:28 AM To: [EMAIL PROTECTED] Subject: Question Deleting/Reindexing Files Hi, I am using Lucene for indexing a relatively large article based system where articles change from time to time so i have to reindex them. reindexing had the effekt that a query would return the hit for a file multiple times (according to the number of updates. The only solution to that problem I found was to delete the file to be updated before indexing it again. Is there another possibility ? As the system is large i am collecting the articles that have to be updated together, open a writer and add the documents to the index. this solution worked fine for me using rc1 in rc4 it seems that it is not possible anymore to delete a file from an index while the index is opened for writing. do you know any solutions to that problem ? thanx a lot in advance regards joe -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Multiple field searching
--- Kelvin Tan [EMAIL PROTECTED] wrote: hmmm...really? My impression was that the ANDs are treated equivalently with +s by the parser, so they're redundant. Correct. The { and }s aren't part of the syntax, are they? I was wondering where those came from. I don't think I've seen them in QueryParser.jj. Otis - Original Message - From: Mehran Mehr [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED]; Kelvin Tan [EMAIL PROTECTED] Sent: Thursday, March 21, 2002 8:11 PM Subject: Re: Multiple field searching this is the right syntax: +(keyword:{computers}) AND +(subject:{News}) AND content:xml __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Older versions of Lucene?
Maybe you can find something in Lucene's old repository on Sourceforge.net. Otis --- Robert A. Decker [EMAIL PROTECTED] wrote: I'm on Java 1.1.8, and can't upgrade beyond that for quite some time due to testing requirements. I've managed to compile in and use the 1.2 StringBuffer class that is required by Lucene. However, I'm getting tons of 'Integer constant out of range' errors when building. For example: 0xfffeL, 0xL, 0xL, 0xL Are all out of range... Did the size of a long change from 1.1.8 to 1.2? If so, is there a way to use 1.1.8 and lucene? If not, is it possible to use an older version? thanks, rob -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: TokenManager's longs too long
Sorry, no experience with JDK 1.1.8 and Lucene nor JavaCC. Sounds like a question for WebGain folks. Otis --- Robert A. Decker [EMAIL PROTECTED] wrote: I'm stuck on jdk 1.1.8 and can't upgrade for some time. I'm using javacc to create some java code from a .jj file provided by the Lucene project at lucene.jakarta.org. I'm runnig into a problem where the long data types found in the XXXTokenManager.java files are too long for my version of java. For example, these are all too long: static final long[] jjbitVec0 = { 0xfffeL, 0xL, 0xL, 0xL }; Is this a familiar problem? I just joined the mailing list. I've been looking around the documentation at webgain, but can't find a mention of this. Is there a solution to this? thanks, rob -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Bugs
Hello, --- David Smiley [EMAIL PROTECTED] wrote: I have reported bugs about Lucene in the fall of 2001 but no Lucene developer has responded. I am sending this summary as a reminder. My original message to the mailing list is here: [Lucene-dev] More bugs http://www.geocrawler.com/archives/3/2626/2001/8/0/6409669/ The bugs at SourceForge are here: DateFilter: call enum.next() first DateFilter.java has changed since the report, but I think I found the piece of code that you were referring to. After looking at DateFilter, TermEnum, and FilteredTermEnum it seems to me that next() does not need to be called first. This is not java.util.Enumeration enum, it is TermEnum's enum. Also, if you look at methods next() and term() in FilteredTermEnum you'll see that term() does need to be called first, otherwise the first term would get skipped. I'm not very familiar with this code, but this is what it seems like from looking at it for 7:32 minutes. Otis __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Bugs
Hello, SegmentTermEnum.clone(), term == null http://sourceforge.net/tracker/index.php?func=detailaid=451315group_id=3922; atid=103922 Aha, this was a bug, indeed, but it looks like this bug has been fixed about 6 months ago: revision 1.2 date: 2001/10/11 15:14:14; author: scottganyo; state: Exp; lines: +1 -1 Fix NullPointerException in clone() method when the Term is null. Otis __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Bugs
Hello, Has anyone else observed this behaviour? Wrong ordering from Document.fields() http://sourceforge.net/tracker/index.php?func=detailaid=451317group_id=3922; atid=103922 It looks like java.util.Enumeration is used to store the fields, so if Enumeration guarantees order than this should, too. Could you please provide a self-contained test case that I can just put somewhere, compile, and run? I can't compile the snippet in the above bug report. No software is bug free; I just want to help make Lucene better. If I can be of any help, please ask. Thanks! Otis __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Field search matching exact and partial occurence
Aero* Look at Wildcard and Prefix queries. Otis --- RAYMOND Romain [EMAIL PROTECTED] wrote: Hello, Is there a way to do a query where I will find on a filed XX and retrieved the exact or partial matching fields ... for example a query on aero will return aeronef , aerosol, aero-finder ... Thanks. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: TokenManager's longs too long
www.webgain.com --- Robert A. Decker [EMAIL PROTECTED] wrote: Aren't the webgain people on this mailing list? If not, how do I contact them? I've been looking around the javacc pages, but can only find the email address for this mailing list... thanks, rob On Thu, 21 Mar 2002, Otis Gospodnetic wrote: Sorry, no experience with JDK 1.1.8 and Lucene nor JavaCC. Sounds like a question for WebGain folks. Otis --- Robert A. Decker [EMAIL PROTECTED] wrote: I'm stuck on jdk 1.1.8 and can't upgrade for some time. I'm using javacc to create some java code from a .jj file provided by the Lucene project at lucene.jakarta.org. I'm runnig into a problem where the long data types found in the XXXTokenManager.java files are too long for my version of java. For example, these are all too long: static final long[] jjbitVec0 = { 0xfffeL, 0xL, 0xL, 0xL }; Is this a familiar problem? I just joined the mailing list. I've been looking around the documentation at webgain, but can't find a mention of this. Is there a solution to this? thanks, rob -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: StopFilter-troubles
--- [EMAIL PROTECTED] wrote: Dear Lucene-users, has someone an answer to the following question: If I add a StopFilter to my Analyzer, the stopwords I gave him will be left out the query. So far, so good. But when my query is like this one: (field1 : x) AND (field2 : stopword) AND (field 1 : y) the StopFilter will do its work, but the resulting query is a big mess : (field1 : x) AND ( ) AND (field 1 : y), and because of that the searching results ara no good. I hoped it would search for (field1 : x) AND (field 1 : y). I think the StopFilter does a poor job here. Is anyone familiar with this problem and has an answer for me? Puk Witte. I tried something like this on one Lucene index: description:travel AND description:a The results were the same as this query: description:travel This seems right to me. Otis __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: StopFilter-troubles
I don't know enough about the query parser to be able to answer that question, but why do you really need those parentheses? It would also be great if you could submit this as a bug at http://jakarta.apache.org/lucene/ Thanks, Otis --- [EMAIL PROTECTED] wrote: Dear all, especially Otis Gospodnetic (thanks for your answer), without ( )'s the StopFilter is doing a good job indeed, but if I put them around parts of the query, then the searchResult is wrong. For example: (field1 : x) AND (field2 : stopword) AND (field 1 : y) So I'm afraid my problem is not solved yet. But maybe someone can try it with the ()'s with his own tool and tell me if they've got the same problem. Then I know whether I made a mistake. Puk Witte -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: What do reader-valued Fields do?
This means that you can make searches against that field, but cannot retrieve its original value. Otis --- Robert A. Decker [EMAIL PROTECTED] wrote: What should I use to store and add to my Document a long String? (thousands of characters) I'm still having difficulty understanding what it means to create a field with a reader value: String aString = fieldName; String aStringReader = new StringReader(someLongText); Field field = Field.Text(aString, aStringReader); The documentation says that this will be tokenized and indexed, but is not stored in the index verbatim. I'm using this to store a long text field - an entire document. However, in my case, nothing appears to be stored in the index! What do they mean by not being stored verbatim? I assumed this to mean that it would run the text through my analyzer, at the least, and perhaps, further, store it as a serialized form. thanks rob -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Greetings - send holiday greetings for Easter, Passover http://greetings.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: corrupted index
Hello, Nobody has contributed a tool that verified index integrity, yet. Is this the latest version of Lucene? Are you hitting the 2GB/file limit? Just some ideas. Otis --- H S [EMAIL PROTECTED] wrote: Dear All, We are experiencing a problem with index updates. We have a fairly large index (10 gigabytes). There are no problems searching it. But when we add a single file and then try to optimize, optimization fails with a null pointer exception in RandomAccessFile.seek. Has anybody come across this problem? Is there a way to tell whether an index is corrupted? Thanks very much - Hinrich Schuetze __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: compiling lucene
JavaCC 2.1 works, too. This is how I have it set up: [otis@linux2 otis]$ ls -al /usr/local/.version/javacc2.1/ total 44 drwxrwxr-x6 otis otis 4096 Jan 28 06:50 . drwxr-xr-x 20 otis otis 4096 Apr 2 23:32 .. drwxrwxr-x3 otis otis 4096 Jan 28 06:50 bin -rw-rw-r--1 otis otis 8518 Jan 28 06:50 COPYRIGHT drwxrwxr-x2 otis otis 4096 Jan 28 06:50 doc drwxrwxr-x 21 otis otis 4096 Jan 28 06:50 examples -rw-rw-r--1 otis otis 5599 Jan 28 06:50 README drwxrwxr-x5 otis otis 4096 Jan 28 06:50 src [otis@linux2 otis]$ ls -al ~/cvs-repositories/jakarta/jakarta-lucene/lib/ total 132 drwxrwxr-x3 otis otis 4096 Jan 28 15:28 . drwxrwxr-x9 otis otis 4096 Mar 27 23:28 .. drwxrwxr-x2 otis otis 4096 Jan 28 15:29 CVS lrwxrwxrwx1 otis otis 36 Jan 28 06:55 JavaCC.zip - /usr/local/javacc/bin/lib/JavaCC.zip -rw-rw-r--1 otis otis 117522 Jan 28 15:23 junit_37.jar Otis --- Victor Hadianto [EMAIL PROTECTED] wrote: Hi list, I'm having problem compiling lucene from scratch. I checkout lucene 1.2 rc4 from cvs and I am missing one vital component JavaCC 2.0 The latest javaCC that I can get from webgain is 2.1 and just dropping the thing to lucene/lib directory does not work quite well, I had a look and the class name expected by lucene build file is quite different from JavaCC 2.1 Is there someplace where I can get JavaCC 2.0 that works with lucene? Thanks, -- Victor Hadianto -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: storing index in third party database.
If you want to store indices in a database search the mailing list archives for SqlDirectory. Once I considered using it for one application at work, so I asked its author about performance. The answer was that it doesn't perform all that well when the index grows, if I recall correctly. Consequently, we chose to use file-based indices instead. Otis --- [EMAIL PROTECTED] wrote: Hi all I want to index the datas which I already stored in a thirdparty database table and develop a search facility using lucene. I am thinking of storing this indexes back to the database in another table. I know for this we have to create a 'directory' which do all the indexing operations, for example Indexwriter indwriter = new Indexwriter(dirStore,null,create); where dirStore is the directory, create is boolean. but I don't know the format to be followed for the directory(dirStore).Please help me if anybody has done similar thing. TIA Amith __ Your favorite stores, helpful shopping tools and great gift ideas. Experience the convenience of buying online with Shop@Netscape! http://shopnow.netscape.com/ Get your own FREE, personal Netscape Mail account today at http://webmail.netscape.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Custom queries
name != pradeep == -name:pradeep I think there is also support for the date query below, but I haven't used it yet, so I don't want to give you any wrong information. Otis --- Pradeep Kumar K [EMAIL PROTECTED] wrote: Hi lucene friends! Is there any way to create custom queries. Just for example I want to create a query like name != 'pradeep' creationDatedateVar. TIA Pradeep -- Robosoft Technologies, Mangalore, India -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: JavaCC error when installing with Ant
Ant you have Ant's optional.jar in Ant's lib directory? --- David Black [EMAIL PROTECTED] wrote: Ant returns following error.any ideas? ... lucene-1.2-rc4-src/build.xml:92: Could not create task of type: javacc. Common solutions are to use taskdef to declare your task, or, if this is an optional task, to put the optional.jar in the lib directory of your ant installation (ANT_HOME). ... I altered the build.properties file to reflect my version of javacc # Home directory of JavaCC javacc.home = /usr/local/java/javacc2.1 javacc.zip.dir = ${javacc.home}/lib javacc.zip = ${javacc.zip.dir}/JavaCC.zip -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
HTML parser
Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Thanks, Otis __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
Hello Terrence, Ah, you got me. I guess I need a bit of both. I need to just strip HTML and get raw body text so that I can stick it in Lucene's index. I would also like something that can extract at least the title.../title stuff, so that I can stick that in a separate field in Lucene index. While doing that I, like you, need to be able to handle poorly formatted web pages. In a future I may need something that has the ability to extract HREFs, but I'll stick to one of the XP principles and just look for something that meets current needs :) I looked for ANTLR-based HTML parser a few days ago, but must have missed the one you pointed out. I'll take a look at it now. Can you share or describe your stripHTML method? Simple java that looks for s and s or something smarter? Thanks, Otis P.S. This type of thing makes me wish I can use Perl or Python :) --- Terence Parr [EMAIL PROTECTED] wrote: On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Wildcard query problem with ?
Hm, I just went through all the diffs after RC2 (QueryParser.jj revision 1.3) and I can't see where '?' was dropped. However, one user reported this on February 27th: We just tried adding the ? character to QueryParser.jj under #_TERM_START_CHAR. We noticed that the * was in that list, so we figured we'd just give it a try. It seems to have worked. Now when we search on rou?d, we get hits on the word round. We're going to try searching for some other variations to make sure that we've done the right thing. We'd still be interested to know exactly why this worked (assuming it continues to solve our problem). What is a TERM_START_CHAR and how is it used? Obviously it does something important. :-) So I'll try your code and if wildcards really don't work I'll try this person's suggestion and if it works I'll commit it. Otis --- Ralf Hettesheimer [EMAIL PROTECTED] wrote: Hello, I have been using RC2 until yesterday when I tried the latest nightly build. Now it seems that I can no longer search for wildcard-queries with a question mark. For example in my index there are two documents, one containing the word meier and another one containing the word maier. With RC2 I could search for m?ier and got two hits. With anything later (I tried RC3, RC4 and the nightly builds from 1704 and 1804) I get 0 hits. When searching for mei?r the same, 1 hit with RC2 and 0 hits with RC4. The QueryParser from RC2 generated a BooleanQuery and the QueryParser from RC4 generates a PhraseQuery. I have attached the source code of a little test program and output from the debugger. Could somebody explain this behaviour? Thanks Ralf Hettesheimer ATTACHMENT part 2 application/octet-stream name=TestQueryParser.java ATTACHMENT part 3 image/gif name=debugrc2.gif ATTACHMENT part 4 image/gif name=debugrc4.gif -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: HTML parser
Such classes are not included with Lucene. This was _just_ mentioned on this list earlier today. Look at the archives and search for crawler, URL, lucene sandbox, etc. Otis --- Ian Forsyth [EMAIL PROTECTED] wrote: Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Wildcard Searching
Did the change that you mentioned below really work for you? I wrote this class: http://nagoya.apache.org/bugzilla/showattachment.cgi?attach_id=1638 and it looks like the bug is not in QueryParser, but in some Java class (could it be WildcardTermEnum?), since the class does not make use of QueryParser and still demonstrates that WildcardQuery doesn't work properly. Thanks, Otis --- Howk, Michael [EMAIL PROTECTED] wrote: We just tried adding the ? character to QueryParser.jj under #_TERM_START_CHAR. We noticed that the * was in that list, so we figured we'd just give it a try. It seems to have worked. Now when we search on rou?d, we get hits on the word round. We're going to try searching for some other variations to make sure that we've done the right thing. We'd still be interested to know exactly why this worked (assuming it continues to solve our problem). What is a TERM_START_CHAR and how is it used? Obviously it does something important. :-) -Original Message- From: Howk, Michael [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 27, 2002 11:14 AM To: 'Lucene Users List' Subject: RE: Wildcard Searching The StandardAnalyzer uses a lowercase filter, but we tried indexing the round hat, just to make sure. The * still worked, but the ? still failed. We noticed that the ? character is listed in the QueryParser as a WILDTERM. But after that, the code heads into the WildcardQuery class, and we get lost amidst setEnum() and wildcardEquals() stuff. :-) Seriously though, we're using the StandardAnalyzer directly from Lucene. I suppose it's possible that the ? is a special character that's getting stripped out. But we need help to find out exactly where the special characters are defined or filtered. Michael -Original Message- From: Aruna Raghavan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 27, 2002 11:00 AM To: 'Lucene Users List' Subject: RE: Wildcard Searching From my experience with wildcards, 1. They are case sensitive while the regular queries aren't. 2. Only one wild card is allowed in a word. If you are using this with a bool query, you can use something like the following (asas*) AND (fhg*fd). This is acceptable 3. There is a requirement of using atleast one character before wildcard in a query.(*fhhd is not valid) 4. Special characters are not supported (? may be a special character) Hope this helps! -Original Message- From: Howk, Michael [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 27, 2002 10:56 AM To: Lucene Mailing List (E-mail) Subject: Wildcard Searching We're really struggling with trying to understand why the WildcardQuery seems to strip out the question mark by replacing it with a space. We're using the daily build, and a StandardAnalyzer. We've got the text The Round Window in our index. If we search on roun* the Lucene QueryParser returns a hit. When we search on roun?, we don't get any hits. We don't even know how to make heads or tails of the WildcardQuery or WildcardTermEnum classes. Also, Lucene returns the parsed version of each of our searches. When we search by rou*d, Lucene parses it as rou*d (which is what we would expect). But when we search by rou?d, Lucene parses it as rou d. It seems to wrap the term in quotes and replace the question mark with a space. Any ideas? Or can someone give us an idea of how to understand WildcardQuery or WildcardTermEnum? Michael -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ WildcardQuestionmarkTest.java Description: WildcardQuestionmarkTest.java -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re:_HTML_parser
Laura, http://marc.theaimsgroup.com/?l=lucene-userw=2r=1s=Spindleq=b Oops, it's JoBo, not MoJo :) http://www.matuschek.net/software/jobo/ Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi Otis, thanks for your reply. I have been looking for Spindle and Mojo for 2 hours but I don't found anything. Can you help me? Wher can I find something? Thanks for your help and time Laura Laura, Search the lucene-user and lucene-dev archives for things like: crawler spider spindle lucene sandbox Spindle is something you may want to look at, as is MoJo (not mentione d on lucene lists, use Google). Otis Did someone solve the problem to spider recursively a web pages? While trying to research the same thing, I found the following...here 's a good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take abou t ten lines of code. __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Error with StandardTokenizer.java and Token.java
Hello, Get the latest version, try again, paste the error if you get it, and use lucene-user list instead, more eyeballs and brains will see your proble on that list. Thanks, Otis --- Jacob Gutierrez [EMAIL PROTECTED] wrote: Hi there... Using the latest version of StandardTokenizer.jj and using JavaCC (ver 2.1) I get 7 java files, among them StandardTokenizer.java and Token.java The Token Class has this atributes public final class Token { String termText; // the text of the term int startOffset; // start in source text int endOffset; // end in source text String type = word; // lexical type } And the StandardTokenizer in it's next() function has this code: new org.apache.lucene.analysis.Token(token.image, token.beginColumn,token.endColumn, tokenImage[token.kind]); Giving an error of Variable not found. Why is this error happening?? Do I have to manually modify the file created by JavaCC??? Any help will be appreciated. Jacob Gutiérrez R. Cochabamba - Bolivia -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Cannot compile Lucene
Just curious, what exactly people need to do to 'fix up the exceptions'? Editing of which files to change what to what? I'd just like to document that somewhere, that's why I'm asking... Otis --- Robert A. Decker [EMAIL PROTECTED] wrote: I got it working under Project Builder. You just have to fix up the exceptions yourself. Also, you'll get some warnings (121 warnings to be exact) during the linking stage stating that an Integer Constant is too large - just ignore these - they're wrong. thanks, rob http://www.robdecker.com/ http://www.planetside.com/ On Wed, 24 Apr 2002, Avi Drissman wrote: I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene without Ant, by tossing the files into Project Builder (Mac OS X). I ran JavaCC on StandardTokenizer.jj with the standard options, tossed the resulting files into the project, and now I'm running into a few errors: 1. StandardTokenizer.jj:173 is org.apache.lucene.analysis.Token next() throws IOException which is JavaCC'd into StandardTokenizer.java:26 as final public org.apache.lucene.analysis.Token next() throws ParseException, IOException which isn't a valid override. javac says next() in org.apache.lucene.analysis.standard.StandardTokenizer cannot override next() in org.apache.lucene.analysis.TokenStream; overridden method does not throw org.apache.lucene.analysis.standard.ParseException 2. StandardTokenizer.java:26 says token.beginColumn,token.endColumn and there are no such member variables. Am I totally missing something here? Avi -- Avi Drissman [EMAIL PROTECTED] Argh! This darn mailserver is trunca -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene index integrity... or lack of :-(
Morning, I'm starting to wander how bullet proof are Lucene indexes? Do they get corrupted easely? If so is there a way to rebuild them? There is no tool to detect index corruption, fixing of indexing, nor index rebuilding. The last one anyone can/has to do on their own. I'm started to get the following exception left and right... 04/25 18:34:39 (Warning) Indexer.indexObjectWithValues: java.io.IOException: _91.fnm already exists I've seen people asking about this on the list, but I never encountered this particular exception. I build a little app (http://homepage.mac.com/zoe_info/) that uses Lucene quiet extensively, and I would like to keep it that way. However, I'm starting to have second thought about Lucene's reliability... :-( I'm sure I'm doing something wrong somewhere, but I really cannot see what... Maybe it's not a Lucene issue then, although I've seen this mentioned so often, which means that documentation could be improved to prevent people from making the same mistakes that others have already made. Otis __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene index integrity... or lack of :-(
Hello, There is no tool to detect index corruption, fixing of indexing, nor index rebuilding. The last one anyone can/has to do on their own. :-( Well, that *very* sad to say the least... How do I know if my indexes are not corrupted even if everything seems to be working fine? Don't tell me I'm the first one to run into this kind of issues?!? How can I trust an index if there is *no* way of checking its integrity? And even if you happen to notice that something is fishy, there is no way to rebuild the index -short or re-indexing everything from scratch? That does not sound like a very healthy situation to me. Fragile will be kind for describing it... Yes, that's all unfortunate. If you come up with anything, please share it. Or, you can use Lucene Sandbox and develop stuff there. I've seen people asking about this on the list, but I never encountered this particular exception. Lucky you... :) Maybe it's not a Lucene issue then, although I've seen this mentioned so often, which means that documentation could be improved to prevent people from making the same mistakes that others have already made. Maybe, maybe not. And most likely I'm doing something odd. In any case, could you point me to the mistakes that others have already made? Or did I miss something obvious here? Nah, the only thing I can suggest is check the lists' archives, that is where mistakes of others would be recorded. Otis __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: rc4 and FileNotFoundException: an update
--- petite_abeille [EMAIL PROTECTED] wrote: I don't know what environment you're using Lucene in. The problem seems to be specially bad on osx (10.1.4 + JRE 1.3.1 + latest updates). Does this mean you tried it on other OSs and it worked? Which ones? What JDK did those have and what was their ulimit and what is the ulimit on your OSX machine? Just curious. Otis __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: FileNotFoundException: code example
Hello, I'll put my comments inline... --- petite_abeille [EMAIL PROTECTED] wrote: Hello again, attached is the source code of the only class interacting directly with Lucene in my app. Sorry for not providing a complete test case as it's hard for me to come up with something self contained. Maybe there is something that's obviously wrong in what I'm doing. Thanks for any help. PA // // === // //Title: SZIndex.java //Description:[Description] //Author: Raphael Szwarc [EMAIL PROTECTED] //Creation Date: Wed Sep 12 2001 //Legal: Copyright (C) 2001 Raphael Szwarc. All Rights Reserved. // // --- // package alt.dev.szobject; import com.lucene.store.Directory; import com.lucene.store.FSDirectory; import com.lucene.store.RAMDirectory; import com.lucene.document.Field; import com.lucene.document.DateField; import com.lucene.document.Document; import com.lucene.analysis.Analyzer; import com.lucene.analysis.standard.StandardAnalyzer; import com.lucene.index.IndexWriter; import com.lucene.index.IndexReader; import com.lucene.index.Term; import com.lucene.search.IndexSearcher; import com.lucene.search.MultiSearcher; import com.lucene.search.Searcher; import com.lucene.search.Query; import com.lucene.search.Hits; import java.io.FilenameFilter; import java.io.File; import java.io.IOException; import java.util.Map; import java.util.Collection; import java.util.Date; import java.util.Iterator; import alt.dev.szfoundation.SZHexCoder; import alt.dev.szfoundation.SZDate; import alt.dev.szfoundation.SZSystem; import alt.dev.szfoundation.SZLog; final class SZIndex extends Object { // === //Constant(s) // --- private static final String Extension = .index; // === //Class variable(s) // --- private static final Filter _filter = new Filter(); // === //Instance variable(s) // --- private String _path = null; private transient File _directory = null; private transient Directory _indexDirectory = null; private transient IndexWriter _writer = null; private transient IndexReader _reader = null; private transient Searcher _searcher = null; private transient Directory _ramDirectory = null; private transient IndexWriter _ramWriter = null; private transient int _counter = 0; // === //Constructor method(s) // --- private SZIndex() { super(); } // === //Class method(s) // --- static FilenameFilter filter() { return _filter; } static String stringByDeletingPathExtension(String aPath) { if ( aPath != null ) { int anIndex = aPath.lastIndexOf( SZIndex.Extension ); if ( anIndex 0 ) { aPath = aPath.substring( 0, anIndex ); } return aPath; } throw new IllegalArgumentException( SZIndex.stringByDeletingPathExtension: null path. ); } static SZIndex indexWithNameInDirectory(String aName, File aDirectory) { if ( aName != null ) { if ( aDirectory != null ) { String anEncodedName = SZHexCoder.encode( aName.getBytes() ); //StringaPath = aDirectory.getPath() + File.separator + anEncodedName + SZIndex.Extension + File.separator; String aPath = aDirectory.getPath() + File.separator + aName + SZIndex.Extension + File.separator; SZIndex anIndex = new SZIndex(); anIndex.setPath( aPath );
Re: rc4 and FileNotFoundException: an update
Hello, and what was their ulimit and what is the ulimit on your OSX machine? Just curious. I don't know. Does it matter? Of course it does - a low (u)limit is a part of your problem, perhaps. Otis P.S. I don't know how Winblows deals with file descriptors. Try your application on some other flavour of Unix, if possible. __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Options for sorting on an integer or date
Hello, --- Joel Bernstein [EMAIL PROTECTED] wrote: At my company we trying to decide on a new search engine. I am very impressed with what I see with Lucene and am thinking very seriously of not going with AltaVista, FAST etc... :) One of things that is very important to us is sorting by an integer or by a date, which Lucene currently cannot do. So I am thinking about some options I might have here. I would welcome comments from the lucene developers on the options below: 1) We could wait for the sorting to be added to Lucene. Is there an idea of when this will be added? There was not much/any discussion about this functionality, so one can draw a conclusion from that easily :) 2) Have my company commission a project from the Lucene team to add this functionality soon. Does the Lucene team do commissioned work? Commission in what sense? The $en$e? I think payment is out of question, but I would encourage you to take the current Lucene snapshot, or maybe the next release, which is imminent, and add this functionality to Lucene. It sounds like if Lucene doesn't have this functionality you'll have to spend a good amount of dollars anyway. Damn, I'm not a very good salesman :) 3) Add the sorting code with guidance from the Lucene team and from a search engine expert that works with our company. I can't help with that, but maybe somebody else can. 4) Re-sort the results in the application that is using Lucene. This is the least attractive because our result-sets can be very large and I think we will have performance problems. That's the simpliest and the 'hackiest' solution. :( Otis __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing PDF files
Hm, this should be a FAQ. Maybe it should... ;-) It is now. Check Lucene contributions page, there are some starting points there, Well, this seems to be a very popular request... In fact I need something like that also. Unfortunately, there seems to be no authoritative answer as far as converting pdf files to text in a pure Java environment... Maybe I'm missing something here as usual? Also, on a related note, what would be a good approach to convert any random document into pdf? I was thinking to have a two steps process for document indexing in Lucene: - First, convert everything to pdf (with Acrobat or something) - Second, convert pdf to text and index it. Any practical suggestions about how to do that in a pure Java environment very welcome. Wouldn't you want to convert to XML instead and use XSLT to transform the XML representation to any desired format by just applying a style sheet? Sounds like less work with bigger document type coverage. Otis __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: term search speeds
Caching? The OSes usually cache recently opened files... Otis --- a person [EMAIL PROTECTED] wrote: Does anyone know exactlty why when searching for a term the engine is much slower on the first search of a term, than on subsequent searchs of the same term? Thanks Join 18 million Eudora users by signing up for a free Eudora Web-Mail account at http://www.eudoramail.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: 3 Times Isn't a Charm for me and Lucene
Uh, this is a very broad question. A number of things could be wrong. Look at your Tomcat log files. Write a class that you can run from the command line, not as a servlet, that may be easier to debug. You can use one of the demo ones to get started. Log things, don't catch exceptions and ignore them, etc. Check that your index directory exists, that it is readable by the user doing the searchs, etc. etc. Otis --- James Rozee [EMAIL PROTECTED] wrote: I've just recently recoded my entire website and search engine to use Tomcat 4.0.3, Velocity, MySQL and Lucene 1.2-rc4. I have been using MySQL and servlets for a few years now. However, I only recently started using Lucene. I've built a Lucene index from my document collection and now I need to be able to search it from a servlet. My first attempt to do this causes Tomcat to return a page that is empty. Can anyone give me some advice on how to track down my problem? My hardware is an SS1000E with 5 SM81s and 1.2GB RAM. Thanks. James * The Game Development Search Engine and DQuest E-zine http://www.gdse.com/ A Member of the Future Games Network http://www.fgn.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Stemming
You could have a single index with both stemmed and non-stemmed terms, using different field names for each and searching a different set of fields depending on the type of search. You'd also have to use 2 types of analyzers/filters, I think. Roughly :) Otis --- Joel Bernstein [EMAIL PROTECTED] wrote: In our search application the user can turn stemming off and on. With Lucene will I have to maintain two sets of indexes to create this functionality, one stemming and one non-stemming index? Or Is there a way to query a stemming index so that it does not return stems? Thanks, Joel __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Book
I don't think there are any on the market. A perfect opportunity for somebody :) Otis --- William W [EMAIL PROTECTED] wrote: Hi All, Do you know some book about Lucene ? Thanks, William. _ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Any one used websearch - Need Help Please
Hello, The host that you are trying to crawl cannot be looked up: bash-2.04$ nslookup www.violet-arcana.com Server: localhost.apache.org Address: 127.0.0.1 *** localhost.apache.org can't find www.violet-arcana.com: Non-existent host/domain This is not a Lucene issue, but more of a networking issue, so I suggest you talk to some network/system administrators about this. They'll have an answer for you. Good luck, Otis --- Moturu,Praveen [EMAIL PROTECTED] wrote: Hi All, Has any one used websearch.. If so can you please help me. I am trying to use the demo files.. When I do the index the demo site I am getting the following message and when I try the examples search form and enter rock or red as described I am not getting any search results... START CRAWLING index exists, delete all files deleting 0 records SCANNING : http://localhost/websearch/bot.jsp *status: bad SCANNING : http://www.violet-arcana.com/ *status: java.net.UnknownHostException: www.violet-arcana.com DONE CRAWLING links crawled http://localhost/websearch/bot.jsp http://www.violet-arcana.com/ Any help is highly appreciated Thanks Praveen Moturu -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: WildcardQuery
Yes, me too. I just tried it on some Lucene index (the search at blink.com) and it doesn't seem to work (try searching for travel and then *vel). I'm assuming the original poster confused something... Otis --- Joel Bernstein [EMAIL PROTECTED] wrote: I thought Lucene didn't support left wildcards like the following: *ucene - Original Message - From: Christian Schrader [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, May 06, 2002 7:14 PM Subject: WildcardQuery I am pretty happy with the results of WildcardQueries like *ucen* that matches lucene, but *lucene* doesn't match lucene. Is there a reason for this? And what would be the patch. It should be in WildcardTermEnum. I am wondering if somebody already patched it? Thanks, Chris -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Searching greater than/less than
Hello, I believe that is not possible with Lucene. Although there is something called a RangeQuery, which may be helpful. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/RangeQuery.html Otis --- Victor Hadianto [EMAIL PROTECTED] wrote: Can I use lucene to search greater than / less than a value in the field? I have a field in the document that function as a score. I would need to be able to search the index + the option having to say a field 50 Regards, -- Victor Hadianto -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? LAUNCH - Your Yahoo! Music Experience http://launch.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Merging (adding) indices
The source code looks like this: public final synchronized void addIndexes(Directory[] dirs) throws IOException { optimize(); // start with zero or 1 seg for (int i = 0; i dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } optimize(); // final cleanup } So I think the original directories/indices should not be modified in any way. Are you sure your application is not deleting them? Otis --- Lex Lawrence [EMAIL PROTECTED] wrote: Hello- I am using org.apache.lucene.index.IndexWriter.addIndexes(Directory[] dirs) to merge several indices into one. The resulting index appears to work fine, but afterward the original indices seem to have been completely emptied. I can deal with that, but I just wanted to check: Is this method supposed to alter the indices in the 'dirs' parameter? It's not mentioned in the javadoc. Thanks- Lex _ Chat with friends online, try MSN Messenger: http://messenger.msn.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Partial word search with unicode contents
Hello, A query for india should not be returning southindia (one word). It sounds like something else is happening in your application. Otis --- Harpreet S Walia [EMAIL PROTECTED] wrote: Hi, We are using lucene to index and search unicode(utf-8) contents in devnagari(hindi) language . What we have observed is that our query fetches results which have partial word match . i.e if it were english then a query india would relurn words like indian , southindia and so on. Is there a way by which we can instruct lucene to only search complete words and not word parts. TIA Regards harpreet -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Opening and index as ready only
I believe what you are referring to is on Lucene's TODO list, possibly for the next release. One or two people have already contributed some code for Lucene on read-only media such as CD-ROM, so you may want to check the mailing list archives for the code if this is urgent for you. Otis --- Paul Dlug [EMAIL PROTECTED] wrote: Is there anyway to open an index as read-only? I get an IOException with Permission Denied when I change the index to a set of read-only file permissions. I have a cluster of search servers with the index on an NFS mount. I'd like to be able to have them all open and search the index at the same time. A single IndexWriter would be used to add new documents. Is there any way to do this? Thanks, Paul -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: searching with wild cards ignoers analyzer?
Dobro jutro, Dario, maybe this answers your question: http://www.jguru.com/faq/view.jsp?EID=538312 Otis --- Dario Novakovic [EMAIL PROTECTED] wrote: i index/search with anlyzer which converts all characters to lowercase. it works corectly until i use *, then i must use query strings with exact capitalization. why is that, am i doing something wrong? thanks for any answer dario _ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: lucene and java naming conventions
Dario, Yes, we may improve coding style over time, but there are no plans for doing that in the immediate future. I know, it's not ideal, so we all have to get used to those few exceptions. Otis --- Dario Novakovic [EMAIL PROTECTED] wrote: i noticed that some method names in lucene start with upercase, and it is realy confusing for me because i allways think it is some inner classes. java naming convention suggest that method names starts with lowercase and lucene is my first source code expirience that oposes naming conventions. i don't want to teach developers how to code, i just want to ask is there any reasons for that and to suggest them to consider changes to source code to comply with conventions. thanks dario _ Send and receive Hotmail on your mobile device: http://mobile.msn.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: lucen compared to other open source solutions
I haven't used Swish-e, but I remember looking at it years ago, and from what I remember it wasn't nowhere nearly as scalable as Lucene, and it did not support various types of queries that Lucene supports. Maybe things have changed since then. You can look at http://www.searchtools.com/ for some additional information. Otis --- degetel [EMAIL PROTECTED] wrote: Hi, I have a small question. I am quiet new in this field of indexing searching content. I already used lucene in aproject it was succesfull ! now I have to consider other solutions. Do you know where I can find some arguments to choose lucene compared to the swish-e solution ? functionnal differences ? scalability ? performances ? is there any benchamrks somewhere ? thanks roland -Message d'origine- De : Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Envoye : mercredi 5 juin 2002 00:23 A : Lucene Users List Objet : Re: lucene and java naming conventions Dario, Yes, we may improve coding style over time, but there are no plans for doing that in the immediate future. I know, it's not ideal, so we all have to get used to those few exceptions. Otis --- Dario Novakovic [EMAIL PROTECTED] wrote: i noticed that some method names in lucene start with upercase, and it is realy confusing for me because i allways think it is some inner classes. java naming convention suggest that method names starts with lowercase and lucene is my first source code expirience that oposes naming conventions. i don't want to teach developers how to code, i just want to ask is there any reasons for that and to suggest them to consider changes to source code to comply with conventions. thanks dario _ Send and receive Hotmail on your mobile device: http://mobile.msn.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Document Object
As far as I know there is no generic way to do that. You can parse the String in your application, form Fields, add them to a Document, and there you go, but there is nothing generic. Besides field names and values, your String would also have to contain meta data about each field, whether it is to be indexed, unindexed, tokenized or not tokenized, etc. e.g. field1:value1Keyword, field2:value2UnStored Maybe there are better approaches. This is just the first thing that came to mind. Good luck, and if you implement something generic please contribute it to the project. Thanks, Otis --- Pradeep Kumar K [EMAIL PROTECTED] wrote: Hi all Is there any way to type cast a String Object to Document object. ie, Document object can be converted to its String from by using method 'toString()'. How we can convert it back to Document object. Any help will be greatly appreciated. Regards Pradeep -- Robosoft Technologies, Mangalore, India -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: status of ? wildcard queries in rc5
David, As far as I can tell the '?' character works as it should with WildcardQuery. See src/test/org/apache/lucene/search/TestWildcard.java. The tests there use SimpleAnalyzer and WildcardQuery directly (i.e. not QueryParser). All tests pass. Try comparing your code with the code in the above test class. Otis --- [EMAIL PROTECTED] wrote: I've searched the mail archive and I'm still a bit confused as to the current status of ? wildcard queries. My experience, using lucene-1.2-RC5, is that ? wildcard queries are unsupported using the StandardAnalyzer or SimpleAnalyzer. For example, the following search on two fields (go_id and go_desc) (using StandardAnalyzer for indexing and searching): %java Search ./index +go_id:5737 +go_desc:biosynthesis Result: go_id:4853, 6783, 5737 go_desc:uroporphyrinogen decarboxylase, heme biosynthesis, cytoplasm Score: 1.0 using * wildcard: %java Search ./index +go_id:5737 +go_desc:biosynth*sis Result: go_id:4853, 6783, 5737 go_desc:uroporphyrinogen decarboxylase, heme biosynthesis, cytoplasm Score: 1.0 using ? wildcard: %java Search ./index +go_id:5737 +go_desc:biosynth?sis Noresults Is this the expected behavior for RC5, a reported bug, or an unreported bug? thanks, --David M. Goodstein __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Problem in unicode field value retrival
Hello, That was the problem , Thanks :-) . still i am strugling to get lucene to search non english unicode content . it works partially will simple analyser but doesn't return any results with standard analyser . is there a way by which i can output the exact contents that are going into the index Perhaps something like this will help. This is a very recent post from the searchable mailing list archives at http://nagoya.apache.org/: http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]msgId=352570 Otis __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Within Search
Hello, I'm sending this to lucene-user list, as that seems more appropriate. I haven't used Lucene's slop feature, but it looks like both QueryParser and PhraseQuery have support for slop. I am not sure what the syntax for it is, but if nothign else you should be able to call setSlop(int) method on an instance of PhraseQuery. Oh, it looks like you missed it in the Query Parser Syntax document: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Otis --- none none [EMAIL PROTECTED] wrote: hi, i asked some help about this feature some time ago, but no answer. What do i need to do is the WithinPhraseSearch. An example can be: search for: car w/10 rent. This mean, look for documents that contains 'car' and within 10 words 'rent'. So, what i think i need is: 1.Change the QueryParser.jj to reconize the operator w/xx as the within operator. 2.The QueryParser should return a PhraseQuery with a slop factor equals to '10' for the example above. Should also ignore w/xx if xx is not numeric. An other question: what should i do if i want the query operator (AND,OR,NOT,etc) to be case insensitive? what should i change inside the QueryParser.jj ? PLEASE HELP, because i really don't know how to use the JavaCC utility. Thanks, bye. ___ WIN a first class trip to Hawaii. Live like the King of Rock and Roll on the big Island. Enter Now! http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: How does simple analyser work
--- Harpreet S Walia [EMAIL PROTECTED] wrote: Hi, Are there any resources available which explain how the simple analyser processes the data given to it . what i want to know is that suppose i have a set of words , what exact rules are applied to tokenize and index these words and how can i customize them. My requirement is that the words be broken only by spaces and not at any other character . I understand that this can be done by writing a parser in JAVACC . but is there any simpler way of achieving this . Actually, this can be done by writing your own custom Analyzer. Check this: ./org/apache/lucene/analysis/standard/StandardAnalyzer.java ./org/apache/lucene/analysis/Analyzer.java ./org/apache/lucene/analysis/de/GermanAnalyzer.java ./org/apache/lucene/analysis/SimpleAnalyzer.java ./org/apache/lucene/analysis/StopAnalyzer.java ./org/apache/lucene/analysis/WhitespaceAnalyzer.java Maybe this last one is what you are looking for. Otis __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Question about RangeQuery and strings...
James, I haven't used RangeQueries, but what you describe does sound confusing to me. I'll enter it as a bug, just so this information doesn't get lost, because I am not certain that this is really a bug, even though it sounds like one to me. Thanks, Otis --- James Ricci [EMAIL PROTECTED] wrote: I'm replying to my own message because I think I now understand the problem, and part of it is, in my opinion, a bad implementation of RangedQuery. When you create a ranged query and omit the lower term, my expectation would be that I would find everything less than the upper term. Now if I pass false for the inclusive term, then I would expect that I would find all terms less than the upper term excluding the upper term itself. What is happening in the case of lower_term=null, upper_term=x, inclusive=false is that empty strings are being excluded because inclusive is set false, and the implementation of RangedQuery creates a default lower term of Term(fieldName, ). Since it's not inclusive, it excludes . This isn't what I intended, and I don't think it's what most people would imagine RangedQuery would do in the case I've mentioned. I equate lower=null, upper=x, inclusive=false to Field x. lower=null, upper=x, inclusive=true would be Field = x. In both cases, the only difference should be whether or not Field = x is true for the query. I'm still quite new to Lucene, so maybe I'm wrong about all this because I just don't understand it well enough. If so, could someone tell me where I've gone astray? Thanks much, James PS: The rest of the problems I had below I was able to fix by changing how the fields were tokenized and indexed. -Original Message- From: James Ricci Sent: Thursday, June 06, 2002 11:16 AM To: '[EMAIL PROTECTED]' Subject:Question about RangeQuery and strings... Hi all, I've been having some problems using RangeQuery. I have a simple Query which is essentially document.field AB. Field values are: // Empty string A SPACE A123456 ABC Now I expected to find the first three of the four values (and I do with another commercial search engine product I've worked with). With Lucene I get nothing. Part of the problem I think is that there are some issues with case here. Changing my query to document.field ab returns: A123456 Now I would have expected A SPACE to get returned, and I was really surprised that wasn't returned. I'm guessing that wasn't returned because no term in the field passed the query criteria, and empty string is not considered a term. How should I go about getting what I expect? What is going on here? Thanks much, James -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Are IndexReader objects always up to date?
Hm, this sounds an awful lot like a FAQ, yet I don't see it in Lucene's FAQ at jGuru.com. You need to close and reopen the index(reader) if you want to see the latest changes. There is a method that you can use to figure out if the index has been modified since you opened it. Otis --- James Ricci [EMAIL PROTECTED] wrote: Hi, If I have an IndexReader object open, and someone else is using an IndexWriter to update the contents of an index, will my IndexReader automatically reflect the current contents of the index? If not, what must I do to refresh it? James -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Are IndexReader objects always up to date?
I don't think there is anything else. That is how I wrote applications that used Lucene at my previous job. It worked, but those indices changed only hourly. Otis --- James Ricci [EMAIL PROTECTED] wrote: Otis, Thanks. This seems to agree with what I've seen myself. The system I'm working on is extremely dynamic, so this will be an issue for me. The method I think you're talking about is IndexReader.lastModified. I'm not sure this actually tells me if the IndexReader I have is up to date, but it would tell me if there has been a change since I opened it (assuming I have saved off the open time). Is there something a little more direct? James -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Tuesday, June 11, 2002 2:23 PM To: Lucene Users List Subject: Re: Are IndexReader objects always up to date? Hm, this sounds an awful lot like a FAQ, yet I don't see it in Lucene's FAQ at jGuru.com. You need to close and reopen the index(reader) if you want to see the latest changes. There is a method that you can use to figure out if the index has been modified since you opened it. Otis --- James Ricci [EMAIL PROTECTED] wrote: Hi, If I have an IndexReader object open, and someone else is using an IndexWriter to update the contents of an index, will my IndexReader automatically reflect the current contents of the index? If not, what must I do to refresh it? James -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Thread safety
Thanks for this table. It's part of the Lucene FAQ at jGuru now: http://www.jguru.com/forums/view.jsp?EID=910778 Otis --- Mark Harwood [EMAIL PROTECTED] wrote: I've been trying to understand the multithreaded behaviour of Lucene too. I have a test rig and the observed results are available here: http://home.clara.net/markharwood/lucene/threads.htm I would be interested in having these observations verified. Cheers Mark -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Memory-based indexing
Yes, there are a few things one can do. See http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]msgId=117057 Otis --- James Ricci [EMAIL PROTECTED] wrote: I've been doing a few tests, and I'm finding creating an index in Lucene to be somewhat slower than other engines I've worked with. Is there a way to cache, batch, or otherwise speed up indexing of a large number of documents? This is mainly a problem when creating the index for the first time. James -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Thread safety
Yeah, I think you are right, that matrix isn't 100% correct. I'll have to change it...thanks for checking. Otis --- David Smiley [EMAIL PROTECTED] wrote: Maybe I'm just not with it right now... but that matrix doesn't seem to make sense to me. From my understanding, two write requests cannot happen concurrently, yet there's a Y in that box on the matrix. Also, /shouldn't/ the matrix be symmetric? It isn't. If it is intended to me, I think only half of the matrix should be there as to not be confusing. ~ Dave Smiley On Tuesday, June 11, 2002, at 10:12 PM, Otis Gospodnetic wrote: Thanks for this table. It's part of the Lucene FAQ at jGuru now: http://www.jguru.com/forums/view.jsp?EID=910778 Otis -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Boolean Query + Memory Monster
I don't know about Resin, but Tomcat allows one to set CATALINA_OPTS (or some other _OPTS) environment variable, whose value is them used to invoke Java. I would imagine Resin to have something similar. This then becomes a Resin question. Otis --- Nader S. Henein [EMAIL PROTECTED] wrote: I'm all ears .. I'm running the search from a servlet on a resin web server, any suggestions as to increasing the heap size in this case ? -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Thursday, June 13, 2002 9:47 PM To: 'Lucene Users List' Subject: RE: Boolean Query + Memory Monster Use the java -Xmx option to increase your heap size. Scott -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Thursday, June 13, 2002 12:20 PM To: [EMAIL PROTECTED] Subject: Boolean Query + Memory Monster I have 1 Geg of memory on the machine with the application when I use a normal query it goes well, but when I use a range query it sucks the memory out of the machine and throws a servlet out of memory error, I have 80 000 records in the index and it's 43 MB large anything people ? Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Deleting document from index
Hello, First of all, the machine from which you sent this email has the date set incorrectly - it thinks it's 22. 6. 2000. --- [EMAIL PROTECTED] wrote: I had searched the archive of this list for getting more info on How to delete a document from the lucene index. But most of the postings talk about IndexReader.delete(docNum). When we tried to delete a single document entry from the index , what we found is : the whole index got deleted. You must be doing something wrong. Send the relevant piece of code. 1) Can anyone help us on how we can handle this ? http://www.jguru.com/faq/view.jsp?EID=492423 public int delete(final String fieldName, final String fieldValue) throws IOException { final IndexReader reader = IndexReader.open(mIndexDir); final int deleteCount= reader.delete(new Term(fieldName, fieldValue)); reader.close(); return deleteCount; } 2) When the search results will reflect that, the particular document which I had deleted ,is not there ? Do I need to optimize the index for this ? You don't need to optimize the index, but I believe you need to close the IndexReader and re-open IndexSearcher when you detect that the index has changed. 3) After adding few more documents to an existing index, what effect will it have on search , if I don't optimize the index immediately ? Will these new documents will be searchable before optimization ? Yes. Otis __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Retrieve documents from index by document number
Check the Hits class API: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Hits.html Otis --- Chris Sibert [EMAIL PROTECTED] wrote: Anybody know how to retrieve a stored document from an index by it's document number ? I have a list of search hits, and when the user clicks on one, I want to pull the stored document up out of the index. __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: IndexReader Pool
I don't think Lucene contains anything to help you create this pool. However, if you look at Jakarta Commons project you will find a subproject there that allows you to create pools of any kind of Java object. You can probably use that to save yourself development and debug time. Otis --- Nader S. Henein [EMAIL PROTECTED] wrote: I was going through the lucene-user posts on the web and I came accross a posting by Scott Oshima http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00693.html witch is talking about creating a IndexReader pool to spead up the search I've looked into that but I can't fiure out what to use for a DataSource like in creating a pool for DB connections, is there an equivalant in the lucene architecture or should one just take the initiative. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]