Re: Not entire document being indexed?
Thanks Andrzej and Pasha for your prompt replies and suggestions. I will try everything you have suggested and report back on the findings! regards -pedja Pasha Bizhan said the following on 2/25/2005 6:32 PM: Hi, whole document was indexed or not. Luke can help you to give an answer the question: does my index contain a correct data? Let do the following steps: - run Luke - open the index - find the specified document (document tab) - click reconstruct and edit button - select the field and look the original stored content of this field reconstructed from index Does this reconstructed content contain your last 2-3 paragraphs? Also, 230Kb is not equal 20.000. Try to set writer.maxFieldLength to 250 000. Pasha Bizhan http://lucenedotnet.com
Re: Not entire document being indexed?
Hi Otis Thanks for the reply, what exactly should I be looking for with Luke? What would setting the max value to maxInteger do? Is this some arbitrary value or...? -pedja Otis Gospodnetic said the following on 2/24/2005 2:24 PM: Use Luke to peek in your index and find out what really got indexed. You could also try the extreme case and set that max value to the max Integer. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi everyone I'm having a bizzare problem with a few of the documents here that do not seem to get indexed entirely. I use textmining WordExtractor to convert M$ Word to plain text and then index that text. For example one document which is about 230KB in size when converted to plain text, when indexed and later searched for a pharse in the last 2-3 paragraphs returns no hits, yet searching anything above those paragraphs works just fine. WordExtractor does convert the entire document to text, I've checked that. I've tried increasing the number of terms per field from default 10,000 to 20,000 with writer.maxFieldLength but that didnt make any difference, still cant find phrases from the last 2-3 paragraphs. Any ideas as to why this could be happening and how I could rectify it? thanks, -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Howdy, For starters, compile and install the java bridge (and if necessary recompile PHP and Apache2) and make sure it works (there's a test php file supplied). Then, here's a simplified part of my code, just to give you an example how it works. This is the part that does the searching, indexing is done in a similar way. PHP: ...some code here for HTML page setup etc... $lucene_dir = $GLOBALS[lucene_dir]; java_set_library_path(/path/to/your/custom/lucene-classes.jar); $obj = new Java(searcher); // searcher is the custom written class that does actual searching and data output $writer = new Java(java.io.StringWriter); $obj-setWriter($writer); $obj-initSearch($lucene_dir); $obj-getQuery($query); // $query is the user supplied query from the HTML form, not visible here // get the last exception $e = java_last_exception_get(); if ($e) { // print error echo $e-toString(); } else { echo $writer-toString(); $writer-flush(); $writer-close(); } java_last_exception_get(); // clear the exception java_last_exception_clear(); - JAVA (custom written class located in the /path/to/your/custom/lucene-classes.jar): import ...whatever is needed here for the class... public class searcher { IndexReader reader = null; IndexSearcher s= null; //the searcher used to open/search the index Query q= null; //the Query created by the QueryParser BooleanQuery query = new BooleanQuery(); Hits hits = null; //the search results public Writer out; public void setWriter(Writer out) { this.out=out; } public void initSearch(String indexName) throws Exception { try { File indexFile= new File(indexName); Directory activeDir = FSDirectory.getDirectory(indexFile, false); if(reader.isLocked(activeDir)) { //out.write(Lucene index is locked, waiting 5 sec.); Thread.sleep(5000); } reader = IndexReader.open(indexName); s = new IndexSearcher(reader); //out.write(Index opened); } catch (Exception e) { throw new Exception(e.getMessage()); } } public void getQuery(String queryString) throws Exception { int totalhits = 0; Analyzer analyzer = new StandardAnalyzer(); String[] queryFields = {field1,field2,field3,field4,field5}; float[] boostFields = {10, 6, 2, 1, 1}; try { for ( int i = 0; i queryFields.length; i++) { q = QueryParser.parse(queryString, queryFields[i], analyzer); if (boostFields[i] 1) q.setBoost(boostFields[i]); query.add(q, false, false); } } catch (ParseException e) { throw new Exception(e.getMessage()); } try { hits = s.search(query); } catch (Exception e) { throw new Exception(e.getMessage()); } totalhits = hits.length(); if (totalhits == 0) { // if we find no hits, tell the user out.write(brI'm sorry I couldn't find your query: + queryString); } else { for (int i = 0; i totalhits; i++) { Document doc = hits.doc(i); String field1 = doc.get(field1); String field2 = doc.get(field2); String field3 = doc.get(field3); String field4 = doc.get(field4); String field5 = doc.get(field5); out.write(Field1: + field1 + , Field2: + field2 + , Field3: + field3 + , Field4: + field4 + , Field5: + field5 + br); } } } Sanyi said the following on 2/7/2005 3:54 AM: Hi! Can you please explain how did you implement the java and php part to let them communicate through this bridge? The brige's project summary talks about java application-server or a dedicated java process and I'm not into Java that much. Currenty I'm using a self-written command-line search program and it outputs its results to the standard output. I guess your solution must be better ;) If the communication parts of your code aren't top secret, can you please share them with me/us? Regards, Sanyi __ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Hi Owen I am using Lucene with PHP, though in previous replies it was suggested to run Tomcat on an alternate port, but for me that was not a solution. I did not want to run too many tasks or too many servers for various reasons (maintenance, security etc) and also needed to have control over PHP sessions and what not. The original PHP extension for Java is broken and is far fro being usable in production. Instead I have been using PHP and Lucene with a PHP-Java-Bridge for the past 6 months or so. It does the job very well and I can call classes and methods right out of PHP just like you would expect with a PHP extension. The bridge is available here: http://sourceforge.net/projects/php-java-bridge Hope this helps, -pedja Owen Densmore said the following on 2/6/2005 12:10 PM: I'm building a lucene project for a client who uses php for their dynamic web pages. It would be possible to add servlets to their environment easily enough (they use apache) but I'd like to have minimal impact on their IT group. There appears to be a php java extension that lets php call back forth to java classes, but I thought I'd ask here if anyone has had success using lucene from php. Note: I looked in the Lucene In Action search page, and yup, I bought the book and love it! No examples there tho. The list archives mention that using java lucene from php is the way to go, without saying how. There's mention of a lucene server and a php interface to that. And some similar comments. But I'm a bit surprised there's not a bit more in terms of use of the official java extension to php. Thanks for the great package! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: English and French documents together / analysis, indexing, searching
Morus Walter said the following on 1/21/2005 2:14 AM: No. You could do a ( ( french-query ) or ( english-query ) ) construct using one query. So query construction would be a bit more complex but querying itself wouldn't change. The first thing I'd do in your case would be to look at the differences in the output of english and french snowball stemmer. I don't speak any french, but probably you might even use both stemmers on all texts. Morus I've done some thinking afterwards, and instead of messing with complex queries, would it make sense to replace all special characters such as é, è with e during indexing (I suppose write a custom analyzer) and then during searching parse the query and replace all occurances of special characters (if any) with their normal latin equivalents? This should produce the required results, no? Since the index would not contain any French characters and searching for French words would return them since they were indexed as normal words. -pedja
English and French documents together / analysis, indexing, searching
Greetings everyone I wonder is there a solution for analyzing both English and French documents using the same analyzer. Reason being is that we have predominantly English documents but there are some French, yet it all has to go into the same index and be searchable from the same location during any perticular search. Is there a way to analyze both types of documents with a same analyzer (and which one)? I've looked around and I see there's a SnowBall analyzer but you have to specify the language of analysis, and I do not know that ahead of time during indexing nor do I know it most of the time during searching (users would like to search in both document types). There's also the issue of letter accents in french words and searching for the same (how are they indexed at the first place even)? Has anyone dealt with this before and how did you solve the problem? thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: English and French documents together / analysis, indexing, searching
Right now I am using StandardAnalyzer but the results are not what I'd hope for. Also since my understanding is that we should use the same analyzer for searching that was used for indexing, even if I can manage to guess the language during indexing and apply to the SnowBall analyzer I wouldn't be able to use SnowBall for searching because users want to search through both English and French and I suppose I would not get the same results if used with StandardAnalyzer? Another problem with StandardAnalyzer is that it breaks up some words that should not be broken (in our case document identifiers such as ABC-1234 etc) but that's a secondary issue... thanks -pedja Bernhard Messer said the following on 1/20/2005 1:05 PM: i think the easiest way ist to use Lucene's StandardAnalyzer. If you want to use the snowball stemmers, you have to add a language guesser to get the language for the particular document before creating the analyzer. regards Bernhard [EMAIL PROTECTED] schrieb: Greetings everyone I wonder is there a solution for analyzing both English and French documents using the same analyzer. Reason being is that we have predominantly English documents but there are some French, yet it all has to go into the same index and be searchable from the same location during any perticular search. Is there a way to analyze both types of documents with a same analyzer (and which one)? I've looked around and I see there's a SnowBall analyzer but you have to specify the language of analysis, and I do not know that ahead of time during indexing nor do I know it most of the time during searching (users would like to search in both document types). There's also the issue of letter accents in french words and searching for the same (how are they indexed at the first place even)? Has anyone dealt with this before and how did you solve the problem? thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: English and French documents together / analysis, indexing, searching
you could try to create a more complex query and expand it into both languages using different analyzers. Would this solve your problem ? Would that mean I would have to actually conduct two searches (one in English and one in French) then merge the results and display them to the user? It sounds to me like a long way around, so then actually writing an analyzer that has the language guesser might be a better solution on the long run? This is a behaviour is implemented in StandardTokenizer used by StandardAnalyzer. Look at the documentation of StandardTokenizer: Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. Hmm I feel this is beyond my abilities at the moment, writing my own tokenizer, without more in-depth knowledge of everything else. Perhaps I'll try taking the StandardTokenizer and expand it or change it based on other tokenziers available in Lucene such as WhiteSpaceTokenizer. thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Book in UK
Have you checked Manning's site (http://www.manning.com), where you can order the book directly from them (the publisher) and they will also provide you with a copy of an eBook in the mean time until your paperback arrives in mail? -pedja P.S. two cubes of sugar with that tea, please :) David Townsend said the following on 1/6/2005 1:23 PM: Sorry if this is the wrong forum but I wondered what's happened to 'Lucene In Action' in the UK. Looking forward to reading it but amazon.co.uk report it as a 'hard to find' item and are now quoting a 4-6 week delivery time and tacking on a rare book charge. Amazon.com are quoting shipping in 24hrs. Is this a new 'Boston Tea Party'? cheers David - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene working with a DB
Hello I'll just paste the relevant MySQL code, you add the calls to it per your needs..it has no checking of anything so better add that as well... It's possible I didnt copy/paste everything but you should get the idea where this is going... -pedja -- import java.sql.*; import lucene stuff... public class sqlTest { public static void main(String[] args) throws Exception { String sTable = args[0]; String sThing = args[1]; String indexDir = /path/to/lucene/index; try { Analyzer analyzer = new StandardAnalyzer(); IndexWriter fsWriter = new IndexWriter(indexDir, analyzer, false); addSQLDoc(fsWriter, sTable, sThing); fsWriter.close(); } catch (Exception e) { throw new Exception( caught a + e.getClass() + \n with message: + e.getMessage()); } } private void addSQLDoc(IndexWriter writer, String sqlTable, String somethingElse) throws Exception { String cs = jdbc:mysql://HOST/DATABASE?user=SQLUSERpassword=SQLPASSWORD; String sql= SELECT * FROM + sqlTable + WHERE something=\ + somethingElse + \; // establish a connection to MySQL database try { Class.forName(com.mysql.jdbc.Driver).newInstance(); } catch (Exception e) { System.out.println(Lucene: ERROR: Unable to load driver); e.printStackTrace(); } // get the record data... try { Connection conn = DriverManager.getConnection(cs); Statement Stmt = conn.createStatement(); ResultSet RS = Stmt.executeQuery(sql); while(RS.next()) { // make a new, empty document Document doc = new Document(); // get the database fields String field2 = RS.getString(1); String field2 = RS.getString(2); String field3 = RS.getString(3); String field4 = RS.getString(4); String field5 = RS.getString(5); // add the first group of fields // doc.add(Field.Keyword(FIELD1, field1)); doc.add(Field.Keyword(FIELD2, field2)); doc.add(Field.Keyword(FIELD3, field3)); doc.add(Field.Keyword(FIELD4, field4)); doc.add(Field.Text(FIELD5, field5)); // add the document writer.addDocument(doc); } catch (Exception e) { e.printStackTrace(); throw new Exception(); } } // close while(..) RS.close(); Stmt.close(); conn.close(); } catch(SQLException e) { throw new Exception(); } } } -- Daniel Cortes said the following on 12/21/2004 10:39 AM: I read a lot of messages that Lucene can index a DB because it use that INPUTSTREAM type I don't understand how to do this. For example if I've a forum with Mysql and a lot of files on my web, for every search I've to select the index that I want use in my search, true? But I don't know how to do that Lucene writes an index about the information of the DB of forum (for example MySQL) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to index Windows' Compiled HTML Help (CHM) Format
I suggest you look at the chmlib at: http://66.93.236.84/~jedwin/projects/chmlib/ -pedja Tom said the following on 12/11/2004 11:20 AM: Hi, Does anybody know how to index chm-files? A possible solution I know is to convert chm-files to pdf-files (there are converters available for this job) and then use the known tools (e.g. PDFBox) to index the content of the pdf files (which contain the content of the chm-files). Are there any tools which can directly grab the textual content out of the (binary) chm-files? I think chm-file indexing-support is really a big missing piece in the currently supported indexable filetype-collection (XML, HTML, PDF, MSWord-DOC, RTF, Plaintext). WBR, Tom. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: No of docs using IndexSearcher
numDocs() http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#numDocs() Ravi said the following on 12/10/2004 2:42 PM: How do I get the number of docs in an index If I just have access to a searcher on that index? Thanks in advance Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: No of docs using IndexSearcher
If your index is open shouldnt there be an instance of IndexReader already there? Ravi said the following on 12/10/2004 3:13 PM: I already have a field with a constant value in my index. How about using IndexSearcher.docFreq(new Term(field,value))? Then I don't have to instantiate IndexReader. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 2:59 PM To: Lucene Users List Subject: Re: No of docs using IndexSearcher numDocs() http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexR eader.html#numDocs() Ravi said the following on 12/10/2004 2:42 PM: How do I get the number of docs in an index If I just have access to a searcher on that index? Thanks in advance Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Empty/non-empty field indexing question
Hi Otis What kind of implications does that produce on the search? If I understand correctly that record would not be searched for if the field is not there, correct? But then is there a point putting an empty value in it, if an application will never search for empty values? thanks -pedja Otis Gospodnetic said the following on 12/8/2004 1:31 AM: Empty fields won't add any value, you can skip them. Documents in an index don't have to be uniform. Each Document could have a different set of fields. Of course, that has some obvious implications for search, but is perfectly fine technically. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Here's probably a silly question, very newbish, but I had to ask. Since I have mysql documents that contain over 30 fields each and most of them are added to the index, is it a common practice to add fields to the index with empty values, for that perticular record, or should the field be totally omitted. What I mean is if let's say a Title field is empty on a specific record (in mysql) should I still add that field into Lucene index with an empty value or just skip it and only add the fields that contain non-empty values? thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Empty/non-empty field indexing question
Here's probably a silly question, very newbish, but I had to ask. Since I have mysql documents that contain over 30 fields each and most of them are added to the index, is it a common practice to add fields to the index with empty values, for that perticular record, or should the field be totally omitted. What I mean is if let's say a Title field is empty on a specific record (in mysql) should I still add that field into Lucene index with an empty value or just skip it and only add the fields that contain non-empty values? thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem with indexing/merging indices - documents not indexed.
(file.separator) + index; try{ Date start= new Date(); Analyzer analyzer = new StandardAnalyzer(); String doctype= args[0]; String crofileno = args[1]; /* IndexReader reader = IndexReader.open(indexDir);; int deleted = reader.delete(new Term(crofileno, crofileno)); System.out.println(Lucene deleted records: + deleted + br); reader.close(); */ // let's make two writers, RAM and FS so that we index to RAM first then merge at the end.. // RAMDirectory ramDir = new RAMDirectory(); IndexWriter ramWriter= new IndexWriter(ramDir, analyzer, true); addDoc(ramWriter, doctype, crofileno); System.out.println(Docs In the RAM index: + ramWriter.docCount()); IndexWriter fsWriter = new IndexWriter(indexDir, analyzer, true); //fsWriter.setUseCompoundFile(false); //fsWriter.mergeFactor = 1000; //fsWriter.maxMergeDocs = 10; fsWriter.addIndexes(new Directory[] { ramDir }); //fsWriter.optimize(); System.out.println(Docs in the FS index: + fsWriter.docCount()); ramWriter.close(); fsWriter.close(); Date end = new Date(); System.out.println(Lucene Added OK: + Long.toString(end.getTime() - start.getTime()) + total millisecondsbr); } catch (IOException e) { throw new Exception(Something bad happened: + e.getClass() + with message: + e.getMessage()); } catch (Exception e) { throw new Exception( caught a + e.getClass() + \n with message: + e.getMessage()); } } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem with indexing/merging indices - documents not indexed.
Hi Chris actually for merging indices that's how Otis did it in the article I quoted: // if -r argument was specified, use RAMDirectory RAMDirectory ramDir= new RAMDirectory(); IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true); addDocs(ramWriter, docsInIndex); IndexWriter fsWriter = new IndexWriter(indexDir, analyzer, true); fsWriter.addIndexes(new Directory[] { ramDir }); ramWriter.close(); fsWriter.close(); ..which works great, and all I've done is replace the addDocs with my MySQL version of the function. I do know about having to close and re-open to find the document count, which I've also tried but it didnt yield any difference, a simple look in the index directory shows no files there except segements, even though it should've merged the RAMDir index into the fsDir. thanks -pedja Chris Hostetter said the following on 12/6/2004 6:09 PM: : I would appreciate any feedback on my code and whether I'm doing : something in a wrong way, because I'm at a total loss right now : as to why documents are not being indexed at all. I didn't try running your code (because i don't have a DB to test it with) but a quick read gives me a good guess as to your problem: I believe you to call... ramWriter.close(); ...before you call... fsWriter.addIndexes(new Directory[] { ramDir }); (I've never played with merging indexes, so i could be completley wrong) Everything I've ever read/seen/tried has indicated that untill you close your IndexWritter, nothing you do will be visible to anybody else who opens that Directory I'm also guessing that when you were trying to add the docs to fsWriter directly, you were using an IndexReader you had opened prior to calling fsWriter.close() to check the number of docs ... that won't work for hte same reason. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Is this a bug or a feature with addIndexes?
Greetings, Ok, so maybe this is common knowledge to most of you but I'm a lamen when it comes to Lucene and I couldnt find any details about this after some searching. When you merge two indexes via addIndexes, does it only work in batches (10 or more documents)? Because I've been banging my head off the wall wondering why my code does not want to index 1 (one) document and then I went to run Otis's MemoryVsDisk class from http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html?page=last but I didnt use 10,000 documents as suggested, I used 5 and 15 instead. And what do you know, less than 10 it doesnt merge at all while more than 10 it will merge only first 10 documents and gently forget about the other 5. My project requires me to index/update one single document as required and make it immediately available for searching. How do I accomplish this if index merging will not merge less than 10 and in increments of 10, and single indexing doesnt seem to do it at all (please see my other post http://marc.theaimsgroup.com/?l=lucene-userm=110237364203877w=2) thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is this a bug or a feature with addIndexes?
Hi Otis I did try, here's what I get: [EMAIL PROTECTED] tmp]# time java MemoryVsDisk 1 1 10 -r Docs in the RAM index: 1 Docs in the FS index: 0 Total time: 142 ms real0m0.322s user0m0.268s sys 0m0.033s I tried other combinations but they dont seem to affect the outcome either :( thanks -pedja Otis Gospodnetic said the following on 12/6/2004 8:11 PM: Hello, Try changing IndexWriter's mergeFactor variable. It's 10 by default. Change it to 1, for instance. Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Greetings, Ok, so maybe this is common knowledge to most of you but I'm a lamen when it comes to Lucene and I couldnt find any details about this after some searching. When you merge two indexes via addIndexes, does it only work in batches (10 or more documents)? Because I've been banging my head off the wall wondering why my code does not want to index 1 (one) document and then I went to run Otis's MemoryVsDisk class from http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html?page=last but I didnt use 10,000 documents as suggested, I used 5 and 15 instead. And what do you know, less than 10 it doesnt merge at all while more than 10 it will merge only first 10 documents and gently forget about the other 5. My project requires me to index/update one single document as required and make it immediately available for searching. How do I accomplish this if index merging will not merge less than 10 and in increments of 10, and single indexing doesnt seem to do it at all (please see my other post http://marc.theaimsgroup.com/?l=lucene-userm=110237364203877w=2) thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problems search number range
(excuse me for my english) hi people: i am trying to do a search between two numbers.. at the very beginning it was all right, for example: when i had the number 20 and i searched between 10 and 30 query= 'number:[10 TO 30]' then lucene found it.. but.. if i change the range numbers: 5 and 130 i started to have problems.. lucene didn't find the number 20 yet¡ i solved this changing the format of the numbers and putting this: number to look for: 020 range: 005, 130 query= 'number:[005 TO 030] up to this point all correct.. but then another problem starts: i need to use negative numbers and then all becomes crazy for me... i need to solve this search: number: -10 range: -50 TO 5 i need help.. i dont find anything using google.. thanks d2clon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems search number range
hi morus company; On Thursday 18 November 2004 12:49, Morus Walter wrote: [EMAIL PROTECTED] writes: i need to solve this search: number: -10 range: -50 TO 5 i need help.. i dont find anything using google.. If your numbers are in the interval MIN/MAX and MIN0 you can shift that to a positive interval 0 ... (MAX-MIN) by subtracting MIN from each number. thx, this is just what i have done.. Alternatively you have to find a string represantation providing the correct order for signed integers. E.g. -0010 -0001 0 1 00020 should work (in the range -..9), since '0' has a higher ascii (unicode) code than '-'. Of course the analayzer has to preserve the '-' and the '-' should not be eaten by the query parser in case you use it. I don't know if there are problems with that, but I suspect that at least for the query parser. this solution was the first that i tried.. but this does not run correctly.. because: when we try to sort this number in alphanumeric order we obtain that number -0010 is higher than -0001 so, the final solution is what you comment us at the beginning of your post. thx a lot d2clon Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
hibernate y el algoritmo para asignar scores a los hits
(first excuse me for my english) hi people.. we are programming a few tests battery to see how lucene creates the scores in a search into a very simple and controllable documents. but we don't understand why the results looks not ok for us.. i am going to try to explain, the details of the tests described at botton.. if you look the tests you will see: in test 1) query: info:house lucene returns a 50.00 score for this document id info - -- house2 house noise noise noise and the same score for this document: id info - -- house3 house noise noise noise noise in test 2) query: info:house lucene returns more score for this document: id info - -- house3 noise noiseso noise noiseso noise noiseso house house house house than for this one: id info - -- house5 noise noiseso noise noiseso noise noiseso house house house house house house i have seen lucene's algoritm for creating this scores, i don't understand all but i think the results of these searches are not ok for me.. i need comments, or urls to study and improve my search method thanks for all d2clon --- TESTS --- we always execute this query, using the StandardAnalizer: query: info:house test 1 --- document list: id info --- house0 house noise house1 house noise noise house2 house noise noise noise house3 house noise noise noise noise house4 house noise noise noise noise noise nohouse noise noiseso noise noiseso noise noiseso scores: id score --- house062.5 house150.0 house250.0 house343.75 house437.5 test 2: document list: id info --- house0 noise noiseso noise noiseso noise noiseso house house1 noise noiseso noise noiseso noise noiseso house house house2 noise noiseso noise noiseso noise noiseso house house house house3 noise noiseso noise noiseso noise noiseso house house house house house4 noise noiseso noise noiseso noise noiseso house house house house house house5 noise noiseso noise noiseso noise noiseso house house house house house house house6 noise noiseso noise noiseso noise noiseso house house house house house house house house7 noise noiseso noise noiseso noise noiseso house house house house house house house house house8 noise noiseso noise noiseso noise noiseso house house house house house house house house house house9 noise noiseso noise noiseso noise noiseso house house house house house house house house house house nohouse noise noiseso noise noiseso noise noiseso noise noiseso noise noiseso noise noiseso nohouse scores: id score --- house979.0569 house875.0 house770.7107 house666.1438 house362.5 house561.2372 house455.9017 house254.1266 house144.1942 house037.5 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
hibernate and algorithm for asing score to hits
(first excuse me for my english) hi people.. we are programming a few tests battery to see how lucene creates the scores in a search into a very simple and controllable documents. but we don't understand why the results looks not ok for us.. i am going to try to explain, the details of the tests described at botton.. if you look the tests you will see: in test 1) query: info:house lucene returns a 50.00 score for this document id info - -- house2 house noise noise noise and the same score for this document: id info - -- house3 house noise noise noise noise in test 2) query: info:house lucene returns more score for this document: id info - -- house3 noise noiseso noise noiseso noise noiseso house house house house than for this one: id info - -- house5 noise noiseso noise noiseso noise noiseso house house house house house house i have seen lucene's algoritm for creating this scores, i don't understand all but i think the results of these searches are not ok for me.. i need comments, or urls to study and improve my search method thanks for all d2clon --- TESTS --- we always execute this query, using the StandardAnalizer: query: info:house test 1 --- document list: id info --- house0 house noise house1 house noise noise house2 house noise noise noise house3 house noise noise noise noise house4 house noise noise noise noise noise nohouse noise noiseso noise noiseso noise noiseso scores: id score --- house062.5 house150.0 house250.0 house343.75 house437.5 test 2: document list: id info --- house0 noise noiseso noise noiseso noise noiseso house house1 noise noiseso noise noiseso noise noiseso house house house2 noise noiseso noise noiseso noise noiseso house house house house3 noise noiseso noise noiseso noise noiseso house house house house house4 noise noiseso noise noiseso noise noiseso house house house house house house5 noise noiseso noise noiseso noise noiseso house house house house house house house6 noise noiseso noise noiseso noise noiseso house house house house house house house house7 noise noiseso noise noiseso noise noiseso house house house house house house house house house8 noise noiseso noise noiseso noise noiseso house house house house house house house house house house9 noise noiseso noise noiseso noise noiseso house house house house house house house house house house nohouse noise noiseso noise noiseso noise noiseso noise noiseso noise noiseso noise noiseso nohouse scores: id score --- house979.0569 house875.0 house770.7107 house666.1438 house362.5 house561.2372 house455.9017 house254.1266 house144.1942 house037.5 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE and algorithm for asing score to hits
im sorry friends.. i put the title incorrectly for two times - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Implement custom score
Hi, I know this is probably a common question and I've found a couple of posts about it in the archive but none with a complete answer. If there is one please point me to it! The question is that I want to discard the default scoring and implement my own. I want all the the hits to be sorted after popularity (a field) and not by anything else. How can I do this? What classes and methods do I have to change? thanks, William - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Implement custom score
Yes thanks, I implemented my own Similarity class that returns 1.0f from lengthNorm() and idf() then I use setBoost when writing the document. However I get some small round errors. When I boost with 0.7 that document gets the score 0.625. I've found that this has to do with the encode/decode norm in Simliarity. Should I do anything about it? or does'nt it matter? /William You need your own Similarity implementation and you need to set it as shown in this javadoc: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarit y.html Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, I know this is probably a common question and I've found a couple of posts about it in the archive but none with a complete answer. If there is one please point me to it! The question is that I want to discard the default scoring and implement my own. I want all the the hits to be sorted after popularity (a field) and not by anything else. How can I do this? What classes and methods do I have to change? thanks, William - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Implement custom score
Thanks for the reply, I've looked in to the search method that takes a Sort object as argument. As I understand it the sorting is only done on the best matches (100 by default)? I don't want the default score to have any impact at all. I want to sort all hits on popularity not just the best matches. /William Actually what William should use is the new Sort facility to order results by a field. Doing this with a Similarity would be much trickier. Look at the IndexSearcher.sort() methods which take a Sort and follow the Javadocs from there. Let us know if you have any questions on sorting. It would be best if you represent your 'popularity' field as an integer (or at least numeric) since sorting by String uses more memory. Erik On Sep 22, 2004, at 4:52 AM, Otis Gospodnetic wrote: You need your own Similarity implementation and you need to set it as shown in this javadoc: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ Similarity.html Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, I know this is probably a common question and I've found a couple of posts about it in the archive but none with a complete answer. If there is one please point me to it! The question is that I want to discard the default scoring and implement my own. I want all the the hits to be sorted after popularity (a field) and not by anything else. How can I do this? What classes and methods do I have to change? thanks, William - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pdf in Chinese
it is not about analyzer ,i need to read text from pdf file first. - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 4:15 PM Subject: Re: pdf in Chinese which analyzer you are using to index chinese pdf documents ? I think you should use cjkanalyzer - Original Message - From: [EMAIL PROTECTED] [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 11:27 AM Subject: pdf in Chinese Hi all, i use pdfbox to parse pdf file to lucene document.when i parse Chinese pdf file,pdfbox is not always success. Is anyone have some advice? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
pdf in Chinese
Hi all, i use pdfbox to parse pdf file to lucene document.when i parse Chinese pdf file,pdfbox is not always success. Is anyone have some advice? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problem in IndexSearcher
java.io.IOException: Lock obtain timed out I was trying to create two instance of IndexSearcher with different index files Is there something i've missed? tia, buics - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: memory leek in lucene?
I also have problems regarding my application, what would be the ideal memory allocation for lucene considering my application will serve at least 20 transactions per second? tia --buics On Fri, 3 Sep 2004 15:20:45 +0200, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Terence, still had not not time to prepare a test case, but... I worked around it: The idea is to replace the score with timestamp on populating hits (in case You are not interesting too much in real score), where the field_sort is in MMddHHmm etc. format. Works fine, at least no outOfMemory crush until now. public TopDocs search(Query query, Filter filter, final int nDocs, final String field_sort) throws IOException { Scorer scorer = query.weight(this).scorer(reader); if (scorer == null) return new TopDocs(0, new ScoreDoc[0]); final BitSet bits = filter != null ? filter.bits(reader) : null; final HitQueue hq = new HitQueue(nDocs); final int[] totalHits = new int[1]; scorer.score(new HitCollector() { public final void collect(int doc, float score) { // this bloody piece of code fakes scorer to deliver results sorted by date // because valid way runs into outOfMemory problem:( JGO 2004/08/31 // note: modules touched - // Searcher,Searchable,Hits,ParallelMultiSearcher,MultiSearcher,RemoteSearchable (these for new field field_sort) // ScoreDoc, FieldSortedHitQueue,Hits,FieldScore (these for float - double) double new_score = score; String fval=; if (field_sort!=null){ //if null, just sort as usual by real score try { new_score= new Double(0.+reader.document(doc).get(field_sort)+d).doubleValue(); } catch (IOException e) { e.printStackTrace(); } } if (score 0.0f // ignore zeroed buckets (bits==null || bits.get(doc))) { // skip docs not in bits totalHits[0]++; hq.insert(new ScoreDoc(doc, new_score)); } } }); ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()]; for (int i = hq.size()-1; i = 0; i--)// put docs in array scoreDocs[i] = (ScoreDoc)hq.pop(); return new TopDocs(totalHits[0], scoreDocs); } Hi Iouli, Sorry, I am having a very tight schedule at work right now. I don't have time to come up the test case. The problem is really related to the search(query,sort) method call. If you can come up the test case, that would be great. Thanks, Terence back to biz. Terence, probably u prepared it already or should I do it? Otis, actually it's just a common way to execute a query. If the code is like hits = ms.search(query); or sort = new Sort(SortField.FIELD_DOC); hits = ms.search(query,sort); or even filter = DateFilter(published,stamp_from,stamp_to); sort = new Sort(SortField.FIELD_DOC); hits = ms.search(query,filter,sort); everything is ok, memory is getting free (you see it with top -p pid) The problem starts only in case: sort = new Sort(new SortField(published_short,SortField.FLOAT, true)); hits = ms.search(query,sort); The memory never comes back and grows up with every iteration even You start garbage collector explicitly and code runs somehow into finalize() Regards J. Iouli Terence, Could you create a self-sufficient test case that demonstrates the memory leak? If you can do that, please open a new bug entry in Bugzilla (the link to it is on Lucene's home page), and then attach your test case to it. Thanks! Otis --- [EMAIL PROTECTED] wrote: Yes Terence, it's exactly what I do Terence Lai [EMAIL PROTECTED] 21.08.2004 01:50 Please respond to Lucene Users List To: Lucene Users List [EMAIL PROTECTED] cc: Subject:RE: memory leek in lucene? Category: Are you calling ParallelMultiSearcher.search(Query query, Sort sort) to do your search? If so, I am currently having a similar problem. Terence Doing query against lucene I run into memomry problem, i.e. it's look like it's not giving memory back after the query have been executed. I use ParallelMultiSearcher ant call close method after results are displayed. hits=null; // Hits class if (ms!=null) ms.close(); //ParallelMultiSearcher Doesn't help. The memory getting not free. On queries like No* I get incremental memory consume of c. 20-70mb. per query. Imagine what happens with my web server... I tried also from command line and got the similar result. Am I doing wrong or miss something? Please help, I use 1.4.1 on linux box
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for your mail
Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Italian web sites
The first one. Bye Laura What does it mean? Italian website can be: - site that use italian language - site owned by an italian organization - site hosted in a italian geographical site Every definition has a different solution. Date sent:Wed, 24 Apr 2002 11:02:32 +0200 From: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Italian web sites To: [EMAIL PROTECTED] Send reply to:Lucene Users List lucene- [EMAIL PROTECTED] Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura -- Marco Ferrante ([EMAIL PROTECTED]) CSITA (Centro Servizi Informatici e Telematici d'Ateneo) Università degli Studi di Genova - Italy Via Brigata Salerno, ponte - 16147 Genova tel (+39) 0103532621 (interno tel. 2621) -- -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED]
Re: Italian web sites
Hi all, I have found a very interesting library which is written in perl. The problem is now how I can use this library. Anyway the library is Textcat an you can find it: http://odur.let.rug.nl/~vannoord/TextCat/ Bye Laura combined with that you could use an italian stop- word list to run statistics on a page :-) ?!? On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote: Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED]
Re: Lucene in action at www.mil.fi
Hi Jari whre do you build your index? On filesystem? Do you use database? Laura Hello, I'm glad to inform you that I've built a complete Lucene-based web search solution for the Finnish Defence Forces web site and that it's online as of this moment. You can see it in action at: http://www2.mil.fi:8080/haku/haku?q=hornet The user interface is in Finnish, but I hope you can get a general gra sp of what's going on there anyway. As for the technical facts, I basically built a web crawler for indexi ng the www.mil.fi -sites, and a servlet/xml/xsl -based frontend that delivers the results to your screen. The crawler is capable of indexing HTML (I used the Swing parser), PDF (I used xpdf, which is kinda bubble-gum-ish, but it works ;) and images (they're searched for by filename only). And for the front end, I have a servlet that does the searching, prints out XML (raw XML output: http://www2.mil.fi:8080/haku/raw? q=hornet) which is then transformed to HTML via XSL (I wrote a neat little servlet filter for this). The search servlet also has a simple query parser: the incoming query is parsed so that the default operand is AND instead of OR. So basically, if you type 'hornet picture', the actual search sent to Lucene will be '+hornet +picture' - I wanted it to be Google-like. Anyway, check it out and feel free to ask me if you'd like to know something more about the implementation. Also, feel free to mention Finnish Defence Forces at the Powered by -section of the Lucene web site. Thanks go to all the Lucene developers - it's great stuff :D Jari Aarniala -- Jari Aarniala [EMAIL PROTECTED] death is the Vantaa, .fi last dance eternal -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED]
Re:_HTML_parser
Hi all, did someone try jobo? It seems a good software which can be extended. Has someone some experiences about it? Laura Laura, http://marc.theaimsgroup.com/?l=lucene-userw=2r=1s=Spindleq=b Oops, it's JoBo, not MoJo :) http://www.matuschek.net/software/jobo/ Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi Otis, thanks for your reply. I have been looking for Spindle and Mojo for 2 hours but I don't found anything. Can you help me? Wher can I find something? Thanks for your help and time Laura Laura, Search the lucene-user and lucene-dev archives for things like: crawler spider spindle lucene sandbox Spindle is something you may want to look at, as is MoJo (not mentione d on lucene lists, use Google). Otis Did someone solve the problem to spider recursively a web pages? While trying to research the same thing, I found the following...here 's a good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take abou t ten lines of code. __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED]
Re: HTML parser
Hi all, I'm very interested about this thread. I also have to solve the problem of spidering web sites, creating index (weel about this there is the BIG problem that lucene can't be integrated easily with a DB), extracting links from the page repeating all the process. For extracting links from a page I'm thinking to use JTidy. I think that with this library you can also parse a non well formed page (that you can take from the web with URLConnection) setting the property to clean the page. The class Tidy() returns a org.w3c.dom.Document that you can use for analizing all the document: for example you can use doc.getElementsByTagName(a) for taking all the a elements. You can parse as xml. Did someone solve the problem to spider recursively a web pages? Laura While trying to research the same thing, I found the following...here 's a good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take abou t ten lines of code. -- Brian Goetz Quiotix Corporation [EMAIL PROTECTED] Tel: 650-843-1300Fax: 650-324- 8032 http://www.quiotix.com -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED]
Some questions
Hi all, my name is Laura and I'm a new member of this list. I'm a long date user of tomcat and I'm also a meber of tomcat user list. Yesterday looking at the jakarta menu I saw lucene and I said:What is this? Reading lucene home page I understood that Lucene is a very interesting and big project. I'm looking for a product as lucene!!! Because I'm a new member of this list o new user of lucene I have some questions that you answer easily to, I think. Well, I saw that lucene create the index on the filesystem: I think that this is a problem for producion enviroment. I usually use Database, for example Oracle. Is it possible integrate Lucene with Oracle or some other db (Mysql)? I think that there isn't any Italian Anylizer, is it? How can I write one? The last question is: I suppose that my search engine is able to spider web sites. Is it possible spidering urls? For example is it possible that with a page I spider this page, then I extract the links of the page and at least I'd like spidering also these links? How can I do this? Well I hope to be able to use lucene. Thanks for your help Laura