Zilverline Search Engine version 1.0-final released
All, I've just released Zilverline version 1.0. New features include incremental indexing and scheduling of indexing proces, as well as a few minor updates. The source will be made available as well very soon. Zilverline is protected by a Collaborative Source License. You can read more on this type of licensing at http://www.zilverline.org Zilverline is a search engine based on lucene that's ready to roll, and can be simply dropped in a Servlet Engine. It runs out of the box, and supports PDF, WORD, HTM, TXT, RTF and CHM, and can index zip, rar, and many other formats. Both on Windows and Linux. Zilverline supports plugins. You can create your own extractors for various file formats. I've provided Extractors for RTF, Text, PDF, Word, and HTML. Zilverline supports collections. A collection is a set of files and directories in a directory. A collection can be indexed, and searched. The results of the search can be retrieved from local disk or remotely, if you run a webserver on your machine. Files inside zip, rar and chm files are extracted, indexed and can be cached. The cache can be mapped to sit behind your webserver as well. It's also possible to specify your own handlers for archives. Say you have a RAR archive, and you have a program on your system that can extract the content from it, then you can specify that Zilverline should use this program. Please take look at http://www.zilverline.org, and have a swing at it. cheers, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search PDF ???
Hi Eric, Try zilverline http://www.zilverline.org Michael Eric Chow wrote: Hello, 1. Is it possibleto use Lucene to search PDF contents ? 2. Can it search Chinese contents PDF files ??? Eric - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Zilverline release candidate 1.0-rc7 available
All, I've just released a new candidate (*1.0-rc7*) New features include Highlighting and 'on-the-fly' extraction of archives. Zilverline is a search engine based on lucene that's ready to roll, and can be simply dropped in a Servlet Engine. It runs out of the box, and supports PDF, WORD, HTM, TXT, RTF and CHM, and can index zip, rar, and many other formats. Both on Windows and Linux. Zilverline supports plugins. You can create your own extractors for various file formats. I've provided Extractors for RTF, Text, PDF, Word, and HTML. Zilverline supports collections. A collection is a set of files and directories in a directory. A collection can be indexed, and searched. The results of the search can be retrieved from local disk or remotely, if you run a webserver on your machine. Files inside zip, rar and chm files are extracted, indexed and can be cached. The cache can be mapped to sit behind your webserver as well. It's also possible to specify your own handlers for archives. Say you have a RAR archive, and you have a program on your system that can extract the content from it, then you can specify that Zilverline should use this program. Please take look at http://www.zilverline.org, and have a swing at it. cheers, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Zilverline release candidate 1.0-rc6 available
All, I've just released a new candidate (*1.0-rc6*) New features include a command line indexer and support for Chinese and Cyrillic. Zilverline is an free search engine based on lucene that's ready to roll, and can be simply dropped in a Servlet Engine. It runs out of the box, and supports PDF, WORD, HTM, TXT, RTF and CHM, and can index zip, rar, and many other formats. Both on Windows and Linux. Zilverline supports plugins. You can create your own extractors for various file formats. I've provided Extractors for RTF, Text, PDF, Word, and HTML. Zilverline supports collections. A collection is a set of files and directories in a directory. A collection can be indexed, and searched. The results of the search can be retrieved from local disk or remotely, if you run a webserver on your machine. Files inside zip, rar and chm files are extracted, indexed and can be cached. The cache can be mapped to sit behind your webserver as well. It's also possible to specify your own handlers for archives. Say you have a RAR archive, and you have a program on your system that can extract the content from it, then you can specify that Zilverline should use this program. Please take look at http://www.zilverline.org, and have a swing at it. cheers, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene 1.4 in maven repository
Hi, Can anyone tell me why there is no lucene 1.4 jar in the maven repository @ http://www.ibiblio.org/maven/lucene/jars/ ? Who makes them available? It would be very convenient to be able to get the latest version from there (or anywhere else) regards, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchhelp
The PDF and WORD stuff has been done too: have a look at http://www.zilverline.org. Michael Franken Chandan Tamrakar wrote: For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Weighted queries
Hi Eric, I have implemented this in Zilverline. What I do is the following: subclass QueryParser and override getFieldQuery: protected Query getFieldQuery(String field, Analyzer analyzer, String queryText) throws ParseException { // for field that contain 'contents' add boostfactors for other terms specified in BoostFactor if (defaultField.equals(field)) { TokenStream source = analyzer.tokenStream(field, new StringReader(queryText)); Vector v = new Vector(); org.apache.lucene.analysis.Token t; while (true) { try { t = source.next(); } catch (IOException e) { t = null; } if (t == null) break; v.addElement(t.termText()); log.debug(field + , + t.termText()); } try { source.close(); } catch (IOException e) { // ignore } if (v.size() == 0) { return null; } else { // create a new composed query BooleanQuery bq = new BooleanQuery(); // get the static BoostFactors through non static getter BoostFactor bf = new BoostFactor(); // For all boostfactors create a new PhraseQuery Iterator iter = bf.getFactors().entrySet().iterator(); while (iter.hasNext()) { Map.Entry element = (Map.Entry) iter.next(); String thisField = ((String) element.getKey()).toLowerCase(); Float boost = (Float) element.getValue(); PhraseQuery q = new PhraseQuery(); // and add all the terms of the query for (int i = 0; i v.size(); i++) { q.add(new Term(thisField, (String) v.elementAt(i))); } // boost the query q.setBoost(boost.floatValue()); // and add it to the composed query bq.add(q, false, false); } log.debug(Query: + bq); return bq; } } else { return super.getFieldQuery(field, analyzer, queryText); } } Read the Boostfactors from an external source. Im using a object with a Hashmap. see Boostfactors @ www.zilverline.org Cheers, Michael Franken Eric Jain wrote: Is it possible to expand a query such as foo bar into (title:foo^4 OR abstract:foo^2 OR content:foo) AND (title:bar^4 OR abstract:bar^2 OR content:bar) ? I can assign weights to individual fields when indexing, and could use the MultiFieldQueryParser - but it seems this parser can't be configured to use AND as default! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Zilverline release candidate 1.0-rc4 available
All, I've just released a new candidate (*1.0-rc4*) New features include Spanish GUI, RTF support, searching on date range, customizable boosting factors, and configurable analyzers per collection. Zilverline now generates a MD5 Hash per file, and prevents duplicate files from being added more than once. Zilverline supports plugins. You can create your own extractors for various file formats. I've provided Extractors for RTF, Text, PDF, Word, and HTML. Zilverline supports collections. A collection is a set of files and directories in a directory. A collection can be indexed, and searched. The results of the search can be retrieved from local disk or remotely, if you run a webserver on your machine. Files inside zip, rar and chm files are extracted, indexed and can be cached. The cache can be mapped to sit behind your webserver as well. It's also possible to specify your own handlers for archives. Say you have a RAR archive, and you have a program on your system that can extract the content from it, then you can specify that Zilverline should use this program. Zilverline is an free search engine based on lucene that's ready to roll, and can be simply dropped in a Servlet Engine. It runs out of the box, and supports PDF, WORD, HTM, TXT, and CHM, and can index zip, rar, and many other formats. Both on Windows and Linux. Please take look at http://www.zilverline.org, and have a swing at it. cheers, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDFBox problem.
Natarajan.T wrote: FYI, I am using PDFBox.jar to Convert PDF to Text. Problem is in the runtime its printing lot of object messages How can I avoid this one??? How can I go with this one. import java.io.InputStream; import java.io.BufferedWriter; import java.io.IOException; import org.pdfbox.util.PDFTextStripper; import org.pdfbox.pdfparser.PDFParser; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.pdmodel.PDDocumentInformation; /** * @author natarajant * * TODO To change the template for this generated type comment go to * Window - Preferences - Java - Code Generation - Code and Comments */ public class PDFConverter extends DocumentConverter{ public PDFConverter() { } /** * This method will construct the Lucene document object from the * given information by extracting the text from PDF file. * * @param reader and writer - InputStream and BufferedWriter * @return true or false i.e. extract the text or not */ public boolean extractText(InputStream reader, BufferedWriter writer) throws IOException{ PDFParser parser = null; PDDocument pdDoc = null; PDFTextStripper stripper = null; String pdftext = ; String pdftitle = ; try { parser = new PDFParser(reader); parser.parse(); pdDoc = parser.getPDDocument(); stripper = new PDFTextStripper(); pdftext = stripper.getText(pdDoc); writer.write(pdftext + ); PDDocumentInformation info = pdDoc.getDocumentInformation(); pdftitle = info.getTitle(); } catch(Exception err) { System.out.println(err.getMessage()); change this to return false; } writer.close(); return true; } finally { // close all open resources } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Extracting Lucene onto Tomcat
Hi Ian, Depending on what you want to do, you could also follow the installation instructions on http://www.zilverline.org. It describes how to install zilverline, but the same goes for the lucene war. Hope this helps, Michael Franken Ian McDonnell wrote: Also another silly question, do i need to setup a war on the server? --- Ian McDonnell [EMAIL PROTECTED] wrote: Well when i extracted it, it created the org/apache/lucene directories in the public_html directory. When i try to compile any of the source it just throws numerous errors. I've got the classpath set to web-inf/classes. Have i extraced it to the wrong directory? --- Erik Hatcher [EMAIL PROTECTED] wrote: On Jul 21, 2004, at 8:10 AM, Ian McDonnell wrote: Is the package information and import paths ready to deploy on Tomcat server. I tried extracting lucene on the server, but when i compile files, it just throws numerous no class definition errors and errors relating to the package. Huh? Lucene certainly deploys just fine in Tomcat web applications (in a WAR under WEB-INF/lib). Could you elaborate on what you mean here? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Sign up for FREE email from SpinnersCity Online Dance Magazine Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Sign up for FREE email from SpinnersCity Online Dance Magazine Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Extracting Lucene onto Tomcat
Hi Ian, You don't extract war files, or jar files. To deploy a web application that comes as a war file, you just have to drop it into webserver/servlet engine. So just: copy lucene.war tomcatserver/webapps. That's it. I advice you to read some of the documentation on the Tomcat website on deploying webapplications, or if you're really serious buy this book: http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471446629.html regards, Michael Ian McDonnell wrote: I was looking at your instructions there, but couldnt really figure out what you mean. Can i manually add the extracted directories onto the tomcat server, if so what should my root directory be? Say for example the extracted directories org/apache/lucene/ Should i have that as public_html/WEB-INF/org/apache/lucene? Ian --- Zilverline info [EMAIL PROTECTED] wrote: Hi Ian, Depending on what you want to do, you could also follow the installation instructions on http://www.zilverline.org. It describes how to install zilverline, but the same goes for the lucene war. Hope this helps, Michael Franken Ian McDonnell wrote: Also another silly question, do i need to setup a war on the server? --- Ian McDonnell [EMAIL PROTECTED] wrote: Well when i extracted it, it created the org/apache/lucene directories in the public_html directory. When i try to compile any of the source it just throws numerous errors. I've got the classpath set to web-inf/classes. Have i extraced it to the wrong directory? --- Erik Hatcher [EMAIL PROTECTED] wrote: On Jul 21, 2004, at 8:10 AM, Ian McDonnell wrote: Is the package information and import paths ready to deploy on Tomcat server. I tried extracting lucene on the server, but when i compile files, it just throws numerous no class definition errors and errors relating to the package. Huh? Lucene certainly deploys just fine in Tomcat web applications (in a WAR under WEB-INF/lib). Could you elaborate on what you mean here? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Sign up for FREE email from SpinnersCity Online Dance Magazine Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Sign up for FREE email from SpinnersCity Online Dance Magazine Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Sign up for FREE email from SpinnersCity Online Dance Magazine Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Anyone use MultiSearcher class
Hi Don, Yes, I'm using the MultiSearcher (in Zilverline), and have seen no serious performance issues with it. The app performs well with multiple indexes, it's responds so quick (with 100k+ documents) that I haven't even taken the time to measure the difference to a single index search. Michael Franken Don Vaillancourt wrote: Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: upgrade from Lucene 1.3 final to 1.4rc3 problem
This is a bug (see posting 'Lockfile Problem Solved'), upgrade to 1.4-final, and you'll be fine Alex Aw Seat Kiong wrote: Hi! I'm using Lucene 1.3 final currently, all things were working fine. But, after i'm upgraded from Lucene 1.3 final to 1.4rc3 (simply overwrite the lucene-1.4-final.jar to lucene-1.4-rc3.jar and re-compile it) We can re-compile it successfuly. but when will try to index the document. It give the error as below: java.lang.NullPointerException at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:146) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:126) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) Which wrong? Pls help. Thanks. Regards, Alex - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Zilverline release candidate 1.0-rc3 available
All, I've just released a new candidate (*1.0-rc3*) that now supports plugins. You can create your own extractors for various file formats. I've provided Extractors for Text, PDF, Word, and HTML. It's also possible to specify your own handlers for archives. Say you have a RAR archive, and you have a program on your system that can extract the content from it, then you can specify that zilverline should use this program. Zilverline is an free search engine based on lucene that's ready to roll, and can be simply dropped in a Servlet Engine. It runs out of the box, and supports PDF, WORD, HTM, TXT, and CHM, and can index zip, rar, and many other formats. Both on Windows and Linux. Please take look at http://www.zilverline.org, and have a swing at it. cheers, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tool for analyzing analyzers
Hi Erik, Erik Hatcher wrote: [snip] But I'd love to build a Lucene demo application that is powerful enough to be used as a foundation for folks to use out-of-the-box. That's just what I thought. Here's one: http://www.zilverline.org Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Cheers, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]