escaping characters
Hi all, I am usung folowing query to find exact match document with title: Abramovich says Chelsea win showed Russian character . In both cases with/without escaping character, and all others form http://jakarta.apache.org/lucene/docs/queryparsersyntax.html i recieve these exceptions: CASE 1 - without escaping: Query == ((TITLE:([Abramovich says Chelsea win showed Russian character TO Abramovich says Chelsea win showed Russian character])) org.apache.lucene.queryParser.ParseException: Encountered character\\ Was expecting: ] ... CASE 2 - with escaping: Query == (TITLE:([Abramovich says Chelsea win showed \Russian character\ TO Abramovich says Chelsea win showed \Russian character\])) org.apache.lucene.queryParser.ParseException: Encountered character Was expecting: ] ... Question 1: Where is the problem? (i am usung SimpleAnalyzer, is this true?) Question 2: Is there more sly way to get the doc with exact maching this title? (for info: my titles are unique) Pls answer me in both questions. Best regards! Rosen
Re: Does a RAMDirectory ever need to merge segments... (performanceissue)
I've always wondered about this too. To put it another way, how does mergeFactor affect an IndexWriter backed by a RAMDirectory? Can I set mergeFactor to the highest possible value (given the machine's RAM) in order to avoid merging segments? Kevin A. Burton [EMAIL PROTECTED] 04/20/04 04:40AM I've been benchmarking our indexer to find out if I can squeeze any more performance out of it. I noticed one problem with RAMDirectory... I'm storing documents in memory and then writing them to disk every once in a while. ... IndexWriter.maybeMergeSegments is taking up 5% of total runtime. DocumentWriter.addDocument is taking up another 17% of total runtime. Notice that this doesn't == 100% becuase there are other tasks taking up CPU before and after Lucene is called. Anyway... I don't see why RAMDirectory is trying to merge segments. Is there anyway to prevent this? I could just store them in a big ArrayList until I'm ready to write them to a disk index but I'm not sure how efficient this will be. Anyone run into this before? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searcher not aware of index changes
Hi! My Searcher's instance it not aware of changes to the index. I even create a new instance but it seems only a complete restart does help(?): indexSearcher = new IndexSearcher(IndexReader.open(index)); Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query
On Apr 21, 2004, at 10:01 AM, Rosen Marinov wrote: Does it query work: my name is \Rosen\? http://wiki.apache.org/jakarta-lucene/AnalysisParalysis Short answer: it depends. Questions for you to answer: What field type and analyzer did you use during indexing? What analyzer used with QueryParser? What does the generated Query.toString return? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query
Short answer: it depends. Questions for you to answer: What field type and analyzer did you use during indexing? What analyzer used with QueryParser? What does the generated Query.toString return? in both cases SimpleAnalyzer QueryParser.parse(\abc\) throws an exception and i can't see what does Query.toString return in this case what analizer should i use if i want to execute folowing queries: simple key word seach (+bush -president , etc) range queries including characters in searching values thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searcher not aware of index changes
This is not normal behaviour. Normally using a new IndexSearcher should reflect the modified state of your index. Could you post a more informative bit of code? sv On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote: Hi! My Searcher's instance it not aware of changes to the index. I even create a new instance but it seems only a complete restart does help(?): indexSearcher = new IndexSearcher(IndexReader.open(index)); Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searcher not aware of index changes
On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote: This is not normal behaviour. Normally using a new IndexSearcher should reflect the modified state of your index. Could you post a more informative bit of code? BTW Why can't Lucene care for it itself? Well, according to my logging it does create a new instance. I use only one instance of SessoinFacade: public class SearchFacade extends Observable { protected class IndexObserver implements Observer { private final Log log = LogFactory.getLog(getClass()); public Searcher indexSearcher; public IndexObserver() { newSearcher(); // init } public void update(Observable o, Object arg) { log.debug(Index has changed, creating new Searcher ); newSearcher(); } private void newSearcher() { try { indexSearcher = new IndexSearcher(IndexReader.open(Configuration.LuceneIndex.MAIN)); } catch (IOException e) { log.error(Could not instantiate searcher: + e); } } public Searcher getIndexSearcher() { return indexSearcher; } } private IndexObserver indexObserver; public SearchFacade() { addObserver(indexObserver = new IndexObserver()); } public void createIndex() { ... setChanged(); // index has changed notifyObservers(); } public Hits search(String query) { Searcher searcher = indexObserver.getIndexSearcher(); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searcher not aware of index changes
Normally the code should work, iif your you don't keep references to the old Searcher (and not try cacheing it). Make sure you aren't doing this by mistake. For the design of your facade, you could always implement Searchable and do the delegation to the up-to-date instance of IndexSearcher. Quick comment: you should call .close() on your searcher before removing the reference. If this causes exceptions in future searches, it would indicate incorrect cacheing. HTH, sv On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote: On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote: This is not normal behaviour. Normally using a new IndexSearcher should reflect the modified state of your index. Could you post a more informative bit of code? BTW Why can't Lucene care for it itself? Well, according to my logging it does create a new instance. I use only one instance of SessoinFacade: public class SearchFacade extends Observable { protected class IndexObserver implements Observer { private final Log log = LogFactory.getLog(getClass()); public Searcher indexSearcher; public IndexObserver() { newSearcher(); // init } public void update(Observable o, Object arg) { log.debug(Index has changed, creating new Searcher ); newSearcher(); } private void newSearcher() { try { indexSearcher = new IndexSearcher(IndexReader.open(Configuration.LuceneIndex.MAIN)); } catch (IOException e) { log.error(Could not instantiate searcher: + e); } } public Searcher getIndexSearcher() { return indexSearcher; } } private IndexObserver indexObserver; public SearchFacade() { addObserver(indexObserver = new IndexObserver()); } public void createIndex() { ... setChanged(); // index has changed notifyObservers(); } public Hits search(String query) { Searcher searcher = indexObserver.getIndexSearcher(); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does a RAMDirectory ever need to merge segments... (performanceissue)
Gerard Sychay wrote: I've always wondered about this too. To put it another way, how does mergeFactor affect an IndexWriter backed by a RAMDirectory? Can I set mergeFactor to the highest possible value (given the machine's RAM) in order to avoid merging segments? Yes... actually I was thinking of increasing these vars on the RAMDirectory in the hope to avoid this CPU overhead.. Also I think the var you want is minMergeDocs not mergeFactor. the only problem is that the source to maybeMergeSegments says: private final void maybeMergeSegments() throws IOException { long targetMergeDocs = minMergeDocs; while (targetMergeDocs = maxMergeDocs) { So I guess to prevent this we would have to set minMergeDocs to maxMergeDocs+1 ... which makes not sense. Also by default maxMergeDocs is Integer.MAX_VALUE so that will have to be changed. Anyway... I'm still playing with this myself. It might be easier to just use an ArrayList of N documents if you know for sure how big your RAM dir will grow to. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
what web crawler work best with Lucene?
Have anyone implemented any open source web crawler with Lucene? I have a dynamic website and are looking at putting in a search tools. Your advice is very much appreciated. Thank you. IMPORTANT - This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query
On Apr 21, 2004, at 12:17 PM, Rosen Marinov wrote: Short answer: it depends. Questions for you to answer: What field type and analyzer did you use during indexing? What analyzer used with QueryParser? What does the generated Query.toString return? in both cases SimpleAnalyzer QueryParser.parse(\abc\) throws an exception and i can't see what does Query.toString return in this case This is clean and green for me: public void testAbc() throws ParseException { Query query = QueryParser.parse(\abc\, field, new SimpleAnalyzer()); assertEquals(abc, query.toString(field)); } Either you're using an old version of Lucene that is broken in this regard (I'm at CVS HEAD) or something else is fishy. Note that a single term in quotes is optimized into a TermQuery, not a quoted PhraseQuery in the assert above. what analizer should i use if i want to execute folowing queries: simple key word seach (+bush -president , etc) range queries including characters in searching values Ranges with spaces in them doesn't work. It is for single term ranges, not phrases that were tokenized. If you indexed the entire phrase as a single term (Field.Keyword), then you could do an API RangeQuery, but QueryParser won't be happy. QueryParser syntax is documented on the Lucene website if you need assistance with the type of syntax it supports. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what web crawler work best with Lucene?
How big is the site? I mostly use an inhouse solution, but I've used HttpUnit for web scrapping small sites (because of its high-level api). Here is a hello world example: http://wiki.apache.org/jakarta-lucene/HttpUnitExample For a small/simple site, small modifications to this class could suffice. IT WILL NOT function on large sites because of memory problems. For larger sites, there are questions like: - memory: For example, spidering all links on every page can lead to visiting too many links. Keeping all visited links in memory can be problematic - noise If you get every page on your web site, you might be adding noise to the search engine. Spider navigation rules can help out, like saying that you should only follow links/index documents of a specific form like www.mysite.com/news/article.jsp?articleid=xxx - speed: Too much speed can be bad if you doing 100 hits/sec on a site could hurt it (especially if it's not you who are the webmaster) Too little speed can be bad if you want to make sure you quickly get new pages. - categorisation: You might want to separate information in your index. For example, you might want a user to do a search in the documentation section or in the press release section. This categorisation can be done by specifying sections to the site, or a subsequent analysis of available docs. -up-to-date information You'll want to think of your update schedule, so that if you add a new page, it gets indexed quickly. This problem also occurs when you modify an existing page, you might want the modification to be detected rapidly. HTH, sv On Thu, 22 Apr 2004, Tuan Jean Tee wrote: Have anyone implemented any open source web crawler with Lucene? I have a dynamic website and are looking at putting in a search tools. Your advice is very much appreciated. Thank you. IMPORTANT - This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]