Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Morus Walter wrote: Owen Densmore writes: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? rule based stemmers such as porter/snowball cannot do that. But there are (commercial) dictionary based tools that can. E.g. the canoo lemmatizer. You might also have a look at egothors stemmer, that are word list based. Egothor stemmers are algorithmic, they only use word lists for training. Stems produced by them are usually closer to lemmas than in e.g. Porter's stemmer, but there is a significant amount of stems like in the example above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search Chinese in Unicode !!!
How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? Indexing and searching Chinese basically is no different than using English with Lucene. We covered a bit about it in Lucene in Action: http://www.lucenebook.com/search?query=chinese And a screenshot here: http://www.blogscene.org/erik/LuceneInAction/i18n.html The main issues of dealing with Chinese, and of course other languages, are encoding concerns in both indexing and querying of reading in the text and analysis (as you can see from the screenshot). Lucene itself works with Unicode fine and you're free to index anything. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How works *
On Fri, 2005-01-21 at 10:58 +0100, Bertrand VENZAL wrote: I wondered how lucene implement the * character, I know that is working but when I look at the Query Object, it doesn t seem to appear somewhere, does someone know how is it implemented ? Take a look at the PrefixQuery and WildcardQuery. PrefixQuery works by finding all terms beginning with the query then constructing a boolean query of them. I assume WildcardQuery works in a similar way. If you have several terms or a short prefix (e.g. a*) you might need to increase the maximum number of clauses allowed in a boolean query because the number of terms might exceed the default (i.e. 1024). -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. It is possible to derive the human-readable form of a stemmed term using either re-analysis of indexed content or TermPositionVector. Either of these techniques should give you the position data required to discover the original form. The highlighter package is one example of where this technique is used. Cheers Mark ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 11:42, Eric Chow wrote: Search not really correct with UTF-8 !!! Lucene works just fine with any flavor of Unicode as long as _your_ application knows how to consistently deal with Unicode as well. Remember: the world is not just one Big5 pile. As far as Analyzer goes, you may or may not be better off using something more tailored to your linguistic needs. That said, even the default Analyzer does a fairly decent job at handling non-roman languages. YMMV. Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering w/ Multiple Terms
OK. But isn't there a limit on the number of BooleanQueries that can be combined with AND / OR / etc? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, January 20, 2005 5:05 PM To: Lucene Users List Subject: Re: Filtering w/ Multiple Terms On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote: In looking at the examples for filtering of hits, it looks like I can only specify a single term; i.e. Filter f = new QueryFilter(new TermQuery(new Term(acct, acct1))); I need to specify more than one term in my filter. Short of using something like ChainFilter, how are others handling this? You can make as complex of a Query as you want for QueryFilter. If you want to filter on multiple terms, construct a BooleanQuery with nested TermQuery's, either in an AND or OR fashion. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stemming
I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin
Re: Stemming
Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- Kevin L. Cobb [EMAIL PROTECTED] wrote: I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering w/ Multiple Terms
This: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html ? You can control that limit via http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount Otis --- Jerry Jalenak [EMAIL PROTECTED] wrote: OK. But isn't there a limit on the number of BooleanQueries that can be combined with AND / OR / etc? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, January 20, 2005 5:05 PM To: Lucene Users List Subject: Re: Filtering w/ Multiple Terms On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote: In looking at the examples for filtering of hits, it looks like I can only specify a single term; i.e. Filter f = new QueryFilter(new TermQuery(new Term(acct, acct1))); I need to specify more than one term in my filter. Short of using something like ChainFilter, how are others handling this? You can make as complex of a Query as you want for QueryFilter. If you want to filter on multiple terms, construct a BooleanQuery with nested TermQuery's, either in an AND or OR fashion. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion needed for extranet search
Hi Ranjan, It sounds like you are should look at and use Nutch: http://www.nutch.org Otis --- Ranjan K. Baisak [EMAIL PROTECTED] wrote: I am planning to move to Lucene but not have much knowledge on the same. The search engine which I had developed is searching some extranet URLs e.g. codeguru.com/index.html. Is is possible to get the same functionality using Lucene. So basically can I make Lucene as a search engine to search extranets. regards, Ranjan __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search on heterogenous index
Hello all. I'm new to lucene and think about using it in my project. I have prices with dynamic structure, containing wares there, about 10K prices with total 500K wares. Each price has about 5 text fields. I'll do searches on wares. The difficult part is that I'll do searches for all wares, the search is not bound to a particular price structure. My question is, how should I organize my indices? Can Lucene handle data effectlively if I'll have one index containing different Fields in Documents? Or should I create a separate index for each price with same Fields structure across Documents? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion needed for extranet search
Otis, Thanks for your help. Is nutch a freeware tool? regards, Ranjan --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Ranjan, It sounds like you are should look at and use Nutch: http://www.nutch.org Otis --- Ranjan K. Baisak [EMAIL PROTECTED] wrote: I am planning to move to Lucene but not have much knowledge on the same. The search engine which I had developed is searching some extranet URLs e.g. codeguru.com/index.html. Is is possible to get the same functionality using Lucene. So basically can I make Lucene as a search engine to search extranets. regards, Ranjan __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? The all-new My Yahoo! - What will yours do? http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Stemming
OK, OK ... I'll buy the book. I guess its about time since I am deeply and forever in love with Lucene. Might as well take the final plunge. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 9:12 AM To: Lucene Users List Subject: Re: Stemming Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- Kevin L. Cobb [EMAIL PROTECTED] wrote: I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Concurrent read and write
I am a little fuzzy on the thread-safeness of Lucene, or maybe just java. From what I understand, and correct me if I'm wrong, Lucene takes care of concurrency issues and it is ok to run a query while writing to an index. My question is, does this still hold true if the reader and writer are being executed as separate programs? I have a cron job that will update the index periodically. I also have a search application on a web form. Is this going to cause trouble if someone runs a query while the indexer is updating? Ashley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Closed IndexWriter reuse
--- Otis Gospodnetic [EMAIL PROTECTED] wrote: No, you can't add documents to an index once you close the IndexWriter. You can re-open the IndexWriter and add more documents, of course. Otis That's what I expected at first, but: 1- It's a disappointment, because such a 'feature' would have made IndexeWriter management much easier. You would open an IndexWriter at startup and reuse it during all the life of the application, just flushing on a regular base using the close() method and without worrying if other objects are currently using the writer. 2- When you say you can't add, do you mean it's impossible or that you shouldn't because for example it could corrupt the index? Maybe I'm wrong, but I think it's possible. Let's look at the follwoing code: public static void main(String[] args) throws IOException { final IndexWriter writer1 = new IndexWriter(/tmp/test-reuse, new StandardAnalyzer(), true); // First write with the writer Document doc = new Document(); doc.add(new Field(name, John, Field.Store.YES, Field.Index.UN_TOKENIZED)); writer1.addDocument(doc); System.out.println(1 After first write, before closing the writer ---); Searcher searcher = new IndexSearcher(/tmp/test-reuse); Query query = new TermQuery(new Term(name, John)); Hits hits = searcher.search(query); System.out.println(=== hits: + hits.length()); System.out.println(); // CLOSING THE WRITER ONCE writer1.close(); System.out.println(2 After first write, after closing the writer ---); searcher = new IndexSearcher(/tmp/test-reuse); hits = searcher.search(query); System.out.println(=== hits: + hits.length()); System.out.println(); // Second write, THE WRITER HAS ALREADY BEEN CLOSED ONCE writer1.addDocument(doc); System.out.println(3 After second write, the writer has been closed once ---); hits = searcher.search(query); System.out.println(=== hits: + hits.length()); System.out.println(); // Closing the writer again writer1.close(); System.out.println(4 After second write, the writer has been closed twice ---); searcher = new IndexSearcher(/tmp/test-reuse); hits = searcher.search(query); System.out.println(=== hits: + hits.length()); } == Results == 1 After first write, before closing the writer --- === hits: 0 2 After first write, after closing the writer --- === hits: 1 3 After second write, the writer has been closed once --- === hits: 1 4 After second write, the writer has been closed twice --- === hits: 2 As your can see, not only does the code above execute without complain but it also gives the right results. Thanks for your comments. __ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Concurrent read and write
Hello Ashley, You can read/search while modifying the index, but you have to ensure only one thread or only one process is modifying an index at any given time. Both IndexReader and IndexWriter can be used to modify an index. The former to delete Documents and the latter to add them. You have to ensure these two operations don't overlap. c.f. http://www.lucenebook.com/search?query=concurrent Otis --- Ashley Steigerwalt [EMAIL PROTECTED] wrote: I am a little fuzzy on the thread-safeness of Lucene, or maybe just java. From what I understand, and correct me if I'm wrong, Lucene takes care of concurrency issues and it is ok to run a query while writing to an index. My question is, does this still hold true if the reader and writer are being executed as separate programs? I have a cron job that will update the index periodically. I also have a search application on a web form. Is this going to cause trouble if someone runs a query while the indexer is updating? Ashley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
I've written a Chinese Analyzer for Lucene that uses a segmenter written by (BErik Peterson. However, as the author of the segmenter does not want his code (Breleased under apache open source license (although his code _is_ (Bopensource), I cannot place my work in the Lucene Sandbox. This is (Bunfortunate, because I believe the analyzer works quite well in indexing and (Bsearching chinese docs in GB2312 and UTF-8 encoding, and I like more people (Bto test, use, and confirm this. So anyone who wants it, can have it. Just (Bshoot me an email. (BBTW, I also have written an arabic analyzer, which is collecting dust for (Bsimilar reasons. (BGood luck, (B (BAli Safarnejad (B (B (B-Original Message- (BFrom: Eric Chow [mailto:[EMAIL PROTECTED] (BSent: 21 January 2005 11:42 (BTo: Lucene Users List (BSubject: Re: Search Chinese in Unicode !!! (B (B (BSearch not really correct with UTF-8 !!! (B (B (BThe following is the search result that I used the SearchFiles in the lucene (Bdemo. (B (Bd:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava (Borg.apache.lucene.demo.SearchFiles c:\temp\myindex (BUsage: java SearchFiles idnex (BQuery: $Be4(J (BSearching for: g strange ?? (B3 total matching documents (B0. ../docs/ChineseDemo.htmlthis files contains (Bthe $Be4(J (B - (B1. ../docs/luceneplan.html (B - Jakarta Lucene - Plan for enhancements to Lucene (B2. ../docs/api/index-all.html (B - Index (Lucene 1.4.3 API) (BQuery: (B (B (B (BFrom the above result only the ChineseDemo.html includes the character that I (Bwant to search ! (B (B (B (B (BThe modified code in SearchFiles.java: (B (B (BBufferedReader in = new BufferedReader(new InputStreamReader(System.in, (B"UTF-8")); (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED] (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
If you are hosting the code somewhere (e.g. your site, SF, java.net, etc.), we should link to them from one of the Lucene pages where we link to related external tools, apps, and such. Otis --- Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: å´ Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the å´ - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I would love to give it a try. Please email me at aurora00 at gmail.com. Thanks! Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some people actually said the StandardAnalyzer works better. I wonder what's the pros and cons. I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
Are you indexing the FOP PDF's differently than other PDF documents? Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() method? Ben On Fri, 21 Jan 2005, Luke Shannon wrote: Hello; Our CMS now allows users to create PDF documents (uses FOP) and than search them. I seem to be able to index these documents ok. But when I am generating the results to display I get a Null Pointer Exception while trying to use a variable that should contain the url keyword for one of these documents in the index: Document doc = hits.doc(i); String path = doc.get(url); Path contains null. The interesting thing is this only happens with PDF that are generate with FOP. Other PDFs are fine. What I find weird is shouldn't the url field just contain the path of the file? Anyone else seen this before? Any ideas? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemming
Also if you can't wait, see page 2 of http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html or the LIA e-book ;) On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb [EMAIL PROTECTED] wrote: OK, OK ... I'll buy the book. I guess its about time since I am deeply and forever in love with Lucene. Might as well take the final plunge. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 9:12 AM To: Lucene Users List Subject: Re: Stemming Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- Kevin L. Cobb [EMAIL PROTECTED] wrote: I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Opening up one large index takes 940M or memory?
We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. Here's the code: System.out.println( opening... ); long before = System.currentTimeMillis(); Directory dir = FSDirectory.getDirectory( /var/ksa/index-1078106952160/, false ); IndexReader ir = IndexReader.open( dir ); System.out.println( ir.getClass() ); long after = System.currentTimeMillis(); System.out.println( opening...done - duration: + (after-before) ); System.out.println( totalMemory: + Runtime.getRuntime().totalMemory() ); System.out.println( freeMemory: + Runtime.getRuntime().freeMemory() ); Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. This would be similar to the way the MySQL index cache works... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
: We have one large index right now... its about 60G ... When I open it : the Java VM used 940M of memory. The VM does nothing else besides open Just out of curiosity, have you tried turning on the verbose gc log, and putting in some thread sleeps after you open the reader, to see if the memory footprint settles down after a little while? You're currently checking the memoory usage immediately after opening the index, and some of that memory may be used holding transient data that will get freed up after some GC iterations. : IndexReader ir = IndexReader.open( dir ); : System.out.println( ir.getClass() ); : long after = System.currentTimeMillis(); : System.out.println( opening...done - duration: + : (after-before) ); : : System.out.println( totalMemory: + : Runtime.getRuntime().totalMemory() ); : System.out.println( freeMemory: + : Runtime.getRuntime().freeMemory() ); -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document 'Context' Relation to each other
As a log4j developer, I've been toying with the idea of what Lucene could do for me, maybe as an excuse to play around with Lucene. I've started creating a LoggingEvent-Document converter, and thinking through how I'd like this utility to work when I came across a question I wasn't sure about. When scanning/searching through logging events, one is usually looking for a particular matching event which Lucene does excellently, but what a person usually needs is also the context of that matching logging event around it. With grep, one can use the -CcontextSize argument to grep to provide X # of lines around the matching entry. I'd like to be able to do the same thing with Lucene. Now, I could provide a Field to the LoggingEvent Document that has a sequence #, and once a user has chosen an appropriate matching event, do another search for the documents with a Sequence # between +/- the context size. My question is, is that going to be an efficient way to do this? The sequence # would be treated as text, wouldn't it? Would the range search on an int be the most efficient way to do this? I know from the Hits documentation that one can retrieve the Document ID of a matching entry. What is the contract on this Document ID? Is each Document added to the Index given an increasing number? Can one search an index by Document ID? Could one search for Document ID's between a range? (Hope you can see where I'm going here). If you have any other recommendations about Context searching I would appreciate any thoughts. Many thanks for an excellent API, and kudos to Erik Otis for a great eBook btw. regards, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I want that Chinese Anayzer !! On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]