Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread Andrzej Bialecki
Morus Walter wrote: Owen Densmore writes: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important

Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Chinese in Unicode !!!

2005-01-21 Thread Erik Hatcher
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? Indexing and searching Chinese basically is no different than using English with Lucene. We covered a bit about it in Lucene in Action:

Re: How works *

2005-01-21 Thread Miles Barr
On Fri, 2005-01-21 at 10:58 +0100, Bertrand VENZAL wrote: I wondered how lucene implement the * character, I know that is working but when I look at the Query Object, it doesn t seem to appear somewhere, does someone know how is it implemented ? Take a look at the PrefixQuery and

Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread mark harwood
1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. It is possible to derive the human-readable form of a stemmed term using either re-analysis of indexed content or TermPositionVector. Either of these

Re: Search Chinese in Unicode !!!

2005-01-21 Thread PA
On Jan 21, 2005, at 11:42, Eric Chow wrote: Search not really correct with UTF-8 !!! Lucene works just fine with any flavor of Unicode as long as _your_ application knows how to consistently deal with Unicode as well. Remember: the world is not just one Big5 pile. As far as Analyzer goes, you

RE: Filtering w/ Multiple Terms

2005-01-21 Thread Jerry Jalenak
OK. But isn't there a limit on the number of BooleanQueries that can be combined with AND / OR / etc? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Erik Hatcher

Stemming

2005-01-21 Thread Kevin L. Cobb
I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the

Re: Stemming

2005-01-21 Thread Otis Gospodnetic
Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more

RE: Filtering w/ Multiple Terms

2005-01-21 Thread Otis Gospodnetic
This: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html ? You can control that limit via http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount Otis --- Jerry Jalenak [EMAIL PROTECTED] wrote: OK.

Re: Suggestion needed for extranet search

2005-01-21 Thread Otis Gospodnetic
Hi Ranjan, It sounds like you are should look at and use Nutch: http://www.nutch.org Otis --- Ranjan K. Baisak [EMAIL PROTECTED] wrote: I am planning to move to Lucene but not have much knowledge on the same. The search engine which I had developed is searching some extranet URLs e.g.

Search on heterogenous index

2005-01-21 Thread Simeon Koptelov
Hello all. I'm new to lucene and think about using it in my project. I have prices with dynamic structure, containing wares there, about 10K prices with total 500K wares. Each price has about 5 text fields. I'll do searches on wares. The difficult part is that I'll do searches for all wares,

Re: Suggestion needed for extranet search

2005-01-21 Thread Ranjan K. Baisak
Otis, Thanks for your help. Is nutch a freeware tool? regards, Ranjan --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Ranjan, It sounds like you are should look at and use Nutch: http://www.nutch.org Otis --- Ranjan K. Baisak [EMAIL PROTECTED] wrote: I am planning to move to

RE: Stemming

2005-01-21 Thread Kevin L. Cobb
OK, OK ... I'll buy the book. I guess its about time since I am deeply and forever in love with Lucene. Might as well take the final plunge. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 9:12 AM To: Lucene Users List Subject: Re:

Concurrent read and write

2005-01-21 Thread Ashley Steigerwalt
I am a little fuzzy on the thread-safeness of Lucene, or maybe just java. From what I understand, and correct me if I'm wrong, Lucene takes care of concurrency issues and it is ok to run a query while writing to an index. My question is, does this still hold true if the reader and writer are

Re: Closed IndexWriter reuse

2005-01-21 Thread Oscar Picasso
--- Otis Gospodnetic [EMAIL PROTECTED] wrote: No, you can't add documents to an index once you close the IndexWriter. You can re-open the IndexWriter and add more documents, of course. Otis That's what I expected at first, but: 1- It's a disappointment, because such a 'feature' would have

Re: Concurrent read and write

2005-01-21 Thread Otis Gospodnetic
Hello Ashley, You can read/search while modifying the index, but you have to ensure only one thread or only one process is modifying an index at any given time. Both IndexReader and IndexWriter can be used to modify an index. The former to delete Documents and the latter to add them. You have

RE: Search Chinese in Unicode !!!

2005-01-21 Thread Safarnejad, Ali (AFIS)
I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is

RE: Search Chinese in Unicode !!!

2005-01-21 Thread Otis Gospodnetic
If you are hosting the code somewhere (e.g. your site, SF, java.net, etc.), we should link to them from one of the Lucene pages where we link to related external tools, apps, and such. Otis --- Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that

Re: Search Chinese in Unicode !!!

2005-01-21 Thread aurora
I would love to give it a try. Please email me at aurora00 at gmail.com. Thanks! Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some people actually said the StandardAnalyzer works better. I wonder what's the pros and cons. I've written a Chinese Analyzer for Lucene that

Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Ben Litchfield
Are you indexing the FOP PDF's differently than other PDF documents? Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() method? Ben On Fri, 21 Jan 2005, Luke Shannon wrote: Hello; Our CMS now allows users to create PDF documents (uses FOP) and than search them. I

Re: Stemming

2005-01-21 Thread Chris Lamprecht
Also if you can't wait, see page 2 of http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html or the LIA e-book ;) On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb [EMAIL PROTECTED] wrote: OK, OK ... I'll buy the book. I guess its about time since I am deeply and forever in love with

Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. Here's the code: System.out.println( opening... ); long before = System.currentTimeMillis(); Directory dir =

Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way

Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Chris Hostetter
: We have one large index right now... its about 60G ... When I open it : the Java VM used 940M of memory. The VM does nothing else besides open Just out of curiosity, have you tried turning on the verbose gc log, and putting in some thread sleeps after you open the reader, to see if the memory

Document 'Context' Relation to each other

2005-01-21 Thread Paul Smith
As a log4j developer, I've been toying with the idea of what Lucene could do for me, maybe as an excuse to play around with Lucene. I've started creating a LoggingEvent-Document converter, and thinking through how I'd like this utility to work when I came across a question I wasn't sure about.

Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
I want that Chinese Anayzer !! On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache