Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread Andrzej Bialecki
Morus Walter wrote:
Owen Densmore writes:

1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
apparently produces non-word stems .. i.e. not really human readable.  
(Example: generate, generates, generated, generating  - generat) 
Although in typical queries this is not important because the result of 
the search is a document list, it *would* be important if we use the 
stems within a graphical navigation interface.
So the question is: Is there a way to have the stemmer produce 
english
base forms of the words being stemmed?

rule based stemmers such as porter/snowball cannot do that.
But there are (commercial) dictionary based tools that can. E.g. the
canoo lemmatizer.
You might also have a look at egothors stemmer, that are word list based.
Egothor stemmers are algorithmic, they only use word lists for training. 
Stems produced by them are usually closer to lemmas than in e.g. 
Porter's stemmer, but there is a significant amount of stems like in the 
example above.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread Erik Hatcher
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote:
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?
Indexing and searching Chinese basically is no different than using 
English with Lucene.  We covered a bit about it in Lucene in Action:

http://www.lucenebook.com/search?query=chinese
And a screenshot here:
http://www.blogscene.org/erik/LuceneInAction/i18n.html
The main issues of dealing with Chinese, and of course other languages, 
are encoding concerns in both indexing and querying of reading in the 
text and analysis (as you can see from the screenshot).

Lucene itself works with Unicode fine and you're free to index anything.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How works *

2005-01-21 Thread Miles Barr
On Fri, 2005-01-21 at 10:58 +0100, Bertrand VENZAL wrote:
 I wondered how lucene implement the * character, I know that is working 
 but when I look at the Query Object, it doesn t seem to appear somewhere, 
 does someone know how is it implemented ?

Take a look at the PrefixQuery and WildcardQuery. 

PrefixQuery works by finding all terms beginning with the query then
constructing a boolean query of them. I assume WildcardQuery works in a
similar way.

If you have several terms or a short prefix (e.g. a*) you might need to
increase the maximum number of clauses allowed in a boolean query
because the number of terms might exceed the default (i.e. 1024).
 
-- 
Miles Barr [EMAIL PROTECTED]
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
Search not really correct with UTF-8 !!!


The following is the search result that I used the SearchFiles in the
lucene demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles idnex
Query: 
Searching for: g  strange ??
3 total matching documents
0. ../docs/ChineseDemo.htmlthis files contains the 

   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query: 



From the above result only the ChineseDemo.html includes the character
that I want to search !




The modified code in SearchFiles.java:


BufferedReader in = new BufferedReader(new
InputStreamReader(System.in, UTF-8));

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread mark harwood
1 - I'm a bit concerned that reasonable stemming
(Porter/Snowball) 
apparently produces non-word stems .. i.e. not
really human readable. 

It is possible to derive the human-readable form of a
stemmed term using either re-analysis of indexed
content or TermPositionVector. Either of these
techniques should give you the position data required
to discover the original form. 
The highlighter package is one example of where this
technique is used.

Cheers
Mark





___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread PA
On Jan 21, 2005, at 11:42, Eric Chow wrote:
Search not really correct with UTF-8 !!!
Lucene works just fine with any flavor of Unicode as long as _your_ 
application knows how to consistently deal with Unicode as well. 
Remember: the world is not just one Big5 pile.

As far as Analyzer goes, you may or may not be better off using 
something more tailored to your linguistic needs. That said, even the 
default Analyzer does a fairly decent job at handling non-roman 
languages. YMMV.

Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Filtering w/ Multiple Terms

2005-01-21 Thread Jerry Jalenak
OK.  But isn't there a limit on the number of BooleanQueries that can be
combined with AND / OR / etc?



Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Thursday, January 20, 2005 5:05 PM
 To: Lucene Users List
 Subject: Re: Filtering w/ Multiple Terms
 
 
 
 On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote:
 
  In looking at the examples for filtering of hits, it looks 
 like I can 
  only
  specify a single term; i.e.
 
  Filter f = new QueryFilter(new TermQuery(new Term(acct,
  acct1)));
 
  I need to specify more than one term in my filter.  Short of using 
  something
  like ChainFilter, how are others handling this?
 
 You can make as complex of a Query as you want for 
 QueryFilter.  If you 
 want to filter on multiple terms, construct a BooleanQuery 
 with nested 
 TermQuery's, either in an AND or OR fashion.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stemming

2005-01-21 Thread Kevin L. Cobb
I want to understand how Lucene uses stemming but can't find any
documentation on the Lucene site. I'll continue to google but hope that
this list can help narrow my search. I have several questions on the
subject currently but hesitate to list them here since finding a good
document on the subject may answer most of them. 

 

Thanks in advance for any pointers,

 

Kevin

 

 



Re: Stemming

2005-01-21 Thread Otis Gospodnetic
Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- Kevin L. Cobb [EMAIL PROTECTED] wrote:

 I want to understand how Lucene uses stemming but can't find any
 documentation on the Lucene site. I'll continue to google but hope
 that
 this list can help narrow my search. I have several questions on the
 subject currently but hesitate to list them here since finding a good
 document on the subject may answer most of them. 
 
  
 
 Thanks in advance for any pointers,
 
  
 
 Kevin
 
  
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Filtering w/ Multiple Terms

2005-01-21 Thread Otis Gospodnetic
This:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html
?

You can control that limit via
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount

Otis


--- Jerry Jalenak [EMAIL PROTECTED] wrote:

 OK.  But isn't there a limit on the number of BooleanQueries that can
 be
 combined with AND / OR / etc?
 
 
 
 Jerry Jalenak
 Senior Programmer / Analyst, Web Publishing
 LabOne, Inc.
 10101 Renner Blvd.
 Lenexa, KS  66219
 (913) 577-1496
 
 [EMAIL PROTECTED]
 
 
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Thursday, January 20, 2005 5:05 PM
  To: Lucene Users List
  Subject: Re: Filtering w/ Multiple Terms
  
  
  
  On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote:
  
   In looking at the examples for filtering of hits, it looks 
  like I can 
   only
   specify a single term; i.e.
  
 Filter f = new QueryFilter(new TermQuery(new Term(acct,
   acct1)));
  
   I need to specify more than one term in my filter.  Short of
 using 
   something
   like ChainFilter, how are others handling this?
  
  You can make as complex of a Query as you want for 
  QueryFilter.  If you 
  want to filter on multiple terms, construct a BooleanQuery 
  with nested 
  TermQuery's, either in an AND or OR fashion.
  
  Erik
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 This transmission (and any information attached to it) may be
 confidential and
 is intended solely for the use of the individual or entity to which
 it is
 addressed. If you are not the intended recipient or the person
 responsible for
 delivering the transmission to the intended recipient, be advised
 that you
 have received this transmission in error and that any use,
 dissemination,
 forwarding, printing, or copying of this information is strictly
 prohibited.
 If you have received this transmission in error, please immediately
 notify
 LabOne at the following email address:
 [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion needed for extranet search

2005-01-21 Thread Otis Gospodnetic
Hi Ranjan,

It sounds like you are should look at and use Nutch:
http://www.nutch.org

Otis

--- Ranjan K. Baisak [EMAIL PROTECTED] wrote:

 I am planning to move to Lucene but not have much
 knowledge on the same. The search engine which I had
 developed is searching some extranet URLs e.g.
 codeguru.com/index.html. Is is possible to get the
 same functionality using Lucene. So basically can I
 make Lucene as a search engine to search extranets.
 
 regards,
 Ranjan
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search on heterogenous index

2005-01-21 Thread Simeon Koptelov
Hello all. I'm new to lucene and think about using it in my project.

I have prices with dynamic structure, containing wares there, about 10K prices 
with total 500K wares. Each price has about 5 text fields. 

I'll do searches on wares. The difficult part is that I'll do searches for all 
wares, the search is not bound to a particular price structure.

My question is, how should I organize my indices? Can Lucene handle data 
effectlively if I'll have one index containing different Fields in Documents? 
Or should I create a separate index for each price with same Fields structure 
across Documents?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion needed for extranet search

2005-01-21 Thread Ranjan K. Baisak
Otis,
Thanks for your help. Is nutch a freeware tool?

regards,
Ranjan
--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Hi Ranjan,
 
 It sounds like you are should look at and use Nutch:
 http://www.nutch.org
 
 Otis
 
 --- Ranjan K. Baisak [EMAIL PROTECTED]
 wrote:
 
  I am planning to move to Lucene but not have much
  knowledge on the same. The search engine which I
 had
  developed is searching some extranet URLs e.g.
  codeguru.com/index.html. Is is possible to get the
  same functionality using Lucene. So basically can
 I
  make Lucene as a search engine to search
 extranets.
  
  regards,
  Ranjan
  
  __
  Do You Yahoo!?
  Tired of spam?  Yahoo! Mail has the best spam
 protection around 
  http://mail.yahoo.com 
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Stemming

2005-01-21 Thread Kevin L. Cobb
OK, OK ... I'll buy the book. I guess its about time since I am deeply
and forever in love with Lucene. Might as well take the final plunge.



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 21, 2005 9:12 AM
To: Lucene Users List
Subject: Re: Stemming

Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- Kevin L. Cobb [EMAIL PROTECTED] wrote:

 I want to understand how Lucene uses stemming but can't find any
 documentation on the Lucene site. I'll continue to google but hope
 that
 this list can help narrow my search. I have several questions on the
 subject currently but hesitate to list them here since finding a good
 document on the subject may answer most of them. 
 
  
 
 Thanks in advance for any pointers,
 
  
 
 Kevin
 
  
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Concurrent read and write

2005-01-21 Thread Ashley Steigerwalt
I am a little fuzzy on the thread-safeness of Lucene, or maybe just java.  
From what I understand, and correct me if I'm wrong, Lucene takes care of 
concurrency issues and it is ok to run a query while writing to an index.

My question is, does this still hold true if the reader and writer are being 
executed as separate programs?  I have a cron job that will update the index 
periodically.  I also have a search application on a web form.  Is this going 
to cause trouble if someone runs a query while the indexer is updating?

Ashley

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Closed IndexWriter reuse

2005-01-21 Thread Oscar Picasso
--- Otis Gospodnetic [EMAIL PROTECTED] wrote:

 No, you can't add documents to an index once you close the IndexWriter.
 You can re-open the IndexWriter and add more documents, of course.
 
 Otis

That's what I expected at first, but:
1- It's a disappointment, because such a 'feature' would have made IndexeWriter
management much easier. You would open an IndexWriter at startup and reuse it
during all the life of the application, just flushing on a regular base using
the close() method and without worrying if other objects are currently using
the writer.

2- When you say you can't add, do you mean it's impossible or that you
shouldn't because for example it could corrupt the index?
Maybe I'm wrong, but I think it's possible. Let's look at the follwoing code:



public static void main(String[] args) throws IOException
{
final IndexWriter writer1 = new IndexWriter(/tmp/test-reuse, new
StandardAnalyzer(), true);

// First write with the writer
Document doc = new Document();
doc.add(new Field(name, John, Field.Store.YES, 
Field.Index.UN_TOKENIZED));
writer1.addDocument(doc);
System.out.println(1  After first write, before closing the writer 
---);
Searcher searcher = new IndexSearcher(/tmp/test-reuse);
Query query = new TermQuery(new Term(name, John));
Hits hits = searcher.search(query);
System.out.println(=== hits:  + hits.length());
System.out.println();

// CLOSING THE WRITER ONCE
writer1.close();
System.out.println(2  After first write, after closing the writer 
---);
searcher = new IndexSearcher(/tmp/test-reuse);
hits = searcher.search(query);
System.out.println(=== hits:  + hits.length());
System.out.println();

// Second write, THE WRITER HAS ALREADY BEEN CLOSED ONCE
writer1.addDocument(doc);
System.out.println(3  After second write, the writer has been 
closed once
---);
hits = searcher.search(query);
System.out.println(=== hits:  + hits.length());
System.out.println();

// Closing the writer again
writer1.close();
System.out.println(4  After second write, the writer has been 
closed
twice ---);
searcher = new IndexSearcher(/tmp/test-reuse);
hits = searcher.search(query);
System.out.println(=== hits:  + hits.length());

}

== Results ==
1  After first write, before closing the writer ---
=== hits: 0

2  After first write, after closing the writer ---
=== hits: 1

3  After second write, the writer has been closed once ---
=== hits: 1

4  After second write, the writer has been closed twice ---
=== hits: 2


As your can see, not only does the code above execute without complain but it
also gives the right results.

Thanks for your comments.



__ 
Do you Yahoo!? 
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrent read and write

2005-01-21 Thread Otis Gospodnetic
Hello Ashley,

You can read/search while modifying the index, but you have to ensure
only one thread or only one process is modifying an index at any given
time.  Both IndexReader and IndexWriter can be used to modify an index.
 The former to delete Documents and the latter to add them.  You have
to ensure these two operations don't overlap.
c.f. http://www.lucenebook.com/search?query=concurrent

Otis


--- Ashley Steigerwalt [EMAIL PROTECTED] wrote:

 I am a little fuzzy on the thread-safeness of Lucene, or maybe just
 java.  
 From what I understand, and correct me if I'm wrong, Lucene takes
 care of 
 concurrency issues and it is ok to run a query while writing to an
 index.
 
 My question is, does this still hold true if the reader and writer
 are being 
 executed as separate programs?  I have a cron job that will update
 the index 
 periodically.  I also have a search application on a web form.  Is
 this going 
 to cause trouble if someone runs a query while the indexer is
 updating?
 
 Ashley
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Chinese in Unicode !!!

2005-01-21 Thread Safarnejad, Ali (AFIS)
I've written a Chinese Analyzer for Lucene that uses a segmenter written by
(BErik Peterson. However, as the author of the segmenter does not want his code
(Breleased under apache open source license (although his code _is_
(Bopensource), I cannot place my work in the Lucene Sandbox.  This is
(Bunfortunate, because I believe the analyzer works quite well in indexing and
(Bsearching chinese docs in GB2312 and UTF-8 encoding, and I like more people
(Bto test, use, and confirm this.  So anyone who wants it, can have it. Just
(Bshoot me an email.
(BBTW, I also have written an arabic analyzer, which is collecting dust for
(Bsimilar reasons.
(BGood luck,
(B
(BAli Safarnejad
(B
(B
(B-Original Message-
(BFrom: Eric Chow [mailto:[EMAIL PROTECTED] 
(BSent: 21 January 2005 11:42
(BTo: Lucene Users List
(BSubject: Re: Search Chinese in Unicode !!!
(B
(B
(BSearch not really correct with UTF-8 !!!
(B
(B
(BThe following is the search result that I used the SearchFiles in the lucene
(Bdemo.
(B
(Bd:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
(Borg.apache.lucene.demo.SearchFiles c:\temp\myindex
(BUsage: java SearchFiles idnex
(BQuery: $Be4(J
(BSearching for: g  strange ??
(B3 total matching documents
(B0. ../docs/ChineseDemo.htmlthis files contains
(Bthe $Be4(J
(B   -
(B1. ../docs/luceneplan.html
(B   - Jakarta Lucene - Plan for enhancements to Lucene
(B2. ../docs/api/index-all.html
(B   - Index (Lucene 1.4.3 API)
(BQuery: 
(B
(B
(B
(BFrom the above result only the ChineseDemo.html includes the character that I
(Bwant to search !
(B
(B
(B
(B
(BThe modified code in SearchFiles.java:
(B
(B
(BBufferedReader in = new BufferedReader(new InputStreamReader(System.in,
(B"UTF-8"));
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]

RE: Search Chinese in Unicode !!!

2005-01-21 Thread Otis Gospodnetic
If you are hosting the code somewhere (e.g. your site, SF, java.net,
etc.), we should link to them from one of the Lucene pages where we
link to related external tools, apps, and such.

Otis


--- Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote:

 I've written a Chinese Analyzer for Lucene that uses a segmenter
 written by
 Erik Peterson. However, as the author of the segmenter does not want
 his code
 released under apache open source license (although his code _is_
 opensource), I cannot place my work in the Lucene Sandbox.  This is
 unfortunate, because I believe the analyzer works quite well in
 indexing and
 searching chinese docs in GB2312 and UTF-8 encoding, and I like more
 people
 to test, use, and confirm this.  So anyone who wants it, can have it.
 Just
 shoot me an email.
 BTW, I also have written an arabic analyzer, which is collecting dust
 for
 similar reasons.
 Good luck,
 
 Ali Safarnejad
 
 
 -Original Message-
 From: Eric Chow [mailto:[EMAIL PROTECTED] 
 Sent: 21 January 2005 11:42
 To: Lucene Users List
 Subject: Re: Search Chinese in Unicode !!!
 
 
 Search not really correct with UTF-8 !!!
 
 
 The following is the search result that I used the SearchFiles in the
 lucene
 demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: å´
 Searching for: g 
 strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files
 contains
 the å´
-
 1. ../docs/luceneplan.html
- Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
- Index (Lucene 1.4.3 API)
 Query: 
 
 
 
 From the above result only the ChineseDemo.html includes the
 character that I
 want to search !
 
 
 
 
 The modified code in SearchFiles.java:
 
 
 BufferedReader in = new BufferedReader(new
 InputStreamReader(System.in,
 UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread aurora
I would love to give it a try. Please email me at aurora00 at gmail.com.  
Thanks!

Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some  
people actually said the StandardAnalyzer works better. I wonder what's  
the pros and cons.


I've written a Chinese Analyzer for Lucene that uses a segmenter written  
by
Erik Peterson. However, as the author of the segmenter does not want his  
code
released under apache open source license (although his code _is_
opensource), I cannot place my work in the Lucene Sandbox.  This is
unfortunate, because I believe the analyzer works quite well in indexing  
and
searching chinese docs in GB2312 and UTF-8 encoding, and I like more  
people
to test, use, and confirm this.  So anyone who wants it, can have it.  
Just
shoot me an email.
BTW, I also have written an arabic analyzer, which is collecting dust for
similar reasons.
Good luck,

Ali Safarnejad
-Original Message-
From: Eric Chow [mailto:[EMAIL PROTECTED]
Sent: 21 January 2005 11:42
To: Lucene Users List
Subject: Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!!
The following is the search result that I used the SearchFiles in the  
lucene
demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles idnex
Query: 
Searching for: g   
strange ??
3 total matching documents
0. ../docs/ChineseDemo.htmlthis files  
contains
the 
   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query:


From the above result only the ChineseDemo.html includes the character  
that I
want to search !


The modified code in SearchFiles.java:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
UTF-8));
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Ben Litchfield


Are you indexing the FOP PDF's differently than other PDF documents?

Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
method?

Ben

On Fri, 21 Jan 2005, Luke Shannon wrote:

 Hello;

 Our CMS now allows users to create PDF documents (uses FOP) and than search
 them.

 I seem to be able to index these documents ok. But when I am generating the
 results to display I get a Null Pointer Exception while trying to use a
 variable that should contain the url keyword for one of these documents in
 the index:

 Document doc = hits.doc(i);
 String path = doc.get(url);

 Path contains null.

 The interesting thing is this only happens with PDF that are generate with
 FOP. Other PDFs are fine.

 What I find weird is shouldn't the url field just contain the path of the
 file?

 Anyone else seen this before?

 Any ideas?

 Thanks,

 Luke



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming

2005-01-21 Thread Chris Lamprecht
Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
[EMAIL PROTECTED] wrote:
 OK, OK ... I'll buy the book. I guess its about time since I am deeply
 and forever in love with Lucene. Might as well take the final plunge.
 
 
 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Friday, January 21, 2005 9:12 AM
 To: Lucene Users List
 Subject: Re: Stemming
 
 Hi Kevin,
 
 Stemming is an optional operation and is done in the analysis step.
 Lucene comes with a Porter stemmer and a Filter that you can use in an
 Analyzer:
 
 ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
 ./src/java/org/apache/lucene/analysis/PorterStemmer.java
 
 You can find more about it here:
 http://www.lucenebook.com/search?query=stemming
 You can also see mentions of SnowballAnalyzer in those search results,
 and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
 
 Otis
 
 --- Kevin L. Cobb [EMAIL PROTECTED] wrote:
 
  I want to understand how Lucene uses stemming but can't find any
  documentation on the Lucene site. I'll continue to google but hope
  that
  this list can help narrow my search. I have several questions on the
  subject currently but hesitate to list them here since finding a good
  document on the subject may answer most of them.
 
 
 
  Thanks in advance for any pointers,
 
 
 
  Kevin
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides open 
this index.

Here's the code:
   System.out.println( opening... );
   long before = System.currentTimeMillis();
   Directory dir = FSDirectory.getDirectory( 
/var/ksa/index-1078106952160/, false );
   IndexReader ir = IndexReader.open( dir );
   System.out.println( ir.getClass() );
   long after = System.currentTimeMillis();
   System.out.println( opening...done - duration:  + 
(after-before) );

   System.out.println( totalMemory:  + 
Runtime.getRuntime().totalMemory() );
   System.out.println( freeMemory:  + 
Runtime.getRuntime().freeMemory() );

Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
Kevin A. Burton wrote:
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides 
open this index.
After thinking about it I guess 1.5% of memory per index really isn't 
THAT bad.  What would be nice if there was a way to do this from disk 
and then use the a buffer (either via the filesystem or in-vm memory) to 
access these variables.

This would be similar to the way the MySQL index cache works...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Chris Hostetter
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory.  The VM does nothing else besides open

Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory footprint settles down after a little while?  You're currently
checking the memoory usage immediately after opening the index, and some
of that memory may be used holding transient data that will get freed up
after some GC iterations.


: IndexReader ir = IndexReader.open( dir );
: System.out.println( ir.getClass() );
: long after = System.currentTimeMillis();
: System.out.println( opening...done - duration:  +
: (after-before) );
:
: System.out.println( totalMemory:  +
: Runtime.getRuntime().totalMemory() );
: System.out.println( freeMemory:  +
: Runtime.getRuntime().freeMemory() );





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document 'Context' Relation to each other

2005-01-21 Thread Paul Smith
As a log4j developer, I've been toying with the idea of what Lucene 
could do for me, maybe as an excuse to play around with Lucene.

I've started creating a LoggingEvent-Document converter, and thinking 
through how I'd like this utility to work when I came across a question 
I wasn't sure about.

When scanning/searching through logging events, one is usually looking 
for a particular matching event which Lucene does excellently, but what 
a person usually needs is also the context of that matching logging 
event around it. 

With grep, one can use the -CcontextSize argument to grep to provide 
X # of lines around the matching entry. I'd like to be able to do the 
same thing with Lucene.

Now, I could provide a Field to the LoggingEvent Document that has a 
sequence #, and once a user has chosen an appropriate matching event, do 
another search for the documents with a Sequence # between +/- the 
context size. 

My question is, is that going to be an efficient way to do this? The 
sequence # would be treated as text, wouldn't it?  Would the range 
search on an int be the most efficient way to do this?

I know from the Hits documentation that one can retrieve the Document ID 
of a matching entry.  What is the contract on this Document ID?  Is each 
Document added to the Index given an increasing number?  Can one search 
an index by Document ID?  Could one search for Document ID's between a 
range?   (Hope you can see where I'm going here).

If you have any other recommendations about Context searching I would 
appreciate any thoughts.

Many thanks for an excellent API, and kudos to Erik  Otis for a great 
eBook btw.

regards,
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
I want that Chinese Anayzer !!


On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS)
[EMAIL PROTECTED] wrote:
 I've written a Chinese Analyzer for Lucene that uses a segmenter written by
 Erik Peterson. However, as the author of the segmenter does not want his code
 released under apache open source license (although his code _is_
 opensource), I cannot place my work in the Lucene Sandbox.  This is
 unfortunate, because I believe the analyzer works quite well in indexing and
 searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
 to test, use, and confirm this.  So anyone who wants it, can have it. Just
 shoot me an email.
 BTW, I also have written an arabic analyzer, which is collecting dust for
 similar reasons.
 Good luck,
 
 Ali Safarnejad
 
 
 -Original Message-
 From: Eric Chow [mailto:[EMAIL PROTECTED]
 Sent: 21 January 2005 11:42
 To: Lucene Users List
 Subject: Re: Search Chinese in Unicode !!!
 
 Search not really correct with UTF-8 !!!
 
 The following is the search result that I used the SearchFiles in the lucene
 demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: 
 Searching for: g  strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files contains
 the 
   -
 1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
 Query:
 
 From the above result only the ChineseDemo.html includes the character that I
 want to search !
 
 The modified code in SearchFiles.java:
 
 BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
 UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]