lucene-db

2004-12-22 Thread Daniel Cortes
I've found some websites that use lucene-db, and I never saw this .jar. Someone can tall me to found information about this. This API can probided me some elements to index a MySQL DB of a forum or wiki? thks - To unsubscribe,

(Offtopic) The unicode name for a character

2004-12-22 Thread Peter Pimley
Hi everyone, The Question: In Java generally, Is there an easy way to get the unicode name of a character? (e.g. LATIN SMALL LETTER A from 'a') The Reasoning (for those who are interested): The documents I'm indexing have quite a lot of characters that are basically variations on the basic A-Z

Re: (Offtopic) The unicode name for a character

2004-12-22 Thread Morus Walter
Hi Peter, The Question: In Java generally, Is there an easy way to get the unicode name of a character? (e.g. LATIN SMALL LETTER A from 'a') ... I'm considering taking the unicode name for each character I encounter and regexping it against something like: ^LATIN .* LETTER (.) WITH

Re: (Offtopic) The unicode name for a character

2004-12-22 Thread Pierrick Brihaye
Hi, Morus Walter a écrit : If you cannot find that list somewhere I can mail you a copy. ICU4J's one is here : http://oss.software.ibm.com/cvs/icu4j/icu4j/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt?rev=1.7content-type=text/x-cvsweb-markup See also Unicode's one:

retrieve tokens

2004-12-22 Thread M. Smit
Hello list, I'm not sure if this subject will cover my question, but here goes: consider the following snippet: is = new IndexSearcher((String) envContext.lookup(search_index_dir)); StopAnalyzer analyzer = new StopAnalyzer(ArticleIndexer.SEARCH_STOP_WORDS_NL); parser = new

Indexing terms only

2004-12-22 Thread DES
hi i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks! DES

Re: Indexing terms only

2004-12-22 Thread Mike Snare
Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method or the appropriate constructor when adding a field to the Document. You'll need to also store a

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
Martijn, have you seen the Highlighter in the Lucene Sandbox? If you've stored your text in the Lucene index, there is no need to go back to DB to pull out the blog, parse it, and highlight it - the Highlighter in the Sandbox will do this for you. Otis --- M. Smit [EMAIL PROTECTED] wrote:

Re: Indexing terms only

2004-12-22 Thread DES
I actually use Field.Text(String,String) to add documents to my index. Maybe I do not understand the way an analyzer works, but I thought that all German articles (der, die, das etc) should be filtered out. However if I use Luke to view my index, the original text is completely stored in a

Re: (Offtopic) The unicode name for a character

2004-12-22 Thread Otis Gospodnetic
If you are not tied to Java, see 'unac' at http://www.senga.org/. It's old, but if nothing else you could see how it works and rewrite it in Java. And if you can, you can donate it to Lucene Sandbox. Otis --- Peter Pimley [EMAIL PROTECTED] wrote: Hi everyone, The Question: In Java

Re: Indexing terms only

2004-12-22 Thread Erik Hatcher
On Dec 22, 2004, at 11:36 AM, Mike Snare wrote: Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method Correction: Field.Text(String, String) is a stored

Re: Indexing terms only

2004-12-22 Thread Mike Snare
I've never used the german analyzer, so I don't know what stop words it defines/uses. Someone else will have to answer that. Sorry On Wed, 22 Dec 2004 17:45:17 +0100, DES [EMAIL PROTECTED] wrote: I actually use Field.Text(String,String) to add documents to my index. Maybe I do not understand

Re: Indexing terms only

2004-12-22 Thread Mike Snare
Thanks for correcting me. I use the reader version -- hence my confusion. -Mike On Wed, 22 Dec 2004 11:53:31 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: On Dec 22, 2004, at 11:36 AM, Mike Snare wrote: Whether or not the text is stored in the index is a different concern that how it is

Re: retrieve tokens

2004-12-22 Thread M. Smit
Otis, Problem is though that I'm a little reluctant storing the data Field.Text instead of Field.UnStored, because of the shear size of the documents and the multitude I would like to index (say some 100paged * 2k documents). But than again, it's size versus

Re: retrieve tokens

2004-12-22 Thread Erik Hatcher
On Dec 22, 2004, at 12:04 PM, M. Smit wrote: Problem is though that I'm a little reluctant storing the data Field.Text instead of Field.UnStored, because of the shear size of the documents and the multitude I would like to index (say some 100paged * 2k documents). But than again, it's size

Re: retrieve tokens

2004-12-22 Thread M. Smit
Erik Hatcher wrote: Highlighter does not mandate you store your text in the index. It is just a convenient way to do it. You're free to pull the text from anywhere and highlight it based on the query. Furthermore, you are saying that the highlighter takes care of the corresponding

Re: retrieve tokens

2004-12-22 Thread Mike Snare
But for the other issue on 'store lucene' vs 'store db'. Does anyone can provide me with some field experience on size? The system I'm developing will provide searching through some 2000 pdf's, say some 200 pages each. I feed the plain text into Lucene on a Field.UnStored bases. I also store

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
I suspect Martijn really wants that snippet dynamically generated, with KWIC, as on the lucenebook.com screen shot. Thus, he can't generate and store the snippet at index time, and has to construct it at search time. Otis --- Mike Snare [EMAIL PROTECTED] wrote: But for the other issue on

Re: retrieve tokens

2004-12-22 Thread Otis Gospodnetic
For simpy.com I store the full text of web pages in Lucene, in order to provide full-text web searches. Nutch (nutch.org) does the same. You can set the maximal number of tokens you want indexed via IndexWriter. You can also compress fields in the newest version of Lucene (or maybe just the one

Re: retrieve tokens

2004-12-22 Thread Erik Hatcher
On Dec 22, 2004, at 12:43 PM, M. Smit wrote: Erik Hatcher wrote: But for the other issue on 'store lucene' vs 'store db'. Does anyone can provide me with some field experience on size? The system I'm developing will provide searching through some 2000 pdf's, say some 200 pages each. I feed the

Re: retrieve tokens

2004-12-22 Thread Martijn
Erik Hatcher wrote: On Dec 22, 2004, at 12:43 PM, M. Smit wrote: Consider that you're only highlighting 20 or so entries at one time. Getting the text from a Lucene index you're already navigating will be quite quick. But it shouldn't be too bad to pull 20 records from a database either.

Re: retrieve tokens

2004-12-22 Thread Martijn
Otis Gospodnetic wrote: I suspect Martijn really wants that snippet dynamically generated, with KWIC, as on the lucenebook.com screen shot. Thus, he can't generate and store the snippet at index time, and has to construct it at search time. Otis That is correct. I won't be having a lot of

CFS file?

2004-12-22 Thread Steve Rajavuori
Can someone tell me the purpose of the .CFS files? The Index File Formats page does not mention this type of file. Steve Rajavuori - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Search Result Text

2004-12-22 Thread Hetan Shah
All, This might be asked earlier please point to the earlier post or any pointers would be appreciated. I have bunch of HTML pages which I index using IndexHTML. My dilemma is when I want to search the pages and then display the results the text that I use for the results snippet do not get the

Re: CFS file?

2004-12-22 Thread Bernhard Messer
Steve Rajavuori schrieb: Can someone tell me the purpose of the .CFS files? The Index File Formats page does not mention this type of file. uuuh, you're right, it is not documented at fileformats.html. Since Lucene 1.4, the individual index files are stored per default within one single

RE: CFS file and file formats

2004-12-22 Thread Steve Rajavuori
Thanks. I am trying to repair a corrupted 'segments' file. I am attempting to manually edit the file to add some missing segment names, but I need to add the correct segment size for each. Can anyone tell me how to determine the correct segment size (number of documents in the segment) by looking

Re: CFS file and file formats

2004-12-22 Thread Daniel Naber
On Wednesday 22 December 2004 23:41, Steve Rajavuori wrote: Thanks. I am trying to repair a corrupted 'segments' file. Why are you sure it's corrupted? Are the *.cfs file and the other files types mixed in one directory? Then that's the problem: if you have *.cfs, segments, and deletable,

search question

2004-12-22 Thread roy-lucene-user
Hi guys, We have an index with some fields containing email addresses. Doing a search for an email address with this format: [EMAIL PROTECTED], does not bring up any results with lucene 1.4. The query: Field1:[EMAIL PROTECTED] However it returns results with 1.2. Any ideas? Roy.

Re: search question

2004-12-22 Thread Erik Hatcher
What does toString() return for each of those queries? Are you using the same analyzer in both cases? Erik On Dec 22, 2004, at 5:44 PM, [EMAIL PROTECTED] wrote: Hi guys, We have an index with some fields containing email addresses. Doing a search for an email address with this format:

addIndexes() Question

2004-12-22 Thread Ryan Aslett
Hi there, Im about to embark on a Lucene project of massive scale (between 500 million and 2 billion documents). I am currently working on parallellizing the construction of the Index(es). Rough summary of my plan: I have many, many physical machines, each with multiple processors that I wish

Re: Search Result Text

2004-12-22 Thread Erik Hatcher
The demo IndexHTML does not store the contents field - it is indexed using a Reader and thus not stored. You will have to modify the code to get the complete contents available at search time. Erik On Dec 22, 2004, at 5:01 PM, Hetan Shah wrote: All, This might be asked earlier please

Re: addIndexes() Question

2004-12-22 Thread Otis Gospodnetic
I _think_ you'd be better off doing it all at once, but I wouldn't trust myself on this and would instead construct a small 3-index set and test, looking at a) maximal disk usage, b) time, and c) RAM usage. :) Otis --- Ryan Aslett [EMAIL PROTECTED] wrote: Hi there, Im about to embark on a

Exception: cannot determine sort type

2004-12-22 Thread Kauler, Leto S
We have been implementing Lucene as the datasource for our website--Lucene is exposed through a java web service which our ASP pages query and process. So far things have been going very well and in general tests everything has been fine. Interestingly though, under a small server stress test

SNOWBALL STEMMER + BOOSTING

2004-12-22 Thread Karthik N S
Hi Guys Apologies.. Using Analysis Paralysis on SnowBall Stemmer [ using StandardAnalyzer. ENGLISH_STOP_WORDS and StopAnalyzer.ENGLISH_STOP_WORDS ] from http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html?page=last#thre ad for the word 'jakarta^4 apache' both the cases

Word co-occurrences counts

2004-12-22 Thread Andrew.Cunningham
Hi all, I have a curious problem, and initial poking around with Lucene looks like it may only be able to half-handle the problem. The problem requires two abilities: 1. To be able to return the number of times the word appears in all the documents (which it looks like lucene can do

RE: Relevance percentage

2004-12-22 Thread Gururaja H
Hi Chuck Williams, Thanks much for the reply. If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at

Re: Word co-occurrences counts

2004-12-22 Thread Paul Elschot
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote: Hi all, I have a curious problem, and initial poking around with Lucene looks like it may only be able to half-handle the problem. The problem requires two abilities: 1.To be able to return the number of times the

Re: Relevance percentage

2004-12-22 Thread Paul Elschot
On Thursday 23 December 2004 08:13, Gururaja H wrote: Hi Chuck Williams, Thanks much for the reply. If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the