I've found some websites that use lucene-db, and I never saw this .jar.
Someone can tall me to found information about this.
This API can probided me some elements to index a MySQL DB of a forum or
wiki?
thks
-
To unsubscribe,
Hi everyone,
The Question:
In Java generally, Is there an easy way to get the unicode name of a
character? (e.g. LATIN SMALL LETTER A from 'a')
The Reasoning (for those who are interested):
The documents I'm indexing have quite a lot of characters that are
basically variations on the basic A-Z
Hi Peter,
The Question:
In Java generally, Is there an easy way to get the unicode name of a
character? (e.g. LATIN SMALL LETTER A from 'a')
...
I'm considering taking the unicode name for each character I encounter
and regexping it against something like:
^LATIN .* LETTER (.) WITH
Hi,
Morus Walter a écrit :
If you cannot find that list somewhere I can mail you a copy.
ICU4J's one is here :
http://oss.software.ibm.com/cvs/icu4j/icu4j/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt?rev=1.7content-type=text/x-cvsweb-markup
See also Unicode's one:
Hello list,
I'm not sure if this subject will cover my question, but here goes:
consider the following snippet:
is = new IndexSearcher((String) envContext.lookup(search_index_dir));
StopAnalyzer analyzer = new
StopAnalyzer(ArticleIndexer.SEARCH_STOP_WORDS_NL);
parser = new
hi
i need to index my text so that index contains only tokenized stemmed words
without stopwords etc. The text ist german, so I tried to use GermanAnalyzer,
but it stores whole text, not terms. Please give me a tip how to index terms
only. Thanks!
DES
Whether or not the text is stored in the index is a different concern
that how it is analyzed. If you want the text to be indexed, and not
stored, then use the Field.Text(String, String) method or the
appropriate constructor when adding a field to the Document. You'll
need to also store a
Martijn, have you seen the Highlighter in the Lucene Sandbox?
If you've stored your text in the Lucene index, there is no need to go
back to DB to pull out the blog, parse it, and highlight it - the
Highlighter in the Sandbox will do this for you.
Otis
--- M. Smit [EMAIL PROTECTED] wrote:
I actually use Field.Text(String,String) to add documents to my index. Maybe
I do not understand the way an analyzer works, but I thought that all German
articles (der, die, das etc) should be filtered out. However if I use Luke
to view my index, the original text is completely stored in a
If you are not tied to Java, see 'unac' at http://www.senga.org/.
It's old, but if nothing else you could see how it works and rewrite it
in Java. And if you can, you can donate it to Lucene Sandbox.
Otis
--- Peter Pimley [EMAIL PROTECTED] wrote:
Hi everyone,
The Question:
In Java
On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
Whether or not the text is stored in the index is a different concern
that how it is analyzed. If you want the text to be indexed, and not
stored, then use the Field.Text(String, String) method
Correction: Field.Text(String, String) is a stored
I've never used the german analyzer, so I don't know what stop words
it defines/uses. Someone else will have to answer that. Sorry
On Wed, 22 Dec 2004 17:45:17 +0100, DES [EMAIL PROTECTED] wrote:
I actually use Field.Text(String,String) to add documents to my index. Maybe
I do not understand
Thanks for correcting me. I use the reader version -- hence my confusion.
-Mike
On Wed, 22 Dec 2004 11:53:31 -0500, Erik Hatcher
[EMAIL PROTECTED] wrote:
On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
Whether or not the text is stored in the index is a different concern
that how it is
Otis,
Problem is though that I'm a little reluctant storing the data
Field.Text instead of Field.UnStored, because of the shear size of the
documents and the multitude I would like to index (say some 100paged *
2k documents). But than again, it's size versus
On Dec 22, 2004, at 12:04 PM, M. Smit wrote:
Problem is though that I'm a little reluctant storing the data
Field.Text instead of Field.UnStored, because of the shear size of the
documents and the multitude I would like to index (say some 100paged *
2k documents). But than again, it's size
Erik Hatcher wrote:
Highlighter does not mandate you store your text in the index. It is
just a convenient way to do it. You're free to pull the text from
anywhere and highlight it based on the query.
Furthermore, you are saying that the highlighter takes care of the
corresponding
But for the other issue on 'store lucene' vs 'store db'. Does anyone can
provide me with some field experience on size?
The system I'm developing will provide searching through some 2000
pdf's, say some 200 pages each. I feed the plain text into Lucene on a
Field.UnStored bases. I also store
I suspect Martijn really wants that snippet dynamically generated, with
KWIC, as on the lucenebook.com screen shot. Thus, he can't generate
and store the snippet at index time, and has to construct it at search
time.
Otis
--- Mike Snare [EMAIL PROTECTED] wrote:
But for the other issue on
For simpy.com I store the full text of web pages in Lucene, in order to
provide full-text web searches. Nutch (nutch.org) does the same. You
can set the maximal number of tokens you want indexed via IndexWriter.
You can also compress fields in the newest version of Lucene (or maybe
just the one
On Dec 22, 2004, at 12:43 PM, M. Smit wrote:
Erik Hatcher wrote:
But for the other issue on 'store lucene' vs 'store db'. Does anyone
can provide me with some field experience on size?
The system I'm developing will provide searching through some 2000
pdf's, say some 200 pages each. I feed the
Erik Hatcher wrote:
On Dec 22, 2004, at 12:43 PM, M. Smit wrote:
Consider that you're only highlighting 20 or so entries at one time.
Getting the text from a Lucene index you're already navigating will be
quite quick. But it shouldn't be too bad to pull 20 records from a
database either.
Otis Gospodnetic wrote:
I suspect Martijn really wants that snippet dynamically generated, with
KWIC, as on the lucenebook.com screen shot. Thus, he can't generate
and store the snippet at index time, and has to construct it at search
time.
Otis
That is correct. I won't be having a lot of
Can someone tell me the purpose of the .CFS files? The Index File Formats
page does not mention this type of file.
Steve Rajavuori
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
All,
This might be asked earlier please point to the earlier post or any
pointers would be appreciated.
I have bunch of HTML pages which I index using IndexHTML. My dilemma is
when I want to search the pages and then display the results the text
that I use for the results snippet do not get the
Steve Rajavuori schrieb:
Can someone tell me the purpose of the .CFS files? The Index File Formats
page does not mention this type of file.
uuuh, you're right, it is not documented at fileformats.html.
Since Lucene 1.4, the individual index files are stored per default
within one single
Thanks. I am trying to repair a corrupted 'segments' file. I am attempting
to manually edit the file to add some missing segment names, but I need to
add the correct segment size for each. Can anyone tell me how to determine
the correct segment size (number of documents in the segment) by looking
On Wednesday 22 December 2004 23:41, Steve Rajavuori wrote:
Thanks. I am trying to repair a corrupted 'segments' file.
Why are you sure it's corrupted? Are the *.cfs file and the other files
types mixed in one directory? Then that's the problem: if you have *.cfs,
segments, and deletable,
Hi guys,
We have an index with some fields containing email addresses. Doing a search
for an email address with this format: [EMAIL PROTECTED], does not bring up any
results with lucene 1.4.
The query: Field1:[EMAIL PROTECTED]
However it returns results with 1.2. Any ideas?
Roy.
What does toString() return for each of those queries? Are you using
the same analyzer in both cases?
Erik
On Dec 22, 2004, at 5:44 PM, [EMAIL PROTECTED] wrote:
Hi guys,
We have an index with some fields containing email addresses. Doing a
search for an email address with this format:
Hi there, Im about to embark on a Lucene project of massive scale
(between 500 million and 2 billion documents). I am currently working
on parallellizing the construction of the Index(es).
Rough summary of my plan:
I have many, many physical machines, each with multiple processors that
I wish
The demo IndexHTML does not store the contents field - it is indexed
using a Reader and thus not stored. You will have to modify the code
to get the complete contents available at search time.
Erik
On Dec 22, 2004, at 5:01 PM, Hetan Shah wrote:
All,
This might be asked earlier please
I _think_ you'd be better off doing it all at once, but I wouldn't
trust myself on this and would instead construct a small 3-index set
and test, looking at a) maximal disk usage, b) time, and c) RAM usage.
:)
Otis
--- Ryan Aslett [EMAIL PROTECTED] wrote:
Hi there, Im about to embark on a
We have been implementing Lucene as the datasource for our
website--Lucene is exposed through a java web service which our ASP
pages query and process. So far things have been going very well and in
general tests everything has been fine.
Interestingly though, under a small server stress test
Hi Guys
Apologies..
Using Analysis Paralysis on SnowBall Stemmer [ using StandardAnalyzer.
ENGLISH_STOP_WORDS
and StopAnalyzer.ENGLISH_STOP_WORDS ] from
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html?page=last#thre
ad
for the word 'jakarta^4 apache'
both the cases
Hi all,
I have a curious problem, and initial poking around with Lucene looks
like it may only be able to half-handle the problem.
The problem requires two abilities:
1. To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do
Hi Chuck Williams,
Thanks much for the reply.
If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is 0, then divide
this by the total number of clauses. Take a look at
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
Hi all,
I have a curious problem, and initial poking around with Lucene looks
like it may only be able to half-handle the problem.
The problem requires two abilities:
1.To be able to return the number of times the
On Thursday 23 December 2004 08:13, Gururaja H wrote:
Hi Chuck Williams,
Thanks much for the reply.
If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is 0, then divide
this by the
38 matches
Mail list logo