Hi,
I have noticed that FuzzyQueries and WildcardQueries don't do stemming.
Since all terms in the index are in stemmed forms, this causes some
problems:
Etagenwohnung gets stemmed to nwohnung. So a search for
Etagenwohnung will find Etagenwohnung and nwohnung.
Fuzzy search for Etagenwohnung~
I made a small mistake in my example. My application converted all
characters to lowercase while indexing. When I comment this out,
Etagenwohnung remains unchanged after stemming. So, my example is bad.
However, the basic problem remains (at least for all words that do not
start with a capital
Hello,
i'm trying to index html file with Lucene.
Do u know what's the best HTML Parser in Java ?
The most Powerful ?
I need to extract meta-tag, and many other differents text fields...
Thx for ur help ;)
It's not possible to generalize like that.
I like NekoHTML.
Otis
--- Pierre Lacchini [EMAIL PROTECTED] wrote:
Hello,
i'm trying to index html file with Lucene.
Do u know what's the best HTML Parser in Java ?
The most Powerful ?
I need to extract meta-tag, and many other differents text
My guess is that your 2 getDocument calls are the source, that is, that
those PDF and TXT classes don't return a proper Document.
I also don't see the output created by log(doc: +doc);
Otis
if(path.matches(\\d+_\\d{4}_[a-z]{2,3}\\.pdf)) {
doc =
log(doc: +doc); is handled by tomcat and directed into special log-files,
so you can't see them.
System.err.println(hallo1 +doc);
ex.printStackTrace();
System.err.println(hallo2);
this is printing the relevant output.
doc is never null,
Hello,
Is there any simple way to get the information from the search results
on which of the query terms contributed the most to the document's
score? I'm working on an application which could use this sort of
information to give a hint to the user why particular document scores
the way it
If I were you I would make things simpler for myself by converting the
code to something that I could run from the command line instead of
having to go through Tomcat.
You really need to capture your exception stack trace with lne numbers,
and then we can try helping.
Otis
--- Günter_Kukies
Heya,
is it possible to sort the Hits Array on a given Field ?
(for example a field containing the Date)
Thx for ur help !
Check out the new Explanation API in the latest CVS sources. It permits
one to get a detailed explanation of how a query was scored against a
document. Note that these explanations are designed for user perusal,
not for further computation, and are as expensive to construct as
re-running the
I switched off the -server switch from the java commandline options and
everything works fine now.
I changed nothing in my code.
So is it principly possible to throw an Exception with not stack trace?
Any comments about this phenomenon?
Günter
- Original Message -
From: Otis
Doug Cutting wrote:
Check out the new Explanation API in the latest CVS sources. It permits
one to get a detailed explanation of how a query was scored against a
document. Note that these explanations are designed for user perusal,
not for further computation, and are as expensive to
My first question I tried to write phrase query below is my attempt when i do a search
the search content is in but it does not work it any idea what is wrong? I m using
the index created by the Lucene Demo
PhraseQuery query = new PhraseQuery();
BooleanQuery bQuery = new BooleanQuery();
All,
I'm in need of some pointers, hints or tips on indexing large collections
of data. I know I saw some tips on this list before but when I tried
searching
the list, I came up blank.
I have a large collection of XML files (336000 files around 5K
apiece) that I'm
indexing and its taking
Mike,
By way of comparison, I've got a collection of about 50,000 XML files, each
of which averages about 8K. It takes about 1.25 hours to index (on a 1.8Ghz
machine). I use basically the standard configuration (mergeFactor, etc.)
and I've got about 30 fields per document. I add about 200 new
Hello,
Since you are trying this anyway, and looking for ways to improve
indexing times... Could you perhaps try to replace use of
java.io.RandomAccessFile in FSDirectory implementation, with the
attached implementation? It supposedly increases I/O throughput by
orders of magnitude, by using
Things to consider:
- disk speed and whether it is busy satisfying other processes'
requests
- CPU speed
- amount or free RAM in the machine and amount of RAM given to your JVM
- the bottleneck - could be a slow XML parser, for instance, profile it
I'm about to submit another Lucene article to
Hi Andrzej,
Thanks for the code. I'll try it as soon as I have time. If you had a copy
of the modified FSDirectory implementation you could also share, that would
make testing it a bit quicker and easier. BTW, when you said it supposedly
increases I/O, I gather that you are not the author?
I prefer JTidy http://lempinen.net/sami/jtidy/.
Michael
-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Montag, 24. Februar 2003 15:03
An: Lucene Users List; [EMAIL PROTECTED]
Betreff: Re: Best HTML Parser !!
It's not possible to generalize like that.
Yes it is possible. Instead of catching an Exception you can do anything
else, e.g.
try {
...}
catch (MyException e) {
System.err.prinltn(e.class.forName());
}
But this is off-topic here, it´s an gereral question about java.
Michael
-Ursprüngliche Nachricht-
Von: Günter Kukies
20 matches
Mail list logo