I needed this myself not long time ago..
Here is a piece of code to get an Analyzer that will use a tokeniez and
an English stemmer, (for "bears" it will also return "bear" and vice
versa)
private static Analyzer createEnglishAnalyzer() {
return new Analyzer() {
public TokenStream tokenSt
Hi Ashwin,
Well in that case you might need to use Ifilters some other way instead
of through SeekAFile. I don't know how since I haven't used it myself.
Perhaps someone else here has.
Sachin
-Original Message-
From: ashwin kumar [mailto:[EMAIL PROTECTED]
Sent: 09 March 2007 02:48
To:
Chris Hostetter wrote:
i'm not crypto expert, but i imagine it would probably take the same
amount of statistical guess work to reconstruct meaningful info from
either approach (hashing hte individual words compared to eliminating the
positions) so i would think the trade off of supporting phrase
Yeap. I strongly support John.
I knew when I reopen indexes. Then what is the reason to wait for garbage
collector..
And that too, FieldCache uses WeakHashMap and that may lead some memory
leaks.
Jelda
> -Original Message-
> From: John Wang [mailto:[EMAIL PROTECTED]
> Sent: Friday, March
Hi,
Here we use lucene to index our emails, currently 500.000 Documents.
When Searching the body by a WildcardQuery the problems arises.
I did some profiling with JProfiler. I see the more BooleanClause
instances used
the more memory is required during search.
Most memory is used by instances
hi all i have tried indexing .txt using lucene and its working fine.
now i want to index .doc , .pdf , .xls , . ppt with lucene
can some one help in doing that
thanks
regards
ashwin
For indexing e-mail, I recommend that you tokenise the e-mail addresses into
fragments and query on the fragments as whole terms rather than using
wildcards.
Rather than looking for fischauto333* in (say) smtp-from, look for
fischauto333 in (say) an additional field called smtp-from-fragments to
Hi Rob,
For indexing e-mail, I recommend that you tokenise the e-mail addresses into
fragments and query on the fragments as whole terms rather than using
wildcards.
[example]
Hm for email adresses this isnt a big problem here.
The real problem is the query on the body part of an email, wh
You can also use a filter. The basic idea is to construct a Lucene
Filter, probably using something like RegexTermEnum/TermDocs.
It's faster than you think . This, in combination with
ConstantScoreQuery should fix you right up. Several things:
1> you lose scoring with the filter part of a query w
Search the archive, read the FAQ (see link in my signature).
On Mar 9, 2007, at 7:20 AM, ashwin kumar wrote:
hi all i have tried indexing .txt using lucene and its working fine.
now i want to index .doc , .pdf , .xls , . ppt with lucene
can some one help in doing that
thanks
regards
ashwin
I'm having problems with queries not returning a hit when a document
does in fact have those terms. (I'm not worried about the ranking, just
whether or not it's a hit.)
Is anything wrong with the query syntax? (see below) Also, words in the
document's index (not the Lucene index) seemed less lik
are you perhaps exceding this...
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)
: Date: Fri, 09 Mar 2007 12:14:38 -0500
: From: "Walker, Keith 1" <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
Hello Chris,
this is incredible!
I'm new to Lucene and did just subscribe to the list for this very
phenomena. Keith' problem was also my problem.
Your mail was the first one I read and is _exactly_ what I needed to
know.
Thanks a lotta!
Cheers, Steffen
On 9 Mar 2007 at 9:25, Chris Hostetter
(Lucene 1.9.1)
I have a "filename" field in Lucene that holds a value, like this:
pagefile.sys
If I run searches through QueryParser, and I do a search for:
pagefile.sys
pagefile
pagefile.
This all works because it goes through getFieldQuery, which tokenizes
the string and generat
On 9 Mar 2007 at 15:10, McGuigan, Colin wrote:
> I have a "filename" field in Lucene that holds a value, like this:
> pagefile.sys
>
Hi Colin,
I'm still _very_ new to lucene, but isn't that what the un-tokenized
indexing is for?
Like in 1.9.1
doc.add(Field.Keyword("filename", "pagefile.sys"));
Hello,
I run Nutch and get a whole slew of articles and when I display search
results, there may be 5-6 articles that have different titles, and most of
the body text is the same, but I want to group them all under one result.
These are usually AP articles that all newspapers repurpose.
When usi
Agree it's totally hackable, particularly with an md5 hashcode. If you
used a 16 bit hash e.g. mod % 65536 then it becomes more difficult to
construct the original document but less precise in querying. It might
be nice to store the individual words contained in each document as just
a sorted l
-Original Message-
From: Steffen Heinrich [mailto:[EMAIL PROTECTED]
Sent: Fri 3/9/2007 4:31 PM
To: java-user@lucene.apache.org
Subject: Re: Wildcard query with untokenized punctuation
On 9 Mar 2007 at 15:10, McGuigan, Colin wrote:
>> I have a "filename" field in Lucene that holds a value
18 matches
Mail list logo