-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Tuesday, August 16, 2011 8:14 PM
>> To: java-user@lucene.apache.org
>> Subject: R
sort is interesting. We must "retain" the granularity
> of the "original" timestamp for Index maintenance purposes,
> but we could add another field, with a granularity of
> "date" instead of "date+time", which would be used for
> sorting only.
&
About your OOM. Grant asked a question that's pretty important,
how many unique terms in the field(s) you sorted on? At a guess,
you tried sorting on your timestamp and your timestamp has
millisecond or less granularity, so there are 625M of them.
Memory requirements for sorting grow as the number
It's certainly possible as others have said, but don't be surprised
if it's not performant. At root, you still have a disk out there that's
being used for fetching the data. Simply moving it from fetching
individual files to fetching that data from the index doesn't change
that fundamental fact.
B
Index files should not be disappearing unless you're using the form
of opening an indexwriter that creates a new index. We'd need to see
the code you use top open the IW to provide more help.
If all you're doing is looking at the index directory, segments will disappear
as they are merged so that'
You can ignore the warning.
But you haven't told us a thing about *how* the failure occurs or
what gets reported. What exactly are you doing? What exactly
fails (i.e.do you just not find files? Get a stack trace? Get a
"class not found error"?)
We really cannot help at all without more informatio
Simply breaking up your index into separate pieces on the same machine
buys you nothing, in fact it costs you considerably. Have you put
a profiler on the system to see what's happening? I expect you're swapping
all over the place and are memory-constrained.
Have you considered sharding your index
adOnlyDirectoryReader Entries from fieldCache.
> Only sorting produces now an entry.
>
> So action that starts a new searcher and closes the old one (like
> replication)
> should release cache from fieldCache through garbage collection?
>
> Regards
> Bernd
>
> Am 21.06.
Hmmm, I'm not going to even try to talk about the code itself, but I will add
a couple of clarifications:
Jetty has nothing to do with it. It's in Lucene, and it's used for sorting and
sometimes faceting. The cache is associated with a reader on a machine
used to search. When replication happens,
Does this help?
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/util/IndexableBinaryStringTools.html
If not, here's a note from Ryan McKinley on another thread (googling
lucene storing binary data brought it up)...
**
You can store binary data using a binary field type -- then y
See PerFieldAnalyzerWrapper, then form your query like
field1:word1 OR field2:word1
Best
Erick
On Mon, Jun 20, 2011 at 10:40 AM, G.Long wrote:
> Hi :)
>
> I know it is possible to create a query on different fields with different
> analyzers with PerFieldAnalyzer class but is it possible to also
a plain indexing library for those typical rdbms indexing use-cases that
> you have.
>
> Dean
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Monday, June 20, 2011 6:15 AM
> To: java-user@lucene.apache.org
> Subject: Re: loo
re: 20020101 to the end of time.. Use a clause like [2002-01-01 TO *]
About paging... Yes, you have to start all over again for each search. The basic
problem is that you have to score every document each search, the last document
scored might be the highest-scoring document.
But let's back up a
Please review:
http://wiki.apache.org/solr/UsingMailingLists
You've given us no information to go on here, what are you
trying to do when this happens? What have you tried? What
is the query you're running when this happens? How much
memory are you allocating to the JVM?
You're apparently sorting
How did you do this? Did you execute the "ant eclipse" target
first? See the instructions at:
http://wiki.apache.org/solr/HowToContribute#Eclipse_.28Galileo.2C_J2EE_version_1.2.2.20100217-2310.2C_but_any_relatively_recent_Eclipse_should_do.29:
Best
Erick
On Sat, Jun 11, 2011 at 2:16 AM, dyzc2010
My first question is "what are you trying to do at a higher level"?
Because asking people to check your code without telling us
what you're trying to accomplish makes it difficult to know what
to look at. You might review:
http://wiki.apache.org/solr/UsingMailingLists
That said, at a guess, your
<<>>
Hmmm, then it's pretty hopeless I think. Problem is that
anything you say about running on a machine with
2G available memory on a single processor is completely
incomparable to running on a machine with 64G of
memory available for Lucene and 16 processors.
There's really no such thing as an
Well, taking the code all together, what I expect is
that you'll have a document after all is done that
only has a "DocId" in it. Nowhere do you fetch
the document from the index.
What is your evidence that you haven't deleted
the document? If you haven't reopened your reader
after the above, you'
I take it from this that you want documents with values #outside# 20-30
to still be found? In that case you can do something like add a clause like:
OR field:[20 TO 30]^10
or similar.
Best
Erick
BTW, is there a reason you decided not to use Solr? In many ways it's
easier than straight Lucene...
to MFQP? Worth a try.
>
>
> --
> Ian.
>
>
> On Wed, Jun 8, 2011 at 2:38 PM, Erick Erickson
> wrote:
>> Could you just construct a BooleanQuery with the
>> terms against different fields instead of using MFQP?
>> e.g.
>>
>> bq.add(qp.parse("title
Could you just construct a BooleanQuery with the
terms against different fields instead of using MFQP?
e.g.
bq.add(qp.parse("title:(the AND project)", SHOULD))
bq.add(qp.parse("desc:(the AND project)", SHOULD))
etc...? If your QueryParser was created with a
PerFieldAnalyzerWrapper I think you mig
>
> But thanks for the reply.
>
> On Wed, Jun 8, 2011 at 6:14 PM, Erick Erickson wrote:
>
>> hard to say. You should get a copy of Luke and inspect your index to
>> see if what you
>> think you put there is actually there. When you added data to your
>> index,
hard to say. You should get a copy of Luke and inspect your index to
see if what you
think you put there is actually there. When you added data to your
index, did you
perform a commit?
Best
Erick
On Wed, Jun 8, 2011 at 2:45 AM, Pranav goyal wrote:
> There is one field DocId which I am storing as
ddition I am going to switch to another collector as well. ATM I
>> collect the results and then sort them using the std. Collections.sort
>> approach... I have to look what Lucene offers and switch to something
>> else.
>>
>> Thanks,
>> Alex
>>
>> O
e a fairly
> complex system, and adding anything Hadoop-related feels like it might
> push us over a tipping point into the realm of unwieldy overcomplexity.
> But, this is a hard problem after all, so some amount of complexity is
> inevitable.
>
> On 06/02/2011 07:05 PM, Eric
As you've found out, raw scores certainly aren't comparable across
different indexes
#unless# the documents are fairly distributed. You're talking large
indexes here,
so if the documents are balanced across all your indexes, the results should be
pretty comparable. This pre-supposes that the indexe
Have you tried using the explain method on a Searcher and examining the results?
Best
Erick
On Thu, Jun 2, 2011 at 3:51 PM, Clemens Wyss wrote:
> I have a minimal unit test in which I add three documents to an index. The
> documents have two fields "year" and "descritpion".
> doc1(year = "2007"
me.
>
> Thanks,
> Alex
>
> On 02.06.2011 13:04, Erick Erickson wrote:
>>
>> At this size, really consider going to a single index. The lack of
>> administrative headaches alone is probably well worth the effort
>>
>> I almost guarantee that the time yo
gt; Multi-threaded searching will be next and if that hasn't helped, I will
> switch to one big index.
> All indexes together are rather small, ~200MB and 50.000 documents.
>
> -Alex
>
> On 01.06.2011 23:26, Erick Erickson wrote:
>>
>> I'd start by putting them
I'd start by putting them all in one index. There's no penalty
in Lucene for having empty fields in a document, unlike an
RDBMS.
Alternately, if you're opening then closing searchers each
time, that's very expensive. Could you open the searchers
once and keep them open (all 90 of them)? That alone
ed case, too.
>
> This probably leaves me with a single option which is not to use
> stopwords at all, allowing me to get the best of the both worlds. Does
> anyone have any experience on how much of increased index size
> (roughly) can I expect?
>
> Regards,
> Mindau
Hmmm, somehow I missed this days ago
Anyway, the Lucene query parsing process isn't quite Boolean logic.
I encourage you to think in terms of "required", "optional", and
"prohibited".
Both queries are equivalent, to see this try attaching &debugQuery=on
to your URL and look at the "parsed que
Actually, there are no results in the range [l220-2 TO l220-10]
This is basically a string comparison, and l220-2 > l220-10 so
this range would never match.
Best
Erick
On Tue, May 17, 2011 at 1:51 PM, G.Long wrote:
> I set the field article to NOT_ANALYZED and I didn't quoted the article
> valu
Ahhh, I probably should have read more carefully!
At any rate, I think all you need to do is specify
the reverse boolean in the SortField c'tor???
Best
Erick
On Mon, May 16, 2011 at 8:12 AM, shrinath.m wrote:
>
> Erick Erickson wrote:
>>
>> Why do you want to do this? t
Why do you want to do this? the internal doc ids are
transient. If you update a document by delete/add, the
internal id will now be different. What I'm getting at is
that I'd like to be sure the use case here does what
you think it will because this smells like an XY problem,
see:
http://people.apa
bq: Just curious. How would this version be published if there are
missing jar and there are compiling errors?
Well, the fact that it has been published probably means that you've
missed a step somewhere. There'd have been howls of outrage if
something as egregious as this were the case
That
I'm a bit confused by this:
***
With my query, I
would like to only return 'patrol' items and nothing else. Is there a way
to do this?? My current querying code is below. This returns all items with
'patrol' in it.
**
Are you saying that if you're searching on "p
Well, Solr officially uses Lucene, but you'll do disappointingly
little Java coding. Which some people think is a plus :).
The biggest issue will be making really, really sure that your
schema.xml file in Solr reflects your use in the Lucene code
Actually, I'd swallow the blue pill and just make t
moon
> The moon is bright
> This is a moon
>
> i.e. the "leftmost hit" of my search term should be rated highest/best...
>
> How should I analyze/search my documents to get this search/rating behavior?
>
>> -Ursprüngliche Nachricht-
>> Von: Erick
What is the problem you're trying to solve? I'm wondering if
this is an XY problem. See:
http://people.apache.org/~hossman/#xyproblem
Best
Erick
On Wed, May 4, 2011 at 3:16 AM, Clemens Wyss wrote:
> Given the I have 3 documents with exactly one field and the fields have the
> following contents
Shingles won't to that either, so I suspect you'll have to write a custom
tokenizer.
Best
Erick
On Wed, May 4, 2011 at 2:07 AM, Clemens Wyss wrote:
> I know this is just an example.
> But even the WhitespaceAnalyzer takes the words apart, which I don't want. I
> would like the phrases as they a
Why do you want to do this? I'm wondering if this is an XY problem...
See: http://people.apache.org/~hossman/#xyproblem
Best
Erick
On Tue, May 3, 2011 at 7:55 AM, harsh srivastava wrote:
> Hi All,
>
>
> I want to know any inbuilt method in lucene that can help me to fix the
> number of searched
Are you sure you need to? They may simply have moved. Which ones
are you using? If you tell us maybe we can suggest where to find them.
Best
Erick
On Thu, Apr 28, 2011 at 9:14 AM, Tanuj Jain wrote:
> Hi,
> Can anyone please tell where I could download *packages
> org.apache.lucene.analysis.* *so
You can also specify a large slop in your phrase (e.g.
"arcos biosciences"~500 which will take distance into
account when scoring, although it may not be enough
to rank the document where you want. Sujit's comment
is probably a better place to start.
Best
Erick
On Tue, Apr 26, 2011 at 2:59 PM, Su
What do you mean by "access"? Are you trying to write to the
common index with more than one of your machines?
Best
Erick
On Wed, Apr 20, 2011 at 8:32 AM, Yogesh Dabhi wrote:
>
>
> Three Instance of My application & they access common lucene directory
>
>
>
> Instance1 jdk64 ,64 os
>
> Instance2
ersion isn't exactly
>> the fastest and easiest thing so I'd like to strike out all other
>> possibilities before :)
>>
>> Best regards,
>>
>> Erik
>>
>>
>> Am 20.04.2011 01:07, schrieb Lance Norskog:
>>>
>>> Lo
H, I don't see the problem either. It *sounds* like you don't really
have the default search field defined the way you think you do. Did you restart
Solr after making that change?
I'm assuming that when you say "not created by Solr" you mean that it's created
by Lucene. What version of Lucene
You can easily string together your own tokenizer and any number of filters
to create an analyzer that does exactly what you need. Lucene In Action
shows an example for creating your own analyzer by assembling
the standard parts
Best
Erick
On Mon, Apr 18, 2011 at 3:08 AM, Clemens Wyss wrote:
I would not go there first. There are examples out there to, for instance,
index Wikipedia but that is, IMO, too complex for just starting to get
your feet wet.
I think you'd be better off looking at the Lucene demo code and
trying to understand/modify that as a starting point, see:
http://lucene.
What information do you need? Could you just ping the stats component
and parse the results (basically the info on the admin/stats page).
Best
Erick
On Thu, Apr 14, 2011 at 11:56 AM, jm wrote:
> Hi,
>
> I need to collect some diagnostic info from customer sites, so I would like
> to get info on
You've given us anything to go on here, except "it doesn't work". You might
review this page:
http://wiki.apache.org/solr/UsingMailingLists
Best
Erick
On Tue, Apr 12, 2011 at 9:05 AM, Ranjit Kumar wrote:
> Hi,
>
> I am creating index with help of StandardAnalyzer for *.docx file it's
> fine. Bu
I don't quite get why the German analyzer would do this, but
all the Filters I see are stemmers and I expect they'd
reduce the words as you indicate.
What version of Lucene are you using?
Best
Erick
On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss wrote:
> I try to apply German*Filter and or Anal
A TermQuery is really dumb. It doesn't do anything at all to the
input, it assumes you've done all that up front. Try parsing
a query rather than using TermQuery
And I suspect you'll have problems with casing, but that's another
story
Best
Erick
On Wed, Apr 6, 2011 at 6:33 AM, Mark Wilts
I suspect you're already aware of this, but I've
overlooked the obvious so many times I thought
I'd mention it...
A classic mistake is to assign a reader with reopen
and not close the old reader, see:
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/IndexReader.html#reopen()
<
FSDirectory will, indeed, store the index on disk. However,
when *using* that index, lots of stuff happens. Specifically:
When indexing, there is a buffer that accumulates documents
until it's flushed to disk. Are you indexing?
When searching (and this is the more important part), various
caches a
You might consider a multiValued field and a positionIncrementGap longer
than the longest tuple.
At that point, you can search for phrase queries where the slop is less than
the positionIncrementGap.
I'm a bit rushed, so if you need more details we can talk later
Best
Erick
2011/4/1 袁武 [GMa
5-10 G indexes are pretty small by Lucene/Solr standards, so
given reasonable hardware resources this should be no problem.
That said, only measurement will nail this down. But an
often-used rule of thumb is that you need to consider some
better strategies in the 40G range.
CAUTION: you haven't sp
Uhhhm, doesn't "term1 term2"~5 work? If not, why not?
You might get some use from
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
Or if that's not germane, perhaps you can explain your use case.
Best
Erick
On Wed, Mar 30, 2011 at 5:49 PM, Andy Yang wrote:
> Is there a minimum string
You get this in response to doing what? Are you sure you've unpackaged
the nightly build and aren't inadvertently getting older jars?
Best
Erick
On Tue, Mar 29, 2011 at 7:21 AM, Patrick Diviacco
wrote:
> I've downloaded the nightly build of Lucene (TRUNK) and I'm referring to the
> following doc
I'm always skeptical of storing the doc IDs since they can
change out from underneath you (just delete even a single
document and optimize). What is it you're doing with
the doc ID that you couldn't do with the guid? If your "guid list"
were ordered, I can imagine building filters quite quickly fro
ly because it
> slows down a lot computations. Is it true ?
>
> On 22 March 2011 14:29, Erick Erickson wrote:
>
>> Try Searcher.explain.
>>
>> Best
>> Erick
>>
>> On Tue, Mar 22, 2011 at 4:34 AM, Patrick Diviacco
>> wrote:
>> > Is
Try Searcher.explain.
Best
Erick
On Tue, Mar 22, 2011 at 4:34 AM, Patrick Diviacco
wrote:
> Is there a way to display Lucene scores per field instead of the global one
> ?
> Both my query and my docs have 3 fields.
>
> I would like to see the scores for each field in the results. Can I ?
>
> Or
A good habit to develop is to print out the toString() of the assembled
queries, that'll get you going pretty quickly understanding what the
query assembly is all about without having to wait for people to respond.
But the short form is that phrase queries require all the terms to be
adjacent, whi
Don't do that Let's back up a second and
ask why in the world you want to do this, what's the
use-case you're satisfying? Because spinning through
all the results and getting information from the underlying
documents is inherently expensive since, as Sanne
says, you're doing disk seeks. Most L
The easiest way to figure out this kind of thing is to print out the
toString() on the queries after they're assembled. I believe you'll
find that the difference is that the PhraseQuery would find text like
"Term1 Term2 Term3" but not text like "Term1 some stuff Term2 more
stuff Term3" whereas Bool
You can't. If by "normalize" you mean compare the scores
between two different queries, it's meaningless. The scores
from one query to another are not comparable.
If by "normalize" you mean make into a value between 0 and 1,
anywhere you have access to raw scores I believe you also have
access to
This sounds like you're not closing your index searchers and the file system
is keeping them around. On the Unix box, does hour index space reappear
just by restarting the process?
Not using reopen correctly is sometimes the culprit, you need something like
this (taken from the javadocs).
IndexRe
passed an english stop words set. My
> question was if I have to call any other function of the german analyzer for
> it to be corrent.
>
> Thank you.
>
>
> Quoting Erick Erickson :
>
>> I don't understand what you're saying here. If you put a stemmer in the
4, 2011 at 8:21 AM, Vasiliki Gkouta wrote:
> Thanks a lot for your help Erick! About the fields you mentioned: If I don't
> use stemmers, except for the constructor argument related to the stop words,
> is there anything else that I have to modify?
>
> Thanks,
&
StandardAnalyzer works well for most European languages. The problem will
be stemming. Applying stemming via English rules to non-English languages
produces...er...interesting results.
You can go ahead and create language-specific fields for each language and
use StandardAnalyzer with the appropri
This looks like just a phrase query, perhaps with no slop.
Term query definitely won't work if you've tokenized a the field,
because your terms would be "A" and "B", but not "A B".
SpanQueries should also work if you want, there's no reason to
subclass anything, just use SpanNearQuery... You can
Solr doesn't do it. There exist various tokenizers/filters that just strip
the HTML tags, but there's nothing built into Solr that I know of that
understands HTML, HTML-aware operations are outside Solr's purview.
Best
Erick
On Fri, Mar 11, 2011 at 6:50 AM, shrinath.m wrote:
> On Fri, Mar 11, 20
It's not so much a matter of problems with indexing/searching
as it is with search behavior. The reason these strategies
are implemented is that using English stemming, say, on
other languages will produce "interesting" results.
There's no a-priori reason you can't index multiple languages
in the
What mail client are you using? I also had this problem and it's
solved in Gmail by sending the mail as "plain text" rather than
"Rich formatting".
Best
Erick
On Fri, Mar 11, 2011 at 4:35 AM, Li Li wrote:
> hi
> it seems my mail is judged as spam.
> Technical details of permanent failure:
No, Lucene itself shouldn't be doing this, the recommendation is for multiple
threads to share a single searcher. I'd first look upstream, are your requests
being processed serially? I.e. is there a single thread that's
handling requests?
Best
Erick
On Thu, Mar 10, 2011 at 4:25 PM, RobM wrote:
>
If you're loading 100,000 documents, you can expect it to be slow. If
you're loading 10 documents, it should be quite fast... So how big is
hits.length?
And what version of Lucene are you using? The Hits object has been
deprecated for quite some time I believe.
The problem here is that you're
How large is (large)? What machines are you intending to run this on?
In general, though, don't worry about index size until you actually have some
numbers to deal with. Solr generally has resource issues based on the number
of #unique# terms in an index. So repeating the same thing in a bunch of
You have to describe in detail what "taking a huge performance hit"
means, there's not much to go on here...
But in general, adding N elements to a mutli-valued field isn't a
problem at all.
This bit of code:
Document D = searcher.doc(hits[i].doc);
is very suspicious. Does your cLucene version h
Sure, just use a field that is not analyzed. Perhaps you want to
define a new field in your documents like "nameKey" that is
analyzed with something like KeywordAnalyzer. See:
http://lucene.apache.org/java/3_0_3/api/all/index.html
PerFieldAnalyzerWrapper will let you use different
analyzers for di
What does "can't post" mean? Bounced as spam? rejected for other
reasons? This question came through so obviously you can post
something
I found that sending mail as "plain text" kept the spam filter
from kicking in.
Best
Erick
On Tue, Feb 15, 2011 at 7:29 AM, Li Li wrote:
> hi all
> is
This is usually something you should not do. Is there any possibility you can
combine these indexes into one? Maybe sharded? Because this approach
is almost guaranteed to scale poorly.
This smells like an XY problem, perhaps you can back up and explain
what the higher-level problem you're trying t
d content:brown)
> fox (content:fox content:trick content:throw content:slyboots content:fuddle
> content:fob content:dodger content:discombobulate content:confuse
> content:confound content:befuddle content:bedevil)
>
> 2011/2/13 Erick Erickson
>
>> At a guess make is a synon
Be aware that when you do a doc.get(), the fields are the
*stored* fields in their original, unanalyzed form. Is that really
what you want? Or do you want the tokenized form of the fields?
If the latter, you might get the Luke code, it reconstructs all the fields
in the document from the terms tha
At a guess make is a synonym for one of your search terms. doc.get
returns the original content, not synonyms.
So what are your synonyms that might be a factor here?
Best
Erick
On Sat, Feb 12, 2011 at 6:04 AM, Gong Li wrote:
> Hi,
>
> I am tying WordNet synonyms into an SynonymAnalyzer. But I
I wonder if you can define the problem away? It sounds like
you have essentially random input here. That is, the users
can put in whatever they want so whatever you do will be wrong
sometime. Could you sidestep the problem with auto-complete
and prefix queries (essentially adding * to the user's in
It is, I think, a legitimate question to ask whether scoring is worthwhile
on wildcards. That is,
does it really improve the user experience? Because the MaxBooleanClause
gets tripped
pretty quickly if you add the terms back in, so you'd have to deal with
that.
Would your users be satisfied with s
Yes. You're confusing an *engine* with a full-blown application.
The user here is a Java programmer. I argue that guessing, which
is what you're asking for, is emphatically NOT in the domain of the
search *engine*, which is what Lucene is. Imagine the poor programmer
trying to understand why certa
I think all you need to do is index the keywords in one field and weights in
another.
Then just search on keywords and sort on weight.
Note: the field you sort on should NOT be tokenized.
Best
Erick
On Mon, Jan 24, 2011 at 4:02 PM, Chris Schilling wrote:
> Hello,
>
> I have a bunch of text doc
<<>>
yes
<<>>
Unknown. The devs are trying mightily to keep this kind of thing out of
the 3_x branch, but this was a fairly nasty bug rather than an
enhancement which made it important enough to put in the 3x branch.
This is NOT the same sort of issue you've seen in messages about
rebuilding tru
27;re not going to return the docs
anyway,
why filter them later?
Best
Erick
> -amg
>
> On Sat, Jan 22, 2011 at 12:32 PM, Erick Erickson
> wrote:
> > I guess I don't see what the problem is. These look to me like
> > standard Lucene query syntax options. If I'm o
I guess I don't see what the problem is. These look to me like
standard Lucene query syntax options. If I'm off base here,
let me know.
If you're building your own BooleanQuery,
you can add these as sub-clauses
Here's the Lucene query syntax:
http://lucene.apache.org/java/2_9_1/queryparsersyntax.
That's certainly valid. You could also consider n-grams here as another
approach.
Its also useful to restrict the number of leading (or trailing) characters
you allow. For
instance, requiring at least 3 non-wildcard leading characters makes a big
difference.
It's also a legitimate question how wel
See below:
On Sun, Jan 16, 2011 at 10:15 AM, sol myr wrote:
> Hi,
>
> I'm trying to understand the behavior of file merging / optimization.
> I see that whenever my IndexWriter calls 'commit()', it creates a new file
> (or fileS).
> I also see these files merged when calling 'optimize()' , as mu
That is certainly one way of approaching it. Another is to add a
ClientId field to each document and add a mandatory "AND
ClientId=thisclient"
to each query.
This will have some effect on relevance since the statistics are
gathered over the whole corpus rather than just the individual client.
Als
You need to create (and it's pretty easy) your own analysis chain that
returns a larger position increment gap, which is intended for this
very situation using the proximity (or SpanNear) as Jokin suggested.
Best
Erick
On Tue, Jan 11, 2011 at 1:24 AM, Jokin Cuadrado wrote:
> you should use a pr
Lucene In Action has an example of creating a synonymanalyzer that
you can adapt. The general idea is to subclass from Analyzer and
implement the required functions, perhaps wrapping a Tokenizer
in a bunch of Filters.
You might be able to crib some ideas from
solr.analysis.WordDelimiterFilter
Best
Some days I just can't read...
First question: Why do you require standard analyzer?Are you really making
use of
the special processing? Take a look at other analyzer options.
PatternAnalyzer,
SimpleAnalyzer, etc.
If you really require StandardAnalyzer, consider using two fields.
field_original
a
<<>>
No, that is not the case. Storing a field stores an exact copy of the
input, without any analysis. The intent of storing a field is to return
something to display in the results list that reflects the original
document. What use would it be to store something that had gone
through the analysi
It's not a problem, but it's best to share the underlying reader. You
could open your short-lived searcher by getting a reader
via getIndexReader() on your long-lived searcher
What's the underlying use-case you're trying to make happen?
Best
Erick
On Fri, Dec 31, 2010 at 8:01 PM, Paul Libbre
Have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Best
Erick
On Fri, Dec 31, 2010 at 6:12 AM, Benzion G wrote:
> Hi,
>
> I need to parse the Java log files with Lucene 3.0.3. The StandardAnalyzer
> is
> OK, except it's handling of dots.
>
> E.g. it handles "java.la
401 - 500 of 2299 matches
Mail list logo