Nalini,
Right now the best you can do is to use copyField to combine everything into
a catch-all for spellchecking purposes. While this seems wasteful, this often
has to be done anyhow because typically you'll need less/different analysis for
spellchecking than for searching. But rather than
again for your lengthy and informative response. I updated
from SVN trunk again today and was successfully able to run 'ant
test'. So I proceeded with trying your suggestions (for question 1 so
far):
On 17/01/2012 5:32 AM, Dyer, James wrote:
David,
The spellchecker normally won't give
the terms with
plus-signs.
example:
q=mispelt werdsdefType=edismaxmm=0spellcheck.q=mispelt AND werds ... {other
spellcheck parameters}
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original Message-
From: Dyer, James [mailto:james.d...@ingrambook.com]
Sent: Thursday
@lucene.apache.org
Cc: Dyer, James
Subject: Re: replication, disk space
Okay, I do have an index.properties file too, and THAT one does contain
the name of an index directory.
But it's got the name of the timestamped index directory! Not sure how
that happened, could have been Solr trying to recover
I'm not sure there is a good way to this this currently. I think you'd just
have to issue a second query with mm=100 to get additional spelling suggestions
as maxCollationTries is designed to replicate the original query when trying
collations for hits. It might be a worthy enhancement to
You need to find apache-solr-solrj-4.0.jar from your distribution and put it
in the classpath somewhere. Perhaps the easiest thing is to include it in your
core's lib directory.
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original Message-
From: Rob
Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original Message-
From: O. Klein [mailto:kl...@octoweb.nl]
Sent: Wednesday, January 18, 2012 7:22 AM
To: solr-user@lucene.apache.org
Subject: RE: Improving Solr Spell Checker Results
Dyer, James wrote
David
I've seen this happen when the configuration files change on the master and
replication deems it necessary to do a core-reload on the slave. In this case,
replication copies the entire index to the new directory then does a core
re-load to make the new config files and new index directory go
David,
The spellchecker normally won't give suggestions for any term in your index.
So even if wever is misspelled in context, if it exists in the index the
spell checker will not try correcting it. There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x
Three things to check:
1. Use a higher spellcheck.count than 1. Try 10. IndexBasedSpellChecker
pre-filters the possibilities in a first pass of a 2-pass process. If
spellcheck.count is too low, all the good suggestions might get filtered on the
first pass and then it won't find anything on
We do this in production and haven't had any issues. This is a 1.4.1
installation, back when there was no threads option in DIH. We divide the
index into 8 parts and then run 8 DIH handlers at the same time, indexing
simultaneously. While Lucene itself is a bottleneck, we have a lot of data
Just be sure to download the correct binary for your version of Windows. Then
unzip the file somewhere and add curl.exe to your PATH. It should then just
work from the command line like the examples. If you need more curl help, you
might need to ask elsewhere.
With curl you can upload
I've successfully used the windows distribution of curl from
http://curl.haxx.se/download.html for this purpose. This at least works when
you put your xml in a text file and then use curl http://host:port/solr/update
-F stream.file=c:\filename.xml I do not think I was ever able to get -F
The SolrParams class is in the solrj.jar file so you should verify that this is
in the classpath. Also see if it is listed in the manifest.mf file in the
war's META-INF dir. If you're running this on a server within Eclipse and
letting Eclipse do the deploy, my experience is it can be
Pravin,
When using the file-based spell checking option, it will try to give you
suggestions for every query term regardless of whether or not thwy are in your
spelling dictionary. Getting the behavior you want would seem to be a worthy
enhancement, but I don't think it is currently supported.
Just committed to Solr 3.5 is a numberToKeep parameter which lets you tell it
to automatically delete the older backups. See
http://wiki.apache.org/solr/SolrReplication#HTTP_API , under which is a short
quip about this.
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
Just FYI that the final piece of SOLR-2382 has not been committed, and instead
has been spun off to SOLR-2943. So it you're using Trunk and you need the
ability to persist a cache on disk and then read it back again later as an DIH
entity, you'll need both SOLR-2943 and also a cache
Writing your own spellchecker to do what you propose might be difficult. At
issue is the fact that both the index-based and file-based spellcheckers
are designed to work off a Lucene index and use the document frequency reported
by Lucene to base their decisions. Both spell checkers build a
12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: File based wordlists for spellchecker
On 15 November 2011 15:55, Dyer, James james.d...@ingrambook.com wrote:
Writing your own spellchecker to do what you propose might be difficult. At
issue is the fact that both the index-based and file
Dali,
You might want to try to increase spellcheck.count to something higher, maybe
10 or 20. The default spell checker pre-filters suggestions in such a way that
you often need to ask for more results than you actually want to get the right
ones. The other thing you might want to see is to
and have a nice day!
Simo
http://people.apache.org/~simonetripodi/
http://simonetripodi.livejournal.com/
http://twitter.com/simonetripodi
http://www.99soft.org/
On Tue, Oct 18, 2011 at 5:16 PM, Dyer, James james.d...@ingrambook.com
wrote:
Simone,
You can set up a master dictionary
Mark,
The bug you describe looks the same as SOLR-2726
(https://issues.apache.org/jira/browse/SOLR-2726) which doesn't seem to be part
of 3.4. You might want to try applying the patch, or better yet, just use a
fresh check-out on the 3.x branch as the current state (for un-released 3.5)
While Solr/Lucene can't support true document updates, there are 2 ways you
might be able to work around this in your situation.
1. If you store all of the fields, you can write something that will read back
everything already indexed to the document, append whatever data you want, then
write
Simone,
You can set up a master dictionary but with a few caveats. What you'll need
to do is copyfield all of the fields you want to include in your master
dictionary into one field and base your IndexBasedSpellChecker dictionary on
that. In addition, I would recommend you use the collate
If using SolrJ,
use QueryResponse.getSpellCheckResponse().getCollatedResults() . This returns
a ListCollation . On each Collation object, getCollationQueryString() will
return the corrected queries.
Note that unless you specify spellcheck.maxCollationTries, the collations
might not return
Try adding spellcheck.collateExtendedResults=true to your query (without
maxCollationTries) to see if solrj correctly returns all 4 collations in that
case. In any case, if solrj is returning the last collation 4 times, this is
likely a bug.
The likely reason why
You will need get the source, apply the patch and build solr for yourself.
There are some instructions at http://wiki.apache.org/solr/HowToContribute .
It would be great if you could try this out and provide feedback.
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
The changes to DirectSpellChecker are included in SOLR-2585 patch, which I
sync'ed to the current Trunk today. So all you have to do is apply the patch,
build and then add the 1-2 new parameters to your query:
- spellcheck.alternativeTermCount - the # of suggestions you want to generate
on
Shawn,
I do not know of an easy or a good way to do this. It would be nice if there
were a non-frail, programmatic way to get back DIH status but I don't think
there is one. I have a (monsterous) program that polls a running DIH handler
every so often to get its status. The crux is
If you use spellcheck.q, you also still need to specify q for the
queryhandler, otherwise you'll get an NPE. Not sure that's your problem but
its one thing to check.
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original Message-
From: Valentin
Valentin,
There is currently an open issue about this:
https://issues.apache.org/jira/browse/SOLR-2649 . I ran into this also and
ended up telling all of the application developers to always insert AND
between every user keywords. I was using an older version of edismax but the
person who
Daniel,
This looks like a good usecase for FieldCollapsing (see
http://wiki.apache.org/solr/FieldCollapsing). Perhaps try something like:
group=truegroup.field=documentIdgroup.limit=1group.sort=version desc
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original
, 2011 1:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Return records based on aggregate functions?
Woah. That looks like exactly what I need. Thanks you very much. Is there
any documentation for how to do that using the SolrJ API?
On Wed, Aug 17, 2011 at 2:26 PM, Dyer, James james.d
@lucene.apache.org
Subject: Re: Return records based on aggregate functions?
For response option 1, would I add the group.main=true and
group.format=simple parameters to the SolrQuery object?
On Wed, Aug 17, 2011 at 3:09 PM, Dyer, James james.d...@ingrambook.comwrote:
For the request end, you can just use
Herman,
- Specify spellcheck.maxCollations with something higher than one to get more
than 1 collation.
- If you also want the spellchecker to test whether or not a particular
collation will return hits, also specify spellcheck.maxCollationTries
- If you also want to know how many hits each
bool collateExtendedResults = true;
static float accuracy = 0.7f;
-Original Message-
From: Dyer, James [mailto:james.d...@ingrambook.com]
Sent: Wednesday, August 17, 2011 5:48 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr spellcheck and multiple collations
Herman,
- Specify
Christian,
It looks like you should probably write a Transformer for your DIH script. I
assume you have a child entity set up for PriceTable. Add a Transformer to
this entity that will look at the value of currency and price, remove these
from the row, then add them back in with currency as
If you want to try MMapDirectory with Solr 1.4, then copy the class
org.apache.solr.core.MMapDirectoryFactory from 3.x or Trunk, and either add it
to the .war file (you can just add it under src/java and re-package the war),
or you can put it in its own .jar file in the lib directory under
The most likely problem is forgetting to specify spellcheck.build=true on the
first query since the last restart. This builds the spell check dictionary
used by the IndexBasedSpellChecker. You should put this in a warming query or
alternatively, specify build-on-commit or build-on-optimize.
I would imagine if you're doing updates all day the commit might take a long
time. You could try it though and see if it works for you. Another option,
which will use more disk memory is to replicate all your data to another core
just after midnight. Then update the data all day long as you
Content Group
(615) 213-4311
-Original Message-
From: Dyer, James
Sent: Friday, July 29, 2011 8:58 AM
To: solr-user@lucene.apache.org
Subject: RE: Updating opinion
I would imagine if you're doing updates all day the commit might take a long
time. You could try it though and see
Would it be possible to just run two sepearate deltas, one that updates records
that changed in ds1 and another that updates records that changed in ds2 ? Of
course this would be inefficient if a lot of records typically change in both
places at the same time.
With this approach, you might
You need to index the field you want to facet on.
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original Message-
From: Mark juszczec [mailto:mark.juszc...@gmail.com]
Sent: Thursday, July 28, 2011 3:50 PM
To: solr-user@lucene.apache.org
Subject: field with repeated
: Re: field with repeated data in index
James
Wow. That was fast. Thanks!
But I thought you couldn't index a field that has duplicate values?
Mark
On Thu, Jul 28, 2011 at 4:53 PM, Dyer, James james.d...@ingrambook.comwrote:
You need to index the field you want to facet on.
James Dyer
E
I could not reproduce the problem even with the two parameters you show below
added to the Default handler. I tried using this default handler with
different queries with correct incorrect terms. I made sure it would
sometimes successfully create collations and other times try to create
If you're getting OOM's, double-check that you're on 3.3. There was a nasty
bug in 3.0 - 3.2 that would cause OOM in conjunction with spellcheck collations
in some cases. Ditto if Solr hangs as you might be in a Garbage Collection
loop. If you have your jvm running with verbose gc's you'll
It sounds like that could be a bug. Could you provide some details on how
you're building your dictionary (config snippets), and what parameters you're
using to query, etc. ? Your jvm settings and a rough estimate of how big your
index is would be helpful too. It would be nice to try and
I'm afraid there currently isn't much support for correcting misplaced
whitespace. Solr is going to look at each word individually and won't even try
to combine ajacent words (or split a word into 2 or more). So there is no good
way to get these kinds of suggestions.
One thing that might
...
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Monday, July 25, 2011 10:13 AM
To: solr-user@lucene.apache.org
Cc: Dyer, James
Subject: Re: Spellcheck compounded words
This will work
Helmut,
I recently submitted SOLR-2549
(https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width
and delimited flat files. To be honest, I only needed fixed-width support for
my app so this might not support everything you mention for delimited files,
but it should be a
Demian,
If you omit spellcheckIndexDir from the configuration, it will create an
in-memory spelling dictionary.
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original Message-
From: Demian Katz [mailto:demian.k...@villanova.edu]
Sent: Tuesday, June 07, 2011
-user@lucene.apache.org
Subject: Re: Spellcheck Phrases
are there any updates on this? any third party apps that can make this work
as expected?
On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote:
Tanner,
Currently Solr will only make suggestions for words
that can make this work
as expected?
On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote:
Tanner,
Currently Solr will only make suggestions for words that are not in the
dictionary, unless you specifiy spellcheck.onlyMorePopular=true. However,
if you do
You're up against a couple of real limitations with Solr's spell checking. The
first limitation is that you can only use 1 dictionary per query.
The second limitation is that if a word is in the dictionary it never tries to
correct it. This will happen even if you *don't* combine your two
Are you trying to do something like this:
defType=dismaxqf=what whereq=(spellchek me with both diktionaries fur what
and where)
??
If so, then I believe your only option is to create a third dictionary that
combines what and where into one big uber-dictionary. Create a new field
and
This is a limitation of Lucene/Solr in that there is no way to tell it to not
match across mutli-valued field occurences.
A workaround is to convert your query to a phrase and add a slop factor less
than your posititonIncrementGap. ex: q=alice trudy~99 ... This example
assumes that your
The failure to commit bug with $deleteDocById can be fixed by applying patch
SOLR-2492. This patch also partially fixes the no updated stats bug in that
it increments 1 for every call to $deleteDocById and $deleteDocByQuery. Note
that this might result in inaccurate counts if the id given
if only delete requests were generated and then do not
call DIH.
Previously, I found another open issue, created from Ephraim:
https://issues.apache.org/jira/browse/SOLR-2104
It's the same issue, but it hasn't had any updates yet.
Regards,
Alexandre
On Wed, May 25, 2011 at 3:17 PM, Dyer, James james.d
Richard,
To enable the guarantee you need to specify spellcheck.maxCollationTries with
a value other than zero (which is default). There is cost involved with
verifying beforehand if the collations will return hits so this feature is
off by default. Also, you may want to enable extended
I recently set up a solrj application that uses Solr Trunk and grouping. I
didn't see where there was any explicit support in solrj for grouping (in
Trunk...Maybe there is in the old SOLR-236 version). But you can set any
parameters on the request like this:
SolrQuery query = new
There is still a functionality gap in Solr's spellchecker even with Solr-2010
applied. If a user enters a word that is in the dictionary, solr will never
try to correct it. The only way around this is to use
spellcheck.onlyMorePopular. The problem with this approach is
onlyMorePopular
I just did a clean check out on the 1.4.1 branch and then applied the latest
(10/22/2010) version of SOLR-2010_141.patch and it applied cleanly.
I noticed from the listing you sent that any new files it removes trailing
crs from the text. Maybe its not doing this for you on files that need
I also should mention that solr-2010 is incorporated in Solr 3.1, so if you can
upgrade you won't need a patch. Note, however, that you will still want to
apply the fix in solr-2462 regardless of the version as this fix hasn't been
committed anywhere.
James Dyer
E-Commerce Systems
Ingram
Siva,
If you specify spellcheck.collate=true, then the spell checker will return
you a corrected query. Your client has to re-run this query as there is no way
to get Solr to automatically redirect the response to the correction. The new
query will return documents that have the corrected
We're on the final stretch in getting our product database in Production with
Solr. We have 13m wide-ish records with quite a few stored fields in a
single index (no shards). We sort on at least a dozen fields and facet on
20-30. One thing that came up in QA testing is we were getting full
bits can be spared and the performance
benefit of this approach. In some cases, it might just be easier to
use full length 64-bit pointers.
2011/3/18 Dyer, James james.d...@ingrambook.com:
We're on the final stretch in getting our product database in Production with
Solr. We have 13m wide-ish
Jonathan,
When I was first setting up replication a couple weeks ago, I had this working,
as described here:
http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml
I created the slave's solrconfig.xml and saved it on the master in the conf
dir as solrconfig_slave.xml, then
matches what's in that wiki
page anyway.
On 2/28/2011 3:14 PM, Dyer, James wrote:
Jonathan,
When I was first setting up replication a couple weeks ago, I had this
working, as described here:
http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml
I created the slave's
Where do you get your Lucene/Solr downloads from?
[X] ASF Mirrors (linked in our release announcements or via the Lucene website)
[] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
[X] I/we build them from source via an SVN/Git checkout.
[] Other (someone in your company
Camden,
You may also want to be aware that there is a new feature added to Spell
Check's collate functionality that will guarantee the collations will return
hits. It also is able to return more than one collation and tell you how many
hits each one would result in if re-queried. This might
Add spellcheck.onlyMorePopular=true to your query and I think it'll do what you
want. See
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular for
more info.
One caveat is if you use spellcheck.collate, this will likely result in
useless, nonsensical collations most of
, Jan 17, 2011 at 2:19 PM, Dyer, James james.d...@ingrambook.comwrote:
Camden,
You may also want to be aware that there is a new feature added to Spell
Check's collate functionality that will guarantee the collations will
return hits. It also is able to return more than one collation and tell
: Dyer, James
Subject: Re: StopFilterFactory and qf containing some fields that use it and
some that do not
It's a known 'issue' in dismax, (really an inherent part of dismax's
design with no clear way to do anything about it), that qf over fields
with different stop word definitions will produce
I'm running into a problem with StopFilterFactory in conjunction with (e)dismax
queries that have a mix of fields, only some of which use StopFilterFactory.
It seems that if even 1 field on the qf parameter does not use
StopFilterFactory, then stop words are not removed when searching any
On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James
james.d...@ingrambook.comwrote:
I'm running into a problem with StopFilterFactory in conjunction with
(e)dismax queries that have a mix of fields, only some of which use
StopFilterFactory. It seems that if even 1 field on the qf parameter
The SpellCheckComponent in v1.4 does not use fq. All it does is take the
keywords out of the q (or spellcheck.q) parameter and check them against the
entire dictionary. If any keyword is not in the dictionary, it gives you a
list of alternatives. The collate function then takes the query and
The phrase solution works as does escaping the space with a backslash:
fq=Product:Electric\ Guitar ... actually a lot of characters need to be escaped
like this (amperstands and parenthesis come to mind)...
I assume you already have this indexed as string, not text...
James Dyer
E-Commerce
I'm using SOLR 1.4.1 with SOLR-1553 applied (edismax query parser). I'm
experiencing inconsistent behavior with terms grouped in parenthesis.
Sometimes they are AND'ed and sometimes OR'ed together.
1. q=Title:(life)defType=edismax 285 results
2. q=Title:(hope)defType=edismax 34 results
I have two request handlers set up something like this:
requestHandler name=Keyword_SI class=solr.SearchHandler
lst name=defaults
str name=defTypeedismax/str
float name=tie0.01/float
str name=qfTitle^130 Features^110 Edition^100 CTBR_SEARCH^90
: Wednesday, December 22, 2010 4:08 PM
To: solr-user@lucene.apache.org
Subject: Re: edismax inconsistency -- AND/OR
On 12/22/2010 8:25 AM, Dyer, James wrote:
I'm using SOLR 1.4.1 with SOLR-1553 applied (edismax query parser). I'm
experiencing inconsistent behavior with terms grouped in parenthesis
Dennis,
If you need to search a key/value pair, you'll have to put them both in the
same field, somehow. One way is to re-index them using the key in the
fieldname. For instance, suppose you have:
contributor: dyer, james
contributor: smith, sam
role: author
role: editor
...but you want
, we had to do something like
this:
contrib: dyer, james|author|123
contrib: smith, sam|editor|456
But Lucene/Solr will guanantee that multivalued fields return in exactly the
same order you put them in. So with SOLR we can do this:
contrib_name: dyer, james
contrib_name: smith, sam
(615) 213-4311
-Original Message-
From: Dyer, James
Sent: Friday, December 17, 2010 10:59 AM
To: solr-user@lucene.apache.org
Subject: RE: A schema inside a Solr Schema (Schema in a can)
Dennis,
I may be misunderstanding your question, but think I've just worked through
something similar
We have ~50 long-running SQL queries that need to be joined and denormalized.
Not all of the queries are to the same db, and some data comes from fixed-width
data feeds. Our current search engine (that we are converting to SOLR) has a
fast disk-caching mechanism that lets you cache all of
We're working on a budgeting for an environment to begin using SOLR in
Production and the question came up about whether or not we should pay for
commercial support on the container that SOLR runs under. We've pretty much
decided to run on JBOSS simply because that's what we use company-wide.
I'm not sure, but SOLR-1499 might have what you want.
https://issues.apache.org/jira/browse/SOLR-1499
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-Original Message-
From: Chantal Ackermann [mailto:chantal.ackerm...@btelligent.de]
Sent: Wednesday, November 10,
Mark,
I have the same question so I did a little research on this. Not a complete
answer but here is what I've found:
- threads was aded with SOLR-1352
(https://issues.apache.org/jira/browse/SOLR-1352).
- Also see
You should be building your index on a field that creates tokens on whitespace.
So your dictionary would have iphone and case as separate terms instead of
iphone case as one term. And if you query on something like iphole case,
it will give suggestions for iphole but not for case because the
In your standard Search Handler, you have the last-components array inside
lst name=defaults. However, it should be outside as in the /spell Search
Handler. Try this:
requestHandler name=standard class=solr.SearchHandler default=true
!-- default values for query parameters --
lst
Look at SOLR-2010 which has patches for 1.4.1 and trunk. It works with the
spellcheck collate functionality and ensures that collations are returned
only if they can result in hits if requeried (it tests each collation with any
fq you put on the original query). This would effectively prevent
You might want to look at SOLR-2010. This patch works with the collation
feature, having it test the collations it returns to ensure they'll return
hits. So if a user types san jos it will know that the combination san
jose is in the index and san ojos is not.
James Dyer
E-Commerce Systems
A couple of people have asked about getting SOLR-2010 to work on v1.4.1. I
uploaded a backported patch to JIRA today. See
https://issues.apache.org/jira/browse/SOLR-2010 and file SOLR-2010_141.patch.
SOLR-2010 improves the SpellCheckComponent Collate funtionality. Specifically,
1. Only
This possibly might be a bug. See
http://lucene.472066.n3.nabble.com/Spellcheck-help-td951059.html#a990476
James Dyer
E-Commerce Systems
Ingram Book Company
(615) 213-4311
-Original Message-
From: fabritw [mailto:fabr...@gmail.com]
Sent: Thursday, August 19, 2010 12:51 PM
To:
If you could, let me know how your testing goes with this change. I too am
interested in having the Collate work as good as it can. It looks like the
code would be better with this change but then again I don't know what the
original author was thinking when this was put in.
James Dyer
Mark,
I'd like to see your code if you open a JIRA for this. I recently
opened SOLR-2010 with a patch that does something similar to the second
part only of what you describe (find combinations that actually return a
match). But I'm not sure if my approach is the best one so I would like
to see
In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84):
final static String PATTERN = (?:(?!( + NMTOKEN + :|\\d+)))[\\p{L}_\\-0-9]+;
and remove the |\\d+ to make it:
final static String PATTERN = (?:(?! + NMTOKEN + :))[\\p{L}_\\-0-9]+;
My testing shows this solves your
301 - 396 of 396 matches
Mail list logo