How many synonym sets do you have? I'm using about 600 sets with
no problem. --wunder
On 11/19/07 8:23 PM, climbingrose [EMAIL PROTECTED] wrote:
Correction for last message: you need to modify or extend
SynonymFilterFactory instead of SynonymFilter. SynonmFilterFactory is
responsible for
1000 qps is a lot of load, at least 30M queries/day.
We are running dual CPU Power P5 machines and getting about 80 qps
with worst case response times of 5 seconds. 90% of responses are
under 70 msec.
Our expected peak load is 300 qps on our back-end Solr farm.
We execute multiple back-end
This can be useful, but it is limited. At Infoseek, we used this
for demoting porn and spam in the index in 1996, but replaced it
with more precise approaches.
wunder
On 11/22/07 6:49 AM, Ryan McKinley [EMAIL PROTECTED] wrote:
Jörg Kiegeland wrote:
Yes, SOLR-139 will eventually do what you
AM, Walter Underwood [EMAIL PROTECTED]
wrote:
OpenSearch was a pretty poor design and is dead now, so I wouldn't
expect any new implementations. Google's GData (based on Atom)
reuses the few useful OpenSearch elements needed for things
like number of hits. Solr's Atom support really should
implementers. Heck, Doug Cutting was there.
http://infolab.stanford.edu/~gravano/workshop_participants.html
wunder
On 11/26/07 6:28 PM, Ed Summers [EMAIL PROTECTED] wrote:
On Nov 26, 2007 5:35 PM, Walter Underwood [EMAIL PROTECTED] wrote:
GData is really pretty useful. OpenSearch was just
Dictionaries are surprisingly expensive to build and maintain and
bi-gram is surprisingly effective for Chinese. See this paper:
http://citeseer.ist.psu.edu/kwok97comparing.html
I expect that n-gram indexing would be less effective for Japanese
because it is an inflected language. Korean is
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Walter Underwood [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 2:41:38 PM
Subject: Re: CJK Analyzers for Solr
Dictionaries are surprisingly expensive to build
Since they all use the same schema, can you add a client ID to each document
when it is indexed? Filter by clientid:4 and you get a subset of the
index.
wunder
On 12/11/07 1:01 PM, Owens, Martin [EMAIL PROTECTED] wrote:
Hello everyone,
The system we're moving from (dtSearch) allows each of
Fetch your 70,000 results in 70 chunks of 1000 results. Parse each chunk
and add it to your internal list.
If you are allowed to parse Python results, why can't you use a diffetent
XML parser?
What sort of more work are you doing? I've implemented lots of stuff
on top of a paged model, including
That is not a very useful load test, since it doesn't match what
you'll see in production. About half our requests are served
from cache. Cache hits are all CPU, cache misses are heavy
on IO. Testing with all cache misses will under-estimate CPU
buy a huge amount.
It is very hard to simulate a
I recommend the opencsv library for Java or the csv package for Python.
Either one can write legal CSV files.
There are lots of corner cases in CSV and some differences between
applications, like whetehr newlines are allowed inside a quoted field.
It is best to use a library for this instead of
Yes, they are reputable. They've been doing consulting with Verity,
Ultraseek, and other platforms for many years. --wunder
On 1/12/08 1:22 AM, Chris Hostetter [EMAIL PROTECTED] wrote:
It is pretty cool to see a reputable
Search company (is ideaeng.com a reputable search consulting company?
This error means that the JVM has run out of heap space. Increase the
heap space. That is an option on the java command. I set my heap to
200 Meg and do it this way with Tomcat 6:
JAVA_OPTS=-Xmx600M tomcat/bin/startup.sh
wunder
On 1/16/08 8:33 AM, David Thibault [EMAIL PROTECTED] wrote:
Solr filters already provide a restricted review of results, so the
code that calls Solr can choose the appropriate handler for each
class of users. Make sure that end users cannot directly access the
Solr server, or at least not the search URL (/solr/select).
Building authentication and
How often does the index change? Can you use an HTTP cache and do this
once for each new index?
wunder
On 1/31/08 9:09 AM, Andy Blower [EMAIL PROTECTED] wrote:
Actually I do need all facets for a field, although I've just realised that
the tests are limited to only 100. Ooops. So it should
Our users can blow up the parser without special characters.
AND THE BAND PLAYED ON
TO HAVE AND HAVE NOT
Lower-casing in the front end avoids that.
We have auto-complete on titles, so the there are plenty
of chances to inadvertently use special characters:
Romeo + Juliet
Airplane!
How about the query parser respecting backslash escaping? I need
free-text input, no syntax at all. Right now, I'm escaping every
Lucene special character in the front end. I just figured out that
it breaks for colon, can't search for 12:01 with 12\:01.
wunder
On 2/7/08 11:06 AM, Chris Hostetter
We have a movie with this title: 6'2
I can get that string indexed, but I can't get it through the query
parser and into DisMax. It goes through the analyzers fine. I can
run the analysis tool in the admin interface and get a match with
that exact string.
These variants don't work:
6'2
6'2\
On 2/11/08 8:42 PM, Chris Hostetter [EMAIL PROTECTED] wrote:
if you want to worry about smart load balancing, try to load balance based
on the nature of the URL query string ... make you load balancer pick
a slave by hashing on the q param for example.
This is very effective. We used this at
On 2/12/08 7:40 AM, Ken Krugler [EMAIL PROTECTED] wrote:
In general immediate updating of an index with a continuous stream of
new content, and fast search results, work in opposition. The
searcher's various caches are getting continuously flushed to avoid
stale content, which can easily kill
That does seem really slow. Is the index on NFS-mounted storage?
wunder
On 2/12/08 7:04 AM, Erick Erickson [EMAIL PROTECTED] wrote:
Well, the *first* sort to the underlying Lucene engine is expensive since
it builds up the terms to sort. I wonder if you're closing and opening the
underlying
Python marshal format is worth a try. It is binary and can represent
the same data as JSON. It should be a good fit to Solr.
We benchmarked that against XML several years ago and it was 2X faster.
Of course, XML parsers are a lot faster now.
wunder
On 2/21/08 10:50 AM, Grant Ingersoll [EMAIL
I saw a 100X slowdown running with indexes on NFS.
I don't understand going through a lot of effort with unsupported
configurations just to share an index. Local disk is cheap, the
snapshot stuff works well, and local discs avoid a single point
of failure.
The testing time to make a shared index
is not done successfully, so I do
need to do something by manually.
If you have only one index, there is a risk to mess up the index.
Thanks,
Jae
-Original Message-
From: Walter Underwood [mailto:[EMAIL PROTECTED]
Sent: Tue 2/26/2008 1:27 PM
To: solr-user@lucene.apache.org
Have you timed how long it takes to copy the index files? Optimizing
can never be faster than that, since it must read every byte and write
a whole new set. Disc speed may be your bottleneck.
You could also look at disc access rates in a monitoring tool.
Is there read contention between the
, and that optimise time is going to be at
least O(n)
James
On 28 Feb 2008, at 09:07, Walter Underwood wrote:
Have you timed how long it takes to copy the index files? Optimizing
can never be faster than that, since it must read every byte and write
a whole new set. Disc speed may be your bottleneck
Please answer with the size of your index (post-optimize) and how long
an optimize takes. I'll collect the data and see if I can draw a line
through it.
190 MB, 55 seconds
$ du -sk /apps/wss/solr_home/data/index
191592 /apps/wss/solr_home/data/index
$ grep commit
You have no cache at all when you stop and restart Solr. I recommend
using the provided scripts for index distribution. Run snappuller
and snapinstaller every two hours.
The scripts already do the right thing. A snapshot is created after
a commit on the indexer. Snappuller only copies over an
Good point. My numbers are from a full rebuild. Let's collect maximum
times, to keep it simple. --wunder
On 2/28/08 7:28 PM, Alex Benjamen [EMAIL PROTECTED] wrote:
It mostly depends on whether or not the index is completely new or incremental
4Gb, 28MM docs, ~30min (new index)
4Gb, 28MM
28, 2008, at 1:15 PM, Walter Underwood wrote:
Please answer with the size of your index (post-optimize) and how long
an optimize takes. I'll collect the data and see if I can draw a line
through it.
190 MB, 55 seconds
$ du -sk /apps/wss/solr_home/data/index
191592 /apps/wss/solr_home
In solrconfig.xml, configure a listener for postOptimize but not for
postCommit. That listener runs snapshooter. You will only create
snapshots after an optimize. That's what I do.
wunder
On 2/29/08 11:38 AM, Alex Benjamen [EMAIL PROTECTED] wrote:
OK, I'll give it a shot... Couple of issues I
Section 2.2 of the XML spec. Three characters from the 0x00-0x19 block
are allowed: 0x09, 0x0A, 0x0D.
Annotated version: http://www.xml.com/axml/testaxml.htm
Section 2.2 in current official spec: http://www.w3.org/TR/REC-xml/#charsets
wunder
On 3/2/08 6:44 AM, Brian Whitman [EMAIL PROTECTED]
Ultraseek has recent and relevant as an option. We used the document age
in days (now - document_date) and took the log of that. You need to adjust
the boost to have the desired amount of influence.
The most conservative approach is to use it as a tiebreaker, so that
you can distinguish between
Generally, the accented version will have a higher IDF, so it
will score higher.
wunder
On 3/11/08 8:44 AM, Renaud Waldura [EMAIL PROTECTED]
wrote:
Peter:
Very interesting. To take care of the issue you mention, could you add
multiple synonyms with progressively less accents?
E.g. you'd
Golly, let me think. I can use the out-of-the-box, tested Solr
stuff for syncing indexes or I can invent some command line kludge
that does the same thing, except I will need to write it and test
it myself. Which one is easier?
Seriously, the existing Solr index distribution is great stuff.
I
Getting 10,000 records will be slow.
What are you doing with 10,000 records?
wunder
On 3/19/08 10:07 PM, 李银松 [EMAIL PROTECTED] wrote:
I want to get the top 1-10010 record from two different servers,So Ihave
to get top10010 scores from each server and merge them to get
the results.
I
to transport is about 500k(1 docs'
scores)
and the QTime is about 100ms
but the total time I used is about 10+
seconds
I want to know it really cost so much time or something other is wrong
.
2008/3/20, Walter Underwood [EMAIL PROTECTED]:
Getting 10,000
records will be slow.
What
the same
language. We didn't do it in Ultraseek because it would have been an
incompatible index change and the benefit didn't justify that.
wunder
==
Walter Underwood
Former Ultraseek Architect
Current Entire Netflix Search Department
On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote:
Token
We do a similar thing with a no stopword, no stemming field.
There are a surprising number of movie titles that are entirely
stopwords. Being There was the first one I noticed, but
To be and to have wins the prize for being all-stopwords
in two languages.
See my list, here:
We use two fields, one with and one without stopwords. The exact
field has a higher boost than the other. That works pretty well.
It helps to have an automated relevance test when tuning the boost
(and other things). I extracted queries and clicks from the logs
for a couple of months. Not
I've started implementing something to use fuzzy queries for selected fields
in dismax. The request handler spec looks like this:
exact~0.7^4.0 stemmed^2.0
If anyone has already done this, I'd be glad to use it.
I'm working with an older version of Solr, so I won't have a 1.2 patch
right
:
In order to do that I have to change to a 64 bits OS so I can have more than
4 GB of RAM.Is there any way to see how long does it takes to Solr to warmup
the searcher?
On Wed, Apr 16, 2008 at 11:40 AM, Walter Underwood [EMAIL PROTECTED]
wrote:
A commit every two minutes means that the Solr caches
memory for the
index? I was under the impression that Solr did not support a RAMIndex.
Walter Underwood wrote:
Do it. 32-bit OS's went out of style five years ago in server-land.
I would start with 8GB of RAM. 4GB for your index, 2 for Solr, 1 for
the OS and 1 for other processes
It should help to weight the terms with their frequency in the
original document. That will distinguish between two documents
with the same terms, but different focus.
wunder
On 4/22/08 7:46 AM, Erik Hatcher [EMAIL PROTECTED] wrote:
No, the MLT feature does not have that kind of field-specific
DisMax preserves a fair amount of syntax. It isn't a pure text
query.
We have a small client library (written before solrj) that
escapes all the stuff that Solr doesn't. If you are already
lowercasing queries, then you can fix AND, OR, and NOT by
replacing them with their lowercase equivalents.
Status pages should be sent with Pragma: no-cache. That is a bug.
wunder
On 4/24/08 6:29 PM, Erik Hatcher [EMAIL PROTECTED] wrote:
The issue is the HTTP caching feature of Solr, for better or worse in
this case. It confuses me often when I hit this myself. Try hitting
that URL with curl
In our setup, snapshooter is triggered on optimize, not commit.
We can commit all we want on the master without making a
snapshot. That only happens when we optimize.
The new Searcher is the biggest performance impact for us.
We don't have that many documents (~250K), so copying an
entire index
Custom trickery is pretty standard for access controls in search.
A couple of the high points from deploying Ultraseek: three incompatible
single sign on system in one company and a system that controlled
which links were shown instead of access to the docs themselves.
The latter amazed me. If
On 4/28/08 10:20 AM, Chris Hostetter [EMAIL PROTECTED] wrote:
the recursive mapping was something i put in the DismaxQueryParser because
it was easy. The param syntax of the DismaxRequestHandler has never
supported it, but it's possible someone out there has a subclass that
takes advantage
I've been doing it with synonyms and I have several hundred of them.
Concatenating bi-word groups is pretty useful for English. We have a
habit of gluing words together. database used to be two words.
Dictionaries still think it should be web server.
wunder
On 5/1/08 10:47 AM, Geoffrey Young
ghost world = ghost world, ghostworld
ghostbusters = ghostbusters, ghost busters
I don't see as many in personal names. Mostly, things like De Niro
and DiCaprio.
wunder
On 5/1/08 11:13 AM, Geoffrey Young [EMAIL PROTECTED] wrote:
Walter Underwood wrote:
I've been doing it with synonyms and I have
I wrote a prefix map (ternary search tree) in Java and load it with
queries to Solr every two hours. That keeps the autocomplete and
search index in sync.
Our autocomplete gets over 25M hits per day, so we don't really
want to send all that traffic to Solr.
wunder
On 5/6/08 2:37 AM, Nishant
to match the max cached
request in our middle tier HTTP server. We have over twenty front
end webapps and five back end Solr servers.
wunder
On 5/6/08 9:50 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote:
Hi Wunder,
- Original Message
From: Walter Underwood [EMAIL PROTECTED]
To: solr
And use a log of real queries, captured from your website or one
like it. Query statistics are not uniform.
wunder
On 5/9/08 6:20 AM, Erick Erickson [EMAIL PROTECTED] wrote:
This still isn't very helpful. How big are the docs? How many fields do you
expect to index? What is your expected
ASAP means As Soon As Possible, not As Soon As Convenient.
Please don't say that if you don't mean it. --wunder
On 5/12/08 6:48 AM, Ricky [EMAIL PROTECTED] wrote:
Hi Mike,
Thanx for your reply. I have got the answer to the question posted.
I know people are donating time here. ASAP doesnt
There is one huge advantage of talking to Solr with SolrJ (or any
other client that uses the REST API), and that is that you can
put an HTTP cache between that and Solr. We get a 75% hit rate
on that cache. SOAP is not cacheable in any useful sense.
I designed and implemented the SOAP interface
We have some useful single character terms in the rating field,
like G and R, alongside PG and others.
wunder
On 5/12/08 1:33 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
On Mon, May 12, 2008 at 4:13 PM, Naomi Dushay [EMAIL PROTECTED] wrote:
So I'm now asking: why would SOLR want single
Try creating a separate field that does not remove stopwords,
populating that with copyfield and configuring the phrase
queries to go against that field instead.
I do something similar. For both regular and phrase queries,
we have a stemmed and stopped field and another field with
neither. The
N-gram works pretty well for Chinese, there are even studies to
back that up.
Do not use the N-gram matches for highlighting. They look really
stupid to native speakers.
wunder
On 5/14/08 2:03 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote:
There are no free morphological analyzers for Chinese
I've worked with the Basis products. Solid, good support.
Last time I talked to them, they were working on hooking
them into Lucene.
For really good quality results from any of these, you need
to add terms to the user dictionary of the segmenter. These
may be local jargon, product names, personal
The people working on Lucene are pretty smart, and this sort of
query optimization is a well-known trick, so I would not worry
about it.
A dozen years ago at Infoseek, we checked the count of matches
for each term in an AND, and evaluated the smallest one first.
If any of them had zero matches,
Do you need all the results? I have never seen a search UI that showed
all results at once.
Fetching all the results will be slow. Most sites fetch just the
results needed to display one page.
wunder
On 6/5/08 12:46 AM, khirb7 [EMAIL PROTECTED] wrote:
hello every body
I want to imporve
I recommend using the OpenCSV package. Works fine, Apache 2.0 license.
http://opencsv.sourceforge.net/
wunder
On 6/11/08 10:00 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote:
Hi Marshall,
I don't think there is a CSV Writer, but here are some pointers for writing
one:
$ ff \*Writer\*java
We use it out of the box. Our extensions are new filters or new
request handlers, all configured through the XML files.
wunder
On 6/13/08 11:15 AM, Chris Hostetter [EMAIL PROTECTED] wrote:
The Solr Developers would like some feedback from the user community
regarding some changes that have
The spider was given an admin login so it could access all
content. Reasonable decision if the pages had been designed well.
Even with a confirmation, never delete with a GET. Use POST.
If the spider ever discovers the URL that the confirmation
uses, it will still delete the content.
Luckily,
Send multiple deletes, with a commit after the last one. --wunder
On 7/4/08 8:40 AM, Jonathan Ariel [EMAIL PROTECTED] wrote:
yeah I know. the problem with a query is that there is a maximum amount
of
query terms that I can add, which is reasonable. The problem is that I
have
thousands of Ids.
be sufficient.
-Mike
On 4-Jul-08, at 9:06 AM, Jonathan Ariel wrote:
Yes, I just wanted to avoid N requests and do just 2.
On Fri, Jul 4, 2008 at 12:48 PM, Walter Underwood [EMAIL PROTECTED]
wrote:
Send multiple deletes, with a commit after the last one. --wunder
On 7/4/08 8:40 AM
Why do you want random hits? If we know more about the bigger
problem, we can probably make better suggestions.
Fundamentally, Lucene is designed to quickly return the best
hits for a query. Returning random hits from the entire
matched set is likely to be very slow. It just isn't what
Lucene is
starting at a given random number .. would that work? Sounds a bit
cludgy to me even as I say it.
Sean
--
From: Walter Underwood [EMAIL PROTECTED]
Sent: Monday, July 07, 2008 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re
For capacity planning, our autocomplete gets more than 10X as many
requests as our search. Solr can handle our search just fine, but
I wrote an in-memory prefix match to handle the 25-30M autocomplete
matches each day. I load that by doing Solr queries, so the two
stay in sync.
wunder
On 7/9/08
On 7/12/08 7:00 PM, Chris Harris [EMAIL PROTECTED] wrote:
Mike, your idea of indexing bigrams is also interesting. Do you know
if any text search platforms do this behind the scenes as their
default way of handling phrase queries?
Infoseek indexed biwords with their Ultra engine, which lives
You might be able to split the ranking into a common score and
a dynamic score. Return the results nearly the right order, then
do a minimal reordering after. If you plan to move a result by
a maximum of five positions, then you could fetch 15 results to
show 10 results. That is far, far cheaper
Try putting them all in one index. Your fields can be s1_name for
schema 1, s2_name for schema 2, and so on.
The only reason to have separate indexes is if each group of
content has a different update schedule and if you have high
traffic (over 1M queries/day).
wunder
On 8/8/08 8:19 AM,
I meant update frequency more than schedule. If one group of content
is updated once per day and the another every ten minutes, and most of
the traffic is going to the slow collection, splitting them could help.
wunder
On 8/8/08 8:25 AM, Walter Underwood [EMAIL PROTECTED] wrote:
Try putting
Stripping accents doesn't quite work. The correct translation
is language-dependent. In German, o-dieresis should turn into
oe, but in English, it shoulde be o (as in coöperate or
Mötley Crüe). In Swedish, it should not be converted at all.
There are other character-to-string conversions:
This is fairly high on our to-do list. I'm inclined to index the
bi-words at the same position as the first word, like synonyms.
wunder
On 8/13/08 2:27 PM, Brendan Grainger [EMAIL PROTECTED] wrote:
Hi Ryan,
We do basically the same thing, using a modified ShingleFilter
I hate to blame the JDK, but we tried 1.6 for our production
webapp and it was crashing too often. Unless you need 1.6,
you might try 1.5. --wunder
On 8/16/08 1:54 PM, Chris Harris [EMAIL PROTECTED] wrote:
On Sat, Aug 16, 2008 at 4:33 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:
What version
I would do it in the client, even if it meant parsing the query,
modifying it, then unparsing it.
This is exactly like changing To: to Zu: in a mail header.
Show that in the client, but make it standard before it goes
onto the network.
If queries at the Solr/Lucene level are standard, then users
Also, + in a URL parameter turns into a space. The URL for this query:
+field:Jake
should look like this:
?q=%2Bfield%3AJake
The admin UI takes care of that for you.
wunder
On 8/21/08 5:53 PM, Erik Hatcher [EMAIL PROTECTED] wrote:
On Aug 21, 2008, at 7:33 PM, Jake Conk wrote:
I'm
On 8/27/08 5:54 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
That's really only one use case though... the other being to have a
single stored field that is analyzed multiple different ways.
We are the other use case. We take a title and put it in three
fields: one merely lowercased, one stemmed
You don't need two schemas. Have a field type with values
job_post and job_profile, then filter based on type:job_post
and type:job_profile.
wunder
On 8/28/08 4:57 AM, Norberto Meijome [EMAIL PROTECTED] wrote:
On Thu, 28 Aug 2008 02:01:05 -0700 (PDT)
sanraj25 [EMAIL PROTECTED] wrote:
I
title field?
Thanks,
- Jake
On Wed, Aug 27, 2008 at 7:41 PM, Walter Underwood
[EMAIL PROTECTED] wrote:
On 8/27/08 5:54 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
That's really only one use case though... the other being to have a
single stored field that is analyzed multiple
How many documents do you have in your index? How many unique
queries per day, bot and human? What are your cache hit ratios?
Maybe you can increase the size of the caches and not worry about
it. Search engine position is important. Have marketing pay for
the extra memory (I'm not kidding).
color:red AND color:green
+color:red +color:green
Either one works.
wunder
On 9/9/08 3:47 PM, hernan [EMAIL PROTECTED] wrote:
Hey Solr users,
My schema defines a field like this:
field name=color type=string indexed=true required=true
multiValued=true/
If I have a document indexed
Perhaps we need a syntax option on DisMax. At Netflix, we've modified it
to be pure text, with no operators. My current favorite unsearchable
name is this band:
(+/-)
wunder
On 9/11/08 7:32 AM, Smiley, David W. (DSMILEY) [EMAIL PROTECTED] wrote:
I have also wanted to use the very cool DisMax
A free text option would be really nice. When our users type
mission:impossible, they are not searching a field named mission.
wunder
On 9/11/08 4:39 PM, Chris Hostetter [EMAIL PROTECTED] wrote:
: I think the point is that Viaj would like to permit users to specify the
: field if they so
We need no field queries, never, no way. We don't want accidental
collisions between a new movie title and an existing fieldname that
requires an emergency software push to production.
Same thing for plus, minus, AND, OR, and NOT.
Our customers really, really don't do that. They are not native
It depends entirely on the needs of the project. For some things,
Solr is superior to Autonomy, for other things, not.
I used to work at Autonomy (and Verity and Inktomi and Infoseek),
and I chose Solr for Netflix. It is working great for us.
wunder
==
Walter Underwood
Former Ultraseek Architect
I would do the field visibility one layer up from the search engine.
That layer already knows about the user and can request the appropriate
fields. Or request them all (better HTTP caching) and only show the
appropriate ones.
As I understand your application, putting access control in Solr
Save the file to disk with a name ending in .xml, then open it in a
browser. The browser will show you a parse error, usually with the line
and column number.
You cannot ignore illegal characters. You must send legal XML.
Oddly, I answered this same question on the search_dev list yesterday.
This is probably not useful because synonyms work better at index time
than at query time. Reloading synonyms also requires reindexing all
the affected documents.
wunder
On 9/23/08 7:45 AM, Batzenmann [EMAIL PROTECTED] wrote:
Hi,
I'm quite new to solr and I'm looking for a way to extend
I replied to this exact same question yesterday from another Solr user.
Please check the mailing list archives.
http://www.nabble.com/Refresh-of-synonyms.txt-without-reload-to19629361.html
wunder
On 9/24/08 8:55 AM, Stephen Weiss [EMAIL PROTECTED] wrote:
Hi,
I'm running Solr 1.2, we are
More details on index-time vs. query-time synonyms are here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter
wunder
On 9/23/08 7:47 AM, Walter Underwood [EMAIL PROTECTED] wrote:
This is probably not useful because synonyms work better at index time
than at query
I process our HTTP logs. I'm sure there are log analyzers that
handle search terms, though I wrote a bit of Python to do it.
If you extract the search queries to a file, then use a Unix
pipe to get a list:
sort queries.txt | uniq -c | sort -rn counted-queries.txt
wunder
On 9/25/08 12:29 AM,
First, define separate analyzer/filter chains for index and query.
Do not include synonyms in the query chain.
Second, use a separate indexing system and use Solr index distribution
to sync the indexes to one or more query systems. This will create a new
Searcher and caches on the query systems,
This will cause the result counts to be wrong and the deleted docs
will stay in the search index forever.
Some approaches for incremental update:
* full sweep garbage collection: fetch every ID in the Solr DB and
check whether that exists in the source DB, then delete the ones
that don't exist.
That should be flag it in a boolean column. --wunder
On 9/25/08 11:51 AM, Walter Underwood [EMAIL PROTECTED] wrote:
This will cause the result counts to be wrong and the deleted docs
will stay in the search index forever.
Some approaches for incremental update:
* full sweep garbage
Make a view in your database and index that. No point in duplicating
database views in Solr. --wunder
On 9/27/08 2:47 PM, Britske [EMAIL PROTECTED] wrote:
Looking at the wiki, code of DataImportHandler and it looks impressive.
There's talk about ways to use Transformers to be able to create
Solr index distribution already does this with a slightly different
mechanism. It moves the files instead of the directory. I recommend
understanding and using the standard scripts for index distribution.
http://wiki.apache.org/solr/CollectionDistribution
wunder
On 9/29/08 9:55 PM, Otis
Synonyms are domain-specific, so general-purpose lists are not very useful.
Ultraseek shipped a British-American synonym list as an example, but even
that wasn't very general. One of our customers was a chemical company and
was very surprised when the search rocket fuel suggested arugula,
even
101 - 200 of 1642 matches
Mail list logo