from:"Walter Underwood"

Re: Finding all possible synonyms for a word

2007-11-19 Thread Walter Underwood

How many synonym sets do you have? I'm using about 600 sets with no problem. --wunder On 11/19/07 8:23 PM, climbingrose [EMAIL PROTECTED] wrote: Correction for last message: you need to modify or extend SynonymFilterFactory instead of SynonymFilter. SynonmFilterFactory is responsible for

Re: Performance of Solr on different Platforms

2007-11-19 Thread Walter Underwood

1000 qps is a lot of load, at least 30M queries/day. We are running dual CPU Power P5 machines and getting about 80 qps with worst case response times of 5 seconds. 90% of responses are under 70 msec. Our expected peak load is 300 qps on our back-end Solr farm. We execute multiple back-end

Re: Document update based on ID

2007-11-22 Thread Walter Underwood

This can be useful, but it is limited. At Infoseek, we used this for demoting porn and spam in the index in 1996, but replaced it with more precise approaches. wunder On 11/22/07 6:49 AM, Ryan McKinley [EMAIL PROTECTED] wrote: Jörg Kiegeland wrote: Yes, SOLR-139 will eventually do what you

Re: Opensearch XSLT

2007-11-26 Thread Walter Underwood

AM, Walter Underwood [EMAIL PROTECTED] wrote: OpenSearch was a pretty poor design and is dead now, so I wouldn't expect any new implementations. Google's GData (based on Atom) reuses the few useful OpenSearch elements needed for things like number of hits. Solr's Atom support really should

Re: Opensearch XSLT

2007-11-26 Thread Walter Underwood

implementers. Heck, Doug Cutting was there. http://infolab.stanford.edu/~gravano/workshop_participants.html wunder On 11/26/07 6:28 PM, Ed Summers [EMAIL PROTECTED] wrote: On Nov 26, 2007 5:35 PM, Walter Underwood [EMAIL PROTECTED] wrote: GData is really pretty useful. OpenSearch was just

Re: CJK Analyzers for Solr

2007-11-27 Thread Walter Underwood

Dictionaries are surprisingly expensive to build and maintain and bi-gram is surprisingly effective for Chinese. See this paper: http://citeseer.ist.psu.edu/kwok97comparing.html I expect that n-gram indexing would be less effective for Japanese because it is an inflected language. Korean is

Re: CJK Analyzers for Solr

2007-11-28 Thread Walter Underwood

-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Walter Underwood [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, November 27, 2007 2:41:38 PM Subject: Re: CJK Analyzers for Solr Dictionaries are surprisingly expensive to build

Re: Solr, Multiple processes running

2007-12-11 Thread Walter Underwood

Since they all use the same schema, can you add a client ID to each document when it is indexed? Filter by clientid:4 and you get a subset of the index. wunder On 12/11/07 1:01 PM, Owens, Martin [EMAIL PROTECTED] wrote: Hello everyone, The system we're moving from (dtSearch) allows each of

Re: Solr, search result format

2007-12-12 Thread Walter Underwood

Fetch your 70,000 results in 70 chunks of 1000 results. Parse each chunk and add it to your internal list. If you are allowed to parse Python results, why can't you use a diffetent XML parser? What sort of more work are you doing? I've implemented lots of stuff on top of a paged model, including

Re: Disabling the cache?

2007-12-14 Thread Walter Underwood

That is not a very useful load test, since it doesn't match what you'll see in production. About half our requests are served from cache. Cache hits are all CPU, cache misses are heavy on IO. Testing with all cache misses will under-estimate CPU buy a huge amount. It is very hard to simulate a

Re: correct escapes in csv-Update files

2008-01-04 Thread Walter Underwood

I recommend the opencsv library for Java or the csv package for Python. Either one can write legal CSV files. There are lots of corner cases in CSV and some differences between applications, like whetehr newlines are allowed inside a quoted field. It is best to use a library for this instead of

Re: LNS - or - now i know we've succeeded

2008-01-14 Thread Walter Underwood

Yes, they are reputable. They've been doing consulting with Verity, Ultraseek, and other platforms for many years. --wunder On 1/12/08 1:22 AM, Chris Hostetter [EMAIL PROTECTED] wrote: It is pretty cool to see a reputable Search company (is ideaeng.com a reputable search consulting company?

Re: Indexing very large files.

2008-01-16 Thread Walter Underwood

This error means that the JVM has run out of heap space. Increase the heap space. That is an option on the java command. I set my heap to 200 Meg and do it this way with Tomcat 6: JAVA_OPTS=-Xmx600M tomcat/bin/startup.sh wunder On 1/16/08 8:33 AM, David Thibault [EMAIL PROTECTED] wrote:

Re: Restricted views of an index

2008-01-25 Thread Walter Underwood

Solr filters already provide a restricted review of results, so the code that calls Solr can choose the appropriate handler for each class of users. Make sure that end users cannot directly access the Solr server, or at least not the search URL (/solr/select). Building authentication and

Re: Slow response times using :

2008-01-31 Thread Walter Underwood

How often does the index change? Can you use an HTTP cache and do this once for each new index? wunder On 1/31/08 9:09 AM, Andy Blower [EMAIL PROTECTED] wrote: Actually I do need all facets for a field, although I've just realised that the tests are limited to only 100. Ooops. So it should

Re: Query with literal quote character: 6'2

2008-02-07 Thread Walter Underwood

Our users can blow up the parser without special characters. AND THE BAND PLAYED ON TO HAVE AND HAVE NOT Lower-casing in the front end avoids that. We have auto-complete on titles, so the there are plenty of chances to inadvertently use special characters: Romeo + Juliet Airplane!

Re: Query with literal quote character: 6'2

2008-02-07 Thread Walter Underwood

How about the query parser respecting backslash escaping? I need free-text input, no syntax at all. Right now, I'm escaping every Lucene special character in the front end. I just figured out that it breaks for colon, can't search for 12:01 with 12\:01. wunder On 2/7/08 11:06 AM, Chris Hostetter

Query with literal quote character: 6'2

2008-02-07 Thread Walter Underwood

We have a movie with this title: 6'2 I can get that string indexed, but I can't get it through the query parser and into DisMax. It goes through the analyzers fine. I can run the analysis tool in the admin interface and get a match with that exact string. These variants don't work: 6'2 6'2\

Re: solrj and multiple slaves

2008-02-11 Thread Walter Underwood

On 2/11/08 8:42 PM, Chris Hostetter [EMAIL PROTECTED] wrote: if you want to worry about smart load balancing, try to load balance based on the nature of the URL query string ... make you load balancer pick a slave by hashing on the q param for example. This is very effective. We used this at

Re: Performance help for heavy indexing workload

2008-02-12 Thread Walter Underwood

On 2/12/08 7:40 AM, Ken Krugler [EMAIL PROTECTED] wrote: In general immediate updating of an index with a continuous stream of new content, and fast search results, work in opposition. The searcher's various caches are getting continuously flushed to avoid stale content, which can easily kill

Re: Performance help for heavy indexing workload

2008-02-12 Thread Walter Underwood

That does seem really slow. Is the index on NFS-mounted storage? wunder On 2/12/08 7:04 AM, Erick Erickson [EMAIL PROTECTED] wrote: Well, the *first* sort to the underlying Lucene engine is expensive since it builds up the terms to sort. I wonder if you're closing and opening the underlying

Re: YAML update request handler

2008-02-21 Thread Walter Underwood

Python marshal format is worth a try. It is binary and can represent the same data as JSON. It should be a good fit to Solr. We benchmarked that against XML several years ago and it was 2X faster. Of course, XML parsers are a lot faster now. wunder On 2/21/08 10:50 AM, Grant Ingersoll [EMAIL

Re: Shared index base

2008-02-26 Thread Walter Underwood

I saw a 100X slowdown running with indexes on NFS. I don't understand going through a lot of effort with unsupported configurations just to share an index. Local disk is cheap, the snapshot stuff works well, and local discs avoid a single point of failure. The testing time to make a shared index

Re: Shared index base

2008-02-26 Thread Walter Underwood

is not done successfully, so I do need to do something by manually. If you have only one index, there is a risk to mess up the index. Thanks, Jae -Original Message- From: Walter Underwood [mailto:[EMAIL PROTECTED] Sent: Tue 2/26/2008 1:27 PM To: solr-user@lucene.apache.org

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread Walter Underwood

Have you timed how long it takes to copy the index files? Optimizing can never be faster than that, since it must read every byte and write a whole new set. Disc speed may be your bottleneck. You could also look at disc access rates in a monitoring tool. Is there read contention between the

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread Walter Underwood

, and that optimise time is going to be at least O(n) James On 28 Feb 2008, at 09:07, Walter Underwood wrote: Have you timed how long it takes to copy the index files? Optimizing can never be faster than that, since it must read every byte and write a whole new set. Disc speed may be your bottleneck

How long does optimize take on your Solr installation?

2008-02-28 Thread Walter Underwood

Please answer with the size of your index (post-optimize) and how long an optimize takes. I'll collect the data and see if I can draw a line through it. 190 MB, 55 seconds $ du -sk /apps/wss/solr_home/data/index 191592 /apps/wss/solr_home/data/index $ grep commit

Re: Master/Slave setup

2008-02-28 Thread Walter Underwood

You have no cache at all when you stop and restart Solr. I recommend using the provided scripts for index distribution. Run snappuller and snapinstaller every two hours. The scripts already do the right thing. A snapshot is created after a commit on the indexer. Snappuller only copies over an

Re: How long does optimize take on your Solr installation?

2008-02-28 Thread Walter Underwood

Good point. My numbers are from a full rebuild. Let's collect maximum times, to keep it simple. --wunder On 2/28/08 7:28 PM, Alex Benjamen [EMAIL PROTECTED] wrote: It mostly depends on whether or not the index is completely new or incremental 4Gb, 28MM docs, ~30min (new index) 4Gb, 28MM

Re: How long does optimize take on your Solr installation?

2008-02-28 Thread Walter Underwood

28, 2008, at 1:15 PM, Walter Underwood wrote: Please answer with the size of your index (post-optimize) and how long an optimize takes. I'll collect the data and see if I can draw a line through it. 190 MB, 55 seconds $ du -sk /apps/wss/solr_home/data/index 191592 /apps/wss/solr_home

Re: Master/Slave setup

2008-02-29 Thread Walter Underwood

In solrconfig.xml, configure a listener for postOptimize but not for postCommit. That listener runs snapshooter. You will only create snapshots after an optimize. That's what I do. wunder On 2/29/08 11:38 AM, Alex Benjamen [EMAIL PROTECTED] wrote: OK, I'll give it a shot... Couple of issues I

Re: invalid XML character

2008-03-02 Thread Walter Underwood

Section 2.2 of the XML spec. Three characters from the 0x00-0x19 block are allowed: 0x09, 0x0A, 0x0D. Annotated version: http://www.xml.com/axml/testaxml.htm Section 2.2 in current official spec: http://www.w3.org/TR/REC-xml/#charsets wunder On 3/2/08 6:44 AM, Brian Whitman [EMAIL PROTECTED]

Re: Favouring recent matches

2008-03-08 Thread Walter Underwood

Ultraseek has recent and relevant as an option. We used the document age in days (now - document_date) and took the log of that. You need to adjust the boost to have the desired amount of influence. The most conservative approach is to use it as a tiebreaker, so that you can distinguish between

Re: Accented search

2008-03-11 Thread Walter Underwood

Generally, the accented version will have a higher IDF, so it will score higher. wunder On 3/11/08 8:44 AM, Renaud Waldura [EMAIL PROTECTED] wrote: Peter: Very interesting. To take care of the issue you mention, could you add multiple synonyms with progressively less accents? E.g. you'd

Re: searching from command line?

2008-03-13 Thread Walter Underwood

Golly, let me think. I can use the out-of-the-box, tested Solr stuff for syncing indexes or I can invent some command line kludge that does the same thing, except I will need to write it and test it myself. Which one is easier? Seriously, the existing Solr index distribution is great stuff. I

Re: question about fl=score

2008-03-19 Thread Walter Underwood

Getting 10,000 records will be slow. What are you doing with 10,000 records? wunder On 3/19/08 10:07 PM, 李银松 [EMAIL PROTECTED] wrote: I want to get the top 1-10010 record from two different servers,So Ihave to get top10010 scores from each server and merge them to get the results. I

Re: question about fl=score

2008-03-19 Thread Walter Underwood

to transport is about 500k(1 docs' scores) and the QTime is about 100ms but the total time I used is about 10+ seconds I want to know it really cost so much time or something other is wrong . 2008/3/20, Walter Underwood [EMAIL PROTECTED]: Getting 10,000 records will be slow. What

Re: Language support

2008-03-20 Thread Walter Underwood

the same language. We didn't do it in Ultraseek because it would have been an incompatible index change and the benefit didn't justify that. wunder == Walter Underwood Former Ultraseek Architect Current Entire Netflix Search Department On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote: Token

Re: stopwords and phrase queries

2008-03-21 Thread Walter Underwood

We do a similar thing with a no stopword, no stemming field. There are a surprising number of movie titles that are entirely stopwords. Being There was the first one I noticed, but To be and to have wins the prize for being all-stopwords in two languages. See my list, here:

Re: Making stop-words optional with DisMax?

2008-03-26 Thread Walter Underwood

We use two fields, one with and one without stopwords. The exact field has a higher boost than the other. That works pretty well. It helps to have an automated relevance test when tuning the boost (and other things). I extracted queries and clicks from the logs for a couple of months. Not

Fuzzy queries in dismax specs?

2008-04-14 Thread Walter Underwood

I've started implementing something to use fuzzy queries for selected fields in dismax. The request handler spec looks like this: exact~0.7^4.0 stemmed^2.0 If anyone has already done this, I'd be glad to use it. I'm working with an older version of Solr, so I won't have a 1.2 patch right

Re: too many queries?

2008-04-16 Thread Walter Underwood

: In order to do that I have to change to a 64 bits OS so I can have more than 4 GB of RAM.Is there any way to see how long does it takes to Solr to warmup the searcher? On Wed, Apr 16, 2008 at 11:40 AM, Walter Underwood [EMAIL PROTECTED] wrote: A commit every two minutes means that the Solr caches

Re: too many queries?

2008-04-16 Thread Walter Underwood

memory for the index? I was under the impression that Solr did not support a RAMIndex. Walter Underwood wrote: Do it. 32-bit OS's went out of style five years ago in server-land. I would start with 8GB of RAM. 4GB for your index, 2 for Solr, 1 for the OS and 1 for other processes

Re: More Like This boost

2008-04-22 Thread Walter Underwood

It should help to weight the terms with their frequency in the original document. That will distinguish between two documents with the same terms, but different focus. wunder On 4/22/08 7:46 AM, Erik Hatcher [EMAIL PROTECTED] wrote: No, the MLT feature does not have that kind of field-specific

Re: Got parseException when search keyword AND on a text field

2008-04-24 Thread Walter Underwood

DisMax preserves a fair amount of syntax. It isn't a pure text query. We have a small client library (written before solrj) that escapes all the stuff that Solr doesn't. If you are already lowercasing queries, then you can fix AND, OR, and NOT by replacing them with their lowercase equivalents.

Re: Caching of DataImportHandler's Status Page

2008-04-24 Thread Walter Underwood

Status pages should be sent with Pragma: no-cache. That is a bug. wunder On 4/24/08 6:29 PM, Erik Hatcher [EMAIL PROTECTED] wrote: The issue is the HTTP caching feature of Solr, for better or worse in this case. It confuses me often when I hit this myself. Try hitting that URL with curl

Re: Reindexing mode for solr

2008-04-25 Thread Walter Underwood

In our setup, snapshooter is triggered on optimize, not commit. We can commit all we want on the master without making a snapshot. That only happens when we optimize. The new Searcher is the biggest performance impact for us. We don't have that many documents (~250K), so copying an entire index

Re: GSA - Solr

2008-04-25 Thread Walter Underwood

Custom trickery is pretty standard for access controls in search. A couple of the high points from deploying Ultraseek: three incompatible single sign on system in one company and a system that controlled which links were shown instead of access to the docs themselves. The latter amazed me. If

Re: Fuzzy queries in dismax specs?

2008-04-28 Thread Walter Underwood

On 4/28/08 10:20 AM, Chris Hostetter [EMAIL PROTECTED] wrote: the recursive mapping was something i put in the DismaxQueryParser because it was easy. The param syntax of the DismaxRequestHandler has never supported it, but it's possible someone out there has a subclass that takes advantage

Re: token concat filter?

2008-05-01 Thread Walter Underwood

I've been doing it with synonyms and I have several hundred of them. Concatenating bi-word groups is pretty useful for English. We have a habit of gluing words together. database used to be two words. Dictionaries still think it should be web server. wunder On 5/1/08 10:47 AM, Geoffrey Young

Re: token concat filter?

2008-05-01 Thread Walter Underwood

ghost world = ghost world, ghostworld ghostbusters = ghostbusters, ghost busters I don't see as many in personal names. Mostly, things like De Niro and DiCaprio. wunder On 5/1/08 11:13 AM, Geoffrey Young [EMAIL PROTECTED] wrote: Walter Underwood wrote: I've been doing it with synonyms and I have

Re: Your valuable suggestion on autocomplete

2008-05-06 Thread Walter Underwood

I wrote a prefix map (ternary search tree) in Java and load it with queries to Solr every two hours. That keeps the autocomplete and search index in sync. Our autocomplete gets over 25M hits per day, so we don't really want to send all that traffic to Solr. wunder On 5/6/08 2:37 AM, Nishant

Re: Your valuable suggestion on autocomplete

2008-05-06 Thread Walter Underwood

to match the max cached request in our middle tier HTTP server. We have over twenty front end webapps and five back end Solr servers. wunder On 5/6/08 9:50 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Wunder, - Original Message From: Walter Underwood [EMAIL PROTECTED] To: solr

Re: Solr hardware specs

2008-05-09 Thread Walter Underwood

And use a log of real queries, captured from your website or one like it. Query statistics are not uniform. wunder On 5/9/08 6:20 AM, Erick Erickson [EMAIL PROTECTED] wrote: This still isn't very helpful. How big are the docs? How many fields do you expect to index? What is your expected

Re: How Special Character '' used in indexing

2008-05-13 Thread Walter Underwood

ASAP means As Soon As Possible, not As Soon As Convenient. Please don't say that if you don't mean it. --wunder On 5/12/08 6:48 AM, Ricky [EMAIL PROTECTED] wrote: Hi Mike, Thanx for your reply. I have got the answer to the question posted. I know people are donating time here. ASAP doesnt

Re: Extending XmlRequestHandler

2008-05-13 Thread Walter Underwood

There is one huge advantage of talking to Solr with SolrJ (or any other client that uses the REST API), and that is that you can put an HTTP cache between that and Solr. We get a 75% hit rate on that cache. SOAP is not cacheable in any useful sense. I designed and implemented the SOAP interface

Re: single character terms in index - why?

2008-05-13 Thread Walter Underwood

We have some useful single character terms in the rating field, like G and R, alongside PG and others. wunder On 5/12/08 1:33 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On Mon, May 12, 2008 at 4:13 PM, Naomi Dushay [EMAIL PROTECTED] wrote: So I'm now asking: why would SOLR want single

Re: Stop words and exact phrase

2008-05-14 Thread Walter Underwood

Try creating a separate field that does not remove stopwords, populating that with copyfield and configuring the phrase queries to go against that field instead. I do something similar. For both regular and phrase queries, we have a stemmed and stopped field and another field with neither. The

Re: Chinese Language + Solr

2008-05-14 Thread Walter Underwood

N-gram works pretty well for Chinese, there are even studies to back that up. Do not use the N-gram matches for highlighting. They look really stupid to native speakers. wunder On 5/14/08 2:03 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: There are no free morphological analyzers for Chinese

Re: Chinese Language + Solr

2008-05-15 Thread Walter Underwood

I've worked with the Basis products. Solid, good support. Last time I talked to them, they were working on hooking them into Lucene. For really good quality results from any of these, you need to add terms to the user dictionary of the segmenter. These may be local jargon, product names, personal

Re: Search query optimization

2008-05-29 Thread Walter Underwood

The people working on Lucene are pretty smart, and this sort of query optimization is a well-known trick, so I would not worry about it. A dozen years ago at Infoseek, we checked the count of matches for each term in an AND, and evaluated the smallest one first. If any of them had zero matches,

Re: Improve Solr Performance

2008-06-05 Thread Walter Underwood

Do you need all the results? I have never seen a search UI that showed all results at once. Fetching all the results will be slow. Most sites fetch just the results needed to display one page. wunder On 6/5/08 12:46 AM, khirb7 [EMAIL PROTECTED] wrote: hello every body I want to imporve

Re: CSV output

2008-06-11 Thread Walter Underwood

I recommend using the OpenCSV package. Works fine, Apache 2.0 license. http://opencsv.sourceforge.net/ wunder On 6/11/08 10:00 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Marshall, I don't think there is a CSV Writer, but here are some pointers for writing one: $ ff \*Writer\*java

Re: Seeking Feedback: non back compat for Java API of 3 FilterFactories in 1.3?

2008-06-13 Thread Walter Underwood

We use it out of the box. Our extensions are new filters or new request handlers, all configured through the XML files. wunder On 6/13/08 11:15 AM, Chris Hostetter [EMAIL PROTECTED] wrote: The Solr Developers would like some feedback from the user community regarding some changes that have

Re: Re[2]: Feature idea - delete and commit from web interface ?

2008-06-18 Thread Walter Underwood

The spider was given an admin login so it could access all content. Reasonable decision if the pages had been designed well. Even with a confirmation, never delete with a GET. Use POST. If the spider ever discovers the URL that the confirmation uses, it will still delete the content. Luckily,

Re: Bulk delete

2008-07-04 Thread Walter Underwood

Send multiple deletes, with a commit after the last one. --wunder On 7/4/08 8:40 AM, Jonathan Ariel [EMAIL PROTECTED] wrote: yeah I know. the problem with a query is that there is a maximum amount of query terms that I can add, which is reasonable. The problem is that I have thousands of Ids.

Re: Bulk delete

2008-07-05 Thread Walter Underwood

be sufficient. -Mike On 4-Jul-08, at 9:06 AM, Jonathan Ariel wrote: Yes, I just wanted to avoid N requests and do just 2. On Fri, Jul 4, 2008 at 12:48 PM, Walter Underwood [EMAIL PROTECTED] wrote: Send multiple deletes, with a commit after the last one. --wunder On 7/4/08 8:40 AM

Re: implementing a random result request handler - solr 1.2

2008-07-07 Thread Walter Underwood

Why do you want random hits? If we know more about the bigger problem, we can probably make better suggestions. Fundamentally, Lucene is designed to quickly return the best hits for a query. Returning random hits from the entire matched set is likely to be very slow. It just isn't what Lucene is

Re: implementing a random result request handler - solr 1.2

2008-07-07 Thread Walter Underwood

starting at a given random number .. would that work? Sounds a bit cludgy to me even as I say it. Sean -- From: Walter Underwood [EMAIL PROTECTED] Sent: Monday, July 07, 2008 5:06 PM To: solr-user@lucene.apache.org Subject: Re

Re: Certain form of autocomplete (like Google Suggest)

2008-07-10 Thread Walter Underwood

For capacity planning, our autocomplete gets more than 10X as many requests as our search. Solr can handle our search just fine, but I wrote an in-memory prefix match to handle the 25-30M autocomplete matches each day. I load that by doing Solr queries, so the two stay in sync. wunder On 7/9/08

Re: Big slowdown with phrase queries

2008-07-12 Thread Walter Underwood

On 7/12/08 7:00 PM, Chris Harris [EMAIL PROTECTED] wrote: Mike, your idea of indexing bigrams is also interesting. Do you know if any text search platforms do this behind the scenes as their default way of handling phrase queries? Infoseek indexed biwords with their Ultra engine, which lives

Re: Using Solr for Info Retreval not so much Search...

2008-07-29 Thread Walter Underwood

You might be able to split the ranking into a common score and a dynamic score. Return the results nearly the right order, then do a minimal reordering after. If you plan to move a result by a maximum of five positions, then you could fetch 15 results to show 10 results. That is far, far cheaper

Re: Multiple Indexes

2008-08-08 Thread Walter Underwood

Try putting them all in one index. Your fields can be s1_name for schema 1, s2_name for schema 2, and so on. The only reason to have separate indexes is if each group of content has a different update schedule and if you have high traffic (over 1M queries/day). wunder On 8/8/08 8:19 AM,

Re: Multiple Indexes

2008-08-08 Thread Walter Underwood

I meant update frequency more than schedule. If one group of content is updated once per day and the another every ten minutes, and most of the traffic is going to the slow collection, splitting them could help. wunder On 8/8/08 8:25 AM, Walter Underwood [EMAIL PROTECTED] wrote: Try putting

Re: Best way to index without diacritics

2008-08-13 Thread Walter Underwood

Stripping accents doesn't quite work. The correct translation is language-dependent. In German, o-dieresis should turn into oe, but in English, it shoulde be o (as in coöperate or Mötley Crüe). In Swedish, it should not be converted at all. There are other character-to-string conversions:

Re: Word Gram?

2008-08-13 Thread Walter Underwood

This is fairly high on our to-do list. I'm inclined to index the bi-words at the same position as the first word, like synonyms. wunder On 8/13/08 2:27 PM, Brendan Grainger [EMAIL PROTECTED] wrote: Hi Ryan, We do basically the same thing, using a modified ShingleFilter

Re: Auto commit error and java.io.FileNotFoundException

2008-08-16 Thread Walter Underwood

I hate to blame the JDK, but we tried 1.6 for our production webapp and it was crashing too often. Unless you need 1.6, you might try 1.5. --wunder On 8/16/08 1:54 PM, Chris Harris [EMAIL PROTECTED] wrote: On Sat, Aug 16, 2008 at 4:33 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: What version

Re: Localisation, faceting

2008-08-18 Thread Walter Underwood

I would do it in the client, even if it meant parsing the query, modifying it, then unparsing it. This is exactly like changing To: to Zu: in a mail header. Show that in the client, but make it standard before it goes onto the network. If queries at the Solr/Lucene level are standard, then users

Re: Querying Question

2008-08-21 Thread Walter Underwood

Also, + in a URL parameter turns into a space. The URL for this query: +field:Jake should look like this: ?q=%2Bfield%3AJake The admin UI takes care of that for you. wunder On 8/21/08 5:53 PM, Erik Hatcher [EMAIL PROTECTED] wrote: On Aug 21, 2008, at 7:33 PM, Jake Conk wrote: I'm

Re: copyField: String vs Text Field

2008-08-27 Thread Walter Underwood

On 8/27/08 5:54 PM, Yonik Seeley [EMAIL PROTECTED] wrote: That's really only one use case though... the other being to have a single stored field that is analyzed multiple different ways. We are the other use case. We take a title and put it in three fields: one merely lowercased, one stemmed

Re: Storing two different files

2008-08-28 Thread Walter Underwood

You don't need two schemas. Have a field type with values job_post and job_profile, then filter based on type:job_post and type:job_profile. wunder On 8/28/08 4:57 AM, Norberto Meijome [EMAIL PROTECTED] wrote: On Thu, 28 Aug 2008 02:01:05 -0700 (PDT) sanraj25 [EMAIL PROTECTED] wrote: I

Re: copyField: String vs Text Field

2008-08-28 Thread Walter Underwood

title field? Thanks, - Jake On Wed, Aug 27, 2008 at 7:41 PM, Walter Underwood [EMAIL PROTECTED] wrote: On 8/27/08 5:54 PM, Yonik Seeley [EMAIL PROTECTED] wrote: That's really only one use case though... the other being to have a single stored field that is analyzed multiple

Re: Conditional caching

2008-09-01 Thread Walter Underwood

How many documents do you have in your index? How many unique queries per day, bot and human? What are your cache hit ratios? Maybe you can increase the size of the caches and not worry about it. Search engine position is important. Have marketing pay for the extra memory (I'm not kidding).

Re: [solr-user] Correct query syntax for a multivalued string field?

2008-09-09 Thread Walter Underwood

color:red AND color:green +color:red +color:green Either one works. wunder On 9/9/08 3:47 PM, hernan [EMAIL PROTECTED] wrote: Hey Solr users, My schema defines a field like this: field name=color type=string indexed=true required=true multiValued=true/ If I have a document indexed

Re: Help with Dismax query Handler

2008-09-11 Thread Walter Underwood

Perhaps we need a syntax option on DisMax. At Netflix, we've modified it to be pure text, with no operators. My current favorite unsearchable name is this band: (+/-) wunder On 9/11/08 7:32 AM, Smiley, David W. (DSMILEY) [EMAIL PROTECTED] wrote: I have also wanted to use the very cool DisMax

Re: Help with Dismax query Handler

2008-09-11 Thread Walter Underwood

A free text option would be really nice. When our users type mission:impossible, they are not searching a field named mission. wunder On 9/11/08 4:39 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : I think the point is that Viaj would like to permit users to specify the : field if they so

Re: Help with Dismax query Handler

2008-09-11 Thread Walter Underwood

We need no field queries, never, no way. We don't want accidental collisions between a new movie title and an existing fieldname that requires an emergency software push to production. Same thing for plus, minus, AND, OR, and NOT. Our customers really, really don't do that. They are not native

Re: Solr vs Autonomy

2008-09-18 Thread Walter Underwood

It depends entirely on the needs of the project. For some things, Solr is superior to Autonomy, for other things, not. I used to work at Autonomy (and Verity and Inktomi and Infoseek), and I chose Solr for Netflix. It is working great for us. wunder == Walter Underwood Former Ultraseek Architect

Re: Solr vs Autonomy

2008-09-18 Thread Walter Underwood

I would do the field visibility one layer up from the search engine. That layer already knows about the user and can request the appropriate fields. Or request them all (better HTTP caching) and only show the appropriate ones. As I understand your application, putting access control in Solr

Re: Illegal character in xml file

2008-09-19 Thread Walter Underwood

Save the file to disk with a name ending in .xml, then open it in a browser. The browser will show you a parse error, usually with the line and column number. You cannot ignore illegal characters. You must send legal XML. Oddly, I answered this same question on the search_dev list yesterday.

Re: Refresh of synonyms.txt without reload

2008-09-23 Thread Walter Underwood

This is probably not useful because synonyms work better at index time than at query time. Reloading synonyms also requires reindexing all the affected documents. wunder On 9/23/08 7:45 AM, Batzenmann [EMAIL PROTECTED] wrote: Hi, I'm quite new to solr and I'm looking for a way to extend

Re: updating synonyms file

2008-09-24 Thread Walter Underwood

I replied to this exact same question yesterday from another Solr user. Please check the mailing list archives. http://www.nabble.com/Refresh-of-synonyms.txt-without-reload-to19629361.html wunder On 9/24/08 8:55 AM, Stephen Weiss [EMAIL PROTECTED] wrote: Hi, I'm running Solr 1.2, we are

Re: Refresh of synonyms.txt without reload

2008-09-24 Thread Walter Underwood

More details on index-time vs. query-time synonyms are here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter wunder On 9/23/08 7:47 AM, Walter Underwood [EMAIL PROTECTED] wrote: This is probably not useful because synonyms work better at index time than at query

Re: most searched keyword in solr

2008-09-25 Thread Walter Underwood

I process our HTTP logs. I'm sure there are log analyzers that handle search terms, though I wrote a bit of Python to do it. If you extract the search queries to a file, then use a Unix pipe to get a list: sort queries.txt | uniq -c | sort -rn counted-queries.txt wunder On 9/25/08 12:29 AM,

Re: Refresh of synonyms.txt without reload

2008-09-25 Thread Walter Underwood

First, define separate analyzer/filter chains for index and query. Do not include synonyms in the query chain. Second, use a separate indexing system and use Solr index distribution to sync the indexes to one or more query systems. This will create a new Searcher and caches on the query systems,

Re: Best practice advice needed!

2008-09-25 Thread Walter Underwood

This will cause the result counts to be wrong and the deleted docs will stay in the search index forever. Some approaches for incremental update: * full sweep garbage collection: fetch every ID in the Solr DB and check whether that exists in the source DB, then delete the ones that don't exist.

Re: Best practice advice needed!

2008-09-25 Thread Walter Underwood

That should be flag it in a boolean column. --wunder On 9/25/08 11:51 AM, Walter Underwood [EMAIL PROTECTED] wrote: This will cause the result counts to be wrong and the deleted docs will stay in the search index forever. Some approaches for incremental update: * full sweep garbage

Re: DataImportHandler: way to merge multiple db-rows to 1 doc using transformer?

2008-09-27 Thread Walter Underwood

Make a view in your database and index that. No point in duplicating database views in Solr. --wunder On 9/27/08 2:47 PM, Britske [EMAIL PROTECTED] wrote: Looking at the wiki, code of DataImportHandler and it looks impressive. There's talk about ways to use Transformers to be able to create

Re: Changing dataDir without restatrting server

2008-09-30 Thread Walter Underwood

Solr index distribution already does this with a slightly different mechanism. It moves the files instead of the directory. I recommend understanding and using the standard scripts for index distribution. http://wiki.apache.org/solr/CollectionDistribution wunder On 9/29/08 9:55 PM, Otis

Re: French synonyms Online synonyms

2008-09-30 Thread Walter Underwood

Synonyms are domain-specific, so general-purpose lists are not very useful. Ultraseek shipped a British-American synonym list as an example, but even that wasn't very general. One of our customers was a chemical company and was very surprised when the search rocket fuel suggested arugula, even

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 1642 matches

Mail list logo