Re: How do I best store and retrieve ISO country codes?
On 8/24/07, Simon Peter Nicholls [EMAIL PROTECTED] wrote: I've just noticed that for ISO 2 character country codes such as BE and IT, my queries are not working as expected. The field is being stored as country_t, dynamically from acts_as_solr v0.9, as follows (from schema.xml): dynamicField name=*_t type=text indexed=true stored=false/ The thing that sprang to my mind was that BE and IT are also valid words, and perhaps Solr is doing something I'm not expecting (ignoring them, which would make sense mid-text). With this in mind, perhaps an _s type of field is needed, since it is indeed a single important string rather than text composed of many strings. Right, type text by default in solr has stopword removal and stemmers (see the fieldType definition in the schema.xml) A string would give you exact values with no analysis at all. If you want to lowercase (for case insensitive matches) start off with a text field and configure it with keyword analyzer followed by lowercase filter). If it can have multiple words, an analyzer that had a whitespace analyzer followed by a lowercase filter would fit the bill. -Yonik
Re: sort problem
On 9/2/07, michael ravits [EMAIL PROTECTED] wrote: this is the field definition: field name=msgid type=slong indexed=true stored=true required=true / holds message id's, values range from 0 to 127132531 can I disable this cache? No, sorting wouldn't work without it. The cache structure certainly isn't optimal for this (every doc probably has a different value). If you could live with a cap of 2B on message id, switching to type int would decrease the memory usage to 4 bytes per doc (presumably you don't need range queries?) -Yonik
Re: sort problem
On 9/3/07, Marcus Stratmann [EMAIL PROTECTED] wrote: If you could live with a cap of 2B on message id, switching to type int would decrease the memory usage to 4 bytes per doc (presumably you don't need range queries?) I haven't found exact definitions of the fieldTypes anywhere. Does integer span the common range from -2^31 to 2^31-1? And there seems to be no unsigned int, am i right? Right, these map to Java native types, so it's signed. -Yonik
Re: -field:[* TO *] doesn't seem to work
Can you provide the full query response (with debugging output)? -Yonik On 9/3/07, Jérôme Etévé [EMAIL PROTECTED] wrote: Hi all I've got a problem here with the '-field:[* TO *]' syntax. It doesn't seem to work as expected
Re: Multiple Values -Structured?
You could index both a compound field and the components separately. This could be simplified by sending the value in once as the compound format: review,1 Jan 2007 revision, 2 Jan 200 And then use a copyField with a regex tokenizer to extract and index the date into a separate field. You could index the type separately via the same mechanism. -Yonik On 9/3/07, Bharani [EMAIL PROTECTED] wrote: Hi, I have got two sets of document 1) Primary Document 2) Occurrences of primary document Since there is no such thing as join i can either a) Post the primary document with occurrences as multi valued field or b) Post the primary document for every occurrences i.e. classic de-normalized route My problem with Option a) This works great as long as the occurrence is a single field but if i had a group of fields that describes the occurrence then the search returns wrong results becuase of the nature of text search i.e date1 Jan 2007/date type review/type date 2 Jan 2007 /date type revision/type if i search for 2 Jan 2007 and date 1 Jan 2007 /date i will get a hit (which is wrong) becuase there is no grouping of fields to associate date and type as one unit. If i merge them as one entity then i cant use the range quieries for date Option B) This would result in large number of documents and even if i try with index only and not store i am still have to deal with duplicate hit - becuase all i want is the primary document Is there a better approach to the problem? Thanks Bharani -- View this message in context: http://www.nabble.com/Multiple-Values--Structured--tf4370282.html#a12456399 Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr.py problems with german Umlaute
On 9/6/07, Brian Carmalt [EMAIL PROTECTED] wrote: Try it with title.encode('utf-8'). As in: kw = {'id':'12','title':title.encode('utf-8'),'system':'plone','url':'http://www.google.de'} It seems like the client library should be responsible for encoding, not the user. So try changing title=Übersicht into a unicode string via title=uÜbersicht And that should hopefully get your test program working. If it doesn't it's probably a solr.py bug and should be fixed there. -Yonik
Re: Replication broken.. no helpful errors?
On 9/6/07, Matthew Runo [EMAIL PROTECTED] wrote: The thing is that a new searcher is not opened if I look in the stats.jsp page. The index version never changes. The index version is read from the index... hence if the lucene index doesn't change (even if a ew snapshot was taken), the version won't change even if a new searcher was opened. Is the problem on the master side now since it looks like the slave is pulling a temp-snapshot? -Yonik
Re: searching where a value is not null?
On 9/6/07, David Whalen [EMAIL PROTECTED] wrote: Hi all. I'm trying to construct a query that in pseudo-code would read like this: field != '' I'm finding it difficult to write this as a solr query, though. Stuff like: NOT field:() doesn't seem to do the trick. any ideas? perhaps field:[* TO *] -Yonik
Re: caching query result
On 9/6/07, Jae Joo [EMAIL PROTECTED] wrote: I have 13 millions and have facets by states (50). If there is a mechasim to chche, I may get faster result back. How fast are you getting results back with standard field faceting (facet.field=state)?
Re: FW: Minor mistake on the Wiki
On 9/7/07, Lance Norskog [EMAIL PROTECTED] wrote: In the page http://wiki.apache.org/solr/UpdateXmlMessages We find: Optional attributes on doc * boost = float - default is 1.0 (See Lucene docs for definition of boost.) * NOTE: make sure norms are enabled (omitNorms=false in the schema.xml) for any fields where the index-time boost should be stored. This NOTE appears to be block-copied from the following entry about field-level boosts, and makes no sense here. Perhaps it could be worded better, but there is some sense behind it. There is no document boost in a lucene lindex... a doc boost is simply multipled into the boost for each field as the document is indexed. -Yonik
Re: adding without overriding dups - DirectUpdateHandler2.java does not implement?
On 9/7/07, Lance Norskog [EMAIL PROTECTED] wrote: It appears that DirectUpdateHandler2.java does not actually implement the parameters that control whether to override existing documents. It's been proposed that most of these be deprecated anyway and replaced with a simple overwrite=true/false. Are you trying to do something different than standard overwriting? -Yonik
Re: adding without overriding dups - DirectUpdateHandler2.java does not implement?
On 9/7/07, Lance Norskog [EMAIL PROTECTED] wrote: No, I'm just doing standard overwriting. It just took a little digging to be able to do it :) Overwriting is the default... you shouldn't have to do specify anything extra when indexing the document. -Yonik
Re: quirks with sorting
On 9/10/07, David Whalen [EMAIL PROTECTED] wrote: I'm seeing a weird problem with sorting that I can't figure out. I have a query that uses two fields -- a source column and a date column. I search on the source and I sort by the date descending. What I'm seeing is that depending on the value in the source, the date sort works in reverse. For example, the query: content_source:(mv); content_date desc returns 2007-09-10T09:25:00.000Z in its first row, which is what I expect. BUT, the query: content_source:(thomson); content_date desc returns 2008-08-17T00:00:00.000Z, which is the first date we put into SOLR. It is it the last (highest date) since it's 2008? -Yonik
Re: My Solr index keeps growing
On 9/10/07, Robin Bonin [EMAIL PROTECTED] wrote: I had created a new index over the weekend, and the final size was a few hundred megs. I just checked and now the index folder is up to 1.7 Gig. Is this due to results being cached? can I set a limit to how large the index will grow? is there anything else that could be effecting this file size? index normally refers to the index files on the disk... is this what you mean? If so, it shouldn't grow unless new documents are added. -Yonik
Re: Solr and KStem
Some other notes: I just read the license... it's nice and short, and appears to be ASL compatible to me. We could either include the source in Solr and build it, or add it as a pre-compiled jar into lib. The FilterFactory should probably have it's package changed to org.apache.solr.analysis (definitely if it will be included in source form in our repository). -Yonik On 9/10/07, Mike Klaas [EMAIL PROTECTED] wrote: Hi Harry, Thanks for your contribution! Unfortunately, we can't include it in Solr unless the necessary legal hurdles are cleared. An issue needs to be opened on http://issues.apache.org/jira/browse/ SOLR and you have to attach the file and check the Grant License to ASF button. It is also important to verify that you have the legal right to grant the code to ASF (since it is probably your employer's intellectual property). Legal issues are a hassle, but are unavoidable, I'm afraid. Thanks again, -Mike On 10-Sep-07, at 10:22 AM, Wagner,Harry wrote: Hi Yonik, The modified KStemmer source is attached. The original KStemFilter is now wrapped (and replaced) by KStemFilterFactory. I also changed the path to avoid any naming collisions with existing Lucene code. I included the jar file also, for anyone who wants to just drop and play: - put KStem2.jar in your solr/lib directory. - change your schema to use: filter class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/ - restart your app server I don't know if you credit contributions, but if so please include OCLC. Seems only fair since I did this on their dime :) Cheers! harry -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, September 07, 2007 3:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr and KStem On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote: I've implemented a Solr plug-in that wraps KStem for Solr use. KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. Absolutely. We need to make sure that the license for that k-stemmer is ASL compatible of course. -Yonik kstem_solr.tar.gz
Re: Removing lengthNorm from the calculation
If you aren't using index-time document boosting, or field boosting for that field specifically, then set omitNorms=true for that field in the schema, shut down solr, completely remove the index, and then re-index. The norms for each field consist of the index-time boost multiplied by the length normalization. -Yonik On 9/10/07, Kyle Banerjee [EMAIL PROTECTED] wrote: I know I'm missing something really obvious, but I'm spinning my wheels figuring out how to eliminate lengthNorm from the calculations. The specific problem I'm trying to solve is that naive queries are resulting in crummy short records near the top of the list. The reality is that the longer records tend to be higher quality, so if anything, they need to be emphasized. However, I'm missing something simple. Any advice or a pointer to an example I could model off would be greatly appreciated. Thanks, kyle
Re: largish test data set?
If you want to see what performance will be like on the next release, you could try upgrading Solr's internal version of lucene to trunk (current dev version)... there have been some fantastic improvements in indexing speed. For query speed/throughput, Solr 1.2 or trunk should do fine. -Yonik On 9/17/07, David Welton [EMAIL PROTECTED] wrote: Hi, I'm in the process of evaluating solr and sphinx, and have come to realize that actually having a large data set to run them against would be handy. However, I'm pretty new to both systems, so thought that perhaps asking around my produce something useful. What *I* mean by largish is something that won't fit into memory - say 5 or 6 gigs, which is probably puny for some and huge for others. BTW, I would also welcome any input from others who have done the above comparison, although what we'll be using it for is specific enough that of course I'll need to do my own testing. Thanks! -- David N. Welton http://www.welton.it/davidw/
Re: EdgeNGramTokenFilter, term position?
On 9/16/07, Ryan McKinley [EMAIL PROTECTED] wrote: Should the EdgeNGramFilter use the same term position for the ngrams within a single token? It feels like that is the right approach. I don't see value in having them sequential, and I can think of uses for having them overlap. -Yonik
Re: Customize the way relevancy is calculated
On 9/18/07, Amitha Talasila [EMAIL PROTECTED] wrote: The 65% of the relevance can be computed while indexing the document and posted as a field. But the keyword match is a run time score .Is there any way of getting the relevance score as a combination of this 65% and 35%? A FunctionQuery can get you the value of a field to use in a relevancy score. Put that it in a boolean query with the relevanct query and boost each portion to give the correct weight. +text:foo^.65 _val_:scorefield^.35 -Yonik
Re: pluggable functions
On 9/18/07, Jon Pierce [EMAIL PROTECTED] wrote: Reflection could be used to look up and invoke the constructor with appropriately-typed arguments. If we assume only primitive types and ValueSources are used, I don't think it would be too hard to craft a drop-in replacement that works with existing implementations. In any case, the more flexible alternative would probably be to do as you're suggesting (if I understand you correctly) -- let the function handle the parsing, The parser is a quick hack I threw together, and any value source factories should not be exposed to it. It seems like either 1) a value source factory would expose the types it expects or 2) a value source factory would take a ListValueSource and throw a ParseException if it didn't get what it expected Reflection might be fine if the cost of construction via reflection ends up being small compared to the parsing itself. -Yonik
Re: How can i make a distribute search on Solr?
On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote: On Wed, 19 Sep 2007 01:46:53 -0400 Ryan McKinley [EMAIL PROTECTED] wrote: Stu is referring to Federated Search - where each index has some of the It really should be Distributed Search I think (my mistake... I started out calling it Federated). I think Federated search is more about combining search results from different data sources. data and results are combined before they are returned. This is not yet supported out of the box Maybe this is related. How does this compare to the map-reduce functionality in Nutch/Hadoop ? map-reduce is more for batch jobs. Nutch only uses map-reduce for parallel indexing, not searching. -Yonik
Re: useColdSearcher = false... not working in 1.2?
On 9/19/07, Adam Goldband [EMAIL PROTECTED] wrote: Anyone else using this, and finding it not working in Solr 1.2? Since we've got an automated release process, I really need to be able to have the appserver not see itself as done warming up until the firstSearcher is ready to go... but with 1.2 this no longer seems to be the case. I took a quick peek at the code, and it should still work (it's pretty simple). false is also the default. How are you determining that it isn't working? -Yonik
Re: Getting only size of getFacetCounts , to simulate count(group by( a field) ) using facets
On 9/19/07, Laurent Hoss [EMAIL PROTECTED] wrote: We want to (mis)use facet search to get the number of (unique) field values appearing in a document resultset. We have paging of facets, so just like normal search results, it does make sense to list the total number of facets matching. The main problem with implementing this is trying to figure out where to put the info in a backward compatible manner. Here is how the info is currently returned (JSON format): facet_fields:{ cat:[ camera,1, card,2, connector,2, copier,1, drive,2 ] }, Unfortunately, there's not a good place to put this extra info without older clients choking on it. Within cat there should have been another element called values or something... then we could easily add extra fields like nvalues: cat:{ nvalues:5042, values:[ camera,1, card,2, connector,2, copier,1, drive,2 ] } -Yonik
Re: How can i make a distribute search on Solr?
On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote: Maybe I got this wrong...but isn't this what mapreduce is meant to deal with? Not really... you could force a *lot* of different problems into map-reduce (that's sort of the point... being able to automatically parallelize a lot of different problems). It really isn't the best fit though, and would end up being much slower than a custom job. Then there is the issue that the way map-reduce is implemented (like hadoop) is also tuned for longer running batch jobs on huge data (temporary files are used, external sorts, initial input, final output is via files, etc). Check out the google map-reduce paper - they don't use it for their search side either. Things are already progressing in the distributed search area: https://issues.apache.org/jira/browse/SOLR-303 Hopefully I'll have time to dig into it more myself in a few weeks. -Yonik
Re: Term extraction
On 9/19/07, Pieter Berkel [EMAIL PROTECTED] wrote: However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as open source, Microsoft Office, Bill Gates rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings. SynonymFilter works out-of-the-box with multi-token synonyms... Microsoft Office = microsoft_office Bill Gates, William Gates = bill_gates Just don't use a word-delimiter filter if you use underscore to join words. -Yonik
Re: Solr and FieldCache
On 9/20/07, Walter Ferrara [EMAIL PROTECTED] wrote: I'm just wondering, as this cached object could be (theoretically) pretty big, do I need to be aware of some OOM? I know that FieldCache use weakmaps, so I presume the cached array for the older reader(s) will be gc-ed when the reader is no longer referenced (i.e. when solr load the new one, after its warmup and so on), is that right? Right. You will need room for two entries (one for the current searcher and one for the warming searcher). -Yonik
Re: Solr and FieldCache
On 9/20/07, Walter Ferrara [EMAIL PROTECTED] wrote: I have an index with several fields, but just one stored: ID (string, unique). I need to access that ID field for each of the tops nodes docs in my results (this is done inside a handler I wrote), code looks like: Hits hits = searcher.search(query); for(int i=0; inodes; i++) { id[i]=hits.doc(i).get(ID); score[i]=hits.score(i); } What is the higher level use-case you are trying to address that makes it necessary to write a plugin? -Yonik
Re: Problem getting the FacetCount
On 9/21/07, Amitha Talasila [EMAIL PROTECTED] wrote: But when we make a facet query like, http://localhost:8983/solr/select?q=ipodrows=0facet=truefacet.limit=-1fac et.query=weight:{0m TO 100m}, the facet count is coming as 0.We are indexing it as a string field because if the user searches for 12m he needs to see that result. Can anyone suggest a better way of querying this? In a string field, 12m is greater than 100m, so won't be in the range. You need to index that field as a numeric type where range queries work: use type sint or sfloat. As for the m, you should have a frontend that allows input in the form desire and converts it to a valid query to solr. -Yonik
Re: Term extraction
On 9/21/07, Pieter Berkel [EMAIL PROTECTED] wrote: Yonik: This is the approach I had in mind, will it still work if I put the SynonymFilter after the word-delimiter filter in the schema config? SynonymFilter doesn't currently have the capability to handle multiple tokens at the same position in the input. You could simply remove the WordDelimiterFilter unless you need it. Ideally I want to strip out the underscore char before it gets indexed Why's that? You could just define your synonyms like that initially: Bill Gates, William Gates = billgates -Yonik
Re: I can't delete, why?
On 9/25/07, Ben Shlomo, Yatir [EMAIL PROTECTED] wrote: I know I can delete multiple docs with the following: deletequerymediaId:(6720 OR 6721 OR )/query/delete My question is can I do something like this? deletequerylanguageId:123 AND manufacturer:456 /query/delete (It does not work for me and I didn't forget to commit) Do you get an error, or do you just not see this document deleted? Does a query identical to this show matching documents after a commit? Also keep in mind that delete by id is currently more efficient than delete by query, so if mediaId is your uniqueKeyField, you would be better served by using that. -Yonik
Re: How to get debug information while indexing?
On 9/26/07, Urvashi Gadi [EMAIL PROTECTED] wrote: Hi, I am trying to create my own application using SOLR and while trying to index my data i get Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/update or Server returned HTTP response code: 500 for URL: http://localhost:8983/solr/update Is there a way to get more debug information than this (any logs, which file is wrong, schema.xml? etc) Both the HTTP reason and response body should contain more information. What are you using to communicate with Solr? Try a bad request with curl and you can see the info that comes back: [EMAIL PROTECTED] /cygdrive/f/code/lucene $ curl -i http://localhost:8983/solr/select?q=foo:bar HTTP/1.1 400 undefined_field_foo Content-Type: text/html; charset=iso-8859-1 Content-Length: 1398 Server: Jetty(6.1.3) html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 400 /title /head bodyh2HTTP ERROR: 400/h2preundefined field foo/pre pRequestURI=/solr/select/ppismalla href=http://jetty.mortbay.org/;P owered by Jetty:///a/small/i/pbr/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ /body /html Errors should also be logged. -Yonik
Re: searching for non-empty fields
On 9/27/07, Pieter Berkel [EMAIL PROTECTED] wrote: While in theory -URL: should be valid syntax, the Lucene query parser doesn't accept it and throws a ParseException. I don't have time to work on that now, but I did just open a bug: https://issues.apache.org/jira/browse/LUCENE-1006 -Yonik
Re: moving index
On 9/27/07, Jae Joo [EMAIL PROTECTED] wrote: I do need to move the index files, but have a concerns any potential problem including performance? Do I have to keep the original document for querying? I assume you posted XML documents in Solr XML format (like adddoc...)? If so, that is just an example way to get the data into Solr. Those XML files aren't needed, and any high-speed indexing will avoid creating files at all - just create the XML doc in memory and send to solr via HTTP-POST. -Yonik
Re: searching for non-empty fields
On 9/27/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/27/07, Pieter Berkel [EMAIL PROTECTED] wrote: While in theory -URL: should be valid syntax, the Lucene query parser doesn't accept it and throws a ParseException. I don't have time to work on that now, OK, I lied :-) It was simple (and a nice diversion). -Yonik but I did just open a bug: https://issues.apache.org/jira/browse/LUCENE-1006
Re: custom sorting
On 9/27/07, Erik Hatcher [EMAIL PROTECTED] wrote: Using something like this, how would the custom SortComparatorSource get a parameter from the request to use in sorting calculations? perhaps hook in via function query: dist(10.4,20.2,geoloc) And either manipulate the score with that and sort by score, q=+(foo bar)^0 dist(10.4,20.2,geoloc) sort=score asc or extend solr's sorting mechanisms to allow specifying a function to sort by. sort=dist(10.4,20.2,geoloc) asc -Yonik
Re: Color search
If it were just a couple of colors, you could have a separate field for each color and then index the percent in that field. black:70 grey:20 and then you could use a function query to influence the score (or you could sort by the color percent). However, this doesn't scale well to a large index with a large number of colors. Each field used like that will take up 4 bytes per document in the index. so if you have 1M documents, that's 1Mdocs * 100colors * 4bytes = 400MB Doable depending on your index size (use int or float and not sint or sfloat type for this... it will be better on the memory). If you needed to be better on the memory, you could encode all of the colors into a single value (perhaps into a compact string... one percentile per byte or something) and then have a custom function that extracts the value for a particular color. (this involves some java development) -Yonik On 9/28/07, Guangwei Yuan [EMAIL PROTECTED] wrote: Hi, We're running an e-commerce site that provides product search. We've been able to extract colors from product images, and we think it'd be cool and useful to search products by color. A product image can have up to 5 colors (from a color space of about 100 colors), so we can implement it easily with Solr's facet search (thanks all who've developed Solr). The problem arises when we try to sort the results by the color relevancy. What's different from a normal facet search is that colors are weighted. For example, a black dress can have 70% of black, 20% of gray, 10% of brown. A search query color:black should return results in which the black dress ranks higher than other products with less percentage of black. My question is: how to configure and index the color field so that products with higher percentage of color X ranks higher for query color:X? Thanks for your help! - Guangwei
Re: small rsync index question
On 9/28/07, Brian Whitman [EMAIL PROTECTED] wrote: For some reason sending a commit/ is not refreshing the index It should... are there any errors in the logs? do you see the commit in the logs? Check the stats page to see info about when the current searcher was last opened too. -Yonik
Re: Schema version question
On 9/28/07, Robert Purdy [EMAIL PROTECTED] wrote: I was wondering if anyone could help me, I just completed a full index of my data (about 4 million documents) and noticed that when I was first setting up the schema I set the version number to 1.2 thinking that solr 1.2 uses schema version 1.2... ugh... so I am wondering if I can just set the schema to 1.1 without having to rebuild the full index? I ask because I am hoping that given an invalid schema version number, that version 1.0 is not used by default and all my fields are now mulitvalued. Any help would be greatly appreciated. Thanks in advance Yes, it should be OK to set it back to 1.1 w/o reindexing. The index format does not differentiate between single and multi-valued fields so you should be fine there. -Yonik
Re: Request for graphics
On 9/28/07, Clay Webster [EMAIL PROTECTED] wrote: i'm late for dinner out, so i'm just attaching it here. Most attachments are stripped :-) -Yonik
Re: Searching combined English-Japanese index
On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote: When I search using an English term, I get results but the Japanese is not encoded correctly in the response. (although it is UTF-8 encoded) One quick thing to try is the python writer (wt=python) to see the actual unicode values of what you are getting back (since the python writer automatically escapes non-ascii). That can help rule out incorrect charset handling by clients. -Yonik
Re: Major CPU performance problems under heavy user load with solr 1.2
On 10/1/07, Robert Purdy [EMAIL PROTECTED] wrote: Hi there, I am having some major CPU performance problems with heavy user load with solr 1.2. I currently have approximately 4 million documents in the index and I am doing some pretty heavy faceting on multi-valued columns. I know that doing facets are expensive on multi-valued columns but the CPU seems to max out (400%) with apache bench with just 5 identical concurrent requests One can always max out CPU (unless one is IO bound) with concurrent requests greater than the number of CPUs on the system. This isn't a problem by itself and would exist even if Solr were an order of magnitude slower or faster. You should be looking at things the peak throughput (queries per sec) you need to support and the latency of the requests (look at the 90 percentile, or whatever). and I have the potential for a lot more concurrent requests then that with my large number of users that hit our site per day and I am wondering if there are any workarounds. Currently I am running the out of the box solr solution (Example jetty application with my own schema.xml and solrconfig.xml) on a dual Intel Duo core 64 bit box with 8 gigs of ram allocated to the start.jar process dedicated to solr with no slaves. I have set up some aggressive caching in the solrconfig.xml for the filtercache (class=solr.LRUCachesize=300 initialSize=200) and have the HashDocSet set to 1 to help with faceting, but still I am getting some pretty poor performance. I have also tried autowarming the facets by performing a query that hits all my multivalued facets with no facet limits across all the documents in the index. This does seem to reduce my query times by a lot because the filtercache grows to about 2.1 million lookups and finishes the query in about 70 secs. OK, that's long. So focus on the latency of a single request instead of jumping straight to load testing. 2.1 million is a lot - what's the field with the largest number of unique values that you are faceting on? However I have noticed an issue with this because each time I do an optimize or a commit after prewarming the facets the cache gets cleared, according to the stats on the admin page, but the RSize does not shink for the process, and the queries get slow again, so I prewarm the facets again and the memory usage keeps growing like the cache is not being recycled The old searcher and cache won't be discarded until all requests using it have completed. and as a results the prewarm query starts to get slower and slower as each time this occurs (after about 5 times of prewarms and then commit the query takes about 30 mins... ugh) and almost run out of memory. Any thoughts on how to help improve this and fix the memory issue? You could try the minDf param to reduce the number of facets stored in the cache and reduce memory consumption. -Yonik
Re: Searching combined English-Japanese index
On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote: Yonik Seeley schrieb: On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote: When I search using an English term, I get results but the Japanese is not encoded correctly in the response. (although it is UTF-8 encoded) One quick thing to try is the python writer (wt=python) to see the actual unicode values of what you are getting back (since the python writer automatically escapes non-ascii). That can help rule out incorrect charset handling by clients. -Yonik Thanks for the tip, it turns out that the unicode values are wrong... I mean the browser displays correctly what is send. But I don't know how solr gets these values. OK, so they never got into the index correctly. The most likely explanation is that the charset wasn't set correctly when the update message was sent to Solr. -Yonik
Re: Searching combined English-Japanese index
On 10/2/07, Maximilian Hütter [EMAIL PROTECTED] wrote: Are you sure, they are wrong in the index? It's not an issue with Jetty output encoding since the python writer takes the string and converts it to ascii before that. Since Solr does no charset encoding itself on output, that must mean that it's in the index incorrectly. When I use the Lucene Index Monitor (http://limo.sourceforge.net/) to look at the document in the index the Japanese is displayed correctly. I've never really used limo, but it's possible it's incorrectly interpreting what's in the index (and by luck doing the reverse transformation that got the data in there incorrectly). Try indexing a document with a unicode character specified via an entity, to remove the issues of input char encodings. For example if a Japanese char has a unicode value of \u1234, then in the XML doc, use #x1234 -Yonik
Re: Seeing if an entry exists in an index for a set of terms
On 10/3/07, Ian Holsman [EMAIL PROTECTED] wrote: Hi. I was wondering if there was a easy way to give solr a list of things and finding out which have entries. ie I pass it a list Bill Clinton George Bush Mary Papas (and possibly 20 others) to a solr index which contains news articles about presidents. I would like a response saying bill Clinton was found in 20 records George Bush was found in 15. possibly with the links, but thats not too important. I know I can do this by doing ~20 individual queries, but I thought there may be a more efficient way How about facet.query=Bill Clintonfacet.query=George Bush, etc Will give you counts, but not the links -Yonik
Re: Best way to change weighting based on the presence of a field
On 10/5/07, Mike Klaas [EMAIL PROTECTED] wrote: The other option is to use a function query on the value stored in a field (which could represent a range of 'badness'). This can be used directly in the dismax handler using the bf (boost function) query parameter. In the near future, you can do a real query-time boost (score multiplication) by another field or function https://issues.apache.org/jira/browse/SOLR-334 And even quickly update all the values of the field being used as the boost: https://issues.apache.org/jira/browse/SOLR-351 -Yonik
Re: Urldecode Problem
On 10/6/07, Frederik M. Kraus [EMAIL PROTECTED] wrote: Looks like we ran into a urldecode problem when having certain query strings. This is what happens: Client: Jeffrey's Bay - Jeffrey%26%2339%3Bs+Bay (php 5.2 urlencode/rawurlencode) It looks like the client is doing XML escaping as it replaces ' with #39; Then each char of the #39; is URL encoded. This is incorrect of course, urlencoding has nothing to do with XML. -Yonik
Re: High-Availability deployment
On 10/8/07, Daniel Alheiros [EMAIL PROTECTED] wrote: I'm about to deploy SOLR in a production environment Cool, can you share exactly what it will be used for? and so far I'm a bit concerned about availability. I have a system that is responsible for fetching data from a database and then pushing it to SOLR using its XML/HTTP interface. So I'm going to deploy N instances of my application so it's going to be redundant enough. And I'm deploying SOLR in a Master / Slaves structure, so I'm using the slaves nodes as a way to keep my index replicated and to be able to use them to serve my queries. But my problem lies on the indexing side of things. Is there a good alternative like a Master/Master structure that I could use so if my current master dies I can automatically switch to my secondary master keeping my index integrity? In all the setups I've dealt with, master redundancy wasn't an issue. If something bad happens to corrupt the index, shut off replication to the slaves and do a complete rebuild on the master. If the master hardware dies, reconfigure one of the slaves to be the new master. These are manual steps and assumes that it's not the end of the world if your search is stale for a couple of hours. A schema change that required reindexing would also cause this window of staleness. If your index build takes a long time, you could set up a secondary master to pull from the primary (just like another slave). But there's no support for automatically switching over slaves, and the secondary wouldn't have stuff between the last commit and the primary crash... so something would need to update it... (query for latest doc and start from there). You could also have two search tiers... another copy of the master and multiple slaves. If one was down, being upgraded, or being rebuilt, you could direct search traffic to the other set of servers. -Yonik
Re: High-Availability deployment
On 10/8/07, Daniel Alheiros [EMAIL PROTECTED] wrote: Well I believe I can live with some staleness at certain moments, but it's not good as users are supposed to need it 24x7. So the common practice is to make one of the slaves as the new master and switch things over to it and after the outage put them in sync again and do the proper switch back? OK, I'll follow this, but I'm still concerned about the amount of manual steps to be done... That was the plan - never needed it though... (never had a master completely die that I know of). Having the collection not be updated for an hour or so while the ops folks fixed things always worked fine. And other important issue is how frequently have you seen indexes getting corrupted? Just once I think - no idea of the cause (and I think it was quite an old version of lucene). If I try to run a commit or optimize on a Solr master instance and it's index got corrupted will it run the command? Almost all of the cases I've seen of a master failing was an OOM error, often during segment merging (again, older versions of Lucene, and someone forgot to change the JVM heap size from the default). This could cause a situation where you added a document but the old one was not deleted (overwritten). Not corrupted at the Lucene level, but if the JVM died at the wrong spot, search results could possibly return two documents for the same unique key. We normally just rebuilt after a crash. And more importantly, will it run the postOptimize/postCommit scripts generating snapshots and then possibly propagating the bad index? Normally not, I think... the JVM crash/restart left the lucene write lock aquired on the index and further attempts to modify it failed. -Yonik
Re: High-Availability deployment
On 10/8/07, Daniel Alheiros [EMAIL PROTECTED] wrote: Hmm, is there any exception thrown in case the index get corrupted (if it's not caused by OOM and the JVM crashes)? The document uniqueness SOLR offers is one of the many reasons I'm using it and should be excellent to know when it's gone. :) Does it mean that after recovering from a JVM crash should be recommended to rebuild my indexes instead of just re-starting it? Yes, it's safer to do so. I think in a future release we will be able to guarantee document uniqueness even in the face of a crash. -Yonik
Re: Availability Issues
On 10/8/07, David Whalen [EMAIL PROTECTED] wrote: Have you taken a thread dump to see what is going on? We can't do it b/c during the unresponsive time we can't access the admin site (/solr/admin) at all. I don't know how to do a thread dump via the command line kill -3 pid_of_jvm_running_solr Start with the thread dump. I bet it's multiple queries piling up around some synchronization points in lucene (sometimes caused by multiple threads generating the same big filter that isn't yet cached). -Yonik
Re: Availability Issues
On 10/8/07, David Whalen [EMAIL PROTECTED] wrote: The logs show nothing but regular activity. We do a tail -f on the logfile and we can read it during the unresponsive period and we don't see any errors. You don't see log entries for requests until after they complete. When a server becomes unresponsive, try shutting off further traffic to it, and let it finish whatever requests it's working on (assuming that's the issue) so you can see them in the log. Do you see any requests that took a really long time to finish? -Yonik
Re: Availability Issues
On 10/8/07, David Whalen [EMAIL PROTECTED] wrote: Do you see any requests that took a really long time to finish? The requests that take a long time to finish are just simple queries. And the same queries run at a later time come back much faster. Our logs contain 99% inserts and 1% queries. We are constantly adding documents to the index at a rate of 10,000 per minute, so the logs show mostly that. Oh, so you are using the same boxes for updating and querying? When you insert, are you using multiple threads? If so, how many? What is the full URL of those slow query requests? Do the slow requests start after a commit? Start with the thread dump. I bet it's multiple queries piling up around some synchronization points in lucene (sometimes caused by multiple threads generating the same big filter that isn't yet cached). What would be my next steps after that? I'm not sure I'd understand enough from the dump to make heads-or-tails of it. Can I share that here? Yes, post it here. Most likely a majority of the threads will be blocked somewhere deep in lucene code, and you will probably need help from people here to figure it out. -Yonik
Re: Facets and running out of Heap Space
On 10/9/07, David Whalen [EMAIL PROTECTED] wrote: I run a faceted query against a very large index on a regular schedule. Every now and then the query throws an out of heap space error, and we're sunk. So, naturally we increased the heap size and things worked well for a while and then the errors would happen again. We've increased the initial heap size to 2.5GB and it's still happening. Is there anything we can do about this? Try facet.enum.cache.minDf param: http://wiki.apache.org/solr/SimpleFacetParameters -Yonik
Re: Facets and running out of Heap Space
On 10/9/07, David Whalen [EMAIL PROTECTED] wrote: This is only used during the term enumeration method of faceting (facet.field type faceting on multi-valued or full-text fields). What if I'm faceting on just a plain String field? It's not full-text, and I don't have multiValued set for it Then you will be using the FieldCache counting method, and this param is not applicable :-) Are all your field that you facet on like this? The FieldCache entry might be taking up too much room, esp if the number of entries is high, and the entries are big. The requests themselves can take up a good chunk of memory temporarily (4 bytes * nValuesInField). You could try a memory profiling tool and see where all the memory is being taken up too. -Yonik
Re: Facets and running out of Heap Space
On 10/10/07, Mike Klaas [EMAIL PROTECTED] wrote: Have you tried setting multivalued=true without reindexing? I'm not sure, but I think it will work. Yes, that will work fine. One thing that will change is the response format for stored fields arr name=foostrval1/str/arr instead of str name=fooval1/str Hopefully in the future we can specify a faceting method w/o having to change the schema. -Yonik
Re: Internal Server Error and waitSearcher=false for commit/optimize
On 10/10/07, Jason Rennie [EMAIL PROTECTED] wrote: We're using solr 1.2 and a nightly build of the solrj client code. We very occasionally see things like this: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process( QueryRequest.java:86) at org.apache.solr.client.solrj.impl.BaseSolrServer.query( BaseSolrServer.java:99) ... Caused by: org.apache.solr.common.SolrException: Internal Server Error Is there a longer stack trace somewhere concerning the internal server error? We also occasionally see solr taking too long to respond. We currently make our commit/optimize calls without any arguments. I'm wondering whether setting waitSearcher=false might allow search queries to be served while a commit/optimize is being run. I found this in an old message from this list: While commit/optimize is being run, requests are served using the old searcher - there shouldn't be any blocking. Is waitSearcher=false designed to allow queries to be processed while a commit/optimize is being run? No, waitSearcher=true was designed such that a client could do a commit, and wait for a new searcher to be registered such that a new query request is guaranteed to see the changes. waitSearcher=true/false only affects the thread calling commit... it has no effect on other query requests which will continue to use the previous searcher until the new one is registered. -Yonik
Re: doubled/halved performance?
On 10/11/07, Mike Klaas [EMAIL PROTECTED] wrote: I'm seeing some interesting behaviour when doing benchmarks of query and facet performance. Note that the query cache is disabled, and the index is entirely in the OS disk cache. filterCache is fully primed. Often when repeatedly measuring the same query, I'll see pretty consistent results (within a few ms), but occasionally a run which is almost exactly half the time: 240ms vs. 120ms: solr: DEBUGINFO: /select/ facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism axversion=2.2rows=1 0 239 solr: DEBUGINFO: /select/ facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism axversion=2.2rows=1 0 237 solr: DEBUGINFO: /select/ facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism axversion=2.2rows=1 0 120 solr: DEBUGINFO: /select/ facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism axversion=2.2rows=1 0 120 solr: DEBUGINFO: /select/ facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism axversion=2.2rows=1 0 237 solr: DEBUGINFO: /select/ facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism axversion=2.2rows=1 0 238 The strange thing is that the execution time is halved across _all_ parts of query processing: 101.0 total time 1.0 setup/query parsing 68.0 main query 30.0 faceting 0.0 pre fetch 2.0 debug 201.0 total time 1.0 setup/query parsing 138.0main query 58.0 faceting 0.0 pre fetch 4.0 debug I can't really think of a plausible explanation. Fortuitous instruction pipelining? It is hard to imagine a cause that wouldn't exhibit consistency. So the queries are one at a time, the index isn't changing, and nothing else happening in the system? It would be easier to explain an occasional long query than an occasional short one. It's weird how the granularity seems to be on the basis of a request (if the speedup sometimes happened half way through, then you'd get an average of the times). You could try -Xbatch to see if it's hotspot somehow, but I doubt that's it. -Yonik
Re: Instant deletes without committing
On 10/11/07, BrendanD [EMAIL PROTECTED] wrote: Yes, we have some huge performance issues with non-cached queries. So doing a commit is very expensive for us. We have our autowarm count for our filterCache and queryResultCache both set to 4096. But I don't think that's near high enough. We did have it as high as 16384 before, but it took over an hour to warm. Look in the logs... what took an hour to warm? there are separate autowarm log messages for the query and filter caches. Some of our queries take 30-60 seconds to complete if they're not cached. 1) Configure static warming requesst for any faceting that's common 2) Configure static warming requests for any filters (fq) that are common 3) size the filter cache larger than what's needed to hold all the facets (if that's too much memory, try the minDf param... see the wiki) 4) if indexing performance isn't an issue, lower mergeFactor to lower the average number of segments in the index (or optimize if you can) -Yonik
Re: query syntax performance difference?
On 10/11/07, BrendanD [EMAIL PROTECTED] wrote: Is there a difference in the performance for the following 2 variations on query syntax? The first query was a response from Solr by using a single fq parameter in the URL. The second query was a response from Solr by using separate fq parameter in the URL, one for each field. str name=fq product_is_active:true AND product_status_code:complete AND category_id:1001570 AND attribute_id_value_en_pair:1005758\:Elvis Presley /str vs: arr name=fq strproduct_is_active:true/str strproduct_status_code:complete/str strcategory_id:1001570/str strattribute_id_value_en_pair:1005758\:Elvis Presley/str /arr I'm just wondering if the queries get executed differently and whether it's better to split out each individual query into it's own statement or combine them using the AND operator. If they almost always appear together, then use an AND and put them in the same filter. If they are relatively independent, use different filters. Having solr intersect a few filters is normally very fast, so independent filters is usually fine. -Yonik
Re: Non-sortable types in sample schema
On 10/13/07, Lance Norskog [EMAIL PROTECTED] wrote: The sample schema in Solr 1.2 supplies two variants of integers, longs, floats, doubles. One variant is sortable and one is not. What is the point of having both? Why would I choose the non-sorting variants? Do they store fewer bytes per record? They both sort (because sorting uses the un-inverted FieldCache entry) ... but they don't both do range queries correctly (which relies on term index oder). One might choose integer for reading a legacy lucene index, or because they only need it for sorting or for function queries and the FieldCache entry is smaller. -Yonik
Re: comment-out a filter?
On 10/15/07, David Whalen [EMAIL PROTECTED] wrote: I want to comment-out a filter in my schema.xml, specifically the solr.EnglishPorterFilterFactory filter. I want to know -- will this cause me to have to re-build my index? Or will a restart of SOLR get the job done? Yes, you will need to rebuild because the index will have stemmed terms and queries will no longer match those terms in the index. -Yonik
Re: Search results problem
On 10/17/07, Maximilian Hütter [EMAIL PROTECTED] wrote: I also found this: Controls the maximum number of terms that can be added to a Field for a given Document, thereby truncating the document. Increase this number if large documents are expected. However, setting this value too high may result in out-of-memory errors. Coming from: http://www.ibm.com/developerworks/library/j-solr2/index.html That might be a problem for me. I was thinking about using copyFields, instead of one large fulltext field. Would that solve my problem, or would the maxFieldLength still apply when using copyFields? maxFieldLength is a setting on the IndexWriter and applies to all fields. If you want more tokens indexed, simply increase the value of maxFieldLength to something like 20 and you should be fine. There's no penalty for setting it higher than the largest field you are indexing (no diff between 1M and 2B if all your docs have field lengths less than 1M tokens anyway). -Yonik
Re: GET_SCORES flag in SolrIndexSearcher
On 10/19/07, Chris Hostetter [EMAIL PROTECTED] wrote: (it doesn't matter that parseSort returns null when the sort string is just score ... SolrIndexSearcher recognizes a null Sort as being the default sort by score) Yep... FYI, I did this early on specifically because no sort and score desc get you the same results from Lucene's IndexSearcher.search(), but they take different code paths (the former being slightly faster). -Yonik
Re: Performance when indexing or cold cache
On 10/22/07, Walter Underwood [EMAIL PROTECTED] wrote: lst name=appends str name=fq(pushstatus:A AND (type:movie OR type:person))/str /lst /requestHandler Perhaps try setting up a static warming query for this filter and any other common filters? Also look for correlations between when slow queries happen and the number of segments in the index (and perhaps lower mergeFactor to compensate if possible). -Yonik
Re: Using wildcard with accented words
On 10/22/07, Erik Hatcher [EMAIL PROTECTED] wrote: Perhaps this is a case that Solr could address with a third analyzer configuration (it already has query, and index differentiation) that could be incorporated for wildcard queries. Thoughts on that? I've actually thought about it previously it would be nice for it to all work automatically for the user. Seems like the implementation should be based on the TokenFilter level, then things like synonym filters, stemmers, etc, would do nothing. Perhaps add some new methods to BaseTokenFilterFactory to do prefix, wildcard, etc, transformations? Another gotcha is handling multiple tokens. What happens if someone queries for myfield:foo-bar* with a letter tokenizer or a word-delimiter filter? It's not a simple prefix query at all! -Yonik
Re: Search results problem
On 10/19/07, Maximilian Hütter [EMAIL PROTECTED] wrote: Yonik Seeley schrieb: On 10/17/07, Maximilian Hütter [EMAIL PROTECTED] wrote: I also found this: Controls the maximum number of terms that can be added to a Field for a given Document, thereby truncating the document. Increase this number if large documents are expected. However, setting this value too high may result in out-of-memory errors. Coming from: http://www.ibm.com/developerworks/library/j-solr2/index.html That might be a problem for me. I was thinking about using copyFields, instead of one large fulltext field. Would that solve my problem, or would the maxFieldLength still apply when using copyFields? maxFieldLength is a setting on the IndexWriter and applies to all fields. If you want more tokens indexed, simply increase the value of maxFieldLength to something like 20 and you should be fine. There's no penalty for setting it higher than the largest field you are indexing (no diff between 1M and 2B if all your docs have field lengths less than 1M tokens anyway). -Yonik Yes, that would be an easy solution, as there is no performance penalty as say. I am still unsure, if the maxFieldLength applies to copyFields? maxFieldLength applies to all fields (it's a Lucene concept, not a Solr one). copyField and maxFieldLength are not related. When using copyFields I get an array back for that field (I copied to). So it seems to be different. ??? maxFieldLength only applies to the number of tokens indexed. You will always get the complete field back if it's stored, regardless of what maxFieldLength is. Is there a performance penalty for using copyFields when indexing? copyFields are done as a discrete step before indexing... almost no cost to do that. Indexing itself will have a performance impact if there are more fields to index + store as a result of the copyField commands. How about the mixed fieldtypes in the source fields? What happens when I copy an sint based field and a string based field to a string based field? copyField is done based on the string values, before any analysis. Mixed content should be fine. -Yonik
Re: Search results problem
On 10/23/07, Maximilian Hütter [EMAIL PROTECTED] wrote: ??? maxFieldLength only applies to the number of tokens indexed. You will always get the complete field back if it's stored, regardless of what maxFieldLength is. What I meant was, that it is different from just having a field with all the tokens compared to using copyField to copy all the content to a field. CopyField doesn't just copy the contents to the field but seems to somehow link them there. copyField simply creates an additional value for the target... it would end up the same as if you sent it in yourself. So if my maxFieldLength is for example set to 100 and I use copyField for 101 other fields, will the 101th get truncated? copyField and maxFieldLength have nothing to do with each other. maxFieldLength limits the number of *tokens* in all values of a given name in a given document. So if you had field1: this is a test and a maxFieldLength of 3, then the test token would be dropped. if you had field1: this is field1: a test and a maxFieldLength of 3, then the test token would still be dropped. Is there a performance penalty for using copyFields when indexing? copyFields are done as a discrete step before indexing... almost no cost to do that. Indexing itself will have a performance impact if there are more fields to index + store as a result of the copyField commands. The documents in my application have something like 400+ fields (many multivalued). For easy searching the application copies all the contents of the 400+ fields to one field (fulltext field) which is used as defaultfield. This field is quite large for many documents (it gets as long as 55 tokens). I was thinking about using copyField for copying the fields onto that field instead of having the application do it before sending it to Solr. The indexing cost will be identical in either case. Since copyField is a little more elegant (why force the user to send the data more than once), I'd use that. If you don't need to search on all 400+ fields individually, don't index them (just index your defaultfield). And I wouldn't store your defaultfield since it's redundant info. -Yonik
Re: Payloads for multiValued fields?
On 10/24/07, Alf Eaton [EMAIL PROTECTED] wrote: Yonik Seeley wrote: Could you perhaps index the captions as #1 this is the first caption #2 this is the second caption And then when just look for #n in the highlighted results? For display, you could also strip out the #n in the captions. This was working ok for a while, but there's a problem: the highlighter doesn't return the whole caption - just the highlighted part - so sometimes the #n at the start of the caption field doesn't get returned and isn't available. Any other ideas? Perhaps there's a way for the response to say which fields of each document were matched? Perhaps try hl.fragsize=0 http://wiki.apache.org/solr/HighlightingParameters -Yonik
Re: Empty field error when boosting a dismax query using bf
On 10/24/07, Alf Eaton [EMAIL PROTECTED] wrote: I'm trying to use the bf parameter to boost a dismax query based on the value of a certain (integer) field. The trouble is that for some of the documents this field is empty (rather than zero), which means that there's an error when using the bf parameter: - select?q=query+stringqf=bodyqt=dismaxbf=intfield - java.lang.NumberFormatException: For input string: It looks like you are indexing a zero-length string for that field. Instead, completely leave the field out. In the future, we should probably have Solr remove (not index) empty non-string fields. -Yonik
Re: where did my foreign language go?
On 10/24/07, Ian Holsman [EMAIL PROTECTED] wrote: Hi. I'm in the middle of bringing up a new solr server and am using the trunk. (where I was using an earlier nightly release of about 2-3 weeks ago on my old server) now, when I do a search for 日本 (japan) it used to show the kanji in the q area, but now it shows gibberish instead 日本 any hints on where I should start investigating on why this is happening? My standard answer is to use the python writer (wt=python) to see what the actual unicode values are when debugging an issue like this. When I try your URL with the example server from the solr trunk, I get 'q':u'\u65e5\u672c', And when I try your server, I get 'q':u'\u00e6\u0097\u00a5\u00e6\u009c\u00ac', So the answer is that your app-server isn't correctly handling UTF-8 encoded URLs. I see you are using Tomcat... see http://wiki.apache.org/solr/SolrTomcat URI Charset Config If you are going to query Solr using international characters (127) using HTTP-GET, you must configure Tomcat to conform to the URI standard by accepting percent-encoded UTF-8. Edit Tomcat's conf/server.xml and add the following attribute to the correct Connector element: URIEncoding=UTF-8. Server ... Service ... Connector ... URIEncoding=UTF-8/ ... /Connector /Service /Server This is only an issue when sending non-ascii characters in a query request... no configuration is needed for Solr/Tomcat to return non-ascii chars in a response, or accept non-ascii chars in an HTTP-POST body. -Yonik
Re: My filters are not used
On 10/24/07, Norskog, Lance [EMAIL PROTECTED] wrote: I am creating a filter that is never used. Here is the query sequence: q=*:*fq=contentid:00*start=0rows=200 q=*:*fq=contentid:00*start=200rows=200 q=*:*fq=contentid:00*start=400rows=200 q=*:*fq=contentid:00*start=600rows=200 q=*:*fq=contentid:00*start=700rows=200 Accd' to the statistics here is my filter cache usage: lookups : 1 [...] I'm completely confused. I thought this should be 1 insert, 4 lookups, 4 hits, and a hitratio of 100%. Solr has a query cache too... the query cache is checked, there's a hit, and the query process is short circuited. -Yonik
Re: Forced Top Document
On 10/25/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The typical use case, though, is for the featured document to be on top only : for certain queries. Like in an intranet where someone queries 401K or : retirement or similar, you want to feature a document about benefits that : would otherwise rank really low for that query. I have not be able to make : sorting strategies work very well. this type of question typically falls into two use cases: 1) targeted ads 2) sponsored results in the targeted ads case, the special matches aren't part of the normal flow of results, and don't fit into pagination -- they always appera at the top, or to the right, on every page, no matter what the sort this kind of usage doesn't really need any special logic, it can be solved as easily by a second Solr hit as it can by custom request handler logic. in the sponsored results use case, the special matches should appear in the normal flow of results as the #1 (2, 3, etc) matches, so that they don't appear on page#2 ... but that also means that it's extremely disconcerting for users if those matches are still at the top when the userse resort. if a user is looking at product listings, sorted by relevancy and the top 3 results all say they are sponsered that's fine ... but if the user sort by price and those 3 results are still at teh top of the list, even though they clearly aren't the chepest, that's just going to piss the user off. in my profesional opinion: don't fuck with your users. default to whatever order you want, but if the user specificly requests to sort the results by some option, do it. assuming you follow my professional opinion, then boosting docs to have an artifically high score will work fine. if you absolutely *MUST* have certain docs sorting before others, regardless of which sort option the user picks, then it is still possible do ... i'm hesitant to even say how, but if people insist on knowing... allways sort by score first, then by whatever field the user wants to sort by ... but when the user wants to sort on a specific field, move the users main query input into an fq (so it doesn't influence the score) ... and use an extremely low boost matchalldocs query along with your special doc matching query as the main (scoring) query param. the key being that even though your primary sort is on score, every doc except your special matches have identical scores. That sorts by relevance for your sponsored results, right? What if you want absolute ordering based on dollars spent on that result, for example. (this may not be possible with dismax because it's not trivial to move the query into an fq Should be easier in trunk: fq=!dismaxfoo bar or fq=!dismax v=$userq -Yonik
Re: prefix-search ingnores the lowerCaseFilter
On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote: Is it possible that the prefix-processing ignores the filters? Yes, It's a known limitation that we haven't worked out a fix for yet. The issue is that you can't just run the prefix through the filters because of things like stop words, stemming, minimum length filters, etc. -Yonik
Re: indexing one documents with different populated fields causes a deletion of documents in with other populated fileds
On 10/25/07, Anton Valdstein [EMAIL PROTECTED] wrote: Does solr check automatically for duplicate texts in other fields and delete documents that have the same text stored in other fields? Solr automatically overwrites (deletes old versions of) documents with the same uniqueKey field (normally called id). Both Lucene and Solr lack the ability to change (or add fields to) existing documents. -Yonik
Re: SOLR 1.3 Release?
On 10/25/07, Matthew Runo [EMAIL PROTECTED] wrote: Any ideas on when 1.3 might be released? We're starting a new project and I'd love to use 1.3 for it - is SVN head stable enough for use? I think it's stable in the sense of does the right thing and doesn't crash, but IMO isn't stable in the sense that new interfaces (internal and external) added since 1.2 may still be changing. Lots of new stuff going in (and has gone in), and I wouldn't expect to see 1.3 super soon. Just IMO of course. -Yonik
Re: indexing one documents with different populated fields causes a deletion of documents in with other populated fileds
On 10/25/07, Anton Valdstein [EMAIL PROTECTED] wrote: thanks, that explains a lot (:, I have another question: about how the idf is calculated: is the document frequency the sum of all documents containing the term in one of their fields or just in the field the query contained? idfs are field (fieldname) specific. So it's based on the count of documents containing that word in that field. Things are done on the basis of term in Lucene, and a term consists of the fieldname and the word. -Yonik
Re: CollectionDistribution - Changes reflected immediately on master, but only after tomcat restart on slave
On 10/26/07, Karen Loughran [EMAIL PROTECTED] wrote: But after distribution of this latest snapshop to the slave the collection does not show the update (with solr admin query url or via java query client) UNLESS I restart tomcat ? Sounds like a config issue with the scripts... pulling the snapshot is obviously working, but snapinstaller (calling commit) is broken. try running bin/commit -V by hand on the slave -Yonik
Re: prefix-search ingnores the lowerCaseFilter
On 10/29/07, Martin Grotzke [EMAIL PROTECTED] wrote: On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote: On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote: Is it possible that the prefix-processing ignores the filters? Yes, It's a known limitation that we haven't worked out a fix for yet. The issue is that you can't just run the prefix through the filters because of things like stop words, stemming, minimum length filters, etc. What about not having only facet.prefix but additionally facet.filtered.prefix that runs the prefix through the filters? Would that be possible? The underlying issue remains - it's not safe to treat the prefix like any other word when running it through the filters. -Yonik
Re: Phrase Query Performance Question
On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote: Thanks a lot for replying Yonik! I am running solr on a windows 2003 server (standard version). intel Xeon CPU 3.00GHz, with 4.00 GB RAM. The index is locate on Raid5 with 2 million documents. Is there any way to improve query performance without moving to more powerful computer? I understand that the query performances of phrase query (auto repair) has to do with the number of documents containing the two words. In fact the number of documents that have auto and repair are about 10. It is like 5% of the documents containing auto and repair. It seems to me 937 ms is too slower. Chen, that does seem slow I'm not sure why. 1) was this the first search on the index? if so, try running some other searches to warm things up first. 2) was the jvm in server mode? (start with -server) 3) shut down unlrelated things on the system so that there is more memory available to the OS to cache the index files Would it be faster if I run solr on linux system? Maybe... Lucene does rely on the OS caching often used parts of the index, so this can differ the most between Windows and Linux. If you have a Linux box lying around, trying it out quick to remove that variable would be a good idea. -Yonik
Re: FW: Score customization
On 10/31/07, Victoria Kaganski [EMAIL PROTECTED] wrote: Does FunctionQuery actually override the default similarity function? If it does, how can I still access the similarity value? FunctionQuery returns the *value* of a field (or a function of it) as the value for a query - it does not use Similarity at all. If you put a FunctionQuery in a BooleanQuery with other queries (like normal relevance queries), the scores will be added together. If you use a BoostedQuery, the FunctionQuery score will be multiplied by the normal relevance score. -Yonik
Re: fieldNorm seems to be killing my score
Hmmm, a norm of 0.0??? That implies that the boost for that field (text) was set to zero when it was indexed. How did you index the data (straight HTTP, SolrJ, etc)? What does your schema for this field (and copyFields) look like? -Yonik On 11/1/07, Robert Young [EMAIL PROTECTED] wrote: Hi, I've been trying to debug why one of my test cases doesn't work. I have an index with two documents in, one talking mostly about apples and one talking mostly about oranges (for the sake of this test case) both of which have 'test_site' in their site field. If I run the query +(apple^4 orange) +(site:test_site) I would expect the document which talks about apples to always apear first but it does not. Looking at the debug output (below) it looks like fieldNorm is killing the first part of the query. Why is this and how can I stop it? ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=rows10/str str name=start0/str str name=indenton/str str name=q+(apple^4 orange) +(site:test_site)/str str name=debugQueryon/str str name=version2.2/str /lst /lst result name=response numFound=2 start=0 doc str name=guidtest_index-test_site-integration:124/str str name=indextest_index/str str name=link/oranges/str str name=sitetest_site/str str name=snippetorange orange orange/str str name=titleorange/str /doc doc str name=guidtest_index-test_site-integration:123/str str name=indextest_index/str str name=link/me/str str name=sitetest_site/str str name=snippetapple apple apple/str str name=titleapple/str /doc /result lst name=debug str name=rawquerystring+(apple^4 orange) +(site:test_site)/str str name=querystring+(apple^4 orange) +(site:test_site)/str str name=parsedquery+(text:appl^4.0 text:orang) +site:test_site/str str name=parsedquery_toString+(text:appl^4.0 text:orang) +site:test_site/str lst name=explain str name=id=test_index-test_site-integration:124,internal_docid=13 0.14332592 = (MATCH) sum of: 0.0 = (MATCH) product of: 0.0 = (MATCH) sum of: 0.0 = (MATCH) weight(text:orang in 13), product of: 0.24034579 = queryWeight(text:orang), product of: 1.9162908 = idf(docFreq=5) 0.1254224 = queryNorm 0.0 = (MATCH) fieldWeight(text:orang in 13), product of: 2.236068 = tf(termFreq(text:orang)=5) 1.9162908 = idf(docFreq=5) 0.0 = fieldNorm(field=text, doc=13) 0.5 = coord(1/2) 0.14332592 = (MATCH) weight(site:test_site in 13), product of: 0.13407566 = queryWeight(site:test_site), product of: 1.0689929 = idf(docFreq=13) 0.1254224 = queryNorm 1.0689929 = (MATCH) fieldWeight(site:test_site in 13), product of: 1.0 = tf(termFreq(site:test_site)=1) 1.0689929 = idf(docFreq=13) 1.0 = fieldNorm(field=site, doc=13) /str str name=id=test_index-test_site-integration:123,internal_docid=14 0.14332592 = (MATCH) sum of: 0.0 = (MATCH) product of: 0.0 = (MATCH) sum of: 0.0 = (MATCH) weight(text:appl^4.0 in 14), product of: 0.96138316 = queryWeight(text:appl^4.0), product of: 4.0 = boost 1.9162908 = idf(docFreq=5) 0.1254224 = queryNorm 0.0 = (MATCH) fieldWeight(text:appl in 14), product of: 2.236068 = tf(termFreq(text:appl)=5) 1.9162908 = idf(docFreq=5) 0.0 = fieldNorm(field=text, doc=14) 0.5 = coord(1/2) 0.14332592 = (MATCH) weight(site:test_site in 14), product of: 0.13407566 = queryWeight(site:test_site), product of: 1.0689929 = idf(docFreq=13) 0.1254224 = queryNorm 1.0689929 = (MATCH) fieldWeight(site:test_site in 14), product of: 1.0 = tf(termFreq(site:test_site)=1) 1.0689929 = idf(docFreq=13) 1.0 = fieldNorm(field=site, doc=14) /str /lst /lst /response
Re: SOLR 1.3: defaultOperator always defaults to OR although AND is specifed.
Try the latest... I just fixed this. -Yonik On 11/1/07, Britske [EMAIL PROTECTED] wrote: experimenting with SOLR 1.3 and discovered that although I specified solrQueryParser defaultOperator=AND/ in schema.xml q=a+b behaves as q=a OR B instead of q=a AND b Obviously this is not correct. I used the nightly of 29 oct. Cheers, Geert-Jan -- View this message in context: http://www.nabble.com/SOLR-1.3%3A-defaultOperator-always-defaults-to-OR-although-AND-is-specifed.-tf4731773.html#a13529997 Sent from the Solr - User mailing list archive at Nabble.com.
Re: overlapping onDeckSearchers message
On 11/3/07, Brian Whitman [EMAIL PROTECTED] wrote: I have a solr index that hasn't had many problems recently but I had the logs open and noticed this a lot during indexing: [16:23:34.086] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 That means that one searcher hasn't yet finished warming in the background, and a commit was just done and another searcher started warming. -Yonik
Re: FW: Score customization
On 11/3/07, Victoria Kaganski [EMAIL PROTECTED] wrote: I guess I was not clear... I understand that if I use FunctionQuery, it's result value will return as the score, instead of the similarity. Am I right? Only for the FunctionQuery part... it's not an all or nothing thing. Let me give you a specific example in Solr Query syntax: +text:spider man~100 _val_:popularity This query will result in the full-text relevance score (yes, using similarity) of the first part added to the value of the popularity field. Try some examples out and let us know if you don't get what you expect. -Yonik From: [EMAIL PROTECTED] on behalf of Yonik Seeley Sent: Wed 10/31/2007 7:21 PM To: solr-user@lucene.apache.org Subject: Re: FW: Score customization On 10/31/07, Victoria Kaganski [EMAIL PROTECTED] wrote: Does FunctionQuery actually override the default similarity function? If it does, how can I still access the similarity value? FunctionQuery returns the *value* of a field (or a function of it) as the value for a query - it does not use Similarity at all. If you put a FunctionQuery in a BooleanQuery with other queries (like normal relevance queries), the scores will be added together. If you use a BoostedQuery, the FunctionQuery score will be multiplied by the normal relevance score. -Yonik
Re: customer request handler doesn't envok the query tokenization chain
On 11/4/07, Yu-Hui Jin [EMAIL PROTECTED] wrote: Let's say we defined a customer filed type that when querying and indexing, the solr.LowerCaseFilterFactory is used as the last filter to low-case all letters. In the Analysis UI, we found tokenization is working correctly. We also defined a custom request handler which always creates a boolean query that ANDs all tokens for fielded queries (we overrided the getFieldQuery method only). First, if all you are doing is ANDing all the tokens, you can just change the default operator to AND (q.op=AND). Analysis is done during query parsing by the query parser... if you create your own queries, you need to do that analysis yourself. -Yonik
Re: customer request handler doesn't envok the query tokenization chain
On 11/5/07, Yu-Hui Jin [EMAIL PROTECTED] wrote: Just curious, does the default operator ( AND or OR) specify the relationship between a field/value component or between the tokens of the same field/value componenet? between any clauses in a boolean query. e.g. for a query like this: field1:abc field2:xyz does the operator connect field1:abc and field2:xyz , or it connects the tokens from abc and xyz for their respective field? These are two different query clauses (the fieldnames don't matter). If the default operator is OR, then it will be interpreted as field1:abc OR field2:xyz (both optional) if the default operator is set to AND then it will be field1:abc AND field2:xyz (both required) -Yonik
Re: Phrase Query Performance Question and score threshold
On 11/5/07, Haishan Chen [EMAIL PROTECTED] wrote: As for the first issues. The number of different phrase queries have performance issues I found so far are about 10. If these are normal phrase queries (no slop), a good solution might be to simply index and query these phrases as a single token. One could do this with a SynonymFilter. Oh, and no, a score threshold won't help performance. I believe there will be a lot more I just haven't tried. It can be solve by using faster hard ware though. Also I believe it will help if SOLR has samilar distributed search architecture like NUTCH so that it can scale out instead of scale up. It's coming... -Yonik
Re: specify index location
On 11/5/07, evol__ [EMAIL PROTECTED] wrote: Just a remark: !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. -- Might be a good idea to change this to ./data/index to reflect the location that is expected in there. ./data is the generic solr data directory index stores the main index under the data directory. -Yonik
Re: value boosts? (boosting a multiValued field's data)
On 11/6/07, evol__ [EMAIL PROTECTED] wrote: Hi. Is the expansion method described in the following year old post still the best available way to do this? http://www.nabble.com/newbie-Q-regarding-schema-configuration-tf1814271.html#a4956602 The way I understand it, indexing these field name=foo boost=1.0First val/field field name=foo boost=0.8Less important value/field would just make the boost 0.8 field-wide? Yes... all boost values for multivalued fields are multiplied together. Nothing we can do about that... only one norm (boost * lengthNorm) is stored per document per unique field. -Yonik
Re: query syntax
On 11/6/07, Traut [EMAIL PROTECTED] wrote: I have in index document with field name and its value is somename123 Why I can't find anything with query name:somename123* This is a prefix query. No analysis is done on the prefix, so it may not match analysis that was done when the document was indexed. For example, if you use WordDelimiterFilter, this may be indexed as somename 123 but there are results on query name:somename123* This is not a prefix query. The * will most likely be removed by the analyzer, leaving you effectively with a query of name:somename123 -Yonik
Re: Can you parse the contents of a field to populate other fields?
On 11/6/07, Kristen Roth [EMAIL PROTECTED] wrote: Yonik - thanks so much for your help! Just to clarify; where should the regex go for each field? Each field should have a different FieldType (referenced by the type XML attribute). Each fieldType can have it's own analyzer. You can use a different PatternTokenizer (which specifies a regex) for each analyzer. -Yonik
Re: SOLR 1.2 - Duplicate Documents??
On Nov 7, 2007 12:30 PM, realw5 [EMAIL PROTECTED] wrote: We did have Tomcat crash once (JVM OutOfMem) durning an indexing process, could that be a possible source of the issue? Yes. Deletes are buffered and carried out in a different phase. -Yonik
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik
Re: solr range query
On Nov 12, 2007 8:02 AM, Heba Farouk [EMAIL PROTECTED] wrote: I would like to use solr to return ranges of searches on an integer field, if I wrote in the url offset:[0 TO 10], it returns documents with offset values 0, 1, 10 only but I want to return the range 0,1,2, 3, 4 ,10. How can I do that with solr Use fieldType=sint (sortable int... see the schema.xml), and reindex. -Yonik
Re: no segments* file found
On Nov 12, 2007 3:46 AM, SDIS M. Beauchamp [EMAIL PROTECTED] wrote: If I don't optimize, I 've got a too many files open at about 450K files and 3 Gb index You may need to increase the number of filedescriptors in your system. If you're using Linux, see this: http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html Check the system wide limit and the per-process limit. -Yonik
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
On Nov 12, 2007 2:20 PM, David Neubert [EMAIL PROTECTED] wrote: Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am also considering Lucene - There's not a well defined solution in either IMO. - problem is, I am one week into both technologies (though have years in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :) Unfortunately the OS Summit has been canceled. -Yonik
Re: Exception in SOLR when querying for fields of type string
On Nov 13, 2007 6:23 PM, Kasi Sankaralingam [EMAIL PROTECTED] wrote: It is not tokenized, it is a string field, so will it still match photo for field 'title_s' and book for the default field? Yes, because the query parser splits up things by whitespace before analyzers are even applied. Do you have a default field defined? -Yonik
Re: how to load custom valuesource as plugin
Unfortunately, the function query parser isn't currently pluggable. -Yonik On Nov 14, 2007 2:02 PM, Britske [EMAIL PROTECTED] wrote: I've created a simple valueSource which is supposed to calculate a weighted sum over a list of supplied valuesources. How can I let Solr recognise this valuesource? I tried to simply upload it as a plugin, and reference is by its name (wsum) in a functionquery, but got a Unknown function wsum in FunctionQuery. Can anybody tell me what I'm missing here? Thanks in advance, Geert-Jan
Re: score customization
On Nov 15, 2007 11:06 AM, Jae Joo [EMAIL PROTECTED] wrote: I am looking for the way to get the score - only hundredth - ex. 4.09something like that. Currently, it has 7 decimal digits. float name=score1.8032384/float If you want to display scores only to the hundredths place, simply do that in your client. There's not a good reason to try and add this to solr... saving 5 bytes per document wouldn't be worth it. -Yonik
Re: Payloads in Solr
On Nov 17, 2007 2:18 PM, Tricia Williams [EMAIL PROTECTED] wrote: I was wondering how Solr people feel about the inclusion of Payload functionality in the Solr codebase? All for it... depending on what one means by payload functionality of course. We should probably hold off on adding a new lucene version to Solr until the Payload API has stabilized (it will most likely be changing very soon). From a recent message to the [EMAIL PROTECTED] mailing list: I'm working on the issue https://issues.apache.org/jira/browse/SOLR-380 which is a feature request that allows one to index a Structured Document which is anything that can be represented by XML in order to provide more context to hits in the result set. This allows us to do things like query the index for Canada and be able to not only say that that query matched a document titled Some Nonsense but also that the query term appeared on page 7 of chapter 1. We can then take this one step further and markup/highlight the image of this page based on our OCR and position hit. For example: book title='Some Nonsense'chapter title='One'page name='1'Some text from page one of a book./pagepage name='7'Some more text from page seven of a book. Oh and I'm from Canada./page/chapter/book I accomplished this by creating a custom Tokenizer which strips the xml elements and stores them as a Payload at each of the Tokens created from the character data in the input. The payload is the string that describes the XPath at that location. So for Canada the payload is /book[title='Some Nonsense']/chapter[title='One']/page[name='7'] That's a lot of data to associate with every token... I wonder how others have accomplished this? One could compress it with a dictionary somewhere. I wonder if one could index special begin_tag and end_tag tokens, and somehow use span queries? Using Payloads requires me to include lucene-core-2.3-dev.jar which might be a barrier. Also, using my Tokenizer with Solr specific TokenFilter(s) looses the Payload at modified tokens. Yes, this will be an issue for many custom tokenizers that don't yet know about payloads but that create tokens. It's not clear what to do in some cases when multiple tokens are created from one... should identical payloads be created for the new tokens... it depends on what the semantics of those payloads are. -Yonik