Re: Updating FAQ for International Characters?
So I am using Sunspot to post over, which means an extra layer of indirection between mean and my XML! I will look tomorrow. On Mar 10, 2010, at 7:21 PM, Chris Hostetter wrote: : Any time a character like that was index Solr through a unknown entity error. : But if converted to #192; or Agrave; then everything works great. : : I tried out using Tomcat versus Jetty and got the same results. Before I edit Uh, you mean like the characters in exampledocs/utf8-example.xml ? it contains literale utf8 characters, and it works fine. Based on your #192; comment I assume you are posting XML ... are you sure you are using the utf8 charset? -Hoss - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server Free/Busy: http://tinyurl.com/eric-cal
Facet pagination
Is there a way to get *total count of facets* per field? Meaning, if my facets are - lst name=facet_fields lst name=first_char int name=s305807/int int name=d264748/int int name=p181084/int int name=m130546/int int name=r98544/int int name=b82741/int int name=k77157/int /lst /lst Then, is the underneath possible? lst name=first_char *totalFacetCount=7'* where 7 is the count of all facets available. In this example - s, d, p, m, r, b and k. I need this to fetch paginated facets of a field for a given query; not by doing next-previous. Cheers Avlesh
Advance Search
How can i achieve the advance search in solr . i need search books like (eg title = The Book of Three,author= Lloyd Alexander, price = 99.00) How can i querying this -- View this message in context: http://old.nabble.com/Advance-Search-tp27861279p27861279.html Sent from the Solr - User mailing list archive at Nabble.com.
Cant commit on 125 GB index
Hi, I'm having timeouts commiting on a 125 GB index with about 2200 docs. I'm inserting new docs every 5m and commiting after that. I would like to try the autocommit option and see if I can get better results. I need the docs indexed available for searching in about 10 minutes after the insert. I was thinking of using something like autoCommit maxDocs5000/maxDocs maxTime86000/maxTime /autoCommit I update about 4000 docs every 15m. Can you share your thoughts on this config? Do you think this will solve my commits timeout problem? Thanks, Frederico
Multiple SOLR queries on same index
Hi, Is it possible to execute multiple SOLR queries (basically same structure/fields but due to the headersize limitations for long query URLs, thinking of having multiple SOLR queries) on single index like a batch or so? Best Regards, Kranti K K Parisa
Re: index merge
Hi All, Thank you for the very valuable suggestions. I am planning to try using the Master - Slave configuration. Best Rgds, Mark. On Mon, Mar 8, 2010 at 11:17 AM, Mark Miller markrmil...@gmail.com wrote: On 03/08/2010 10:53 AM, Mark Fletcher wrote: Hi Shalin, Thank you for the mail. My main purpose of having 2 identical cores COREX - always serves user request COREY - every day once, takes the updates/latest data and passess it on to COREX. is:- Suppose say I have only one COREY and suppose a request comes to COREY while the update of the latest data is happening on to it. Wouldn't it degrade performance of the requests at that point of time? Yes - but your not going to help anything by using two indexes - best you can do it use two boxes. 2 indexes on the same box will actually be worse than one if they are identical and you are swapping between them. Writes on an index will not affect reads in the way you are thinking - only in that its uses IO and CPU that the read process cant. Thats going to happen with 2 indexes on the same box too - except now you have way more data to cache and flip between, and you can't take any advantage of things just being written possibly being in the cache for reads. Lucene indexes use a write once strategy - when writing new segments, you are not touching the segments being read from. Lucene is already doing the index juggling for you at the segment level. So I was planning to keep COREX and COREY always identical. Once COREY has the latest it should somehow sync with COREX so that COREX also now has the latest. COREY keeps on getting the updates at a particular time of day and it will again pass it on to COREX. This process continues everyday. What is the best possible way to implement this? Thanks, Mark. On Mon, Mar 8, 2010 at 9:53 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Hi Mark, On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher mark.fletcher2...@gmail.com wrote: I ran the SWAP command. Now:- COREX has the dataDir pointing to the updated dataDir of COREY. So COREX has the latest. Again, COREY (on which the update regularly runs) is pointing to the old index of COREX. So this now doesnt have the most updated index. Now shouldn't I update the index of COREY (now pointing to the old COREX) so that it has the latest footprint as in COREX (having the latest COREY index)so that when the update again happens to COREY, it has the latest and I again do the SWAP. Is a physical copying of the index named COREY (the latest and now datDir of COREX after SWAP) to the index COREX (now the dataDir of COREY.. the orginal non-updated index of COREX) the best way for this or is there any other better option. Once again, later when COREY is again updated with the latest, I will run the SWAP again and it will be fine with COREX again pointing to its original dataDir (now the updated one).So every even SWAP command run will point COREX back to its original dataDir. (same case with COREY). My only concern is after the SWAP is done, updating the old index (which was serving previously and now replaced by the new index). What is the best way to do that? Physically copy the latest index to the old one and make it in sync with the latest one so that by the time it is to get the latest updates it has the latest in it so that the new ones can be added to this and it becomes the latest and is again swapped? Perhaps it is best if we take a step back and understand why you need two identical cores? -- Regards, Shalin Shekhar Mangar. -- - Mark http://www.lucidimagination.com
field length normalization
Hi, In my schema, the document title field has omitNorms=false, which, if I am not wrong, causes length of titles to be counted in the scoring. But when I query with: word1 word2 word3 I dont know why still the top two documents title have these words and other words, where as the document which has exact and only these query words is coming on third place. Setting omitNorms to false, should bring the titles with exact words on top shouldn't it? Also I realized when debugged query, that all three top documents have same score, shouldn't this be different as they have different title lengths? Thanks very much. -A -- View this message in context: http://old.nabble.com/field-length-normalization-tp27862618p27862618.html Sent from the Solr - User mailing list archive at Nabble.com.
mincount doesn't work with FacetQuery
I'm faceting with a query range (with addFacetQuery) and setting mincount to 10 (with setFacetMinCount(10)), but Solr is not respecting this mincount; it's still giving me all responses, even those having less than 10 retrieved documents. I'm wondering wether there's another way to define the mincount while using addFacetQuery. Actually, when I use this same mincount with addFacetField, it works perfectly. Any ideas? Thanks
Re: Architectural help
Assuming you create the view in such a way that it returns 1 row for each solrdocument you want indexed: yes On Wed, Mar 10, 2010 at 7:54 PM, blargy zman...@hotmail.com wrote: So I can just create a view (or temporary table) and then just have a simple select * from (view or table) in my DIH config? Constantijn Visinescu wrote: Try making a database view that contains everything you want to index, and then just use the DIH. Worked when i tested it ;) On Wed, Mar 10, 2010 at 1:56 AM, blargy zman...@hotmail.com wrote: I was wondering if someone could be so kind to give me some architectural guidance. A little about our setup. We are RoR shop that is currently using Ferret (no laughs please) as our search technology. Our indexing process at the moment is quite poor as well as our search results. After some deliberation we have decided to switch to Solr to satisfy our search requirements. We have about 5M records ranging in size all coming from a DB source (only 2 tables). What will be the most efficient way of indexing all of these documents? I am looking at DIH but before I go down that road I wanted to get some guidance. Are there any pitfalls I should be aware of before I start? Anything I can do now that will help me down the road? I have also been exploring the Sunspot rails plugin (http://outoftime.github.com/sunspot/) which so far seems amazing. There is an easy way to reindex all of your models like Model.reindex but I doubt this is the most efficient. Has anyone had any experience using Sunspot with their rails environment and if so should I bother with the DIH? Please let me know of any suggestions/opinions you may have. Thanks. -- View this message in context: http://old.nabble.com/Architectural-help-tp27844268p27844268.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Architectural-help-tp27844268p27854256.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Advance Search
Have you looked at dismax? Erick On Thu, Mar 11, 2010 at 4:40 AM, Suram reactive...@yahoo.com wrote: How can i achieve the advance search in solr . i need search books like (eg title = The Book of Three,author= Lloyd Alexander, price = 99.00) How can i querying this -- View this message in context: http://old.nabble.com/Advance-Search-tp27861279p27861279.html Sent from the Solr - User mailing list archive at Nabble.com.
Apache Solr module with drupal - where to change key word in context?
I am using the apache solr module with our Drupal site. Out data is not clean enough to use the key word in context blurb under the title in the result set. I would like to change it to the first N characters in the body of the node. Can anyone direct me to the file and line(s) where I would do this? thanks! Lynn -- View this message in context: http://old.nabble.com/Apache-Solr-module-with-drupal---where-to-change-key-word-in-context--tp27863711p27863711.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Distributed search fault tolerance
I guess I must be including too much information in my questions, running into tl;dr with them. Later today when I have more time I'll try to make it more bite-size. On 3/9/2010 2:28 PM, Shawn Heisey wrote: I attended the Webinar on March 4th. Many thanks to Yonik for putting that on. That has led to some questions about the best way to bring fault tolerance to our distributed search. High level question: Should I go with SolrCloud, or stick with 1.4 and use load balancing? I hope the rest of this email isn't too disjointed for understanding.
Re: mincount doesn't work with FacetQuery
Steve - I'm a bit confused... each facet.query (using HTTP parameter nomenclature) only adds a single value to the response, the number of docs within the current constraints that match that query. facet.mincount is specifically for facet.field, which adds a name/ value pair for each value in the field, and that's where you want to limit the number of values returned. Perhaps you could provide a concrete example with the solr response (XML, JSON, or something readable format) for the facet data isn't making sense to you. Or maybe SolrJ has some faults in presenting the response properly? Erik On Mar 11, 2010, at 8:11 AM, Steve Radhouani wrote: I'm faceting with a query range (with addFacetQuery) and setting mincount to 10 (with setFacetMinCount(10)), but Solr is not respecting this mincount; it's still giving me all responses, even those having less than 10 retrieved documents. I'm wondering wether there's another way to define the mincount while using addFacetQuery. Actually, when I use this same mincount with addFacetField, it works perfectly. Any ideas? Thanks
Aggregate functions on faceted result
Hi. We would like to be able to create trend graphs which have date in the X-axle and sum(pagerank) on the Y-Axle. We have the field pageRank stored as an external field (since it is updated all the time). I have started to build a SearchComponent which will be named something like FacetFunctionComponent but felt that I should drop a mail here asking if it is already possible. Is it even remotely possible to create this function in SOLR ? Cheers //Marcus Herou -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: distinct on my result
okay. we have a lot of products and i just importet the name of each product to a core. make an edgengram to this and my autoCOMPLETION runs. but i want an auto-suggestion: example. autoCompletion-- I: harry O: harry potter... but when the input ist -- I. potter -- O: / so what i want is, that i get harry potter ... when i tipping potter into my search field! any idea ? i think the solution is a mixe of termsComponent and EdgeNGram or not ? i am a little bit despair, and in this forum are too many information about it =( gwk-4 wrote: Hi, The autosuggest core is filled by a simple script (written in PHP) which request facet values for all the possible strings one can search for and adds them one by one as a document. Our case has some special issues due to the fact that we search in multiple languages (Typing España will suggest Spain and the other way around when on the Spanish site). We have about 97500 documents yeilding approximately 12500 different documents in our autosuggest-core and the autosuggest-update script takes about 5 minutes to do a full re-index (all this is done on a separate server and replicated so the indexing has no impact on the performance of the site). Regards, gwk On 3/10/2010 3:09 PM, stocki wrote: okay. thx my suggestion run in another core;) do you distinct during the import with DIH ? -- View this message in context: http://old.nabble.com/distinct-on-my-result-tp27849951p27864088.html Sent from the Solr - User mailing list archive at Nabble.com.
Index size on disk
Hello, I needed an easy way to see the index size (the actual size on disk, not just the number of documents indexed) and as i didn't found anything for doing that on the documentation or on the list, I coded a fast solution. I added the Index size as a statistic of the searcher, that way the value can be seen on the statistics page of the Solr admin. To do this I modified the method public NamedList getStatistics() {... on the class org.apache.solr.search.SolrIndexSearcher by adding the line lst.add(indexSize, this.calculateIndexSize(reader.directory()).toString() + MB); and added the methods: private BigDecimal calculateIndexSize(Directory directory) { long size = 0L; try { for(String filePath:directory.listAll()) { size+=directory.fileLength(filePath); } } catch (IOException e) { return new BigDecimal(-1); } return getSizeInMB(size, 2); } private BigDecimal getSizeInMB(long size, int scale) { BigDecimal divisor = new BigDecimal(1024); BigDecimal sizeKb = new BigDecimal(size).divide(divisor, scale + 1, BigDecimal.ROUND_HALF_UP); return sizeKb.divide(divisor, scale, BigDecimal.ROUND_HALF_UP); } I'm running Solr 1.4 on a JBoss 4.0.5 with Java 1.5 and this worked just fine. Does anyone see a potential problem on this? I'm assuming that the solr index will never have directories inside (that's why I'm just looping on the index parent directory), is there any case when this is not true? Tomás Yahoo! Cocina Encontra las mejores recetas con Yahoo! Cocina. http://ar.mujer.yahoo.com/cocina/
Call for presentations - Berlin Buzzwords - Summer 2010
Call for Presentations Berlin Buzzwords http://buzzwordsberlin.de Berlin Buzzwords 2010 - Search, Store, Scale 7/8 June 2010 This is to announce the Berlin Buzzwords 2010. The first conference on scalable and open search, data processing and data storage in Germany, taking place in Berlin. The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics: Information retrieval / Search - Lucene, Solr, katta or comparable solutions NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies. Important Dates (all dates in GMT +2): Submission deadline: April 17th 2010, 23:59 Notification of accepted speakers: May 1st, 2010. Publication of final schedule: May 9th, 2010. Conference: June 7/8. 2010. High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters. Proposals should be submitted at http://berlinbuzzwords.de/content/cfp no later than April 17th, 2010. Acceptance notifications will be sent out on May 1st. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a short (30min) or long (45min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) The presentation format is short: either 30 or 45 minutes including questions. We will be enforcing the schedule rigorously. If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us. Follow @hadoopberlin on Twitter for updates. News on the conference will be published on our website at http://berlinbuzzwords.de Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer. Schedule and further updates on the event will be published on http://berlinbuzzwords.de Please re-distribute this CfP to people who might be interested. Contact us at: newthinking communications GmbH Schönhauser Allee 6/7 10119 Berlin, Germany Andreas Gebhard a...@newthinking.de Isabel Drost i...@newthinking.de +49(0)30-9210 596 signature.asc Description: This is a digitally signed message part.
Solr Performance Issues
Hi everyone, I have an index corresponding to ~2.5 million documents. The index size is 43GB. The configuration of the machine which is running Solr is - Dual Processor Quad Core Xeon 5430 - 2.66GHz (Harpertown) - 2 x 12MB cache, 8GB RAM, and 250 GB HDD. I'm observing a strange trend in the queries that I send to Solr. The query times for queries that I send earlier is much lesser than the queries I send afterwards. For instance, if I write a script to query solr 5000 times (with 5000 distinct queries, most of them containing not more than 3-5 words) with 10 threads running in parallel, the average times for queries goes from ~50ms in the beginning to ~6000ms. Is this expected or is there something wrong with my configuration. Currently I've configured the queryResultCache and the documentCache to contain 2048 entries (hit ratios for both is close to 50%). Apart from this, a general question that I want to ask is that is such a hardware enough for this scenario? I'm aiming at achieving around 20 queries per second with the hardware mentioned above. Thanks, Regards, -- - Siddhant
Re: mincount doesn't work with FacetQuery
: I'm faceting with a query range (with addFacetQuery) and setting mincount to : 10 (with setFacetMinCount(10)), but Solr is not respecting this mincount; : it's still giving me all responses, even those having less than 10 retrieved : documents. if by all responses you mean all facet queries then that is the correct behavior -- facet.mincount is a param that affects facet.field, not fact.query. The documentation notes this, in that all of the params are divided by section... http://wiki.apache.org/solr/SimpleFacetParameters ...if you'd like to open a feature request, it would be fairly easy to make facet.query (and facet.date) consider facet.mincount as well. -Hoss
Re: Solr Performance Issues
How many outstanding queries do you have at a time? Is it possible that when you start, you have only a few queries executing concurrently but as your test runs you have hundreds? This really is a question of how your load test is structured. You might get a better sense of how it works if your tester had a limited number of threads running so the max concurrent requests SOLR was serving at once were capped (30, 50, whatever). But no, I wouldn't expect SOLR to bog down the way you're describing just because it was running for a while. HTH Erick On Thu, Mar 11, 2010 at 9:39 AM, Siddhant Goel siddhantg...@gmail.comwrote: Hi everyone, I have an index corresponding to ~2.5 million documents. The index size is 43GB. The configuration of the machine which is running Solr is - Dual Processor Quad Core Xeon 5430 - 2.66GHz (Harpertown) - 2 x 12MB cache, 8GB RAM, and 250 GB HDD. I'm observing a strange trend in the queries that I send to Solr. The query times for queries that I send earlier is much lesser than the queries I send afterwards. For instance, if I write a script to query solr 5000 times (with 5000 distinct queries, most of them containing not more than 3-5 words) with 10 threads running in parallel, the average times for queries goes from ~50ms in the beginning to ~6000ms. Is this expected or is there something wrong with my configuration. Currently I've configured the queryResultCache and the documentCache to contain 2048 entries (hit ratios for both is close to 50%). Apart from this, a general question that I want to ask is that is such a hardware enough for this scenario? I'm aiming at achieving around 20 queries per second with the hardware mentioned above. Thanks, Regards, -- - Siddhant
Content Highlighting
With the highlighting options will Solr highlight the found text something like google search does ? I cant seem to get this working ? Hope someone can advise.
Re: Content Highlighting
Please see: http://wiki.apache.org/solr/UsingMailingLists http://wiki.apache.org/solr/UsingMailingListsand repost with additional information. Best Erick On Thu, Mar 11, 2010 at 10:10 AM, Lee Smith l...@weblee.co.uk wrote: With the highlighting options will Solr highlight the found text something like google search does ? I cant seem to get this working ? Hope someone can advise.
release schedule?
Hello I'm new to this list, so please excuse me if I'm asking in the wrong place. I have been tasked with planning the next release of our software. Today, we are using Solr 1.4.0, and we plan to release a new version of our software later this year. I would like to know, if possible: - Are there any planned Solr releases for this year? - What are the planned release dates/contents, etc.? - Are there any beta releases to work with in the meantime? Thank you, Harold Ship NGSoft
Re: Call for presentations - Berlin Buzzwords - Summer 2010
On 11.03.2010 Isabel Drost wrote: Call for Presentations Berlin Buzzwords It should have been http://berlinbuzzwords.de of course... Isabel signature.asc Description: This is a digitally signed message part.
Re: distinct on my result
hey, okay i show your my settings ;) i use an extra core with the standard requesthandler. SCHEMA.XML field name=id type=string indexed=true stored=true required=true / field name=name type=textindexed=true stored=true required=true / field name=suggest type=autocomplete indexed=true stored=true multiValued=true/ copyField source=name dest=suggest/ so i copy my names to the field suggest and use the EdgeNGramFilter and some others fieldType name=autocomplete class=solr.TextField analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory maxGramSize=100 minGramSize=1 / filter class=solr.StandardFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory maxGramSize=100 minGramSize=1 / filter class=solr.StandardFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ /analyzer /fieldType so with this konfig i get the results above ... maybe i have t many filters ;) ?! gwk-4 wrote: Hi, I'm no expert on the full-text search features of Solr but I guess that has something to do with your fieldtype, or query. Are you using the standard request handler or dismax for your queries? And what analysers are you using on your product name field? Regards, gwk On 3/11/2010 3:24 PM, stocki wrote: okay. we have a lot of products and i just importet the name of each product to a core. make an edgengram to this and my autoCOMPLETION runs. but i want an auto-suggestion: example. autoCompletion--I: harry O: harry potter... but when the input ist -- I. potter -- O: / so what i want is, that i get harry potter ... when i tipping potter into my search field! any idea ? i think the solution is a mixe of termsComponent and EdgeNGram or not ? i am a little bit despair, and in this forum are too many information about it =( gwk-4 wrote: Hi, The autosuggest core is filled by a simple script (written in PHP) which request facet values for all the possible strings one can search for and adds them one by one as a document. Our case has some special issues due to the fact that we search in multiple languages (Typing España will suggest Spain and the other way around when on the Spanish site). We have about 97500 documents yeilding approximately 12500 different documents in our autosuggest-core and the autosuggest-update script takes about 5 minutes to do a full re-index (all this is done on a separate server and replicated so the indexing has no impact on the performance of the site). Regards, gwk On 3/10/2010 3:09 PM, stocki wrote: okay. thx my suggestion run in another core;) do you distinct during the import with DIH ? -- View this message in context: http://old.nabble.com/distinct-on-my-result-tp27849951p27865058.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Performance Issues
Hi Erick, The way the load test works is that it picks up 5000 queries, splits them according to the number of threads (so if we have 10 threads, it schedules 10 threads - each one sending 500 queries). So it might be possible that the number of queries at a point later in time is greater than the number of queries earlier in time. I'm not very sure about that though. Its a simple Ruby script that starts up threads, calls the search function in each thread, and then waits for each of them to exit. How many queries per second can we expect Solr to serve, given this kind of hardware? If what you suggest is true, then is it possible that while Solr is serving a query, another query hits it, which increases the response time even further? I'm not sure about it. But yes I can observe the query times going up as I increase the number of threads. Thanks, Regards, On Thu, Mar 11, 2010 at 8:30 PM, Erick Erickson erickerick...@gmail.comwrote: How many outstanding queries do you have at a time? Is it possible that when you start, you have only a few queries executing concurrently but as your test runs you have hundreds? This really is a question of how your load test is structured. You might get a better sense of how it works if your tester had a limited number of threads running so the max concurrent requests SOLR was serving at once were capped (30, 50, whatever). But no, I wouldn't expect SOLR to bog down the way you're describing just because it was running for a while. HTH Erick On Thu, Mar 11, 2010 at 9:39 AM, Siddhant Goel siddhantg...@gmail.com wrote: Hi everyone, I have an index corresponding to ~2.5 million documents. The index size is 43GB. The configuration of the machine which is running Solr is - Dual Processor Quad Core Xeon 5430 - 2.66GHz (Harpertown) - 2 x 12MB cache, 8GB RAM, and 250 GB HDD. I'm observing a strange trend in the queries that I send to Solr. The query times for queries that I send earlier is much lesser than the queries I send afterwards. For instance, if I write a script to query solr 5000 times (with 5000 distinct queries, most of them containing not more than 3-5 words) with 10 threads running in parallel, the average times for queries goes from ~50ms in the beginning to ~6000ms. Is this expected or is there something wrong with my configuration. Currently I've configured the queryResultCache and the documentCache to contain 2048 entries (hit ratios for both is close to 50%). Apart from this, a general question that I want to ask is that is such a hardware enough for this scenario? I'm aiming at achieving around 20 queries per second with the hardware mentioned above. Thanks, Regards, -- - Siddhant -- - Siddhant
Re: distinct on my result
Hi, Try replacing KeywordTokenizerFactory with a WhitespaceTokenizerFactory so it'll create separate terms per word. After a reindex it should work. Regards, gwk On 3/11/2010 4:33 PM, stocki wrote: hey, okay i show your my settings ;) i use an extra core with the standard requesthandler. SCHEMA.XML field name=id type=string indexed=true stored=true required=true / field name=name type=textindexed=true stored=true required=true / field name=suggest type=autocomplete indexed=true stored=true multiValued=true/ copyField source=name dest=suggest/ so i copy my names to the field suggest and use the EdgeNGramFilter and some others fieldType name=autocomplete class=solr.TextField analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory maxGramSize=100 minGramSize=1 / filter class=solr.StandardFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory maxGramSize=100 minGramSize=1 / filter class=solr.StandardFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ /analyzer /fieldType so with this konfig i get the results above ... maybe i have t many filters ;) ?! gwk-4 wrote: Hi, I'm no expert on the full-text search features of Solr but I guess that has something to do with your fieldtype, or query. Are you using the standard request handler or dismax for your queries? And what analysers are you using on your product name field? Regards, gwk On 3/11/2010 3:24 PM, stocki wrote: okay. we have a lot of products and i just importet the name of each product to a core. make an edgengram to this and my autoCOMPLETION runs. but i want an auto-suggestion: example. autoCompletion-- I: harry O: harry potter... but when the input ist -- I. potter -- O: / so what i want is, that i get harry potter ... when i tipping potter into my search field! any idea ? i think the solution is a mixe of termsComponent and EdgeNGram or not ? i am a little bit despair, and in this forum are too many information about it =( gwk-4 wrote: Hi, The autosuggest core is filled by a simple script (written in PHP) which request facet values for all the possible strings one can search for and adds them one by one as a document. Our case has some special issues due to the fact that we search in multiple languages (Typing España will suggest Spain and the other way around when on the Spanish site). We have about 97500 documents yeilding approximately 12500 different documents in our autosuggest-core and the autosuggest-update script takes about 5 minutes to do a full re-index (all this is done on a separate server and replicated so the indexing has no impact on the performance of the site). Regards, gwk On 3/10/2010 3:09 PM, stocki wrote: okay. thx my suggestion run in another core;) do you distinct during the import with DIH ?
Re: Snapshot / Distribution Process
Have you started rsyncd on the master? Make sure that it is enabled before you start: http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline You can also tried running snappuller with the -V option to et more debugging info. Bill On Wed, Mar 10, 2010 at 4:09 PM, Lars R. Noldan l...@sixfeetup.com wrote: Is anyone aware of a comprehensive guide for setting up the Snapshot Distribution process on Solr 1.3? I'm working through: http://wiki.apache.org/solr/CollectionDistribution#The_Snapshot_and_Distribution_Process And have run into a roadblock where the solr/bin/snappuller finds the appropriate snapshot, but rsync fails. (according to the logs.) Any guidance you can provide, even if it's asking for additional troubleshooting information is welcome and appreciated. Thanks Lars -- l...@sixfeetup.com | +1 (317) 861-5948 x609 six feet up presents INDIGO : The Help Line for Plone More info at http://sixfeetup.com/indigo or call +1 (866) 749-3338
What does means ~2, ~3, ~4 in DisjunctionMaxQuery?
I am debuggin a 2 words query build using dismax. So it's build from DisjunctionMaxQueries being the minShouldMatch 100% and tie breaker multiplier = 0.3 +((DisjunctionMaxQuery((content:john | title:john~0.3) DisjunctionMaxQuery((content:malone | title:malone)~0.3))~2) And a 3 words one (with same tie and mm): +((DisjunctionMaxQuery((content:john^3.0 | region:john)~0.3) DisjunctionMaxQuery((content:malone^3.0 | region:malone)~0.3) DisjunctionMaxQuery((content:lawyer^3.0 | region:lawyer)~0.3))~3) I have tryed to read carefully lucene documentation of DisjunctionMaxQuery but can't find what ~2 (for the first query) and ~3 (for the second query) do. In case I search for 4 words it will be ~4 I know ~ its used to specify the slop in phrase queries. Does it means any sort of slope here in the DisjunctionMaxQueries?? Thanks in advance -- View this message in context: http://old.nabble.com/What-does-means-%7E2%2C-%7E3%2C-%7E4-in-DisjunctionMaxQuery--tp27866033p27866033.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What does means ~2, ~3, ~4 in DisjunctionMaxQuery?
On Mar 11, 2010, at 11:42 AM, Marc Sturlese wrote: I am debuggin a 2 words query build using dismax. So it's build from DisjunctionMaxQueries being the minShouldMatch 100% and tie breaker multiplier = 0.3 +((DisjunctionMaxQuery((content:john | title:john~0.3) DisjunctionMaxQuery((content:malone | title:malone)~0.3))~2) the ~2 is BooleanQuery's way of saying the minimum number that should match value. And a 3 words one (with same tie and mm): +((DisjunctionMaxQuery((content:john^3.0 | region:john)~0.3) DisjunctionMaxQuery((content:malone^3.0 | region:malone)~0.3) DisjunctionMaxQuery((content:lawyer^3.0 | region:lawyer)~0.3))~3) And likewise for ~3 here. It's being computed based on the mm parameter you're providing, which is 100%. I know ~ its used to specify the slop in phrase queries. Does it means any sort of slope here in the DisjunctionMaxQueries?? It's actually purely on the BooleanQuery for that factor. Erik
Re: Snapshot / Distribution Process
: Subject: Snapshot / Distribution Process : In-Reply-To: 27854256.p...@talk.nabble.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: HTMLStripTransformer not working with data importer
Hi- I can't seem to make any of the transfomers work, I am using the DataImporter to pull in data from a wordpress instance (see below). Neither REGEX or HTMLStrip seems to do anything to my content. Do I have to include a separate jar with the transformers? Are the transformers in 1.4 (particularly the HTMLStrip)? James On Wed, Mar 10, 2010 at 10:47 PM, James Ostheimer james.osthei...@gmail.com wrote: HI- I am working a contract to index some wordpress data. For the posts I of course have html in the content of the column, I'd like to strip it out. Here is my data importer config dataConfig dataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/econetsm user=*** password=*** / document entity name=post transformer=HTMLStripTransformer query=SELECT id, post_content, post_title FROM elinstmkting_posts e onError=abort deltaQuery=SELECT * FROM elinstmkting_posts e where post_modified_gmt '${dataimporter.last_index_time}' field column=POST_TITLE name=post_title stripHTML=false/ field column=POST_CONTENT name=post_content stripHTML=true / /entity /document /dataConfig Looks perfect according to the wiki docs, but the html is found when I search for strong (strong tag) and html is returned in the field. I assume I am doing something stupid wrong, I am using the latest stable solr (1.4.0). Does it matter that the post data is not a complete html document (it doesn't have a html start tag or a body tag)? James
How to edit / compile the SOLR source code
Hi, Sorry for asking this very simple question but I am very new to SOLR and I want to play with its source code. As a initial step I have a requirement to enable wildcard search (*text) in SOLR. I am trying to figure out a way to import the complete SOLR build to Eclipse and edit QueryParsing.java file but I am not able to import (I tried to import with ant project in Eclipse and selected the build.xml file and got an error stating javac is not present in the build.xml file). Can someone help me out with the initial steps on how to import / edit / compile / test the SOLR source? Thanks a lot for your help!!! Thanks, B -- View this message in context: http://old.nabble.com/How-to-edit---compile-the-SOLR-source-code-tp27866410p27866410.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Architectural help
: We have about 5M records ranging in size all coming from a DB source (only 2 : tables). What will be the most efficient way of indexing all of these : documents? I am looking at DIH but before I go down that road I wanted to The main question to ask yourself is what your indexing freshness requirements are. If you have a small amount of data, or if a large percentage of your data is changing all the time, and you can tollerate lag in how quickly updates to your data make it into the index, then doing complete re/full-builds (with DIH or anything else) periodicly is certianly the simplest way to go. If you have a lot of data, or a small percentage of your data is changing within the largest interval of time you are willing to wait before your index is updated, then a batch delta indexing approach like DIH's deltaQuery provides is only a little bit more effort on top of implementing fullbuilds. if you really need your index to be updated as soon as the authoritative data changes, then having your publishing flow immediately make changes to the index by pushing it over HTTP to the /update API is probably your best bet. -Hoss
Re: Scaling indexes with high document count
: I wonder if anyone might have some insight/advice on index scaling for high : document count vs size deployments... Your general approach sounds reasonable, although specifics of how you'll need to tune the caches and how much hardware you'll need will largely depend on the specifics of the data and the queries. I'm not sure what you mean by this though... : As searching would always be performed on replicas - the indexing cores : wouldn't be tuned with much autowarming/read cache, but have loads of : 'maxdocs' cache. The searchers would be the other way 'round - lots of what do you mean by 'maxdocs' cache ? -Hoss
issue with delete index
Hi, I have made some changes to my schema, including setting of omitNorms to false for a few fields. I am using Solr1.4 with SolrJ client. I deleted my index using the client: solrserver.deleteByQuery(*:*); solrserver.optimize(); But after reindexing and running the queries i don't see any difference in query results, as if it didn't take 'omitNorms' settings into consideration. Can anyone tell me how to delete the index entirely so that new changes can take place? Thanks!! -M -- View this message in context: http://old.nabble.com/issue-with-delete-index-tp27866630p27866630.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: embedded server / servlet container
: I am trying to provide an embedded server to a web application deployed in a : servlet container (like tomcat). If you are trying to use Solr inside another webapp, my suggestion would just be to incorporate the existing Solr servlets, jsps, dispatch filter, and web.xml specifics from solr inot your app, and let them do their own thing -- it's going to make your life much easier from an upgrade standpoint. Better still: run solr.war as it's own webapp in the same servlet container. -Hoss
Highlighting Results
Hi All Im not sure where i'm going wrong but highlighting does not seem to work for me. I have indexed around 5000 PDF documents which went well. Running normal queries against the attr_content works well. When adding any hl code it does not seem to make a bit of difference. Here is an example query: ?q=attr_content:Some Namehl=truehl.fl=attr_contenthl.fragsize=50rows=5 If I am correct fragsize should be limiting the returned content for attr_content ?? and the keyowrds found in attr_contnet should be surronded with the em tags ? The attr_content is a stored if this helps. Hope someone can point me in the right direction. Thank you if you can !
Re: How to edit / compile the SOLR source code
Yep, as you've discovered, the import from ant build file doesn't work for the solr build.xml in eclipse. There is an excellent how-to for getting Solr up and running in Eclipse for debugging purposes here: http://www.lucidimagination.com/developers/artiicles/setting-up-apache-solr-in-eclipse Once you have the setup in place from the above tutorial, you can then go to any of the Solr Jar files and attach the source, which will allow you to debug into and modify the Solr code. If you need to step into any of the lucene code you'll have to pull it down separately, but you can attach the same way. The last step (after you've made your changes) is that you would just need to rebuild with Ant (run ant from the directory containing the build.xml file to see the build options for Solr). I think that just running ant example there should do the trick. -Trey On Thu, Mar 11, 2010 at 12:07 PM, JavaGuy84 bbar...@gmail.com wrote: Hi, Sorry for asking this very simple question but I am very new to SOLR and I want to play with its source code. As a initial step I have a requirement to enable wildcard search (*text) in SOLR. I am trying to figure out a way to import the complete SOLR build to Eclipse and edit QueryParsing.java file but I am not able to import (I tried to import with ant project in Eclipse and selected the build.xml file and got an error stating javac is not present in the build.xml file). Can someone help me out with the initial steps on how to import / edit / compile / test the SOLR source? Thanks a lot for your help!!! Thanks, B -- View this message in context: http://old.nabble.com/How-to-edit---compile-the-SOLR-source-code-tp27866410p27866410.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: field length normalization
Did you reindex after setting omitNorms to false? I'm not sure whether or not it is needed, but it makes sense. On Thu, Mar 11, 2010 at 5:34 PM, muneeb muneeba...@hotmail.com wrote: Hi, In my schema, the document title field has omitNorms=false, which, if I am not wrong, causes length of titles to be counted in the scoring. But when I query with: word1 word2 word3 I dont know why still the top two documents title have these words and other words, where as the document which has exact and only these query words is coming on third place. Setting omitNorms to false, should bring the titles with exact words on top shouldn't it? Also I realized when debugged query, that all three top documents have same score, shouldn't this be different as they have different title lengths? Thanks very much. -A -- View this message in context: http://old.nabble.com/field-length-normalization-tp27862618p27862618.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Siddhant
Re: field length normalization
: : Did you reindex after setting omitNorms to false? I'm not sure whether or : not it is needed, but it makes sense. Yes i deleted the old index and reindexed it. Just to add another fact, that the titlles length is less than 10. I am not sure if solr has pre-set values for length normalizations, because for titles with 3 as well as 4 terms the fieldNorm is coming up as 0.5 (in the debugQuery section). -- View this message in context: http://old.nabble.com/field-length-normalization-tp27862618p27867025.html Sent from the Solr - User mailing list archive at Nabble.com.
dismax and WordDelimiterFilterFactory with PreserveOriginal = 1
Hi all, I'm facing the same issue as previous post here: http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html. Since no one answers this post, I thought I'll ask again. In my case, I use below setting for index filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ and filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ for query. When I use query with word ain't, no result is returned. When I turned on the logging, I found the word is interpreted as (ain't ain) t. 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause ((description:(ain't ain) t^2.0 | name:(ain't ain) t^3.0 | search_keywords:(ain't ain) t)~0.1) Does anyone know why ain't be parsed as (ain't ain) t and how to fix it so it can match documents that include ain't in the name? Thanks in advance! Wen
Profiling Solr
Hi, I'm trying to identify the bottleneck to get acceptable performance of a single shard containing 4.7 millions of documents using my own machine (Mac Pro - Quad Core with 8Gb of RAM with 4Gb allocated to the JVM). I tried using YourKit but I don't get anything about Solr classes. I'm new to Yourkit so I might be doing something wrong but it seems pretty straight forward. I am running Solr within a Tomcat instance within Eclipse. Does anyone have an idea about what could be wrong in my setup? I'm making individual requests (one at a time) and the response times are horrible (about 15 sec on average). I need to bring this way below 1 second. Here is a sample query: http://localhost:8983/jobs_part3/select/?q=*:*collapse=truecollapse.field=hash_idfacet=truefacet.field=county_idfacet.field=advertiser_idfacet.field=county_idsort=county_id+ascrows=100collapse.type=adjacent I know that collapsing results has a big hit on performance but it is a must have for us. Thanks for any hints. = JVM Parameters = -Xms4g -Xmx4g -d64 -server
Re: Profiling Solr
On Thu, Mar 11, 2010 at 1:11 PM, Jean-Sebastien Vachon js.vac...@videotron.ca wrote: Hi, I'm trying to identify the bottleneck to get acceptable performance of a single shard containing 4.7 millions of documents using my own machine (Mac Pro - Quad Core with 8Gb of RAM with 4Gb allocated to the JVM). I tried using YourKit but I don't get anything about Solr classes. Sometimes org.apache.* can be in the ignore list by default along with java.*, I guess because people are looking for bottlenecks in their own code and don't want to look into other libraries. -Yonik http://www.lucidimagination.com
Multi valued fields
Hi All, I'd like to know if it is possible to do the following on a multi-value field: Given the following data: document A: field1 = [ A B C D] document B: field 1 = [A B] document C: field 1 = [A] Can I build a query such as : -field: A which will return all documents that do not have exclusive A in the their field's values. By exclusive I mean that I don't want documents that only have A in their list of values. In my sample case, the query would return doc A and B. Because they both have other values in field1. It this kind of query possible with Solr/Lucene? Thanks
Re: issue with delete index
On Thu, Mar 11, 2010 at 12:22 PM, muneeb muneeba...@hotmail.com wrote: I have made some changes to my schema, including setting of omitNorms to false for a few fields. I am using Solr1.4 with SolrJ client. I deleted my index using the client: solrserver.deleteByQuery(*:*); solrserver.optimize(); Solr implements a *:* by removing the index, so this should have been fine. But after reindexing and running the queries i don't see any difference in query results, as if it didn't take 'omitNorms' settings into consideration. Did you restart Solr so that the schema was re-read? -Yonik http://www.lucidimagination.com
Re: dismax and WordDelimiterFilterFactory with PreserveOriginal = 1
On Thu, Mar 11, 2010 at 1:07 PM, Ya-Wen Hsu y...@eline.com wrote: Hi all, I'm facing the same issue as previous post here: http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html. Since no one answers this post, I thought I'll ask again. In my case, I use below setting for index filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ and filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ for query. When I use query with word ain't, no result is returned. When I turned on the logging, I found the word is interpreted as (ain't ain) t. The problem is preserving the original in the query analyzer - try removing that. And if you aren't doing prefix or wildcard queries, preserveOriginal doesn't buy you anything but wasted index space. It's the same issue of why you can't generate and catenate at the same time with the query parser. -Yonik http://www.lucidimagination.com
Re: How to edit / compile the SOLR source code
See Trey's comment, but before you go there. What about SOLR's wildcard searching capabilities aren't working for you now? There are a couple of tricks for making leading wildcard searches work quickly, but this is a solved problem. Although whether the existing solutions work in your situation may be an open question... Or do you have to hack into the parser for other reasons? Best Erick On Thu, Mar 11, 2010 at 12:07 PM, JavaGuy84 bbar...@gmail.com wrote: Hi, Sorry for asking this very simple question but I am very new to SOLR and I want to play with its source code. As a initial step I have a requirement to enable wildcard search (*text) in SOLR. I am trying to figure out a way to import the complete SOLR build to Eclipse and edit QueryParsing.java file but I am not able to import (I tried to import with ant project in Eclipse and selected the build.xml file and got an error stating javac is not present in the build.xml file). Can someone help me out with the initial steps on how to import / edit / compile / test the SOLR source? Thanks a lot for your help!!! Thanks, B -- View this message in context: http://old.nabble.com/How-to-edit---compile-the-SOLR-source-code-tp27866410p27866410.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Cleaning up dirty OCR
Thanks Robert, I've been thinking about this since you suggested it on another thread. One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.) There are a couple of use cases where removing words that occur only once might be a problem. One is for genealogical searches where a user might want to retrieve a document if their relative is only mentioned once in the document. We have quite a few government documents and other resources such as the Lineage Book of the Daughters of the American Revolution. Another use case is humanities researchers doing phrase searching for quotes. In this case, if we remove one of the words in the quote because it occurs only once in a document, then the phrase search would fail. For example if someone were searching Macbeth and entered the phrase query: Eye of newt and toe of frog it would fail if we had removed newt from the index because newt occurs only once in Macbeth. I ran a quick check against a couple of our copies of Macbeth and found out of about 5,000 unique words about 3,000 occurred only once. Of these about 1,800 were in the unix dictionary, so at least 1800 words that would be removed would be real words as opposed to OCR errors (a spot check of the words not in the unix /usr/share/dict/words file revealed most of them also as real words rather than OCR errors.) I also ran a quick check against a document with bad OCR and out of about 30,000 unique words, 20,000 occurred only once. Of those 20,000 only about 300 were in the unix dictionary so your intuition that a lot of OCR errors will occur only once seems spot on. A quick look at the words not in the dictionary revealed a mix of technical terms, common names, and obvious OCR nonsense such as ffll.lj'slall'lm I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above. When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index. I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list. Can you suggest which list I should post an index pruning question to? Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, March 09, 2010 2:36 PM To: solr-user@lucene.apache.org Subject: Re: Cleaning up dirty OCR Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir rcm...@gmail.com
Re: dismax and WordDelimiterFilterFactory with PreserveOriginal = 1
Kind of a shot in the dark here, but your parameters for index and query on WordDelimiterFilterFactory are different, especially suspicious is catenateWords. You could test this by looking in your index with the SOLR admin page and/or Luke to see what your actual terms are. And don't forget you'll have to re-index after restarting SOLR for any index changes to take effect HTH Erick On Thu, Mar 11, 2010 at 2:20 PM, Ya-Wen Hsu y...@eline.com wrote: Yonik, thank you for your reply. When I don't use PreserveOriginal = 1 for WordDelimiterFilterFactory, the query ain't is parsed as ain t and no match is found in this case too. If I remove ' from the query, then I can get results. I used the analysis tool and see the term ain't is processed as ain t, and get matches when the title includes ain't. But I got no result when using ain't query with dismax. The debug output looks like: (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) +(long_description:ain t^2.0 | name:ain t^3.0 | search_keywords:ain t)~0.1 (long_description:save^2.0 | name:save^3.0 | search_keywords:saved)~0.1) () Below is my configuration for text field type. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ !--filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/-- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I get results back when I tried to use solr.LowerCaseTokenizerFactory instead of solr.WhitespaceTokenizerFactory. However, the concern here is this might reduce the quality of relevant search. Does anyone have a better idea on what to try next? Thanks! Wen -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, March 11, 2010 10:51 AM To: solr-user@lucene.apache.org Subject: Re: dismax and WordDelimiterFilterFactory with PreserveOriginal = 1 On Thu, Mar 11, 2010 at 1:07 PM, Ya-Wen Hsu y...@eline.com wrote: Hi all, I'm facing the same issue as previous post here: http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html. Since no one answers this post, I thought I'll ask again. In my case, I use below setting for index filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ and filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ for query. When I use query with word ain't, no result is returned. When I turned on the logging, I found the word is interpreted as (ain't ain) t. The problem is preserving the original in the query analyzer - try removing that. And if you aren't doing prefix or wildcard queries, preserveOriginal doesn't buy you anything but wasted index space. It's the same issue of why you can't generate and catenate at the same time with the query parser. -Yonik http://www.lucidimagination.com
Re: Cleaning up dirty OCR
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Robert, I've been thinking about this since you suggested it on another thread. One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.) You are correct. I really hate recommending you 'remove data', but at the same time, as perhaps an intermediate step, this could be a brutally simple approach to move you along. I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above. When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index. Well, one thing I did with Andrzej's patch is immediately relevance-test this approach against some corpora I had. The results are on the JIRA issue, and the test collection itself is in openrelevance. In my opinion the p...@n is probably overstated, and the MAP values are probably understated (due to it being a pooled relevance collection), but I think its fair to say for that specific large text collection, pruning terms that only appear in the document a single time does not hurt relevance. At the same time I will not dispute that it could actually help p...@n, I am just saying I'm not sold :) Either way its extremely interesting, cut your index size in half, and get the same relevance! I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list. Can you suggest which list I should post an index pruning question to? I would recommend posting it to the JIRA issue: http://issues.apache.org/jira/browse/LUCENE-1812 This way someone who knows more (Andrzej) could see it, too. -- Robert Muir rcm...@gmail.com
Re: Cleaning up dirty OCR
Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we would have to watch out for any Russian-Chinese dictionaries:) Tom There wasn't any completely satisfactory solution; there were a large number of two and three letter n-grams so we were able to use a dictionary approach to eliminate those (names tend to be longer). We also looked for runs of punctuation, unlikely mixes of alpha/numeric/punctuation, and also eliminated longer words which consisted of runs of not-ocurring-in-English bigrams. Hope this helps -Simon -- -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27869940.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Cleaning up dirty OCR
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West tburtonw...@gmail.com wrote: Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we would have to watch out for any Russian-Chinese dictionaries:) Ok this is a new one for me, I am just curious, have you figured out why this is happening? Separately, i would love to know some sort of character frequency data for your non-english text, are you OCR'ing that data too? Are you using Unicode normalization or anything to prevent explosion of terms that are really the same? -- Robert Muir rcm...@gmail.com
Re: Cleaning up dirty OCR
: We can probably implement your suggestion about runs of punctuation and : unlikely mixes of alpha/numeric/punctuation. I'm also thinking about : looking for unlikely mixes of unicode character blocks. For example some of : the CJK material ends up with Cyrillic characters. (except we would have to : watch out for any Russian-Chinese dictionaries:) Since you are dealing with multiple langugaes, and multiple varient usages of langauges (ie: olde english) I wonder if one way to try and generalize the idea of unlikely letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index based on the frequency of (character) N-grams in each of those words, relative the entire corpus, and then eliminate any of the hapax legomenon words whose score is below some cut off threshold (that you'd have to pick arbitrarily, probably by eyeballing the sorted list of words and their contexts to deide if they are legitimate) ? -Hoss
Re: Cleaning up dirty OCR
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote: I wonder if one way to try and generalize the idea of unlikely letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index Hmm, how about a classifier? Common words are the yes training set, hapax legomenons are the no set, and n-grams are the features. But why isn't the OCR program already doing this? wunder
RE: Scaling indexes with high document count
Hi, Thanks for your reply (an apologies for the orig msg being ent multiple times to the list - googlemail problems). I actually meant to put 'maxBufferredDocs'. I admit I'm not that familar with this parameter, but as I understand it, it is the number of documents that are held in ram before flushing to disk. I've noticed the ramBufferSizeMB is a similar parameter, but using memory as the threshold rather than number of docs. Is it best not to set these too high on indexers? In my environment, all writes are done via SolrJ, where documents are placed in a SolrDocumentList and commit()ed when the list reaches 1000 (default value), or a configured commit thread interval is reached (default is 20s, whichever comes first). I suppose this is a SolrJ-side version of 'maxBufferedDocs', so maybe I don't need to set maxBufferedDocs in solrconfig? (the SolrJ 'client' is on the same machine as the index) For the indexer cores (essentially write-only indexes), I wasn't planning on configuring extra memory for read cache (Lucene value cache or filter cache), as no queries would/should be received on these. Should I reconsider this? They'll be plenty of RAM available for indexers to use and still leave enough for the OS file system cache to do its thing. Do you have any suggestions as to what would be the best way to use this memory to achieve optimal indexing speed? The main things I do now to tune for fast indexing are: * commiting lists of docs rather than each one separately * not optimizing too often * bump up the mergeFactor (I use a value of 25) Many Thanks! Peter Date: Thu, 11 Mar 2010 09:19:12 -0800 From: hossman_luc...@fucit.org To: solr-user@lucene.apache.org Subject: Re: Scaling indexes with high document count : I wonder if anyone might have some insight/advice on index scaling for high : document count vs size deployments... Your general approach sounds reasonable, although specifics of how you'll need to tune the caches and how much hardware you'll need will largely depend on the specifics of the data and the queries. I'm not sure what you mean by this though... : As searching would always be performed on replicas - the indexing cores : wouldn't be tuned with much autowarming/read cache, but have loads of : 'maxdocs' cache. The searchers would be the other way 'round - lots of what do you mean by 'maxdocs' cache ? -Hoss _ Tell us your greatest, weirdest and funniest Hotmail stories http://clk.atdmt.com/UKM/go/195013117/direct/01/
Re: Cleaning up dirty OCR
Interesting. I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus. hossman wrote: Since you are dealing with multiple langugaes, and multiple varient usages of langauges (ie: olde english) I wonder if one way to try and generalize the idea of unlikely letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index based on the frequency of (character) N-grams in each of those words, relative the entire corpus, and then eliminate any of the hapax legomenon words whose score is below some cut off threshold (that you'd have to pick arbitrarily, probably by eyeballing the sorted list of words and their contexts to deide if they are legitimate) ? -Hoss -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Cleaning up dirty OCR
We've been thinking about running some kind of a classifier against each book to select books with a high percentage of dirty OCR for some kind of special processing. Haven't quite figured out a multilingual feature set yet other than the punctuation/alphanumeric and character block ideas mentioned above. I'm not sure I understand your suggestion. Since real word hapax legomenons are generally pretty common (maybe 40-60% of unique words) wouldn't using them as the no set provide mixed signals to the classifier? Tom Walter Underwood-2 wrote: Hmm, how about a classifier? Common words are the yes training set, hapax legomenons are the no set, and n-grams are the features. But why isn't the OCR program already doing this? wunder -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: field length normalization
The fieldNorm is computed like this: fieldNorm = lengthNorm * documentBoost * documentFieldBoosts and the lengthNorm is: lengthNorm = 1/(numTermsInField)**.5 [note that the value is encoded as a single byte, so there is some precision loss] So the values are not pre-set for the lengthNorm, but for some counts the fieldLength value winds up being the same because of the precision los. Here is a list of lengthNorm values for 1 to 10 term fields: # of termslengthNorm 1 1.0 2 .625 3 .5 4 .5 5 .4375 6 .375 7 .375 8 .3125 9 .3125 10 .3125 That's why, in your example, the lengthNorm for 3 and 4 is the same. -Jay http://www.lucidimagination.com On Thu, Mar 11, 2010 at 9:50 AM, muneeb muneeba...@hotmail.com wrote: : : Did you reindex after setting omitNorms to false? I'm not sure whether or : not it is needed, but it makes sense. Yes i deleted the old index and reindexed it. Just to add another fact, that the titlles length is less than 10. I am not sure if solr has pre-set values for length normalizations, because for titles with 3 as well as 4 terms the fieldNorm is coming up as 0.5 (in the debugQuery section). -- View this message in context: http://old.nabble.com/field-length-normalization-tp27862618p27867025.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Performance Issues
I dont mean to turn this into a sales pitch, but there is a tool for Java app performance management that you may find helpful. Its called New Relic (www.newrelic.com) and the tool can be installed in 2 minutes. It can give you very deep visibility inside Solr and other Java apps. (Full disclosure I work at New Relic.) Mike Siddhant Goel wrote: Hi everyone, I have an index corresponding to ~2.5 million documents. The index size is 43GB. The configuration of the machine which is running Solr is - Dual Processor Quad Core Xeon 5430 - 2.66GHz (Harpertown) - 2 x 12MB cache, 8GB RAM, and 250 GB HDD. I'm observing a strange trend in the queries that I send to Solr. The query times for queries that I send earlier is much lesser than the queries I send afterwards. For instance, if I write a script to query solr 5000 times (with 5000 distinct queries, most of them containing not more than 3-5 words) with 10 threads running in parallel, the average times for queries goes from ~50ms in the beginning to ~6000ms. Is this expected or is there something wrong with my configuration. Currently I've configured the queryResultCache and the documentCache to contain 2048 entries (hit ratios for both is close to 50%). Apart from this, a general question that I want to ask is that is such a hardware enough for this scenario? I'm aiming at achieving around 20 queries per second with the hardware mentioned above. Thanks, Regards, -- - Siddhant -- View this message in context: http://old.nabble.com/Solr-Performance-Issues-tp27864278p27872139.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to edit / compile the SOLR source code
Leaving aside some historical reasons, the root of the issue is that any search has to identify all the terms in a field that satisfy it. Let's take a normal non-leading wildcard case first. Finding all the terms like 'some*' will have to deal with many fewer terms than 's*'. Just dealing with that many terms will decrease performance, regardless of the underlying mechanisms used. Imagine you're searching down an ordered list of all the terms for a field, assembling a list, and then comparing that list with all the terms in that field with your list. So, pure wildcard serches, i.e. just *, would have to handle all the terms in the index for the field. The situation with leading wildcards is worse than trailing, since all the terms in the index have to be examined. Even doing something as bad as a* will examine only terms starting in a. But looking for *a has to examine each and every term in the index because australia and zebra both qualify, there aren't any good shortcuts if you think of having an ordered list of terms in a field. So performance can degrade pretty dramatically when you allow this kind of thing and the original writers (my opinion here, I wasn't one of them) decided it was much better to disallow it by default and require users to dig around for the why rather than have them crash and burn a lot by something that seems innocent if you aren't familiar with the issues involved. A better approach is, and this isn't very obvious, is to index your terms reversed, and do leading wildcard searches on the *reversed* field as trailing wildcards. E.g. 'some' gets indexed as 'emos' and the wildcard search '*me' gets searched in the reversed field as 'em*'. There may still be performance issues if you allow single-letter wildcards, e.g. s* or *s, although a lot of work has been done in this area in the last few years. You'll have to measure in your situation. And beware that a really common problem when deciding how many real letters to allow is that it all works fine in your test data, but when you load your real corpus and suddenly SOLR/Lucene has to deal with 100,000 terms that might match rather than the 1,000 in your test set, response time changesfor the worse. So I'd look around for the reversed idea (See SOLR-1321 in the JIRA), and at least one of the schema examples has it. One hurdle for me was asking the question does it really help the user to allow one or two leading characters in a wildcard search?. Surprisingly often, that's of no use to real users because so many terms match that it's overwhelming. YMMV, but it's a good question to ask if you find yourself in a quagmire because you allow a* type of queries. There are other strategies too, but that seems easiest Now, all that said, SOLR has done significant work to make wildcards work well, these are just general things to look out for when thinking about wildcards... I really think hacking the parser will come back to bite you as both as a maintenance and performance issue, I wouldn't go there without a pretty exhaustive look at other options. HTH Erick On Thu, Mar 11, 2010 at 6:29 PM, JavaGuy84 bbar...@gmail.com wrote: Eric, Thanks a lot for your reply. I was able to successfully hack the query parser and enabled the leading wild card search. As of today I hacked the code for this reason only, I am not sure how to make the leading wild card search to work without hacking the code and this type of search is the preferred type of search in our organization. I had previously searched all over the web to find out 'why' that feature was disabled as default but couldn't find any solid answer stating the reason. In one of the posting in nabble it was mentioned that it might take a performance hit if we enable the leading wild card search, can you please let me know your comments on that? But I am very much interested in contributing some new stuff to SOLR group so I consider this as a starting point.. Thanks, Barani Erick Erickson wrote: See Trey's comment, but before you go there. What about SOLR's wildcard searching capabilities aren't working for you now? There are a couple of tricks for making leading wildcard searches work quickly, but this is a solved problem. Although whether the existing solutions work in your situation may be an open question... Or do you have to hack into the parser for other reasons? Best Erick On Thu, Mar 11, 2010 at 12:07 PM, JavaGuy84 bbar...@gmail.com wrote: Hi, Sorry for asking this very simple question but I am very new to SOLR and I want to play with its source code. As a initial step I have a requirement to enable wildcard search (*text) in SOLR. I am trying to figure out a way to import the complete SOLR build to Eclipse and edit QueryParsing.java file but I am not able to import (I tried to import with ant project in Eclipse and selected the build.xml file and got an error stating
Re: How to edit / compile the SOLR source code
Erik, That was a wonderful explanation, I hope many folks in this forum will be benefited from the explanation you have given here. Actually I Googled and found the solution when you had earlier mentioned that I can do a leading wildcard without hacking the code. I found out the patch that had been already available to resolve this issue (by using ReversedWildcardFilterFactory) and I have started to implement that idea. Thanks a lot for your valuable time.. SOLR rocks Thanks, Barani Erick Erickson wrote: Leaving aside some historical reasons, the root of the issue is that any search has to identify all the terms in a field that satisfy it. Let's take a normal non-leading wildcard case first. Finding all the terms like 'some*' will have to deal with many fewer terms than 's*'. Just dealing with that many terms will decrease performance, regardless of the underlying mechanisms used. Imagine you're searching down an ordered list of all the terms for a field, assembling a list, and then comparing that list with all the terms in that field with your list. So, pure wildcard serches, i.e. just *, would have to handle all the terms in the index for the field. The situation with leading wildcards is worse than trailing, since all the terms in the index have to be examined. Even doing something as bad as a* will examine only terms starting in a. But looking for *a has to examine each and every term in the index because australia and zebra both qualify, there aren't any good shortcuts if you think of having an ordered list of terms in a field. So performance can degrade pretty dramatically when you allow this kind of thing and the original writers (my opinion here, I wasn't one of them) decided it was much better to disallow it by default and require users to dig around for the why rather than have them crash and burn a lot by something that seems innocent if you aren't familiar with the issues involved. A better approach is, and this isn't very obvious, is to index your terms reversed, and do leading wildcard searches on the *reversed* field as trailing wildcards. E.g. 'some' gets indexed as 'emos' and the wildcard search '*me' gets searched in the reversed field as 'em*'. There may still be performance issues if you allow single-letter wildcards, e.g. s* or *s, although a lot of work has been done in this area in the last few years. You'll have to measure in your situation. And beware that a really common problem when deciding how many real letters to allow is that it all works fine in your test data, but when you load your real corpus and suddenly SOLR/Lucene has to deal with 100,000 terms that might match rather than the 1,000 in your test set, response time changesfor the worse. So I'd look around for the reversed idea (See SOLR-1321 in the JIRA), and at least one of the schema examples has it. One hurdle for me was asking the question does it really help the user to allow one or two leading characters in a wildcard search?. Surprisingly often, that's of no use to real users because so many terms match that it's overwhelming. YMMV, but it's a good question to ask if you find yourself in a quagmire because you allow a* type of queries. There are other strategies too, but that seems easiest Now, all that said, SOLR has done significant work to make wildcards work well, these are just general things to look out for when thinking about wildcards... I really think hacking the parser will come back to bite you as both as a maintenance and performance issue, I wouldn't go there without a pretty exhaustive look at other options. HTH Erick On Thu, Mar 11, 2010 at 6:29 PM, JavaGuy84 bbar...@gmail.com wrote: Eric, Thanks a lot for your reply. I was able to successfully hack the query parser and enabled the leading wild card search. As of today I hacked the code for this reason only, I am not sure how to make the leading wild card search to work without hacking the code and this type of search is the preferred type of search in our organization. I had previously searched all over the web to find out 'why' that feature was disabled as default but couldn't find any solid answer stating the reason. In one of the posting in nabble it was mentioned that it might take a performance hit if we enable the leading wild card search, can you please let me know your comments on that? But I am very much interested in contributing some new stuff to SOLR group so I consider this as a starting point.. Thanks, Barani Erick Erickson wrote: See Trey's comment, but before you go there. What about SOLR's wildcard searching capabilities aren't working for you now? There are a couple of tricks for making leading wildcard searches work quickly, but this is a solved problem. Although whether the existing solutions work in your situation may be an open question... Or do you
Re: Cleaning up dirty OCR
: Interesting. I wonder though if we have 4 million English documents and 250 : in Urdu, if the Urdu words would score badly when compared to ngram : statistics for the entire corpus. Well it doesn't have to be a strict ratio cutoff .. you could look at the average frequency of all character Ngrams in your index, and then consider any Ngram that appeared fewer then X stddev's below the average to be suspicious, and eliminate any work that contains Y or more suspicious Ngrams. Of you could just start really simple and eliminate any work that contains an Ngram that doesn't appear in *any* other word in your corpus. I don't deal with a lot of multi-lingual stuff, but my understanding is that this sort of thing gets a lot easier if you can partition your docs by language -- and even if you can't, doing some langauge detection on the (dirty) OCRed text to get a language guess (and then partition by language and attempt to find the suspicious words in each partition) -Hoss
Re: Cleaning up dirty OCR
I don't deal with a lot of multi-lingual stuff, but my understanding is that this sort of thing gets a lot easier if you can partition your docs by language -- and even if you can't, doing some langauge detection on the (dirty) OCRed text to get a language guess (and then partition by language and attempt to find the suspicious words in each partition) and if you are really OCR'ing Urdu text and trying to search it automatically, then this is your last priority. -- Robert Muir rcm...@gmail.com
Re: embedded server / servlet container
How would that work in a PHP environment. I've already come to my own conclusion that using the JSON output would be safer (definitely) and faster (probably) than using PHP output and eval(); So what to do when it gets to the PHP process is no problem. But it's setting up an embedded server on a shared host that I'm working on. I assume that use PHP to access the localhost port for SOLR once I get it all going. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 3/11/10, Chris Hostetter hossman_luc...@fucit.org wrote: From: Chris Hostetter hossman_luc...@fucit.org Subject: Re: embedded server / servlet container To: solr-user@lucene.apache.org Date: Thursday, March 11, 2010, 9:24 AM : I am trying to provide an embedded server to a web application deployed in a : servlet container (like tomcat). If you are trying to use Solr inside another webapp, my suggestion would just be to incorporate the existing Solr servlets, jsps, dispatch filter, and web.xml specifics from solr inot your app, and let them do their own thing -- it's going to make your life much easier from an upgrade standpoint. Better still: run solr.war as it's own webapp in the same servlet container. -Hoss
Re: Architectural help
What is DIH? I feel like I'm saying, Duh . . ., sorry. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 3/11/10, Constantijn Visinescu baeli...@gmail.com wrote: From: Constantijn Visinescu baeli...@gmail.com Subject: Re: Architectural help To: solr-user@lucene.apache.org Date: Thursday, March 11, 2010, 5:25 AM Assuming you create the view in such a way that it returns 1 row for each solrdocument you want indexed: yes On Wed, Mar 10, 2010 at 7:54 PM, blargy zman...@hotmail.com wrote: So I can just create a view (or temporary table) and then just have a simple select * from (view or table) in my DIH config? Constantijn Visinescu wrote: Try making a database view that contains everything you want to index, and then just use the DIH. Worked when i tested it ;) On Wed, Mar 10, 2010 at 1:56 AM, blargy zman...@hotmail.com wrote: I was wondering if someone could be so kind to give me some architectural guidance. A little about our setup. We are RoR shop that is currently using Ferret (no laughs please) as our search technology. Our indexing process at the moment is quite poor as well as our search results. After some deliberation we have decided to switch to Solr to satisfy our search requirements. We have about 5M records ranging in size all coming from a DB source (only 2 tables). What will be the most efficient way of indexing all of these documents? I am looking at DIH but before I go down that road I wanted to get some guidance. Are there any pitfalls I should be aware of before I start? Anything I can do now that will help me down the road? I have also been exploring the Sunspot rails plugin (http://outoftime.github.com/sunspot/) which so far seems amazing. There is an easy way to reindex all of your models like Model.reindex but I doubt this is the most efficient. Has anyone had any experience using Sunspot with their rails environment and if so should I bother with the DIH? Please let me know of any suggestions/opinions you may have. Thanks. -- View this message in context: http://old.nabble.com/Architectural-help-tp27844268p27844268.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Architectural-help-tp27844268p27854256.html Sent from the Solr - User mailing list archive at Nabble.com.
How to get Facet results only on a range of search results documents
Hi, I would like to return Facet results only on the range of search results (say 1-100) not on the whole set of search results. Any idea how can I do it? Here is the reason I want to do it: My document set is quite huge: About 100 Million documents. When a query is run, the returned results are on average about 1 or so. And I want to do faceting on the defined window of 100 documents around the result set the user is looking at, as the faceting is most relevant only around the result document the user is looking at. Thanks Regards, Shishir Jain
local solr geo_distance
Hi I'm getting geo_distance as str eventhough I'm define the field as tdouble. my search looks like /solr/select?qt=geolat=xx.xxlong=yy.yyq=*radius=10 Is there anyway i can get is as double instead of str -- View this message in context: http://old.nabble.com/local-solr-geo_distance-tp27873810p27873810.html Sent from the Solr - User mailing list archive at Nabble.com.
Best Practices for Runtime Index Updates
Hi, What are the Best Practices for Runtime Index Updates? Means we have index and user may add some data like tags, notes..etc to each solr document. during this scenario how quick we could update the index, and how quick we could show the updates to the end user on UI? Best Regards, Kranti K K Parisa
DIH field options
How can you simply add a static value like? field name=id value=123/ How does one add a static multi-value field? field name=category_ids values=123, 456/ Is there any documentation on all the options for the field tag in data-config.xml? Thanks for the help -- View this message in context: http://old.nabble.com/DIH-field-options-tp27873996p27873996.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH field options
The wiki page has most of the info you need *http://wiki*.apache.org/*solr*/DataImportHandler To use multi-value fields, your schema.xml must define it with multiValued=true On 3/11/10 10:58 PM, blargy wrote: How can you simply add a static value like?field name=id value=123/ How does one add a static multi-value field?field name=category_ids values=123, 456/ Is there any documentation on all the options for the field tag in data-config.xml? Thanks for the help -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com