Sorting by vale of field
Hi Say I have a field type in multiple documents which can be either type:bike type:boat type:car type:van and I want to order a search to give me documents in the following order type:car type:van type:boat type:bike Is there a way I can do this just using the sort method? Thanks
Re: Sorting by vale of field
Thanks, Yes this is the work around I am currently doing. Still wondering is the sort method can be used alone. On 29 June 2011 18:34, Michael Ryan mr...@moreover.com wrote: You could try adding a new int field (like typeSort) that has the desired sort values. So when adding a document with type:car, also add typeSort:1; when adding type:van, also add typeSort:2; etc. Then you could do sort=typeSort asc to get them in your desired order. I think this is also possible with custom function queries, but I've never done that. -Michael
Replication without configs
I have replicated a solr instance without configs as the slave has it's own config. The replication has failed. My plan was to use replication to remove the indexes I no longer wish to use which is why the slave has a different schema.xml file. Does anyone know why the replication has failed? Thanks Error below: HTTP ERROR 500 Problem accessing /solr/select/. Reason: null java.lang.NullPointerException at org.apache.solr.response.XMLWriter.writePrim(XMLWriter.java:828) at org.apache.solr.response.XMLWriter.writeStr(XMLWriter.java:686) at org.apache.solr.schema.StrField.write(StrField.java:49) at org.apache.solr.schema.SchemaField.write(SchemaField.java:124) at org.apache.solr.response.XMLWriter.writeDoc(XMLWriter.java:373) at org.apache.solr.response.XMLWriter$3.writeDocs(XMLWriter.java:545) at org.apache.solr.response.XMLWriter.writeDocuments(XMLWriter.java:482) at org.apache.solr.response.XMLWriter.writeDocList(XMLWriter.java:519) at org.apache.solr.response.XMLWriter.writeVal(XMLWriter.java:582) at org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:131) at org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:35) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:343) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Do unused indexes after performance?
Hi, As a proof of concept I have imported around ~11 million document in a solr index. my schema file has multiple fields defined dynamicField name=*_idtype=text indexed=true stored=true/ dynamicField name=*_start type=tdate indexed=true stored=true/ dynamicField name=*_end type=tdate indexed=true stored=true/ dynamicField name=* type=string indexed=true stored=true/ Above being the most important for my question. The average document has around 40 attributes. Each document has: * a minimum of 2 tdate fileds ( max of 10) * a minimum of 2 *_id fields each contain a space delimited list of ids (i.e. 4de5656 q23ew9h) The finial dynamicField causes all fields within a document to be indexed. This was done to firstly show the flexibility of solr and also due to me not knowing what fields we would use to query / filter on. The total size of my index is ~18GB However... we now know the fields we will be querying on. I have 3 questions 1) Do unused indexes on the same dynamicField affect solr's performance? Our query will always be (type:book book_id:*). Will the presents of 4 million documents (type:location store_id:*) affect solr's performance? Sounds obviously yes but may not be the case. 2) Do unused dynamicField indexes affect solr's performance? All documents have a attribute version which is indexed as text yet this is never used in any queries. Does their existence ( in 11 million documents ) effect performance? 3) How does one improve query times against an index Once an index is built is there a method to optimise the query analyzers or a method of removing unused indexes without rebuilding the entire index? The latter is a very important one. We want to replace the current schema with a more restrictive version. Most importantly dynamicField name=* type=string indexed=true stored=true / becomes dynamicField name=* type=string indexed=*false* stored=true / But this change alone does not cause the index to shrink. It would be lovely if there was a method to re-analyze an index post import. More than happy to be referred to related documentation. I have read and considered http://wiki.apache.org/solr/SolrPerformanceFactors http://wiki.apache.org/lucene-java/ImproveSearchingSpeed But there may be some fluid knowledge held here which is undocumented. Thank you in advance for any answers.
Boost Strangeness
WONDERFUL! Just reporting back. This document is ACE http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters For explaining what the filters are and how to affect the analyzer. Erik your statement First, boosting isn't absolute played on me so I continued to investigate boosting. I found this document that ( at last ) explains the dismax logic http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/ The reason why I was not getting the order I require was due to: A) my boost metrics were too close together. b) similar id's in a document affected the score It seems that if a partial match is made the product ( a % of the total boost ) contributes to the documents score. This meant that one type of document in the index had a higher aggregate score due to the fact it had all but one of the boosted fields ( does not have parent_id ) in it and the fields where populated with content that was *very* similar to the requested id. for example required id = b011mg62 X_id = b011mgsf Due to the partial matching and closeness of the boost ranges this type of document always aquired a higher score than another document with just one matching field ( i.e. id field ). My solution was to increase the value of the fields I wanted to *really* count id^10 parent_id^5000 brand_container_id^500 As a result even if there are similar matches in any field the id and parent_id matches should always receive a higher boost. This was also useful http://stackoverflow.com/questions/2179497/adding-date-boosting-to-complex-solr-queries Thanks for the help!
Re: Boost Strangeness
: { - time: 0 } - - org.apache.solr.handler.component.DebugComponent: { - time: 18 } } } } } On 15 June 2011 13:16, Erick Erickson erickerick...@gmail.com wrote: First off, you didn't violate groups ettiquette. In fact, yours was one of the better first posts in terms or providing enough information for us to actually help! A very useful page is the admin/analysis page to see how the analysis chain works. For instance, if you haven't changed the field type (i.e. fieldType name=text) that your input is being broken up by WordDelimiterFilterFactory. Be sure to check the verbose checkbox and enter text in both the query and index boxes! Here's an invaluable page, though do note that it's not exhaustive: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters But on to your problem: First, boosting isn't absolute, boosting terms just tends to bubble things up, you have to experiment with various weights To get the full comparison for both documents you're curious about, try using explainOther. see: http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_doesn.27t_document_id:juggernaut_appear_in_the_top_10_results_for_my_query If you use that against the two docs in question, you should see (although it's a hard read!) the reason the docs got their relative scores. Finally, your next e-mail hints at what's happening. If you're putting multiple tokens in some of these fields, the length normalization may be causing the matches to score lower. You can try disabling those calculations (omitNorms=true in your field definition). See: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr String types accept spaces just fine, but you might want to define the fields with 'multiValued=true ' and index each as a separate field (note that won't work with a field that's also your uniqueKey). Best Erick On Wed, Jun 15, 2011 at 7:16 AM, Judioo cont...@judioo.com wrote: dynamicField name=*_id type=textindexed=true stored=true/ so all attributes except 'id' are of type text. I didn't know that about the string type. So is my problem as described ( that partial matches are contributing to the calculation ) and does defining the filed type as string solve this problem. Or is my understanding completely incorrect? Thanks in advance On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote: /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on same result ( just higher scores ). It's almost as if partial matches on brand|series_container_id and id are being considered in the 1st document. Surely this can't be right / expected? What is your fieldType definition? Don't you think it is better to use string type which is not tokenized?
Boost Strangeness
Hi I'm confused about exactly how boosts relevancy scores work. Apologies if I am violating this groups etiquette but I could not find solr's paste bin anywhere. I have 2 document types but want to return any documents where the requested ID appears. The ID appears in multiple attributes but I want to boost results based on which attribute contains the ID. so my query is q=id:b007vty6 parent_id:b007vty6 brand_container_id:b007vty6 series_container_id:b007vty6 subseries_container_id:b007vty6 clip_container_id:b007vty6 clip_episode_id:b007vty6 and I use qf to boost fields qf=id^10 parent_id^9 brand_container_id^8 series_container_id^8 subseries_container_id^8 clip_container_id^1 clip_episode_id^1 I expect any document with the following id:b007vty6 to be returned 1st ( with the highest score ) yet this is not the case. Can anyone explain why this is? Could it be that extra info below: complete URL /solr/select/?q=id:b007vty6%20parent_id:b007vty6%20brand_container_id:b007vty6%20series_container_id:b007vty6%20subseries_container_id:b007vty6%20clip_container_id:b007vty6%20clip_episode_id:b007vty6start=0rows=10wt=jsonindent=ondebugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scoreqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1 results { - - responseHeader: { - status: 0 - QTime: 12 - - params: { - debugQuery: on - fl: id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score - indent: on - start: 0 - q: id:b007vty6 parent_id:b007vty6 brand_container_id:b007vty6 series_container_id:b007vty6 subseries_container_id:b007vty6 clip_container_id:b007vty6 clip_episode_id:b007vty6 - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8 subseries_container_id^8 clip_container_id^1 clip_episode_id^1 - wt: json - rows: 10 } } - - response: { - numFound: 2 - start: 0 - maxScore: 1.5543144 - - docs: [ - - { - series_container_id: b007vm94 - id: b007vsvm - brand_container_id: b007hhk5 - subseries_container_id: b007vty6 - clip_episode_id: - score: 1.5543144 } - - { - parent_id: b007vm94 - id: b007vty6 - score: 0.3014368 } ] } - - debug: { - rawquerystring: id:b007vty6 parent_id:b007vty6 brand_container_id:b007vty6 series_container_id:b007vty6 subseries_container_id:b007vty6 clip_container_id:b007vty6 clip_episode_id:b007vty6 - querystring: id:b007vty6 parent_id:b007vty6 brand_container_id:b007vty6 series_container_id:b007vty6 subseries_container_id:b007vty6 clip_container_id:b007vty6 clip_episode_id:b007vty6 - parsedquery: id:b007vty6 PhraseQuery(parent_id:b 007 vty 6) PhraseQuery(brand_container_id:b 007 vty 6) PhraseQuery(series_container_id:b 007 vty 6) PhraseQuery(subseries_container_id:b 007 vty 6) PhraseQuery(clip_container_id:b 007 vty 6) PhraseQuery(clip_episode_id:b 007 vty 6) - parsedquery_toString: id:b007vty6 parent_id:b 007 vty 6 brand_container_id:b 007 vty 6 series_container_id:b 007 vty 6 subseries_container_id:b 007 vty 6 clip_container_id:b 007 vty 6 clip_episode_id:b 007 vty 6 - - explain: { - b007vsvm: 1.5543144 = (MATCH) product of: 10.8802 = (MATCH) sum of: 10.8802 = (MATCH) weight(subseries_container_id:b 007 vty 6 in 39526), product of: 0.43911988 = queryWeight(subseries_container_id:b 007 vty 6), product of: 49.55458 = idf(subseries_container_id: b=547 007=31 vty=1 6=87) 0.008861338 = queryNorm 24.77729 = fieldWeight(subseries_container_id:b 007 vty 6 in 39526), product of: 1.0 = tf(phraseFreq=1.0) 49.55458 = idf(subseries_container_id: b=547 007=31 vty=1 6=87) 0.5 = fieldNorm(field=subseries_container_id, doc=39526) 0.14285715 = coord(1/7) - b007vty6: 0.3014368 = (MATCH) product of: 2.1100576 = (MATCH) sum of: 2.1100576 = (MATCH) weight(id:b007vty6 in 39512), product of: 0.13674039 = queryWeight(id:b007vty6), product of: 15.431123 = idf(docFreq=1, maxDocs=3701577) 0.008861338 = queryNorm 15.431123 = (MATCH) fieldWeight(id:b007vty6 in 39512), product of: 1.0 = tf(termFreq(id:b007vty6)=1) 15.431123 = idf(docFreq=1, maxDocs=3701577) 1.0 = fieldNorm(field=id, doc=39512) 0.14285715 = coord(1/7) } - QParser: LuceneQParser - - timing: { - time: 12 - - prepare: { - time: 3 - -
Re: Boost Strangeness
Apologies I have tried that method as well. /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on same result ( just higher scores ). It's almost as if partial matches on brand|series_container_id and id are being considered in the 1st document. Surely this can't be right / expected? { - - responseHeader: { - status: 0 - QTime: 13 - - params: { - debugQuery: on - fl: id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score - indent: on - q: b007vty6 - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8 subseries_container_id^8 clip_container_id^1 clip_episode_id^1 - wt: json - defType: dismax } } - - response: { - numFound: 2 - start: 0 - maxScore: 21.138214 - - docs: [ - - { - series_container_id: b007vm94 - id: b007vsvm - brand_container_id: b007hhk5 - subseries_container_id: b007vty6 - clip_episode_id: - score: 21.138214 } - - { - parent_id: b007vm94 - id: b007vty6 - score: 5.1243143 } ] } - - debug: { - rawquerystring: b007vty6 - querystring: b007vty6 - parsedquery: +DisjunctionMaxQuery((id:b007vty6^10.0 | clip_episode_id:b 007 vty 6 | subseries_container_id:b 007 vty 6^8.0 | series_container_id:b 007 vty 6^8.0 | clip_container_id:b 007 vty 6 | brand_container_id:b 007 vty 6^8.0 | parent_id:b 007 vty 6^9.0)) () - parsedquery_toString: +(id:b007vty6^10.0 | clip_episode_id:b 007 vty 6 | subseries_container_id:b 007 vty 6^8.0 | series_container_id:b 007 vty 6^8.0 | clip_container_id:b 007 vty 6 | brand_container_id:b 007 vty 6^8.0 | parent_id:b 007 vty 6^9.0) () - - explain: { - b007vsvm: 21.138214 = (MATCH) sum of: 21.138214 = (MATCH) max of: 21.138214 = (MATCH) weight(subseries_container_id:b 007 vty 6^8.0 in 39526), product of: 0.85312855 = queryWeight(subseries_container_id:b 007 vty 6^8.0), product of: 8.0 = boost 49.55458 = idf(subseries_container_id: b=547 007=31 vty=1 6=87) 0.0021519922 = queryNorm 24.77729 = fieldWeight(subseries_container_id:b 007 vty 6 in 39526), product of: 1.0 = tf(phraseFreq=1.0) 49.55458 = idf(subseries_container_id: b=547 007=31 vty=1 6=87) 0.5 = fieldNorm(field=subseries_container_id, doc=39526) - b007vty6: 5.1243143 = (MATCH) sum of: 5.1243143 = (MATCH) max of: 5.1243143 = (MATCH) weight(id:b007vty6^10.0 in 39512), product of: 0.33207658 = queryWeight(id:b007vty6^10.0), product of: 10.0 = boost 15.431123 = idf(docFreq=1, maxDocs=3701577) 0.0021519922 = queryNorm 15.431123 = (MATCH) fieldWeight(id:b007vty6 in 39512), product of: 1.0 = tf(termFreq(id:b007vty6)=1) 15.431123 = idf(docFreq=1, maxDocs=3701577) 1.0 = fieldNorm(field=id, doc=39512) } - QParser: DisMaxQParser - altquerystring: null - boostfuncs: null - - timing: { - time: 13 - - prepare: { - time: 3 - - org.apache.solr.handler.component.QueryComponent: { - time: 3 } - - org.apache.solr.handler.component.FacetComponent: { - time: 0 } - - org.apache.solr.handler.component.MoreLikeThisComponent: { - time: 0 } - - org.apache.solr.handler.component.HighlightComponent: { - time: 0 } - - org.apache.solr.handler.component.StatsComponent: { - time: 0 } - - org.apache.solr.handler.component.DebugComponent: { - time: 0 } } - - process: { - time: 10 - - org.apache.solr.handler.component.QueryComponent: { - time: 0 } - - org.apache.solr.handler.component.FacetComponent: { - time: 0 } - - org.apache.solr.handler.component.MoreLikeThisComponent: { - time: 0 } - - org.apache.solr.handler.component.HighlightComponent: { - time: 0 } - - org.apache.solr.handler.component.StatsComponent: { - time: 0
Re: Boost Strangeness
dynamicField name=*_id type=textindexed=true stored=true/ so all attributes except 'id' are of type text. I didn't know that about the string type. So is my problem as described ( that partial matches are contributing to the calculation ) and does defining the filed type as string solve this problem. Or is my understanding completely incorrect? Thanks in advance On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote: /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on same result ( just higher scores ). It's almost as if partial matches on brand|series_container_id and id are being considered in the 1st document. Surely this can't be right / expected? What is your fieldType definition? Don't you think it is better to use string type which is not tokenized?
Re: Boost Strangeness
String also does not seem to accept spaces. currently the _id fields can contain multiple ids ( using as a multiType alternative ). This is why I used the text type. On 15 June 2011 12:16, Judioo cont...@judioo.com wrote: dynamicField name=*_id type=textindexed=true stored=true/ so all attributes except 'id' are of type text. I didn't know that about the string type. So is my problem as described ( that partial matches are contributing to the calculation ) and does defining the filed type as string solve this problem. Or is my understanding completely incorrect? Thanks in advance On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote: /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on same result ( just higher scores ). It's almost as if partial matches on brand|series_container_id and id are being considered in the 1st document. Surely this can't be right / expected? What is your fieldType definition? Don't you think it is better to use string type which is not tokenized?
Pattern: Is there a method of resolving multivalued date ranges into a single document?
Hi All, Question on best methods again :) I have the following type of document. film titleTron/film times time start='2010-09-23T12:00:00Z' end='2010-09-23T1430:00:00Z' theater_id='445632'/ time start='2010-09-23T15:00:00Z' end='2010-09-23T1730:00:00Z' theater_id='445633'/ time start='2010-09-23T18:00:00Z' end='2010-09-23T2030:00:00Z' theater_id='445634'/ /times . /film where theater identifies the place where the film is showing. Each theater is stored in another document. I want to store the timings in the same document as the film details. This is so I can perform a range search like ( type:film AND start:[ NOW TO * ] AND end:[NOW TO *] ) i.e. give me all the films that are scheduled to start in the future. I was hoping I could submit a document like the following: doc field name=id12345-67890-12345/field field name=titleTron/field field name=445632_start2010-09-23T12:00:00Z/field field name=445632_end2010-09-23T1430:00:00Z/field field name=445633_start2010-09-23T15:00:00Z/field field name=445633_end2010-09-23T1730:00:00Z/field field name=445634_start2010-09-23T18:00:00Z/field field name=445634_end2010-09-23T2030:00:00Z/field /doc My assumption is that I could then perform a wildcard date range search like ( type:film AND *_start:[ NOW TO * ] AND *_end:[NOW TO *] ) Using the attribute name theater_id_start|end as an indicator to the theater. However I do not think date ranges support this. Can ANYONE suggest a method to accomplish this with examples? Thank you in advance.
Re: Solr Indexing Patterns
Very informative links and statement Jonathan. thank you. On 6 June 2011 20:55, Jonathan Rochkind rochk...@jhu.edu wrote: This is a start, for many common best practices: http://wiki.apache.org/solr/SolrRelevancyFAQ Many of the questions in there have an answer that involves de-normalizing. As an example. It may be that even if your specific problem isn't in there, I myself anyway found reading through there gave me a general sense of common patterns in Solr. ( It's certainly true that some things are hard to do in Solr. It turns out that an RDBMS is a remarkably flexible thing -- but when it doesn't do something you need well, and you turn to a specialized tool instead like Solr, you certainly give up some things One of the biggest areas of limitation involves hieararchical or relationship data, definitely. There are a variety of features, some more fully baked than others, some not yet in a Solr release, meant to provide tools to get at different aspects of this. Including pivot facetting, join (https://issues.apache.org/jira/browse/SOLR-2272), and field-collapsing. Each, IMO, is trying to deal with different aspects of dealing with hieararchical or multi-class data, or data that is entities with relationships. ). On 6/6/2011 3:43 PM, Judioo wrote: I do think that Solr would be better served if there was a *best practice section *of the site. Looking at the majority of emails to this list they resolve around how do I do X?. Seems like tutorials with real world examples would serve Solr no end of good. I still do not have an example of the best method to approach my problem, although Erick has help me understand the limitations of Solr. Just thought I'd say. On 6 June 2011 20:26, Judioocont...@judioo.com wrote: Thanks On 6 June 2011 19:32, Erick Ericksonerickerick...@gmail.com wrote: #Everybody# (including me) who has any RDBMS background doesn't want to flatten data, but that's usually the way to go in Solr. Part of whether it's a good idea or not depends on how big the index gets, and unfortunately the only way to figure that out is to test. But that's the first approach I'd try. Good luck! Erick On Mon, Jun 6, 2011 at 11:42 AM, Judioocont...@judioo.com wrote: On 5 June 2011 14:42, Erick Ericksonerickerick...@gmail.com wrote: See: http://wiki.apache.org/solr/SchemaXml By adding ' multiValued=true ' to the field, you can add the same field multiple times in a doc, something like add doc field name=mvvalue1/field field name=mvvalue2/field /doc /add I can't see how that would work as one would need to associate the right start / end dates and price. As I understand using multivalued and thus flattening the discounts would result in: { name:The Book, price:$9.99, price:$3.00, price:$4.00,synopsis:thanksgiving special, starts:11-24-2011, starts:10-10-2011, ends:11-25-2011, ends:10-11-2011, synopsis:Canadian thanksgiving special, }, How does one differentiate the different offers? But there's no real ability in Solr to store sub documents, so you'd have to get creative in how you encoded the discounts... This is what I'm asking :) What is the best / recommended / known patterns for doing this? But I suspect a better approach would be to store each discount as a separate document. If you're in the trunk version, you could then group results by, say, ISBN and get responses grouped together... This is an option but seems sub optimal. So say I store the discounts in multiple documents with ISDN as an attribute and also store the title again with ISDN as an attribute. To get all books currently discounted requires 2 request * get all discounts currently active * get all books using ISDN retrieved from above search Not that bad. However what happens when I want all books that are currently on discount in the horror genre containing the word 'elm' in the title. The only way I can see in catering for the above search is to duplicate all searchable fields in my book document in my discount document. Coming from a RDBM background this seems wrong. Is this the correct approach to take? Best Erick On Sat, Jun 4, 2011 at 1:42 AM, Judioocont...@judioo.com wrote: Hi, Discounts can change daily. Also there can be a lot of them (over time and in a given time period ). Could you give an example of what you mean buy multi-valuing the field. Thanks On 3 June 2011 14:29, Erick Ericksonerickerick...@gmail.com wrote: How often are the discounts changed? Because you can simply re-index the book information with a multiValued discounts field and get something similar to your example (wt=json) Best Erick On Fri, Jun 3, 2011 at 8:38 AM, Judioocont...@judioo.com wrote: What is the best practice method to index the following in Solr: I'm attempting to use solr for a book store site
Re: Solr Indexing Patterns
On 5 June 2011 14:42, Erick Erickson erickerick...@gmail.com wrote: See: http://wiki.apache.org/solr/SchemaXml By adding ' multiValued=true ' to the field, you can add the same field multiple times in a doc, something like add doc field name=mvvalue1/field field name=mvvalue2/field /doc /add I can't see how that would work as one would need to associate the right start / end dates and price. As I understand using multivalued and thus flattening the discounts would result in: { name:The Book, price:$9.99, price:$3.00, price:$4.00,synopsis:thanksgiving special, starts:11-24-2011, starts:10-10-2011, ends:11-25-2011, ends:10-11-2011, synopsis:Canadian thanksgiving special, }, How does one differentiate the different offers? But there's no real ability in Solr to store sub documents, so you'd have to get creative in how you encoded the discounts... This is what I'm asking :) What is the best / recommended / known patterns for doing this? But I suspect a better approach would be to store each discount as a separate document. If you're in the trunk version, you could then group results by, say, ISBN and get responses grouped together... This is an option but seems sub optimal. So say I store the discounts in multiple documents with ISDN as an attribute and also store the title again with ISDN as an attribute. To get all books currently discounted requires 2 request * get all discounts currently active * get all books using ISDN retrieved from above search Not that bad. However what happens when I want all books that are currently on discount in the horror genre containing the word 'elm' in the title. The only way I can see in catering for the above search is to duplicate all searchable fields in my book document in my discount document. Coming from a RDBM background this seems wrong. Is this the correct approach to take? Best Erick On Sat, Jun 4, 2011 at 1:42 AM, Judioo cont...@judioo.com wrote: Hi, Discounts can change daily. Also there can be a lot of them (over time and in a given time period ). Could you give an example of what you mean buy multi-valuing the field. Thanks On 3 June 2011 14:29, Erick Erickson erickerick...@gmail.com wrote: How often are the discounts changed? Because you can simply re-index the book information with a multiValued discounts field and get something similar to your example (wt=json) Best Erick On Fri, Jun 3, 2011 at 8:38 AM, Judioo cont...@judioo.com wrote: What is the best practice method to index the following in Solr: I'm attempting to use solr for a book store site. Each book will have a price but on occasions this will be discounted. The discounted price exists for a defined time period but there may be many discount periods. Each discount will have a brief synopsis, start and end time. A subset of the desired output would be as follows: ... response:{numFound:1,start:0,docs:[ { name:The Book, price:$9.99, discounts:[ { price:$3.00, synopsis:thanksgiving special, starts:11-24-2011, ends:11-25-2011, }, { price:$4.00, synopsis:Canadian thanksgiving special, starts:10-10-2011, ends:10-11-2011, }, ] }, . A requirement is to be able to search for just discounted publications. I think I could use date faceting for this ( return publications that are within a discount window ). When a discount search is performed no publications that are not currently discounted will be returned. My question are: - Does solr support this type of sub documents In the above example the discounts are the sub documents. I know solr is not a relational DB but I would like to store and index the above representation in a single document if possible. - what is the best method to approach the above I can see in many examples the authors tend to denormalize to solve similar problems. This suggest that for each discount I am required to duplicate the book data or form a document association http://stackoverflow.com/questions/2689399/solr-associations . Which method would you advise? It would be nice if solr could return a response structured as above. Much Thanks
Re: Solr Indexing Patterns
Thanks On 6 June 2011 19:32, Erick Erickson erickerick...@gmail.com wrote: #Everybody# (including me) who has any RDBMS background doesn't want to flatten data, but that's usually the way to go in Solr. Part of whether it's a good idea or not depends on how big the index gets, and unfortunately the only way to figure that out is to test. But that's the first approach I'd try. Good luck! Erick On Mon, Jun 6, 2011 at 11:42 AM, Judioo cont...@judioo.com wrote: On 5 June 2011 14:42, Erick Erickson erickerick...@gmail.com wrote: See: http://wiki.apache.org/solr/SchemaXml By adding ' multiValued=true ' to the field, you can add the same field multiple times in a doc, something like add doc field name=mvvalue1/field field name=mvvalue2/field /doc /add I can't see how that would work as one would need to associate the right start / end dates and price. As I understand using multivalued and thus flattening the discounts would result in: { name:The Book, price:$9.99, price:$3.00, price:$4.00,synopsis:thanksgiving special, starts:11-24-2011, starts:10-10-2011, ends:11-25-2011, ends:10-11-2011, synopsis:Canadian thanksgiving special, }, How does one differentiate the different offers? But there's no real ability in Solr to store sub documents, so you'd have to get creative in how you encoded the discounts... This is what I'm asking :) What is the best / recommended / known patterns for doing this? But I suspect a better approach would be to store each discount as a separate document. If you're in the trunk version, you could then group results by, say, ISBN and get responses grouped together... This is an option but seems sub optimal. So say I store the discounts in multiple documents with ISDN as an attribute and also store the title again with ISDN as an attribute. To get all books currently discounted requires 2 request * get all discounts currently active * get all books using ISDN retrieved from above search Not that bad. However what happens when I want all books that are currently on discount in the horror genre containing the word 'elm' in the title. The only way I can see in catering for the above search is to duplicate all searchable fields in my book document in my discount document. Coming from a RDBM background this seems wrong. Is this the correct approach to take? Best Erick On Sat, Jun 4, 2011 at 1:42 AM, Judioo cont...@judioo.com wrote: Hi, Discounts can change daily. Also there can be a lot of them (over time and in a given time period ). Could you give an example of what you mean buy multi-valuing the field. Thanks On 3 June 2011 14:29, Erick Erickson erickerick...@gmail.com wrote: How often are the discounts changed? Because you can simply re-index the book information with a multiValued discounts field and get something similar to your example (wt=json) Best Erick On Fri, Jun 3, 2011 at 8:38 AM, Judioo cont...@judioo.com wrote: What is the best practice method to index the following in Solr: I'm attempting to use solr for a book store site. Each book will have a price but on occasions this will be discounted. The discounted price exists for a defined time period but there may be many discount periods. Each discount will have a brief synopsis, start and end time. A subset of the desired output would be as follows: ... response:{numFound:1,start:0,docs:[ { name:The Book, price:$9.99, discounts:[ { price:$3.00, synopsis:thanksgiving special, starts:11-24-2011, ends:11-25-2011, }, { price:$4.00, synopsis:Canadian thanksgiving special, starts:10-10-2011, ends:10-11-2011, }, ] }, . A requirement is to be able to search for just discounted publications. I think I could use date faceting for this ( return publications that are within a discount window ). When a discount search is performed no publications that are not currently discounted will be returned. My question are: - Does solr support this type of sub documents In the above example the discounts are the sub documents. I know solr is not a relational DB but I would like to store and index the above representation in a single document if possible. - what is the best method to approach the above I can see in many examples the authors tend to denormalize to solve similar problems. This suggest that for each discount I am required to duplicate the book data or form a document association http://stackoverflow.com/questions
Re: Solr Indexing Patterns
I do think that Solr would be better served if there was a *best practice section *of the site. Looking at the majority of emails to this list they resolve around how do I do X?. Seems like tutorials with real world examples would serve Solr no end of good. I still do not have an example of the best method to approach my problem, although Erick has help me understand the limitations of Solr. Just thought I'd say. On 6 June 2011 20:26, Judioo cont...@judioo.com wrote: Thanks On 6 June 2011 19:32, Erick Erickson erickerick...@gmail.com wrote: #Everybody# (including me) who has any RDBMS background doesn't want to flatten data, but that's usually the way to go in Solr. Part of whether it's a good idea or not depends on how big the index gets, and unfortunately the only way to figure that out is to test. But that's the first approach I'd try. Good luck! Erick On Mon, Jun 6, 2011 at 11:42 AM, Judioo cont...@judioo.com wrote: On 5 June 2011 14:42, Erick Erickson erickerick...@gmail.com wrote: See: http://wiki.apache.org/solr/SchemaXml By adding ' multiValued=true ' to the field, you can add the same field multiple times in a doc, something like add doc field name=mvvalue1/field field name=mvvalue2/field /doc /add I can't see how that would work as one would need to associate the right start / end dates and price. As I understand using multivalued and thus flattening the discounts would result in: { name:The Book, price:$9.99, price:$3.00, price:$4.00,synopsis:thanksgiving special, starts:11-24-2011, starts:10-10-2011, ends:11-25-2011, ends:10-11-2011, synopsis:Canadian thanksgiving special, }, How does one differentiate the different offers? But there's no real ability in Solr to store sub documents, so you'd have to get creative in how you encoded the discounts... This is what I'm asking :) What is the best / recommended / known patterns for doing this? But I suspect a better approach would be to store each discount as a separate document. If you're in the trunk version, you could then group results by, say, ISBN and get responses grouped together... This is an option but seems sub optimal. So say I store the discounts in multiple documents with ISDN as an attribute and also store the title again with ISDN as an attribute. To get all books currently discounted requires 2 request * get all discounts currently active * get all books using ISDN retrieved from above search Not that bad. However what happens when I want all books that are currently on discount in the horror genre containing the word 'elm' in the title. The only way I can see in catering for the above search is to duplicate all searchable fields in my book document in my discount document. Coming from a RDBM background this seems wrong. Is this the correct approach to take? Best Erick On Sat, Jun 4, 2011 at 1:42 AM, Judioo cont...@judioo.com wrote: Hi, Discounts can change daily. Also there can be a lot of them (over time and in a given time period ). Could you give an example of what you mean buy multi-valuing the field. Thanks On 3 June 2011 14:29, Erick Erickson erickerick...@gmail.com wrote: How often are the discounts changed? Because you can simply re-index the book information with a multiValued discounts field and get something similar to your example (wt=json) Best Erick On Fri, Jun 3, 2011 at 8:38 AM, Judioo cont...@judioo.com wrote: What is the best practice method to index the following in Solr: I'm attempting to use solr for a book store site. Each book will have a price but on occasions this will be discounted. The discounted price exists for a defined time period but there may be many discount periods. Each discount will have a brief synopsis, start and end time. A subset of the desired output would be as follows: ... response:{numFound:1,start:0,docs:[ { name:The Book, price:$9.99, discounts:[ { price:$3.00, synopsis:thanksgiving special, starts:11-24-2011, ends:11-25-2011, }, { price:$4.00, synopsis:Canadian thanksgiving special, starts:10-10-2011, ends:10-11-2011, }, ] }, . A requirement is to be able to search for just discounted publications. I think I could use date faceting for this ( return publications that are within a discount window ). When a discount search is performed no publications that are not currently discounted will be returned. My question are: - Does solr support this type of sub documents In the above example
Solr Indexing Patterns
What is the best practice method to index the following in Solr: I'm attempting to use solr for a book store site. Each book will have a price but on occasions this will be discounted. The discounted price exists for a defined time period but there may be many discount periods. Each discount will have a brief synopsis, start and end time. A subset of the desired output would be as follows: ... response:{numFound:1,start:0,docs:[ { name:The Book, price:$9.99, discounts:[ { price:$3.00, synopsis:thanksgiving special, starts:11-24-2011, ends:11-25-2011, }, { price:$4.00, synopsis:Canadian thanksgiving special, starts:10-10-2011, ends:10-11-2011, }, ] }, . A requirement is to be able to search for just discounted publications. I think I could use date faceting for this ( return publications that are within a discount window ). When a discount search is performed no publications that are not currently discounted will be returned. My question are: - Does solr support this type of sub documents In the above example the discounts are the sub documents. I know solr is not a relational DB but I would like to store and index the above representation in a single document if possible. - what is the best method to approach the above I can see in many examples the authors tend to denormalize to solve similar problems. This suggest that for each discount I am required to duplicate the book data or form a document associationhttp://stackoverflow.com/questions/2689399/solr-associations. Which method would you advise? It would be nice if solr could return a response structured as above. Much Thanks
Storing, indexing and searching XML documents in Solr
Hi, I'm new to solr so apologies if the solution is already documented. I have installed and populated a solr index using the examples as a template with a version of the data below. I have XML in the form of entity resource guid123898-2092099098982/guid media_formatBlu-Ray/media_format updated2011-05-05T11:25:35+0500/updated /resource price currency=usd3.99price discounts discount type=percentage rate=30 start=2011-05-03T00:00:00 end=2011-05-10T00:00:00 / discount type=decimal amount=1.99 coupon=1 / . /discounts aspect_ratio16:9/aspect_ratio duration1620/duration categories category id=drama / category id=horror / /categories rating rate id=D1contains some scenes which some viewers may find upsetting/rate /rating ... media_typeVideo/media_type /entity Can I populate solr directly with this document (like I believe marklogic does )? If yes Can I search on any attribute ( i.e. find all records where /entity/resource/media_format equals blu-ray ) If no What is the best practice to import the attributes above into solr ( i.e. patterns for sub dividing / flattening document ). Does solr support attached documents and if so is this advised ( how does it affect performance ). Any help is greatly appreciated. Pointers to documentation that address my issues is even more helpful. Thanks again OJ
Re: Storing, indexing and searching XML documents in Solr
The data is being imported directly from mysql. The document is however indeed a good starting place. Thanks 2011/5/18 Yury Kats yuryk...@yahoo.com On 5/18/2011 4:19 PM, Judioo wrote: Any help is greatly appreciated. Pointers to documentation that address my issues is even more helpful. I think this would be a good start: http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource
Re: Storing, indexing and searching XML documents in Solr
Great document. I can see how to import the data direct from the database. However it seems as though I need to write xpath's in the config to extract the fields that I wish to transform into an solr document. So it seems that there is no way of storing the document structure in solr as is? 2011/5/18 Yury Kats yuryk...@yahoo.com On 5/18/2011 4:19 PM, Judioo wrote: Any help is greatly appreciated. Pointers to documentation that address my issues is even more helpful. I think this would be a good start: http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource