Re: Data Import handler and join select
First of all thank you very much for the answer, James. It is very complete and it gives us several alternatives :) I think we will try first the cache approach, as, after solving this problem https://issues.apache.org/jira/browse/SOLR-5954 the performance has been improved, so along with the cache solution we may achieve the expected performance. We've also tried modifying the transformers and we've got it working the way we were looking for, though the solutions you propose seem to be much cleaner. Regarding indexing through solrj it was our first idea, the problem is when we started the project, the DIH seemed to fit our needs perfectly, until we tried with real data and realized about the performance issues, so, now maybe it's a bit late for us trying to change everything :( If we have no other option we will go that way but we need to try less drastic solutions first. Thanks! 2014-08-07 18:11 GMT+02:00 Dyer, James james.d...@ingramcontent.com: Alejandro, You can use a sub-entity with a cache using DIH. This will solve the n+1-select problem and make it run quickly. Unfortunately, the only built-in cache implementation is in-memory so it doesn't scale. There is a fast, disk-backed cache using bdb-je, which I use in production. See https://issues.apache.org/jira/browse/SOLR-2613 . You will need to build this youself and include it on the classpath, and obtain a copy of bdb-je from Oracle. While bdb-je is open source, its license is incompatible with ASL so this will never officially be part of Solr. Once you have a disk-backed cache, you can specify it on the child entity like this: entity name=parent query=select id, ... from parent table entity name=child query=select foreignKey, ... from child_table cacheKey=foreignKey cacheLookup=parent.id processor=SqlEntityProcessor transformer=... cacheImpl=BerkleyBackedCache / /entity If you don't want to go down this path, you can achieve this all with one query, if you include and ORDER BY to sort by whatever field is used as Solr's uniqueKey, and add a dummy row at the end with a UNION: SELECT p.uniqueKey, ..., 'A' as lastInd from PRODUCTS p INNER JOIN DESCRIPTIONS d ON p.uniqueKey = d.productKey UNION SELECT 0 as uniqueKey, ... , 'B' as lastInd from dual ORDER BY uniqueKey, lastInd Then your transformer would need to keep the lastUniqueKey in an instance variable and keep a running map of everything its seen for that key. When the key changes, or if on the last row, send that map as the document. Otherwise, the transformer returns null. This will collect data from each row seen onto one document. Keep in mind also, that in a lot of cases like this, it might just be easiest to write a program that uses solrj to send your documents rather than trying to make DIH's features fit your use-case. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Alejandro Marqués Rodríguez [mailto: amarq...@paradigmatecnologico.com] Sent: Thursday, August 07, 2014 1:43 AM To: solr-user@lucene.apache.org Subject: Data Import handler and join select Hi, I have one problem while indexing with data import hadler while doing a join select. I have two tables, one with products and another one with descriptions for each product in several languages. So it would be: Products: ID, NAME, BRAND, PRICE, ... Descriptions: ID, LANGUAGE, DESCRIPTION I would like to have every product indexed as a document with a multivalued field language which contains every language that has an associated description and several dinamic fields description_ one for each language. So it would be for example: Id: 1 Name: Product Brand: Brand Price: 10 Languages: [es,en] Description_es: Descripción en español Description_en: English description Our first approach was using sub-entities for the data import handler and after implementing some transformers we had everything indexed as we wanted. The sub-entity process added the descriptions for each language to the solr document and then indexed them. The problem was performance. I've read that using sub-entities affected performance greatly, so we changed our process in order to use a join instead. Performance was greatly improved this way but now we have a problem. Each time a row is processed a solr document is generated and indexed into solr, but the data is not added to any previous data, but it replaces it. If we had the previous example the query resulting from the join would be: Id - Name - Brand - Price - Language - Description 1 - Product - Brand - 10 - es - Descripción en español 1 - Product - Brand - 10 - en - English description So when indexing as both have the same id the only information I get is the second row. Is there any way for data import handler to manage this and allow
Data Import handler and join select
Hi, I have one problem while indexing with data import hadler while doing a join select. I have two tables, one with products and another one with descriptions for each product in several languages. So it would be: Products: ID, NAME, BRAND, PRICE, ... Descriptions: ID, LANGUAGE, DESCRIPTION I would like to have every product indexed as a document with a multivalued field language which contains every language that has an associated description and several dinamic fields description_ one for each language. So it would be for example: Id: 1 Name: Product Brand: Brand Price: 10 Languages: [es,en] Description_es: Descripción en español Description_en: English description Our first approach was using sub-entities for the data import handler and after implementing some transformers we had everything indexed as we wanted. The sub-entity process added the descriptions for each language to the solr document and then indexed them. The problem was performance. I've read that using sub-entities affected performance greatly, so we changed our process in order to use a join instead. Performance was greatly improved this way but now we have a problem. Each time a row is processed a solr document is generated and indexed into solr, but the data is not added to any previous data, but it replaces it. If we had the previous example the query resulting from the join would be: Id - Name - Brand - Price - Language - Description 1 - Product - Brand - 10 - es - Descripción en español 1 - Product - Brand - 10 - en - English description So when indexing as both have the same id the only information I get is the second row. Is there any way for data import handler to manage this and allow the documents to be indexed updating any previous data? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Volatile spellcheck index
Hi, I'm having a problem with the spell check index building. I've configured the spell checker component to have the index built on optimize. * !-- Spell Check http://wiki.apache.org/solr/SpellCheckComponent http://wiki.apache.org/solr/SpellCheckComponent --* * searchComponent name=spellcheck class=solr.SpellCheckComponent* * str name=queryAnalyzerFieldTypespell/str* * lst name=spellchecker* * str name=namespellchecker/str* * str name=fieldspell/str* * str name=accuracy0.7/str* * str name=buildOnOptimizetrue/str* * /lst* * /searchComponent* * !-- A request handler for demonstrating the spellcheck component. http://wiki.apache.org/solr/SpellCheckComponent http://wiki.apache.org/solr/SpellCheckComponent for details --* * requestHandler name=/spell class=solr.SearchHandler* *lst name=defaults * * str name=spellcheck.dictionaryspellchecker/str* * str name=spellcheckon/str* * str name=spellcheck.onlyMorePopularfalse/str* * str name=spellcheck.extendedResultsfalse/str* * str name=spellcheck.count1/str* */lst* *arr name=last-components* * strspellcheck/str* */arr* * /requestHandler* After the index process I launch an optimize request and the spellcheck index is generated and everything is working fine. However, if I restart Solr the spell check is not working anymore until I execute another optimize request. So, is this the expected way of working? Is the spell check index deleted after every server restart? Is there any way to make it persistent? And just one more question, I remember in previous Solr versions the spellcheck had even its own folder under the data folder, so, for example I could see if the spell check index had been generated just listing the files under that folder. Does that folder still exist? Is there any way of knowing if the spell check index has been generated without executing a query that is supposed to return a correction? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: Volatile spellcheck index
Thanks for the answer James. My fault not specifying the Solr version, we are working with solr 4.5. Anyway, thank you very much for pointing the change to DirectSolrSpellChecker. I hadn't even realized that change, and I think I wasn't using it, as the line str name=classnamesolr.DirectSolrSpellChecker/str was missing in my configuration. Once I changed it, I think everything is working fine even after server restart. Thanks again James, you've saved me from some serious headache ;) 2014-02-05 Dyer, James james.d...@ingramcontent.com: Alejandro, Assuming you're using Solr 3.x, under: searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker ... /lst /searchComponent ...you can add: str name=spellcheckIndexDir./spellchecker/str ...then the spell check index will be created on-disk and not in memory. But in Solr 4.0, the default spellcheck implementation changed to org.apache.solr.spelling.DirectSolrSpellChecker, which does not create a separate index for for spellchecking, build does nothing, and you need not worry at all about these things. The wiki still says experimental here but that is woefully out-of-date. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Alejandro Marqués Rodríguez [mailto: amarq...@paradigmatecnologico.com] Sent: Wednesday, February 05, 2014 3:41 AM To: solr-user@lucene.apache.org Subject: Volatile spellcheck index Hi, I'm having a problem with the spell check index building. I've configured the spell checker component to have the index built on optimize. * !-- Spell Check http://wiki.apache.org/solr/SpellCheckComponent http://wiki.apache.org/solr/SpellCheckComponent --* * searchComponent name=spellcheck class=solr.SpellCheckComponent* * str name=queryAnalyzerFieldTypespell/str* * lst name=spellchecker* * str name=namespellchecker/str* * str name=fieldspell/str* * str name=accuracy0.7/str* * str name=buildOnOptimizetrue/str* * /lst* * /searchComponent* * !-- A request handler for demonstrating the spellcheck component. http://wiki.apache.org/solr/SpellCheckComponent http://wiki.apache.org/solr/SpellCheckComponent for details --* * requestHandler name=/spell class=solr.SearchHandler* *lst name=defaults * * str name=spellcheck.dictionaryspellchecker/str* * str name=spellcheckon/str* * str name=spellcheck.onlyMorePopularfalse/str* * str name=spellcheck.extendedResultsfalse/str* * str name=spellcheck.count1/str* */lst* *arr name=last-components* * strspellcheck/str* */arr* * /requestHandler* After the index process I launch an optimize request and the spellcheck index is generated and everything is working fine. However, if I restart Solr the spell check is not working anymore until I execute another optimize request. So, is this the expected way of working? Is the spell check index deleted after every server restart? Is there any way to make it persistent? And just one more question, I remember in previous Solr versions the spellcheck had even its own folder under the data folder, so, for example I could see if the spell check index had been generated just listing the files under that folder. Does that folder still exist? Is there any way of knowing if the spell check index has been generated without executing a query that is supposed to return a correction? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: Is there any limit how many documents can be indexed by apache solr
Hi, In lucene you are supossed to be able to index up to 274 billion documents ( http://lucene.apache.org/core/3_0_3/fileformats.html#Limitations ), so in Solr should be something like that. Anyway the maximum number is quite bigger than those 11.000 ;) Could it be that you are reusing IDs so the new documents overwrite the old ones? 2013/11/26 Kamal Palei palei.ka...@gmail.com Dear All I am using Apache solr 3.6.2 with Drupal 7. Users keeps adding their profiles (resumes) and with cron task from Drupal, documents get indexed. Recently I observed, after indexing around 11,000 documents, further documents are not getting indexed. Is there any configuration for max documents those can be indexed. Kindly help. Thanks kamal -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Best implementation for multi-price store?
Hi, I've been recently ask to implement an application to search products from several stores, each store having different prices and stock for the same product. So I have products that have the usual fields (name, description, brand, etc) and also number of units and price for each store. I must be able to filter for a given store and order by stock or price for that store. The application should also allow incresing the number of stores, fields depending of store and number of products without much work. The numbers for the application are more or less 100 stores and 7M products. I've been thinking of some ways of defining the index structure but I don't know wich one is better as I think each one has it's pros and cons. 1. *Each product-store as a document:* Denormalizing the information so for every product and store I have a different document. Pros are that I can filter and order without problems and that adding a new store-depending field is very easy. Cons are that the index goes from 7M documents to 700M and that most of the info is redundant as most of the fields are repeated among stores. 2. *Each field-store as a field:* For example for price I would have store1_price, store2_price, Pros are that the index stays at 7M documents, and I can still filter and sort by those fields. Cons are that I have to add some logic so if I filter by one store I order for the associated price field, and that number of fields increases as number of store-depending fields x number of stores. I don't know if having more fields affects performance, but adding new store-depending fields will increase the number of fields even more 3. *Join:* First time I read about solr joins thought it was the way to go in this case, but after reading a bit more and doing some tests I'm not so sure about it... Maybe I've done it wrong but I think it also denormalizes the info (So I will also havee 700M documents) and besides I can't order or filter by store fields. I must say my preferred option is number 2, so I don't duplicate information, I keep a relatively small number of documents and I can filter and sort by the store fields. However, my main concern here is I don't know if having too many fields in a document will be harmful to performance. Which one do you think is the best approach for this application? Is there a better approach that I have missed? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: Best implementation for multi-price store?
Hi Robert, That was the idea, dynamic fields, so, as you said, it is easier to sort and filter. Besides, having dynamic fields it would be easier to add new stores, as I wouldn't have to modify the schema :) Thanks for the answer! 2013/11/21 Petersen, Robert robert.peter...@mail.rakuten.com Hi, I'd go with (2) also but using dynamic fields so you don't have to define all the storeX_price fields in your schema but rather just one *_price field. Then when you filter on store:store1 you'd know to sort with store1_price and so forth for units. That should be pretty straightforward. Hope that helps, Robi -Original Message- From: Alejandro Marqués Rodríguez [mailto: amarq...@paradigmatecnologico.com] Sent: Thursday, November 21, 2013 1:36 AM To: solr-user@lucene.apache.org Subject: Best implementation for multi-price store? Hi, I've been recently ask to implement an application to search products from several stores, each store having different prices and stock for the same product. So I have products that have the usual fields (name, description, brand, etc) and also number of units and price for each store. I must be able to filter for a given store and order by stock or price for that store. The application should also allow incresing the number of stores, fields depending of store and number of products without much work. The numbers for the application are more or less 100 stores and 7M products. I've been thinking of some ways of defining the index structure but I don't know wich one is better as I think each one has it's pros and cons. 1. *Each product-store as a document:* Denormalizing the information so for every product and store I have a different document. Pros are that I can filter and order without problems and that adding a new store-depending field is very easy. Cons are that the index goes from 7M documents to 700M and that most of the info is redundant as most of the fields are repeated among stores. 2. *Each field-store as a field:* For example for price I would have store1_price, store2_price, Pros are that the index stays at 7M documents, and I can still filter and sort by those fields. Cons are that I have to add some logic so if I filter by one store I order for the associated price field, and that number of fields increases as number of store-depending fields x number of stores. I don't know if having more fields affects performance, but adding new store-depending fields will increase the number of fields even more 3. *Join:* First time I read about solr joins thought it was the way to go in this case, but after reading a bit more and doing some tests I'm not so sure about it... Maybe I've done it wrong but I think it also denormalizes the info (So I will also havee 700M documents) and besides I can't order or filter by store fields. I must say my preferred option is number 2, so I don't duplicate information, I keep a relatively small number of documents and I can filter and sort by the store fields. However, my main concern here is I don't know if having too many fields in a document will be harmful to performance. Which one do you think is the best approach for this application? Is there a better approach that I have missed? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
SorlCloud recovery issue while search stress test
Hi, We've been experiencing some problems during search stress tests and we don't even have a clue on why is this happening. We have the following: - 3 servers - Websphere 7 - Zookeeper 3.4.5 on each server - Solr 4.5.0 on each server - 1 shard (so it is one leader and 2 replicas) - The index contains 7M documents (About 2GB) We've run several stress tests with JMeter with 100-500 concurrent threads. Depending on how many threads, we have different scenarios, but appart from times or wether the system fully recovers or not, we have the next steps: 1. The solrs begin responding queries, with stable number of threads for each solr (Less than 10) 2. Once the test has been running for several minutes we kill one of the solrs (Most of the times the one being the leader) 3. The remaining solrs respond to the queries increasing slightly the number of threads used. 4. After a few minutes we restart the killed solr again (And here is where our problem starts) 5. Once it starts it begins increasing the number of threads used (Up to 100 or above) and the worst thing is that even the other two solrs start responding slowly (Or not responding at all). Then, depending on the number of concurrent queries, if there are few in more or less 3 minutes everything goes back to normal (thought almost no queries are attended during that period) or, if there are more than 200 concurrent queries the restarted server increases so much its used threads that it crashes. During the minutes that the three solrs are not responding there are no logs, and after making a thread dump we've seen a lot of stalled threads with sun.misc.Unsafe.park traces. I don't understand this behaviour at all, not only it works better with two solrs than restarting the third but this restart affects the behaviour of the two remaining solrs... Anybody has any clue about this? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
SolrCloud full index replication on leader failure
Hi, I have a problem with SolrCloud in an specific test case and I wanted to know if it is the way it should work or if is there any way to avoid this... I have the next scenario: - Three machines - Each one with one zookeeper and one solr 4.1.0 - Each Solr stores 7 Million documents and the index is 2GB The test consist on sending queries to solr (100 concurrent queries continously) and then forcing the leader failure by shutting down both zookeeper and solr. When we shut down any solr that is not the leader there are no problems, the other two respond to the queries without problems. However if we shut down the leader the next steps occur: - Both Solrs continue responding to the queries until the leader election starts - One of them is elected as leader and the other one stops responding queries (I've read it goes to recovery mode until its index is synchronized with the leader's one) - Then, even though both indexes are the same (They were synchronized before the leader failure), the whole index is replicated. - During the time while the 2GB are replicated from leader to the remaining server, the server recovering is not responding to queries, therefore the leader must attend to the whole amount of queries and finally it crashes due to having to many queries to answer (Aside of replicating its index) My question here is... Is it normal that the whole index replicates in a leader change even though the leader and the other solr indexes should be the same? Is there any way to avoid it? Maybe I have some configuration wrong? Should changing Solr to 4.5.X avoid this operative? Aside from this problem everything seems to work fine, but that point of failure is too risky for us Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: Query help
I can't see a way of retrieving five results from one type and five from another in a single query. The only way I can think about that would have a similar behaviour would be: ?q=ContentType:(News+OR+Analysis)sort=DatePublished+descstart=0rows=10 This way you'll have the first 10 results being News or Analysis, though it could be 7 News and 3 Analysis or even 10 and 0... If you need Solr to return 5 results from each type, I think the only way to improve the search speed would be, instead of using just one query, making two parallel queries. Regards 2010/7/15 Rupert Bates rupert.ba...@guardian.co.uk Sorry, my mistake, the example should have been as follows: ?q=ContentType:Newssort=DatePublished+descstart=0rows=5 ?q=ContentType:Analysissort=DatePublished+descstart=0rows=5 Rupert On 15 July 2010 13:02, kenf_nc ken.fos...@realestate.com wrote: Your example though doesn't show different ContentType, it shows a different sort order. That would be difficult to achieve in one call. Sounds like your best bet is asynchronous (multi-threaded) calls if your architecture will allow for it. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-help-tp969075p969334.html Sent from the Solr - User mailing list archive at Nabble.com. -- Rupert Bates Software Development Manager Guardian News and Media Tel: 020 3353 3315 rupert.ba...@guardian.co.uk Please consider the environment before printing this email. -- Visit guardian.co.uk - newspaper website of the year www.guardian.co.uk www.observer.co.uk To save up to 33% when you subscribe to the Guardian and the Observer visit http://www.guardian.co.uk/subscriber The Guardian Public Services Awards 2010, in partnership with Hays Specialist Recruitment, recognise and reward outstanding performance from public, private and voluntary sector teams. To find out more and to nominate a deserving team or individual, visit http://guardian.co.uk/publicservicesawards Entries close 16 July. - This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way. Guardian News Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software. Guardian News Media Limited A member of Guardian Media Group PLC Registered Office Number 1 Scott Place, Manchester M3 3GG Registered in England Number 908396 -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: partial word searching
Hi, You can use wildcards but I suppose it would only work with one word (though maybe if you use tokenization you could use something like field:sun* AND field:hot*) You could also use N-grams to achieve partial searchs. For example, if you use 3-grams for hotel you'll index hot, ote and tel, so you could find hotel searching for any of those three strings. There's a N-gram filter you could apply though I don't know how it works when retrieving N-grams from a more than one word expression: filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=15/ Again I suppose if you use a tokenizer first you could get the next 3-grams hot ote tel sun unw nwa way and therefore searching for field:sun AND field:hot will retrieve the sunway hotel document. Regards 2010/4/21 Chamnap Chhorn chamnapchh...@gmail.com Hi everyone, I'm quite new to solr 1.4. I have a requirement to be able to search partial words (sun hot = Sunway Hotel) and to search full word(sunway hotel = Sunway Hotel). Currently, I could be able to search only full word. Anyone has any suggestions? -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/ -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: Stemming - disable at query time - reg.
Sent from the Solr - User mailing list archive at Nabble.com. -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42