Re: Need help with DIH dataconfig.xml
Use TemplateTransformer dataConfig dataSource name = wld type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/wld user=root password=pass/ document name=variants entity name=III_1_1 query=SELECT * FROM `wld`.`III_1_1` transformer=TemplateTransformer field column=id template='${III_1_1.id}III_1_1}'/ field column=lemmatitel name=lemma / field column=vraagtekst name=vraagtekst / field column=lexical_variant name=variant / /entity entity name=III_1_2 query=SELECT * FROM `wld`.`III_1_2` field column=id name='${III_1_2_ + id}'/ field column=lemmatitel name=lemma / field column=vraagtekst name=vraagtekst / field column=lexical_variant name=variant / /entity /document /dataConfig On Wed, Jun 15, 2011 at 4:41 PM, MartinS martin.snijd...@gmail.com wrote: Hello, I want to perform a data import from a relational database. That all works well. However, i want to dynamically create a unique id for my solr documents while importing by using my data config file. I cant get it to work, maybe its not possible this way, but i thought i would ask you ll. (I set up schema.xml to use the field id as the unique id for solr documents) My solr config looks like this : dataConfig dataSource name = wld type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/wld user=root password=pass/ document name=variants entity name=III_1_1 query=SELECT * FROM `wld`.`III_1_1` field column=id name='${variants.name + id}'/ field column=lemmatitel name=lemma / field column=vraagtekst name=vraagtekst / field column=lexical_variant name=variant / /entity entity name=III_1_2 query=SELECT * FROM `wld`.`III_1_2` field column=id name='${III_1_2_ + id}'/ field column=lemmatitel name=lemma / field column=vraagtekst name=vraagtekst / field column=lexical_variant name=variant / /entity /document /dataConfig For a unique id I would like the concatenate the primary key of the table (Column id) with the table name. How can I do this ? Both ways as shown in the example data config don't work while importing. Any help is appreciated. Martin -- View this message in context: http://lucene.472066.n3.nabble.com/Need-help-with-DIH-dataconfig-xml-tp3066855p3066855.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul
Re: fieldCache problem OOM exception
Hi Erik, yes I'm sorting and faceting. 1) Fields for sorting: sort=f_dccreator_sort, sort=f_dctitle, sort=f_dcyear The parameter facet.sort= is empty, only using parameter sort=. 2) Fields for faceting: f_dcperson, f_dcsubject, f_dcyear, f_dccollection, f_dclang, f_dctypenorm, f_dccontenttype Other faceting parameters: ...facet=truefacet.mincount=1facet.limit=100facet.sort=facet.prefix=... 3) The LukeRequestHandler takes too long for my huge index so this is from the standalone luke (compiled for solr3.2): f_dccreator_sort = 10.029.196 f_dctitle= 21.514.939 f_dcyear = 1.471 f_dcperson = 14.138.165 f_dcsubject = 8.012.319 f_dccollection = 1.863 f_dclang =299 f_dctypenorm = 14 f_dccontenttype =497 numDocs:28.940.964 numTerms: 686.813.235 optimized:true hasDeletions:false What can you read/calculate from this values? Is my index to big for Lucene/Solr? What I don't understand, why fieldCache is not garbage collected and therefore reduced in size from time to time. Regards Bernd Am 15.06.2011 17:50, schrieb Erick Erickson: The first question I have is whether you're sorting and/or faceting on many unique string values? I'm guessing that sometime you are. So, some questions to help pin it down: 1 what fields are you sorting on? 2 what fields are you faceting on? 3 how many unique terms in each (see the solr admin page). Best Erick On Wed, Jun 15, 2011 at 8:22 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Dear list, after getting OOM exception after one week of operation with solr 3.2 I used MemoryAnalyzer for the heapdumpfile. It looks like the fieldCache eats up all memory. Objects Shalow Heap Retained Heap org.apache.lucene.search.FieldCache 0 0 = 14,636,950,632 org.apache.lucene.search.FieldCacheImpl 1 32 = 14,636,950,384 org.apache.lucene.search.FieldCacheImpl$StringIndexCache 1 32 = 14,636,947,080 org.apache.lucene.search.FieldCache$StringIndex 10 320 = 14,636,944,352 java.lang.String[] 519 567,811,040 = 13,503,733,312 char[] 81,766,595 11,604,293,712 = 11,604,293,712 fieldCache retains over 14g of heap. When looking on stats page under fieldCache the description says: Provides introspection of the Lucene FieldCache, this is **NOT** a cache that is managed by Solr. So is this a jetty problem and not solr? Why is fieldCache growing and growing until OOM? Regards Bernd
Re: Copying few field using copyField to non multiValued field
Hi Omri, there are two limitations: 1. You can't sort on a multiValued field. (Anyway, on which of the copied fields would you want to sort first?) 2. You can't make the multiValued field the unique key. Both are no real limitations: 1. Better sort on at_country, at_state, at_city instead. 2. Simply choose another unique key field. (Your location wouldn't be unique anyway.) Greetings, Kuli Am 16.06.2011 06:40, schrieb Omri Cohen: I just don't want to suffer all the limitation a multiValued field has.. (it does have some limitations, doesn't it?) I just remember I read somewhere that it does.
Re: DIH abort doesn't close datasources
On Wed, Jun 15, 2011 at 8:10 PM, Frank Wesemann f.wesem...@fotofinder.netwrote: Hi, I just came across this: If I abort an import via /dataimport/?command=abort the connections to the (in my case) database stay open. Shouldn't DocBuilder#rollback() call something like cleanup() which in turn tries to close EntityProcessors, Datasources etc. instead of relying that finalize() will sometimes do it's job? The abort command just sets a atomic boolean flag which is checked frequently by the import threads to see if they should stop. If you look at the DataImport.java's doFullImport or doDeltaImport methods, you'll see that config.clearCaches is the clean up method which is called in a finally block. So the data sources should be closed once the import actually aborts. Note that there may be a time lag between calling the abort method and the import actually getting abort if the import threads are waiting for I/O. -- Regards, Shalin Shekhar Mangar.
RE: Multiple indexes
Are there any plans to support a kind of federated search in a future solr version? I think there are reasons to use seperate indexes for each document type but do combined searches on these indexes (for example if you need separate TFs for each document type). I am aware of http://wiki.apache.org/solr/DistributedSearch and a workaround to do federated search with sharding http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set but this seems to be too much network- and maintenance overhead. Perhaps it is worth a try to use an IndexReaderFactory which returns a lucene MultiReader!? Is the IndexReaderFactory still Experimental? https://issues.apache.org/jira/browse/SOLR-1366 Regards, Kai Gülzau -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 15, 2011 8:43 PM To: solr-user@lucene.apache.org Subject: Re: Multiple indexes Next, however, I predict you're going to ask how you do a 'join' or otherwise query accross both these cores at once though. You can't do that in Solr. On 6/15/2011 1:00 PM, Frank Wesemann wrote: You'll configure multiple cores: http://wiki.apache.org/solr/CoreAdmin Hi. How to have multiple indexes in SOLR, with different fields and different types of data? Thank you very much! Bye.
Field Collapsing and Grouping in Solr 3.2
Hello. Does anybody know if Field Collapsing and Grouping is available in Solr 3.2. I mean directly available, not as a patch. I have read conflicting statements about it... Thanks a lot! http://www.playence.com/ Description: playence Sergio Martín Cantero playence KG Penthouse office Soho II - Top 1 Grabenweg 68 6020 Innsbruck Austria Mobile: (+34)654464222 eMail: mailto:sergio.mar...@playence.com sergio.mar...@playence.com Web:www.playence.com skype:superepi2000 Description: skypeplayence http://twitter.com/playence Description: twitterplayence http://www.linkedin.com/companies/playence Description: linkedinplayence Stay up to date on the latest developments of playence by subscribing to our blog ( http://blog.playence.com http://blog.playence.com) or following us in Twitter ( http://twitter.com/playence http://twitter.com/playence). The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee and access to the e-mail by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you have received this e-mail in error please forward to mailto:off...@playence.com off...@playence.com. Thank you for your cooperation.
Re: DIH abort doesn't close datasources
Shalin, thank you for the answer. I indeed didn't look into clearCache(). I thought it would just do that ( clear caches ). :) Shalin Shekhar Mangar schrieb: The abort command just sets a atomic boolean flag which is checked frequently by the import threads to see if they should stop. If you look at the DataImport.java's doFullImport or doDeltaImport methods, you'll see that config.clearCaches is the clean up method which is called in a finally block. So the data sources should be closed once the import actually aborts. Note that there may be a time lag between calling the abort method and the import actually getting abort if the import threads are waiting for I/O. -- mit freundlichem Gruß, Frank Wesemann Fotofinder GmbH USt-IdNr. DE812854514 Software EntwicklungWeb: http://www.fotofinder.com/ Potsdamer Str. 96 Tel: +49 30 25 79 28 90 10785 BerlinFax: +49 30 25 79 28 999 Sitz: Berlin Amtsgericht Berlin Charlottenburg (HRB 73099) Geschäftsführer: Ali Paczensky
Showing facet of first N docs
Hi all, Do you know if it is possible to show the facets for a particular field related only to the first N docs of the total number of results? It seems facet.limit doesn't help with it as it defines a window in the facet constraints returned. Thanks in advance, Tommaso
Re: Field Collapsing and Grouping in Solr 3.2
Alas, no, not yet.. grouping/field collapse has had a long history with Solr. There were many iterations on SOLR-236, but that impl was never committed. Instead, SOLR-1682 was committed, but committed only to trunk (never backported to 3.x despite requests). Then, a new grouping module was factored out of Solr's trunk implementation, and was backported to 3.x. Finally, there is now an effort to cut over Solr trunk (SOLR-2564) and Solr 3.x (SOLR-2524) to the new grouping module, which looks like it's close to being done! So hopefully for 3.3 but not promises! This is open-source... Mike McCandless http://blog.mikemccandless.com 2011/6/16 Sergio Martín sergio.mar...@playence.com Hello. Does anybody know if Field Collapsing and Grouping is available in Solr 3.2. I mean directly available, not as a patch. I have read conflicting statements about it... Thanks a lot! [image: Description: playence] http://www.playence.com/ *Sergio Martín Cantero* *playence KG* Penthouse office Soho II - Top 1 Grabenweg 68 6020 Innsbruck Austria Mobile: (+34)654464222 eMail: sergio.mar...@playence.com Web:www.playence.com [image: Description: skypeplayence] [image: Description: twitterplayence]http://twitter.com/playence [image: Description: linkedinplayence]http://www.linkedin.com/companies/playence Stay up to date on the latest developments of playence by subscribing to our blog (http://blog.playence.com) or following us in Twitter ( http://twitter.com/playence). The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee and access to the e-mail by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you have received this e-mail in error please forward to off...@playence.com. Thank you for your cooperation.
RE: Field Collapsing and Grouping in Solr 3.2
Mike, thanks a lot for your quick and precise answer! Sergio Martín Cantero playence KG Penthouse office Soho II - Top 1 Grabenweg 68 6020 Innsbruck Austria Mobile: (+34)654464222 eMail: sergio.mar...@playence.com Web:www.playence.com Stay up to date on the latest developments of playence by subscribing to our blog (http://blog.playence.com) or following us in Twitter (http://twitter.com/playence). The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee and access to the e-mail by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you have received this e-mail in error please forward to off...@playence.com. Thank you for your cooperation. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: jueves, 16 de junio de 2011 12:51 To: solr-user@lucene.apache.org Subject: Re: Field Collapsing and Grouping in Solr 3.2 Alas, no, not yet.. grouping/field collapse has had a long history with Solr. There were many iterations on SOLR-236, but that impl was never committed. Instead, SOLR-1682 was committed, but committed only to trunk (never backported to 3.x despite requests). Then, a new grouping module was factored out of Solr's trunk implementation, and was backported to 3.x. Finally, there is now an effort to cut over Solr trunk (SOLR-2564) and Solr 3.x (SOLR-2524) to the new grouping module, which looks like it's close to being done! So hopefully for 3.3 but not promises! This is open-source... Mike McCandless http://blog.mikemccandless.com 2011/6/16 Sergio Martín sergio.mar...@playence.com Hello. Does anybody know if Field Collapsing and Grouping is available in Solr 3.2. I mean directly available, not as a patch. I have read conflicting statements about it... Thanks a lot! [image: Description: playence] http://www.playence.com/ *Sergio Martín Cantero* *playence KG* Penthouse office Soho II - Top 1 Grabenweg 68 6020 Innsbruck Austria Mobile: (+34)654464222 eMail: sergio.mar...@playence.com Web:www.playence.com [image: Description: skypeplayence] [image: Description: twitterplayence]http://twitter.com/playence [image: Description: linkedinplayence]http://www.linkedin.com/companies/playence Stay up to date on the latest developments of playence by subscribing to our blog (http://blog.playence.com) or following us in Twitter ( http://twitter.com/playence). The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee and access to the e-mail by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you have received this e-mail in error please forward to off...@playence.com. Thank you for your cooperation.
Re: DIH abort doesn't close datasources
On Thu, Jun 16, 2011 at 3:46 PM, Frank Wesemann f.wesem...@fotofinder.netwrote: Shalin, thank you for the answer. I indeed didn't look into clearCache(). I thought it would just do that ( clear caches ). :) Yeah, it is not the most aptly named method :) Thanks for reviewing the code though! -- Regards, Shalin Shekhar Mangar.
Re: Mahout Solr
You're right...It would be nice to be able to see the cluster results coming from Solr though... Adam On Thu, Jun 16, 2011 at 3:21 AM, Andrew Clegg andrew.clegg+mah...@gmail.com wrote: Well, it does have the ability to pull TermVectors from an index: https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html#CreatingVectorsfromText-FromLucene Nothing Solr-specific about it though. On 15 June 2011 15:38, Mark static.void@gmail.com wrote: Apache Mahout is a new Apache TLP project to create scalable, machine learning algorithms under the Apache license. It is related to other Apache Lucene projects and integrates well with Solr. How does Mahout integrate well with Solr? Can someone explain a brief overview on whats available. I'm guessing one of the features would be the replacing of the Carrot2 clustering algorithm with something a little more sophisticated? Thanks -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Complex situation
Hello, First i will try to explain the situation: I have some companies with openinghours. Some companies has multiple seasons with different openinghours. I wil show some example data : Companyid Startdate(d-m) Enddate(d-m) Openinghours_end 101-0101-04 17:00 101-0401-08 18:00 101-0831-12 17:30 201-0131-12 20:00 301-0101-06 17:00 301-0631-12 18:00 What i want is some facets on the left site of my page. They have to look like this: Closing today on: 17:00(23) 18:00(2) 20:00(1) So i need to get the NOW to know which openinghours(seasons) i need in my facet results. How should my index look like? Can anybody helps me how i can save this data in the solr index? -- View this message in context: http://lucene.472066.n3.nabble.com/Complex-situation-tp3071936p3071936.html Sent from the Solr - User mailing list archive at Nabble.com.
Performance loss - querying more than 64 cores (randomly)
Hi, I set up a Solr instance with 512 cores. Each core has 100k documents and 15 fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM. Now I've done some benchmarks with JMeter. On each thread iteration JMeter queriing another Core by random. Here are the results (Duration: each with 180 second): Randomly queried cores | queries per second 1| 2016 2 | 2001 4 | 1978 8 | 1958 16 | 2047 32 | 1959 64 | 1879 128 | 1446 256 | 1009 512 | 428 Why are the queries per second until 64 constant and then the performance is degreasing rapidly? Solr only uses 10GB of the 16GB memory so I think it is not a memory issue.
Re: query routing with shards
Hi Otis, I followed your recommendation and decided to implement the SearchComponent::modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest sreq) method, where the query routing happens. So far it is working OK for the non-facet search, this is good news. The bad news is that it fails on the facet search. This is how request modification happens: [code_snippet, SearchComponent::modifyRequest] SolrQueryRequest req_routed = rb.req; req_routed = routeRequest(req_routed); rb.req = req_routed; sreq.shards = shards.toString().split(,); [/code_snippet] where shards is StringBuilder, that accumulates the shards the request should go to. req_routed also contains the target shards. Those are set like this: [code_snippet, my function routeRequest(SolrQueryRequest req)] // could not find clone(), used ref reassignment SolrQueryRequest req_local = req; ModifiableSolrParams params = new ModifiableSolrParams(req_local.getParams()); ... params.remove(ShardParams.SHARDS); params.set(ShardParams.SHARDS, getShardsParams(yearToQuarterMap)); params.remove(ShardParams.IS_SHARD); params.set(ShardParams.IS_SHARD, true); req_local.setParams(params); ... return req_local; [/code_snippet] The NPE happens down the road during the facet search, in the FacetComponent::countFacets(), the cause of which is that OpenBitSet obs is null for shardNum=0. Do you have any idea why this happens, should some other field of ResponseBuilder, SearchComponent or ShardRequest be changed? BTW, I have tried to call FacetInfo::parse method inside FacetComponent::modifyRequest() and countFacets(). Where do the fi.facets.values() get initiated, is there some method to call? Thanks, Dmitry On Fri, Jun 3, 2011 at 8:00 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Nah, if you can quickly figure out which shard a given query maps to, then all this component needs to do is stick the appropriate shards param value in the request and let the request pass through to the other SearchComponents in the chain, including QueryComponent, which will know what to do with the shards param. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dmitry Kan dmitry@gmail.com To: solr-user@lucene.apache.org Sent: Fri, June 3, 2011 12:56:15 PM Subject: Re: query routing with shards Hi Otis, Thanks! This sounds promising. This custom implementation, will it hurt in any way the stability of the front end SOLR? After implementing it, can I run some tests to verify the stability / performance? Dmitry On Fri, Jun 3, 2011 at 4:49 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Dmitry, Yes, you could also implement your own custom SearchComponent. In this component you could grab the query param, examine the query value, and based on that add the shards URL param with appropriate value, so that when the regular QueryComponent grabs stuff from the request, it has the correct shard in there already. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dmitry Kan dmitry@gmail.com To: solr-user@lucene.apache.org Sent: Fri, June 3, 2011 2:47:00 AM Subject: Re: query routing with shards Hi Otis, I merely followed on the gmail's suggestion to include other people into the recipients list, Yonik was the first one :) I won't do it next time. Thanks for a rapid reply. The reason for doing this query routing is that we abstract the distributed SOLR from the client code for security reasons (that is, we don't want to expose the entire shard farm to the world, but only the frontend SOLR) and for better decoupling. Is it possible to implement a plugin to SOLR that would map queries to shards? We have other choices too, they'll take quite some time, that's why I decided to quickly ask, if I was missing something from the SOLR main components design and configuration. Dmitry On Fri, Jun 3, 2011 at 8:25 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Dmitry (you may not want to additionally copy Yonik, he's subscribed to this list, too) It sounds like you have the knowledge of which query maps to which shard. If so, why not control/change the value of shards param in the request to your front-end Solr (aka distributed request dispatcher) within your app, which is the one calling Solr? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original
Re: Boost Strangeness
fascinating Thank you so much Erik, I'm slowly beginning to understand. SO I've discovered that by defining 'splitOnNumerics=0' on the filter class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I can get *closer* to my required goal! Now something else odd is occuring. It only returns 2 results where there is over 70? Why is that? I can't find were this is explained :( query /solr/select?omitNorms=trueq=b006m86ddefType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=onomitNorms=true output { - - responseHeader: { - status: 0 - QTime: 51 - - params: { - debugQuery: on - fl: type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score - indent: on - q: b006m86d - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8 subseries_container_id^8 clip_container_id^1 clip_episode_id^1 - wt: json - - omitNorms: [ - true - true ] - defType: dismax } } - - response: { - numFound: 2 - start: 0 - maxScore: 13.473297 - - docs: [ - - { - parent_id: - id: b006m86d - type: brand - score: 13.473297 } - - { - series_container_id: - id: b00y1w9h - type: episode - brand_container_id: b006m86d - subseries_container_id: - clip_episode_id: - score: 11.437143 } ] } - - debug: { - rawquerystring: b006m86d - querystring: b006m86d - parsedquery: +DisjunctionMaxQuery((id:b006m86d^10.0 | clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 | series_container_id:b006m86d^8.0 | clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) () - parsedquery_toString: +(id:b006m86d^10.0 | clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 | series_container_id:b006m86d^8.0 | clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0) () - - explain: { - b006m86d: 13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636), product of: 1.0 = tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2, maxDocs=783800) 1.0 = fieldNorm(field=id, doc=27636) - b00y1w9h: 11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61), product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0), product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800) 0.007422088 = queryNorm 13.878762 = (MATCH) fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 = tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1, maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) } - QParser: DisMaxQParser - altquerystring: null - boostfuncs: null - - timing: { - time: 51 - - prepare: { - time: 6 - - org.apache.solr.handler.component.QueryComponent: { - time: 5 } - - org.apache.solr.handler.component.FacetComponent: { - time: 0 } - - org.apache.solr.handler.component.MoreLikeThisComponent: { - time: 0 } - - org.apache.solr.handler.component.HighlightComponent: { - time: 1 } - - org.apache.solr.handler.component.StatsComponent: { - time: 0 } - - org.apache.solr.handler.component.DebugComponent: { - time: 0 } } - - process: { - time: 45 - - org.apache.solr.handler.component.QueryComponent: { - time: 27 } - - org.apache.solr.handler.component.FacetComponent: { - time: 0 } - - org.apache.solr.handler.component.MoreLikeThisComponent: { - time: 0 } - - org.apache.solr.handler.component.HighlightComponent: { - time: 0 } - - org.apache.solr.handler.component.StatsComponent: {
Re: Showing facet of first N docs
http://wiki.apache.org/solr/SimpleFacetParameters facet.offset This param indicates an offset into the list of constraints to allow paging. The default value is 0. This parameter can be specified on a per field basis. Dmitry On Thu, Jun 16, 2011 at 1:39 PM, Tommaso Teofili tommaso.teof...@gmail.comwrote: Hi all, Do you know if it is possible to show the facets for a particular field related only to the first N docs of the total number of results? It seems facet.limit doesn't help with it as it defines a window in the facet constraints returned. Thanks in advance, Tommaso -- Regards, Dmitry Kan
Re: Performance loss - querying more than 64 cores (randomly)
On 6/16/11 3:22 PM, Mark Schoy wrote: Hi, I set up a Solr instance with 512 cores. Each core has 100k documents and 15 fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM. Now I've done some benchmarks with JMeter. On each thread iteration JMeter queriing another Core by random. Here are the results (Duration: each with 180 second): Randomly queried cores | queries per second 1| 2016 2 | 2001 4 | 1978 8 | 1958 16 | 2047 32 | 1959 64 | 1879 128 | 1446 256 | 1009 512 | 428 Why are the queries per second until 64 constant and then the performance is degreasing rapidly? Solr only uses 10GB of the 16GB memory so I think it is not a memory issue. This may be an OS-level disk buffer issue. With a limited disk buffer space the more random IO occurs from different files, the higher is the churn rate, and if the buffers are full then the churn rate may increase dramatically (and the performance will drop then). Modern OS-es try to keep as much data in memory as possible, so the memory usage itself is not that informative - but check what are the pagein/pageout rates when you start hitting the 32 vs 64 cores. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: getFieldValue always returns an ArrayList?
Interesting. You guessed right. I changed multivalued to multiValued and all of a sudden I get Strings. But, doesn't multivalued default to false? In my schema, I originally did not set multivalued. I only put in multivalued=false after I experienced this issue. -Rich For the record, I had a number of fields which had never settings for multivalued because none of them were multivalued and I expected the default to be false. When I experienced this problem, I added multivalued=false to all of them. I still had the problem. So, I added a method to deal with the returned ArrayLists: private Object getFieldValue(String field, SolrDocument document) { ArrayList list = (ArrayList)document.getFieldValue(field); return list.get(0); } I deliberately did not test if the return Object was an ArrayList because I wanted to get an exception if any of them were Strings; I got no exceptions, so they were all returned as ArrayLists. I then changed one of the fields to use multiValued=false, and I got an exception, trying to cast String to ArrayList! So, I changed all the troublesome fields to use multiValued, and changed my helper method to look like this: private Object getFieldValue(String field, SolrDocument document) { Object o = document.getFieldValue(field); if (o instanceof ArrayList) { System.out.println(### Field + field + is an instance of ArrayList.); ArrayList list = (ArrayList)document.getFieldValue(field); return list.get(0); } else { if (!(o instanceof String)) { System.out.println(## ERROR); } else { System.out.println(### Field + field + is an instance of String.); } return o; } } Here's the output, interspersed with the schema definitions of the fields: field name=uri type=string indexed=true stored=true multiValued=false required=true / ### Field uri is an instance of String. field name=entity_label type=string indexed=false stored=true required=false / ### Field entity_label is an instance of ArrayList. field name=institution_uri type=string indexed=true stored=true required=false / ### Field institution_uri is an instance of ArrayList. field name=asserted_type_uri type=string indexed=true stored=true required=false / ### Field asserted_type_uri is an instance of ArrayList. field name=asserted_type_label type=text_eaglei indexed=true stored=true required=false / ### Field asserted_type_label is an instance of ArrayList. field name=provider_uri type=string indexed=true stored=true multiValued=false required=false / ### Field provider_uri is an instance of String. field name=provider_label type=string indexed=true stored=true multiValued=false required=false / ### Field provider_label is an instance of String. As you can see, the ones with no declaration for multivalued are returned as ArrayLists, while the ones with multiValued=false are returned as Strings. So, it looks like there are two problems here: multivalued (small v) is not recognized, since using that in the schema still causes all fields to be returned as ArrayLists; and, multivalued does not default to false (or, at least, not setting it causes a field to be returned as an ArrayList, as though it were set to true). -Rich -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, June 15, 2011 4:25 PM To: solr-user@lucene.apache.org Subject: Re: getFieldValue always returns an ArrayList? Hmmm, I admit I'm not using embedded, and I'm using 3.2, but I'm not seeing the behavior you are. My question about reindexing could have been better stated, I was just making sure you didn't have some leftover cruft where your field was multi-valued from previous experiments, but if you're reindexing each time that's not the problem. Arrrh, camel case may be striking again. Try multiValued, not multivalued If that's still not it, can we see the code? Best Erick On Wed, Jun 15, 2011 at 3:47 PM, Simon, Richard T richard_si...@hms.harvard.edu wrote: We rebuild the index from scratch each time we start (for now). The fields in question are not multi-valued; in fact, I explicitly set multi-valued to false, just to be sure. Yes, this is SolrJ, using the embedded server, if that matters. Using Solr/Lucene 3.1.0. -Rich -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, June 15, 2011 3:44 PM To: solr-user@lucene.apache.org Subject: Re: getFieldValue always returns an ArrayList? Did you perhaps change the schema but not re-index? I'm
Re: Performance loss - querying more than 64 cores (randomly)
I am assuming that you are running on linux here, I have found atop to be very useful to see what is going on. http://freshmeat.net/projects/atop/ dstat is also very useful too but needs a little more work to 'decode'. Obviously there is contention going on, you just need to figure out where it is, most likely it is disk I/O but it could also be the number of cores you have. Also I would not say that performance is decreasing rapidly, probably more of a gentle slope down if you plot it (your double the number of cores every time). I would be very interested in hearing about what you find. Cheers François On Jun 16, 2011, at 10:00 AM, Andrzej Bialecki wrote: On 6/16/11 3:22 PM, Mark Schoy wrote: Hi, I set up a Solr instance with 512 cores. Each core has 100k documents and 15 fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM. Now I've done some benchmarks with JMeter. On each thread iteration JMeter queriing another Core by random. Here are the results (Duration: each with 180 second): Randomly queried cores | queries per second 1| 2016 2 | 2001 4 | 1978 8 | 1958 16 | 2047 32 | 1959 64 | 1879 128 | 1446 256 | 1009 512 | 428 Why are the queries per second until 64 constant and then the performance is degreasing rapidly? Solr only uses 10GB of the 16GB memory so I think it is not a memory issue. This may be an OS-level disk buffer issue. With a limited disk buffer space the more random IO occurs from different files, the higher is the churn rate, and if the buffers are full then the churn rate may increase dramatically (and the performance will drop then). Modern OS-es try to keep as much data in memory as possible, so the memory usage itself is not that informative - but check what are the pagein/pageout rates when you start hitting the 32 vs 64 cores. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: getFieldValue always returns an ArrayList?
FYI: Using multiValued=false for all string fields results in the following output: ### Field uri is an instance of String. ### Field entity_label is an instance of String. ### Field institution_uri is an instance of String. ### Field asserted_type_uri is an instance of String. ### Field asserted_type_label is an instance of String. ### Field provider_uri is an instance of String. ### Field provider_label is an instance of String. -Rich -Original Message- From: Simon, Richard T Sent: Thursday, June 16, 2011 10:08 AM To: solr-user@lucene.apache.org Cc: Simon, Richard T Subject: RE: getFieldValue always returns an ArrayList? Interesting. You guessed right. I changed multivalued to multiValued and all of a sudden I get Strings. But, doesn't multivalued default to false? In my schema, I originally did not set multivalued. I only put in multivalued=false after I experienced this issue. -Rich For the record, I had a number of fields which had never settings for multivalued because none of them were multivalued and I expected the default to be false. When I experienced this problem, I added multivalued=false to all of them. I still had the problem. So, I added a method to deal with the returned ArrayLists: private Object getFieldValue(String field, SolrDocument document) { ArrayList list = (ArrayList)document.getFieldValue(field); return list.get(0); } I deliberately did not test if the return Object was an ArrayList because I wanted to get an exception if any of them were Strings; I got no exceptions, so they were all returned as ArrayLists. I then changed one of the fields to use multiValued=false, and I got an exception, trying to cast String to ArrayList! So, I changed all the troublesome fields to use multiValued, and changed my helper method to look like this: private Object getFieldValue(String field, SolrDocument document) { Object o = document.getFieldValue(field); if (o instanceof ArrayList) { System.out.println(### Field + field + is an instance of ArrayList.); ArrayList list = (ArrayList)document.getFieldValue(field); return list.get(0); } else { if (!(o instanceof String)) { System.out.println(## ERROR); } else { System.out.println(### Field + field + is an instance of String.); } return o; } } Here's the output, interspersed with the schema definitions of the fields: field name=uri type=string indexed=true stored=true multiValued=false required=true / ### Field uri is an instance of String. field name=entity_label type=string indexed=false stored=true required=false / ### Field entity_label is an instance of ArrayList. field name=institution_uri type=string indexed=true stored=true required=false / ### Field institution_uri is an instance of ArrayList. field name=asserted_type_uri type=string indexed=true stored=true required=false / ### Field asserted_type_uri is an instance of ArrayList. field name=asserted_type_label type=text_eaglei indexed=true stored=true required=false / ### Field asserted_type_label is an instance of ArrayList. field name=provider_uri type=string indexed=true stored=true multiValued=false required=false / ### Field provider_uri is an instance of String. field name=provider_label type=string indexed=true stored=true multiValued=false required=false / ### Field provider_label is an instance of String. As you can see, the ones with no declaration for multivalued are returned as ArrayLists, while the ones with multiValued=false are returned as Strings. So, it looks like there are two problems here: multivalued (small v) is not recognized, since using that in the schema still causes all fields to be returned as ArrayLists; and, multivalued does not default to false (or, at least, not setting it causes a field to be returned as an ArrayList, as though it were set to true). -Rich -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, June 15, 2011 4:25 PM To: solr-user@lucene.apache.org Subject: Re: getFieldValue always returns an ArrayList? Hmmm, I admit I'm not using embedded, and I'm using 3.2, but I'm not seeing the behavior you are. My question about reindexing could have been better stated, I was just making sure you didn't have some leftover cruft where your field was multi-valued from previous experiments, but if you're reindexing each time that's not the problem. Arrrh, camel case may be striking again. Try multiValued, not multivalued If that's still not it, can
Re: Showing facet of first N docs
Thanks Dmitry, but maybe I didn't explain correctly as I am not sure facet.offset is the right solution, I'd like not to page but to filter facets. I'll try to explain better with an example. Imagine I make a query and first 2 docs in results have both 'xyz' and 'abc' as values for field 'lemmas' while also other docs in the results have 'xyz' or 'abc' as values of field 'lemmas' then I would like to show facets coming from only the first 2 docs in the results thus having : lst name=lemmas str name=xyz2/str str name=abc2/str /lst You can imagine this like a 'give me only facets related to the most relevant docs in the results' functionality. Any idea on how to do that? Tommaso 2011/6/16 Dmitry Kan dmitry@gmail.com http://wiki.apache.org/solr/SimpleFacetParameters facet.offset This param indicates an offset into the list of constraints to allow paging. The default value is 0. This parameter can be specified on a per field basis. Dmitry On Thu, Jun 16, 2011 at 1:39 PM, Tommaso Teofili tommaso.teof...@gmail.comwrote: Hi all, Do you know if it is possible to show the facets for a particular field related only to the first N docs of the total number of results? It seems facet.limit doesn't help with it as it defines a window in the facet constraints returned. Thanks in advance, Tommaso -- Regards, Dmitry Kan
Re: How to index correctly a text save with tinyMCE
I have the following problem: I am using the spanish analyzer to index and query, but due to I am using tinymce some charactes of the text are changed codified in html, for example the text: En españa ... it is changed to En espantilde;a so I need a way to recodify that text to make queries correctly. Could you help me please ??? Regards Ariel On Wed, Jun 15, 2011 at 9:49 PM, Erick Erickson erickerick...@gmail.comwrote: Please review this page: http://wiki.apache.org/solr/UsingMailingLists You haven't stated what your problem is. Some examples of what your inputs and desired outputs are would be helpful Meanwhile, see this page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters but that's a wild guess. Best Erick On Wed, Jun 15, 2011 at 2:30 PM, Ariel isaacr...@gmail.com wrote: Hi everybody, I am using tinyMCE to save the text I am indexing, but as you know the characters whith accents are changed. Could anybody tell me how to solve that problem ? Is there any analyzers that recognize rich text ??? I would appreciate your help. Regards, Ariel
Re: query routing with shards
Hi Otis, I have fixed it by assigning the value to rb same as assigned to sreq: rb.shards = shards.toString().split(,); not tested that fully yet, but distributed faceting works at least on my pc _3 shards 1 router_ setup. Dmitry On Thu, Jun 16, 2011 at 4:53 PM, Dmitry Kan dmitry@gmail.com wrote: Hi Otis, I followed your recommendation and decided to implement the SearchComponent::modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest sreq) method, where the query routing happens. So far it is working OK for the non-facet search, this is good news. The bad news is that it fails on the facet search. This is how request modification happens: [code_snippet, SearchComponent::modifyRequest] SolrQueryRequest req_routed = rb.req; req_routed = routeRequest(req_routed); rb.req = req_routed; sreq.shards = shards.toString().split(,); [/code_snippet] where shards is StringBuilder, that accumulates the shards the request should go to. req_routed also contains the target shards. Those are set like this: [code_snippet, my function routeRequest(SolrQueryRequest req)] // could not find clone(), used ref reassignment SolrQueryRequest req_local = req; ModifiableSolrParams params = new ModifiableSolrParams(req_local.getParams()); ... params.remove(ShardParams.SHARDS); params.set(ShardParams.SHARDS, getShardsParams(yearToQuarterMap)); params.remove(ShardParams.IS_SHARD); params.set(ShardParams.IS_SHARD, true); req_local.setParams(params); ... return req_local; [/code_snippet] The NPE happens down the road during the facet search, in the FacetComponent::countFacets(), the cause of which is that OpenBitSet obs is null for shardNum=0. Do you have any idea why this happens, should some other field of ResponseBuilder, SearchComponent or ShardRequest be changed? BTW, I have tried to call FacetInfo::parse method inside FacetComponent::modifyRequest() and countFacets(). Where do the fi.facets.values() get initiated, is there some method to call? Thanks, Dmitry On Fri, Jun 3, 2011 at 8:00 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Nah, if you can quickly figure out which shard a given query maps to, then all this component needs to do is stick the appropriate shards param value in the request and let the request pass through to the other SearchComponents in the chain, including QueryComponent, which will know what to do with the shards param. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dmitry Kan dmitry@gmail.com To: solr-user@lucene.apache.org Sent: Fri, June 3, 2011 12:56:15 PM Subject: Re: query routing with shards Hi Otis, Thanks! This sounds promising. This custom implementation, will it hurt in any way the stability of the front end SOLR? After implementing it, can I run some tests to verify the stability / performance? Dmitry On Fri, Jun 3, 2011 at 4:49 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Dmitry, Yes, you could also implement your own custom SearchComponent. In this component you could grab the query param, examine the query value, and based on that add the shards URL param with appropriate value, so that when the regular QueryComponent grabs stuff from the request, it has the correct shard in there already. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dmitry Kan dmitry@gmail.com To: solr-user@lucene.apache.org Sent: Fri, June 3, 2011 2:47:00 AM Subject: Re: query routing with shards Hi Otis, I merely followed on the gmail's suggestion to include other people into the recipients list, Yonik was the first one :) I won't do it next time. Thanks for a rapid reply. The reason for doing this query routing is that we abstract the distributed SOLR from the client code for security reasons (that is, we don't want to expose the entire shard farm to the world, but only the frontend SOLR) and for better decoupling. Is it possible to implement a plugin to SOLR that would map queries to shards? We have other choices too, they'll take quite some time, that's why I decided to quickly ask, if I was missing something from the SOLR main components design and configuration. Dmitry On Fri, Jun 3, 2011 at 8:25 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Dmitry (you may not want to additionally copy Yonik, he's subscribed to this list, too) It sounds like you have the knowledge of which query maps to which shard. If so, why not control/change the value of shards
Encoding of alternate fields in highlighting
I have an index with various fields and I want to highlight query matchings on title and content fields. These fields could contain html tags so I've configured HtmlFormatter for highlighting. The problem is that if the query doesn't match the text of the field, solr returns the value of configured alternate field without encoding it. Is there any way to get encoded value also for alternate fields? And in general there is a way to do html escaping on values returned from a response writer? I'm using solr 3.1 and here is an excerpt from requestHandler configuration [...] str name=wtjson/str str name=hltrue/str str name=hl.fltitle,content/str str name=hl.simple.pre![CDATA[b]]/str str name=hl.simple.post![CDATA[/b]]/str str name=f.title.hl.fragsize1024/str str name=f.title.hl.alternateFieldtitle/str str name=f.title.hl.maxAlternateFieldLength512/str int name=f.title.hl.snippets1/int str name=f.content.hl.alternateFieldcontent/str str name=f.content.hl.maxAlternateFieldLength512/str int name=f.content.hl.snippets2/int [...] and from highlighting configuration [...] highlighting formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true /formatter encoder name=html class=org.apache.solr.highlight.HtmlEncoder default=true / fragmentsBuilder name=default class=org.apache.solr.highlight.ScoreOrderFragmentsBuilder default=true / /highlighting [...] Thanks Massimo -- DISCLAIMER: This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
Re: Complex situation
Am I right that you are only interested in results / facets for current season? If it's so then you can index start/end dates as a separate number fields and build your search filters like this fq=+start_date_month:[* TO 6] +start_date_day:[* TO 17] +end_date_month:[* TO 6] +end_date_day:[16 TO *] where 6/16 is current month/day. On Thu, Jun 16, 2011 at 5:20 PM, roySolr royrutten1...@gmail.com wrote: Hello, First i will try to explain the situation: I have some companies with openinghours. Some companies has multiple seasons with different openinghours. I wil show some example data : Companyid Startdate(d-m) Enddate(d-m) Openinghours_end 1 01-01 01-04 17:00 1 01-04 01-08 18:00 1 01-08 31-12 17:30 2 01-01 31-12 20:00 3 01-01 01-06 17:00 3 01-06 31-12 18:00 What i want is some facets on the left site of my page. They have to look like this: Closing today on: 17:00(23) 18:00(2) 20:00(1) So i need to get the NOW to know which openinghours(seasons) i need in my facet results. How should my index look like? Can anybody helps me how i can save this data in the solr index? -- View this message in context: http://lucene.472066.n3.nabble.com/Complex-situation-tp3071936p3071936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Showing facet of first N docs
Hi Tommaso, the FacetComponent works with the DocListAndSet#docSet. It should be easy to switch to DocListAndSet#docList (which contains all documents for result list (default: TOP-10, but possible 15-25 (if start=15, rows=11). Which means to change the source code. Instead of changing the source-code the easier way should be to send a second request with relevance-Filter (if your sort-criteria is relevance): http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html Best regards Karsten http://lucene.472066.n3.nabble.com/Showing-facet-of-first-N-docs-td3071395.html Original-Nachricht Datum: Thu, 16 Jun 2011 12:39:32 +0200 Von: Tommaso Teofili tommaso.teof...@gmail.com An: solr-user@lucene.apache.org Betreff: Showing facet of first N docs Hi all, Do you know if it is possible to show the facets for a particular field related only to the first N docs of the total number of results? It seems facet.limit doesn't help with it as it defines a window in the facet constraints returned. Thanks in advance, Tommaso
Re: Performance loss - querying more than 64 cores (randomly)
Thanks for your answers. Andrzej was right with his assumption. Solr only needs about 9GB memory but the system needs the rest of it for disc IO: 64 Cores: 64*100MB index size = 6,4GB + 9 GB Solr Cache + about 600 MB OS = 16GB Conclusion: My system can exactly buffer the data of 64 Cores. Every additional core cant be buffered and the performance is decreasing. 2011/6/16 François Schiettecatte fschietteca...@gmail.com I am assuming that you are running on linux here, I have found atop to be very useful to see what is going on. http://freshmeat.net/projects/atop/ dstat is also very useful too but needs a little more work to 'decode'. Obviously there is contention going on, you just need to figure out where it is, most likely it is disk I/O but it could also be the number of cores you have. Also I would not say that performance is decreasing rapidly, probably more of a gentle slope down if you plot it (your double the number of cores every time). I would be very interested in hearing about what you find. Cheers François On Jun 16, 2011, at 10:00 AM, Andrzej Bialecki wrote: On 6/16/11 3:22 PM, Mark Schoy wrote: Hi, I set up a Solr instance with 512 cores. Each core has 100k documents and 15 fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM. Now I've done some benchmarks with JMeter. On each thread iteration JMeter queriing another Core by random. Here are the results (Duration: each with 180 second): Randomly queried cores | queries per second 1| 2016 2 | 2001 4 | 1978 8 | 1958 16 | 2047 32 | 1959 64 | 1879 128 | 1446 256 | 1009 512 | 428 Why are the queries per second until 64 constant and then the performance is degreasing rapidly? Solr only uses 10GB of the 16GB memory so I think it is not a memory issue. This may be an OS-level disk buffer issue. With a limited disk buffer space the more random IO occurs from different files, the higher is the churn rate, and if the buffers are full then the churn rate may increase dramatically (and the performance will drop then). Modern OS-es try to keep as much data in memory as possible, so the memory usage itself is not that informative - but check what are the pagein/pageout rates when you start hitting the 32 vs 64 cores. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Document Scoring
Hi, I am designing my indexes to have 1 write-only master core, 2 read-only slave cores. That means the read-only cores will only have snapshots pulled from the master and will not have near real time changes. I was thinking about adding a hybrid read and write master core that will have the most recent changes from my primary data source. I am thinking to query the hybrid master and the read-only slaves and somehow try to intersect the results in order to support near real time full text search. Is this feasible? Thank you, Zarni
Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)
So a search for a product once the user logs in and searches for only the products that he has access to Will translate to something like this . ,the product ids are obtained form the db for a particular user and can run into n number. search term fq=product_id(100 10001 ..n number) but we are currently running into too many Boolean expansion error .We are not able to tie the user also into roles as each user is mainly any one who comes to site and purchases a product . I'm wondering if new trunk Solr join functionality can help here. * http://wiki.apache.org/solr/Join In theory you can index your products (product_id, ...) and user_id-product many-to-many relation (user_product_id, user_id) into signle/different cores and then do join, like f=search termsfq={!join from=product_id to=user_product_id}user_id:10101 But I haven't tried that, so I'm just speculating.
RE: How to index correctly a text save with tinyMCE
Hi Ariel, On 6/16/2011 at 10:45 AM, Ariel wrote: I have the following problem: I am using the spanish analyzer to index and query, but due to I am using tinymce some charactes of the text are changed codified in html, for example the text: En españa ... it is changed to En espantilde;a so I need a way to recodify that text to make queries correctly. HTMLStripCharFilterFactory, which strips out HTML tags, also converts named character entities like ntilde; to their equivalent character. Steve
Re: How to index correctly a text save with tinyMCE
Thanks for your answer, I have just put the filter in my schema.xml but it doesn't work I am using solr 1.4 and my conf is: code analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.HTMLStripCharFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /code But it doesn't work in tomcat 6 logs I get this error: java.lang.ClassCastException: org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to org.apache.solr.analysis.TokenFilterFactory at org.apache.solr.schema.IndexSchema$6.init(IndexSchema.java:831) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:149) at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:835) at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:58) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:424) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:447) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:141) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:456) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:426) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117) ... Any Idea ? How can I solve that problem ??? Regards Ariel On Thu, Jun 16, 2011 at 6:24 PM, Steven A Rowe sar...@syr.edu wrote: Hi Ariel, On 6/16/2011 at 10:45 AM, Ariel wrote: I have the following problem: I am using the spanish analyzer to index and query, but due to I am using tinymce some charactes of the text are changed codified in html, for example the text: En españa ... it is changed to En espantilde;a so I need a way to recodify that text to make queries correctly. HTMLStripCharFilterFactory, which strips out HTML tags, also converts named character entities like ntilde; to their equivalent character. Steve
Re: How to index correctly a text save with tinyMCE
On 6/16/2011 11:12 AM, Ariel wrote: Thanks for your answer, I have just put the filter in my schema.xml but it doesn't work I am using solr 1.4 and my conf is: code analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.HTMLStripCharFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /code But it doesn't work in tomcat 6 logs I get this error: java.lang.ClassCastException: org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to org.apache.solr.analysis.TokenFilterFactory According to the wiki, the output of that filter must be passed to either another CharFilter or a Tokenizer. Try moving it before WhitespaceTokenizerFactory. Shawn
RE: getFieldValue always returns an ArrayList?
: and all of a sudden I get Strings. But, doesn't multivalued default to : false? In my schema, I originally did not set multivalued. I only put in : multivalued=false after I experienced this issue. That's dependent on the version of Solr, and it's is where the version property of the schema comes in. (as the default behavior in solr changes, it does so dependent on what version you specify in your schema to prevent radical behavior changes if you upgrade but keep the same configs)... schema name=example version=1.4 !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.4 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields. 1.3: removed optional field compress feature 1.4: default auto-phrase (QueryParser feature) to off -- -Hoss
RE: getFieldValue always returns an ArrayList?
We haven't changed Solr versions. We've been using 3.1.0 all along. Plus, I have some code that runs during indexing and retrieves the fields from a SolrInputDocument, rather than a SolrDocument. That code gets Strings without any problem, and always has, even without saying multiValued=false. -Rich -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, June 16, 2011 2:18 PM To: solr-user@lucene.apache.org Cc: Simon, Richard T Subject: RE: getFieldValue always returns an ArrayList? : and all of a sudden I get Strings. But, doesn't multivalued default to : false? In my schema, I originally did not set multivalued. I only put in : multivalued=false after I experienced this issue. That's dependent on the version of Solr, and it's is where the version property of the schema comes in. (as the default behavior in solr changes, it does so dependent on what version you specify in your schema to prevent radical behavior changes if you upgrade but keep the same configs)... schema name=example version=1.4 !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.4 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields. 1.3: removed optional field compress feature 1.4: default auto-phrase (QueryParser feature) to off -- -Hoss
Re: Strange behavior
Have you stopped Solr before manually copying the data? This way you can be sure that index is the same and you didn't have any new docs on the fly. 2011/6/14 Denis Kuzmenok forward...@ukr.net: What should i provide, OS is the same, environment is the same, solr is completely copied, searches work, except that one, and that is strange.. I think you will need to provide more information than this, no-one on this list is omniscient AFAIK. François On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote: Hi. I've debugged search on test machine, after copying to production server the entire directory (entire solr directory), i've noticed that one query (SDR S70EE K) does match on test server, and does not on production. How can that be?
Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)
Peter , Thanks for the clarification. Why I specifically asked was because, we have many search instances (200+) on a single JVM. Each of these instaces could have n users and each user can subscribe to n products .Now accordng to your suggestion , I need to maintain an in-memory list of all users and their subscribed products for each of the instances and use this list to fllter for a given query.We are maintaining the user and subscrption details in a DB. I was wondering ,instead if it would make more sense(with respect to memory) to dynamically get the subscribed product ids when ever a user logs in (as access is only for the user session) and use this data to flter the query ? And we really do not have budget and hence wont be able to contract LI for this ,though I will certanly need to get some JAVA experts help wthin my org. Thanks for your time Regards Sujatha On Wed, Jun 15, 2011 at 11:29 PM, Peter Sturge peter.stu...@gmail.comwrote: Hi, By in-memory, I mean you hold a list of users (+ some other parameters like order number, expiry, what ever else you need) in one of those Greek HashMaps, and use this list to determine what query parameters/results will be processed for a given search request (SOLR-1872 reads an acl file to populate such a list). So if you had 500 users who had purchased stuff at a given moment, you'd have 500 entries in the table that hold the relevant data to filter/not filter searches/results. This won't cause a memory problem unless you have a million users and stored their autobiography in each entry. I wouldn't call this sort of thing a novice or even journeyman's task, you would definitely need to know about using and maintaining tables etc. Would you be able to contract someone to do the work on your behalf? There are some excellent resources around, and Lucid would certainly do a great job, but of course you'd need budget for this approach. Alternatively, maybe you can tap some java expertise within your organization to help out? HTH, Peter On Wed, Jun 15, 2011 at 6:17 PM, Sujatha Arun suja.a...@gmail.com wrote: Thanks ,Peter. I am not a Java Programmer and hence the code seems all Greek and Latin to me .I do have a basic knowledge ,but all this Map,hashMap ,Hashlist,NamedList , I dont understand. However I would like to implement the solution that you have mentoned ,so if you have any pointers for me ,would be great .I would also try to dig deep into JAVA. What s meant by in-memory?Is it the Ram memory ,So If i have n concurrent users ,each having n products subscrbed,what would be the Impact on memory ? Regards Sujatha On Tue, Jun 14, 2011 at 5:43 PM, Peter Sturge peter.stu...@gmail.com wrote: SOLR-1872 doesn't add discrete booleans to the query, it does it programmatically, so you shouldn't see this problem. (if you have a look at the code, you'll see how it filters queries) I suppose you could modify SOLR-1872 to use an in-memory, dynamically-updated user list (+ associated filters) instead of using the acl file. This would give you the 'changing users' and 'expiry' functionailty you need. On Tue, Jun 14, 2011 at 10:08 AM, Sujatha Arun suja.a...@gmail.com wrote: Thanks Peter , for your input . I really would like a document and schema agnostic solution as in solr 1872. Am I right in my assumption that SOLR1872 is same as the solution that we currently have where we add a flter query of the products to orignal query and hence (SOLR 1872) will also run into TOO many boolean clause expanson error? Regards Sujatha On Tue, Jun 14, 2011 at 1:53 PM, Peter Sturge peter.stu...@gmail.com wrote: Hi, SOLR-1834 is good when the original documents' ACL is accessible. SOLR-1872 is good where the usernames are persistent - neither of these really fit your use case. It sounds like you need more of an 'in-memory', transient access control mechanism. Does the access have to exist beyond the user's session (or the Solr vm session)? Your best bet is probably something like a custom SearchComponent or similar, that keeps track of user purchases, and either adjusts/limits the query or the results to suit. With your own module in the query chain, you can then decide when the 'expiry' is, and limit results accordingly. SearchComponent's are pretty easy to write and integrate. Have a look at: http://wiki.apache.org/solr/SearchComponent for info on SearchComponent and its usage. On Tue, Jun 14, 2011 at 8:18 AM, Sujatha Arun suja.a...@gmail.com wrote: Hello, Our Use Case is as follows Several solr webapps (one JVM) ,Each webapp catering to one client .Each client has their users who can purchase products from the site .Once they purchase ,they have full access to the products ,other wise
RE: getFieldValue always returns an ArrayList?
Ah! That was the problem. The version was 1.0. I'll change it to 1.2. Thanks! -Rich -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, June 16, 2011 2:33 PM To: Simon, Richard T Cc: solr-user@lucene.apache.org Subject: RE: getFieldValue always returns an ArrayList? : We haven't changed Solr versions. We've been using 3.1.0 all along. but that's not what i'm talking about. I'm talking about the schema version ... a specific property declared in your schema.xml file. did you check it? (even when people start with Solr X, they sometimes are using schema.xml files provided by external packages -- Drupal, wordpress, etc... -- and don't notice that those are from older versions) : Plus, I have some code that runs during indexing and retrieves the : fields from a SolrInputDocument, rather than a SolrDocument. That code : gets Strings without any problem, and always has, even without saying : multiValued=false. SolrInputDocument's are irelevant. they are used to index data, but they don't know anything about the schema. A SolrInputDocument might be completely invalid because of multiple values for singled value fields, or missing values for required fields, etc... what comes back from a search *is* consistent with the schema (even when there is only one value stored in a multiValued field) -Hoss
Re: Updating only one indexed field for all documents quickly.
with the integer field. If you just want to influence the score, then just plain external field fields should work for you. Is this an appropriate solution, give our use case? Yes, check out ExternalFileField * http://search.lucidimagination.com/search/document/CDRG_ch04_4.4.4 * http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html * http://www.slideshare.net/greggdonovan/solr-lucene-etsy-by-gregg-donovan/28
It's not possible to decide at run-time which similarity class to use, right?
Hello, I'm testing out different Similarity implementations, and to do that I restart Solr each time I want to try a different similarity class I change the class attributed of the similiary element in schema.xml. Beside running multiple-cores, each with its own schema, is there a way to tell the RequestHandler which similarity class to use? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Minimum Should Match + External Field + Function Query with boost
: Seem to have a solution but I am still trying to figure out how/why it works. : : Addition of defType=edismax in the boost query seem to honor MM and : correct boosting based on external file source. You didn't bost enough details in your original question to be 100% certain (would have needed to see the *full* solr url, including path, and your requestHandler declaration from solrconfig.xml to be sure) but i suspect the problem you were having is that you weren't actually using dismax (or edismax) at all until you added the explicit defType you mentioned... : The new query syntax : q={!boost b=dishRating v=$qq defType=edismax}qq=hot chicken wings compare the parsedquery_toString in the debug output of your previous message with the debug output you get now and i think you'll see a clear indication of when a DisjunctionMaxQuery is used (and what the mm is set to) -Hoss
RE: HTMLStripTransformer will remove the content in XML??
FYI: There's a new patch specificly for dealing with xml tags and entities that handles the CDATA case... https://issues.apache.org/jira/browse/SOLR-2597 : Date: Fri, 27 May 2011 17:01:26 +0800 : From: Ellery Leung elleryle...@be-o.com : Reply-To: solr-user@lucene.apache.org, elleryle...@be-o.com : To: solr-user@lucene.apache.org : Subject: RE: HTMLStripTransformer will remove the content in XML?? : : Got it. Actually I use solr.MappingCharFilterFactory to replace the ![CDATA[ and ]] to empty first, and use HTMLStripCharFilterFactory to get hello and solr. : : For future reference, here is part of schema.xml : : fieldType name=textMaxWord class=solr.TextField : analyzer type=index : charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ : charFilter class=solr.HTMLStripCharFilterFactory / : ... : : In mappings.txt (2 lines) : : ![CDATA[ = : : ]] = : : Restart Solr : : It works. : : Thank you : : -Original Message- : From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] : Sent: 2011年5月27日 4:20 下午 : To: solr-user@lucene.apache.org; elleryle...@be-o.com : Subject: Re: HTMLStripTransformer will remove the content in XML?? : : I would expect that it doesn't understand CDATA and thinks of : everything between and as a 'tag'. : : Best Regards, : Bryan Rasmussen : : On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote: : I have an XML string like this: : : : : ?xml version=1.0 : encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr : ]]/loc/language : : : : By using HTMLStripTransformer, I expect to get 'hello,solr'. : : : : But actual this transformer will remove ALL THE TEXT INSIDE! : : : : Did I do something silly, or is it a bug? : : : : Thank you : : : : -Hoss
Re: It's not possible to decide at run-time which similarity class to use, right?
No, there's not a way to control Similarity on a per-request basis. Some factors from Similarity are computed at index-time though. What factors are you trying to tweak that way and why? Maybe doing boosting using some other mechanism (boosting functions, boosting clauses) would be a better way to go? Erik On Jun 16, 2011, at 14:55 , Gabriele Kahlout wrote: Hello, I'm testing out different Similarity implementations, and to do that I restart Solr each time I want to try a different similarity class I change the class attributed of the similiary element in schema.xml. Beside running multiple-cores, each with its own schema, is there a way to tell the RequestHandler which similarity class to use? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Performance loss - querying more than 64 cores (randomly)
On 6/16/11 5:31 PM, Mark Schoy wrote: Thanks for your answers. Andrzej was right with his assumption. Solr only needs about 9GB memory but the system needs the rest of it for disc IO: 64 Cores: 64*100MB index size = 6,4GB + 9 GB Solr Cache + about 600 MB OS = 16GB Conclusion: My system can exactly buffer the data of 64 Cores. Every additional core cant be buffered and the performance is decreasing. Glad to be of help... You could formulate this conclusion in a different way, too: if you specify too large a heap size then you stifle the OS disk buffers - Solr won't be able to use that excess of memory, but it won't be available for OS-level disk IO. Therefore reducing the heap size may actually increase your performance. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: It's not possible to decide at run-time which similarity class to use, right?
On Thu, Jun 16, 2011 at 9:14 PM, Erik Hatcher erik.hatc...@gmail.comwrote: No, there's not a way to control Similarity on a per-request basis. Some factors from Similarity are computed at index-time though. You got me on this. What factors are you trying to tweak that way and why? Maybe doing boosting using some other mechanism (boosting functions, boosting clauses) would be a better way to go? I'm trying to assess the impact of coord (search-time) on Qtime. In one implementation coord returns 1, while in another it's actually computed. Running multiple cores adds considerable complication (must specify to share data but not conf). Patching the request handler to change similarity (didn't yet look into this) will only change 'search-time' similarity. How about breaking up similarity into run-time and compile-time? So requesthandler could take a parameter to 'safely' set the run-time similarity? I think many would welcome such responsibility distinction. Erik On Jun 16, 2011, at 14:55 , Gabriele Kahlout wrote: Hello, I'm testing out different Similarity implementations, and to do that I restart Solr each time I want to try a different similarity class I change the class attributed of the similiary element in schema.xml. Beside running multiple-cores, each with its own schema, is there a way to tell the RequestHandler which similarity class to use? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
RE: How to index correctly a text save with tinyMCE
Hi Ariel, As Shawn says, char filters come before tokenizers. You need to use a charFilter tag instead of filter tag. I've updated the HTMLStripCharFilter documentation on the Solr wiki to include this information: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Steve -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Thursday, June 16, 2011 1:32 PM To: solr-user@lucene.apache.org Subject: Re: How to index correctly a text save with tinyMCE On 6/16/2011 11:12 AM, Ariel wrote: Thanks for your answer, I have just put the filter in my schema.xml but it doesn't work I am using solr 1.4 and my conf is: code analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.HTMLStripCharFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /code But it doesn't work in tomcat 6 logs I get this error: java.lang.ClassCastException: org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to org.apache.solr.analysis.TokenFilterFactory According to the wiki, the output of that filter must be passed to either another CharFilter or a Tokenizer. Try moving it before WhitespaceTokenizerFactory. Shawn
getting started
Hello, I am new to Solr and am in the beginning planning stage of a large project and could use some advice so as not to make a huge design blunder that I will regret down the road. Currently I have about 10 MySQL databases that store information about different archival collections. For example, we have data and metadata about a political poster collection, a television program, documents and photographs of and about a famous author, etc. My job is to work with the staff archivists to come up with a standard metadata template so the 10 databases can be consolidated into one. Currently the info in these databases is accessed through 10 different sets of PHP pages that were written a long time ago for PHP 4. My plan is to write a new Java application that will handle both public display of the info as well as an administrative interface so that staff members can add or edit the records. I have decided to use Solr as the search mechanism for this project. Because the info in each of our 10 collections is slightly different (e.g., a record about a poster does not contain duration information, but a record about a TV show does) I was thinking it would be good to separate each collection's index into a separate Solr core so that commits coming from one collection do not bog down the other unrelated collections. One reservation I have is that eventually we would like to be able to type in Iraq and find records across all of the collections at once instead of having to search each collection separately. Although I don't know anything about it at this stage, I did Google sharding after reading someone's recent post on this list and it sounds like that may be a potential answer to my question. Does anyone have any advice on how I should initially set up Solr for my situation? I am slowly making my way through the wiki and RTFMing, but I wanted to see what the experts have to say because at this point I don't really know where to start. Thank you very much, Mari
Re: It's not possible to decide at run-time which similarity class to use, right?
On Thu, Jun 16, 2011 at 3:23 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: I'm trying to assess the impact of coord (search-time) on Qtime. In one implementation coord returns 1, while in another it's actually computed. On query time? coord should be really cheap (unless your impl does something like calculate a million digits of pi), as it is not actually computed per-document. instead, the result of all possible coord factors (e.g. 1/5, 2/5, 3/5, 4/5, 5/5) is computed up-front by BooleanQuery's scorers into a table. See http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer.java and http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer2.java
Re: getting started
On 6/16/2011 4:41 PM, Mari Masuda wrote: One reservation I have is that eventually we would like to be able to type in Iraq and find records across all of the collections at once instead of having to search each collection separately. Although I don't know anything about it at this stage, I did Google sharding after reading someone's recent post on this list and it sounds like that may be a potential answer to my question. So this kind of stuff can be tricky, but with that eventual requirement I would NOT put these in seperate cores. Sharding isn't (IMO, if someone disagrees, they will hopefully say so!) a good answer to searching accross entirely different 'schemas', or avoiding frequent-commit issues -- sharding is really just for scaling/performance when your index gets very very large. (Which it doesn't sound like yours will be, but you can deal with that as a separate issue if it becomes so). If you're going to want to search across all the collections, put them all in the same core. Either in the exact same indexed fields, or using certain common indexed fields -- those common ones are the ones you'll be able to search across all collections on. It's okay if some collections have unique indexed fields too --- documents in the core that don't belong to that collection just won't have any terms in that indexed field that is only used by a certain collection, no problem. (Then you can distribute this single core into shards if you need to for performance reasons related to number of documents/size of index). You're right to be thinking about the fact that very frequent commits can be performance issues in Solr. But separating in different cores is going to create more problems for yourself (if you want to be able to search accross all collections), in an attempt to solve that one. (Among other things, not every Solr feature works in a distributed/sharded environment, it's just a more complicated and somewhat less mature setup for Solr). The way I deal with the frequent-commit issue is by NOT doing frequent commits to my production Solr. Instead, I use Solr replication to have a 'master' Solr index that I do commits to whenever I want, and a 'slave' Solr index that serves the production searches, and which only replicates from master periodically -- not too often to be too-frequent-commits. That seems to be a somewhat common solution, if that use pattern works for you. There are also some near real time features in more recent versions of Solr, that I'm not very familiar with. (not sure if any are included in the current latest release, or if they are all only still in the repo) My sense is that they too only work for certain use patterns, they aren't magic bullets for commit whatever you want as often as you want to Solr. In general Solr isn't so great at very frequent major changes to the index. Depending on exactly what sort of use pattern you are predicting/planning for your commits, maybe people can give you advice on how (or if) to do it. But I personally don't think your idea of splitting your collections (that you'll eventually want to search accross into a single search) into shards is a good solution to frequent-commit issues. You'd be complicating your setup and causing other problems for yourself, and not really even entirely addressing the too-frequent-commit issue with that setup.
Re: getting started
Hi Mari, it depends ... * How many records are stored in your MySQL databases? * How often will updates occur? * How many db records / index documents are changed per update? I would suggest to start with a single Solr core first. Thereby, you can concentrate on the basics and do not need to deal with more advanced things like sharding. In case you encounter performance issues later on, you can switch to a multi-core setup. -Sascha Mari Masuda wrote: Hello, I am new to Solr and am in the beginning planning stage of a large project and could use some advice so as not to make a huge design blunder that I will regret down the road. Currently I have about 10 MySQL databases that store information about different archival collections. For example, we have data and metadata about a political poster collection, a television program, documents and photographs of and about a famous author, etc. My job is to work with the staff archivists to come up with a standard metadata template so the 10 databases can be consolidated into one. Currently the info in these databases is accessed through 10 different sets of PHP pages that were written a long time ago for PHP 4. My plan is to write a new Java application that will handle both public display of the info as well as an administrative interface so that staff members can add or edit the records. I have decided to use Solr as the search mechanism for this project. Because the info in each of our 10 collections is slightly different (e.g., a record about a poster does not contain duration information, but a record about a TV show does) I was thinking it would be good to separate each collection's index into a separate Solr core so that commits coming from one collection do not bog down the other unrelated collections. One reservation I have is that eventually we would like to be able to type in Iraq and find records across all of the collections at once instead of having to search each collection separately. Although I don't know anything about it at this stage, I did Google sharding after reading someone's recent post on this list and it sounds like that may be a potential answer to my question. Does anyone have any advice on how I should initially set up Solr for my situation? I am slowly making my way through the wiki and RTFMing, but I wanted to see what the experts have to say because at this point I don't really know where to start. Thank you very much, Mari
sending results of function query to range query
I am not sure if I can use function queries this way. I have a query like thisattributeX:[* TO ?] in my DB. I replace the ? with input from the front end. Obviously, this works fine. However, what I really want to do is attributeX:[* TO (3 * ?)] Is there anyway to embed the results of a function query inside the query?
Re: Encoding of alternate fields in highlighting
(11/06/17 0:15), Massimo Schiavon wrote: I have an index with various fields and I want to highlight query matchings on title and content fields. These fields could contain html tags so I've configured HtmlFormatter for highlighting. The problem is that if the query doesn't match the text of the field, solr returns the value of configured alternate field without encoding it. Is there any way to get encoded value also for alternate fields? And in general there is a way to do html escaping on values returned from a response writer? Massimo, At first impression, I think the requirement is reasonable. As long as we support HtmlEncoder, we had better support it with alternateField option. Please open a jira issue, and if possible, suggest appropriate option and attach a patch (patch is not required, but it is very helpful). koji -- http://www.rondhuit.com/en/
SOlR -- Out of Memory exception
We just started using SOLR. I am trying to load a single file with 20 million records into SOLR using the CSV uploader. I keep getting and out of Memory after loading 7 million records. Here is the config: autoCommit maxDocs1/maxDocs maxTime6/maxTime I also encountered a LockObtainFailedException org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@D:\work\solr\.\data\index\write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:1097) So I changed the lockType to SIngle, now again I am getting an Out of Memory Exception. I also increased the JVM heap space to 2048M but still getting an Out of Memory. -- View this message in context: http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3074636.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: fieldCache problem OOM exception
Well, if my theory is right, you should be able to generate OOMs at will by sorting and faceting on all your fields in one query. But Lucene's cache should be garbage collected, can you take some memory snapshots during the week? It should hit a point and stay steady there. How much memory are you giving your JVM? It looks like a lot given your memory snapshot. Best Erick On Thu, Jun 16, 2011 at 3:01 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi Erik, yes I'm sorting and faceting. 1) Fields for sorting: sort=f_dccreator_sort, sort=f_dctitle, sort=f_dcyear The parameter facet.sort= is empty, only using parameter sort=. 2) Fields for faceting: f_dcperson, f_dcsubject, f_dcyear, f_dccollection, f_dclang, f_dctypenorm, f_dccontenttype Other faceting parameters: ...facet=truefacet.mincount=1facet.limit=100facet.sort=facet.prefix=... 3) The LukeRequestHandler takes too long for my huge index so this is from the standalone luke (compiled for solr3.2): f_dccreator_sort = 10.029.196 f_dctitle = 21.514.939 f_dcyear = 1.471 f_dcperson = 14.138.165 f_dcsubject = 8.012.319 f_dccollection = 1.863 f_dclang = 299 f_dctypenorm = 14 f_dccontenttype = 497 numDocs: 28.940.964 numTerms: 686.813.235 optimized: true hasDeletions: false What can you read/calculate from this values? Is my index to big for Lucene/Solr? What I don't understand, why fieldCache is not garbage collected and therefore reduced in size from time to time. Regards Bernd Am 15.06.2011 17:50, schrieb Erick Erickson: The first question I have is whether you're sorting and/or faceting on many unique string values? I'm guessing that sometime you are. So, some questions to help pin it down: 1 what fields are you sorting on? 2 what fields are you faceting on? 3 how many unique terms in each (see the solr admin page). Best Erick On Wed, Jun 15, 2011 at 8:22 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Dear list, after getting OOM exception after one week of operation with solr 3.2 I used MemoryAnalyzer for the heapdumpfile. It looks like the fieldCache eats up all memory. Objects Shalow Heap Retained Heap org.apache.lucene.search.FieldCache 0 0 = 14,636,950,632 org.apache.lucene.search.FieldCacheImpl 1 32 = 14,636,950,384 org.apache.lucene.search.FieldCacheImpl$StringIndexCache 1 32 = 14,636,947,080 org.apache.lucene.search.FieldCache$StringIndex 10 320 = 14,636,944,352 java.lang.String[] 519 567,811,040 = 13,503,733,312 char[] 81,766,595 11,604,293,712 = 11,604,293,712 fieldCache retains over 14g of heap. When looking on stats page under fieldCache the description says: Provides introspection of the Lucene FieldCache, this is **NOT** a cache that is managed by Solr. So is this a jetty problem and not solr? Why is fieldCache growing and growing until OOM? Regards Bernd
Re: Boost Strangeness
Right, if you've only changed WordDelimiterFilterFactory in the query, then then tokens you're analyzing may be split up. Try running some of the terms through the admin/analysis page Unless you have catenateAll=1, in the definition, the whole term won't be there It becomes a question of why you even want WDFF in there in the first place, do you ever want to split these fields up this way? Maybe start by just taking it out completely? Best Erick On Thu, Jun 16, 2011 at 9:55 AM, Judioo cont...@judioo.com wrote: fascinating Thank you so much Erik, I'm slowly beginning to understand. SO I've discovered that by defining 'splitOnNumerics=0' on the filter class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I can get *closer* to my required goal! Now something else odd is occuring. It only returns 2 results where there is over 70? Why is that? I can't find were this is explained :( query /solr/select?omitNorms=trueq=b006m86ddefType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=onomitNorms=true output { - - responseHeader: { - status: 0 - QTime: 51 - - params: { - debugQuery: on - fl: type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score - indent: on - q: b006m86d - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8 subseries_container_id^8 clip_container_id^1 clip_episode_id^1 - wt: json - - omitNorms: [ - true - true ] - defType: dismax } } - - response: { - numFound: 2 - start: 0 - maxScore: 13.473297 - - docs: [ - - { - parent_id: - id: b006m86d - type: brand - score: 13.473297 } - - { - series_container_id: - id: b00y1w9h - type: episode - brand_container_id: b006m86d - subseries_container_id: - clip_episode_id: - score: 11.437143 } ] } - - debug: { - rawquerystring: b006m86d - querystring: b006m86d - parsedquery: +DisjunctionMaxQuery((id:b006m86d^10.0 | clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 | series_container_id:b006m86d^8.0 | clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) () - parsedquery_toString: +(id:b006m86d^10.0 | clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 | series_container_id:b006m86d^8.0 | clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0) () - - explain: { - b006m86d: 13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636), product of: 1.0 = tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2, maxDocs=783800) 1.0 = fieldNorm(field=id, doc=27636) - b00y1w9h: 11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61), product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0), product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800) 0.007422088 = queryNorm 13.878762 = (MATCH) fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 = tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1, maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) } - QParser: DisMaxQParser - altquerystring: null - boostfuncs: null - - timing: { - time: 51 - - prepare: { - time: 6 - - org.apache.solr.handler.component.QueryComponent: { - time: 5 } - - org.apache.solr.handler.component.FacetComponent: { - time: 0 } - - org.apache.solr.handler.component.MoreLikeThisComponent: { - time: 0 } - - org.apache.solr.handler.component.HighlightComponent: { - time: 1 } - - org.apache.solr.handler.component.StatsComponent: { - time: 0 } - - org.apache.solr.handler.component.DebugComponent: { - time: 0 } } - - process: { - time: 45 - -
Re: Document Scoring
I really wouldn't go there, it sounds like there are endless opportunities for errors! How real-time is real-time? Could you fix this entirely by 1 adjusting expectations for, say, 5 minutes. 2 adjusting your commit (on the master) and poll (on the slave) appropriately? Best Erick On Thu, Jun 16, 2011 at 11:41 AM, zarni aung zau...@gmail.com wrote: Hi, I am designing my indexes to have 1 write-only master core, 2 read-only slave cores. That means the read-only cores will only have snapshots pulled from the master and will not have near real time changes. I was thinking about adding a hybrid read and write master core that will have the most recent changes from my primary data source. I am thinking to query the hybrid master and the read-only slaves and somehow try to intersect the results in order to support near real time full text search. Is this feasible? Thank you, Zarni
Re: SOlR -- Out of Memory exception
H, are you still getting your OOM after 7M records? Or some larger number? And how are you using the CSV uploader? Best Erick On Thu, Jun 16, 2011 at 9:14 PM, jyn7 jyotsna.namb...@gmail.com wrote: We just started using SOLR. I am trying to load a single file with 20 million records into SOLR using the CSV uploader. I keep getting and out of Memory after loading 7 million records. Here is the config: autoCommit maxDocs1/maxDocs maxTime6/maxTime I also encountered a LockObtainFailedException org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@D:\work\solr\.\data\index\write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:1097) So I changed the lockType to SIngle, now again I am getting an Out of Memory Exception. I also increased the JVM heap space to 2048M but still getting an Out of Memory. -- View this message in context: http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3074636.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOlR -- Out of Memory exception
Yes Eric, after changing the lock type to Single, I got an OOM after loading 5.5 million records. I am using the curl command to upload the csv. -- View this message in context: http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3074765.html Sent from the Solr - User mailing list archive at Nabble.com.
omitTermFreqAndPositions in a TextField fieldType
Is it possible to use omitTermFreqAndPositions=true in a fieldType declaration that uses class=solr.TextField? I've tried doing this and it does not seem to work (i.e., the prx file size does not change). Using it in a field declaration does work, but I'd rather set it in the fieldType so I don't have to repeat it multiple times in my schema. From my schema.xml file: fieldType name=foobar class=solr.TextField sortMissingLast=true omitNorms=true omitTermFreqAndPositions=true indexed=true stored=true positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType In the TextField class I found that it disables OMIT_TF_POSITIONS, which I'm assuming is the cause of my problem: if (schema.getVersion() 1.1f) properties = ~OMIT_TF_POSITIONS; Does it even make sense to use omitTermFreqAndPositions for a TextField, or am I perhaps doing something I shouldn't be? -Michael
Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)
Alexey, Do you mean that we have current Index as it is and have a separate core which has only the user-id ,product-id relation and at while querying ,do a join between the two cores based on the user-id. This would involve us to Index/delete the product as and when the user subscription for a product changes ,This would involve some amount of latency if the Indexing (we have a queue system for Indexing across the various instances) or deletion is delayed IF we want to go ahead with this solution ,We currently are using solr 1.3 , so is this functionality available as a patch for solr 1.3?Would it be possible to do with a separate Index instead of a core ,then I can create only one Index common for all our instances and then use this instance to do the join. Thanks Sujatha On Thu, Jun 16, 2011 at 9:27 PM, Alexey Serba ase...@gmail.com wrote: So a search for a product once the user logs in and searches for only the products that he has access to Will translate to something like this . ,the product ids are obtained form the db for a particular user and can run into n number. search term fq=product_id(100 10001 ..n number) but we are currently running into too many Boolean expansion error .We are not able to tie the user also into roles as each user is mainly any one who comes to site and purchases a product . I'm wondering if new trunk Solr join functionality can help here. * http://wiki.apache.org/solr/Join In theory you can index your products (product_id, ...) and user_id-product many-to-many relation (user_product_id, user_id) into signle/different cores and then do join, like f=search termsfq={!join from=product_id to=user_product_id}user_id:10101 But I haven't tried that, so I'm just speculating.
Re: SOlR -- Out of Memory exception
If you are sending whole CSV in a single HTTP request using curl, why not consider sending it in smaller chunks? -- View this message in context: http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3075091.html Sent from the Solr - User mailing list archive at Nabble.com.