Re: Best practice for Delta every 2 Minutes.
http://10.1.0.10:8983/solr/payment/dataimport?commad=delta-importdebug=on dont work. no debug is started =( thanks. i will try mergefactor=2 -- View this message in context: http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1997595.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: distributed architecture
Okay, I'll see what I can do. Also for what it is worth, if anyone is in London tomorrow, I'm giving a presentation which covers this topic at the (free) Online Information 2010 exhibition at Kensington Olympia, at 3:20pm. Anyone interested is welcome to come along. I believe we're hoping to video it, so if successful, I expect it'll get put online somewhere. Upayavira On Wed, 01 Dec 2010 03:44 +, Jayant Das jayan...@hotmail.com wrote: Hi, A diagram will be very much appreciated. Thanks, Jayant From: u...@odoko.co.uk To: solr-user@lucene.apache.org Subject: Re: distributed architecture Date: Wed, 1 Dec 2010 00:39:40 + I cannot say how mature the code for B) is, but it is not yet included in a release. If you want the ability to distribute content across multiple nodes (due to volume) and want resilience, then use both. I've had one setup where we have two master servers, each with four cores. Then we have two pairs of slaves. Each pair mirrors the masters, so we have two hosts covering each of our cores. Then comes the complicated bit to explain... Each of these four slave hosts had a core that was configured with a hardwired shards request parameter, which pointed to each of our shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs then balanced across each of our pair of hosts. Then, put all four of these servers behind another VIP, and we had a single address we could push requests to, for sharded, and resilient search. Now if that doesn't make any sense, let me know and I'll have another go at explaining it (or even attempt a diagram). Upayavira On Tue, 30 Nov 2010 13:27 -0800, Cinquini, Luca (3880) luca.cinqu...@jpl.nasa.gov wrote: Hi, I'd like to know if anybody has suggestions/opinions on what is currently the best architecture for a distributed search system using Solr. The use case is that of a system composed of N indexes, each hosted on a separate machine, each index containing unique content. Options that I know of are: A) Using Solr distributed search B) Using Solr + Zookeeper integration C) Using replication, i.e. each node replicates all the others It seems like options A) and B) would suffer from a fault-tolerance standpoint: if any of the nodes goes down, the search won't -at this time- return partial results, but instead report an exception. Option C) would provide fault tolerance, at least for any search initiated at a node that is available, but would incur into a large replication overhead. Did I get any of the above wrong, or does somebody have some insight on what is the best system architecture for this use case ? thanks in advance, Luca
Re: distributed architecture
On Tue, 30 Nov 2010 23:11 -0800, Dennis Gearon gear...@sbcglobal.net wrote: Wow, would you put a diagram somewhere up on the Solr site? Or, here, and I will put it somewhere there. I'll see what I can do to make a diagram. And, what is a VIP? Virtual IP. It is what a load balancer uses. You assign a 'virtual IP' to your load balancer, and it is responsible for forwarding traffic to that IP to one of the hosts in that particular pool. Upayavira
Re: Dinamically change master
Note, all extracted from http://wiki.apache.org/solr/SolrReplication You'd put: requestHandler name=/replication class=solr.ReplicationHandler lst name=master !--Replicate on 'startup' and 'commit'. 'optimize' is also a valid value for replicateAfter. -- str name=replicateAfterstartup/str str name=replicateAftercommit/str /lst /requestHandler into every box you want to be able to act as a master, then use: http://slave_host:port/solr/replication?command=fetchindexmasterUrl=your master URL As the above page says better than I can, It is possible to pass on extra attribute 'masterUrl' or other attributes like 'compression' (or any other parameter which is specified in the lst name=slave tag) to do a one time replication from a master. This obviates the need for hardcoding the master in the slave. HTH, Upayavira On Wed, 01 Dec 2010 06:24 +0100, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi Upayavira, this is a good start for solving my problem, can you please tell how does such a replication URL look like? Thanks, Tommaso 2010/12/1 Upayavira u...@odoko.co.uk Hi Tommaso, I believe you can tell each server to act as a master (which means it can have its indexes pulled from it). You can then include the master hostname in the URL that triggers a replication process. Thus, if you triggered replication from outside solr, you'd have control over which master you pull from. Does this answer your question? Upayavira On Tue, 30 Nov 2010 09:18 -0800, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Tommaso, On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote: Hi all, in a replication environment if the host where the master is running goes down for some reason, is there a way to communicate to the slaves to point to a different (backup) master without manually changing configuration (and restarting the slaves or their cores)? Basically I'd like to be able to change the replication master dinamically inside the slaves. Do you have any idea of how this could be achieved? One common approach is to use VIP (virtual IP) support provided by load balancers. Your slaves are configured to use a VIP to talk to the master, so that it's easy to dynamically change which master they use, via updates to the load balancer config. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Dinamically change master
Thanks Upayavira, that sounds very good. p.s.: I read that page some weeks ago and didn't get back to check on it. 2010/12/1 Upayavira u...@odoko.co.uk Note, all extracted from http://wiki.apache.org/solr/SolrReplication You'd put: requestHandler name=/replication class=solr.ReplicationHandler lst name=master !--Replicate on 'startup' and 'commit'. 'optimize' is also a valid value for replicateAfter. -- str name=replicateAfterstartup/str str name=replicateAftercommit/str /lst /requestHandler into every box you want to be able to act as a master, then use: http://slave_host:port/solr/replication?command=fetchindexmasterUrl=your master URL As the above page says better than I can, It is possible to pass on extra attribute 'masterUrl' or other attributes like 'compression' (or any other parameter which is specified in the lst name=slave tag) to do a one time replication from a master. This obviates the need for hardcoding the master in the slave. HTH, Upayavira On Wed, 01 Dec 2010 06:24 +0100, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi Upayavira, this is a good start for solving my problem, can you please tell how does such a replication URL look like? Thanks, Tommaso 2010/12/1 Upayavira u...@odoko.co.uk Hi Tommaso, I believe you can tell each server to act as a master (which means it can have its indexes pulled from it). You can then include the master hostname in the URL that triggers a replication process. Thus, if you triggered replication from outside solr, you'd have control over which master you pull from. Does this answer your question? Upayavira On Tue, 30 Nov 2010 09:18 -0800, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Tommaso, On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote: Hi all, in a replication environment if the host where the master is running goes down for some reason, is there a way to communicate to the slaves to point to a different (backup) master without manually changing configuration (and restarting the slaves or their cores)? Basically I'd like to be able to change the replication master dinamically inside the slaves. Do you have any idea of how this could be achieved? One common approach is to use VIP (virtual IP) support provided by load balancers. Your slaves are configured to use a VIP to talk to the master, so that it's easy to dynamically change which master they use, via updates to the load balancer config. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: ArrayIndexOutOfBoundsException in sort
sorry for lost, following is my schema.xml config and I use IKTokenizer for Chinese charactor fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=org.wltea.analyzer.solr.IKTokenizerFactory isMaxWordLength=false/ !-- tokenizer class=solr.WhitespaceTokenizerFactory/ -- !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumb ers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=org.wltea.analyzer.solr.IKTokenizerFactory isMaxWordLength=true/ !-- tokenizer class=solr.WhitespaceTokenizerFactory/ -- filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumb ers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType field name=id type=string indexed=true stored=true required=true / field name=documentId type=tlong indexed=true stored=true required=true / field name=headline type=text indexed=true stored=true omitNorms=true required=true / field name=content type=text indexed=true stored=true compressed=true omitNorms=true required=true / field name=author type=text indexed=true stored=true required=true default= / field name=pubName type=text indexed=true stored=true required=true default= / field name=pubType type=tint indexed=true stored=true required=true / field name=section type=text indexed=true stored=true required=true / field name=column type=text indexed=true stored=true required=true / field name=folderId type=tint indexed=true stored=true required=true/ field name=userId type=string indexed=true stored=true required=true/ field name=readType type=tint indexed=true stored=true required=true / field name=downloadType type=tint indexed=true stored=true required=true / field name=hasImg type=tint indexed=false stored=true required=true / field name=hasText type=tint indexed=false stored=true required=true / field name=pubDate type=tint indexed=true stored=true required=true/ field name=trackingTime type=tint indexed=true stored=true required=true / field name=text type=text indexed=true stored=false multiValued=true/ uniqueKeyid/uniqueKey defaultSearchFieldtext/defaultSearchField copyField source=headline dest=text/ copyField source=content dest=text/ On Wed, Dec 1, 2010 at 2:50 PM, Gora Mohanty g...@mimirtech.com wrote: On Wed, Dec 1, 2010 at 10:56 AM, Jerry Li zongjie...@gmail.com wrote: Hi team My solr version is 1.4 There is an ArrayIndexOutOfBoundsException when i sort one field and the following is my code and log info, any help will be appreciated. Code: SolrQuery query = new SolrQuery(); query.setSortField(author, ORDER.desc); [...] Please show us how the field author defined in your schema.xml. Sorting has to be done on a non-tokenized field, e.g., a StrField. Regards, Gora -- Best Regards. Jerry. Li | 李宗杰
Re: ArrayIndexOutOfBoundsException in sort
Hi It seems work fine again after I change author field type from text to string, could anybody give some info about it? very appriciated. field name=author type=string indexed=true stored=true required=true default= / On Wed, Dec 1, 2010 at 5:20 PM, Jerry Li zongjie...@gmail.com wrote: sorry for lost, following is my schema.xml config and I use IKTokenizer for Chinese charactor fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=org.wltea.analyzer.solr.IKTokenizerFactory isMaxWordLength=false/ !-- tokenizer class=solr.WhitespaceTokenizerFactory/ -- !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumb ers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=org.wltea.analyzer.solr.IKTokenizerFactory isMaxWordLength=true/ !-- tokenizer class=solr.WhitespaceTokenizerFactory/ -- filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumb ers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType field name=id type=string indexed=true stored=true required=true / field name=documentId type=tlong indexed=true stored=true required=true / field name=headline type=text indexed=true stored=true omitNorms=true required=true / field name=content type=text indexed=true stored=true compressed=true omitNorms=true required=true / field name=author type=text indexed=true stored=true required=true default= / field name=pubName type=text indexed=true stored=true required=true default= / field name=pubType type=tint indexed=true stored=true required=true / field name=section type=text indexed=true stored=true required=true / field name=column type=text indexed=true stored=true required=true / field name=folderId type=tint indexed=true stored=true required=true/ field name=userId type=string indexed=true stored=true required=true/ field name=readType type=tint indexed=true stored=true required=true / field name=downloadType type=tint indexed=true stored=true required=true / field name=hasImg type=tint indexed=false stored=true required=true / field name=hasText type=tint indexed=false stored=true required=true / field name=pubDate type=tint indexed=true stored=true required=true/ field name=trackingTime type=tint indexed=true stored=true required=true / field name=text type=text indexed=true stored=false multiValued=true/ uniqueKeyid/uniqueKey defaultSearchFieldtext/defaultSearchField copyField source=headline dest=text/ copyField source=content dest=text/ On Wed, Dec 1, 2010 at 2:50 PM, Gora Mohanty g...@mimirtech.com wrote: On Wed, Dec 1, 2010 at 10:56 AM, Jerry Li zongjie...@gmail.com wrote: Hi team My solr version is 1.4 There is an ArrayIndexOutOfBoundsException when i sort one field and the following is my code and log info, any help will be appreciated. Code: SolrQuery query = new SolrQuery(); query.setSortField(author, ORDER.desc); [...] Please show us how the field author defined in your schema.xml. Sorting has to be done on a non-tokenized field, e.g., a StrField. Regards, Gora -- Best Regards. Jerry. Li | 李宗杰 -- Best Regards. Jerry. Li | 李宗杰
Spatial Search
Hi , I am a newbie of solr. I found it really interesting specially spetial search. I am very interested to go in its depth but i am facing some problem to use it as i have 1.4.1 version installed on my machine but the spetial search is a feature of 4.0 version which is not released yet. I have also read somewhere that we can use a patch for this purpose. As i am a newbie I dont know how to install the patch and from where to download it. If anyone could help me i'll be very thankful. thanks in advance and bye
Troubles with forming query for solr.
Hi, I have some troubles with forming query for solr. Here is my task : I'm indexing objects with 3 fields, for example {field1, field2, filed3} In solr's response I want to get object in special order : 1. Firstly I want to get objects where all 3 fields are matched 2. Then I want to get objects where ONLY field1 and field2 are matched 3. And finnally I want to get objects where ONLY field2 and field3 are matched. Could your explain me how to form query for my task? -- View this message in context: http://lucene.472066.n3.nabble.com/Troubles-with-forming-query-for-solr-tp1996630p1996630.html Sent from the Solr - User mailing list archive at Nabble.com.
schema design for related fields
Hi I've built a schema for a proof of concept and it is all working fairly fine, niave maybe but fine. However I think we might run into trouble in the future if we ever use facets. The data models train destination city routes from a origin city: Doc:City Name: cityname [uniq key] CityType: city type values [nine possible values so good for faceting] ... [other city attricbutes which relate directy to the doc unique key] all have limited vocab so good for faceting FareJanStandard:cheapest standard fare in january(float value) FareJanFirst:cheapest first class fare in january(float value) FareFebStandard:cheapest standard fare in feb(float value) FareFebFirst:cheapest first fare in feb(float value) . etc The question is how would i best facet fare price? The desire is to return number of citys with jan prices in a set of ranges etc number of citys with first prices in a set of ranges etc install is 1.4.1 running in weblogic Any ideas ? Lee C
Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param
On Tue, Nov 30, 2010 at 7:51 PM, Martin Grotzke martin.grot...@googlemail.com wrote: On Tue, Nov 30, 2010 at 3:09 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Still I'm wondering, why this issue does not occur with the plain example solr setup with 2 indexed docs. Any explanation? It's an old option you have in your solrconfig.xml that causes a different code path to be followed in Solr: !-- An optimization that attempts to use a filter to satisfy a search. If the requested sort does not include score, then the filterCache will be checked for a filter matching the query. If found, the filter will be used as the source of document ids, and then the sort will be applied to that. -- useFilterForSortedQuerytrue/useFilterForSortedQuery Most apps would be better off commenting that out or setting it to false. It only makes sense when a high number of queries will be duplicated, but with different sorts. Great, this sounds really promising, would be a very easy fix. I need to check this tomorrow on our test/integration server if changing this does the trick for us. I just verified this fix on our test/integration system and it works - cool! Thanx a lot for this hint, cheers, Martin
Re: SOLR for Log analysis feasibility
my thoughts exactly that it may seem fairly straightforward but i fear for when a client wants a perfectly reasonable new feature to be added to their report and SOLR simply cannot support this feature. i am hoping we wont have any real issues with scalability as Loggly because we dont index and store large documents of data within SOLR. Most of our documents will be very small. Does anyone have any experience with using field collapsing in a production environment? thank you for all your replies. Joe -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-for-Log-analysis-feasibility-tp1992202p1998360.html Sent from the Solr - User mailing list archive at Nabble.com.
send XML multiValued Field Solr-PHP-Client
Hello. do anyone using Solr-PHP-Client ? how are you using mutltivalued fields with the method addFields() ? solr says to me SCHWERWIEGEND: java.lang.NumberFormatException: empty String when i send a raw xml like this: doc field name=uniquekey24038608/field field name=user_id778/field field name=reasonreason1/field field name=reasonreason1/field /doc in schema i defined: field name=reason type=text indexed=true stored=false multiValued=true / dynamicField name=reason_* type=text indexed=true stored=false / why dont work this ? =( -- View this message in context: http://lucene.472066.n3.nabble.com/send-XML-multiValued-Field-Solr-PHP-Client-tp1998370p1998370.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: QueryNorm and FieldNorm
Thanx for the answer. Is it possible to remove the QueryNorm?? so all the bf boost became an add of the solr score?? omitNorm is about fieldNorm or queryNorm?? thanx Gastone 2010/11/30 Jayendra Patil jayendra.patil@gmail.com fieldNorm is the combination of length of the field with index and query time boosts. 1. lengthNorm = measure of the importance of a term according to the total number of terms in the field 1. Implementation: 1/sqrt(numTerms) 2. Implication: a term matched in fields with less terms have a higher score 3. Rationale: a term in a field with less terms is more important than one with more 2. boost (index) = boost of the field at index-time 1. Index time boost specified. The fieldNorm value in the score would include the same. 3. boost (query) = boost of the field at query-time bf is the query time boost for a field and should affect fieldNorm value. queryNorm is just a normalization factor so that queries can be compared and will differ based on query and results 1. queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as 1/sqrt(sumOfSquaredWeights) You should not be bothered about queryNorm, as for a query it will have the same value for all the results. Regards, Jayendra On Tue, Nov 30, 2010 at 9:37 AM, Gastone Penzo gastone.pe...@gmail.com wrote: Hello, someone can explain the difference between queryNorm and FieldNorm in debugQuery?? why if i push one bf boost up, the queryNorm goes down?? i made some modifies..before the situation was different. why?? thanx -- Gastone Penzo -- Gastone Penzo
Re: distributed architecture
Hi, also take a look at solandra: https://github.com/tjake/Lucandra/tree/solandra I don't have it in prod yet but regarding administration overhead it looks very promising. And you'll get some other neat features like (soft) real time, for free. So its same like A) + C) + X) - Y) ;-) Regards, Peter. Hi, I'd like to know if anybody has suggestions/opinions on what is currently the best architecture for a distributed search system using Solr. The use case is that of a system composed of N indexes, each hosted on a separate machine, each index containing unique content. Options that I know of are: A) Using Solr distributed search B) Using Solr + Zookeeper integration C) Using replication, i.e. each node replicates all the others It seems like options A) and B) would suffer from a fault-tolerance standpoint: if any of the nodes goes down, the search won't -at this time- return partial results, but instead report an exception. Option C) would provide fault tolerance, at least for any search initiated at a node that is available, but would incur into a large replication overhead. Did I get any of the above wrong, or does somebody have some insight on what is the best system architecture for this use case ? thanks in advance, Luca -- http://jetwick.com twitter search prototype
Re : Spatial Search
check jteam's spatial search plugin. very easy to install Aisha Zafar aishazafar...@yahoo.com a écrit Hi , I am a newbie of solr. I found it really interesting specially spetial search. I am very interested to go in its depth but i am facing some problem to use it as i have 1.4.1 version installed on my machine but the spetial search is a feature of 4.0 version which is not released yet. I have also read somewhere that we can use a patch for this purpose. As i am a newbie I dont know how to install the patch and from where to download it. If anyone could help me i'll be very thankful. thanks in advance and bye Cet e-mail a été envoyé depuis un Archos 7.
Re: Solr DataImportHandler (DIH) and Cassandra
This is good timing I am/was just to embark on a spike if anyone is keen to help out On 30 Nov 2010, at 00:37, Mark wrote: The DataSource subclass route is what I will probably be interested in. Are there are working examples of this already out there? On 11/29/10 12:32 PM, Aaron Morton wrote: AFAIK there is nothing pre-written to pull the data out for you. You should be able to create your DataSource sub class http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/DataSource.html Using the Hector java library to pull data from Cassandra. I'm guessing you will need to consider how to perform delta imports. Perhaps using the secondary indexes in 0.7* , or maintaining your own queues or indexes to know what has changed. There is also the Lucandra project, not exactly what your after but may be of interest anyway https://github.com/tjake/Lucandra Hope that helps. Aaron On 30 Nov, 2010,at 05:04 AM, Mark static.void@gmail.com wrote: Is there anyway to use DIH to import from Cassandra? Thanks
Re: ArrayIndexOutOfBoundsException in sort
It seems work fine again after I change author field type from text to string, could anybody give some info about it? very appriciated. http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F And also see Erick's explanation http://search-lucene.com/m/7fnj1TtNde/sort+on+a+tokenized+fieldsubj=Re+Solr+sorting+problem
Re: [PECL-DEV] Re: PHP Solr API
Hi again, actually trying to implement spellcheck on a different way, and had the idea to access /solr/spellcheck to get all required data, before executing the final query to /solr/select - but, that seemed to be impossible - since there is no configuration option to change the /select part of the url? the part before can be configure through 'path', but nothing else. maybe that will be an idea to allow this part of the url to be configured, in what-ever way? Regards Stefan
Re: [PECL-DEV] Re: PHP Solr API
oooh, sorry - used the wrong thread for my suggestion ... please, just ignore this :) On Wed, Dec 1, 2010 at 2:01 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Hi again, actually trying to implement spellcheck on a different way, and had the idea to access /solr/spellcheck to get all required data, before executing the final query to /solr/select - but, that seemed to be impossible - since there is no configuration option to change the /select part of the url? the part before can be configure through 'path', but nothing else. maybe that will be an idea to allow this part of the url to be configured, in what-ever way? Regards Stefan
Re: Failover setup (is this a bad idea)
I agree with the Master with multiple slaves setup. Very easy using the built-in java setup in 1.4.1. When we set this up it made our developers think about how we were writing to Solr. We were using a Delta Import Handler (DIH?) for most writes but our app was also writing 'deletes' directly to Solr. Since we wanted to load balance the Slaves we couldn't have the app writing to the Slaves. Once we discussed the Master/Slave setup with our developers we found all areas where we were writing in our app and moved/centralized those into the DIH. Now the app only does queries against the load balanced slaves while the Master is used for DIH and backups only. Thanks, robo On Tue, Nov 30, 2010 at 7:58 AM, Jayendra Patil jayendra.patil@gmail.com wrote: Rather have a Master and multiple Slave combination, with master only being used for writes and slaves used for reads. Master to Slave replication is easily configurable. Two Solr instances sharing the same index is not at all good idea with both writing to the same index. Regards, Jayendra On Tue, Nov 30, 2010 at 7:13 AM, Keith Pope keith.p...@inflightproductions.com wrote: Hi, I have a windows cluster that I would like to install Solr onto, there are two nodes that provide basic failover. I was thinking of this setup: Tomcat installed as win service Two solr instances sharing the same index The second instance would take over when the first fails, so you should never get two writes/reads at once. Is this a bad idea? Would I end up corrupting my index? Thx Keith - Registered Office: 15 Stukeley Street, London WC2B 5LT, England. Registered in England number 1421223 This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. Please note that the information provided in this e-mail is in any case not legally binding; all committing statements require legally binding signatures. http://www.inflightproductions.com
Re: schema design for related fields
I'd think that facet.query would work for you, something like: facet=truefacet.query=FareJanStandard:[price1 TO price2]facet.query:fareJanStandard[price2 TO price3] You can string as many facet.query clauses as you want, across as many fields as you want, they're all independent and will get their own sections in the response. Best Erick On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.comwrote: Hi I've built a schema for a proof of concept and it is all working fairly fine, niave maybe but fine. However I think we might run into trouble in the future if we ever use facets. The data models train destination city routes from a origin city: Doc:City Name: cityname [uniq key] CityType: city type values [nine possible values so good for faceting] ... [other city attricbutes which relate directy to the doc unique key] all have limited vocab so good for faceting FareJanStandard:cheapest standard fare in january(float value) FareJanFirst:cheapest first class fare in january(float value) FareFebStandard:cheapest standard fare in feb(float value) FareFebFirst:cheapest first fare in feb(float value) . etc The question is how would i best facet fare price? The desire is to return number of citys with jan prices in a set of ranges etc number of citys with first prices in a set of ranges etc install is 1.4.1 running in weblogic Any ideas ? Lee C
Re: Spatial Search
1.4.1 spatial is pretty much superseded by geospatial in the current code, you can download a nightly build from here: https://hudson.apache.org/hudson/ Scroll down to Solr-trunk and pick a nightly build that suits you. Follow the link through build artifacts and checkout/solr/dist and you'll find the zip/tar files. Hudson is reporting some kinda flaky failures, but if you look at the build results you can determine whether you care. For instance, the Dec-1 build has a red ball, but all the tests pass! Here's a good place to start with geospatial: http://wiki.apache.org/solr/SpatialSearch Best Erick On Wed, Dec 1, 2010 at 2:35 AM, Aisha Zafar aishazafar...@yahoo.com wrote: Hi , I am a newbie of solr. I found it really interesting specially spetial search. I am very interested to go in its depth but i am facing some problem to use it as i have 1.4.1 version installed on my machine but the spetial search is a feature of 4.0 version which is not released yet. I have also read somewhere that we can use a patch for this purpose. As i am a newbie I dont know how to install the patch and from where to download it. If anyone could help me i'll be very thankful. thanks in advance and bye
Re: schema design for related fields
Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query would be 20 cluases in the query (10 * 2 fare clases) if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) if first and jan selected with 10 price ranges query would be 10 cluases if we required facets to be returned for all price combinations we'd need to supply 240 cluases the user interface would also need to collate the individual fields into meaningful aggragates for the user (ie numbers by month, numbers by fare class) have I understood or missed the point (i usually have) On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote: I'd think that facet.query would work for you, something like: facet=truefacet.query=FareJanStandard:[price1 TO price2]facet.query:fareJanStandard[price2 TO price3] You can string as many facet.query clauses as you want, across as many fields as you want, they're all independent and will get their own sections in the response. Best Erick On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.com wrote: Hi I've built a schema for a proof of concept and it is all working fairly fine, niave maybe but fine. However I think we might run into trouble in the future if we ever use facets. The data models train destination city routes from a origin city: Doc:City Name: cityname [uniq key] CityType: city type values [nine possible values so good for faceting] ... [other city attricbutes which relate directy to the doc unique key] all have limited vocab so good for faceting FareJanStandard:cheapest standard fare in january(float value) FareJanFirst:cheapest first class fare in january(float value) FareFebStandard:cheapest standard fare in feb(float value) FareFebFirst:cheapest first fare in feb(float value) . etc The question is how would i best facet fare price? The desire is to return number of citys with jan prices in a set of ranges etc number of citys with first prices in a set of ranges etc install is 1.4.1 running in weblogic Any ideas ? Lee C
Re: Best practice for Delta every 2 Minutes.
If your index warmings take longer than two minutes, but you're doing a commit every two minutes -- you're going to run into trouble with overlapping index preperations, eventually leading to an OOM. Could this be it? On 11/30/2010 11:36 AM, Erick Erickson wrote: I don't know, you'll have to debug it to see if it's the thing that takes so long. Solr should be able to handle 1,200 updates in a very short time unless there's something else going on, like you're committing after every update or something. This may help you track down performance with DIH http://wiki.apache.org/solr/DataImportHandler#interactive http://wiki.apache.org/solr/DataImportHandler#interactiveBest Erick On Tue, Nov 30, 2010 at 9:01 AM, stockiist...@shopgate.com wrote: how do you think is the deltaQuery better ? XD -- View this message in context: http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Good example of multiple tokenizers for a single field
On 11/29/2010 5:43 PM, Robert Muir wrote: On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkindrochk...@jhu.edu wrote: * As a tokenizer, I use the WhitespaceTokenizer. * Then I apply a custom filter that looks for CJK chars, and re-tokenizes any CJK chars into one-token-per-char. This custom filter was written by someone other than me; it is open source; but I'm not sure if it's actually in a public repo, or how well documented it is. I can put you in touch with the author to try and ask. There may also be a more standard filter other than the custom one I'm using that does the same thing? You are describing what standardtokenizer does. Wait, standardtokenizer already handles CJK and will put each CJK char into it's own token? Really? I had no idea! Is that documented anywhere, or you just have to look at the source to see it? I had assumed that standardtokenizer didn't have any special handling of bytes known to be UTF-8 CJK, because that wasn't mentioned in the documentation -- but it does? That would be convenient and not require my custom code. Jonathan
Re: Good example of multiple tokenizers for a single field
(Jonathan, I apologize for emailing you twice, i meant to hit reply-all) On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Wait, standardtokenizer already handles CJK and will put each CJK char into it's own token? Really? I had no idea! Is that documented anywhere, or you just have to look at the source to see it? Yes, you are right, the documentation should have been more explicit: in previous releases it doesn't say anything about how it tokenizes CJK in the documentation. But it does do them this way, and tagged them as CJ token type. I think the documentation issue is fixed in branch_3x and trunk: * As of Lucene version 3.1, this class implements the Word Break rules from the * Unicode Text Segmentation algorithm, as specified in * a href=http://unicode.org/reports/tr29/;Unicode Standard Annex #29/a. (from http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java) So you can read the UAX#29 report and then you know how it tokenizes text You can also just use this demo app to see how the new one works: http://unicode.org/cldr/utility/breaks.jsp (choose Word)
Re: schema design for related fields
if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) What would you intend to do with the returned facet-results in this situation? I doubt you want to display 12 categories (1 for each month) ? When a user hasn't selected a date, perhaps it would be more useful to show the cheapest fare regardless of month and facet on that? This would involve introducing 2 new fields: FareDateDontCareStandard, FareDateDontCareFirst Populate these fields on indexing time, by calculating the cheapest fares over all months. This then results in every query having to support at most 20 price ranges (10 for normal and 10 for first class) HTH, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query would be 20 cluases in the query (10 * 2 fare clases) if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) if first and jan selected with 10 price ranges query would be 10 cluases if we required facets to be returned for all price combinations we'd need to supply 240 cluases the user interface would also need to collate the individual fields into meaningful aggragates for the user (ie numbers by month, numbers by fare class) have I understood or missed the point (i usually have) On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote: I'd think that facet.query would work for you, something like: facet=truefacet.query=FareJanStandard:[price1 TO price2]facet.query:fareJanStandard[price2 TO price3] You can string as many facet.query clauses as you want, across as many fields as you want, they're all independent and will get their own sections in the response. Best Erick On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.com wrote: Hi I've built a schema for a proof of concept and it is all working fairly fine, niave maybe but fine. However I think we might run into trouble in the future if we ever use facets. The data models train destination city routes from a origin city: Doc:City Name: cityname [uniq key] CityType: city type values [nine possible values so good for faceting] ... [other city attricbutes which relate directy to the doc unique key] all have limited vocab so good for faceting FareJanStandard:cheapest standard fare in january(float value) FareJanFirst:cheapest first class fare in january(float value) FareFebStandard:cheapest standard fare in feb(float value) FareFebFirst:cheapest first fare in feb(float value) . etc The question is how would i best facet fare price? The desire is to return number of citys with jan prices in a set of ranges etc number of citys with first prices in a set of ranges etc install is 1.4.1 running in weblogic Any ideas ? Lee C
RE: how to set maxFieldLength to unlimitd
Does anyone know how to index a pdf file with very big size (more than 100MB)? Thanks so much, Xiaohui -Original Message- From: Ma, Xiaohui (NIH/NLM/LHC) [C] Sent: Tuesday, November 30, 2010 4:22 PM To: 'solr-user@lucene.apache.org' Subject: RE: how to set maxFieldLength to unlimitd I set maxFieldLength to 2147483647, restarted tomcat and re-indexed pdf files again. I also commented out the one in the mainIndex section. Unfortunately the files are still chopped out if the size of file is more than 20MB. Any suggestions? I really appreciate your help! Xiaohui -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, November 30, 2010 2:01 PM To: solr-user@lucene.apache.org Subject: Re: how to set maxFieldLength to unlimitd Set the maxFieldLength value in solrconfig.xml to, say, 2147483647 Also, see this thread for a common gotcha: http://lucene.472066.n3.nabble.com/Solr-ignoring-maxFieldLength-td473263.html , it appears you can just comment out the one in the mainIndex section. Best Erick On Tue, Nov 30, 2010 at 1:48 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov wrote: I need index and search some pdf files which are very big (around 1000 pages each). How can I set maxFieldLength to unlimited? Thanks so much for your help in advance, Xiaohui
Re: schema design for related fields
Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although that's not often done there's no prohibition against it. So, what if you had just one field for each city that contained some abstract information about your fares etc. Something like janstdfareclass1 jancheapfareclass3 febstdfareclass6 Now just facet on that field? Not #values# in that field, just the field itself. You'd then have to make those into human-readable text, but that would considerably simplify your query. Probably only works if your user is selecting from pre-defined ranges, if they expect to put in arbitrary ranges this scheme probably wouldn't work... Best Erick On Wed, Dec 1, 2010 at 10:22 AM, lee carroll lee.a.carr...@googlemail.comwrote: Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query would be 20 cluases in the query (10 * 2 fare clases) if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) if first and jan selected with 10 price ranges query would be 10 cluases if we required facets to be returned for all price combinations we'd need to supply 240 cluases the user interface would also need to collate the individual fields into meaningful aggragates for the user (ie numbers by month, numbers by fare class) have I understood or missed the point (i usually have) On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote: I'd think that facet.query would work for you, something like: facet=truefacet.query=FareJanStandard:[price1 TO price2]facet.query:fareJanStandard[price2 TO price3] You can string as many facet.query clauses as you want, across as many fields as you want, they're all independent and will get their own sections in the response. Best Erick On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.com wrote: Hi I've built a schema for a proof of concept and it is all working fairly fine, niave maybe but fine. However I think we might run into trouble in the future if we ever use facets. The data models train destination city routes from a origin city: Doc:City Name: cityname [uniq key] CityType: city type values [nine possible values so good for faceting] ... [other city attricbutes which relate directy to the doc unique key] all have limited vocab so good for faceting FareJanStandard:cheapest standard fare in january(float value) FareJanFirst:cheapest first class fare in january(float value) FareFebStandard:cheapest standard fare in feb(float value) FareFebFirst:cheapest first fare in feb(float value) . etc The question is how would i best facet fare price? The desire is to return number of citys with jan prices in a set of ranges etc number of citys with first prices in a set of ranges etc install is 1.4.1 running in weblogic Any ideas ? Lee C
${dataimporter.last_index_time} Format?
Hello All, I have a simple problem; In my conf/dataimport.properties i have last_index_time with this format '%Y-%m-%d %H:%M:%S' for example: last_index_time=2010-12-01 16\:53\:16. But when i use this propertie in my data-config.conf the value format began %Y-%m-%d; for example: url=http://server/_solr/?last_time=${dataimporter.last_index_time}; make: http://server/_solr/?last_time=2010-12-01 You have an idea for me? Thank a lot! -- ~sahid
RE: how to set maxFieldLength to unlimitd
You just can't set it to unlimited. What you could do, is ignoring the positions and put a filter in, that sets the token for all but the first token to 0 (means the field length will be just 1, all tokens stacked on the first position) You could also break per page, so you put each page on a new position. Jan -Original Message- From: ext Ma, Xiaohui (NIH/NLM/LHC) [C] [mailto:xiao...@mail.nlm.nih.gov] Sent: Dienstag, 30. November 2010 19:49 To: solr-user@lucene.apache.org; 'solr-user-i...@lucene.apache.org'; 'solr-user-...@lucene.apache.org' Subject: how to set maxFieldLength to unlimitd I need index and search some pdf files which are very big (around 1000 pages each). How can I set maxFieldLength to unlimited? Thanks so much for your help in advance, Xiaohui
Re: schema design for related fields
Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although that's not often done there's no prohibition against it. So, what if you had just one field for each city that contained some abstract information about your fares etc. Something like janstdfareclass1 jancheapfareclass3 febstdfareclass6 Now just facet on that field? Not #values# in that field, just the field itself. You'd then have to make those into human-readable text, but that would considerably simplify your query. Probably only works if your user is selecting from pre-defined ranges, if they expect to put in arbitrary ranges this scheme probably wouldn't work... Best Erick On Wed, Dec 1, 2010 at 10:22 AM, lee carroll lee.a.carr...@googlemail.comwrote: Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query would be 20 cluases in the query (10 * 2 fare clases) if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) if first and jan selected with 10 price ranges query would be 10 cluases if we required facets to be returned for all price combinations we'd need to supply 240 cluases the user interface would also need to collate the individual fields into meaningful aggragates for the user (ie numbers by month, numbers by fare class) have I understood or missed the point (i usually have) On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote: I'd think that facet.query would work for you, something like: facet=truefacet.query=FareJanStandard:[price1 TO price2]facet.query:fareJanStandard[price2 TO price3] You can string as many facet.query clauses as you want, across as many fields as you want, they're all independent and will get their own sections in the response. Best Erick On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.com wrote: Hi I've built a schema for a proof of concept and it is all working fairly fine, niave maybe but fine. However I think we might run into trouble in the future if we ever use facets. The data models train destination city routes from a origin city: Doc:City Name: cityname [uniq key] CityType: city type values [nine possible values so good for faceting] ... [other city attricbutes which relate directy to the doc unique key] all have limited vocab so good for faceting FareJanStandard:cheapest standard fare in january(float value) FareJanFirst:cheapest first class fare in january(float value) FareFebStandard:cheapest standard fare in feb(float value) FareFebFirst:cheapest first fare in feb(float value) . etc The question is how would i best facet fare price? The desire is to return number of citys with jan prices in a set of ranges etc number of citys with first prices in a set of ranges etc install is 1.4.1 running in weblogic Any ideas ? Lee C
Solr 3x segments file and deleting index
If I want to delete an entire index and start over, in previous versions of Solr, you could stop Solr, delete all files in the index directory and restart Solr. Solr would then create empty segments files and you could start indexing. In Solr 3x if I delete all the files in the index directory I get a large stack trace with this error: org.apache.lucene.index.IndexNotFoundException: no segments* file found As a workaround, whenever I delete an index (by deleting all files in the index directory), I copy the segments files that come with the Solr example to the index directory and then restart Solr. Is this a feature or a bug? What is the rationale? Tom Tom Burton-West
RE: how to set maxFieldLength to unlimitd
Thanks so much for your replay, Jan. I just found I cannot index pdf files with the file size more than 20MB. I use curl index them, didn't get any error either. Do you have any suggestions to index pdf files with more than 20MB? Thanks, Xiaohui -Original Message- From: jan.kure...@nokia.com [mailto:jan.kure...@nokia.com] Sent: Wednesday, December 01, 2010 11:30 AM To: solr-user@lucene.apache.org; solr-user-i...@lucene.apache.org; solr-user-...@lucene.apache.org Subject: RE: how to set maxFieldLength to unlimitd You just can't set it to unlimited. What you could do, is ignoring the positions and put a filter in, that sets the token for all but the first token to 0 (means the field length will be just 1, all tokens stacked on the first position) You could also break per page, so you put each page on a new position. Jan -Original Message- From: ext Ma, Xiaohui (NIH/NLM/LHC) [C] [mailto:xiao...@mail.nlm.nih.gov] Sent: Dienstag, 30. November 2010 19:49 To: solr-user@lucene.apache.org; 'solr-user-i...@lucene.apache.org'; 'solr-user-...@lucene.apache.org' Subject: how to set maxFieldLength to unlimitd I need index and search some pdf files which are very big (around 1000 pages each). How can I set maxFieldLength to unlimited? Thanks so much for your help in advance, Xiaohui
Re: schema design for related fields
Sorry Geert missed of the price value bit from the user interface so we'd display Facet price Standard fares [10] First fares [3] When traveling in Jan [9] in feb [10] in march [1] Fare Price 0 - 25 : [20] 25 - 50: [10] 50 - 100 [2] cheers lee c On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote: Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although that's not often done there's no prohibition against it. So, what if you had just one field for each city that contained some abstract information about your fares etc. Something like janstdfareclass1 jancheapfareclass3 febstdfareclass6 Now just facet on that field? Not #values# in that field, just the field itself. You'd then have to make those into human-readable text, but that would considerably simplify your query. Probably only works if your user is selecting from pre-defined ranges, if they expect to put in arbitrary ranges this scheme probably wouldn't work... Best Erick On Wed, Dec 1, 2010 at 10:22 AM, lee carroll lee.a.carr...@googlemail.comwrote: Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query would be 20 cluases in the query (10 * 2 fare clases) if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) if first and jan selected with 10 price ranges query would be 10 cluases if we required facets to be returned for all price combinations we'd need to supply 240 cluases the user interface would also need to collate the individual fields into meaningful aggragates for the user (ie numbers by month, numbers by fare class) have I understood or missed the point (i usually have) On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote: I'd think that facet.query would work for you, something like: facet=truefacet.query=FareJanStandard:[price1 TO price2]facet.query:fareJanStandard[price2 TO price3] You can string as many facet.query clauses as you want, across as many fields as you want, they're all independent and will get their own sections in the response. Best Erick On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.com wrote: Hi I've built a schema for a proof of concept and it is all working fairly fine, niave maybe but fine. However I think we might run into trouble in the future if we ever use facets. The data models train destination city routes from a origin city: Doc:City Name: cityname [uniq key] CityType: city type values [nine possible values so good for faceting] ... [other city attricbutes which relate directy to the doc unique key] all have limited vocab so good for faceting FareJanStandard:cheapest standard fare in january(float value) FareJanFirst:cheapest first class fare in january(float value) FareFebStandard:cheapest standard fare in feb(float value) FareFebFirst:cheapest first fare in feb(float value) . etc The question is how would i best facet fare price? The desire is to return number of citys with jan prices in a set of ranges etc number of citys with first prices in a set of ranges etc install is 1.4.1 running in weblogic Any ideas ? Lee C
RE: entire farm fails at the same time with OOM issues
It has typically been when query traffic was lowest! We are at 12 GB heap, so I will try to bump it to 14 GB. We have 64GB main memory installed now. Here is our settings, do these look OK? export JAVA_OPTS=-Xmx12228m -Xms12228m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, November 30, 2010 6:44 PM To: solr-user@lucene.apache.org Subject: Re: entire farm fails at the same time with OOM issues On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen rober...@buy.com wrote: My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? If there is no change in query traffic when this happens, then it's due to what the index looks like. My guess is a large index merge happened, which means that when the searchers re-open on the new index, it requires more memory than normal (much less can be shared with the previous index). I'd try bumping the heap a little bit, and then optimizing once a day during off-peak hours. If you still get OOM errors, bump the heap a little more. -Yonik http://www.lucidimagination.com
Re: distributed architecture
Hi, thanks all, this has been very instructive. It looks like in the short term using a combination of replication and sharding, based on Upayavira's setup, might be the safest thing to do, while in the longer term following the zookeeper integration and solandra development might provide a more dynamic environment and perhaps easier setup. Please keep coming the good suggestions if you feel like. thanks again, Luca On Dec 1, 2010, at 4:17 AM, Peter Karich wrote: Hi, also take a look at solandra: https://github.com/tjake/Lucandra/tree/solandra I don't have it in prod yet but regarding administration overhead it looks very promising. And you'll get some other neat features like (soft) real time, for free. So its same like A) + C) + X) - Y) ;-) Regards, Peter. Hi, I'd like to know if anybody has suggestions/opinions on what is currently the best architecture for a distributed search system using Solr. The use case is that of a system composed of N indexes, each hosted on a separate machine, each index containing unique content. Options that I know of are: A) Using Solr distributed search B) Using Solr + Zookeeper integration C) Using replication, i.e. each node replicates all the others It seems like options A) and B) would suffer from a fault-tolerance standpoint: if any of the nodes goes down, the search won't -at this time- return partial results, but instead report an exception. Option C) would provide fault tolerance, at least for any search initiated at a node that is available, but would incur into a large replication overhead. Did I get any of the above wrong, or does somebody have some insight on what is the best system architecture for this use case ? thanks in advance, Luca -- http://jetwick.com twitter search prototype
Re: Good example of multiple tokenizers for a single field
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir rcm...@gmail.com wrote: (Jonathan, I apologize for emailing you twice, i meant to hit reply-all) On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Wait, standardtokenizer already handles CJK and will put each CJK char into it's own token? Really? I had no idea! Is that documented anywhere, or you just have to look at the source to see it? Yes, you are right, the documentation should have been more explicit: in previous releases it doesn't say anything about how it tokenizes CJK in the documentation. But it does do them this way, and tagged them as CJ token type. I think the documentation issue is fixed in branch_3x and trunk: * As of Lucene version 3.1, this class implements the Word Break rules from the * Unicode Text Segmentation algorithm, as specified in * a href=http://unicode.org/reports/tr29/;Unicode Standard Annex #29/a. (from http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java ) So you can read the UAX#29 report and then you know how it tokenizes text You can also just use this demo app to see how the new one works: http://unicode.org/cldr/utility/breaks.jsp (choose Word) What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the current stable StandardTokenizer handle CJK? -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
On Wed, Dec 1, 2010 at 12:25 PM, Jacob Elder jel...@locamoda.com wrote: What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the current stable StandardTokenizer handle CJK? yes
Re: entire farm fails at the same time with OOM issues
On Nov 30, 2010, at 5:16pm, Robert Petersen wrote: What would I do with the heap dump though? Run one of those java heap analyzers looking for memory leaks or something? I have no experience with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory leak occurring on each commit, but it would take thousands of commits to make that add up to anything right? Typically when I run out of memory in Solr, it's during an index update, when the new index searcher is getting warmed up. Looking at the heap often shows ways to reduce memory requirements, e.g. you'll see a really big chunk used for a sorted field. See http://wiki.apache.org/solr/SolrCaching and http://wiki.apache.org/solr/SolrPerformanceFactors for more details. -- Ken -Original Message- From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Tuesday, November 30, 2010 3:12 PM To: solr-user@lucene.apache.org Subject: Re: entire farm fails at the same time with OOM issues Hi Robert, I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=path to where you want the file to go, so then you have something to look at versus a Gedankenexperiment :) -- Ken On Nov 30, 2010, at 3:04pm, Robert Petersen wrote: Greetings, we are running one master and four slaves of our multicore solr setup. We just served searches for our catalog of 8 million products with this farm during black Friday and cyber Monday, our busiest days of the year, and the servers did not break a sweat! Index size is about 28GB. However, twice now recently during a time of low load we have had a fire drill where I have seen tomcat/solr fail and become unresponsive after some OOM heap errors. Solr wouldn't even serve up its admin pages. I've had to go in and manually knock tomcat out of memory and then restart it. These solr slaves are load balanced and the load balancers always probe the solr slaves so if they stop serving up searches they are automatically removed from the load balancer. When all four fail at the same time we have an issue! My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? The load balancer kicks them all out at the same time each time. Each slave only talks to the master and not to each other, but the master show no errors in the logs at all. Something must be triggering this though. The only other odd thing I saw in the logs was after the first OOM errors were recorded, the slaves started occasionally not being able to get to the master. This behavior makes me a little nervous...=:-o eek! Environment: Lucid Imagination distro of Solr 1.4 on Tomcat Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with 64GB memory etc etc http://ken-blog.krugler.org +1 530-265-2225 -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Solr 3x segments file and deleting index
On 12/1/2010 10:12 AM, Burton-West, Tom wrote: If I want to delete an entire index and start over, in previous versions of Solr, you could stop Solr, delete all files in the index directory and restart Solr. Solr would then create empty segments files and you could start indexing. In Solr 3x if I delete all the files in the index directory I get a large stack trace with this error: You have to delete the index directory entirely. This looks like a change in Lucene, not Solr specificially. If the directory exists, but has nothing in it, it throws an exception. I'll leave the rationale question that you also asked to someone who might actually know. I personally think it shouldn't behave this way, but the dev team may encountered something that required that the directory either be a valid index or not exist at all. Shawn
Re: schema design for related fields
Ok longer answer than anticipated (and good conceptual practice ;-) Yeah I belief that would work if I understand correctly that: 'in Jan [9] in feb [10] in march [1]' has nothing to do with pricing, but only with availability? If so you could seperate it out as two seperate issues: 1. ) showing pricing (based on context) 2. ) showing availabilities (based on context) For 1.) you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc]) note: 'dc' indicates 'don't care. depending on the context you query the correct pricefield to populate the price facet-values. for discussion lets call the fields: _p[fare][date]. IN other words the price field for no preference at all would become: _pdcdc For 2.) define a multivalued field 'FaresPerDate 'which indicate availability, which is used to display: A) Standard fares [10] First fares [3] B) in Jan [9] in feb [10] in march [1] A) depends on your selection (or dont caring) about a month B) vice versa depends on your selection (or dont caring) about a fare type given all possible date values: [jan,feb,..dec,dontcare] given all possible fare values:[standard,first,dontcare] FaresPerDate consists of multiple values per document where each value indicates the availability of a combination of 'fare' and 'date': (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC) Note that the nr of possible values = 39. Example: 1. ) the user hasn't selected any preference: q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO 20]facet.query=_pdcdc:[20 TO 40], etc. in the client you have to make sure to select the correct values of 'FaresPerDate' for display: in this case: Standard fares [10] -- FaresPerDate.standardDC First fares [3] -- FaresPerDate.firstDC in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch 2) the user has selected January q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0 TO 20]facet.query=_pDCJan:[20 TO 40] Standard fares [10] -- FaresPerDate.standardJan First fares [3] -- FaresPerDate.firstJan in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch Hope that helps, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Sorry Geert missed of the price value bit from the user interface so we'd display Facet price Standard fares [10] First fares [3] When traveling in Jan [9] in feb [10] in march [1] Fare Price 0 - 25 : [20] 25 - 50: [10] 50 - 100 [2] cheers lee c On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote: Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although that's not often done there's no prohibition against it. So, what if you had just one field for each city that contained some abstract information about your fares etc. Something like janstdfareclass1 jancheapfareclass3 febstdfareclass6 Now just facet on that field? Not #values# in that field, just the field itself. You'd then have to make those into human-readable text, but that would considerably simplify your query. Probably only works if your user is selecting from pre-defined ranges, if they expect to put in arbitrary ranges this scheme probably wouldn't work... Best Erick On Wed, Dec 1, 2010 at 10:22 AM, lee carroll lee.a.carr...@googlemail.comwrote: Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query would be 20 cluases in the query (10 * 2 fare clases) if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) if first and jan selected
Re: schema design for related fields
Also, filtering and sorting on price can be done as well. Just be sure to use the correct price- field. Geert-Jan 2010/12/1 Geert-Jan Brits gbr...@gmail.com Ok longer answer than anticipated (and good conceptual practice ;-) Yeah I belief that would work if I understand correctly that: 'in Jan [9] in feb [10] in march [1]' has nothing to do with pricing, but only with availability? If so you could seperate it out as two seperate issues: 1. ) showing pricing (based on context) 2. ) showing availabilities (based on context) For 1.) you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc]) note: 'dc' indicates 'don't care. depending on the context you query the correct pricefield to populate the price facet-values. for discussion lets call the fields: _p[fare][date]. IN other words the price field for no preference at all would become: _pdcdc For 2.) define a multivalued field 'FaresPerDate 'which indicate availability, which is used to display: A) Standard fares [10] First fares [3] B) in Jan [9] in feb [10] in march [1] A) depends on your selection (or dont caring) about a month B) vice versa depends on your selection (or dont caring) about a fare type given all possible date values: [jan,feb,..dec,dontcare] given all possible fare values:[standard,first,dontcare] FaresPerDate consists of multiple values per document where each value indicates the availability of a combination of 'fare' and 'date': (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC) Note that the nr of possible values = 39. Example: 1. ) the user hasn't selected any preference: q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO 20]facet.query=_pdcdc:[20 TO 40], etc. in the client you have to make sure to select the correct values of 'FaresPerDate' for display: in this case: Standard fares [10] -- FaresPerDate.standardDC First fares [3] -- FaresPerDate.firstDC in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch 2) the user has selected January q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0 TO 20]facet.query=_pDCJan:[20 TO 40] Standard fares [10] -- FaresPerDate.standardJan First fares [3] -- FaresPerDate.firstJan in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch Hope that helps, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Sorry Geert missed of the price value bit from the user interface so we'd display Facet price Standard fares [10] First fares [3] When traveling in Jan [9] in feb [10] in march [1] Fare Price 0 - 25 : [20] 25 - 50: [10] 50 - 100 [2] cheers lee c On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote: Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although that's not often done there's no prohibition against it. So, what if you had just one field for each city that contained some abstract information about your fares etc. Something like janstdfareclass1 jancheapfareclass3 febstdfareclass6 Now just facet on that field? Not #values# in that field, just the field itself. You'd then have to make those into human-readable text, but that would considerably simplify your query. Probably only works if your user is selecting from pre-defined ranges, if they expect to put in arbitrary ranges this scheme probably wouldn't work... Best Erick On Wed, Dec 1, 2010 at 10:22 AM, lee carroll lee.a.carr...@googlemail.comwrote: Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query
Re: schema design for related fields
Hi Geert, Ok I think I follow. the magic is in the multi-valued field. The only danger would be complexity if we allow users to multi select months/prices/fare classes. For example they can search for first prices in jan, april and november. I think what you describe is possible in this case just complicated. I'll see if i can hack some facets into the proto type tommorrow. Thanks for your help Lee C On 1 December 2010 17:57, Geert-Jan Brits gbr...@gmail.com wrote: Ok longer answer than anticipated (and good conceptual practice ;-) Yeah I belief that would work if I understand correctly that: 'in Jan [9] in feb [10] in march [1]' has nothing to do with pricing, but only with availability? If so you could seperate it out as two seperate issues: 1. ) showing pricing (based on context) 2. ) showing availabilities (based on context) For 1.) you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc]) note: 'dc' indicates 'don't care. depending on the context you query the correct pricefield to populate the price facet-values. for discussion lets call the fields: _p[fare][date]. IN other words the price field for no preference at all would become: _pdcdc For 2.) define a multivalued field 'FaresPerDate 'which indicate availability, which is used to display: A) Standard fares [10] First fares [3] B) in Jan [9] in feb [10] in march [1] A) depends on your selection (or dont caring) about a month B) vice versa depends on your selection (or dont caring) about a fare type given all possible date values: [jan,feb,..dec,dontcare] given all possible fare values:[standard,first,dontcare] FaresPerDate consists of multiple values per document where each value indicates the availability of a combination of 'fare' and 'date': (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC) Note that the nr of possible values = 39. Example: 1. ) the user hasn't selected any preference: q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO 20]facet.query=_pdcdc:[20 TO 40], etc. in the client you have to make sure to select the correct values of 'FaresPerDate' for display: in this case: Standard fares [10] -- FaresPerDate.standardDC First fares [3] -- FaresPerDate.firstDC in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch 2) the user has selected January q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0 TO 20]facet.query=_pDCJan:[20 TO 40] Standard fares [10] -- FaresPerDate.standardJan First fares [3] -- FaresPerDate.firstJan in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch Hope that helps, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Sorry Geert missed of the price value bit from the user interface so we'd display Facet price Standard fares [10] First fares [3] When traveling in Jan [9] in feb [10] in march [1] Fare Price 0 - 25 : [20] 25 - 50: [10] 50 - 100 [2] cheers lee c On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote: Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although that's not often done there's no prohibition against it. So, what if you had just one field for each city that contained some abstract information about your fares etc. Something like janstdfareclass1 jancheapfareclass3 febstdfareclass6 Now just facet on that field? Not #values# in that field, just the field itself. You'd then have to make those into human-readable text, but that would considerably simplify your query. Probably only
ramBufferSizeMB not reflected in segment sizes in index
We are using a recent Solr 3.x (See below for exact version). We have set the ramBufferSizeMB to 320 in both the indexDefaults and the mainIndex sections of our solrconfig.xml: ramBufferSizeMB320/ramBufferSizeMB mergeFactor20/mergeFactor We expected that this would mean that the index would not write to disk until it reached somewhere approximately over 300MB in size. However, we see many small segments that look to be around 80MB in size. We have not yet issued a single commit so nothing else should force a write to disk. With a merge factor of 20 we also expected to see larger segments somewhere around 320 * 20 = 6GB in size, however we see several around 1GB. We understand that the sizes are approximate, but these seem nowhere near what we expected. Can anyone explain what is going on? BTW maxBufferedDocs is commented out, so this should not be affecting the buffer flushes !--maxBufferedDocs1000/maxBufferedDocs-- Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 Tom Burton-West
Re: how to set maxFieldLength to unlimitd
I don't know about upload limitations, but for sure there are some in the default settings, this could explain the limit of 20MB. Which upload mechanism on solr side do you use? I guess this is not a lucene problem but rather the http-layer of solr. If you manage to stream your PDF and start parsing it on the stream you then should go for the filter, that sets the positionIncrement to 0 as mentioned. What we did once for PDF files, we parsed them befor into plain text and where indexing this (but we were using lucene directly) with a streamReader. Grüße, Jan Am 01.12.2010 um 18:13 schrieb ext Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov : Thanks so much for your replay, Jan. I just found I cannot index pdf files with the file size more than 20MB. I use curl index them, didn't get any error either. Do you have any suggestions to index pdf files with more than 20MB? Thanks, Xiaohui -Original Message- From: jan.kure...@nokia.com [mailto:jan.kure...@nokia.com] Sent: Wednesday, December 01, 2010 11:30 AM To: solr-user@lucene.apache.org; solr-user-i...@lucene.apache.org; solr-user-...@lucene.apache.org Subject: RE: how to set maxFieldLength to unlimitd You just can't set it to unlimited. What you could do, is ignoring the positions and put a filter in, that sets the token for all but the first token to 0 (means the field length will be just 1, all tokens stacked on the first position) You could also break per page, so you put each page on a new position. Jan -Original Message- From: ext Ma, Xiaohui (NIH/NLM/LHC) [C] [mailto:xiao...@mail.nlm.nih.gov] Sent: Dienstag, 30. November 2010 19:49 To: solr-user@lucene.apache.org; 'solr-user- i...@lucene.apache.org'; 'solr-user-...@lucene.apache.org' Subject: how to set maxFieldLength to unlimitd I need index and search some pdf files which are very big (around 1000 pages each). How can I set maxFieldLength to unlimited? Thanks so much for your help in advance, Xiaohui
Re: ramBufferSizeMB not reflected in segment sizes in index
The ram efficiency (= size of segment once flushed divided by size of RAM buffer) can vary drastically. Because the in-RAM data structures must be growable (to append new docs to the postings as they are encountered), the efficiency is never 100%. I think 50% is actually a good ram efficiency, and lower than that (even down to 27%) I think is still normal. Do you have many unique or low-doc-freq terms? That brings the efficiency down. If you turn on IndexWriter's infoStream and post the output we can see if anything odd is going on... 80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments. Do you do any deletions in this run? A merged segment size will often be less than the sum of the parts, especially if there are many terms but across segments these terms are shared but the infoStream will also show what merges are taking place. Mike On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom tburt...@umich.edu wrote: We are using a recent Solr 3.x (See below for exact version). We have set the ramBufferSizeMB to 320 in both the indexDefaults and the mainIndex sections of our solrconfig.xml: ramBufferSizeMB320/ramBufferSizeMB mergeFactor20/mergeFactor We expected that this would mean that the index would not write to disk until it reached somewhere approximately over 300MB in size. However, we see many small segments that look to be around 80MB in size. We have not yet issued a single commit so nothing else should force a write to disk. With a merge factor of 20 we also expected to see larger segments somewhere around 320 * 20 = 6GB in size, however we see several around 1GB. We understand that the sizes are approximate, but these seem nowhere near what we expected. Can anyone explain what is going on? BTW maxBufferedDocs is commented out, so this should not be affecting the buffer flushes !--maxBufferedDocs1000/maxBufferedDocs-- Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 Tom Burton-West
Solr highlighting is double-quotes-aware?
Not sure how to write that subject line. I'm getting some weird behavior out of the highlighter in Solr. It seems like an edge case, but I'm curious to hear if this is known about, or if it's something worth looking into further. Background: I'm using Solr's highlighting facility to tag words, found in content crawled via Nutch. I split up the content based on those tags, which is later fed into a moderation process. Sample Data (snippet from larger content): [url=\http://www.sampleurl.com/baffle_prices.html\]baffle[/url] (My hl.simple.pre is set to TEST_KEYWORD_START and my hl.simple.post is set to TEST_KEYWORD_END) Query for baffle, and solr highlights it thus: TEST_KEYWORD_STARTbaffle_prices.html\]baffleTEST_KEYWORD_END What should be happening, is this: TEST_KEYWORD_STARTbaffleTEST_KEYWORD_END_prices.html\]TEST_KEYWORD_STARTbaffleTEST_KEYWORD_END Is there something about this data that makes the highlighter not want to split it up? Do I have to have Solr tokenize the words by some character that I somehow excluded? Thank you, Scott Gonyea
Re: ramBufferSizeMB not reflected in segment sizes in index
On 12/1/2010 12:13 PM, Burton-West, Tom wrote: We have set the ramBufferSizeMB to 320 in both the indexDefaults and the mainIndex sections of our solrconfig.xml: ramBufferSizeMB320/ramBufferSizeMB mergeFactor20/mergeFactor We expected that this would mean that the index would not write to disk until it reached somewhere approximately over 300MB in size. However, we see many small segments that look to be around 80MB in size. We have not yet issued a single commit so nothing else should force a write to disk. With a merge factor of 20 we also expected to see larger segments somewhere around 320 * 20 = 6GB in size, however we see several around 1GB. We understand that the sizes are approximate, but these seem nowhere near what we expected. I have seen this. In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do not segment, but all the other files do. I can't remember whether it behaves the same under 3.1, or whether it also creates these files in each segment. Here's the first segment created during a test reindex I just started, excluding the previously mentioned files, which will be prefixed by _57 until I choose to optimize the index: -rw-r--r-- 1 ncindex ncindex315 Dec 1 12:40 _58.fnm -rw-r--r-- 1 ncindex ncindex 26000115 Dec 1 12:40 _58.frq -rw-r--r-- 1 ncindex ncindex 399124 Dec 1 12:40 _58.nrm -rw-r--r-- 1 ncindex ncindex 23879227 Dec 1 12:40 _58.prx -rw-r--r-- 1 ncindex ncindex 205874 Dec 1 12:40 _58.tii -rw-r--r-- 1 ncindex ncindex 16000953 Dec 1 12:40 _58.tis My ramBufferSize is 256MB, and those files add up to about 66MB. My guess is that it takes 256MB of RAM to represent what condenses down to 66MB on the disk. When it had accumulated 16 segments, it merged them down to this, all the while continuing to index. This is about 870MB: -rw-r--r-- 1 ncindex ncindex338 Dec 1 12:56 _5n.fnm -rw-r--r-- 1 ncindex ncindex 376423659 Dec 1 12:58 _5n.frq -rw-r--r-- 1 ncindex ncindex5726860 Dec 1 12:58 _5n.nrm -rw-r--r-- 1 ncindex ncindex 331890058 Dec 1 12:58 _5n.prx -rw-r--r-- 1 ncindex ncindex2037072 Dec 1 12:58 _5n.tii -rw-r--r-- 1 ncindex ncindex 154470775 Dec 1 12:58 _5n.tis If this merge were to happen 16 more times (256 segments created), it would then do a super-merge down to one very large segment. In your case, with a mergeFactor of 20, that would take 400 segments. I only ever saw this happen once - when I built a single index with all 49 million documents in it. Shawn
Re: schema design for related fields
Indeed, selecting the best price for January OR April OR November and sorting on it isn't possible with this solution (if that's what you mean). However, any combination of selecting 1 month and/or 1 price-range and/or 1 fare-type IS possible. 2010/12/1 lee carroll lee.a.carr...@googlemail.com Hi Geert, Ok I think I follow. the magic is in the multi-valued field. The only danger would be complexity if we allow users to multi select months/prices/fare classes. For example they can search for first prices in jan, april and november. I think what you describe is possible in this case just complicated. I'll see if i can hack some facets into the proto type tommorrow. Thanks for your help Lee C On 1 December 2010 17:57, Geert-Jan Brits gbr...@gmail.com wrote: Ok longer answer than anticipated (and good conceptual practice ;-) Yeah I belief that would work if I understand correctly that: 'in Jan [9] in feb [10] in march [1]' has nothing to do with pricing, but only with availability? If so you could seperate it out as two seperate issues: 1. ) showing pricing (based on context) 2. ) showing availabilities (based on context) For 1.) you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc]) note: 'dc' indicates 'don't care. depending on the context you query the correct pricefield to populate the price facet-values. for discussion lets call the fields: _p[fare][date]. IN other words the price field for no preference at all would become: _pdcdc For 2.) define a multivalued field 'FaresPerDate 'which indicate availability, which is used to display: A) Standard fares [10] First fares [3] B) in Jan [9] in feb [10] in march [1] A) depends on your selection (or dont caring) about a month B) vice versa depends on your selection (or dont caring) about a fare type given all possible date values: [jan,feb,..dec,dontcare] given all possible fare values:[standard,first,dontcare] FaresPerDate consists of multiple values per document where each value indicates the availability of a combination of 'fare' and 'date': (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC) Note that the nr of possible values = 39. Example: 1. ) the user hasn't selected any preference: q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO 20]facet.query=_pdcdc:[20 TO 40], etc. in the client you have to make sure to select the correct values of 'FaresPerDate' for display: in this case: Standard fares [10] -- FaresPerDate.standardDC First fares [3] -- FaresPerDate.firstDC in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch 2) the user has selected January q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0 TO 20]facet.query=_pDCJan:[20 TO 40] Standard fares [10] -- FaresPerDate.standardJan First fares [3] -- FaresPerDate.firstJan in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch Hope that helps, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Sorry Geert missed of the price value bit from the user interface so we'd display Facet price Standard fares [10] First fares [3] When traveling in Jan [9] in feb [10] in march [1] Fare Price 0 - 25 : [20] 25 - 50: [10] 50 - 100 [2] cheers lee c On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote: Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although
RE: how to set maxFieldLength to unlimitd
Thanks so much, Jan. I use curl to index pdf files. Is there other way to do it? I changed it the positionIncrement to 0, I didn't get it work either. Thanks, Xiaohui -Original Message- From: jan.kure...@nokia.com [mailto:jan.kure...@nokia.com] Sent: Wednesday, December 01, 2010 2:34 PM To: solr-user@lucene.apache.org Subject: Re: how to set maxFieldLength to unlimitd I don't know about upload limitations, but for sure there are some in the default settings, this could explain the limit of 20MB. Which upload mechanism on solr side do you use? I guess this is not a lucene problem but rather the http-layer of solr. If you manage to stream your PDF and start parsing it on the stream you then should go for the filter, that sets the positionIncrement to 0 as mentioned. What we did once for PDF files, we parsed them befor into plain text and where indexing this (but we were using lucene directly) with a streamReader. Grüße, Jan Am 01.12.2010 um 18:13 schrieb ext Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov : Thanks so much for your replay, Jan. I just found I cannot index pdf files with the file size more than 20MB. I use curl index them, didn't get any error either. Do you have any suggestions to index pdf files with more than 20MB? Thanks, Xiaohui -Original Message- From: jan.kure...@nokia.com [mailto:jan.kure...@nokia.com] Sent: Wednesday, December 01, 2010 11:30 AM To: solr-user@lucene.apache.org; solr-user-i...@lucene.apache.org; solr-user-...@lucene.apache.org Subject: RE: how to set maxFieldLength to unlimitd You just can't set it to unlimited. What you could do, is ignoring the positions and put a filter in, that sets the token for all but the first token to 0 (means the field length will be just 1, all tokens stacked on the first position) You could also break per page, so you put each page on a new position. Jan -Original Message- From: ext Ma, Xiaohui (NIH/NLM/LHC) [C] [mailto:xiao...@mail.nlm.nih.gov] Sent: Dienstag, 30. November 2010 19:49 To: solr-user@lucene.apache.org; 'solr-user- i...@lucene.apache.org'; 'solr-user-...@lucene.apache.org' Subject: how to set maxFieldLength to unlimitd I need index and search some pdf files which are very big (around 1000 pages each). How can I set maxFieldLength to unlimited? Thanks so much for your help in advance, Xiaohui
RE: ramBufferSizeMB not reflected in segment sizes in index
Thanks Mike, Yes we have many unique terms due to dirty OCR and 400 languages and probably lots of low doc freq terms as well (although with the ICUTokenizer and ICUFoldingFilter we should get fewer terms due to bad tokenization and normalization.) Is this additional overhead because each unique term takes a certain amount of space compared to adding entries to a list for an existing term? Does turning on IndexWriters infostream have a significant impact on memory use or indexing speed? If it does, I'll reproduce this on our test server rather than turning it on for a bit on the production indexer. If it doesn't I'll turn it on and post here. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, December 01, 2010 2:43 PM To: solr-user@lucene.apache.org Subject: Re: ramBufferSizeMB not reflected in segment sizes in index The ram efficiency (= size of segment once flushed divided by size of RAM buffer) can vary drastically. Because the in-RAM data structures must be growable (to append new docs to the postings as they are encountered), the efficiency is never 100%. I think 50% is actually a good ram efficiency, and lower than that (even down to 27%) I think is still normal. Do you have many unique or low-doc-freq terms? That brings the efficiency down. If you turn on IndexWriter's infoStream and post the output we can see if anything odd is going on... 80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments. Do you do any deletions in this run? A merged segment size will often be less than the sum of the parts, especially if there are many terms but across segments these terms are shared but the infoStream will also show what merges are taking place. Mike On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom tburt...@umich.edu wrote: We are using a recent Solr 3.x (See below for exact version). We have set the ramBufferSizeMB to 320 in both the indexDefaults and the mainIndex sections of our solrconfig.xml: ramBufferSizeMB320/ramBufferSizeMB mergeFactor20/mergeFactor We expected that this would mean that the index would not write to disk until it reached somewhere approximately over 300MB in size. However, we see many small segments that look to be around 80MB in size. We have not yet issued a single commit so nothing else should force a write to disk. With a merge factor of 20 we also expected to see larger segments somewhere around 320 * 20 = 6GB in size, however we see several around 1GB. We understand that the sizes are approximate, but these seem nowhere near what we expected. Can anyone explain what is going on? BTW maxBufferedDocs is commented out, so this should not be affecting the buffer flushes !--maxBufferedDocs1000/maxBufferedDocs-- Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 Tom Burton-West
Re: entire farm fails at the same time with OOM issues
also try to minimize maxWarming searchers to 1(?) or 2. And decrease cache usage (especially autowarming) if possible at all. But again: only if it doesn't affect performance ... Regards, Peter. On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersenrober...@buy.com wrote: My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? If there is no change in query traffic when this happens, then it's due to what the index looks like. My guess is a large index merge happened, which means that when the searchers re-open on the new index, it requires more memory than normal (much less can be shared with the previous index). I'd try bumping the heap a little bit, and then optimizing once a day during off-peak hours. If you still get OOM errors, bump the heap a little more. -Yonik http://www.lucidimagination.com
Re: ramBufferSizeMB not reflected in segment sizes in index
On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Mike, Yes we have many unique terms due to dirty OCR and 400 languages and probably lots of low doc freq terms as well (although with the ICUTokenizer and ICUFoldingFilter we should get fewer terms due to bad tokenization and normalization.) OK likely this explains the lowish RAM efficiency. Is this additional overhead because each unique term takes a certain amount of space compared to adding entries to a list for an existing term? Exactly. There's a highish startup cost for each term but then appending docs/positions to that term is more efficient especially for higher frequency terms. In the limit, a single unique term across all docs will have very high RAM efficiency... Does turning on IndexWriters infostream have a significant impact on memory use or indexing speed? I don't believe so Mike
RE: entire farm fails at the same time with OOM issues
Good idea. Our farm is behind Akamai so that should be ok to do. -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Wednesday, December 01, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: entire farm fails at the same time with OOM issues also try to minimize maxWarming searchers to 1(?) or 2. And decrease cache usage (especially autowarming) if possible at all. But again: only if it doesn't affect performance ... Regards, Peter. On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersenrober...@buy.com wrote: My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? If there is no change in query traffic when this happens, then it's due to what the index looks like. My guess is a large index merge happened, which means that when the searchers re-open on the new index, it requires more memory than normal (much less can be shared with the previous index). I'd try bumping the heap a little bit, and then optimizing once a day during off-peak hours. If you still get OOM errors, bump the heap a little more. -Yonik http://www.lucidimagination.com
Re: Good example of multiple tokenizers for a single field
On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder jel...@locamoda.com wrote: Right. CJK doesn't tend to have a lot of whitespace to begin with. In the past, we were using a patched version of StandardTokenizer which treated @twitteruser and #hashtag better, but this became a release engineering nightmare so we switched to Whitespace. in this case, have you considered using a CharFilter (e.g. MappingCharFilter) before the tokenizer? This way you could map your special things such as @ and # to some other string that the tokenizer doesnt split on, e.g. # = HASH_. then your #foobar goes to HASH_foobar. If you want searches of #foobar to only match #foobar and not also foobar itself, and vice versa, you are done. Maybe you want searches of #foobar to only match #foobar, but searches of foobar to match both #foobar and foobar. In this case, you would probably use a worddelimiterfilter w/ preserveOriginal at index-time only , followed by a StopFilter containing HASH, so you index HASH_foobar and foobar. anyway i think you have a lot of flexibility to reuse standardtokenizer but customize things like this without maintaining your own tokenizer, this is the purpose of CharFilters. That worked brilliantly. Thank you very much, Robert. -- Jacob Elder @jelder (646) 535-3379
Re: Return Lucene DocId in Solr Results
Take this with a sizeable grain of salt as I haven't actually tried doing this. But you might try using an IndexReader which it looks like you can get from this class: http://lucene.apache.org/solr/api/org/apache/solr/core/StandardIndexReaderFactory.html sasank On Tue, Nov 30, 2010 at 6:45 AM, Lohrenz, Steven steven.lohr...@hmhpub.comwrote: Hmm, I found some similar queries on stackoverflow and they did not recommend exposing the lucene docId. So, I guess my question becomes: What is the best way, from within my custom QParser, to take a list of solr primary keys (that were retrieved from elsewhere) and turn them into docIds? I also saw something about cacheing them using a Field Cache - how would I do that? Thanks, Steve -Original Message- From: Lohrenz, Steven [mailto:steven.lohr...@hmhpub.com] Sent: 30 November 2010 11:57 To: solr-user@lucene.apache.org Subject: Return Lucene DocId in Solr Results Hi, I was wondering how I would go about getting the lucene docid included in the results from a solr query? I've built a QueryParser to query another solr instance and and join the results of the two instances through the use of a Filter. The Filter needs the lucene docid to work. This is the only bit I'm missing right now. Thanks, Steve
Re: Return Lucene DocId in Solr Results
On the face of it, this doesn't make sense, so perhaps you can explain a bit.The doc IDs from one Solr instance have no relation to the doc IDs from another Solr instance. So anything that uses doc IDs from one Solr instance to create a filter on another instance doesn't seem to be something you'd want to do... Which may just mean I don't understand what you're trying to do. Can you back up a bit and describe the higher-level problem? This seems like it may be an XY problem, see: http://people.apache.org/~hossman/#xyproblem Best Erick On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven steven.lohr...@hmhpub.comwrote: Hi, I was wondering how I would go about getting the lucene docid included in the results from a solr query? I've built a QueryParser to query another solr instance and and join the results of the two instances through the use of a Filter. The Filter needs the lucene docid to work. This is the only bit I'm missing right now. Thanks, Steve
Re: ArrayIndexOutOfBoundsException in sort
Got it with thanks. On Wed, Dec 1, 2010 at 8:02 PM, Ahmet Arslan iori...@yahoo.com wrote: It seems work fine again after I change author field type from text to string, could anybody give some info about it? very appriciated. http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F And also see Erick's explanation http://search-lucene.com/m/7fnj1TtNde/sort+on+a+tokenized+fieldsubj=Re+Solr+sorting+problem -- Best Regards. Jerry. Li
Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException
Try this... http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft - Original Message - From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Wednesday, December 01, 2010 7:51 PM Subject: spatial query parinsg error: org.apache.lucene.queryParser.ParseException I am trying to get spatial search to work on my Solr installation. I am running version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the search with the following url: http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3} The result that I get is the following error: HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km threadCount=3}': Encountered RANGEEX_GOOP lng=-121.892639 at line 1, column 38. Was expecting: } Not sure why it would be complaining about the lng parameter in the query. I double-checked to make sure that I had the right name for the longitude field in my solrconfig.xml file. Any help/suggestions would be greatly appreciated Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Preventing index segment corruption when windows crashes
Is there any way that Windows 7 and disk drivers are not honoring the fsync() calls? That would cause files and/or blocks to get saved out of order. On Tue, Nov 30, 2010 at 3:24 PM, Peter Sturge peter.stu...@gmail.com wrote: After a recent Windows 7 crash (:-\), upon restart, Solr starts giving LockObtainFailedException errors: (excerpt) 30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: nativefsl...@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock When I run CheckIndex, I get: (excerpt) 30 of 30: name=_2fi docCount=857 compound=false hasProx=true numFiles=8 size (MB)=0.769 diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev ${svnver sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, java.version=1.6.0_18, java.vendor=Sun Microsystems Inc.} no deletions test: open reader.FAILED WARNING: fixIndex() would remove reference to this segment; full exception: org.apache.lucene.index.CorruptIndexException: did not read all bytes from file _2fi.fnm: read 1 vs size 512 at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71) at org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReade r.java:119) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878) WARNING: 1 broken segments (containing 857 documents) detected This seems to happen every time Windows 7 crashes, and it would seem extraordinary bad luck for this tiny test index to be in the middle of a commit every time. (it is set to commit every 40secs, but for such a small index it only takes millis to complete) Does this seem right? I don't remember seeing so many corruptions in the index - maybe it is the world of Win7 dodgy drivers, but it would be worth investigating if there's something amiss in Solr/Lucene when things go down unexpectedly... Thanks, Peter On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge peter.stu...@gmail.com wrote: The index itself isn't corrupt - just one of the segment files. This means you can read the index (less the offending segment(s)), but once this happens it's no longer possible to access the documents that were in that segment (they're gone forever), nor write/commit to the index (depending on the env/request, you get 'Error reading from index file..' and/or WriteLockError) (note that for my use case, documents are dynamically created so can't be re-indexed). Restarting Solr fixes the write lock errors (an indirect environmental symptom of the problem), and running CheckIndex -fix is the only way I've found to repair the index so it can be written to (rewrites the corrupted segment(s)). I guess I was wondering if there's a mechanism that would support something akin to a transactional rollback for segments. Thanks, Peter On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com wrote: If a Solr index is running at the time of a system halt, this can often corrupt a segments file, requiring the index to be -fix'ed by rewriting the offending file. Really? That shouldn't be possible (if you mean the index is truly corrupt - i.e. you can't open it). -Yonik http://www.lucidimagination.com -- Lance Norskog goks...@gmail.com
Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException
Thanks Jean-Sebastion. I forwarded it to my partner. His membership is still being held up. I'll be the go between until he has access. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jean-Sebastien Vachon js.vac...@videotron.ca To: solr-user@lucene.apache.org Sent: Wed, December 1, 2010 7:12:20 PM Subject: Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException Try this... http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft - Original Message - From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Wednesday, December 01, 2010 7:51 PM Subject: spatial query parinsg error: org.apache.lucene.queryParser.ParseException I am trying to get spatial search to work on my Solr installation. I am running version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the search with the following url: http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3} The result that I get is the following error: HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km threadCount=3}': Encountered RANGEEX_GOOP lng=-121.892639 at line 1, column 38. Was expecting: } Not sure why it would be complaining about the lng parameter in the query. I double-checked to make sure that I had the right name for the longitude field in my solrconfig.xml file. Any help/suggestions would be greatly appreciated Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException
I just saw the parameter 'lng' in your query... I believe it should be 'long'. Give it a try if the link I sent you is not working - Original Message - From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Wednesday, December 01, 2010 11:39 PM Subject: Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException Thanks Jean-Sebastion. I forwarded it to my partner. His membership is still being held up. I'll be the go between until he has access. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jean-Sebastien Vachon js.vac...@videotron.ca To: solr-user@lucene.apache.org Sent: Wed, December 1, 2010 7:12:20 PM Subject: Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException Try this... http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft - Original Message - From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Wednesday, December 01, 2010 7:51 PM Subject: spatial query parinsg error: org.apache.lucene.queryParser.ParseException I am trying to get spatial search to work on my Solr installation. I am running version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the search with the following url: http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3} The result that I get is the following error: HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km threadCount=3}': Encountered RANGEEX_GOOP lng=-121.892639 at line 1, column 38. Was expecting: } Not sure why it would be complaining about the lng parameter in the query. I double-checked to make sure that I had the right name for the longitude field in my solrconfig.xml file. Any help/suggestions would be greatly appreciated Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
best way to get maxDocs in java (i.e. as on stats.jsp page).
hi all, What's the best way to programmatically-in-java get the 'maxDoc' attribute (as seen on the stats.jsp page). I don't see any hooks on the solrj api. Currently I plan to use an http client to get stats.jsp (which returns xml) and parse it using xpath. If anyone can recommend a better approach, please opine. thanks will
Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException
Forwarded to my partner, thx, will let you know. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jean-Sebastien Vachon js.vac...@videotron.ca To: solr-user@lucene.apache.org Sent: Wed, December 1, 2010 8:50:58 PM Subject: Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException I just saw the parameter 'lng' in your query... I believe it should be 'long'. Give it a try if the link I sent you is not working - Original Message - From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Wednesday, December 01, 2010 11:39 PM Subject: Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException Thanks Jean-Sebastion. I forwarded it to my partner. His membership is still being held up. I'll be the go between until he has access. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jean-Sebastien Vachon js.vac...@videotron.ca To: solr-user@lucene.apache.org Sent: Wed, December 1, 2010 7:12:20 PM Subject: Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException Try this... http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft - Original Message - From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Wednesday, December 01, 2010 7:51 PM Subject: spatial query parinsg error: org.apache.lucene.queryParser.ParseException I am trying to get spatial search to work on my Solr installation. I am running version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the search with the following url: http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3} The result that I get is the following error: HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km threadCount=3}': Encountered RANGEEX_GOOP lng=-121.892639 at line 1, column 38. Was expecting: } Not sure why it would be complaining about the lng parameter in the query. I double-checked to make sure that I had the right name for the longitude field in my solrconfig.xml file. Any help/suggestions would be greatly appreciated Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: best way to get maxDocs in java (i.e. as on stats.jsp page).
(10/12/02 13:51), Will Milspec wrote: hi all, What's the best way to programmatically-in-java get the 'maxDoc' attribute (as seen on the stats.jsp page). I don't see any hooks on the solrj api. Currently I plan to use an http client to get stats.jsp (which returns xml) and parse it using xpath. If anyone can recommend a better approach, please opine. thanks will Will, Try: http://localhost:8983/solr/admin/luke LukeRequestHandler http://wiki.apache.org/solr/LukeRequestHandler Koji -- http://www.rondhuit.com/en/
problems with custom SolrCache.init() - fails on startup
My project has a couple custom caches that descend from FastLRUCache. These worked fine in Solr 1.3. Then I started migrating my project to Solr 1.4.1 and had problems during startup. I believe the problem is that I attempt to access the core in the init process. I currently use the deprecated SolrCore.getSolrCore(), but had the same problem when attempting to use CoreContainer. During some initialization process, I need access to the IndexSchema object. I assume the problem is because startup must create objects in a different order now. Does anyone have any suggestions on how to get access to the core infrastructure at the startup of the caches?
Restrict access to localhost
Hello all, 1) I want to restrict access to Solr only in localhost. How to acheive that? 2) If i want to allow the clients to search but not to delete? How to restric the access? Any thoughts? Regards Ganesh. Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php