Re: Lucene FieldCache - Out of memory exception
Here is one sample query that I picked up from the log file : q=*%3A*fq=Category%3A%223__107%22fq=S_P1540477699%3A%22MICROCIRCUIT%2C+LINE+TRANSCEIVERS%22rows=0facet=truefacet.mincount=1facet.limit=2facet.field=S_C1503120369facet.field=S_P1406389942facet.field=S_P1430116878facet.field=S_P1430116881facet.field=S_P1406453552facet.field=S_P1406451296facet.field=S_P1406452465facet.field=S_C2968809156facet.field=S_P1406389980facet.field=S_P1540477699facet.field=S_P1406389982facet.field=S_P1406389984facet.field=S_P1406451284facet.field=S_P1406389926facet.field=S_P1424886581facet.field=S_P2017662632facet.field=F_P1946367021facet.field=S_P1430116884facet.field=S_P2017662620facet.field=F_P1406451304facet.field=F_P1406451306facet.field=F_P1406451308facet.field=S_P1500901421facet.field=S_P1507138990facet.field=I_P1406452433facet.field=I_P1406453565facet.field=I_P1406452463facet.field=I_P1406453573facet.field=I_P1406451324facet.field=I_P1406451288facet.field=S_P1406451282facet.field=S_P1406452471facet.field=S_P1424886605facet.field=S_P1946367015facet.field=S_P1424886598facet.field=S_P1946367018facet.field=S_P1406453556facet.field=S_P1406389932facet.field=S_P2017662623facet.field=S_P1406450978facet.field=F_P1406452455facet.field=S_P1406389972facet.field=S_P1406389974facet.field=S_P1406389986facet.field=F_P1946367027facet.field=F_P1406451294facet.field=F_P1406451286facet.field=F_P1406451328facet.field=S_P1424886593facet.field=S_P1406453567facet.field=S_P2017662629facet.field=S_P1406453571facet.field=F_P1946367030facet.field=S_P1406453569facet.field=S_P2017662626facet.field=S_P1406389978facet.field=F_P1946367024 My primary question here is, can Solr handle this kind of queries with so many facet fields. I have tried using both enum and fc for facet.method and there is no improvement with either. Appreciate any help on this. Thank you. - Rahul On Mon, Apr 30, 2012 at 2:53 PM, Rahul R rahul.s...@gmail.com wrote: Hello, I am using solr 1.3 with jdk 1.5.0_14 and weblogic 10MP1 application server on Solaris. I use embedded solr server. More details : Number of docs in solr index : 1.4 million Physical size of index : 640MB Total number of fields in the index : 700 (99% of these are dynamic fields) Total number of fields enabled for faceting : 440 Avg number of facet fields participating in a faceted query : 50-70 Total RAM allocated to weblogic appserver : 3GB (max possible) In a multi user environment with 3 users using this application for a period of around 40 minutes, the application runs out of memory. Analysis of the heap dump shows that almost 85% of the memory is retained by the FieldCache. Now I understand that the field cache is out of our control but would appreciate some suggestions on how to handle this issue. Some questions on this front : - some mail threads on this forum seem to indicate that there could be some connection between having dynamic fields and usage of FieldCache. Is this true ? Most of the fields in my index are dynamic fields. - as mentioned above, most of my faceted queries could have around 50-70 facet fields (I would do SolrQuery.addFacetField() for around 50-70 fields per query). Could this be the source of the problem ? Is this too high for solr to support ? - Initially, I had a facet.sort defined in solrconfig.xml. Since FieldCache builds up on sorting, I even removed the facet.sort and tried, but no respite. The behavior is same as before. - The document id that I have for each document is quite big (around 50 characters on average). Can this be a problem ? I reduced this to around 15 characters and tried but still there is no improvement. - Can the size of the data be a problem ? But on this forum, I see many users talking of more than 100 million documents in their index. I have only 1.4 million with physical size of 640MB. The physical server on which this application is running, has sufficient RAM and CPU. - What gets stored in the FieldCache ? Is it the entire document or just the document Id ? Any help is much appreciated. Thank you. regards Rahul
Re: Removing old documents
With which client? paul Le 2 mai 2012 à 01:29, alx...@aim.com a écrit : all caching is disabled and I restarted jetty. The same results.
Re: Solr: extracting/indexing HTML via cURL
You can have two fields: one which is stripped, and another which stores the original data. You can use copyField directives and make the stripped field indexed but not stored, and the original field stored but not indexed. You only have to upload the file once, and only store the text once. If you look in the default schema, you'll find a bunch of text fields are all copied to text or text_all, which is indexed but not stored. This catch-all field is the default search field. http://lucidworks.lucidimagination.com/display/solr/Copying+Fields On Mon, Apr 30, 2012 at 2:06 PM, okayndc bodymo...@gmail.com wrote: Great, thank you for the input. My understanding of HTMLStripCharFilter is that it strips HTML tags, which is not what I want ~ is this correct? I want to keep the HTML tags intact. On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky j...@basetechnology.comwrote: If by extracting HTML content via cURL you mean using SolrCell to parse html files, this seems to make sense. The sequence is that regardless of the file type, each file extraction parser will strip off all formatting and produce a raw text stream. Office, PDF, and HTML files are all treated the same in that way. Then, the unformatted text stream is sent through the field type analyzers to be tokenized into terms that Lucene can index. The input string to the field type analyzer is what gets stored for the field, but this occurs after the extraction file parser has already removed formatting. No way for the formatting to be preserved in that case, other than to go back to the original input document before extraction parsing. If you really do want to preserve full HTML formatted text, you would need to define a field whose field type uses the HTMLStripCharFilter and then directly add documents that direct the raw HTML to that field. There may be some other way to hook into the update processing chain, but that may be too much effort compared to the HTML strip filter. -- Jack Krupansky -Original Message- From: okayndc Sent: Monday, April 30, 2012 10:07 AM To: solr-user@lucene.apache.org Subject: Solr: extracting/indexing HTML via cURL Hello, Over the weekend I experimented with extracting HTML content via cURL and just wondering why the extraction/indexing process does not include the HTML tags. It seems as though the HTML tags either being ignored or stripped somewhere in the pipeline. If this is the case, is it possible to include the HTML tags, as I would like to keep the formatted HTML intact? Any help is greatly appreciated. -- Lance Norskog goks...@gmail.com
RE: Solr Merge during off peak times
Ok, thanks Otis Another question on merging What is the best way to monitor merging? Is there something in the log file that I can look for? It seems like I have to monitor the system resources - read/write IOPS etc.. and work out when a merge happened It would be great if I can do it by looking at log files or in the admin UI. Do you know if this can be done or if there is some tool for this? Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 01 May 2012 15:12 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hi Prabhu, I don't think such a merge policy exists, but it would be nice to have this option and I imagine it wouldn't be hard to write if you really just base the merge or no merge decision on the time of day (and maybe day of the week). Note that this should go into Lucene, not Solr, so if you decide to contribute your work, please see http://wiki.apache.org/lucene-java/HowToContribute Otis Performance Monitoring for Solr - http://sematext.com/spm From: Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, May 1, 2012 8:45 AM Subject: Solr Merge during off peak times Hi, I would like to know if there is a way to configure index merge policy in solr so that the merging happens during off peak hours. Can you please let me know if such a merge policy configuration exists? Thanks Prabhu
Re: should slave replication be turned off / on during master clean and re-index?
Simply turn off replication during your rebuild-from-scratch. See: http://wiki.apache.org/solr/SolrReplication#HTTP_API the disabelreplication command. The autocommit thing was, I think, in reference to keeping any replication of a partial-rebuild from being replicated. Autocommit is usually a fine thing. So your full-rebuild looks like this 1 disable replication on the master 2 rebuild the index (autocommit on or off, makes little difference as far as replication) 3 enable replication on the master Best Erick On Tue, May 1, 2012 at 8:55 AM, geeky2 gee...@hotmail.com wrote: hello shawn, thanks for the reply. ok - i did some testing and yes you are correct. autocommit is doing the commit work in chunks. yes - the slaves are also going to having everything to nothing, then slowly building back up again, lagging behind the master. ... and yes - this is probably not what we need - as far as a replication strategy for the slaves. you said, you don't use autocommit. if so - then why don't you use / like autocommit? since we have not done this here - there is no established reference point, from an operations perspective. i am looking to formulate some sort of operation strategy, so ANY ideas or input is really welcome. it seems to me that we have to account for two operational strategies - the first operational mode is a daily append to the solr core after the database tables have been updated. this can probably be done with a simple delta import. i would think that autocommit could remain on for the master and replication could also be left on so the slaves picked up the changes ASAP. this seems like the mode that we would / should be in most of the time. the second operational mode would be a build from scratch mode, where changes in the schema necessitated a full re-index of the data. given that our site (powered by solr) must be up all of the time, and that our full index time on the master (for the moment) is hovering somewhere around 16 hours - it makes sense that some sort of parallel path - with a cut-over, must be used. in this situation is it possible to have the indexing process going on in the background - then have one commit at the end - then turn replication on for the slaves? are there disadvantages to this approach? also - i really like your suggestion of a build core and live core. is this approach you use? thank you for all of the great input then -- View this message in context: http://lucene.472066.n3.nabble.com/should-slave-replication-be-turned-off-on-during-master-clean-and-re-index-tp3945531p3952904.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Merge during off peak times
Why do you care? Merging is generally a background process, or are you doing heavy indexing? In a master/slave setup, it's usually not really relevant except that (with 3.x), massive merges may temporarily stop indexing. Is that the problem? Look at the merge policys, there are configurations that make this less painful. In trunk, DocumentWriterPerThread makes merges happen in the background, which helps the long-pause-while-indexing problem. Best Erick On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: Ok, thanks Otis Another question on merging What is the best way to monitor merging? Is there something in the log file that I can look for? It seems like I have to monitor the system resources - read/write IOPS etc.. and work out when a merge happened It would be great if I can do it by looking at log files or in the admin UI. Do you know if this can be done or if there is some tool for this? Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 01 May 2012 15:12 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hi Prabhu, I don't think such a merge policy exists, but it would be nice to have this option and I imagine it wouldn't be hard to write if you really just base the merge or no merge decision on the time of day (and maybe day of the week). Note that this should go into Lucene, not Solr, so if you decide to contribute your work, please see http://wiki.apache.org/lucene-java/HowToContribute Otis Performance Monitoring for Solr - http://sematext.com/spm From: Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, May 1, 2012 8:45 AM Subject: Solr Merge during off peak times Hi, I would like to know if there is a way to configure index merge policy in solr so that the merging happens during off peak hours. Can you please let me know if such a merge policy configuration exists? Thanks Prabhu
Re: Lucene FieldCache - Out of memory exception
The FieldCache gets populated the first time a given field is referenced as a facet and then will stay around forever. So, as additional queries get executed with different facet fields, the number of FieldCache entries will grow. If I understand what you have said, theses faceted queries do work initially, but after awhile they stop working with OOM, correct? The size of a single FieldCache depends on the field type. Since you are using dynamic fields, it depends on your dynamicField types - which you have not told us about. From your query I see that your fields start with S_ and F_ - presumably you have dynamic field types S_* and F_*? Are they strings, integers, floats, or what? Each FieldCache will be an array with maxdoc entries (your total number of documents - 1.4 million) times the size of the field value or whatever a string reference is in your JVM. String fields will take more space than numeric fields for the FieldCache, since a separate table is maintained for the unique terms in that field. Roughly what is the typical or average length of one of your facet field values? And, on average, how many unique terms are there within a typical faceted field? If you can convert many of these faceted fields to simple integers the size should go down dramatically, but that depends on your application. 3 GB sounds like it might not be enough for such heavy use of faceting. It is probably not the 50-70 number, but the 440 or accumulated number across many queries that pushes the memory usage up. When you hit OOM, what does the Solr admin stats display say for FieldCache? -- Jack Krupansky -Original Message- From: Rahul R Sent: Wednesday, May 02, 2012 2:22 AM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache - Out of memory exception Here is one sample query that I picked up from the log file : q=*%3A*fq=Category%3A%223__107%22fq=S_P1540477699%3A%22MICROCIRCUIT%2C+LINE+TRANSCEIVERS%22rows=0facet=truefacet.mincount=1facet.limit=2facet.field=S_C1503120369facet.field=S_P1406389942facet.field=S_P1430116878facet.field=S_P1430116881facet.field=S_P1406453552facet.field=S_P1406451296facet.field=S_P1406452465facet.field=S_C2968809156facet.field=S_P1406389980facet.field=S_P1540477699facet.field=S_P1406389982facet.field=S_P1406389984facet.field=S_P1406451284facet.field=S_P1406389926facet.field=S_P1424886581facet.field=S_P2017662632facet.field=F_P1946367021facet.field=S_P1430116884facet.field=S_P2017662620facet.field=F_P1406451304facet.field=F_P1406451306facet.field=F_P1406451308facet.field=S_P1500901421facet.field=S_P1507138990facet.field=I_P1406452433facet.field=I_P1406453565facet.field=I_P1406452463facet.field=I_P1406453573facet.field=I_P1406451324facet.field=I_P1406451288facet.field=S_P1406451282facet.field=S_P1406452471facet.field=S_P14248866 05facet.field=S_P1946367015facet.field=S_P1424886598facet.field=S_P1946367018facet.field=S_P1406453556facet.field=S_P1406389932facet.field=S_P2017662623facet.field=S_P1406450978facet.field=F_P1406452455facet.field=S_P1406389972facet.field=S_P1406389974facet.field=S_P1406389986facet.field=F_P1946367027facet.field=F_P1406451294facet.field=F_P1406451286facet.field=F_P1406451328facet.field=S_P1424886593facet.field=S_P1406453567facet.field=S_P2017662629facet.field=S_P1406453571facet.field=F_P1946367030facet.field=S_P1406453569facet.field=S_P2017662626facet.field=S_P1406389978facet.field=F_P1946367024 My primary question here is, can Solr handle this kind of queries with so many facet fields. I have tried using both enum and fc for facet.method and there is no improvement with either. Appreciate any help on this. Thank you. - Rahul On Mon, Apr 30, 2012 at 2:53 PM, Rahul R rahul.s...@gmail.com wrote: Hello, I am using solr 1.3 with jdk 1.5.0_14 and weblogic 10MP1 application server on Solaris. I use embedded solr server. More details : Number of docs in solr index : 1.4 million Physical size of index : 640MB Total number of fields in the index : 700 (99% of these are dynamic fields) Total number of fields enabled for faceting : 440 Avg number of facet fields participating in a faceted query : 50-70 Total RAM allocated to weblogic appserver : 3GB (max possible) In a multi user environment with 3 users using this application for a period of around 40 minutes, the application runs out of memory. Analysis of the heap dump shows that almost 85% of the memory is retained by the FieldCache. Now I understand that the field cache is out of our control but would appreciate some suggestions on how to handle this issue. Some questions on this front : - some mail threads on this forum seem to indicate that there could be some connection between having dynamic fields and usage of FieldCache. Is this true ? Most of the fields in my index are dynamic fields. - as mentioned above, most of my faceted queries could have around 50-70 facet fields (I would do SolrQuery.addFacetField() for around 50-70 fields per query). Could this
RE: Solr Merge during off peak times
We have a fairly large scale system - about 200 million docs and fairly high indexing activity - about 300k docs per day with peak ingestion rates of about 20 docs per sec. I want to work out what a good mergeFactor setting would be by testing with different mergeFactor settings. I think the default of 10 might be high, I want to try with 5 and compare. Unless I know when a merge starts and finishes, it would be quite difficult to work out the impact of changing mergeFactor. I want to be able to measure how long merges take, run queries during the merge activity and see what the response times are etc.. Thanks Prabhu -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 02 May 2012 12:40 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Why do you care? Merging is generally a background process, or are you doing heavy indexing? In a master/slave setup, it's usually not really relevant except that (with 3.x), massive merges may temporarily stop indexing. Is that the problem? Look at the merge policys, there are configurations that make this less painful. In trunk, DocumentWriterPerThread makes merges happen in the background, which helps the long-pause-while-indexing problem. Best Erick On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: Ok, thanks Otis Another question on merging What is the best way to monitor merging? Is there something in the log file that I can look for? It seems like I have to monitor the system resources - read/write IOPS etc.. and work out when a merge happened It would be great if I can do it by looking at log files or in the admin UI. Do you know if this can be done or if there is some tool for this? Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 01 May 2012 15:12 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hi Prabhu, I don't think such a merge policy exists, but it would be nice to have this option and I imagine it wouldn't be hard to write if you really just base the merge or no merge decision on the time of day (and maybe day of the week). Note that this should go into Lucene, not Solr, so if you decide to contribute your work, please see http://wiki.apache.org/lucene-java/HowToContribute Otis Performance Monitoring for Solr - http://sematext.com/spm From: Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, May 1, 2012 8:45 AM Subject: Solr Merge during off peak times Hi, I would like to know if there is a way to configure index merge policy in solr so that the merging happens during off peak hours. Can you please let me know if such a merge policy configuration exists? Thanks Prabhu
Re: Solr Merge during off peak times
But again, with a master/slave setup merging should be relatively benign. And at 200M docs, having a M/S setup is probably indicated. Here's a good writeup of mergepolicy http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/ If you're indexing and searching on a single machine, merging is much less important than how often you commit. If a M/S situation, then you're polling interval on the slave is important. I'd look at commit frequency long before I worried about merging, that's usually where people shoot themselves in the foot - by committing too often. Overall, your mergeFactor is probably less important than other parts of how you perform indexing/searching, but it does have some effect for sure... Best Erick On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: We have a fairly large scale system - about 200 million docs and fairly high indexing activity - about 300k docs per day with peak ingestion rates of about 20 docs per sec. I want to work out what a good mergeFactor setting would be by testing with different mergeFactor settings. I think the default of 10 might be high, I want to try with 5 and compare. Unless I know when a merge starts and finishes, it would be quite difficult to work out the impact of changing mergeFactor. I want to be able to measure how long merges take, run queries during the merge activity and see what the response times are etc.. Thanks Prabhu -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 02 May 2012 12:40 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Why do you care? Merging is generally a background process, or are you doing heavy indexing? In a master/slave setup, it's usually not really relevant except that (with 3.x), massive merges may temporarily stop indexing. Is that the problem? Look at the merge policys, there are configurations that make this less painful. In trunk, DocumentWriterPerThread makes merges happen in the background, which helps the long-pause-while-indexing problem. Best Erick On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: Ok, thanks Otis Another question on merging What is the best way to monitor merging? Is there something in the log file that I can look for? It seems like I have to monitor the system resources - read/write IOPS etc.. and work out when a merge happened It would be great if I can do it by looking at log files or in the admin UI. Do you know if this can be done or if there is some tool for this? Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 01 May 2012 15:12 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hi Prabhu, I don't think such a merge policy exists, but it would be nice to have this option and I imagine it wouldn't be hard to write if you really just base the merge or no merge decision on the time of day (and maybe day of the week). Note that this should go into Lucene, not Solr, so if you decide to contribute your work, please see http://wiki.apache.org/lucene-java/HowToContribute Otis Performance Monitoring for Solr - http://sematext.com/spm From: Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, May 1, 2012 8:45 AM Subject: Solr Merge during off peak times Hi, I would like to know if there is a way to configure index merge policy in solr so that the merging happens during off peak hours. Can you please let me know if such a merge policy configuration exists? Thanks Prabhu
Null Pointer Exception in SOLR
Hi, When I tried to remove a data from UI (which will in turn hit SOLR), the whole application got stuck up. When we took the log files of the UI, we could see that this set of requests did not reach SOLR itself. In the SOLR log file, we were able to find the following exception occuring at the same time. SEVERE: org.apache.solr.common.SolrException: null__javalangNullPointerException_ null__javalangNullPointerException_ request: http://solr/coreX/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request at org.apache.solr.handler.component.HttpCommComponent$1.call at org.apache.solr.handler.component.HttpCommComponent$1.call at java.util.concurrent.FutureTask$Sync.innerRun at java.util.concurrent.FutureTask.run at java.util.concurrent.Executors$RunnableAdapter.call at java.util.concurrent.FutureTask$Sync.innerRun at java.util.concurrent.FutureTask.run at java.util.concurrent.ThreadPoolExecutor$Worker.runTask at java.util.concurrent.ThreadPoolExecutor$Worker.run at java.lang.Thread.run This situation resulted for another few hours. No one was able to perform any operation with the application and If any one tried to perform any action, it resulted in the above exception during that period. But, this situation resolved by itself after few hours and it started working like normal. Can you tell me if this situation was due to deadlock condition or was it due to the CPU utilization going beyond 100%? If it was due to the deadloack, then why did we not get any such messages in the log files?Or is it due to some other problem?Am I missing anything? Can you guide me on this? -- View this message in context: http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-in-SOLR-tp3954952.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie question on sorting
Erick, I'll do that. Thank you very much. Regards, Jacek On Tue, May 1, 2012 at 7:19 AM, Erick Erickson erickerick...@gmail.comwrote: The easiest way is to do that in the app. That is, return the top 10 to the app (by score) then re-order them there. There's nothing in Solr that I know of that does what you want out of the box. Best Erick On Mon, Apr 30, 2012 at 11:10 AM, Jacek pjac...@gmail.com wrote: Hello all, I'm facing this simple problem, yet impossible to resolve for me (I'm a newbie in Solr). I need to sort the results by score (it is simple, of course), but then what I need is to take top 10 results, and re-order it (only those top 10 results) by a date field. It's not the same as sort=score,creationdate Any suggestions will be greatly appreciated!
RE: Solr Merge during off peak times
Actually we are not thinking of a M/S setup We are planning to have x number of shards on N number of servers, each of the shard handling both indexing and searching The expected query volume is not that high, so don't think we would need to replicate to slaves. We think each shard will be able to handle its share of the indexing and searching. If we need to scale query capacity in future, yeah probably need to do it by replicating each shard to its slaves I agree autoCommit settings would be good to set up appropriately Another question I had is pros/cons of optimising the index. We would be purging old content every week and am thinking whether to run an index optimise in the weekend after purging old data. Because we are going to be continuously indexing data which would be mix of adds, updates, deletes, not sure if the benefit of optimising would last long enough to be worth doing it. Maybe setting a low mergeFactor would be good enough. Optimising makes sense if the index is more static, perhaps? Thoughts? Thanks Prabhu -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 02 May 2012 13:15 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times But again, with a master/slave setup merging should be relatively benign. And at 200M docs, having a M/S setup is probably indicated. Here's a good writeup of mergepolicy http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/ If you're indexing and searching on a single machine, merging is much less important than how often you commit. If a M/S situation, then you're polling interval on the slave is important. I'd look at commit frequency long before I worried about merging, that's usually where people shoot themselves in the foot - by committing too often. Overall, your mergeFactor is probably less important than other parts of how you perform indexing/searching, but it does have some effect for sure... Best Erick On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: We have a fairly large scale system - about 200 million docs and fairly high indexing activity - about 300k docs per day with peak ingestion rates of about 20 docs per sec. I want to work out what a good mergeFactor setting would be by testing with different mergeFactor settings. I think the default of 10 might be high, I want to try with 5 and compare. Unless I know when a merge starts and finishes, it would be quite difficult to work out the impact of changing mergeFactor. I want to be able to measure how long merges take, run queries during the merge activity and see what the response times are etc.. Thanks Prabhu -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 02 May 2012 12:40 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Why do you care? Merging is generally a background process, or are you doing heavy indexing? In a master/slave setup, it's usually not really relevant except that (with 3.x), massive merges may temporarily stop indexing. Is that the problem? Look at the merge policys, there are configurations that make this less painful. In trunk, DocumentWriterPerThread makes merges happen in the background, which helps the long-pause-while-indexing problem. Best Erick On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: Ok, thanks Otis Another question on merging What is the best way to monitor merging? Is there something in the log file that I can look for? It seems like I have to monitor the system resources - read/write IOPS etc.. and work out when a merge happened It would be great if I can do it by looking at log files or in the admin UI. Do you know if this can be done or if there is some tool for this? Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 01 May 2012 15:12 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hi Prabhu, I don't think such a merge policy exists, but it would be nice to have this option and I imagine it wouldn't be hard to write if you really just base the merge or no merge decision on the time of day (and maybe day of the week). Note that this should go into Lucene, not Solr, so if you decide to contribute your work, please see http://wiki.apache.org/lucene-java/HowToContribute Otis Performance Monitoring for Solr - http://sematext.com/spm From: Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, May 1, 2012 8:45 AM Subject: Solr Merge during off peak times Hi, I would like to know if there is a way to configure index merge policy in solr so that the merging happens during off peak hours. Can you please let me know if such a
ExtractRH: How to strip metadata
Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the content of the document I send to it? For example, I created an MS Word document containing just the word SEARCHWORD and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: str name=meta Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD /str All I want is the body of the document, in this case the word SEARCHWORD. For further reference, here's my extraction handler: requestHandler name=/update/extract startup=lazy class=solr.extraction.ExtractingRequestHandler lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contentmeta/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler (Ironically, meta is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe
Re: Solr Merge during off peak times
Optimizing is much less important query-speed wise than historically, essentially it's not recommended much any more. A significant effect of optimize _used_ to be purging obsolete data (i.e. that from deleted docs) from the index, but that is now done on merge. There's no harm in optimizing on off-peak hours, and combined with an appropriate merge policy that may make indexing a little better (I'm thinking of not doing as many massive merges here). BTW, in 4.0, there's DocumentWriterPerThread that merges in the background and pretty much removes even this as a motivation for optimizing. All that said, optimizing isn't _bad_, it's just often unnecessary. Best Erick On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: Actually we are not thinking of a M/S setup We are planning to have x number of shards on N number of servers, each of the shard handling both indexing and searching The expected query volume is not that high, so don't think we would need to replicate to slaves. We think each shard will be able to handle its share of the indexing and searching. If we need to scale query capacity in future, yeah probably need to do it by replicating each shard to its slaves I agree autoCommit settings would be good to set up appropriately Another question I had is pros/cons of optimising the index. We would be purging old content every week and am thinking whether to run an index optimise in the weekend after purging old data. Because we are going to be continuously indexing data which would be mix of adds, updates, deletes, not sure if the benefit of optimising would last long enough to be worth doing it. Maybe setting a low mergeFactor would be good enough. Optimising makes sense if the index is more static, perhaps? Thoughts? Thanks Prabhu -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 02 May 2012 13:15 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times But again, with a master/slave setup merging should be relatively benign. And at 200M docs, having a M/S setup is probably indicated. Here's a good writeup of mergepolicy http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/ If you're indexing and searching on a single machine, merging is much less important than how often you commit. If a M/S situation, then you're polling interval on the slave is important. I'd look at commit frequency long before I worried about merging, that's usually where people shoot themselves in the foot - by committing too often. Overall, your mergeFactor is probably less important than other parts of how you perform indexing/searching, but it does have some effect for sure... Best Erick On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: We have a fairly large scale system - about 200 million docs and fairly high indexing activity - about 300k docs per day with peak ingestion rates of about 20 docs per sec. I want to work out what a good mergeFactor setting would be by testing with different mergeFactor settings. I think the default of 10 might be high, I want to try with 5 and compare. Unless I know when a merge starts and finishes, it would be quite difficult to work out the impact of changing mergeFactor. I want to be able to measure how long merges take, run queries during the merge activity and see what the response times are etc.. Thanks Prabhu -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 02 May 2012 12:40 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Why do you care? Merging is generally a background process, or are you doing heavy indexing? In a master/slave setup, it's usually not really relevant except that (with 3.x), massive merges may temporarily stop indexing. Is that the problem? Look at the merge policys, there are configurations that make this less painful. In trunk, DocumentWriterPerThread makes merges happen in the background, which helps the long-pause-while-indexing problem. Best Erick On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: Ok, thanks Otis Another question on merging What is the best way to monitor merging? Is there something in the log file that I can look for? It seems like I have to monitor the system resources - read/write IOPS etc.. and work out when a merge happened It would be great if I can do it by looking at log files or in the admin UI. Do you know if this can be done or if there is some tool for this? Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 01 May 2012 15:12 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hi Prabhu, I don't think such a merge policy exists, but it would be nice to
Dumb question: Streaming collector /query results
I doubt if SOLR has this capability , given that it is based on a RESTful architecture, but I wanted to ask in case I'm mistaken. In lucene, it is easier to gain a direct handle to the collector / scorer and access all the results as they're collected (as opposed to the SOLR query call that performs the same internally but returns only a subset of results based on the spec'd number of results and offset from the first result) What are my options if I want to access results as they're generated? My first thought would be to write a custom collector to handle the hits as they're scored. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Dumb-question-Streaming-collector-query-results-tp3955175.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dumb question: Streaming collector /query results
In other words, .. as an alternative , what's the most efficient way to gain access to all of the document ids that match a query -- View this message in context: http://lucene.472066.n3.nabble.com/Dumb-question-Streaming-collector-query-results-tp3955175p3955194.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ExtractRH: How to strip metadata
Check to see if you have a CopyField for a wildcard pattern that copies to meta, which would copy all of the Tika-generated fields to meta. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 9:56 AM To: solr-user@lucene.apache.org Subject: ExtractRH: How to strip metadata Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the content of the document I send to it? For example, I created an MS Word document containing just the word SEARCHWORD and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: str name=meta Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD /str All I want is the body of the document, in this case the word SEARCHWORD. For further reference, here's my extraction handler: requestHandler name=/update/extract startup=lazy class=solr.extraction.ExtractingRequestHandler lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contentmeta/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler (Ironically, meta is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe
Re: ExtractRH: How to strip metadata
I do not. I commented out all of the copyFields provided in the default schema.xml that ships with 3.5. My schema is rather minimal. Here is my fields block, if this helps: fields field name=cust type=stringindexed=true stored=true required=true / field name=assettype=stringindexed=true stored=true required=true / field name=ent type=stringindexed=true stored=true required=true / field name=meta type=text_en indexed=true stored=true required=true / dynamicField name=ignored_* type=ignored multiValued=true/ !--field name=modified type=dateTime indexed=true stored=true required=false /-- /fields On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky j...@basetechnology.comwrote: Check to see if you have a CopyField for a wildcard pattern that copies to meta, which would copy all of the Tika-generated fields to meta. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 9:56 AM To: solr-user@lucene.apache.org Subject: ExtractRH: How to strip metadata Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the content of the document I send to it? For example, I created an MS Word document containing just the word SEARCHWORD and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: str name=meta Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/** phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD /str All I want is the body of the document, in this case the word SEARCHWORD. For further reference, here's my extraction handler: requestHandler name=/update/extract startup=lazy class=solr.extraction.**ExtractingRequestHandler lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contentmeta/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler (Ironically, meta is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe -- - Joe
question about dates
Hi :) I'm starting to use Solr and I'm facing a little problem with dates. My documents have a date property which is of type 'MMdd'. To index these dates, I use the following code: String dateString = 20101230; SimpleDateFormat sdf = new SimpleDateFormat(MMdd); Date date = sdf.parse(dateString); doc.addField(date, date); In the index, the date 20101230 is saved as 2010-12-29T23:00:00Z ( because of GMT). Now I would like to query documents which have their date property equals to 20101230 but I don't know how to handle this. I tried the following code : String dateString = 20101230; SimpleDateFormat sdf = new SimpleDateFormat(MMdd); Date date = sdf.parse(dateString); SimpleDateFormat gmtSdf = new SimpleDateFormat(-MM-dd'T'HH\\:mm\\:ss'Z'); String gmtString = gmtSdf.format(Date); The problem is that gmtString is equals to 2010-12-30T00\:00\:00Z. There is a difference between the index value and the parameter value of my query : /. I see that there might be something to do with the timezones during the date to string and string to date conversion but I can't find it. Thanks, Gary
Re: question about dates
The trailing Z is required in your input data to be indexed, but the Z is not actually stored. Your query must have the trailing Z though, unless you are doing a wildcard or prefix query. -- Jack Krupansky -Original Message- From: G.Long Sent: Wednesday, May 02, 2012 11:18 AM To: solr-user@lucene.apache.org Subject: question about dates Hi :) I'm starting to use Solr and I'm facing a little problem with dates. My documents have a date property which is of type 'MMdd'. To index these dates, I use the following code: String dateString = 20101230; SimpleDateFormat sdf = new SimpleDateFormat(MMdd); Date date = sdf.parse(dateString); doc.addField(date, date); In the index, the date 20101230 is saved as 2010-12-29T23:00:00Z ( because of GMT). Now I would like to query documents which have their date property equals to 20101230 but I don't know how to handle this. I tried the following code : String dateString = 20101230; SimpleDateFormat sdf = new SimpleDateFormat(MMdd); Date date = sdf.parse(dateString); SimpleDateFormat gmtSdf = new SimpleDateFormat(-MM-dd'T'HH\\:mm\\:ss'Z'); String gmtString = gmtSdf.format(Date); The problem is that gmtString is equals to 2010-12-30T00\:00\:00Z. There is a difference between the index value and the parameter value of my query : /. I see that there might be something to do with the timezones during the date to string and string to date conversion but I can't find it. Thanks, Gary
SOLRJ: Is there a way to obtain a quick count of total results for a query
I can achieve this by building a query with start and rows = 0, and using queryResponse.getResults().getNumFound(). Are there any more efficient approaches to this? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/SOLRJ-Is-there-a-way-to-obtain-a-quick-count-of-total-results-for-a-query-tp3955322.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: question about dates
Oops... I meant to say that Solr doesn't *index* the trailing Z, but it is stored (the stored value, not the indexed value.) The query must match the indexed value, not the stored value. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, May 02, 2012 11:55 AM To: solr-user@lucene.apache.org Subject: Re: question about dates The trailing Z is required in your input data to be indexed, but the Z is not actually stored. Your query must have the trailing Z though, unless you are doing a wildcard or prefix query. -- Jack Krupansky -Original Message- From: G.Long Sent: Wednesday, May 02, 2012 11:18 AM To: solr-user@lucene.apache.org Subject: question about dates Hi :) I'm starting to use Solr and I'm facing a little problem with dates. My documents have a date property which is of type 'MMdd'. To index these dates, I use the following code: String dateString = 20101230; SimpleDateFormat sdf = new SimpleDateFormat(MMdd); Date date = sdf.parse(dateString); doc.addField(date, date); In the index, the date 20101230 is saved as 2010-12-29T23:00:00Z ( because of GMT). Now I would like to query documents which have their date property equals to 20101230 but I don't know how to handle this. I tried the following code : String dateString = 20101230; SimpleDateFormat sdf = new SimpleDateFormat(MMdd); Date date = sdf.parse(dateString); SimpleDateFormat gmtSdf = new SimpleDateFormat(-MM-dd'T'HH\\:mm\\:ss'Z'); String gmtString = gmtSdf.format(Date); The problem is that gmtString is equals to 2010-12-30T00\:00\:00Z. There is a difference between the index value and the parameter value of my query : /. I see that there might be something to do with the timezones during the date to string and string to date conversion but I can't find it. Thanks, Gary
Re: question about dates
That wasn't right either... the query must have the trailing Z, which Solr will strip off to match the indexed value which doesn't have the Z. So, my corrected original statement is: The trailing Z is required in your input data to be indexed, but the Z is not actually indexed by Solr (it is stripped), although the stored value of the field, if any, would have the original value with the Z. Your query must have the trailing Z though (which Solr will strip off), unless you are doing a wildcard or prefix query. Sorry about that. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, May 02, 2012 11:59 AM To: solr-user@lucene.apache.org Subject: Re: question about dates Oops... I meant to say that Solr doesn't *index* the trailing Z, but it is stored (the stored value, not the indexed value.) The query must match the indexed value, not the stored value. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, May 02, 2012 11:55 AM To: solr-user@lucene.apache.org Subject: Re: question about dates The trailing Z is required in your input data to be indexed, but the Z is not actually stored. Your query must have the trailing Z though, unless you are doing a wildcard or prefix query. -- Jack Krupansky -Original Message- From: G.Long Sent: Wednesday, May 02, 2012 11:18 AM To: solr-user@lucene.apache.org Subject: question about dates Hi :) I'm starting to use Solr and I'm facing a little problem with dates. My documents have a date property which is of type 'MMdd'. To index these dates, I use the following code: String dateString = 20101230; SimpleDateFormat sdf = new SimpleDateFormat(MMdd); Date date = sdf.parse(dateString); doc.addField(date, date); In the index, the date 20101230 is saved as 2010-12-29T23:00:00Z ( because of GMT). Now I would like to query documents which have their date property equals to 20101230 but I don't know how to handle this. I tried the following code : String dateString = 20101230; SimpleDateFormat sdf = new SimpleDateFormat(MMdd); Date date = sdf.parse(dateString); SimpleDateFormat gmtSdf = new SimpleDateFormat(-MM-dd'T'HH\\:mm\\:ss'Z'); String gmtString = gmtSdf.format(Date); The problem is that gmtString is equals to 2010-12-30T00\:00\:00Z. There is a difference between the index value and the parameter value of my query : /. I see that there might be something to do with the timezones during the date to string and string to date conversion but I can't find it. Thanks, Gary
Re: Error with distributed search and Suggester component (Solr 3.4)
Hi Robert, On May 1, 2012, at 7:07pm, Robert Muir wrote: On Tue, May 1, 2012 at 6:48 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi list, Does anybody know if the Suggester component is designed to work with shards? I'm not really sure it is? They would probably have to override the default merge implementation specified by SpellChecker. What confuses me is that Suggester says it's based on SpellChecker, which supposedly does work with shards. But, all of the current suggesters pump out over 100,000 QPS on my machine, so I'm wondering what the usefulness of this is? And if it was useful, merging results from different machines is pretty inefficient, for suggest you would shard by term instead so that you need only contact a single host? The issue is that I've got a configuration with 8 shards already that I'm trying to leverage for auto-complete. My quick dirty work-around would be to add a custom response handler that wraps the suggester, and returns results with the fields that the SearchHandler needs to do the merge. -- Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Mahout Solr
Solr 3.5 - Elevate.xml causing issues when placed under /data directory
Hello, I just started using elevation for solr. I am on solr 3.5, running with Drupal 7, Linux. 1. I updated my solrconfig.xml from dataDir${solr.data.dir:./solr/data}/dataDir To dataDir/usr/local/tomcat2/data/solr/dev_d7/data/dataDir 2. I placed my elevate.xml in my solr's data directory. Based on forum answers, I thought placing elevate.xml under data directory would pick my latest change. I restarted tomcat. 3. When i placed my elevate.xml under conf directory, elevation was working with url: http://mysolr.www.com:8181/solr/elevate?q=gameswt=xmlsort=score+descfl=id,bundle_namehttp://p6solr1.cube6.wwe.com:8181/solr/elevate?q=gameswt=xmlfl=id,bundle_name But when i moved to data directory, I am not seeing any results. NOTE: I can see the catalina.out, printing solr reading the file from data directory. I tried to give invalid entries; I noticed solr errors parsing elevate.xml from data directory. I even tried to send some documents to index, thought commit might help to read the elevate config file. But nothing helped. I don't understand why below url does not work anymore. There are no errors in the log files. http://mysolr.www.com:8181/solr/elevate?q=gameswt=xmlsort=score+descfl=id,bundle_namehttp://p6solr1.cube6.wwe.com:8181/solr/elevate?q=gameswt=xmlfl=id,bundle_name Any help on this topic is appreciated. Thanks
Re: ExtractRH: How to strip metadata
I did some testing, and evidently the meta field is treated specially from the ERH. I copied the example schema, and added both meta and metax fields and set fmap.content=metax, and lo and behold only the doc content appears in metax, but all the doc metadata appears in meta. Although, I did get 400 errors with Solr complaining that meta was not a multivalued field. This is with Solr 3.6. What release of Solr are you using? I was not aware of this undocumented feature. I haven't checked the code yet. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 11:10 AM To: solr-user@lucene.apache.org Subject: Re: ExtractRH: How to strip metadata I do not. I commented out all of the copyFields provided in the default schema.xml that ships with 3.5. My schema is rather minimal. Here is my fields block, if this helps: fields field name=cust type=stringindexed=true stored=true required=true / field name=assettype=stringindexed=true stored=true required=true / field name=ent type=stringindexed=true stored=true required=true / field name=meta type=text_en indexed=true stored=true required=true / dynamicField name=ignored_* type=ignored multiValued=true/ !--field name=modified type=dateTime indexed=true stored=true required=false /-- /fields On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky j...@basetechnology.comwrote: Check to see if you have a CopyField for a wildcard pattern that copies to meta, which would copy all of the Tika-generated fields to meta. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 9:56 AM To: solr-user@lucene.apache.org Subject: ExtractRH: How to strip metadata Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the content of the document I send to it? For example, I created an MS Word document containing just the word SEARCHWORD and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: str name=meta Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/** phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD /str All I want is the body of the document, in this case the word SEARCHWORD. For further reference, here's my extraction handler: requestHandler name=/update/extract startup=lazy class=solr.extraction.**ExtractingRequestHandler lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contentmeta/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler (Ironically, meta is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe -- - Joe
Re: Dumb question: Streaming collector /query results
I did small research with the fairly modest result https://github.com/m-khl/solr-patches/tree/streaming you can start exploring it from the trivial test https://github.com/m-khl/solr-patches/blob/17cd45ce7693284de08d39ebc8812aa6a20b8fb3/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java pls let me know whether it's useful for you. On Wed, May 2, 2012 at 6:48 PM, vybe3142 vybe3...@gmail.com wrote: her words, .. as an alternative , what's the most efficient way to gain access to all of the document ids that match a qu -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Removing old documents
I use jetty that comes with solr. I use solr's dedupe updateRequestProcessorChain name=dedupe processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldid/str bool name=overwriteDupestrue/bool str name=fieldsurl/str str name=signatureClasssolr.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain and because of this id is not url itself but its encoded signature. I see solrclean uses url to delete a document. Is it possible that the issue is because of this mismatch? Thanks. Alex. -Original Message- From: Paul Libbrecht p...@hoplahup.net To: solr-user solr-user@lucene.apache.org Sent: Tue, May 1, 2012 11:43 pm Subject: Re: Removing old documents With which client? paul Le 2 mai 2012 à 01:29, alx...@aim.com a écrit : all caching is disabled and I restarted jetty. The same results.
Re: ExtractRH: How to strip metadata
How interesting! You know, I did at one point consider that perhaps the fieldname meta may be treated specially, but I talked myself out of it. I reasoned that a field name in my local schema should have no bearing on how a plugin such as solr-cell/Tika behaves. I should have tested my hypothesis; even if this phenomenon turns out to be undocumented behavior, I consider myself a victim of my own assumptions. I am running version 3.5. You may have gotten the multivalue errors due to the way your test schema and/or extracting request handler is lain out (my bad). I am using the ignored fieldtype and a dynamicField called ignored_ as a catch-all for extraneous fields delivered by Tika. Thanks for your help! Please keep me posted on any further insights/revelations, and I'll do the same. On Wed, May 2, 2012 at 12:54 PM, Jack Krupansky j...@basetechnology.comwrote: I did some testing, and evidently the meta field is treated specially from the ERH. I copied the example schema, and added both meta and metax fields and set fmap.content=metax, and lo and behold only the doc content appears in metax, but all the doc metadata appears in meta. Although, I did get 400 errors with Solr complaining that meta was not a multivalued field. This is with Solr 3.6. What release of Solr are you using? I was not aware of this undocumented feature. I haven't checked the code yet. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 11:10 AM To: solr-user@lucene.apache.org Subject: Re: ExtractRH: How to strip metadata I do not. I commented out all of the copyFields provided in the default schema.xml that ships with 3.5. My schema is rather minimal. Here is my fields block, if this helps: fields field name=cust type=stringindexed=true stored=true required=true / field name=assettype=stringindexed=true stored=true required=true / field name=ent type=stringindexed=true stored=true required=true / field name=meta type=text_en indexed=true stored=true required=true / dynamicField name=ignored_* type=ignored multiValued=true/ !--field name=modified type=dateTime indexed=true stored=true required=false /-- /fields On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky j...@basetechnology.com* *wrote: Check to see if you have a CopyField for a wildcard pattern that copies to meta, which would copy all of the Tika-generated fields to meta. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 9:56 AM To: solr-user@lucene.apache.org Subject: ExtractRH: How to strip metadata Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the content of the document I send to it? For example, I created an MS Word document containing just the word SEARCHWORD and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: str name=meta Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/** phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD /str All I want is the body of the document, in this case the word SEARCHWORD. For further reference, here's my extraction handler: requestHandler name=/update/extract startup=lazy class=solr.extraction.ExtractingRequestHandler lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contentmeta/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler (Ironically, meta is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe -- - Joe -- - Joe
Dynamic core creation works in 3.5.0 fails in 3.6.0: At least one core definition required at run-time for Solr 3.6.0?
Hi: I have been working on an integration project involving Solr 3.5.0 that dynamically registers cores as needed at run-time, but does not contain any cores by default. The current solr.xml configuration file is:- ?xml version=1.0 encoding=UTF-8 ? solr persistent=false sharedLib=lib cores adminPath=/admin/cores/ /solr This configuration does not include any cores as those are created dynamically by each application that is using the Solr server. This is working fine with Solr 3.5.0; the server starts and running web applications can register a new core using SolrJ CoreAdminRequest and everything is working correctly. However, I tried to update to Solr 3.6.0 and this configuration fails with a SolrException due to the following code in CoreContainer.java (lines 171-173):- if (cores.cores.isEmpty()){ throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, No cores were created, please check the logs for errors); } This is a change from Solr 3.5.0 which has no such check. I have searched but cannot find any ticket or notice that this is a planned change in 3.6.0, but before I file a ticket I am asking the community in case this is an issue that has been discussed and this is a planned direction for Solr. Thanks, Matthew
Re: question about dates
: String dateString = 20101230; : SimpleDateFormat sdf = new SimpleDateFormat(MMdd); : Date date = sdf.parse(dateString); : doc.addField(date, date); : : In the index, the date 20101230 is saved as 2010-12-29T23:00:00Z ( because : of GMT). because of GMT is missleading and vague ... what you get in your index is a value of 2010-12-29T23:00:00Z because that is the canonical string representation of the date object you have passed to doc.addField -- the date object you have passed in represents that time, because you constructed a SimpleDateFormat object w/o specifying which TimeZone that SDF object should assume is in use when it parses it's string input. So when you give it the input 20101230 it treats that is Dec 30, 2010, 00:00:00.000 in whatever teh local timezone of your client is. If you want it to treat that input string as a date expression in GMT, then you need to configure the parser to use GMT (SimpleDateFormat.setTimeZone) : I tried the following code : : : String dateString = 20101230; : SimpleDateFormat sdf = new SimpleDateFormat(MMdd); : Date date = sdf.parse(dateString); : SimpleDateFormat gmtSdf = new : SimpleDateFormat(-MM-dd'T'HH\\:mm\\:ss'Z'); : String gmtString = gmtSdf.format(Date); : : The problem is that gmtString is equals to 2010-12-30T00\:00\:00Z. There is again, that is not a gmtString .. in this case, both of the SDF objects you are using have not been configured with an explicit TimeZone, so they use whatever hte platform default is where this code is run -- so the variable you are calling gmtString is actaully a string representation of Date object formated in your local TimeZone. Bottom line... * when parsing a string into a Date, you really need to know (and be explicit to the parser) about what timezone is represented in that string (unless the formated of hte string includes the TimeZone) * when building a query string to pass to solr, then the DateFormat you use to formate a Date object must format it using GMT -- there is a DateUtil class included in solrj to make this easier. If you really don't care at all about TimeZones, then just use GMT everywhere .. but if you actually care about what time of day something happened, and want to be able to query for events with hour/min/sec/etc.. granularity, then you need to be precise about the TimeZone in every Formatter you use. -Hoss
Re: Error with distributed search and Suggester component (Solr 3.4)
On Wed, May 2, 2012 at 12:16 PM, Ken Krugler kkrugler_li...@transpac.com wrote: What confuses me is that Suggester says it's based on SpellChecker, which supposedly does work with shards. It is based on spellchecker apis, but spellchecker's ranking is based on simple comparators like string similarity, whereas suggesters use weights. when spellchecker merges from shards, it just merges all their top-N into one set and recomputes this same distance stuff over again. so, suggester can't possibly work like this correctly (forget about any technical details), as how can it make assumptions about these weights you provided. if they were e.g. log() weights from your query logs then it needs to do log-summation across the shards, etc for the final combined weight to be correct. This is specific to how you originally computed the weights you gave it. it certainly cannot be recomputing anything like spellchecker does :) Anyways, if you really want to do it, maybe https://issues.apache.org/jira/browse/SOLR-2848 is helpful. The background is in 3.x there is really only one spellchecker impl (AbstractLucene or something like that). I don't think distributed spellcheck works with any other SpellChecker subclasses in 3.x, i think its wired to only work with the Abstract-Lucene ones. When we added another subclass to 4.0, DirectSpellChecker, he saw that it was broken here and cleaned up the APIs so that spellcheckers can override this merge() operation. Unfortunately I forgot to commit those refactorings James did (which lets any spellchecker override merge()ing) to the 3.x branch, but the ideas might be useful. -- lucidimagination.com
need some help with a multicore config of solr3.6.0+tomcat7. mine reports: Severe errors in solr configuration.
i've installed tomcat7 and solr 3.6.0 on linux/64 i'm trying to get a single webapp + multicore setup working. my efforts have gone off the rails :-/ i suspect i've followed too many of the wrong examples. i'd appreciate some help/direction getting this working. so far, i've configured grep /etc/tomcat7/server.xml -A2 -B2 Java AJP Connector: /docs/config/ajp.html APR (HTTP/AJP) Connector: /docs/apr.html Define a non-SSL HTTP/1.1 Connector on port -- Connector port= protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 / -- !-- Connector executor=tomcatThreadPool port= protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 / cat /etc/tomcat7/Catalina/localhost/solr.xml Context docBase=/srv/tomcat7/webapps/solr.war debug=0 privileged=true allowLinking=true crossContext=true Environment name=solr/home type=java.lang.String value=/srv/www/solrbase override=true / /Context after tomcat restart, ps ax | grep tomcat 6129 pts/4Sl 0:06 /etc/alternatives/jre/bin/java -classpath :/usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat7 -Dcatalina.home=/usr/share/tomcat7 -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat7/temp -Djava.util.logging.config.file=/usr/share/tomcat7/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start if i nav to http://127.0.0.1: i see as expected Server Information Tomcat Version JVM VersionJVM Vendor OS Name OS Version OS Architecture Apache Tomcat/7.0.26 1.7.0_147-icedtea-b147 Oracle Corporation Linux 3.1.10-1.9-desktop amd64 now, i'm trying to set up multicore properly. i configured, cat /srv/www/solrbase/solr.xml ?xml version=1.0 encoding=UTF-8 ? solr persistent=false cores adminPath=/admin/cores core name=core0 instanceDir=core0 / core name=core1 instanceDir=core1 / /cores /solr then mkdir -p /srv/www/solrbase/{core0,core1} cp -a/srv/www/solrbase/conf /srv/www/solrbase/core0/ cp -a/srv/www/solrbase/conf /srv/www/solrbase/core1/ if i nav to http://localhost:/solr/core0 i get, HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in solr.xml - org.apache.solr.common.SolrException: No cores were created, please check the logs for errors at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:172) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96) at org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:103) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:895) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:871) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615) at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:649) at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1581) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at
synonyms
Hello everbody, I have a doubt with respect to synonyms in Solr, In our company we are lookink for one solution to resolve synonyms from database and not from one text file like SynonymFilterFactory do it. The idea is save all the synonyms in the database, indexing and they will be ready to one query, but we haven't found one solution from database. Another idea is create one plugin that extend to SynonymFilterFactory but I don't know if this is posible. I hope someone can help me. regards, Carlos Andrés García García
Re: synonyms
I'm not sure I completely follow, but are you simply saying that you want to have a synonym filter that reads the synonym table from a database rather than the current text file? If so, sure, you could develop a replacement for the current synonym filter which loads its table from a database, but you would have to develop that code yourself (or get some assistance doing it.) If that is not what you are trying to do, please explain in a little more detail. -- Jack Krupansky -Original Message- From: Carlos Andres Garcia Sent: Wednesday, May 02, 2012 4:31 PM To: solr-user@lucene.apache.org Subject: synonyms Hello everbody, I have a doubt with respect to synonyms in Solr, In our company we are lookink for one solution to resolve synonyms from database and not from one text file like SynonymFilterFactory do it. The idea is save all the synonyms in the database, indexing and they will be ready to one query, but we haven't found one solution from database. Another idea is create one plugin that extend to SynonymFilterFactory but I don't know if this is posible. I hope someone can help me. regards, Carlos Andrés García García
RE: synonyms
Another solution is to write a script to read the database and create the synonyms.txt file, dump the file to solr and reload the core. This gives you the custom synonym solution. -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, May 02, 2012 4:54 PM To: solr-user@lucene.apache.org Subject: Re: synonyms I'm not sure I completely follow, but are you simply saying that you want to have a synonym filter that reads the synonym table from a database rather than the current text file? If so, sure, you could develop a replacement for the current synonym filter which loads its table from a database, but you would have to develop that code yourself (or get some assistance doing it.) If that is not what you are trying to do, please explain in a little more detail. -- Jack Krupansky -Original Message- From: Carlos Andres Garcia Sent: Wednesday, May 02, 2012 4:31 PM To: solr-user@lucene.apache.org Subject: synonyms Hello everbody, I have a doubt with respect to synonyms in Solr, In our company we are lookink for one solution to resolve synonyms from database and not from one text file like SynonymFilterFactory do it. The idea is save all the synonyms in the database, indexing and they will be ready to one query, but we haven't found one solution from database. Another idea is create one plugin that extend to SynonymFilterFactory but I don't know if this is posible. I hope someone can help me. regards, Carlos Andrés García García
RE: need some help with a multicore config of solr3.6.0+tomcat7. mine reports: Severe errors in solr configuration.
I don't know if this will help but I usually add a dataDir element to each cores solrconfig.xml to point at a local data folder for the core like this: !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. -- dataDir${solr.data.dir:./solr/core0/data}/dataDir -Original Message- From: loc...@mm.st [mailto:loc...@mm.st] Sent: Wednesday, May 02, 2012 1:06 PM To: solr-user@lucene.apache.org Subject: need some help with a multicore config of solr3.6.0+tomcat7. mine reports: Severe errors in solr configuration. i've installed tomcat7 and solr 3.6.0 on linux/64 i'm trying to get a single webapp + multicore setup working. my efforts have gone off the rails :-/ i suspect i've followed too many of the wrong examples. i'd appreciate some help/direction getting this working. so far, i've configured grep /etc/tomcat7/server.xml -A2 -B2 Java AJP Connector: /docs/config/ajp.html APR (HTTP/AJP) Connector: /docs/apr.html Define a non-SSL HTTP/1.1 Connector on port -- Connector port= protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 / -- !-- Connector executor=tomcatThreadPool port= protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 / cat /etc/tomcat7/Catalina/localhost/solr.xml Context docBase=/srv/tomcat7/webapps/solr.war debug=0 privileged=true allowLinking=true crossContext=true Environment name=solr/home type=java.lang.String value=/srv/www/solrbase override=true / /Context after tomcat restart, ps ax | grep tomcat 6129 pts/4Sl 0:06 /etc/alternatives/jre/bin/java -classpath :/usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli .jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat7 -Dcatalina.home=/usr/share/tomcat7 -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat7/temp -Djava.util.logging.config.file=/usr/share/tomcat7/conf/logging.properti es -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start if i nav to http://127.0.0.1: i see as expected Server Information Tomcat Version JVM VersionJVM Vendor OS Name OS Version OS Architecture Apache Tomcat/7.0.26 1.7.0_147-icedtea-b147 Oracle Corporation Linux 3.1.10-1.9-desktop amd64 now, i'm trying to set up multicore properly. i configured, cat /srv/www/solrbase/solr.xml ?xml version=1.0 encoding=UTF-8 ? solr persistent=false cores adminPath=/admin/cores core name=core0 instanceDir=core0 / core name=core1 instanceDir=core1 / /cores /solr then mkdir -p /srv/www/solrbase/{core0,core1} cp -a/srv/www/solrbase/conf /srv/www/solrbase/core0/ cp -a/srv/www/solrbase/conf /srv/www/solrbase/core1/ if i nav to http://localhost:/solr/core0 i get, HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in solr.xml - org.apache.solr.common.SolrException: No cores were created, please check the logs for errors at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer. java:172) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java: 96) at org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationF ilterConfig.java:277) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFi lterConfig.java:258) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(Applicatio nFilterConfig.java:382) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilte rConfig.java:103) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.jav a:4638) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.j ava:5294)
Re: need some help with a multicore config of solr3.6.0+tomcat7. mine reports: Severe errors in solr configuration.
I chronicled exactly what I had to configure to slay this dragon at http://vinaybalamuru.wordpress.com/2012/04/12/solr4-tomcat-multicor/ Hope that helps -- View this message in context: http://lucene.472066.n3.nabble.com/need-some-help-with-a-multicore-config-of-solr3-6-0-tomcat7-mine-reports-Severe-errors-in-solr-confi-tp3957196p3957389.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Phrase Slop probelm
You are missing the pf, pf2, and pf3 request parameters, which says which fields to do phrase proximity boosting on. pf boosts using the whole query as a phrase, pf2 boosts bigrams, and pf3 boost trigrams. You can use any combination of them, but if you use none of them, ps appears to be ignored. Maybe it should default to doing some boost if none of the field lists is given, like boost using bigrams in the qf fields, but it doesn't. -- Jack Krupansky -Original Message- From: André Maldonado Sent: Wednesday, May 02, 2012 3:29 PM To: solr-user@lucene.apache.org Subject: Phrase Slop probelm Hi all. In my index I have a multivalued field that contains a lot of information, all text searches are based on it. So, When I Do: http://xxx.xx.xxx.xxx:/Index/select/?start=0rows=12q=term1+term2+term3qf=textoboostfq=field1%3aanother_termdefType=edismaxmm=100%25http://10.100.3.62:8984/solr/Index/select/?start=0rows=12q=churrasqueira+varanda+sacadaps=0qf=textoboost%20textofq=localexibicao%3azapdefType=edismaxmm=100%25debugQuery=trueechoParams=all I got the same result as in: http://xxx.xx.xxx.xxx:/Index/select/?start=0rows=12q=term1+term2+term3 *ps=0*qf=textoboostfq=field1%3aanother_termdefType=edismaxmm=100%25http://10.100.3.62:8984/solr/Index/select/?start=0rows=12q=churrasqueira+varanda+sacadaps=0qf=textoboost%20textofq=localexibicao%3azapdefType=edismaxmm=100%25debugQuery=trueechoParams=all And the same result in: http://xxx.xx.xxx.xxx:/Index/select/?start=0rows=12q=term1+term2+term3 *ps=10* qf=textoboostfq=field1%3aanother_termdefType=edismaxmm=100%25http://10.100.3.62:8984/solr/Index/select/?start=0rows=12q=churrasqueira+varanda+sacadaps=0qf=textoboost%20textofq=localexibicao%3azapdefType=edismaxmm=100%25debugQuery=trueechoParams=all What I'm doing wrong? Thank's * -- * *E conhecereis a verdade, e a verdade vos libertará. (João 8:32)* *andre.maldonado*@gmail.com andre.maldon...@gmail.com (11) 9112-4227 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664 http://www.facebook.com/profile.php?id=10659376883 http://twitter.com/andremaldonado http://www.delicious.com/andre.maldonado https://profiles.google.com/105605760943701739931 http://www.linkedin.com/pub/andr%C3%A9-maldonado/23/234/4b3 http://www.youtube.com/andremaldonado
RE: synonyms
Thanks for your answers, now I have another cuestions,if I develop the filter to replacement the current synonym filter,I understand that this procces would be in time of the indexing because in time of the query search there are a lot problems knows. if so, how can I do for create my index file. For example: I have two synonyms Nou cam, Cataluña for barcelona in the data base Opcion 1) In time of the indexing would create 2 records like this: doc fieldbarcelonafield fieldCamp Noufield ... doc and doc fieldbarcelonafield fieldCataluñafield ... doc Opcion 2) or only would create one record like this: doc fieldbarcelonafield fieldCamp Nou,Cataluñafield ... doc If it create the opcion 2 can looking for by Camp Nou y by Cataluña but when I looking for by barcelona the Solr return 2 records and that is one error because barcelona is only one IF it create the opcion 2 , I have searching wiht wildcards for example *Camp Nou* o *Cataluña* y the solr would return one records, the same case if searching by barcelona solr would return one recors that is good , but i want to know if is the better option or solr have another caracteristic betters that can resolve this topic of one better way.
Re: Solr Merge during off peak times
Hello Prabhu, Look at SPM for Solr (URL in sig below). It includes Index Statistics graphs, and from these graphs you can tell: * how many docs are in your index * how many docs are deleted * size of index on disk * number of index segments * number of index files * maybe something else I'm forgetting now So from size, # of segments, and index files you will be able to tell when merges happened and before/after size, segment and index file count. Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Otis Gospodnetic otis_gospodne...@yahoo.com Sent: Wednesday, May 2, 2012 7:22 AM Subject: RE: Solr Merge during off peak times Ok, thanks Otis Another question on merging What is the best way to monitor merging? Is there something in the log file that I can look for? It seems like I have to monitor the system resources - read/write IOPS etc.. and work out when a merge happened It would be great if I can do it by looking at log files or in the admin UI. Do you know if this can be done or if there is some tool for this? Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 01 May 2012 15:12 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hi Prabhu, I don't think such a merge policy exists, but it would be nice to have this option and I imagine it wouldn't be hard to write if you really just base the merge or no merge decision on the time of day (and maybe day of the week). Note that this should go into Lucene, not Solr, so if you decide to contribute your work, please see http://wiki.apache.org/lucene-java/HowToContribute Otis Performance Monitoring for Solr - http://sematext.com/spm From: Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, May 1, 2012 8:45 AM Subject: Solr Merge during off peak times Hi, I would like to know if there is a way to configure index merge policy in solr so that the merging happens during off peak hours. Can you please let me know if such a merge policy configuration exists? Thanks Prabhu
solr broke a pipe
Anyone have any clues about this exception? It happened during the course of normal indexing. This is new to me (we're running solr 3.6 on tomcat 6/redhat RHEL) and we've been running smoothly for some time now until this showed up: Red Hat Enterprise Linux Server release 5.3 (Tikanga) Apache Tomcat Version 6.0.20 java.runtime.version = 1.6.0_25-b06 java.vm.name = Java HotSpot(TM) 64-Bit Server VM May 2, 2012 4:07:48 PM org.apache.solr.handler.ReplicationHandler$FileStream write WARNING: Exception while writing response for params: indexversion=1276893500358file=_1uca.frqcommand=filecontentchecksum=t ruewt=filestream ClientAbortException: java.net.SocketException: Broken pipe at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j ava:358) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:354) at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java: 381) at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:370) at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStrea m.java:89) at org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java :87) at org.apache.solr.handler.ReplicationHandler$FileStream.write(ReplicationH andler.java:1076) at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.ja va:936) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFil ter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j ava:273) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2 93) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84 9) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process( Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(Unknown Source) at java.net.SocketOutputStream.write(Unknown Source) at org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOut putBuffer.java:740) at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:349) at org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.d oWrite(InternalOutputBuffer.java:764) at org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutp utFilter.java:126) at org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuff er.java:573) at org.apache.coyote.Response.doWrite(Response.java:560) at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j ava:353) ... 21 more
Re: syntax for negative query OR something
: How do I search for things that have no value or a specified value? Things with no value... (*:* -fieldName:[* TO *]) Things with a specific value... fieldName:A Things with no value or a specific value... (*:* -fieldName:[* TO *]) fieldName:A ...or if you aren't using OR as your default op (*:* -fieldName:[* TO *]) OR fieldName:A : I have a few variations of: : -fname:[* TO *] OR fname:(A B C) that is just syntacitic sugar for... -fname:[* TO *] fname:(A B C) which is an empty set. you need to be explicit that the exclude docs with a value in this field clause should applied to the set of all documents -Hoss
Re: syntax for negative query OR something
Sounds good. OR in the negation of any query that matches any possible value in a field. The Solr query parser doc lists the open range as you used: -field:[* TO *] finds all documents without a value for field See: http://wiki.apache.org/solr/SolrQuerySyntax This also include pure wildcard that can generate a PrefixQuery: -fname:* OR fname:(A B C) -- Jack Krupansky -Original Message- From: Ryan McKinley Sent: Wednesday, May 02, 2012 7:18 PM To: solr-user@lucene.apache.org Subject: syntax for negative query OR something How do I search for things that have no value or a specified value? Essentially I have a field that *may* exist and what the absense of a field to also match. I have a few variations of: -fname:[* TO *] OR fname:(A B C) Thanks for any pointers ryan
Re: syntax for negative query OR something
Oops... that is: (-fname:*) OR fname:(A B C) or (-fname:[* TO *]) OR fname:(A B C) -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, May 02, 2012 7:48 PM To: solr-user@lucene.apache.org Subject: Re: syntax for negative query OR something Sounds good. OR in the negation of any query that matches any possible value in a field. The Solr query parser doc lists the open range as you used: -field:[* TO *] finds all documents without a value for field See: http://wiki.apache.org/solr/SolrQuerySyntax This also include pure wildcard that can generate a PrefixQuery: -fname:* OR fname:(A B C) -- Jack Krupansky -Original Message- From: Ryan McKinley Sent: Wednesday, May 02, 2012 7:18 PM To: solr-user@lucene.apache.org Subject: syntax for negative query OR something How do I search for things that have no value or a specified value? Essentially I have a field that *may* exist and what the absense of a field to also match. I have a few variations of: -fname:[* TO *] OR fname:(A B C) Thanks for any pointers ryan
Re: syntax for negative query OR something
Hmmm... I thought that worked in edismax. And I thought that pure negative queries were allowed in SolrQueryParser. Oh well. In any case, in the Lucene or Solr query parser, add *:* to select all docs before negating the docs that have any value in the field: (*:* -fname:*) OR fname:(A B C) or (*:* -fname:[* TO *]) OR fname:(A B C) -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, May 02, 2012 7:52 PM To: solr-user@lucene.apache.org Subject: Re: syntax for negative query OR something Oops... that is: (-fname:*) OR fname:(A B C) or (-fname:[* TO *]) OR fname:(A B C) -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, May 02, 2012 7:48 PM To: solr-user@lucene.apache.org Subject: Re: syntax for negative query OR something Sounds good. OR in the negation of any query that matches any possible value in a field. The Solr query parser doc lists the open range as you used: -field:[* TO *] finds all documents without a value for field See: http://wiki.apache.org/solr/SolrQuerySyntax This also include pure wildcard that can generate a PrefixQuery: -fname:* OR fname:(A B C) -- Jack Krupansky -Original Message- From: Ryan McKinley Sent: Wednesday, May 02, 2012 7:18 PM To: solr-user@lucene.apache.org Subject: syntax for negative query OR something How do I search for things that have no value or a specified value? Essentially I have a field that *may* exist and what the absense of a field to also match. I have a few variations of: -fname:[* TO *] OR fname:(A B C) Thanks for any pointers ryan
Re: synonyms
There are lots of different strategies for dealing with synonyms, depending on what exactly is most important and what exactly your are willing to tolerate. In your latest example, you seem to be using string fields, which is somewhat different form the text synonyms we talk about in Solr. You can certainly have multiple string fields, or even a multi-valued string field to store variations on selected categories of terms. That works well when you have a well-defined number of categories. So, you can have a user query go against a combination of normal text fields and these category string fields. If that is sufficient for your application, great. -- Jack Krupansky -Original Message- From: Carlos Andres Garcia Sent: Wednesday, May 02, 2012 6:57 PM To: solr-user@lucene.apache.org Subject: RE: synonyms Thanks for your answers, now I have another cuestions,if I develop the filter to replacement the current synonym filter,I understand that this procces would be in time of the indexing because in time of the query search there are a lot problems knows. if so, how can I do for create my index file. For example: I have two synonyms Nou cam, Cataluña for barcelona in the data base Opcion 1) In time of the indexing would create 2 records like this: doc fieldbarcelonafield fieldCamp Noufield ... doc and doc fieldbarcelonafield fieldCataluñafield ... doc Opcion 2) or only would create one record like this: doc fieldbarcelonafield fieldCamp Nou,Cataluñafield ... doc If it create the opcion 2 can looking for by Camp Nou y by Cataluña but when I looking for by barcelona the Solr return 2 records and that is one error because barcelona is only one IF it create the opcion 2 , I have searching wiht wildcards for example *Camp Nou* o *Cataluña* y the solr would return one records, the same case if searching by barcelona solr would return one recors that is good , but i want to know if is the better option or solr have another caracteristic betters that can resolve this topic of one better way.
Re: Solr 3.5 - Elevate.xml causing issues when placed under /data directory
(12/05/03 1:39), Noordeen, Roxy wrote: Hello, I just started using elevation for solr. I am on solr 3.5, running with Drupal 7, Linux. 1. I updated my solrconfig.xml from dataDir${solr.data.dir:./solr/data}/dataDir To dataDir/usr/local/tomcat2/data/solr/dev_d7/data/dataDir 2. I placed my elevate.xml in my solr's data directory. Based on forum answers, I thought placing elevate.xml under data directory would pick my latest change. I restarted tomcat. 3. When i placed my elevate.xml under conf directory, elevation was working with url: http://mysolr.www.com:8181/solr/elevate?q=gameswt=xmlsort=score+descfl=id,bundle_namehttp://p6solr1.cube6.wwe.com:8181/solr/elevate?q=gameswt=xmlfl=id,bundle_name But when i moved to data directory, I am not seeing any results. NOTE: I can see the catalina.out, printing solr reading the file from data directory. I tried to give invalid entries; I noticed solr errors parsing elevate.xml from data directory. I even tried to send some documents to index, thought commit might help to read the elevate config file. But nothing helped. I don't understand why below url does not work anymore. There are no errors in the log files. http://mysolr.www.com:8181/solr/elevate?q=gameswt=xmlsort=score+descfl=id,bundle_namehttp://p6solr1.cube6.wwe.com:8181/solr/elevate?q=gameswt=xmlfl=id,bundle_name Any help on this topic is appreciated. Hi Noordeen, What do you mean by I am not seeing any results.? Is it no docs in response (numFound=0) ? And have you tried the original ${solr.data.dir:./solr/data} for the dataDir? Isn't it working for you too? koji -- Query Log Visualizer for Apache Solr http://soleami.com/
Re: synonyms
I think regular sync of database table with synonym text file seems to be simplest of the solutions. It will allow you to use Solr natively without any customization and it is not very complicated operation to update synonyms file with entries in database.
Re: syntax for negative query OR something
thanks! On Wed, May 2, 2012 at 4:43 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : How do I search for things that have no value or a specified value? Things with no value... (*:* -fieldName:[* TO *]) Things with a specific value... fieldName:A Things with no value or a specific value... (*:* -fieldName:[* TO *]) fieldName:A ...or if you aren't using OR as your default op (*:* -fieldName:[* TO *]) OR fieldName:A : I have a few variations of: : -fname:[* TO *] OR fname:(A B C) that is just syntacitic sugar for... -fname:[* TO *] fname:(A B C) which is an empty set. you need to be explicit that the exclude docs with a value in this field clause should applied to the set of all documents -Hoss
Re: Lucene FieldCache - Out of memory exception
Jack, Yes, the queries work fine till I hit the OOM. The fields that start with S_* are strings, F_* are floats, I_* are ints and so so. The dynamic field definitions from schema.xml : dynamicField name=S_* type=stringindexed=true stored=true omitNorms=true/ dynamicField name=I_* type=sintindexed=true stored=true omitNorms=true/ dynamicField name=F_* type=sfloatindexed=true stored=true omitNorms=true/ dynamicField name=D_* type=dateindexed=true stored=true omitNorms=true/ dynamicField name=B_* type=booleanindexed=true stored=true omitNorms=true/ *Each FieldCache will be an array with maxdoc entries (your total number of documents - 1.4 million) times the size of the field value or whatever a string reference is in your JVM* So if I understand correct - every field (dynamic or normal) will have its own field cache. The size of the field cache for any field will be (maxDocs * sizeOfField) ? If the field has only 100 unique values, will it occupy (100 * sizeOfField) or will it still be (maxDocs * sizeOfField) ? *Roughly what is the typical or average length of one of your facet field values? And, on average, how many unique terms are there within a typical faceted field?* Each field length may vary from 10 - 30 characters. Average of 20 maybe. Number of unique terms within a faceted field will vary from 100 - 1000. Average of 300. How will the number of unique terms affect performance ? *3 GB sounds like it might not be enough for such heavy use of faceting. It is probably not the 50-70 number, but the 440 or accumulated number across many queries that pushes the memory usage up* I am using jdk1.5.0_14 - 32 bit. With 32 bit jdk, I think there is a limitation that more RAM cannot be allocated. *When you hit OOM, what does the Solr admin stats display say for FieldCache?* I don't have solr deployed as a separate web app. All solr jar files are present in my webapp's WEB-INF\lib directory. I use EmbeddedSolrServer. So is there a way I can get this information that the admin would show ? Thank you for your time. -Rahul On Wed, May 2, 2012 at 5:19 PM, Jack Krupansky j...@basetechnology.comwrote: The FieldCache gets populated the first time a given field is referenced as a facet and then will stay around forever. So, as additional queries get executed with different facet fields, the number of FieldCache entries will grow. If I understand what you have said, theses faceted queries do work initially, but after awhile they stop working with OOM, correct? The size of a single FieldCache depends on the field type. Since you are using dynamic fields, it depends on your dynamicField types - which you have not told us about. From your query I see that your fields start with S_ and F_ - presumably you have dynamic field types S_* and F_*? Are they strings, integers, floats, or what? Each FieldCache will be an array with maxdoc entries (your total number of documents - 1.4 million) times the size of the field value or whatever a string reference is in your JVM. String fields will take more space than numeric fields for the FieldCache, since a separate table is maintained for the unique terms in that field. Roughly what is the typical or average length of one of your facet field values? And, on average, how many unique terms are there within a typical faceted field? If you can convert many of these faceted fields to simple integers the size should go down dramatically, but that depends on your application. 3 GB sounds like it might not be enough for such heavy use of faceting. It is probably not the 50-70 number, but the 440 or accumulated number across many queries that pushes the memory usage up. When you hit OOM, what does the Solr admin stats display say for FieldCache? -- Jack Krupansky -Original Message- From: Rahul R Sent: Wednesday, May 02, 2012 2:22 AM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache - Out of memory exception Here is one sample query that I picked up from the log file : q=*%3A*fq=Category%3A%223__**107%22fq=S_P1540477699%3A%** 22MICROCIRCUIT%2C+LINE+**TRANSCEIVERS%22rows=0facet=** truefacet.mincount=1facet.**limit=2facet.field=S_** C1503120369facet.field=S_**P1406389942facet.field=S_** P1430116878facet.field=S_**P1430116881facet.field=S_** P1406453552facet.field=S_**P1406451296facet.field=S_** P1406452465facet.field=S_**C2968809156facet.field=S_** P1406389980facet.field=S_**P1540477699facet.field=S_** P1406389982facet.field=S_**P1406389984facet.field=S_** P1406451284facet.field=S_**P1406389926facet.field=S_** P1424886581facet.field=S_**P2017662632facet.field=F_** P1946367021facet.field=S_**P1430116884facet.field=S_** P2017662620facet.field=F_**P1406451304facet.field=F_** P1406451306facet.field=F_**P1406451308facet.field=S_** P1500901421facet.field=S_**P1507138990facet.field=I_** P1406452433facet.field=I_**P1406453565facet.field=I_**