Distributed field collapsing
Hi, Is there any patch available for distributed field collapsing. I need it in my app. Any ideas please add... regards, V.Sriram
RE: How to search for special chars like ä from ae?
Hi Steve, thanks for the reply. I did not understand which file do I need to rename? I'm working on Solr 1.4. The file in examples/solr/conf directory is mapping-ISOLatin1Accent.txt. The Schema.xml has the following commented entry. charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ Do I need to replace mapping-ISOLatin1Accent.txt with http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/mapping-FoldToASCII.txt mapping-FoldToASCII.txt and change the charFilter mapping to charFilter class=solr.MappingCharFilterFactory mapping=mapping-FoldToASCII.txt/ ? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-search-for-special-chars-like-a-from-ae-tp2444921p2447888.html Sent from the Solr - User mailing list archive at Nabble.com.
does copyField recurse?
Hello list, if I have a field title which copied to text and a field text that is copied to text.stemmed. Am I going to get the copy from the field title to the field text.stemmed or should I include it? thanks in advance paul
Re: does copyField recurse?
Field values are copied before being analyzed. There is no cascading of analyzers. Hello list, if I have a field title which copied to text and a field text that is copied to text.stemmed. Am I going to get the copy from the field title to the field text.stemmed or should I include it? thanks in advance paul
Re: does copyField recurse?
And no cascading of copying (as I experimented). I just enriched the wiki's http://wiki.apache.org/solr/SchemaXml#Copy_Fields thanks to proof. paul Le 8 févr. 2011 à 11:16, Markus Jelsma a écrit : Field values are copied before being analyzed. There is no cascading of analyzers. Hello list, if I have a field title which copied to text and a field text that is copied to text.stemmed. Am I going to get the copy from the field title to the field text.stemmed or should I include it? thanks in advance paul
Re: Solr n00b question: writing a custom QueryComponent
I'm still not quite clear what you are attempting to achieve, and more so why you need to extend Solr rather than just wrap it. You have data with title, description and content fields. You make no mention of an ID field. Surely, if you want to store some in mysql and some in Solr, you could make your Solr client code enhance the data it gets back after querying Solr with data extracted from Mysql. What is the issue here? Upayavira On Mon, 07 Feb 2011 23:17 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi all, Been a solr user for a while now, and now I need to add some functionality to solr for which I'm trying to write a custom QueryComponent. Couldn't get much help from websearch. So, turning to solr-user for help. I'm implementing search functionality for (micro)blog aggregation. We use solr 1.4.1. In the current solr config, the title and content fields are both indexed and stored in solr. Storing takes up a lot of space, even with compression. I'd like to store the title and description field in solr in mysql and retrieve these fields in results from MySQL with an id lookup. Using the DataImportHandler won't work because we store just the title and content fields in MySQL. The rest of the fields are in solr itself. I wrote a custom component by extending QueryComponent, and overriding only the finishStage(ResponseBuilder) function where I try to retrieve the necessary records from MySQL. This is how the new QueryComponent is specified in solrconfig.xml searchComponent name=query class=org.apache.solr.handler.component.TestSolr / I see that the component is getting loaded from the solr debug output lst name=prepare double name=time1.0/double lst name=org.apache.solr.handler.component.TestSolr double name=time0.0/double /lst ... But the strange thing is that the finishStage() function is not being called before returning results. What am I missing? Secondly, functions like ResponseBuilder._responseDocs are visible only in the package org.apache.solr.handler.component. How do I access the results in my package? If you folks can give me links to a wiki or some sample custom QueryComponent, that'll be great. -- Thanks in advance. Ishwar. Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: q.alt=*:* for every request?
I'm not sure what you mean but you may be looking for debugQuery=true ? On Tuesday 08 February 2011 08:28:12 Paul Libbrecht wrote: To be able to see this well, it would be lovely to have a switch that would activate a logging of the query expansion result. The Dismax QParserPlugin is particularly powerful in there so it'd be nice to see what's happening. Any logging category I need to activate? paul Le 8 févr. 2011 à 03:22, Markus Jelsma a écrit : There is no measurable performance penalty when setting the parameter, except maybe the execution of the query with a high value for rows. To make things easy, you can define q.alt=*:* as default in your request handler. No need to specifiy it in the URL. Hi, I use dismax handler with solr 1.4. Sometimes, my request comes with q and fq, and others doesn't come with q (only fq and q.alt=*:*). It's quite ok if I send q.alt=*:* for every request? Does it have side effects on performance? -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Solr n00b question: writing a custom QueryComponent
Hi Upayavira, Apologies for the lack of clarity in the mail. The feeds have the following fields: id, url, title, content, refererurl, createdDate, author, etc. We need search functionality on title and content. As mentioned earlier, storing title and content in solr takes up a lot of space. So, we index title and content in solr, and we wish to store title and content in MySQL which has the fields - id, title, content. I'm also looking at a solr client- solrj to query MySQL based on what solr returns. But that means another component which needs to be maintained. I was wondering if it's a good idea to implement the functionality in solr itself. -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Cc: Sent: Tuesday, February 8, 2011 4:36 PM Subject: Re: Solr n00b question: writing a custom QueryComponent I'm still not quite clear what you are attempting to achieve, and more so why you need to extend Solr rather than just wrap it. You have data with title, description and content fields. You make no mention of an ID field. Surely, if you want to store some in mysql and some in Solr, you could make your Solr client code enhance the data it gets back after querying Solr with data extracted from Mysql. What is the issue here? Upayavira On Mon, 07 Feb 2011 23:17 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi all, Been a solr user for a while now, and now I need to add some functionality to solr for which I'm trying to write a custom QueryComponent. Couldn't get much help from websearch. So, turning to solr-user for help. I'm implementing search functionality for (micro)blog aggregation. We use solr 1.4.1. In the current solr config, the title and content fields are both indexed and stored in solr. Storing takes up a lot of space, even with compression. I'd like to store the title and description field in solr in mysql and retrieve these fields in results from MySQL with an id lookup. Using the DataImportHandler won't work because we store just the title and content fields in MySQL. The rest of the fields are in solr itself. I wrote a custom component by extending QueryComponent, and overriding only the finishStage(ResponseBuilder) function where I try to retrieve the necessary records from MySQL. This is how the new QueryComponent is specified in solrconfig.xml searchComponent name=query class=org.apache.solr.handler.component.TestSolr / I see that the component is getting loaded from the solr debug output lst name=prepare double name=time1.0/double lst name=org.apache.solr.handler.component.TestSolr double name=time0.0/double /lst ... But the strange thing is that the finishStage() function is not being called before returning results. What am I missing? Secondly, functions like ResponseBuilder._responseDocs are visible only in the package org.apache.solr.handler.component. How do I access the results in my package? If you folks can give me links to a wiki or some sample custom QueryComponent, that'll be great. -- Thanks in advance. Ishwar. Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Http Connection is hanging while deleteByQuery
Hi, At last the migration to Solr-1.4.1 does solve this issue :-).. Cheers -- View this message in context: http://lucene.472066.n3.nabble.com/Http-Connection-is-hanging-while-deleteByQuery-tp2367405p2451214.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr n00b question: writing a custom QueryComponent
The conventional way to do it would be to index your title and content fields in Solr, along with the ID to identify the document. You could do a search against solr, and just return an ID field, then your 'client code' would match that up with the title/content data from your database. And yes, SolrJ would be the obvious route to take here, for your client application. Yes, it does mean another component that needs to be maintained, but by using Solr's external interface you will be protected from changes to internals that could break your custom components, and you will likely be more able to take advantage of other Solr features that are also available via the standard interfaces. My next question is: are you going to be using the data you're storing in mysql for something other than just enhancing search results? If not, it may still make sense to store the data in Solr. It would mean you just have one index to manage, rather than an index and a database - after all, the words *have* to take up disk space somewhere :-). If you end up with so many documents indexed that performance grinds (over 10million??) you can split your index across multiple shards. Upayavira Once you get search results back from Solr, you would do a query against your database to return the additional On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi Upayavira, Apologies for the lack of clarity in the mail. The feeds have the following fields: id, url, title, content, refererurl, createdDate, author, etc. We need search functionality on title and content. As mentioned earlier, storing title and content in solr takes up a lot of space. So, we index title and content in solr, and we wish to store title and content in MySQL which has the fields - id, title, content. I'm also looking at a solr client- solrj to query MySQL based on what solr returns. But that means another component which needs to be maintained. I was wondering if it's a good idea to implement the functionality in solr itself. -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Cc: Sent: Tuesday, February 8, 2011 4:36 PM Subject: Re: Solr n00b question: writing a custom QueryComponent I'm still not quite clear what you are attempting to achieve, and more so why you need to extend Solr rather than just wrap it. You have data with title, description and content fields. You make no mention of an ID field. Surely, if you want to store some in mysql and some in Solr, you could make your Solr client code enhance the data it gets back after querying Solr with data extracted from Mysql. What is the issue here? Upayavira On Mon, 07 Feb 2011 23:17 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi all, Been a solr user for a while now, and now I need to add some functionality to solr for which I'm trying to write a custom QueryComponent. Couldn't get much help from websearch. So, turning to solr-user for help. I'm implementing search functionality for (micro)blog aggregation. We use solr 1.4.1. In the current solr config, the title and content fields are both indexed and stored in solr. Storing takes up a lot of space, even with compression. I'd like to store the title and description field in solr in mysql and retrieve these fields in results from MySQL with an id lookup. Using the DataImportHandler won't work because we store just the title and content fields in MySQL. The rest of the fields are in solr itself. I wrote a custom component by extending QueryComponent, and overriding only the finishStage(ResponseBuilder) function where I try to retrieve the necessary records from MySQL. This is how the new QueryComponent is specified in solrconfig.xml searchComponent name=query class=org.apache.solr.handler.component.TestSolr / I see that the component is getting loaded from the solr debug output lst name=prepare double name=time1.0/double lst name=org.apache.solr.handler.component.TestSolr double name=time0.0/double /lst ... But the strange thing is that the finishStage() function is not being called before returning results. What am I missing? Secondly, functions like ResponseBuilder._responseDocs are visible only in the package org.apache.solr.handler.component. How do I access the results in my package? If you folks can give me links to a wiki or some sample custom QueryComponent, that'll be great. -- Thanks in advance. Ishwar. Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Search for FirstName with first Char uppercase followed by * not giving result; getting result with all lowercase and *
What you are missing is that the analysis page shows what happens when the text is run through analysis. Wildcards ARE NOT ANALYZED, so you cannot assume that the analysis page shows you what the search terms in that case. Regardless of whether george* is shown in the analysis page, the term searched will be George*, capitalized, and not found. Pre-process your wildcards to lowercase them all is the easiest solution, as Ahmet said. Best Erick On Tue, Feb 8, 2011 at 8:04 AM, Mark Fletcher mark.fletcher2...@gmail.comwrote: Hi Sawas, Thank you for the reply. In the analysis screen (screen shots attached) *George* is finally stored as *george.* Also the keyword which I use for search later namely *George* *is finally analyzed as *george* and *george** Both there are depicted in the screen shots (index as well as query analyzers). If one of the index terms is finally *george* and if one of the query terms is also finally *george *why is it that a match is not found. I am not sending the mail to the group as I am not sure whether I am missing something basic which I am supposed to know here. I believe both the index and query analyser are having the same set of tokenizers and filters (pls refer the analysis attachment) Thanks for your time. BR, Mark. On Sun, Jan 30, 2011 at 2:13 PM, Savvas-Andreas Moysidis savvas.andreas.moysi...@googlemail.com wrote: Hi Mark, When I indexed *George *it was also finally analyzed and stored as *george* Theny why is it that I don't get a match as per the analysis report I had your indexed term is george but you search for George* which does not go through the same analysis process as it did when it was indexed. So, since the terms you are searching for are not lowercased you are trying to find something that starts with George (capital G) which doesn't exist in you index. If you are not hitting Solr directly, maybe you can lowercase you input text before feeding it to Solr? On 30 January 2011 16:38, Mark Fletcher mark.fletcher2...@gmail.com wrote: Hi Ahmet, Thanks for the reply. I had attached the Analysis report of the query George* It is found to be split into terms *George** and *George* by the WordDelimiterFilterFactory and the LowerCaseFilterFactory converts it to * george** and *george* When I indexed *George *it was also finally analyzed and stored as *george* Theny why is it that I don't get a match as per the analysis report I had attached in my previous mail. Or Am I missing something basic here? Many Thanks. M On Sun, Jan 30, 2011 at 4:34 AM, Ahmet Arslan iori...@yahoo.com wrote: :When i try george* I get results. Whereas George* fetches no results. Wildcard queries are not analyzed by QueryParser.
Re: Solr n00b question: writing a custom QueryComponent
Thanks for the detailed reply Upayavira. To answer your question, our index is growing much faster than expected and our performance is grinding to a halt. Currently, it has over 150 million records. We're planning to split the index into multiple shards very soon and move the index creation to hadoop. Our current situation is that we need to run optimize once every couple of days to keep it in shape. Given the size(index + stored), it takes a long time to complete during which time we can't add new documents into the index. And because of the size of the stored fields, we need double the storage size of the current index to optimize. Since we're on EC2, this requires frequent increase in storage capacity. Even after sharding the index, the time to take to optimize the index is going to be significant. That's the reason why we decided to store these fields in MySQL. If there's some easier solution that I've overlooked, please point it out. On a related note, is there a way to 'automagically' split the existing index into multiple shards? -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Cc: Sent: Tuesday, February 8, 2011 7:17 PM Subject: Re: Solr n00b question: writing a custom QueryComponent The conventional way to do it would be to index your title and content fields in Solr, along with the ID to identify the document. You could do a search against solr, and just return an ID field, then your 'client code' would match that up with the title/content data from your database. And yes, SolrJ would be the obvious route to take here, for your client application. Yes, it does mean another component that needs to be maintained, but by using Solr's external interface you will be protected from changes to internals that could break your custom components, and you will likely be more able to take advantage of other Solr features that are also available via the standard interfaces. My next question is: are you going to be using the data you're storing in mysql for something other than just enhancing search results? If not, it may still make sense to store the data in Solr. It would mean you just have one index to manage, rather than an index and a database - after all, the words *have* to take up disk space somewhere :-). If you end up with so many documents indexed that performance grinds (over 10million??) you can split your index across multiple shards. Upayavira Once you get search results back from Solr, you would do a query against your database to return the additional On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi Upayavira, Apologies for the lack of clarity in the mail. The feeds have the following fields: id, url, title, content, refererurl, createdDate, author, etc. We need search functionality on title and content. As mentioned earlier, storing title and content in solr takes up a lot of space. So, we index title and content in solr, and we wish to store title and content in MySQL which has the fields - id, title, content. I'm also looking at a solr client- solrj to query MySQL based on what solr returns. But that means another component which needs to be maintained. I was wondering if it's a good idea to implement the functionality in solr itself. -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Cc: Sent: Tuesday, February 8, 2011 4:36 PM Subject: Re: Solr n00b question: writing a custom QueryComponent I'm still not quite clear what you are attempting to achieve, and more so why you need to extend Solr rather than just wrap it. You have data with title, description and content fields. You make no mention of an ID field. Surely, if you want to store some in mysql and some in Solr, you could make your Solr client code enhance the data it gets back after querying Solr with data extracted from Mysql. What is the issue here? Upayavira On Mon, 07 Feb 2011 23:17 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi all, Been a solr user for a while now, and now I need to add some functionality to solr for which I'm trying to write a custom QueryComponent. Couldn't get much help from websearch. So, turning to solr-user for help. I'm implementing search functionality for (micro)blog aggregation. We use solr 1.4.1. In the current solr config, the title and content fields are both indexed and stored in solr. Storing takes up a lot of space, even with compression. I'd like to store the title and description field in solr in mysql and retrieve these fields in results from MySQL with an id lookup. Using the DataImportHandler won't work because we store just the title and
Re: Solr n00b question: writing a custom QueryComponent
Hi, i agree with Upayavira, probably it's better to create an external app that retrieves content from a db. Anyway, if i am not wrong, finishStage is a method called by the coordinator if you have a distributed search. if your solr is on a single machine every component should implement only prepare and process methods. HTH. Edo On Tue, Feb 8, 2011 at 7:17 AM, Ishwar ishwarsridha...@yahoo.com wrote: Hi all, Been a solr user for a while now, and now I need to add some functionality to solr for which I'm trying to write a custom QueryComponent. Couldn't get much help from websearch. So, turning to solr-user for help. I'm implementing search functionality for (micro)blog aggregation. We use solr 1.4.1. In the current solr config, the title and content fields are both indexed and stored in solr. Storing takes up a lot of space, even with compression. I'd like to store the title and description field in solr in mysql and retrieve these fields in results from MySQL with an id lookup. Using the DataImportHandler won't work because we store just the title and content fields in MySQL. The rest of the fields are in solr itself. I wrote a custom component by extending QueryComponent, and overriding only the finishStage(ResponseBuilder) function where I try to retrieve the necessary records from MySQL. This is how the new QueryComponent is specified in solrconfig.xml searchComponent name=query class=org.apache.solr.handler.component.TestSolr / I see that the component is getting loaded from the solr debug output lst name=prepare double name=time1.0/double lst name=org.apache.solr.handler.component.TestSolr double name=time0.0/double /lst ... But the strange thing is that the finishStage() function is not being called before returning results. What am I missing? Secondly, functions like ResponseBuilder._responseDocs are visible only in the package org.apache.solr.handler.component. How do I access the results in my package? If you folks can give me links to a wiki or some sample custom QueryComponent, that'll be great. -- Thanks in advance. Ishwar. Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/
RequestHandler code within 1.4.0 dist
Hello list, I have been searching through 1.4.0 source for a standard requestHandler plug-in example. I understand that for my purposes, extending RequestHandlerBase is a starting point, however I was wondering if there is any examples of plug-ins which I can view such as those contained within /contrib. Initially my experience using plug-ins relates to those contained within /contrib folder in Solr, or /plugins folder in Nutch, but the structure does not seem to be the same in Solr. Can anyone please help. Thank you Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education's Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
difference between filter_queries and parsed_filter_queries
Hi everybody, please suggest me what's the difference between these two things. After what processing on filter_queries the parsed_filter_queries are generated. Basically ... when i am searching city as fq=city:'noida' then filter_queries and parsed_filter_queries both are same as 'noida'. In this case i do not get any result. But when i do query like this fq=city:noida then filter_queries is noida but parsed_filter_queries is noida and it matches with the city and i am getting correct results. what processing is going on from filter_queries to parsed_filter_queries. my schema for city is : - fieldType name=facetstr_city class=solr.TextField sortMissingLast=true analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms_city_facet.txt ignoreCase=true expand=false / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType please suggest me please. -- View this message in context: http://lucene.472066.n3.nabble.com/difference-between-filter-queries-and-parsed-filter-queries-tp2451708p2451708.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: difference between filter_queries and parsed_filter_queries
Hi, The parsed_filter_queries contains the value after it passed through the analyzer. In this case it remains the same because it was already lowercased and no synonyms were used. You're also using single quotes, these have no special meaning so you're searching for 'noida' in the first and noida in the second fq. Cheers, On Tuesday 08 February 2011 15:52:23 Bagesh Sharma wrote: Hi everybody, please suggest me what's the difference between these two things. After what processing on filter_queries the parsed_filter_queries are generated. Basically ... when i am searching city as fq=city:'noida' then filter_queries and parsed_filter_queries both are same as 'noida'. In this case i do not get any result. But when i do query like this fq=city:noida then filter_queries is noida but parsed_filter_queries is noida and it matches with the city and i am getting correct results. what processing is going on from filter_queries to parsed_filter_queries. my schema for city is : - fieldType name=facetstr_city class=solr.TextField sortMissingLast=true analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms_city_facet.txt ignoreCase=true expand=false / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType please suggest me please. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
General question about Solr Caches
Hello, I am going through the wiki page related to cache configuration http://wiki.apache.org/solr/SolrCaching and I have a question regarding the general cache architecture and implementation: In my understanding, the Current Index Searcher uses a cache instance and when a New Index Searcher is registered a new cache instance is used which is also auto-warmed. However, what happens when the New Index Searcher is a view of an index which has been modified? If the entries contained in the old cache are copied during auto warming to the new cache wouldn’t that new cache contain invalid entries? Thanks, - Savvas
Re: EdgeNgram Auto suggest - doubles ignore
Hi Erick, If you have time, Can you please take a look and provide your comments (or) suggestions for this problem? Please let me know if you need any more information. Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TermVector query using Solr Tutorial
Inline... On Feb 5, 2011, at 4:28 AM, Ryan Chan wrote: Hello all, I am following this tutorial: http://lucene.apache.org/solr/tutorial.html, I am playing with the TermVector, here is my step: 1. Launch the example server, java -jar start.jar 2. Index the monitor.xml, java -jar post.jar monitor.xml, which contains the following adddoc field name=id3007WFP/field field name=nameDell Widescreen UltraSharp 3007WFP/field field name=manuDell, Inc./field field name=catelectronics/field field name=catmonitor/field field name=features30 TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast/field field name=includesUSB cable/field field name=weight401.6/field field name=price2199/field field name=popularity6/field field name=inStocktrue/field /doc/add 3. Execute the query to search for 25, as you can see, there are two `25` in the field features, i.e. http://localhost/solr/select/?q=25version=2.2start=0rows=10indent=onqt=tvrhtv.all=true 4. The term vector in the result does not make sense to me lst name=termVectors - lst name=doc-2 str name=uniqueKey3007WFP/str - lst name=includes - lst name=cabl int name=tf1/int - lst name=offsets int name=start4/int int name=end9/int /lst - lst name=positions int name=position1/int /lst int name=df1/int double name=tf-idf1.0/double /lst - lst name=usb int name=tf1/int - lst name=offsets int name=start0/int int name=end3/int /lst - lst name=positions int name=position0/int /lst int name=df1/int double name=tf-idf1.0/double /lst /lst /lst str name=uniqueKeyFieldNameid/str /lst What I want to know is the relative position the keywords within a field. Anyone can explain the above result to me? It's a little hard to read due to the indentation, but AFAICT you have two terms, usb and cabl. USB appears at position 0 and cabl at position 1. Those are the relative positions to each other. Perhaps you can explain a bit more what you are trying to do? -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search
Cache size
Hi folks, Is there any way to know the size *in bytes* occupied by a cache (filter cache, doc cache ...)? I don't find such information within the stats page. Regards -- Mehdi BEN HAJ ABBES
Re: Cache size
You can dump the heap and analyze it with a tool like jhat. IBM's heap analyzer is also a very good tool and if i'm not mistaken people also use one that comes with Eclipse. On Tuesday 08 February 2011 16:35:35 Mehdi Ben Haj Abbes wrote: Hi folks, Is there any way to know the size *in bytes* occupied by a cache (filter cache, doc cache ...)? I don't find such information within the stats page. Regards -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: EdgeNgram Auto suggest - doubles ignore
I'm afraid I'll have to pass, I'm absolutely swamped at the moment. Perhaps someone else can pick it up. I will say that you should be getting terms back when you pre-lower-case them, so look in your index via the admin page or Luke to see if what's really in your index is what you think in the name field. As for sorting, I haven't a clue. Start by backing out your custom sorting, verifying that things are as you expect for everything *except* sorting and add it back in Best Erick On Tue, Feb 8, 2011 at 10:11 AM, johnnyisrael johnnyi.john...@gmail.comwrote: Hi Erick, If you have time, Can you please take a look and provide your comments (or) suggestions for this problem? Please let me know if you need any more information. Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: How to search for special chars like ä from ae?
Hi Anithya, Yes, that sounds right. You will want to edit mapping-FoldToASCII.txt, and my suggestion is that you rename mapping-FoldToASCII.txt to reflect your changes (for example, if your target language is German, you could rename it to mapping-German-FoldToASCII.txt); otherwise it would be easy to mistake this file for the unchanged original. Steve -Original Message- From: Anithya [mailto:surysha...@gmail.com] Sent: Monday, February 07, 2011 6:28 PM To: solr-user@lucene.apache.org Subject: RE: How to search for special chars like ä from ae? Hi Steve, thanks for the reply. I did not understand which file do I need to rename? I'm working on Solr 1.4. The file in examples/solr/conf directory is mapping-ISOLatin1Accent.txt. The Schema.xml has the following commented entry. charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ Do I need to replace mapping-ISOLatin1Accent.txt with http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/ma pping-FoldToASCII.txt mapping-FoldToASCII.txt and change the charFilter mapping to charFilter class=solr.MappingCharFilterFactory mapping=mapping-FoldToASCII.txt/ ? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to- search-for-special-chars-like-a-from-ae-tp2444921p2447888.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Separating Index Reader and Writer
Just wanted to push that topic. Regards Em wrote: Hi Peter, I must jump in this discussion: From a logical point of view what you are saying makes only sense if both instances do not run on the same machine or at least not on the same drive. When both run on the same machine and the same drive, the overall used memory should be equal plus I do not understand why this setup should affect cache warming etc., since the process of rewarming should be the same. Well, my knowledge about the internals is not very deep. But from just a logical point of view - to me - the same is happening as if I would do it in a single solr-instance. So what is the difference, what do I overlook? Another thing: While W is committing and writing to the index, is there any inconsistency in R or isn't there any, because W is writing a new Segment and so for R there isn't anything different until the commit finished? Are there problems during optimizing an index? How do you inform R about the finished commit? Thank you for your explanation, it's a really interesting topic! Regards, Em Peter Sturge-2 wrote: Hi, We use this scenario in production where we have one write-only Solr instance and 1 read-only, pointing to the same data. We do this so we can optimize caching/etc. for each instance for write/read. The main performance gain is in cache warming and associated parameters. For your Index W, it's worth turning off cache warming altogether, so commits aren't slowed down by warming. Peter On Sun, Feb 6, 2011 at 3:25 PM, Isan Fulia isan.fu...@germinait.com wrote: Hi all, I have setup two indexes one for reading(R) and other for writing(W).Index R refers to the same data dir of W (defined in solrconfig via dataDir). To make sure the R index sees the indexed documents of W , i am firing an empty commit on R. With this , I am getting performance improvement as compared to using the same index for reading and writing . Can anyone help me in knowing why this performance improvement is taking place even though both the indexeses are pointing to the same data directory. -- Thanks Regards, Isan Fulia. -- View this message in context: http://lucene.472066.n3.nabble.com/Separating-Index-Reader-and-Writer-tp2437666p2452238.html Sent from the Solr - User mailing list archive at Nabble.com.
Jira problem
Hi list, I wanted to create a Jira-issue because of the CSVUpdateHandler-topic I started a few days ago. However I can not create a Jira-account - I do not recieve any mail or something like that. Are there any troubles with the Jira? Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Jira-problem-tp2452254p2452254.html Sent from the Solr - User mailing list archive at Nabble.com.
Scoring: Precedent for a Rules-/Priority-based Approach?
Hey everyone, I have a question about Lucene/Solr scoring in general. There are many factors at play in the final score for each document, and very often one factor will completely dominate everything else when that may not be the intention. ** The question: might there be a way to enforce strict requirements that certain factors are higher priority than other factors, and/or certain factors shouldn't overtake other factors? Perhaps a set of rules where one factor is considered before even examining another factor? Tuning boost numbers around and hoping for the best seems imprecise and very fragile. ** To make this more concrete, an example: We previously added the scores of multi-field matches together via an OR, so: score(query apple) = score(field1:apple) + score(field2:apple). I changed that to be more in-line with DisMaxParser, namely a max: score(query apple) = max(score(field1:apple), score(field2:apple)). I also modified coord such that coord would only consider actual unique terms (apple vs. orange), rather than terms across multiple fields (field1:apple vs. field2:apple). This seemed like a good idea, but it actually introduced a bug that was previously hidden. Suddenly, documents matching apple in the title and *nothing* in the body were being boosted over documents matching apple in the title and apple in the body! I investigated, and it was due to lengthNorm: previously, documents matching apple in both title and body were getting very high scores and completely overwhelming lengthNorm. Now that they were no longer getting *such* high scores, which was beneficial in many respects, they were also no longer overwhelming lengthNorm. This allowed lengthNorm to dominate everything else. I'd love to hear your thoughts :) Tavi
Tokenization: How to Allow Multiple Strategies?
Hey everyone, Tokenization seems inherently fuzzy and imprecise, yet Solr/Lucene does not appear to provide an easy mechanism to account for this fuzziness. Let's take an example, where the document I'm indexing is v1.1.0 mr. jones www.gmail.com I may want to tokenize this as follows: [v1.1.0, mr, jones, www.gmail.com] ...or I may want to tokenize this as follows: [v1, 1.0, mr, jones, www, gmail.com] ...or I may want to tokenize it another way. I would think that the best approach would be indexing using multiple strategies, such as: [v1.1.0, v1, 1.0, mr, jones, www.gmail.com, www, gmail.com] However, this would destroy phrase queries. And while Lucene lets you index multiple tokens at the same position, I haven't found a way to deal with cases where you want to index a set of tokens at one position: nor does that even make sense. For instance, I can't index [www, gmail.com] in the same position as www.gmail.com. So: - Any thoughts, in general, about how you all approach this fuzziness? Do you just choose one tokenization strategy and hope for the best? - Might there be a way to use multiple strategies and *not* break phrase queries that I'm overlooking? Thanks! Tavi
Re: Scoring: Precedent for a Rules-/Priority-based Approach?
Hi Tavi, In my understanding the scoring formula Lucene (and therefore Solr) uses is based on a mathematical model which is proven to work for general purpose full text searching. The real challenge, as you mention, comes when you need to achieve high quality scoring based on the domain you are working in. For example, a general search portal for Songs might need to score Songs based on search relevance, but a search application for a Music Publisher might need to score Songs first by relevance with matched documents boosted according to the revenue they have generated..and ranking from that second scoring strategy could be widely different to the first one.. Personally, I can't think of a generic scoring strategy that would come out of the box with Solr which would allow for all the widely different use cases. Don't really agree that tuning Solr and in general experimenting for better scoring quality is something fragile or awkward. As the name suggests, it is a tuning process which is targeting your specific environment. :) Technically wise, in our case, we were able to significantly improve scoring quality (as expected by our domain experts) by using the Dismax Search Handler, and by experimenting with different Boost values, Function Queries, the mm parameter and by setting omitNorms to true for the fields we were having problems with. Regards, - Savvas On 8 February 2011 16:23, Tavi Nathanson tavi.nathan...@gmail.com wrote: Hey everyone, I have a question about Lucene/Solr scoring in general. There are many factors at play in the final score for each document, and very often one factor will completely dominate everything else when that may not be the intention. ** The question: might there be a way to enforce strict requirements that certain factors are higher priority than other factors, and/or certain factors shouldn't overtake other factors? Perhaps a set of rules where one factor is considered before even examining another factor? Tuning boost numbers around and hoping for the best seems imprecise and very fragile. ** To make this more concrete, an example: We previously added the scores of multi-field matches together via an OR, so: score(query apple) = score(field1:apple) + score(field2:apple). I changed that to be more in-line with DisMaxParser, namely a max: score(query apple) = max(score(field1:apple), score(field2:apple)). I also modified coord such that coord would only consider actual unique terms (apple vs. orange), rather than terms across multiple fields (field1:apple vs. field2:apple). This seemed like a good idea, but it actually introduced a bug that was previously hidden. Suddenly, documents matching apple in the title and *nothing* in the body were being boosted over documents matching apple in the title and apple in the body! I investigated, and it was due to lengthNorm: previously, documents matching apple in both title and body were getting very high scores and completely overwhelming lengthNorm. Now that they were no longer getting *such* high scores, which was beneficial in many respects, they were also no longer overwhelming lengthNorm. This allowed lengthNorm to dominate everything else. I'd love to hear your thoughts :) Tavi
Re: Scoring: Precedent for a Rules-/Priority-based Approach?
Hi Tavi, could you please provide an example query for your problem and the debugQuery's output? It confuses me that you write score(query apple) = max(score(field1:apple), score(field2:apple)) I think your problem could come from the norms of your request, but I am not sure. If you can, show us some piece of your schema.xml and the debugQuery's output, so that we can have a look at it. I have to agree with Savvas: Tuning scoring for a special domain is an exciting thing and there are lots of approaches out there to make scoring good. Regards Tavi Nathanson wrote: Hey everyone, I have a question about Lucene/Solr scoring in general. There are many factors at play in the final score for each document, and very often one factor will completely dominate everything else when that may not be the intention. ** The question: might there be a way to enforce strict requirements that certain factors are higher priority than other factors, and/or certain factors shouldn't overtake other factors? Perhaps a set of rules where one factor is considered before even examining another factor? Tuning boost numbers around and hoping for the best seems imprecise and very fragile. ** To make this more concrete, an example: We previously added the scores of multi-field matches together via an OR, so: score(query apple) = score(field1:apple) + score(field2:apple). I changed that to be more in-line with DisMaxParser, namely a max: score(query apple) = max(score(field1:apple), score(field2:apple)). I also modified coord such that coord would only consider actual unique terms (apple vs. orange), rather than terms across multiple fields (field1:apple vs. field2:apple). This seemed like a good idea, but it actually introduced a bug that was previously hidden. Suddenly, documents matching apple in the title and *nothing* in the body were being boosted over documents matching apple in the title and apple in the body! I investigated, and it was due to lengthNorm: previously, documents matching apple in both title and body were getting very high scores and completely overwhelming lengthNorm. Now that they were no longer getting *such* high scores, which was beneficial in many respects, they were also no longer overwhelming lengthNorm. This allowed lengthNorm to dominate everything else. I'd love to hear your thoughts :) Tavi -- View this message in context: http://lucene.472066.n3.nabble.com/Scoring-Precedent-for-a-Rules-Priority-based-Approach-tp2452340p2453161.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTTP ERROR 400 undefined field: *
So I re-indexed some of the content, but no dice. Per Hoss, I tried disabling the TVC and it worked great. We're not really using tvc right now since we made a decision to turn off highlighting for the moment, so this isn't a huge deal. I'll create a new jira issue. FYI here is my query from the logs: --this one breaks (undefined field) webapp=/solr path=/select params={explainOther=fl=*,scoreindent=onstart=0q=brucehl.fl=qt=standardwt=standardfq=version=2.2rows=10} hits=114 status=400 QTime=21 this one works: webapp=/solr path=/select params={explainOther=indent=onhl.fl=wt=standardversion=2.2rows=10fl=*,scorestart=0q=brucetv=falseqt=standardfq=} hits=128 status=0 QTime=48 Though i'm not sure why when the tvc is disabled there are more hits, but the qtime is slower. That's a different issue though, and something I can work though. Thanks for your help. On 02/07/2011 11:38 AM, Chris Hostetter wrote: : The stack trace is attached. I also saw this warning in the logs not sure From your attachment... 853 SEVERE: org.apache.solr.common.SolrException: undefined field: score 854 at org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:142) 855 at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) 856 at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) 857 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1357) ...this is one of the key pieces of info that was missing from your earlier email: that you are using the TermVectorComponent. It's likely that something changed in the TVC on 3x between the two versions you were using and thta change freaks out now on * or score in the fl. you still haven't given us an example of the full URLs you are using that trigger this error. (it's posisble there is something slightly off in your syntax - we don't know because you haven't shown us) All in: this sounds like a newly introduced bug in TVC, please post the details into a new Jira issue. as to the warning you asked about... : Feb 3, 2011 8:14:10 PM org.apache.solr.core.Config getLuceneVersion : WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24 : emulation. You should at some point declare and reindex to at least 3.0, : because 2.4 emulation is deprecated and will be removed in 4.0. This parameter : will be mandatory in 4.0. if you look at the example configs on the 3x branch it should be explained. it's basically just a new feature that lets you specify which quirks of the underlying lucene code you want (so on upgrading you are in control of wether you eliminate old quirks or not) -Hoss
Re: Tokenization: How to Allow Multiple Strategies?
Hi Tavi, if you want to use multiple tokenization strategies (different tokenizers so to speak) you have to use different fieldTypes. Maybe you have to create your own tokenizer for doing what you want or a PatternTokenizer might help you. However, your examples for the different positions of specific terms reminds me on the WordDelimiterFilter (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory ). It does almost everything you wrote and is close to what you want, I think. Have a look at it. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTTP ERROR 400 undefined field: *
here is the ticket: https://issues.apache.org/jira/browse/SOLR-2352 On 02/08/2011 11:27 AM, Jed Glazner wrote: So I re-indexed some of the content, but no dice. Per Hoss, I tried disabling the TVC and it worked great. We're not really using tvc right now since we made a decision to turn off highlighting for the moment, so this isn't a huge deal. I'll create a new jira issue. FYI here is my query from the logs: --this one breaks (undefined field) webapp=/solr path=/select params={explainOther=fl=*,scoreindent=onstart=0q=brucehl.fl=qt=standardwt=standardfq=version=2.2rows=10} hits=114 status=400 QTime=21 this one works: webapp=/solr path=/select params={explainOther=indent=onhl.fl=wt=standardversion=2.2rows=10fl=*,scorestart=0q=brucetv=falseqt=standardfq=} hits=128 status=0 QTime=48 Though i'm not sure why when the tvc is disabled there are more hits, but the qtime is slower. That's a different issue though, and something I can work though. Thanks for your help. On 02/07/2011 11:38 AM, Chris Hostetter wrote: : The stack trace is attached. I also saw this warning in the logs not sure From your attachment... 853 SEVERE: org.apache.solr.common.SolrException: undefined field: score 854 at org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:142) 855 at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) 856 at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) 857 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1357) ...this is one of the key pieces of info that was missing from your earlier email: that you are using the TermVectorComponent. It's likely that something changed in the TVC on 3x between the two versions you were using and thta change freaks out now on * or score in the fl. you still haven't given us an example of the full URLs you are using that trigger this error. (it's posisble there is something slightly off in your syntax - we don't know because you haven't shown us) All in: this sounds like a newly introduced bug in TVC, please post the details into a new Jira issue. as to the warning you asked about... : Feb 3, 2011 8:14:10 PM org.apache.solr.core.Config getLuceneVersion : WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24 : emulation. You should at some point declare and reindex to at least 3.0, : because 2.4 emulation is deprecated and will be removed in 4.0. This parameter : will be mandatory in 4.0. if you look at the example configs on the 3x branch it should be explained. it's basically just a new feature that lets you specify which quirks of the underlying lucene code you want (so on upgrading you are in control of wether you eliminate old quirks or not) -Hoss
relational db mapping for advanced search
Hi, I was just after some advice on how to map some relational metadata to a solr index. The web application I'm working on is based around people and the searching based around properties of these people. Several properties are more complex - for example, a person's occupations have place, from/to dates and other descriptive text; texts about a person have authors, sources and publication dates. Despite the usefulness of facets and the search-based navigation, an advanced search feature is a non-negotiable required feature of the application. An advanced search needs to be able to query a person on any set of attributes (e.g. gender, birth date, death date, place of birth) etc including the more complex search criteron as described above (occupation, texts). Taking occupation as an example, because occupation has its own metadata and a person could have worked an arbitrary number of occupations throughout their lifetime, I was wondering how/if this information can be denormalised into a single person index document to support such a search. I can't use text concatenation in a multivalued field as I need to be able to run date-based range queries (e.g. publication dates, occupation dates). And I'm not sure that resorting to multiple repeated fields based on the current limits (e.g. occ1, occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach (although that would work). If there isn't a sensible way to denormalise this, what is the best approach? For example, should I have an occupation document type, a person document type, a text/source document type and (in an advanced search context) each containing the relevant person id and (in the advanced search context) run a query against each document type and then use the intersecting set of person ids as the result used by the application for its display/pagination? And if so, how do I ensure I capture all records - for example if there are 100,000 hits on someone having worked in Australia in 1956, is there any way to ensure all 100,000 are returned in a query (similar to the facet.limit = -1) other than specifying an arbitrary high number in the rows parameter and hoping a query doesn't hit more than 100,000 and thus exclude those above the limit from the intersect processing? Or is there a single query solution? Any advice/hints welcome. Scott.
RE: relational db mapping for advanced search
I have no great answer for you, this is to me a generally unanswered question, hard to do Solr with this sort of thing, I think you seem to understand it properly. There ARE some interesting new features in trunk (not 1.4) that may be relevant, although to my perspective none of them provide magic bullet solutions. But there is a 'join' feature which could be awfully useful with the setup you suggest of having different 'types' of documents all together in the same index. https://issues.apache.org/jira/browse/SOLR-2272 From: Scott Yeadon [scott.yea...@anu.edu.au] Sent: Tuesday, February 08, 2011 4:41 PM To: solr-user@lucene.apache.org Subject: relational db mapping for advanced search Hi, I was just after some advice on how to map some relational metadata to a solr index. The web application I'm working on is based around people and the searching based around properties of these people. Several properties are more complex - for example, a person's occupations have place, from/to dates and other descriptive text; texts about a person have authors, sources and publication dates. Despite the usefulness of facets and the search-based navigation, an advanced search feature is a non-negotiable required feature of the application. An advanced search needs to be able to query a person on any set of attributes (e.g. gender, birth date, death date, place of birth) etc including the more complex search criteron as described above (occupation, texts). Taking occupation as an example, because occupation has its own metadata and a person could have worked an arbitrary number of occupations throughout their lifetime, I was wondering how/if this information can be denormalised into a single person index document to support such a search. I can't use text concatenation in a multivalued field as I need to be able to run date-based range queries (e.g. publication dates, occupation dates). And I'm not sure that resorting to multiple repeated fields based on the current limits (e.g. occ1, occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach (although that would work). If there isn't a sensible way to denormalise this, what is the best approach? For example, should I have an occupation document type, a person document type, a text/source document type and (in an advanced search context) each containing the relevant person id and (in the advanced search context) run a query against each document type and then use the intersecting set of person ids as the result used by the application for its display/pagination? And if so, how do I ensure I capture all records - for example if there are 100,000 hits on someone having worked in Australia in 1956, is there any way to ensure all 100,000 are returned in a query (similar to the facet.limit = -1) other than specifying an arbitrary high number in the rows parameter and hoping a query doesn't hit more than 100,000 and thus exclude those above the limit from the intersect processing? Or is there a single query solution? Any advice/hints welcome. Scott.
Re: relational db mapping for advanced search
Yes, I saw something in the dev stream about compound types as well which would also be useful (so in my example an occupation field could comprise of multiple fields of different types) but these are up and coming features. I suspect using multiple document types is probably the best way for now, but thanks for the heads up on the join - looks like these issues will be better addressed in the future. RDBMS in my context won't work well as requires lots of joins (and self-joins) for complex searches and in the old system these tend to lock up the DB as the temp table size grows exponentially. Scott. On 9/02/11 8:57 AM, Jonathan Rochkind wrote: I have no great answer for you, this is to me a generally unanswered question, hard to do Solr with this sort of thing, I think you seem to understand it properly. There ARE some interesting new features in trunk (not 1.4) that may be relevant, although to my perspective none of them provide magic bullet solutions. But there is a 'join' feature which could be awfully useful with the setup you suggest of having different 'types' of documents all together in the same index. https://issues.apache.org/jira/browse/SOLR-2272 From: Scott Yeadon [scott.yea...@anu.edu.au] Sent: Tuesday, February 08, 2011 4:41 PM To: solr-user@lucene.apache.org Subject: relational db mapping for advanced search Hi, I was just after some advice on how to map some relational metadata to a solr index. The web application I'm working on is based around people and the searching based around properties of these people. Several properties are more complex - for example, a person's occupations have place, from/to dates and other descriptive text; texts about a person have authors, sources and publication dates. Despite the usefulness of facets and the search-based navigation, an advanced search feature is a non-negotiable required feature of the application. An advanced search needs to be able to query a person on any set of attributes (e.g. gender, birth date, death date, place of birth) etc including the more complex search criteron as described above (occupation, texts). Taking occupation as an example, because occupation has its own metadata and a person could have worked an arbitrary number of occupations throughout their lifetime, I was wondering how/if this information can be denormalised into a single person index document to support such a search. I can't use text concatenation in a multivalued field as I need to be able to run date-based range queries (e.g. publication dates, occupation dates). And I'm not sure that resorting to multiple repeated fields based on the current limits (e.g. occ1, occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach (although that would work). If there isn't a sensible way to denormalise this, what is the best approach? For example, should I have an occupation document type, a person document type, a text/source document type and (in an advanced search context) each containing the relevant person id and (in the advanced search context) run a query against each document type and then use the intersecting set of person ids as the result used by the application for its display/pagination? And if so, how do I ensure I capture all records - for example if there are 100,000 hits on someone having worked in Australia in 1956, is there any way to ensure all 100,000 are returned in a query (similar to the facet.limit = -1) other than specifying an arbitrary high number in the rows parameter and hoping a query doesn't hit more than 100,000 and thus exclude those above the limit from the intersect processing? Or is there a single query solution? Any advice/hints welcome. Scott.
RE: How to search for special chars like ä from ae?
Thanks for the help Steve, it worked!!! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-search-for-special-chars-like-a-from-ae-tp2444921p2454816.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: How to search for special chars like ä from ae?
Hi Anithya, That's good to hear. Again, please consider donating your work: http://wiki.apache.org/solr/HowToContribute#Making_Changes. Steve -Original Message- From: Anithya [mailto:surysha...@gmail.com] Sent: Tuesday, February 08, 2011 5:16 PM To: solr-user@lucene.apache.org Subject: RE: How to search for special chars like ä from ae? Thanks for the help Steve, it worked!!! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to- search-for-special-chars-like-a-from-ae-tp2444921p2454816.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to search for special chars like ä from ae?
Hello, Quick question on solr replication? What effect does index reload after a replication has on search requests? Can server still respond to user queries with old index? Especially, during the following phases of replication on slaves. http://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F * After the download completes, all the new files are 'mov'ed to the slave's live index directory and the files' timestamps will match the timestamps in the master. A 'commit' command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded * Thanks, Charan
RE: How to search for special chars like ä from ae?
So - how did you end up setting it up? In my reading of the thread, it seems you could have a search for 'mäcman' hit 'macman' or 'maecman', but not both, since you it seems you could only map the ä to a single replacement. Or can it be mapped multiple times, generating multiple tokens? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-search-for-special-chars-like-a-from-ae-tp2444921p2455176.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr n00b question: writing a custom QueryComponent
Your observation regarding optimisation is an interesting one, it does at least make sense that reducing the size of a segment will speed up optimisation and reduce the disk space needed. In a situation that had multiple shards, we had two 'rows', for redundancy purposes. In that situation, we could take one row offline while it optimised and allow the other to serve search during that time. If we offset optimisation by 12 hours for each of our rows, we can optimise daily and not have a problem with loss of up-to-date content or slow searches during an optimisation. As to splitting indexes, it isn't an easy task to do properly, and there's nothing in Solr to do it. However, there is a very clever class in Lucene contrib that you can use to split a Lucene index [1], and you can safely use it to split a Solr index so long as the index isn't in use while you're doing it. Upayavira [1] for example: http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/index/MultiPassIndexSplitter.html On Tue, 08 Feb 2011 06:24 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Thanks for the detailed reply Upayavira. To answer your question, our index is growing much faster than expected and our performance is grinding to a halt. Currently, it has over 150 million records. We're planning to split the index into multiple shards very soon and move the index creation to hadoop. Our current situation is that we need to run optimize once every couple of days to keep it in shape. Given the size(index + stored), it takes a long time to complete during which time we can't add new documents into the index. And because of the size of the stored fields, we need double the storage size of the current index to optimize. Since we're on EC2, this requires frequent increase in storage capacity. Even after sharding the index, the time to take to optimize the index is going to be significant. That's the reason why we decided to store these fields in MySQL. If there's some easier solution that I've overlooked, please point it out. On a related note, is there a way to 'automagically' split the existing index into multiple shards? -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Cc: Sent: Tuesday, February 8, 2011 7:17 PM Subject: Re: Solr n00b question: writing a custom QueryComponent The conventional way to do it would be to index your title and content fields in Solr, along with the ID to identify the document. You could do a search against solr, and just return an ID field, then your 'client code' would match that up with the title/content data from your database. And yes, SolrJ would be the obvious route to take here, for your client application. Yes, it does mean another component that needs to be maintained, but by using Solr's external interface you will be protected from changes to internals that could break your custom components, and you will likely be more able to take advantage of other Solr features that are also available via the standard interfaces. My next question is: are you going to be using the data you're storing in mysql for something other than just enhancing search results? If not, it may still make sense to store the data in Solr. It would mean you just have one index to manage, rather than an index and a database - after all, the words *have* to take up disk space somewhere :-). If you end up with so many documents indexed that performance grinds (over 10million??) you can split your index across multiple shards. Upayavira Once you get search results back from Solr, you would do a query against your database to return the additional On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi Upayavira, Apologies for the lack of clarity in the mail. The feeds have the following fields: id, url, title, content, refererurl, createdDate, author, etc. We need search functionality on title and content. As mentioned earlier, storing title and content in solr takes up a lot of space. So, we index title and content in solr, and we wish to store title and content in MySQL which has the fields - id, title, content. I'm also looking at a solr client- solrj to query MySQL based on what solr returns. But that means another component which needs to be maintained. I was wondering if it's a good idea to implement the functionality in solr itself. -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Cc: Sent: Tuesday, February 8, 2011 4:36 PM Subject: Re: Solr n00b question: writing a custom QueryComponent I'm still not quite clear what you are attempting
Re: How to search for special chars like ä from ae?
When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See: http://people.apache.org/~hossman/#threadhijack On Tue, Feb 8, 2011 at 5:59 PM, charan kumar charan.ku...@gmail.com wrote: Hello, Quick question on solr replication? What effect does index reload after a replication has on search requests? Can server still respond to user queries with old index? Especially, during the following phases of replication on slaves. http://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F * After the download completes, all the new files are 'mov'ed to the slave's live index directory and the files' timestamps will match the timestamps in the master. A 'commit' command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded * Thanks, Charan
Re: Tokenization: How to Allow Multiple Strategies?
Thanks for the suggestions! Using a new field makes sense, except it would double the size of the index. I'd like to add additional terms, at my discretion, only when there's ambiguity. More specifically, do you know of any way to put multiple *tokens sets* at the same position of the same field? If I can tokenize 123-4567 apple as: [Token(123), Token(-), Token(4567), Token(apple)] or [Token(123-4567), Token(apple)] ...might there be a way to put [Token(123), Token(-), Token(4567)] *and* [Token(123-4567)] in the index in such a way that the PhraseQuery Token(123-4567) Token(apple) would match the above string, *and* the PhraseQuery Token(123) Token(-) Token(4567) Token(apple) would also match it? Thanks! Tavi On Tue, Feb 8, 2011 at 10:34 AM, Em mailformailingli...@yahoo.de wrote: Hi Tavi, if you want to use multiple tokenization strategies (different tokenizers so to speak) you have to use different fieldTypes. Maybe you have to create your own tokenizer for doing what you want or a PatternTokenizer might help you. However, your examples for the different positions of specific terms reminds me on the WordDelimiterFilter (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory ). It does almost everything you wrote and is close to what you want, I think. Have a look at it. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: jndi datasource in dataimport
Hi Still no luck with this is the problem with the name attribute of the datasource element in the data config ? On 5 February 2011 10:48, lee carroll lee.a.carr...@googlemail.com wrote: ah should this work or am i doing something obvious wrong in config dataSource jndiName=java:sourcepathName type=JdbcDataSource user=xxx password=xxx/ in dataimport config dataSource type=JdbcDataSource name=java:sourcepathName / what am i doing wrong ? On 5 February 2011 10:16, lee carroll lee.a.carr...@googlemail.comwrote: Hi list, It looks like you can use a jndi datsource in the data import handler. however i can't find any syntax on this. Where is the best place to look for this ? (and confirm if jndi does work in dataimporthandler)
Re: Tokenization: How to Allow Multiple Strategies?
A couple of things... First, you haven't provided any evidence that increasing the index size is a concern. If your index isn't all that large, it really doesn't matter, and conserving index size may not be a concern. WordDelimterFilterFactory (WDFF) will do the use cases you outlined below, but don't get stuck on, for instance, having the '-' be a token unless you can say for certain that it has benefits over both indexing and searching on just 123 followed by 4567 which is what would happen with WDFF. I recommend that you look at the analysis page (check the verbose box) to see the effects of tokenization with various analysis chains before making any firm decisions. Best Erick On Tue, Feb 8, 2011 at 6:24 PM, Tavi Nathanson tavi.nathan...@gmail.comwrote: Thanks for the suggestions! Using a new field makes sense, except it would double the size of the index. I'd like to add additional terms, at my discretion, only when there's ambiguity. More specifically, do you know of any way to put multiple *tokens sets* at the same position of the same field? If I can tokenize 123-4567 apple as: [Token(123), Token(-), Token(4567), Token(apple)] or [Token(123-4567), Token(apple)] ...might there be a way to put [Token(123), Token(-), Token(4567)] *and* [Token(123-4567)] in the index in such a way that the PhraseQuery Token(123-4567) Token(apple) would match the above string, *and* the PhraseQuery Token(123) Token(-) Token(4567) Token(apple) would also match it? Thanks! Tavi On Tue, Feb 8, 2011 at 10:34 AM, Em mailformailingli...@yahoo.de wrote: Hi Tavi, if you want to use multiple tokenization strategies (different tokenizers so to speak) you have to use different fieldTypes. Maybe you have to create your own tokenizer for doing what you want or a PatternTokenizer might help you. However, your examples for the different positions of specific terms reminds me on the WordDelimiterFilter (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory ). It does almost everything you wrote and is close to what you want, I think. Have a look at it. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html Sent from the Solr - User mailing list archive at Nabble.com.
Does Distributed Search support {!boost }?
Is it possible to do a query like {!boost b=log(popularity)}foo over sharded indexes? I looked at the wiki on distributed search (http://wiki.apache.org/solr/DistributedSearch) and it has a list of components that are supported in distributed search. Just wondering what component does {!boost } belong to? Thanks. No need to miss a message. Get email on-the-go with Yahoo! Mail for Mobile. Get started. http://mobile.yahoo.com/mail
Re: General question about Solr Caches
: In my understanding, the Current Index Searcher uses a cache instance and : when a New Index Searcher is registered a new cache instance is used which : is also auto-warmed. However, what happens when the New Index Searcher is a : view of an index which has been modified? If the entries contained in the : old cache are copied during auto warming to the new cache wouldn’t that new : cache contain invalid entries? a) i'm not sure what you mean by view of an index which has been modified ... except for the first time an index is created, an Index Searcher always contains a view of an index which has been modified -- that view that the IndexSearcher represents is entirely consinsitent and doesn't change as documents are added/removed - that's why a new Searcher needs to be opened. b) entries are not copied during autowarming. the *keys* of the entries in the old cache are used to warm the new cache -- using the new searcher to generate new values. (caveat: if you have a custom cache, you could write a custom cache regenerator that did copy the values from the old cache verbatim -- i have done that in special cases where the type of object i was caching didn't vary based on the IndexSearcher -- or did vary, but in such a way that i could use the new Searcher to determine a cheap piece of information and based on the result either reuse an old value that was expensive to compute, or recompute it using hte new Searcher. ... but none of the default cache regenerators for the stock solr caches work this way) : : : : Thanks, : - Savvas : -Hoss
Re: jndi datasource in dataimport
: It looks like you can use a jndi datsource in the data import handler. : however i can't find any syntax on this. : : Where is the best place to look for this ? (and confirm if jndi does work in : dataimporthandler) It's been a long time since i used JNDI on anything, and i've never tried it with DIH, but google searching for JNDI DataImportHandler pointed to... http://wiki.apache.org/solr/DataImportHandlerFaq#How_do_I_use_a_JNDI_DataSource.3F -Hoss
[WKT] Spatial Searching
I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam
Re: How to search for special chars like ä from ae?
sorry for cross posting, but that is the only I could get my question posted. SOLR Mailing server treats my question as SPAM Technical details of permanent failure: Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 552 552 spam score (5.1) exceeded threshold (FREEMAIL_FROM,FS_REPLICA, HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL (state 18). On Tue, Feb 8, 2011 at 3:17 PM, Erick Erickson erickerick...@gmail.comwrote: When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See: http://people.apache.org/~hossman/#threadhijack On Tue, Feb 8, 2011 at 5:59 PM, charan kumar charan.ku...@gmail.com wrote: Hello, Quick question on solr replication? What effect does index reload after a replication has on search requests? Can server still respond to user queries with old index? Especially, during the following phases of replication on slaves. http://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F * After the download completes, all the new files are 'mov'ed to the slave's live index directory and the files' timestamps will match the timestamps in the master. A 'commit' command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded * Thanks, Charan
Re: [WKT] Spatial Searching
+1 to David's patch from SOLR-2155. It would be great to implement. Great job using GDAL on converting the WKT Adam! Cheers, Chris On Feb 8, 2011, at 8:18 PM, Adam Estrada wrote: I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Solr n00b question: writing a custom QueryComponent
In the situation that you'd explained, I'm assuming one of the rows is the master and the other is the slave. How did you continue feeding documents while the master was down for optimisation? And thanks for the link to MultiPassIndexSplitter. I shall check it out. -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Sent: Wednesday, February 9, 2011 4:42 AM Subject: Re: Solr n00b question: writing a custom QueryComponent Your observation regarding optimisation is an interesting one, it does at least make sense that reducing the size of a segment will speed up optimisation and reduce the disk space needed. In a situation that had multiple shards, we had two 'rows', for redundancy purposes. In that situation, we could take one row offline while it optimised and allow the other to serve search during that time. If we offset optimisation by 12 hours for each of our rows, we can optimise daily and not have a problem with loss of up-to-date content or slow searches during an optimisation. As to splitting indexes, it isn't an easy task to do properly, and there's nothing in Solr to do it. However, there is a very clever class in Lucene contrib that you can use to split a Lucene index [1], and you can safely use it to split a Solr index so long as the index isn't in use while you're doing it. Upayavira [1] for example: http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/index/MultiPassIndexSplitter.html On Tue, 08 Feb 2011 06:24 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Thanks for the detailed reply Upayavira. To answer your question, our index is growing much faster than expected and our performance is grinding to a halt. Currently, it has over 150 million records. We're planning to split the index into multiple shards very soon and move the index creation to hadoop. Our current situation is that we need to run optimize once every couple of days to keep it in shape. Given the size(index + stored), it takes a long time to complete during which time we can't add new documents into the index. And because of the size of the stored fields, we need double the storage size of the current index to optimize. Since we're on EC2, this requires frequent increase in storage capacity. Even after sharding the index, the time to take to optimize the index is going to be significant. That's the reason why we decided to store these fields in MySQL. If there's some easier solution that I've overlooked, please point it out. On a related note, is there a way to 'automagically' split the existing index into multiple shards? -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Cc: Sent: Tuesday, February 8, 2011 7:17 PM Subject: Re: Solr n00b question: writing a custom QueryComponent The conventional way to do it would be to index your title and content fields in Solr, along with the ID to identify the document. You could do a search against solr, and just return an ID field, then your 'client code' would match that up with the title/content data from your database. And yes, SolrJ would be the obvious route to take here, for your client application. Yes, it does mean another component that needs to be maintained, but by using Solr's external interface you will be protected from changes to internals that could break your custom components, and you will likely be more able to take advantage of other Solr features that are also available via the standard interfaces. My next question is: are you going to be using the data you're storing in mysql for something other than just enhancing search results? If not, it may still make sense to store the data in Solr. It would mean you just have one index to manage, rather than an index and a database - after all, the words *have* to take up disk space somewhere :-). If you end up with so many documents indexed that performance grinds (over 10million??) you can split your index across multiple shards. Upayavira Once you get search results back from Solr, you would do a query against your database to return the additional On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi Upayavira, Apologies for the lack of clarity in the mail. The feeds have the following fields: id, url, title, content, refererurl, createdDate, author, etc. We need search functionality on title and content. As mentioned earlier, storing title and content in solr takes up a lot of space. So, we index title and content in solr, and we wish to store title and content in MySQL which has the fields - id, title, content. I'm also looking at a
Help migrating from Lucene
Hey guys, We're migrating from Lucene to Solr. So far the migration has been smooth, however there is one feature I'm having issues adapting. Our calls to our indexing service are defined in a central interface. Here is an example of a query executed from a programmatically constructed Lucene query. BooleanQuery query = new BooleanQuery(); BooleanQuery inputTerms = new BooleanQuery(); inputTerms.add(new TermQuery(new Term(FIELD_EMAIL, input)), Occur.SHOULD); inputTerms.add(new TermQuery(new Term(FIELD_PHONE, getNumericString(input))), Occur.SHOULD); query.add(inputTerms, Occur.MUST); query.add(new TermQuery(new Term(FIELD_RESOLVED, String.valueOf(false))),Occur.MUST); NumericRangeQuery time = NumericRangeQuery.newLongRange(FIELD_CREATETIME, null, endTime, true, true); query.add(time, Occur.MUST); SortField sort = new SortField(FIELD_CREATETIME, SortField.LONG, true); CommonsHttpSolrServer client = getClient(indexName); SolrQuery solrQuery = new SolrQuery(); //TODO how do I set the sort? solrQuery.setQuery(query.toString()); QueryResponse response = client.query(solrQuery); How can I set the sort into the java client? Also, with the annotations of Pojo's outlined here. http://wiki.apache.org/solr/Solrj#Directly_adding_POJOs_to_Solr How are sets handled? For instance, how are Lists of other POJO's added to the document? Thanks, Todd
Re: Solr n00b question: writing a custom QueryComponent
Actually, in that situation, we indexed twice, to both, so there was no master and no slave. Our testing showed that search was not slowed down unduly by indexing. Upayavira On Tue, 08 Feb 2011 22:34 -0800, Ishwar ishwarsridha...@yahoo.com wrote: In the situation that you'd explained, I'm assuming one of the rows is the master and the other is the slave. How did you continue feeding documents while the master was down for optimisation? And thanks for the link to MultiPassIndexSplitter. I shall check it out. -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Sent: Wednesday, February 9, 2011 4:42 AM Subject: Re: Solr n00b question: writing a custom QueryComponent Your observation regarding optimisation is an interesting one, it does at least make sense that reducing the size of a segment will speed up optimisation and reduce the disk space needed. In a situation that had multiple shards, we had two 'rows', for redundancy purposes. In that situation, we could take one row offline while it optimised and allow the other to serve search during that time. If we offset optimisation by 12 hours for each of our rows, we can optimise daily and not have a problem with loss of up-to-date content or slow searches during an optimisation. As to splitting indexes, it isn't an easy task to do properly, and there's nothing in Solr to do it. However, there is a very clever class in Lucene contrib that you can use to split a Lucene index [1], and you can safely use it to split a Solr index so long as the index isn't in use while you're doing it. Upayavira [1] for example: http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/index/MultiPassIndexSplitter.html On Tue, 08 Feb 2011 06:24 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Thanks for the detailed reply Upayavira. To answer your question, our index is growing much faster than expected and our performance is grinding to a halt. Currently, it has over 150 million records. We're planning to split the index into multiple shards very soon and move the index creation to hadoop. Our current situation is that we need to run optimize once every couple of days to keep it in shape. Given the size(index + stored), it takes a long time to complete during which time we can't add new documents into the index. And because of the size of the stored fields, we need double the storage size of the current index to optimize. Since we're on EC2, this requires frequent increase in storage capacity. Even after sharding the index, the time to take to optimize the index is going to be significant. That's the reason why we decided to store these fields in MySQL. If there's some easier solution that I've overlooked, please point it out. On a related note, is there a way to 'automagically' split the existing index into multiple shards? -- Thanks, Ishwar Just another resurrected Neozoic Archosaur comics. http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/ From: Upayavira u...@odoko.co.uk To: solr-user@lucene.apache.org Cc: Sent: Tuesday, February 8, 2011 7:17 PM Subject: Re: Solr n00b question: writing a custom QueryComponent The conventional way to do it would be to index your title and content fields in Solr, along with the ID to identify the document. You could do a search against solr, and just return an ID field, then your 'client code' would match that up with the title/content data from your database. And yes, SolrJ would be the obvious route to take here, for your client application. Yes, it does mean another component that needs to be maintained, but by using Solr's external interface you will be protected from changes to internals that could break your custom components, and you will likely be more able to take advantage of other Solr features that are also available via the standard interfaces. My next question is: are you going to be using the data you're storing in mysql for something other than just enhancing search results? If not, it may still make sense to store the data in Solr. It would mean you just have one index to manage, rather than an index and a database - after all, the words *have* to take up disk space somewhere :-). If you end up with so many documents indexed that performance grinds (over 10million??) you can split your index across multiple shards. Upayavira Once you get search results back from Solr, you would do a query against your database to return the additional On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com wrote: Hi Upayavira, Apologies for the lack of clarity in the mail. The feeds have the following