Re: solr init.d script
Sorry, forgot to mention, Centos. Thanks. I have very similar script to this Centos one and I am missing status portion of the script. On 11/09/2010 08:47 AM, Eric Martin wrote: Er, what flavor? RHEL / CentOS #!/bin/sh # Starts, stops, and restarts Apache Solr. # # chkconfig: 35 92 08 # description: Starts and stops Apache Solr SOLR_DIR=/var/solr JAVA_OPTIONS=-Xmx1024m -DSTOP.PORT=8079 -DSTOP.KEY=mustard -jar start.jar LOG_FILE=/var/log/solr.log JAVA=/usr/bin/java case $1 in start) echo Starting Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS 2 $LOG_FILE ;; stop) echo Stopping Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS --stop ;; restart) $0 stop sleep 1 $0 start ;; *) echo Usage: $0 {start|stop|restart}2 exit 1 ;; esac Debian http://xdeb.org/node/1213 __ Ubuntu STEPS Type in the following command in TERMINAL to install nano text editor. sudo apt-get install nano Type in the following command in TERMINAL to add a new script. sudo nano /etc/init.d/solr TERMINAL will display a new page title GNU nano 2.0.x. Paste the below script in this TERMINAL window. #!/bin/sh -e # Starts, stops, and restarts solr SOLR_DIR=/apache-solr-1.4.0/example JAVA_OPTIONS=-Xmx1024m -DSTOP.PORT=8079 -DSTOP.KEY=stopkey -jar start.jar LOG_FILE=/var/log/solr.log JAVA=/usr/bin/java case $1 in start) echo Starting Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS 2 $LOG_FILE ;; stop) echo Stopping Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS --stop ;; restart) $0 stop sleep 1 $0 start ;; *) echo Usage: $0 {start|stop|restart}2 exit 1 ;; esac Note: In above script you might have to replace /apache-solr-1.4.0/example with appropriate directory name. Press CTRL-X keys. Type in Y When ask File Name to Write press ENTER key. You're now back to TERMINAL command line. Type in the following command in TERMINAL to create all the links to the script. sudo update-rc.d solr defaults Type in the following command in TERMINAL to make the script executable. sudo chmod a+rx /etc/init.d/solr To test. Reboot your Ubuntu Server. Wait until Ubuntu Server reboot is completed. Wait 2 minutes for Apache Solr to startup. Using your internet browser go to your website and try a Solr search. -Original Message- From: Nikola Garafolic [mailto:nikola.garafo...@srce.hr] Sent: Monday, November 08, 2010 11:42 PM To: solr-user@lucene.apache.org Subject: solr init.d script Hi, Does anyone have some kind of init.d script for solr, that can start, stop and check solr status? -- Nikola Garafolic SRCE, Sveucilisni racunski centar tel: +385 1 6165 804 email: nikola.garafo...@srce.hr
Re: Replication and ignored fields
Not sure about that. I have read that the replication handler actually issues a commit() on itself once the index is downloaded. But probably a better way for Markus' case is to hook the prune job on the master, writing to another core (myIndexPruned). Then you replicate from that core instead, and you also get the benefit of transferring a smaller index across the network. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 8. nov. 2010, at 23.50, Shalin Shekhar Mangar wrote: On Fri, Nov 5, 2010 at 2:30 PM, Jan Høydahl / Cominvent jan@cominvent.com wrote: How about hooking in Andrzej's pruning tool at the postCommit event, literally removing unused fields. I believe a commit is fired on the slave by itself after every successful replication, to put the index live. You could execute a script which prunes away the dead meat and then call a new commit? Well, I don't think it will work because a new commit will cause the index version on the slave to be ahead of the master which will cause Solr replication to download a full index from the master and it'd go in an infinite loop. -- Regards, Shalin Shekhar Mangar.
solr dynamic core creation
Hi, I’m not sure this is the right place, hopefully you can help. Anyway, I also sent mail to solr-user@lucene.apache.org I’m using solr – one master with 17 slaves in the server and using solrj as the java client Currently there’s only one core in all of them (master and slaves) – only the cpaCore. I thought about using multi-cores solr, but I have some problems with that. I don’t know in advance which cores I’d need – When my java program runs, I call for documents to be index to a certain url, which contains the core name, and I might create a url based on core that is not yet created. For example: (at the begining, the only core is cpaCore) Calling to index – http://localhost:8080/cpaCore - existing core, everything as usual Calling to index - http://localhost:8080/newCore - Currently throws excecption. what I'd like to happen is - server realizes there’s no core “newCore”, creates it and indexes to it. After that – also creates the new core in the slaves Calling to index – http://localhost:8080/newCore - existing core, everything as usual What I’d like to have on the server side to do is realize by itself if the cores exists or not, and if not - create it One other restriction – I can’t change anything in the client side – calling to the server can only make the calls it’s doing now – for index and search, and cannot make calls for cores creation via the CoreAdminHandler. All I can do is something in the server itself What can I do to get it done? Write some RequestHandler? REquestProcessor? Any other option? Thanks, nizan -- View this message in context: http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1867705.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Tomcat special character problem
The problem was firstly the wrong URIEncoding of tomcat itself. The second problem came from the application's side: The params were wrongly encoded, so it was not possible to show the desired results. If you need to convert from different encodings to utf8, I can give you the following piece of pseudocode: string = urlencode(encodeForUtf8(myString)); And if you need to decode for several reasons, keep in mind that you must change the order of decodings: value = decodeFromUtf8(urldecode(string)); Hope that helps. Thank you! -- View this message in context: http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1868024.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr init.d script
I have two nodes running one jboss server each and using one (single) solr instance, thats how I run it for now. Do you recommend running jboss with solr via servlet? Two jboss run in load-balancing for high availability purpose. For now it seems to be ok. On 11/09/2010 03:17 PM, Israel Ekpo wrote: I think it would be a better idea to load solr via a servlet container like Tomcat and then create the init.d script for tomcat instead. http://wiki.apache.org/solr/SolrTomcat#Installing_Tomcat_6 -- Nikola Garafolic SRCE, Sveucilisni racunski centar tel: +385 1 6165 804 email: nikola.garafo...@srce.hr
Re: Replication and ignored fields
On Tue, Nov 9, 2010 at 12:33 AM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Not sure about that. I have read that the replication handler actually issues a commit() on itself once the index is downloaded. That was true with the old replication scripts. The Java based replication just re-opens the IndexReader after all the files are downloaded so the index version on the slave remains in sync with the one on the master. But probably a better way for Markus' case is to hook the prune job on the master, writing to another core (myIndexPruned). Then you replicate from that core instead, and you also get the benefit of transferring a smaller index across the network. I agree, that is a good idea. -- Regards, Shalin Shekhar Mangar.
Re: How to Facet on a price range
Just to add to this, if you want to allow the user more choice in his option to select ranges, perhaps by using a 2-sided javasacript slider for the pricerange (ala kayak.com) it may be very worthwhile to discretize the allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider implementations allow for this easily. This has the advantages of: - having far fewer possible facetqueries and thus a far greater chance of these facetqueries hitting the cache. - a better user-experience, although that's debatable. just to be clear: for this the Solr-side would still use: facet=onfacet.query=price:[50 TO *]facet.query=price:[* TO 100] and not the optimized pre-computed variant suggested above. Geert-Jan 2010/11/9 jayant jayan...@hotmail.com That was very well thought of and a clever solution. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Replication and ignored fields
Cool, thanks for the clarification, Shalin. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 9. nov. 2010, at 15.12, Shalin Shekhar Mangar wrote: On Tue, Nov 9, 2010 at 12:33 AM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Not sure about that. I have read that the replication handler actually issues a commit() on itself once the index is downloaded. That was true with the old replication scripts. The Java based replication just re-opens the IndexReader after all the files are downloaded so the index version on the slave remains in sync with the one on the master. But probably a better way for Markus' case is to hook the prune job on the master, writing to another core (myIndexPruned). Then you replicate from that core instead, and you also get the benefit of transferring a smaller index across the network. I agree, that is a good idea. -- Regards, Shalin Shekhar Mangar.
Re: How to Facet on a price range
That was very well thought of and a clever solution. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr init.d script
I think it would be a better idea to load solr via a servlet container like Tomcat and then create the init.d script for tomcat instead. http://wiki.apache.org/solr/SolrTomcat#Installing_Tomcat_6 On Tue, Nov 9, 2010 at 2:47 AM, Eric Martin e...@makethembite.com wrote: Er, what flavor? RHEL / CentOS #!/bin/sh # Starts, stops, and restarts Apache Solr. # # chkconfig: 35 92 08 # description: Starts and stops Apache Solr SOLR_DIR=/var/solr JAVA_OPTIONS=-Xmx1024m -DSTOP.PORT=8079 -DSTOP.KEY=mustard -jar start.jar LOG_FILE=/var/log/solr.log JAVA=/usr/bin/java case $1 in start) echo Starting Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS 2 $LOG_FILE ;; stop) echo Stopping Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS --stop ;; restart) $0 stop sleep 1 $0 start ;; *) echo Usage: $0 {start|stop|restart} 2 exit 1 ;; esac Debian http://xdeb.org/node/1213 __ Ubuntu STEPS Type in the following command in TERMINAL to install nano text editor. sudo apt-get install nano Type in the following command in TERMINAL to add a new script. sudo nano /etc/init.d/solr TERMINAL will display a new page title GNU nano 2.0.x. Paste the below script in this TERMINAL window. #!/bin/sh -e # Starts, stops, and restarts solr SOLR_DIR=/apache-solr-1.4.0/example JAVA_OPTIONS=-Xmx1024m -DSTOP.PORT=8079 -DSTOP.KEY=stopkey -jar start.jar LOG_FILE=/var/log/solr.log JAVA=/usr/bin/java case $1 in start) echo Starting Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS 2 $LOG_FILE ;; stop) echo Stopping Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS --stop ;; restart) $0 stop sleep 1 $0 start ;; *) echo Usage: $0 {start|stop|restart} 2 exit 1 ;; esac Note: In above script you might have to replace /apache-solr-1.4.0/example with appropriate directory name. Press CTRL-X keys. Type in Y When ask File Name to Write press ENTER key. You're now back to TERMINAL command line. Type in the following command in TERMINAL to create all the links to the script. sudo update-rc.d solr defaults Type in the following command in TERMINAL to make the script executable. sudo chmod a+rx /etc/init.d/solr To test. Reboot your Ubuntu Server. Wait until Ubuntu Server reboot is completed. Wait 2 minutes for Apache Solr to startup. Using your internet browser go to your website and try a Solr search. -Original Message- From: Nikola Garafolic [mailto:nikola.garafo...@srce.hr] Sent: Monday, November 08, 2010 11:42 PM To: solr-user@lucene.apache.org Subject: solr init.d script Hi, Does anyone have some kind of init.d script for solr, that can start, stop and check solr status? -- Nikola Garafolic SRCE, Sveucilisni racunski centar tel: +385 1 6165 804 email: nikola.garafo...@srce.hr -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
dynamically create unique key
I'm trying to use Solr to store information from a few different sources in one large index. I need to create a unique key for the Solr index that will be unique per document. If I have 3 systems, and they all have a document with id=1, then I need to create a uniqueId field in my schema that contains both the system name and that id, along the lines of: sysa1, sysb1, and sysc1. That way, each document will have a unique id. I added this to my schema.xml: copyField source=source dest=uniqueId/ copyField source=id dest=uniqueId/ However, after trying to insert, I got this: java.lang.Exception: ERROR: multiple values encountered for non multiValued copy field uniqueId: sysa So instead of just appending to the uniqueId field, it tried to do a multiValued. Does anyone have an idea on how I can make this work? Thanks! -- Chris
Re: dynamically create unique key
On Tue, Nov 9, 2010 at 10:39 AM, Christopher Gross cogr...@gmail.com wrote: I'm trying to use Solr to store information from a few different sources in one large index. I need to create a unique key for the Solr index that will be unique per document. If I have 3 systems, and they all have a document with id=1, then I need to create a uniqueId field in my schema that contains both the system name and that id, along the lines of: sysa1, sysb1, and sysc1. That way, each document will have a unique id. I added this to my schema.xml: copyField source=source dest=uniqueId/ copyField source=id dest=uniqueId/ However, after trying to insert, I got this: java.lang.Exception: ERROR: multiple values encountered for non multiValued copy field uniqueId: sysa So instead of just appending to the uniqueId field, it tried to do a multiValued. Does anyone have an idea on how I can make this work? Thanks! -- Chris Chris, Depending on how you insert your documents into SOLR will determine how to create your unique field. If you are POST'ing the data via HTTP, then you would be responsible for building your unique id (i.e., your program/language would use string concatenation to add the unique id to the output before it gets to the update handler in SOLR). If you're using the DataImportHandler, then you can use the TemplateTransformer (http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer) to dynamically build your unique id at document insertion time. For example, we here at bizjournals use SOLR and the DataImportHandler to index our documents. Like you, we run the risk of two or more ids clashing, and thus overwriting a different type of document. As such, we take two or three different fields and combine them together using the TemplateTransformer to generate a more unique id for each document we index. With respect to the multiValued option, that is used more for an array-like structure within a field. For example, if you have a blog entry with multiple tag keywords, you would probably want a field in SOLR that can contain the various tag keywords for each blog entry; this is where multiValued comes in handy. I hope that this helps to clarify things for you. - Ken Stanley
Re: dynamically create unique key
Thanks Ken. I'm using a script with Java/SolrJ to copy documents from their original locations into the Solr Index. I wasn't sure if the copyField would help me, but from your answers it seems that I'll have to handle it on my own. That's fine -- it is definitely not hard to pass a new field myself. I was just thinking that there should be an easy way to have Solr build the unique field, since it was getting everything anyway. I was just confused as to why I was getting a multiValued error, since I was just trying to append to a field. I wasn't sure if I was missing something. Thanks again! -- Chris On Tue, Nov 9, 2010 at 10:47 AM, Ken Stanley doh...@gmail.com wrote: On Tue, Nov 9, 2010 at 10:39 AM, Christopher Gross cogr...@gmail.com wrote: I'm trying to use Solr to store information from a few different sources in one large index. I need to create a unique key for the Solr index that will be unique per document. If I have 3 systems, and they all have a document with id=1, then I need to create a uniqueId field in my schema that contains both the system name and that id, along the lines of: sysa1, sysb1, and sysc1. That way, each document will have a unique id. I added this to my schema.xml: copyField source=source dest=uniqueId/ copyField source=id dest=uniqueId/ However, after trying to insert, I got this: java.lang.Exception: ERROR: multiple values encountered for non multiValued copy field uniqueId: sysa So instead of just appending to the uniqueId field, it tried to do a multiValued. Does anyone have an idea on how I can make this work? Thanks! -- Chris Chris, Depending on how you insert your documents into SOLR will determine how to create your unique field. If you are POST'ing the data via HTTP, then you would be responsible for building your unique id (i.e., your program/language would use string concatenation to add the unique id to the output before it gets to the update handler in SOLR). If you're using the DataImportHandler, then you can use the TemplateTransformer (http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer) to dynamically build your unique id at document insertion time. For example, we here at bizjournals use SOLR and the DataImportHandler to index our documents. Like you, we run the risk of two or more ids clashing, and thus overwriting a different type of document. As such, we take two or three different fields and combine them together using the TemplateTransformer to generate a more unique id for each document we index. With respect to the multiValued option, that is used more for an array-like structure within a field. For example, if you have a blog entry with multiple tag keywords, you would probably want a field in SOLR that can contain the various tag keywords for each blog entry; this is where multiValued comes in handy. I hope that this helps to clarify things for you. - Ken Stanley
Re: dynamically create unique key
On Tue, Nov 9, 2010 at 10:53 AM, Christopher Gross cogr...@gmail.com wrote: Thanks Ken. I'm using a script with Java/SolrJ to copy documents from their original locations into the Solr Index. I wasn't sure if the copyField would help me, but from your answers it seems that I'll have to handle it on my own. That's fine -- it is definitely not hard to pass a new field myself. I was just thinking that there should be an easy way to have Solr build the unique field, since it was getting everything anyway. I was just confused as to why I was getting a multiValued error, since I was just trying to append to a field. I wasn't sure if I was missing something. Thanks again! -- Chris Chris, I definitely understand your sentiment. The thing to keep in mind with SOLR is that it really has limited logic mechanisms; in fact, unless you're willing to use the DataImportHandler (dih) and the ScriptTransformer, you really have no logic. The copyField directive in schema.xml is mainly used to help you easily copy the contents of one field into another so that it may be indexed in multiple ways; for example, you can index a string so that it is stored literally (i.e., Hello World), parsed using a whitespace tokenizer (i.e., Hello, World), parsed for an nGram tokenizer (i.e., H, He, Hel... ). This is beneficial to you because you wouldn't have to explicitly define each possible instance in your data stream. You just define the field once, and SOLR is smart enough to copy it where it needs to go. Glad to have helped. :) - Ken
Re: How to Facet on a price range
Hi, Instead of all the facet queries, you can also make use of range facets (http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range), which is in trunk afaik, it should also be patchable into older versions of Solr, although that should not be necessary. We make use of it (http://www.mysecondhome.co.uk/search.html) to create the nice sliders Geert-Jan describes. We've also used it to add the sparklines above the sliders which give a nice indication of how the current selection is spread out. Regards, gwk On 11/9/2010 3:33 PM, Geert-Jan Brits wrote: Just to add to this, if you want to allow the user more choice in his option to select ranges, perhaps by using a 2-sided javasacript slider for the pricerange (ala kayak.com) it may be very worthwhile to discretize the allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider implementations allow for this easily. This has the advantages of: - having far fewer possible facetqueries and thus a far greater chance of these facetqueries hitting the cache. - a better user-experience, although that's debatable. just to be clear: for this the Solr-side would still use: facet=onfacet.query=price:[50 TO *]facet.query=price:[* TO 100] and not the optimized pre-computed variant suggested above. Geert-Jan 2010/11/9 jayantjayan...@hotmail.com That was very well thought of and a clever solution. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html Sent from the Solr - User mailing list archive at Nabble.com.
spell check vs terms component
Hi, We are trying to implement auto suggest feature in our application. I would like to know the difference between terms vs spell check component. Both the handlers seems to display almost the same output, can anyone let me know the difference and also I would like to know when to go for spell check and when to go for terms component. Thanks, Barani -- View this message in context: http://lucene.472066.n3.nabble.com/spell-check-vs-terms-component-tp1870214p1870214.html Sent from the Solr - User mailing list archive at Nabble.com.
Is there a way to embed terms handler in search handler?
Hi, I am trying to figure out if there is a way to embed terms handler as part of default search handler and access using URL something lilke below http://localhost:8990/solr/db/select?q=*:*terms.prefix=aterms.fl=name Couple of other questions, I would like to know if its possible to mention * in fl.name to search on all fields or we should specify the field names only? Will the autosuggest suggest the whole phrase or just the word it matches? Thanks, Barani -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-embed-terms-handler-in-search-handler-tp1870505p1870505.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr init.d script
Yes. I recommend running Solr via a servlet container. It is much easier to manage compared to running it by itself. On Tue, Nov 9, 2010 at 10:03 AM, Nikola Garafolic nikola.garafo...@srce.hrwrote: I have two nodes running one jboss server each and using one (single) solr instance, thats how I run it for now. Do you recommend running jboss with solr via servlet? Two jboss run in load-balancing for high availability purpose. For now it seems to be ok. On 11/09/2010 03:17 PM, Israel Ekpo wrote: I think it would be a better idea to load solr via a servlet container like Tomcat and then create the init.d script for tomcat instead. http://wiki.apache.org/solr/SolrTomcat#Installing_Tomcat_6 -- Nikola Garafolic SRCE, Sveucilisni racunski centar tel: +385 1 6165 804 email: nikola.garafo...@srce.hr -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
Re: spell check vs terms component
On Tue, Nov 9, 2010 at 8:20 AM, bbarani bbar...@gmail.com wrote: Hi, We are trying to implement auto suggest feature in our application. I would like to know the difference between terms vs spell check component. Both the handlers seems to display almost the same output, can anyone let me know the difference and also I would like to know when to go for spell check and when to go for terms component. SpellCheckComponent is designed to operate on whole words and not partial words so I don't know how well it will work for auto-suggest, if at all. As far as differences between SpellCheckComponent and Terms Component is concerned, TermsComponent is a straight prefix match whereas SCC takes edit distance into account. Also, SCC can deal with phrases composed of multiple words and also gives back a collated suggestion. -- Regards, Shalin Shekhar Mangar.
Re: How to Facet on a price range
@ http://www.mysecondhome.co.uk/search.htmhttp://www.mysecondhome.co.uk/search.html -- when you drag the sliders , an update of how many results would match is immediately shown. I really like this. How did you do this? IS this out-of-the-box available with the suggested Facet_by_range patch? Thanks, Geert-Jan 2010/11/9 gwk g...@eyefi.nl Hi, Instead of all the facet queries, you can also make use of range facets ( http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range), which is in trunk afaik, it should also be patchable into older versions of Solr, although that should not be necessary. We make use of it (http://www.mysecondhome.co.uk/search.html) to create the nice sliders Geert-Jan describes. We've also used it to add the sparklines above the sliders which give a nice indication of how the current selection is spread out. Regards, gwk On 11/9/2010 3:33 PM, Geert-Jan Brits wrote: Just to add to this, if you want to allow the user more choice in his option to select ranges, perhaps by using a 2-sided javasacript slider for the pricerange (ala kayak.com) it may be very worthwhile to discretize the allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider implementations allow for this easily. This has the advantages of: - having far fewer possible facetqueries and thus a far greater chance of these facetqueries hitting the cache. - a better user-experience, although that's debatable. just to be clear: for this the Solr-side would still use: facet=onfacet.query=price:[50 TO *]facet.query=price:[* TO 100] and not the optimized pre-computed variant suggested above. Geert-Jan 2010/11/9 jayantjayan...@hotmail.com That was very well thought of and a clever solution. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: dynamically create unique key
: one large index. I need to create a unique key for the Solr index that will : be unique per document. If I have 3 systems, and they all have a document : with id=1, then I need to create a uniqueId field in my schema that : contains both the system name and that id, along the lines of: sysa1, : sysb1, and sysc1. That way, each document will have a unique id. take a look at the SignatureUpdateProcessorFactory... http://wiki.apache.org/solr/Deduplication : copyField source=source dest=uniqueId/ : copyField source=id dest=uniqueId/ ... : So instead of just appending to the uniqueId field, it tried to do a : multiValued. Does anyone have an idea on how I can make this work? copyField doesn't append it copies Field (value) instances from the source field to the dest field -- so if you get multiple values for hte dest field. -Hoss
Solr highlighter question
Hey guys, I have 3 fields: FirstName, LastName, Biography. They are all string fields. In schema, I copy them to the default search field which is text. Is there any way to get Solr to highlight all the fields when someone searches the default search field but when someone searches for FirstName then only highlight that? For example: if someone searches: medical +FirstName:dave then medical should be highlighted in all fields and dave only in FirstName. Thanks in advance, Moazzam
Re: spell check vs terms component
On Tue, Nov 9, 2010 at 1:02 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Nov 9, 2010 at 8:20 AM, bbarani bbar...@gmail.com wrote: Hi, We are trying to implement auto suggest feature in our application. I would like to know the difference between terms vs spell check component. Both the handlers seems to display almost the same output, can anyone let me know the difference and also I would like to know when to go for spell check and when to go for terms component. SpellCheckComponent is designed to operate on whole words and not partial words so I don't know how well it will work for auto-suggest, if at all. As far as differences between SpellCheckComponent and Terms Component is concerned, TermsComponent is a straight prefix match whereas SCC takes edit distance into account. Also, SCC can deal with phrases composed of multiple words and also gives back a collated suggestion. -- Regards, Shalin Shekhar Mangar. An alternative to using the SpellCheckComponent and/or the TermsComponent, would be the (Edge)NGrams filter. Basically, this filter breaks words down into auto-suggest-friendly tokens (i.e., Hello = H, He, Hel, Hell, Hello) that works great for auto suggestion querying. Here is an article from Lucid Imagination on using the ngram filter: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ Here is the SOLR wiki entry for the filter: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory - Ken Stanley
Re: dynamically create unique key
Thanks Hoss, I'll look into that! -- Chris On Tue, Nov 9, 2010 at 1:43 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : one large index. I need to create a unique key for the Solr index that will : be unique per document. If I have 3 systems, and they all have a document : with id=1, then I need to create a uniqueId field in my schema that : contains both the system name and that id, along the lines of: sysa1, : sysb1, and sysc1. That way, each document will have a unique id. take a look at the SignatureUpdateProcessorFactory... http://wiki.apache.org/solr/Deduplication : copyField source=source dest=uniqueId/ : copyField source=id dest=uniqueId/ ... : So instead of just appending to the uniqueId field, it tried to do a : multiValued. Does anyone have an idea on how I can make this work? copyField doesn't append it copies Field (value) instances from the source field to the dest field -- so if you get multiple values for hte dest field. -Hoss
Re: solr init.d script
On 11/09/2010 07:00 PM, Israel Ekpo wrote: Yes. I recommend running Solr via a servlet container. It is much easier to manage compared to running it by itself. On Tue, Nov 9, 2010 at 10:03 AM, Nikola Garafolic nikola.garafo...@srce.hrwrote: But in my case, that would make things more complex as I see it. Two jboss servers with solr as servlet container, and then I need the same data dir, right? I am now running single solr instance as cluster service, with data dir set to shared lun, that can be started on any of two hosts. Can you explain my benefits with two solr instances via servlet, maybe more performance? Regards, Nikola -- Nikola Garafolic SRCE, Sveucilisni racunski centar tel: +385 1 6165 804 email: nikola.garafo...@srce.hr
RE: returning message to sender
--=_Part_27114_30663314.1289327581322 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi guys, I have been exploring Solr since last few weeks. Our main intension is to expose the data, as WS, across various data sources by linking them using some scenario. I have couple of questions. Is there any good document/URL, which answers... How the indexing happens/built for the queries across different data sources (DIH)? Does the Lucene store the actual data of each individual query or a combination?, where, if yes? Whenever we do a query against built index, when exactly it fires the query to database? How does the index get the updates from the DIH, For example, if my query includes 3 DIH and What is the max number of data sources, I can include to get better performace? How do we measure the scalablity? Can I run these search engines in a grid mode? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Storage-tp1871155p1871155.html Sent from the Solr - User mailing list archive at Nabble.com. --=_Part_27114_30663314.1289327581322 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Hi guys, I have been exploring Solr since last few weeks. Our main intension is to expose the data, as WS, across various data sources by linking them using some scenario. I have couple of questions. Is there any good document/URL, which answers... How the indexing happens/built for the queries across different data sources (DIH)? Does the Lucene store the actual data of each individual query or a combination?, where, if yes? Whenever we do a query against built index, when exactly it fires the query to database? How does the index get the updates from the DIH, For example, if my query includes 3 DIH and What is the max number of data sources, I can include to get better performace? How do we measure the scalablity? Can I run these search engines in a grid mode? Thanks.img class='smiley' src='http://n3.nabble.com/images/smiley/anim_confused.gif' / brhr align=left width=300 View this message in context: a href=http://lucene.472066.n3.nabble.com/Storage-tp1871155p1871155.html; Storage/abr Sent from the a href=http://lucene.472066.n3.nabble.com/Solr-User-f472068.html;Solr - User mailing list archive/a at Nabble.com.br --=_Part_27114_30663314.1289327581322-- Standard Poor's: Empowering Investors and Markets for 150 Years The information contained in this message is intended only for the recipient, and may be a confidential attorney-client communication or may otherwise be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, please be aware that any dissemination or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by replying to the message and deleting it from your computer. The McGraw-Hill Companies, Inc. reserves the right, subject to applicable local law, to monitor and review the content of any electronic message or information sent to or from McGraw-Hill employee e-mail addresses without informing the sender or recipient of the message.
Using Multiple Cores for Multiple Users
All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? What is the best approach for this kind of thing? Thanks in advance, Adam
Re: returning message to sender
Hmmm, this is a little murky I'm inferring that you believe that DIH somehow queries the data source at #query# time, and this is not true. DIH is an #index time# concept. DIH is used to add data to an index. Once that index is created, all searches against are unaware that there were different data sources. So, with a single Solr schema, you can use DIH on as many different data sources as you want, mapping the various bits of information from each data source into your Solr schema. Searches go against fields defined in the schema, so you're automatically searching against all the databases (assuming you've mapped your data into your schema) If I've misunderstood, perhaps you can add some details? Best Erick On Tue, Nov 9, 2010 at 1:39 PM, Teki, Prasad prasad_t...@standardandpoors.com wrote: --=_Part_27114_30663314.1289327581322 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi guys, I have been exploring Solr since last few weeks. Our main intension is to expose the data, as WS, across various data sources by linking them using some scenario. I have couple of questions. Is there any good document/URL, which answers... How the indexing happens/built for the queries across different data sources (DIH)? Does the Lucene store the actual data of each individual query or a combination?, where, if yes? Whenever we do a query against built index, when exactly it fires the query to database? How does the index get the updates from the DIH, For example, if my query includes 3 DIH and What is the max number of data sources, I can include to get better performace? How do we measure the scalablity? Can I run these search engines in a grid mode? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Storage-tp1871155p1871155.html Sent from the Solr - User mailing list archive at Nabble.com. --=_Part_27114_30663314.1289327581322 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Hi guys, I have been exploring Solr since last few weeks. Our main intension is to expose the data, as WS, across various data sources by linking them using some scenario. I have couple of questions. Is there any good document/URL, which answers... How the indexing happens/built for the queries across different data sources (DIH)? Does the Lucene store the actual data of each individual query or a combination?, where, if yes? Whenever we do a query against built index, when exactly it fires the query to database? How does the index get the updates from the DIH, For example, if my query includes 3 DIH and What is the max number of data sources, I can include to get better performace? How do we measure the scalablity? Can I run these search engines in a grid mode? Thanks.img class='smiley' src='http://n3.nabble.com/images/smiley/anim_confused.gif' / brhr align=left width=300 View this message in context: a href=http://lucene.472066.n3.nabble.com/Storage-tp1871155p1871155.html; Storage/abr Sent from the a href=http://lucene.472066.n3.nabble.com/Solr-User-f472068.html;Solr - User mailing list archive/a at Nabble.com.br --=_Part_27114_30663314.1289327581322-- Standard Poor's: Empowering Investors and Markets for 150 Years The information contained in this message is intended only for the recipient, and may be a confidential attorney-client communication or may otherwise be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, please be aware that any dissemination or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by replying to the message and deleting it from your computer. The McGraw-Hill Companies, Inc. reserves the right, subject to applicable local law, to monitor and review the content of any electronic message or information sent to or from McGraw-Hill employee e-mail addresses without informing the sender or recipient of the message.
Re: Using Multiple Cores for Multiple Users
Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their configuration. Sometimes you must reindex after a change, the same is true for reloading cores. Check the wiki on this one [1]. My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? If you view documents within an rss feed as a separate documents, you can assign an user ID to those documents, creating a multi user index with rss documents per user, or group or whatever. Having a core per user isn't a good idea if you have many users. It takes up additional memory and disk space, doesn't share caches etc. There is also more maintenance and your need some support scripts to dynamically create new cores - Solr currently doesn't create a new core directory structure. But, reindexing a very large index takes up a lot more time and resources and relevancy might be an issue depending on the rss feeds' contents. What is the best approach for this kind of thing? I'd usually store the feeds in a single index and shard if it's too many for a single server with your specifications. Unless the demands are too specific. Thanks in advance, Adam [1]: http://wiki.apache.org/solr/CoreAdmin Cheers
Re: Using Multiple Cores for Multiple Users
I'm willing to bet a lot that the standard approach is to use a Server Side Langauge to customize the queries for the user . . . on the same core/set of cores. The only reasons that my limited experience suggests for a 'core per user' is privacy/performance. Unless you have a very small set of users, I would think managing cores for LOTS of users to be PIA. Create one (takes time), replicate to it (takes MORE time), use it, destroy it after session expires (requires garbage collection program running pretty often)(LOTS more time/CPU resource taken up. I am happy to be corrected on any of this. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Markus Jelsma markus.jel...@openindex.io To: solr-user@lucene.apache.org Cc: Adam Estrada estrada.adam.gro...@gmail.com Sent: Tue, November 9, 2010 3:57:34 PM Subject: Re: Using Multiple Cores for Multiple Users Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their configuration. Sometimes you must reindex after a change, the same is true for reloading cores. Check the wiki on this one [1]. My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? If you view documents within an rss feed as a separate documents, you can assign an user ID to those documents, creating a multi user index with rss documents per user, or group or whatever. Having a core per user isn't a good idea if you have many users. It takes up additional memory and disk space, doesn't share caches etc. There is also more maintenance and your need some support scripts to dynamically create new cores - Solr currently doesn't create a new core directory structure. But, reindexing a very large index takes up a lot more time and resources and relevancy might be an issue depending on the rss feeds' contents. What is the best approach for this kind of thing? I'd usually store the feeds in a single index and shard if it's too many for a single server with your specifications. Unless the demands are too specific. Thanks in advance, Adam [1]: http://wiki.apache.org/solr/CoreAdmin Cheers
Re: dynamically create unique key
Seems to me, it would be a good idea to put into the Solr Code, a unique ID per instance or installation or both, accessible either with JAVA or a query. Kind of like all the browsers do for their SSL connections. Then, it's automatically easy to implement what is described below. Maybe it should be written to the config file upon first run when it does not exist, and then any updates or reinstalls would reuse the same installation/instance ID. From: Christopher Gross cogr...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 9, 2010 11:37:03 AM Subject: Re: dynamically create unique key Thanks Hoss, I'll look into that! -- Chris On Tue, Nov 9, 2010 at 1:43 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : one large index. I need to create a unique key for the Solr index that will : be unique per document. If I have 3 systems, and they all have a document : with id=1, then I need to create a uniqueId field in my schema that : contains both the system name and that id, along the lines of: sysa1, : sysb1, and sysc1. That way, each document will have a unique id. take a look at the SignatureUpdateProcessorFactory... http://wiki.apache.org/solr/Deduplication : copyField source=source dest=uniqueId/ : copyField source=id dest=uniqueId/ ... : So instead of just appending to the uniqueId field, it tried to do a : multiValued. Does anyone have an idea on how I can make this work? copyField doesn't append it copies Field (value) instances from the source field to the dest field -- so if you get multiple values for hte dest field. -Hoss
RE: Using Multiple Cores for Multiple Users
If storing in a single index (possibly sharded if you need it), you can simply include a solr field that specifies the user ID of the saved thing. On the client side, in your application, simply ensure that there is an fq parameter limiting to the current user, if you want to limit to the current user's stuff. Relevancy ranking should work just as if you had 'seperate cores', there is no relevancy issue. It IS true that when your index gets very large, commits will start taking longer, which can be a problem. I don't mean commits will take longer just because there is more stuff to commit -- the larger the index, the longer an update to a single document will take to commit. In general, i suspect that having dozens or hundreds (or thousands!) of cores is not going to scale well, it is not going to make good use of your cpu/ram/hd resources. Not really the intended use case of multiple cores. However, you are probably going to run into some issues with the single index approach too. In general, how to deal with multi-tenancy in Solr is an oft-asked question that there doesn't seem to be any just works and does everything for you without needing to think about it solution for in solr. Judging from past thread. I am not a Solr developer or expert. From: Markus Jelsma [markus.jel...@openindex.io] Sent: Tuesday, November 09, 2010 6:57 PM To: solr-user@lucene.apache.org Cc: Adam Estrada Subject: Re: Using Multiple Cores for Multiple Users Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their configuration. Sometimes you must reindex after a change, the same is true for reloading cores. Check the wiki on this one [1]. My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? If you view documents within an rss feed as a separate documents, you can assign an user ID to those documents, creating a multi user index with rss documents per user, or group or whatever. Having a core per user isn't a good idea if you have many users. It takes up additional memory and disk space, doesn't share caches etc. There is also more maintenance and your need some support scripts to dynamically create new cores - Solr currently doesn't create a new core directory structure. But, reindexing a very large index takes up a lot more time and resources and relevancy might be an issue depending on the rss feeds' contents. What is the best approach for this kind of thing? I'd usually store the feeds in a single index and shard if it's too many for a single server with your specifications. Unless the demands are too specific. Thanks in advance, Adam [1]: http://wiki.apache.org/solr/CoreAdmin Cheers
Re: Using Multiple Cores for Multiple Users
Thanks a lot for all the tips, guys! I think that we may explore both options just to see what happens. I'm sure that scalability will be a huge mess with the core-per-user scenario. I like the idea of creating a user ID field and agree that it's probably the best approach. We'll see...I will be sure to let the list know what I find! Please don't stop posting your comments everyone ;-) My inquiring mind wants to know... Adam On Tue, Nov 9, 2010 at 7:34 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If storing in a single index (possibly sharded if you need it), you can simply include a solr field that specifies the user ID of the saved thing. On the client side, in your application, simply ensure that there is an fq parameter limiting to the current user, if you want to limit to the current user's stuff. Relevancy ranking should work just as if you had 'seperate cores', there is no relevancy issue. It IS true that when your index gets very large, commits will start taking longer, which can be a problem. I don't mean commits will take longer just because there is more stuff to commit -- the larger the index, the longer an update to a single document will take to commit. In general, i suspect that having dozens or hundreds (or thousands!) of cores is not going to scale well, it is not going to make good use of your cpu/ram/hd resources. Not really the intended use case of multiple cores. However, you are probably going to run into some issues with the single index approach too. In general, how to deal with multi-tenancy in Solr is an oft-asked question that there doesn't seem to be any just works and does everything for you without needing to think about it solution for in solr. Judging from past thread. I am not a Solr developer or expert. From: Markus Jelsma [markus.jel...@openindex.io] Sent: Tuesday, November 09, 2010 6:57 PM To: solr-user@lucene.apache.org Cc: Adam Estrada Subject: Re: Using Multiple Cores for Multiple Users Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their configuration. Sometimes you must reindex after a change, the same is true for reloading cores. Check the wiki on this one [1]. My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? If you view documents within an rss feed as a separate documents, you can assign an user ID to those documents, creating a multi user index with rss documents per user, or group or whatever. Having a core per user isn't a good idea if you have many users. It takes up additional memory and disk space, doesn't share caches etc. There is also more maintenance and your need some support scripts to dynamically create new cores - Solr currently doesn't create a new core directory structure. But, reindexing a very large index takes up a lot more time and resources and relevancy might be an issue depending on the rss feeds' contents. What is the best approach for this kind of thing? I'd usually store the feeds in a single index and shard if it's too many for a single server with your specifications. Unless the demands are too specific. Thanks in advance, Adam [1]: http://wiki.apache.org/solr/CoreAdmin Cheers
Highlighter - multiple instances of term being combined
I'm finding that if a keyword appears in a field multiple times very close together, it will get highlighted as a phrase even though there are other terms between the two instances. So this search: http://localhost:8983/solr/select/? hl=true hl.snippets=1 q=residue hl.fragsize=0 mergeContiguous=false indent=on hl.usePhraseHighlighter=false debugQuery=on hl.fragmenter=gap hl.highlightMultiTerm=false Highlights as: What does low-emresidue mean? Like low-residue/em diet? Trying to get it to highlight as: What does low-emresidue/em mean? Like low-emresidue/em diet? I've tried playing with various combinations of mergeContiguous, highlightMultiTerm, and usePhraseHighlighter, but they all yield the same output. For reference, field type uses a StandardTokenizerFactory and SynonymFilterFactory, StopFilterFactory, StandardFilterFactory and SnowballFilterFactory. I've confirmed that the intermediate words don't appear in either the synonym or the stop words list. I can post the full definition if helpful. Any pointers as to how to debug this would be greatly appreciated! sasank
Re: solr init.d script
As many solrs as you want can open an index for read-only queries. If you have a shared disk with a global file system, this could work very well. A note: Solr sessions are stateless. There is no reason to run JBoss Solr in fail-over mode with session replication. On Tue, Nov 9, 2010 at 12:25 PM, Nikola Garafolic nikola.garafo...@srce.hr wrote: On 11/09/2010 07:00 PM, Israel Ekpo wrote: Yes. I recommend running Solr via a servlet container. It is much easier to manage compared to running it by itself. On Tue, Nov 9, 2010 at 10:03 AM, Nikola Garafolic nikola.garafo...@srce.hrwrote: But in my case, that would make things more complex as I see it. Two jboss servers with solr as servlet container, and then I need the same data dir, right? I am now running single solr instance as cluster service, with data dir set to shared lun, that can be started on any of two hosts. Can you explain my benefits with two solr instances via servlet, maybe more performance? Regards, Nikola -- Nikola Garafolic SRCE, Sveucilisni racunski centar tel: +385 1 6165 804 email: nikola.garafo...@srce.hr -- Lance Norskog goks...@gmail.com
Re: returning message to sender
David Smiley and Eric Pugh wrote a wonderful book on Solr: http://www.lucidimagination.com/blog/2010/01/11/book-review-solr-packt-book/ Reading through this book and trying the examples will address all of your questions. On Tue, Nov 9, 2010 at 3:23 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, this is a little murky I'm inferring that you believe that DIH somehow queries the data source at #query# time, and this is not true. DIH is an #index time# concept. DIH is used to add data to an index. Once that index is created, all searches against are unaware that there were different data sources. So, with a single Solr schema, you can use DIH on as many different data sources as you want, mapping the various bits of information from each data source into your Solr schema. Searches go against fields defined in the schema, so you're automatically searching against all the databases (assuming you've mapped your data into your schema) If I've misunderstood, perhaps you can add some details? Best Erick On Tue, Nov 9, 2010 at 1:39 PM, Teki, Prasad prasad_t...@standardandpoors.com wrote: --=_Part_27114_30663314.1289327581322 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi guys, I have been exploring Solr since last few weeks. Our main intension is to expose the data, as WS, across various data sources by linking them using some scenario. I have couple of questions. Is there any good document/URL, which answers... How the indexing happens/built for the queries across different data sources (DIH)? Does the Lucene store the actual data of each individual query or a combination?, where, if yes? Whenever we do a query against built index, when exactly it fires the query to database? How does the index get the updates from the DIH, For example, if my query includes 3 DIH and What is the max number of data sources, I can include to get better performace? How do we measure the scalablity? Can I run these search engines in a grid mode? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Storage-tp1871155p1871155.html Sent from the Solr - User mailing list archive at Nabble.com. --=_Part_27114_30663314.1289327581322 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Hi guys, I have been exploring Solr since last few weeks. Our main intension is to expose the data, as WS, across various data sources by linking them using some scenario. I have couple of questions. Is there any good document/URL, which answers... How the indexing happens/built for the queries across different data sources (DIH)? Does the Lucene store the actual data of each individual query or a combination?, where, if yes? Whenever we do a query against built index, when exactly it fires the query to database? How does the index get the updates from the DIH, For example, if my query includes 3 DIH and What is the max number of data sources, I can include to get better performace? How do we measure the scalablity? Can I run these search engines in a grid mode? Thanks.img class='smiley' src='http://n3.nabble.com/images/smiley/anim_confused.gif' / brhr align=left width=300 View this message in context: a href=http://lucene.472066.n3.nabble.com/Storage-tp1871155p1871155.html; Storage/abr Sent from the a href=http://lucene.472066.n3.nabble.com/Solr-User-f472068.html;Solr - User mailing list archive/a at Nabble.com.br --=_Part_27114_30663314.1289327581322-- Standard Poor's: Empowering Investors and Markets for 150 Years The information contained in this message is intended only for the recipient, and may be a confidential attorney-client communication or may otherwise be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, please be aware that any dissemination or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by replying to the message and deleting it from your computer. The McGraw-Hill Companies, Inc. reserves the right, subject to applicable local law, to monitor and review the content of any electronic message or information sent to or from McGraw-Hill employee e-mail addresses without informing the sender or recipient of the message. -- Lance Norskog goks...@gmail.com
Re: dynamically create unique key
Here is an exausting and exhaustive discursion about picking a unique key: http://wiki.apache.org/solr/UniqueKey On Tue, Nov 9, 2010 at 4:20 PM, Dennis Gearon gear...@sbcglobal.net wrote: Seems to me, it would be a good idea to put into the Solr Code, a unique ID per instance or installation or both, accessible either with JAVA or a query. Kind of like all the browsers do for their SSL connections. Then, it's automatically easy to implement what is described below. Maybe it should be written to the config file upon first run when it does not exist, and then any updates or reinstalls would reuse the same installation/instance ID. From: Christopher Gross cogr...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 9, 2010 11:37:03 AM Subject: Re: dynamically create unique key Thanks Hoss, I'll look into that! -- Chris On Tue, Nov 9, 2010 at 1:43 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : one large index. I need to create a unique key for the Solr index that will : be unique per document. If I have 3 systems, and they all have a document : with id=1, then I need to create a uniqueId field in my schema that : contains both the system name and that id, along the lines of: sysa1, : sysb1, and sysc1. That way, each document will have a unique id. take a look at the SignatureUpdateProcessorFactory... http://wiki.apache.org/solr/Deduplication : copyField source=source dest=uniqueId/ : copyField source=id dest=uniqueId/ ... : So instead of just appending to the uniqueId field, it tried to do a : multiValued. Does anyone have an idea on how I can make this work? copyField doesn't append it copies Field (value) instances from the source field to the dest field -- so if you get multiple values for hte dest field. -Hoss -- Lance Norskog goks...@gmail.com
Re: Using Multiple Cores for Multiple Users
There is a standard problem with this: relevance is determined from all of the words in a field of all documents, not just the documents that match the query. That is, when user A searches for 'monkeys' and one of his feeds has a document with this word, but someone else is a zoophile, 'monkeys' will be a common word in the index. This will skew the relevance computation for user A. You could have a separate text field for each user. This might work better- but you can't use field norms (they take up space for all documents). Lance On Tue, Nov 9, 2010 at 6:00 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Thanks a lot for all the tips, guys! I think that we may explore both options just to see what happens. I'm sure that scalability will be a huge mess with the core-per-user scenario. I like the idea of creating a user ID field and agree that it's probably the best approach. We'll see...I will be sure to let the list know what I find! Please don't stop posting your comments everyone ;-) My inquiring mind wants to know... Adam On Tue, Nov 9, 2010 at 7:34 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If storing in a single index (possibly sharded if you need it), you can simply include a solr field that specifies the user ID of the saved thing. On the client side, in your application, simply ensure that there is an fq parameter limiting to the current user, if you want to limit to the current user's stuff. Relevancy ranking should work just as if you had 'seperate cores', there is no relevancy issue. It IS true that when your index gets very large, commits will start taking longer, which can be a problem. I don't mean commits will take longer just because there is more stuff to commit -- the larger the index, the longer an update to a single document will take to commit. In general, i suspect that having dozens or hundreds (or thousands!) of cores is not going to scale well, it is not going to make good use of your cpu/ram/hd resources. Not really the intended use case of multiple cores. However, you are probably going to run into some issues with the single index approach too. In general, how to deal with multi-tenancy in Solr is an oft-asked question that there doesn't seem to be any just works and does everything for you without needing to think about it solution for in solr. Judging from past thread. I am not a Solr developer or expert. From: Markus Jelsma [markus.jel...@openindex.io] Sent: Tuesday, November 09, 2010 6:57 PM To: solr-user@lucene.apache.org Cc: Adam Estrada Subject: Re: Using Multiple Cores for Multiple Users Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their configuration. Sometimes you must reindex after a change, the same is true for reloading cores. Check the wiki on this one [1]. My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? If you view documents within an rss feed as a separate documents, you can assign an user ID to those documents, creating a multi user index with rss documents per user, or group or whatever. Having a core per user isn't a good idea if you have many users. It takes up additional memory and disk space, doesn't share caches etc. There is also more maintenance and your need some support scripts to dynamically create new cores - Solr currently doesn't create a new core directory structure. But, reindexing a very large index takes up a lot more time and resources and relevancy might be an issue depending on the rss feeds' contents. What is the best approach for this kind of thing? I'd usually store the feeds in a single index and shard if it's too many for a single server with your specifications. Unless the demands are too specific. Thanks in advance, Adam [1]: http://wiki.apache.org/solr/CoreAdmin Cheers -- Lance Norskog goks...@gmail.com
Re: Highlighter - multiple instances of term being combined
Have you looked at solr/admin/analysis.jsp? This is 'Analysis' link off the main solr admin page. It will show you how text is broken up for both the indexing and query processes. You might get some insight about how these words are torn apart and assigned positions. Trying the different Analyzers and options might get you there. But to be frank- highlighting is a tough problem and has always had a lot of edge cases. On Tue, Nov 9, 2010 at 6:08 PM, Sasank Mudunuri sas...@gmail.com wrote: I'm finding that if a keyword appears in a field multiple times very close together, it will get highlighted as a phrase even though there are other terms between the two instances. So this search: http://localhost:8983/solr/select/? hl=true hl.snippets=1 q=residue hl.fragsize=0 mergeContiguous=false indent=on hl.usePhraseHighlighter=false debugQuery=on hl.fragmenter=gap hl.highlightMultiTerm=false Highlights as: What does low-emresidue mean? Like low-residue/em diet? Trying to get it to highlight as: What does low-emresidue/em mean? Like low-emresidue/em diet? I've tried playing with various combinations of mergeContiguous, highlightMultiTerm, and usePhraseHighlighter, but they all yield the same output. For reference, field type uses a StandardTokenizerFactory and SynonymFilterFactory, StopFilterFactory, StandardFilterFactory and SnowballFilterFactory. I've confirmed that the intermediate words don't appear in either the synonym or the stop words list. I can post the full definition if helpful. Any pointers as to how to debug this would be greatly appreciated! sasank -- Lance Norskog goks...@gmail.com
scheduling imports and heartbeats
Hi, Can I configure solr to schedule imports at a specified time (say once a day, once an hour, etc)? Also, does solr have some sort of heartbeat mechanism? Thanks, Tri
Re: Using Multiple Cores for Multiple Users
hm, relevance is before filtering, probably during indexing? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog goks...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 9, 2010 7:07:45 PM Subject: Re: Using Multiple Cores for Multiple Users There is a standard problem with this: relevance is determined from all of the words in a field of all documents, not just the documents that match the query. That is, when user A searches for 'monkeys' and one of his feeds has a document with this word, but someone else is a zoophile, 'monkeys' will be a common word in the index. This will skew the relevance computation for user A. You could have a separate text field for each user. This might work better- but you can't use field norms (they take up space for all documents). Lance On Tue, Nov 9, 2010 at 6:00 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Thanks a lot for all the tips, guys! I think that we may explore both options just to see what happens. I'm sure that scalability will be a huge mess with the core-per-user scenario. I like the idea of creating a user ID field and agree that it's probably the best approach. We'll see...I will be sure to let the list know what I find! Please don't stop posting your comments everyone ;-) My inquiring mind wants to know... Adam On Tue, Nov 9, 2010 at 7:34 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If storing in a single index (possibly sharded if you need it), you can simply include a solr field that specifies the user ID of the saved thing. On the client side, in your application, simply ensure that there is an fq parameter limiting to the current user, if you want to limit to the current user's stuff. Relevancy ranking should work just as if you had 'seperate cores', there is no relevancy issue. It IS true that when your index gets very large, commits will start taking longer, which can be a problem. I don't mean commits will take longer just because there is more stuff to commit -- the larger the index, the longer an update to a single document will take to commit. In general, i suspect that having dozens or hundreds (or thousands!) of cores is not going to scale well, it is not going to make good use of your cpu/ram/hd resources. Not really the intended use case of multiple cores. However, you are probably going to run into some issues with the single index approach too. In general, how to deal with multi-tenancy in Solr is an oft-asked question that there doesn't seem to be any just works and does everything for you without needing to think about it solution for in solr. Judging from past thread. I am not a Solr developer or expert. From: Markus Jelsma [markus.jel...@openindex.io] Sent: Tuesday, November 09, 2010 6:57 PM To: solr-user@lucene.apache.org Cc: Adam Estrada Subject: Re: Using Multiple Cores for Multiple Users Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their configuration. Sometimes you must reindex after a change, the same is true for reloading cores. Check the wiki on this one [1]. My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? If you view documents within an rss feed as a separate documents, you can assign an user ID to those documents, creating a multi user index with rss documents per user, or group or whatever. Having a core per user isn't a good idea if you have many users. It takes up additional memory and disk space, doesn't share caches etc. There is also more maintenance and your need some support scripts to dynamically create new cores - Solr currently doesn't create a new core directory structure. But, reindexing a very large index takes up a lot more time and resources and relevancy might be an issue depending on the rss feeds' contents. What is the best approach for this kind of thing? I'd usually store the feeds in a single index and shard if it's too many for a single server with your specifications. Unless
Re: Using Multiple Cores for Multiple Users
Relevance is TF/DF, meaning the term frequency in the index. DF is the number of times the term appears in the document. There is no quick calculation for total frequency for terms only in these documents. Facets do this, and they're very very slow. On Tue, Nov 9, 2010 at 7:50 PM, Dennis Gearon gear...@sbcglobal.net wrote: hm, relevance is before filtering, probably during indexing? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog goks...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 9, 2010 7:07:45 PM Subject: Re: Using Multiple Cores for Multiple Users There is a standard problem with this: relevance is determined from all of the words in a field of all documents, not just the documents that match the query. That is, when user A searches for 'monkeys' and one of his feeds has a document with this word, but someone else is a zoophile, 'monkeys' will be a common word in the index. This will skew the relevance computation for user A. You could have a separate text field for each user. This might work better- but you can't use field norms (they take up space for all documents). Lance On Tue, Nov 9, 2010 at 6:00 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Thanks a lot for all the tips, guys! I think that we may explore both options just to see what happens. I'm sure that scalability will be a huge mess with the core-per-user scenario. I like the idea of creating a user ID field and agree that it's probably the best approach. We'll see...I will be sure to let the list know what I find! Please don't stop posting your comments everyone ;-) My inquiring mind wants to know... Adam On Tue, Nov 9, 2010 at 7:34 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If storing in a single index (possibly sharded if you need it), you can simply include a solr field that specifies the user ID of the saved thing. On the client side, in your application, simply ensure that there is an fq parameter limiting to the current user, if you want to limit to the current user's stuff. Relevancy ranking should work just as if you had 'seperate cores', there is no relevancy issue. It IS true that when your index gets very large, commits will start taking longer, which can be a problem. I don't mean commits will take longer just because there is more stuff to commit -- the larger the index, the longer an update to a single document will take to commit. In general, i suspect that having dozens or hundreds (or thousands!) of cores is not going to scale well, it is not going to make good use of your cpu/ram/hd resources. Not really the intended use case of multiple cores. However, you are probably going to run into some issues with the single index approach too. In general, how to deal with multi-tenancy in Solr is an oft-asked question that there doesn't seem to be any just works and does everything for you without needing to think about it solution for in solr. Judging from past thread. I am not a Solr developer or expert. From: Markus Jelsma [markus.jel...@openindex.io] Sent: Tuesday, November 09, 2010 6:57 PM To: solr-user@lucene.apache.org Cc: Adam Estrada Subject: Re: Using Multiple Cores for Multiple Users Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their configuration. Sometimes you must reindex after a change, the same is true for reloading cores. Check the wiki on this one [1]. My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? If you view documents within an rss feed as a separate documents, you can assign an user ID to those documents, creating a multi user index with rss documents per user, or group or whatever. Having a core per user isn't a good idea if you have many users. It takes up additional memory and disk space, doesn't share caches etc. There is also more maintenance and your need some support scripts to dynamically create new cores - Solr currently doesn't create a new core
Re: scheduling imports and heartbeats
You should use cron for that.. On 10 Nov 2010 08:47, Tri Nguyen tringuye...@yahoo.com wrote: Hi, Can I configure solr to schedule imports at a specified time (say once a day, once an hour, etc)? Also, does solr have some sort of heartbeat mechanism? Thanks, Tri
Re: Using Multiple Cores for Multiple Users
So, if my other filter/selection criteria get some set of the whole index that goes say from 50% relevance to 60% relevance, the set still gets ordered by relevance and then each item in the returned set is still based on its relevance relative to the set, right? That would only be a problem if there was some minimal relevance desired, right? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog goks...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 9, 2010 8:00:09 PM Subject: Re: Using Multiple Cores for Multiple Users Relevance is TF/DF, meaning the term frequency in the index. DF is the number of times the term appears in the document. There is no quick calculation for total frequency for terms only in these documents. Facets do this, and they're very very slow. On Tue, Nov 9, 2010 at 7:50 PM, Dennis Gearon gear...@sbcglobal.net wrote: hm, relevance is before filtering, probably during indexing? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog goks...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 9, 2010 7:07:45 PM Subject: Re: Using Multiple Cores for Multiple Users There is a standard problem with this: relevance is determined from all of the words in a field of all documents, not just the documents that match the query. That is, when user A searches for 'monkeys' and one of his feeds has a document with this word, but someone else is a zoophile, 'monkeys' will be a common word in the index. This will skew the relevance computation for user A. You could have a separate text field for each user. This might work better- but you can't use field norms (they take up space for all documents). Lance On Tue, Nov 9, 2010 at 6:00 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Thanks a lot for all the tips, guys! I think that we may explore both options just to see what happens. I'm sure that scalability will be a huge mess with the core-per-user scenario. I like the idea of creating a user ID field and agree that it's probably the best approach. We'll see...I will be sure to let the list know what I find! Please don't stop posting your comments everyone ;-) My inquiring mind wants to know... Adam On Tue, Nov 9, 2010 at 7:34 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If storing in a single index (possibly sharded if you need it), you can simply include a solr field that specifies the user ID of the saved thing. On the client side, in your application, simply ensure that there is an fq parameter limiting to the current user, if you want to limit to the current user's stuff. Relevancy ranking should work just as if you had 'seperate cores', there is no relevancy issue. It IS true that when your index gets very large, commits will start taking longer, which can be a problem. I don't mean commits will take longer just because there is more stuff to commit -- the larger the index, the longer an update to a single document will take to commit. In general, i suspect that having dozens or hundreds (or thousands!) of cores is not going to scale well, it is not going to make good use of your cpu/ram/hd resources. Not really the intended use case of multiple cores. However, you are probably going to run into some issues with the single index approach too. In general, how to deal with multi-tenancy in Solr is an oft-asked question that there doesn't seem to be any just works and does everything for you without needing to think about it solution for in solr. Judging from past thread. I am not a Solr developer or expert. From: Markus Jelsma [markus.jel...@openindex.io] Sent: Tuesday, November 09, 2010 6:57 PM To: solr-user@lucene.apache.org Cc: Adam Estrada Subject: Re: Using Multiple Cores for Multiple Users Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their