Solr results in null response
Hi Team, I have an application running on solr 4.7.0; I am frequently seeing null responses for requests to application. On SOLR console I see below error related to 'grouping parameters'. Although I am setting all grouping parameters in code. Could you please suggest why it is throwing this error, the scenario in which it throws this, how I can rectify it? Thanks in advance. Below is the full error details: org.apache.solr.common.SolrException: Specify at least one field, function or query to group by. at org.apache.solr.search.Grouping.execute(Grouping.java:298) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:433) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:214) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170) at com.googlecode.psiprobe.Tomcat70AgentValve.invoke(Tomcat70AgentValve.java:38) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:218) at org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:333) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2462) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2451) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745) best, Amit
SegmentInfo from (SolrIndexSearcher) LeafReader
Hey Guys, I am writing a SearchComponent for SOLR 5.4.0 that does some caching at the level of segments and I want to be able to get SegmentInfo from a LeafReader -I am unable to figure that out; A LeafReader is not an instance of SegmentReader that exposes the segment information, is it still possible to get the SegmentInfo that I might be missing, If I am in the SearchComponent.prepare/process. Many thanks, Amit
Re: How fast indexing?
When I run the same sql on DB it takes only 1 sec. And 6-7 documents are getting indexed per second. As I've 4 node solrCloud setup, can I run 4 import handler to index the same data? Will it not over write? 10-20k is very high in numbers, where can I get the actual size of document. Rgds AJ > On 22-Mar-2016, at 05:32, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 3/20/2016 6:11 PM, Amit Jha wrote: >> In my case I am using DIH to index the data and Query is having 2 join >> statements. To index 70K documents it is taking 3-4Hours. Document size >> would be around 10-20KB. DB is MSSQL and using solr4.2.10 in cloud mode. > > My source data is in a MySQL database. I use DIH for full rebuilds and > SolrJ for maintenance. > > My index is sharded, but I'm not running SolrCloud. When using DIH, all > of my shards build at once, and each one achieves about 750 docs per > second. With six large shards, rebuilding a 146 million document index > takes 9-10 hours. It produces a total index size in the ballpark of 170GB. > > DIH has a performance limitation -- it's single-threaded. I obtain the > speeds that I do because all of my shards import at the same time -- six > dataimport instances running at the same time, each one with a single > thread, importing a little more than 24 million documents. I have > discovered that Solr is the bottleneck on my setup. The data retrieval > from MySQL can proceed much faster than Solr can handle with a single > indexing thread. My situation is a little bit unusual -- as Erick > mentioned, usually the bottleneck is data retrieval, not Solr. > > At this point, if I want to make bulk indexing go faster, I need to > build a SolrJ application that can index with multiple threads to each > Solr core at the same time. This is on my roadmap, but it's not going > to be a trivial project. > > At 10-20K, your documents are large, but not excessively so. If 7 > documents takes 3-4 hours, then there's one of a few problems happening. > > 1) your database is VERY slow. > 2) your analysis chain in schema.xml is running SUPER slow analysis > components. > 3) Your server or its configuration is not providing enough resources > (CPU/RAM/IO) so Solr can run efficiently. > > #2 seems rather unlikely, so I would suspect one of the other two. > > > > I have seen one situation related to the Microsoft side of your setup > that might cause a problem like this. If any of your machines are > running on Windows Server 2012 and you have bridged NICs (usually for > failover in the event of a switch failure), then you will need to break > the bridge and just run one NIC. > > The performance improvement on the network when a bridged NIC is removed > from Server 2012 is enough to blow your mind, especially if the access > is over a high-latency network link, like a VPN or WAN connection. The > same setup on Server 2003 or Server 2008 has very good performance. > Microsoft seems to have a bug with bridged NICs in Server 2012. Last > time I tried to figure out whether it could be fixed, I ran into this > problem: > > https://xkcd.com/979/ > > Thanks, > Shawn >
Re: How fast indexing?
Yes, I do have multiple modes in my solr cloud setup. Rgds AJ > On 21-Mar-2016, at 22:20, fabigol <fabien.stou...@vialtis.com> wrote: > > Amit Jha, > do you have several sold server with solr cloud? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-fast-indexing-tp4264994p4265122.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: How fast indexing?
Hi All, In my case I am using DIH to index the data and Query is having 2 join statements. To index 70K documents it is taking 3-4Hours. Document size would be around 10-20KB. DB is MSSQL and using solr4.2.10 in cloud mode. Rgds AJ > On 21-Mar-2016, at 05:23, Erick Ericksonwrote: > > In my experience, a majority of the time the bottleneck is in > the data acquisition, not the Solr indexing per-se. Take a look > at the CPU utilization on Solr, if it's not running very heavy, > then you need to look upstream. > > You haven't told us anything about _how_ you're indexing. > SolrJ? DIH? Something from some other party? so it's hard to > say much useful. > > You might review: > > http://wiki.apache.org/solr/UsingMailingLists > > Best, > Erick > > On Sun, Mar 20, 2016 at 3:31 PM, Nick Vasilyev > wrote: > >> There can be a lot of factors, can you provide a bit of additional >> information to get started? >> >> - How many items are you indexing per second? >> - How does the indexing process look like? >> - How large is each item? >> - What hardware are you using? >> - How is your Solr set up? JVM memory, collection layout, etc... >> - What is your current commit frequency? >> - What is the query volume while you are indexing? >> >> On Sun, Mar 20, 2016 at 6:25 PM, fabigol >> wrote: >> >>> hi, >>> i have a soir project where i do the indexing since a database postgre. >>> the indexation is very long. >>> How i can accelerate it. >>> I can modify autocommit in the file solrconfig.xml? >>> someone has some ideas. I looking on google but I found little >>> help me please >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/How-fast-indexing-tp4264994.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>
SolrCloud Document Update Problem
Hi, I setup a SolrCloud with 2 shards each is having 2 replicas with 3 zookeeper ensemble. We add and update documents from web app. While updating we delete the document and add same document with updated values with same unique id. I am facing a very strange issue that some time 2 documents have the same unique ID. One document with old values and another one with new values. It happens only we update the document. Please suggest or guide... Rgds
Re: SolrCloud Document Update Problem
It was because of the issues Rgds AJ On Jun 29, 2015, at 6:52 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Jun 29, 2015 at 4:37 PM, Amit Jha shanuu@gmail.com wrote: Hi, I setup a SolrCloud with 2 shards each is having 2 replicas with 3 zookeeper ensemble. We add and update documents from web app. While updating we delete the document and add same document with updated values with same unique id. I am not sure why you delete the document. If you use the same unique key and send the whole document again (with some other fields changed), Solr will automatically overwrite the old document with the new one. I am facing a very strange issue that some time 2 documents have the same unique ID. One document with old values and another one with new values. It happens only we update the document. Please suggest or guide... Rgds -- Regards, Shalin Shekhar Mangar.
Real Time indexing and Scalability
Hi, In my use case, I am adding a document to Solr through spring application using spring-data-solr. This setup works well with single Solr. In current setup it is single point of failure. So we decided to use solr replication because we also need centralized search. Therefore we setup two instances both in repeater mode. The problem with this setup was, some time data was not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 replica setup, but still sometime we found that documents are not getting indexed. I would like to know what is the best way to have highly available setup. Rgds AJ
Re: Real Time indexing and Scalability
I want to have realtime index and realtime search. Rgds AJ On Jun 5, 2015, at 10:12 PM, Amit Jha shanuu@gmail.com wrote: Hi, In my use case, I am adding a document to Solr through spring application using spring-data-solr. This setup works well with single Solr. In current setup it is single point of failure. So we decided to use solr replication because we also need centralized search. Therefore we setup two instances both in repeater mode. The problem with this setup was, some time data was not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 replica setup, but still sometime we found that documents are not getting indexed. I would like to know what is the best way to have highly available setup. Rgds AJ
Re: Real Time indexing and Scalability
Thanks Eric, what about document is committed to master?Then document should be visible from master. Is that correct? I was using replication with repeater mode because LBHttpSolrServer can send write request to any of the Solr server, and that Solr should index the document because it a master. we have a polling interval of 2 sec. After polling interval slave can poll the data. It is worth to mention here is application request the commit command. If document is committed to master and a search request coming to the same master then document should be retrieved. Irrespective of replication because master doesn't know who the slave are? In repeater mode document can be indexed on both the Solr instance. Is that understanding correct? Also why you say that commit is inappropriate? Rgds AJ On Jun 5, 2015, at 11:16 PM, Erick Erickson erickerick...@gmail.com wrote: You have to provide a _lot_ more details. You say: The problem... some data was not get indexed... still sometime we found that documents are not getting indexed. Neither of these should be happening, so I suspect 1 you're expectations aren't correct. For instance, in the master/slave setup you won't see docs on the slave until after the polling interval is expired and the index is replicated. 2 In SolrCloud you aren't committing appropriately. You might review: http://wiki.apache.org/solr/UsingMailingLists Best, Erick On Fri, Jun 5, 2015 at 9:45 AM, Amit Jha shanuu@gmail.com wrote: I want to have realtime index and realtime search. Rgds AJ On Jun 5, 2015, at 10:12 PM, Amit Jha shanuu@gmail.com wrote: Hi, In my use case, I am adding a document to Solr through spring application using spring-data-solr. This setup works well with single Solr. In current setup it is single point of failure. So we decided to use solr replication because we also need centralized search. Therefore we setup two instances both in repeater mode. The problem with this setup was, some time data was not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 replica setup, but still sometime we found that documents are not getting indexed. I would like to know what is the best way to have highly available setup. Rgds AJ
Re: Real Time indexing and Scalability
Thanks Shawn, for reminding CloudSolrServer, yes I have moved to SolrCloud. I agree that repeater is a slave and acts as master for other slaves. But still it's a master and logically it has to obey the what master suppose to obey. if 2 servers are master that means writing can be done on both. If I setup replication between 2 servers and configure both as repeater, than both can act master and slave for each other. Therefore writing can be done on both. Rgds AJ On Jun 6, 2015, at 1:26 AM, Shawn Heisey apa...@elyograg.org wrote: On 6/5/2015 1:38 PM, Amit Jha wrote: Thanks Eric, what about document is committed to master?Then document should be visible from master. Is that correct? I was using replication with repeater mode because LBHttpSolrServer can send write request to any of the Solr server, and that Solr should index the document because it a master. we have a polling interval of 2 sec. After polling interval slave can poll the data. It is worth to mention here is application request the commit command. If document is committed to master and a search request coming to the same master then document should be retrieved. Irrespective of replication because master doesn't know who the slave are? In repeater mode document can be indexed on both the Solr instance. Is that understanding correct? Also why you say that commit is inappropriate? If you are not using SolrCloud, then you must index to the master *ONLY*. A repeater does not enable two-way replication. A repeater is a slave that is also a master for additional slaves. Master-slave replication is *only* one-way - from the master to slaves, and if any of those slaves are repeaters, from there to additional slaves. SolrCloud is probably a far better choice for your setup, especially if you are using the SolrJ client. You mentioned LBHttpSolrServer, which is why I am thinking you're using SolrJ. With a proper configuration on your collection, SolrCloud lets you index to any machine in the cloud and the data will end up exactly where it needs to go. If you use CloudSolrServer/CloudSolrClient and a very recent Solr/SolrJ version, the data will be sent directly to the correct instance for best performance. Thanks, Shawn
Re: Real Time indexing and Scalability
Thanks everyone. I got the answer. Rgds AJ On Jun 6, 2015, at 7:00 AM, Erick Erickson erickerick...@gmail.com wrote: bq: if 2 servers are master that means writing can be done on both. If there's a single piece of documentation that supports this contention, we'll correct it immediately. But it's simply not true. As Shawn says, the entire design behind master/slave architecture is that there is exactly one (and only one) master that _ever_ gets documents indexed to it. Repeaters were introduced as a way to fan out the replication process, particularly across data centers that had expensive pipes connecting them. You could have the repeater in DC2 relay the index form the master in DC1 to all slaves in DC2. In that kind of setup, you then replicate the index across the expensive pipe once rather than once for each slave in DC2. But even in this situation you are only ever indexing to the master on DC1. Best, Erick On Fri, Jun 5, 2015 at 1:20 PM, Amit Jha shanuu@gmail.com wrote: Thanks Shawn, for reminding CloudSolrServer, yes I have moved to SolrCloud. I agree that repeater is a slave and acts as master for other slaves. But still it's a master and logically it has to obey the what master suppose to obey. if 2 servers are master that means writing can be done on both. If I setup replication between 2 servers and configure both as repeater, than both can act master and slave for each other. Therefore writing can be done on both. Rgds AJ On Jun 6, 2015, at 1:26 AM, Shawn Heisey apa...@elyograg.org wrote: On 6/5/2015 1:38 PM, Amit Jha wrote: Thanks Eric, what about document is committed to master?Then document should be visible from master. Is that correct? I was using replication with repeater mode because LBHttpSolrServer can send write request to any of the Solr server, and that Solr should index the document because it a master. we have a polling interval of 2 sec. After polling interval slave can poll the data. It is worth to mention here is application request the commit command. If document is committed to master and a search request coming to the same master then document should be retrieved. Irrespective of replication because master doesn't know who the slave are? In repeater mode document can be indexed on both the Solr instance. Is that understanding correct? Also why you say that commit is inappropriate? If you are not using SolrCloud, then you must index to the master *ONLY*. A repeater does not enable two-way replication. A repeater is a slave that is also a master for additional slaves. Master-slave replication is *only* one-way - from the master to slaves, and if any of those slaves are repeaters, from there to additional slaves. SolrCloud is probably a far better choice for your setup, especially if you are using the SolrJ client. You mentioned LBHttpSolrServer, which is why I am thinking you're using SolrJ. With a proper configuration on your collection, SolrCloud lets you index to any machine in the cloud and the data will end up exactly where it needs to go. If you use CloudSolrServer/CloudSolrClient and a very recent Solr/SolrJ version, the data will be sent directly to the correct instance for best performance. Thanks, Shawn
SolrCloud Replication Issue
Hi, A few days ago I deployed a solr 4.9.0 cluster, which consists of 2 collections. Each collection has 1 shard with 3 replicates on 3 different machines. On the first day I noticed this error appear on the leader. Full Log - http://pastebin.com/wcPMZb0s 4/23/2015, 2:34:37 PM SEVERE SolrCmdDistributor org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://production-solrcloud-004:8080/solr/bookings_shard1_replica2 4/23/2015, 2:34:37 PM WARNING DistributedUpdateProcessor Error sending update 4/23/2015, 2:34:37 PM WARNING ZkController Leader is publishing core=bookings_shard1_replica2 state=down on behalf of un-reachable replica http://production-solrcloud-004:8080/solr/bookings_shard1_replica2/; forcePublishState? false The other 2 replicas had 0 errors. I thought it may be a one off but the same error occured on day 2 which has got me slighlty concerned. During these periods I didn't notice any issues with the cluster and everything looks healthy in the cloud summary. All of the instances are hosted on AWS. Any idea what may be causing this issue and what I can do to mitigate? Thanks Amit
Re: SolrCloud Replication Issue
Appreciate the response, to answer your questions. * Do you see this happen often? How often? It has happened twice in five days. The first two days after deployment. * Are there any known network issues? There are no obvious network issues but as these instances reside in AWS i cannot rule it out network blips. * Do you have any idea about the GC on those replicas? I have been monitoring the memory usage and all instances are using no more than 30% of its JVM memory allocation. On 27 April 2015 at 21:36, Anshum Gupta ans...@anshumgupta.net wrote: Looks like LeaderInitiatedRecovery or LIR. When a leader receives a document (update) but fails to successfully forward it to a replica, it marks that replica as down and asks the replica to recover (hence the name, Leader Initiated Recovery). It could be due to multiple reasons e.g. network issue/GC. The replica generally comes back up and syncs with the leader transparently. As an end-user, you don't have to really worry much about this but if you want to dig deeper, here are a few questions that might help us in suggesting what to do/look at. * Do you see this happen often? How often? * Are there any known network issues? * Do you have any idea about the GC on those replicas? On Mon, Apr 27, 2015 at 1:25 PM, Amit L amitlal...@gmail.com wrote: Hi, A few days ago I deployed a solr 4.9.0 cluster, which consists of 2 collections. Each collection has 1 shard with 3 replicates on 3 different machines. On the first day I noticed this error appear on the leader. Full Log - http://pastebin.com/wcPMZb0s 4/23/2015, 2:34:37 PM SEVERE SolrCmdDistributor org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://production-solrcloud-004:8080/solr/bookings_shard1_replica2 4/23/2015, 2:34:37 PM WARNING DistributedUpdateProcessor Error sending update 4/23/2015, 2:34:37 PM WARNING ZkController Leader is publishing core=bookings_shard1_replica2 state=down on behalf of un-reachable replica http://production-solrcloud-004:8080/solr/bookings_shard1_replica2/; forcePublishState? false The other 2 replicas had 0 errors. I thought it may be a one off but the same error occured on day 2 which has got me slighlty concerned. During these periods I didn't notice any issues with the cluster and everything looks healthy in the cloud summary. All of the instances are hosted on AWS. Any idea what may be causing this issue and what I can do to mitigate? Thanks Amit -- Anshum Gupta
Re: Retrieving Phonetic Code as result
Can I extend solr to add phonetic codes at time of indexing as uuid field getting added. Because I want to preprocess the metaphone code because I calculate the code on runtime will give me some performance hit. Rgds AJ On Jan 23, 2015, at 5:37 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Your app can use the field analysis API (FieldAnalysisRequestHandler) to query Solr for what the resulting field values are for each filter in the analysis chain for a given input string. This is what the Solr Admin UI Analysis web page uses. See: http://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/handler/FieldAnalysisRequestHandler.html and in solrconfig.xml -- Jack Krupansky On Thu, Jan 22, 2015 at 8:42 AM, Amit Jha shanuu@gmail.com wrote: Hi, I need to know how can I retrieve phonetic codes. Does solr provide it as part of result? I need codes for record matching. *following is schema fragment:* fieldtype name=phonetic stored=true indexed=true class=solr.TextField analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.DoubleMetaphoneFilterFactory inject=true maxCodeLength=4/ /analyzer /fieldtype field name=firstname type=text_general indexed=true stored=true/ field name=firstname_phonetic type=phonetic / field name=lastname_phonetic type=phonetic / field name=lastname type=text_general indexed=true stored=true/ copyField source=lastname dest=lastname_phonetic/ copyField source=firstname dest=firstname_phonetic/
Retrieving Phonetic Code as result
Hi, I need to know how can I retrieve phonetic codes. Does solr provide it as part of result? I need codes for record matching. *following is schema fragment:* fieldtype name=phonetic stored=true indexed=true class=solr.TextField analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.DoubleMetaphoneFilterFactory inject=true maxCodeLength=4/ /analyzer /fieldtype field name=firstname type=text_general indexed=true stored=true/ field name=firstname_phonetic type=phonetic / field name=lastname_phonetic type=phonetic / field name=lastname type=text_general indexed=true stored=true/ copyField source=lastname dest=lastname_phonetic/ copyField source=firstname dest=firstname_phonetic/
Re: Retrieving Phonetic Code as result
Hi, I need to know how can I retrieve phonetic codes. Does solr provide it as part of result? I need codes for record matching. *following is schema fragment:* fieldtype name=phonetic stored=true indexed=true class=solr.TextField analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.DoubleMetaphoneFilterFactory inject=true maxCodeLength=4/ /analyzer /fieldtype field name=firstname type=text_general indexed=true stored=true/ field name=firstname_phonetic type=phonetic / field name=lastname_phonetic type=phonetic / field name=lastname type=text_general indexed=true stored=true/ copyField source=lastname dest=lastname_phonetic/ copyField source=firstname dest=firstname_phonetic/ Hi, Thanks for response, I can see generated MetaPhone codes using Luke. I am using solr only because it creates the phonetic code at time of indexing. Otherwise for each record I need to call Metaphone algorithm in realtime to get the codes and compare them. I think when luke can read and display it, why can't solr?
Re: Retrieving Phonetic Code as result
Thanks for response, I can see generated MetaPhone codes using Luke. I am using solr only because it creates the phonetic code at time of indexing. Otherwise for each record I need to call Metaphone algorithm in realtime to get the codes and compare them. I think when luke can read and display it, why can't solr On Thu, Jan 22, 2015 at 7:54 PM, Amit Jha shanuu@gmail.com wrote: Hi, I need to know how can I retrieve phonetic codes. Does solr provide it as part of result? I need codes for record matching. *following is schema fragment:* fieldtype name=phonetic stored=true indexed=true class=solr.TextField analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.DoubleMetaphoneFilterFactory inject=true maxCodeLength=4/ /analyzer /fieldtype field name=firstname type=text_general indexed=true stored=true/ field name=firstname_phonetic type=phonetic / field name=lastname_phonetic type=phonetic / field name=lastname type=text_general indexed=true stored=true/ copyField source=lastname dest=lastname_phonetic/ copyField source=firstname dest=firstname_phonetic/ Hi, Thanks for response, I can see generated MetaPhone codes using Luke. I am using solr only because it creates the phonetic code at time of indexing. Otherwise for each record I need to call Metaphone algorithm in realtime to get the codes and compare them. I think when luke can read and display it, why can't solr?
Re: De Duplication using Solr
Thanks for reply...I have already seen wiki. It is more likely to record matching. On Sat, Jan 3, 2015 at 7:39 PM, Jack Krupansky jack.krupan...@gmail.com wrote: First, see if you can get your requirements to align to the de-dupe feature that Solr already has: https://cwiki.apache.org/confluence/display/solr/De-Duplication -- Jack Krupansky On Sat, Jan 3, 2015 at 2:54 AM, Amit Jha shanuu@gmail.com wrote: I am trying to find out duplicate records based on distance and phonetic algorithms. Can I utilize solr for that? I have following fields and conditions to identify exact or possible duplicates. 1. Fields prefix suffix firstname lastname email(primary_email1, email2, email3) phone(primary_phone1, phone2, phone3) 2. Conditions: Two records said to be exact duplicates if 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND IsExactMatchFunction(record1_suffix, record2_suffix) AND IsExactMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) Two records said to be possible duplicates if 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR IsExactMatchFunction(record1_suffix, record2_suffix) OR IsExactMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) ELSE 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) ELSE 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_any_email,record2_any_email) OR IsExactMatchFunction(record1_any_phone,record2_any_primary) IsFuzzyMatchFunction() will perform distance and phonetic algorithms calculation and compare it with predefined threshold. For example: if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function only return ture only and only if one of the algorithms(distance or phonetic) return the similarity socre = 85. Can I use solr to perform this job. Or Can you guys suggest how can I approach to this problem. I have seen the duke(De duplication API) but I can not use duke out of the box.
De Duplication using Solr
I am trying to find out duplicate records based on distance and phonetic algorithms. Can I utilize solr for that? I have following fields and conditions to identify exact or possible duplicates. 1. Fields prefix suffix firstname lastname email(primary_email1, email2, email3) phone(primary_phone1, phone2, phone3) 2. Conditions: Two records said to be exact duplicates if 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND IsExactMatchFunction(record1_suffix, record2_suffix) AND IsExactMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) Two records said to be possible duplicates if 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR IsExactMatchFunction(record1_suffix, record2_suffix) OR IsExactMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) ELSE 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) ELSE 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_any_email,record2_any_email) OR IsExactMatchFunction(record1_any_phone,record2_any_primary) IsFuzzyMatchFunction() will perform distance and phonetic algorithms calculation and compare it with predefined threshold. For example: if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function only return ture only and only if one of the algorithms(distance or phonetic) return the similarity socre = 85. Can I use solr to perform this job. Or Can you guys suggest how can I approach to this problem. I have seen the duke(De duplication API) but I can not use duke out of the box.
Re: different fields for user-supplied phrases in edismax
Hi Mike, What is exact your use case? What do mean by controlling the fields used for phrase queries ? Rgds AJ On 12-Dec-2014, at 20:11, Michael Sokolov msoko...@safaribooksonline.com wrote: Doug - I believe pf controls the fields that are used for the phrase queries *generated by the parser*. What I am after is controlling the fields used for the phrase queries *supplied by the user* -- ie surrounded by double-quotes. -Mike On 12/12/2014 08:53 AM, Doug Turnbull wrote: Michael, I typically solve this problem by using a copyField and running different analysis on the destination field. Then you could use this field as pf insteaf of qf. If I recall, fields in pf must also be mentioned in qf for this to work. -Doug On Fri, Dec 12, 2014 at 8:13 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: Yes, I guess it's a common expectation that searches work this way. It was actually almost trivial to add as an extension to the edismax parser, and I have what I need now; I opened SOLR-6842; if there's interest I'll try to find the time to contribute back to Solr -Mike On 12/11/14 5:20 PM, Ahmet Arslan wrote: Hi Mike, If I am not wrong, you are trying to simulate google behaviour. If you use quotes, google return exact matches. I think that makes perfectly sense and will be a valuable addition. I remember some folks asked/requested this behaviour in the list. Ahmet On Thursday, December 11, 2014 10:50 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'd like to supply a different set of fields for phrases than for bare terms. Specifically, we'd like to treat phrases as more exact - probably turning off stemming and generally having a tighter analysis chain. Note: this is *not* what's done by configuring pf which controls fields for the auto-generated phrases. What we want to do is provide our users more precise control by explicit use of Is there a way to do this by configuring edismax? I don't think there is, and then if you agree, a followup question - if I want to extend the EDismax parser, does anybody have advice as to the best way in? I'm looking at: Query getFieldQuery(String field, String val, int slop) and altering getAliasedQuery() to accept an aliases parameter, which would be a different set of aliases for phrases ... does that make sense? -Mike
Re: Fault Tolerant Technique of Solr Cloud
Solr will complaint only if you brought down both replica leader of same shard. It would be difficult to have highly available env. If you have less number of physical servers. Rgds AJ On 18-Feb-2014, at 18:35, Vineet Mishra clearmido...@gmail.com wrote: Hi All, I want to have clear idea about the Fault Tolerant Capability of SolrCloud Considering I have setup the SolrCloud with a external Zookeeper, 2 shards, each having a replica with single collection as given in the official Solr Documentation. https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud *Collection1* /\ /\ /\ /\ /\ / \ *Shard 1 Shard 2* localhost:8983localhost:7574 localhost:8900localhost:7500 I Indexed some document and then if I shutdown any of the replica or Leader say for ex- *localhost:8900*, I can't query to the collection to that particular port http:/*/localhost:8900*/solr/collection1/select?q=*:* Then how is it Fault Tolerant or how the query has to be made. Regards
Solr Deduplication use of overWriteDupes flag
Hello, I had a configuration where I had overwriteDupes=false. I added few duplicate documents. Result: I got duplicate documents in the index. When I changed to overwriteDupes=true, the duplicate documents started overwriting the older documents. Question 1: How do I achieve, [add if not there, fail if duplicate is found] i.e. mimic the behaviour of a DB which fails when trying to insert a record which violates some unique constraint. I thought that overwriteDupes=false would do that, but apparently not. Question2: Is there some documentation around overwriteDupes? I have checked the existing Wiki; there is very little explanation of the flag there. Thanks, -Amit
Re: Boosting documents by categorical preferences
Chris, Sounds good! Thanks for the tips.. I'll be glad to submit my talk to this as I have a writeup pretty much ready to go. Cheers Amit On Tue, Jan 28, 2014 at 11:24 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : The initial results seem to be kinda promising... of course there are many : more optimizations I could do like decay user ratings over time to indicate : that preferences decay over time so a 5 rating a year ago doesn't count as : much as a 5 rating today. : : Hope this helps others. I'll open source what I have soon and post back. If : there is feedback or other thoughts let me know! Hey Amit, Glad to hear your user based boosting experiments are paying off. I would definitely love to see a more detailed writeup down the road showing off how it affects your final user metrics -- or perhaps even give a session on your technique at ApacheCon? http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp -Hoss http://www.lucidworks.com/
Re: Boosting documents by categorical preferences
Hi Chris (and others interested in this), Sorry for dropping off.. I got sidetracked with other work and came back to this and finally got a V1 of this implemented. The final process is as follows: 1) Pre-compute the global categorical num_ratings/average/std-dev (so for Action the average rating may be 3.49 with stdDev of .99) 2) For a given user, retrieve the last X (X for me is 10) ratings and compute the user's categorical affinities by taking the average rating for all movies in that particular category (Action) subtract the global cat average and divide by cat std_dev. Furthermore, multiply this by the fraction of total user ratings in that category. - For example, if a user's last 10 ratings consisted of 9/10 Drama and 1/10 Thriller, the z-score of the Thriller should be discounted relative to that of the Drama so that it's more prominent the user's preference (either positive or negative) to Drama. 3) Sort by the absolute value of the z-score (Thanks Hossman.. great thought). 4) Return the top 3 (arbitrary number) 5) Modify the query to look like the following: qq=tom hanksq={!boost b=$b defType=edismax v=$qq}cat1=category:Childrencat2=category:Fantasycat3=category:Animationb=sum(1,sum(product(query($cat1),0.22267872),product(query($cat2),0.21630952),product(query($cat3),0.21120241))) basically b = 1+(pref1*query(category:something1) + pref2*query(category:something2) + pref3*query(category:something3)) The initial results seem to be kinda promising... of course there are many more optimizations I could do like decay user ratings over time to indicate that preferences decay over time so a 5 rating a year ago doesn't count as much as a 5 rating today. Hope this helps others. I'll open source what I have soon and post back. If there is feedback or other thoughts let me know! Cheers Amit On Fri, Nov 22, 2013 at 11:38 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I thought about that but my concern/question was how. If I used the pow : function then I'm still boosting the bad categories by a small : amount..alternatively I could multiply by a negative number but does that : work as expected? I'm not sure i understand your concern: negative powers would give you values less then 1, positive powers would give you values greater then 1, and then you'd use those values as multiplicitive boosts -- so the values less then 1 would penalize the scores of existing matching docs in the categories the user dislikes. Oh wait ... i see, in your original email (and in my subsequent suggested tweak to use pow()) you were talking about sum()ing up these 3 category boosts (and i cut/pasted sum() in my example as well) ... yeah, using multiplcation there would make more sense if you wanted to do the negative prefrences as well, because then then score of any matching doc will be reduced if it matches on an undesired category -- and the amount it will be reduced will be determined by how strongly it matches on that category (ie: the base score returned by the nested query() func) and how negative the undesired prefrence value (ie: the pow() exponent) is qq=... q={!boost b=$b v=$qq} b=prod(pow(query($cat1,cat1z)),pow(query($cat2,cat2z)),pow(query($cat3,cat3z)) cat1=...action... cat1z=1.48 cat2=...comedy... cat2z=1.33 cat3=...kids... cat3z=-1.7 -Hoss
SolrCloud Cluster Setup - Shard Replica
Hi, I tried to create 2 shard cluster with shard replica of a collection. For this set up I used two physical machines. In this set up I have installed 1 shard and replica in Machine A and another 1 shard and 1 replica in Machine B. Now when I stop both shard and replica on machine B. I was not able to perform search. I would like to know how can I set up a fail safe cluster using two machines? I would like achieve the use case where if machine goes down, Still I can serve the search request. I have a constraint where I can not add more machine. Is there any alternative to achieve the use case? Regards Amit
Index size - to determine storage
Hi, I would like to know if I index a file I.e PDF of 100KB then what would be the size of index. What all factors should be consider to determine the disk size? Rgds AJ
Re: DateField - Invalid JSON String Exception - converting Query Response to JSON Object
I am using it. But timestamp having : in between causes the issue. Please help On Tue, Jan 7, 2014 at 11:46 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Amit, If you want json response, Why don't you use wt=json? Ahmet On Tuesday, January 7, 2014 7:34 AM, Amit Jha shanuu@gmail.com wrote: Hi, We have index where date field have default value as 'NOW'. We are using solrj to query solr and when we try to convert query response(response.getResponse) to JSON object in java. The JSON API(org.json) throws 'invalid json string' exception. API say so because date field value i.e. -mm-ddThh:mm:ssZ is not surrounded by double inverted commas( ). So It says required , or } character when API see the colon. Could you please help me to retrieve the date field value as string in JSON response. Or any pointers. Any help would be highly appreciable. On Tue, Jan 7, 2014 at 12:28 AM, Amit Jha shanuu@gmail.com wrote: Hi, Wish You All a Very Happy New Year. We have index where date field have default value as 'NOW'. We are using solrj to query solr and when we try to convert query response(response.getResponse) to JSON object in java. The JSON API(org.json) throws 'invalid json string' exception. API say so because date field value i.e. -mm-ddThh:mm:ssZ is not surrounded by double inverted commas( ). So It says required , or } character when API see the colon. Could you please help me to retrieve the date field value as string in JSON response. Or any pointers. Any help would be highly appreciable.
Re: DateField - Invalid JSON String Exception - converting Query Response to JSON Object
Hey Hoss, Thanks for replying back..Here is the response generated by solrj. *SolrJ Response*: ignore the Braces at It have copied it from big chunk Response: {responseHeader={status=0,QTime=0,params={lowercaseOperators=true,sort=score desc,cache=false,qf=content,wt=javabin,rows=100,defType=edismax,version=2,fl=*,score,start=0,q=White+Paper,stopwords=true,fq=type:White Paper}},response={numFound=9,start=0,maxScore=0.61586785,docs=[SolrDocument{id=007, type=White Paper, source=Documents, title=White Paper 003, body=White Paper 004 Body, author=[Author 3], keywords=[Keyword 3], description=Vivamus turpis eros, mime_type=pdf, _version_=1456609602022932480, *publication_date=Wed Jan 08 03:16:06 IST 2014*, score=0.61586785}]}, Please the publication_date value, Whenever I enable stored=true for this field I got the error *org.json.JSONException: Expected a ',' or '}' at 853 [character 854 line 1]* *Solr Query String* q=%22White%2BPaper%22qf=contentstart=0rows=100sort=score+descdefType=edismaxstopwords=truelowercaseOperators=truewt=jsoncache=falsefl=*%2Cscorefq=type%3A%22White+Paper%22 Hope this may help you to answer. On Tue, Jan 7, 2014 at 10:29 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : We have index where date field have default value as 'NOW'. We are using : solrj to query solr and when we try to convert query : response(response.getResponse) to JSON object in java. The JSON You're going to have to show us some real code, some real data, and a real error exception that you are getting -- because it's not at all clear what you are trying to do, or why you would get an error about invalid JSON. If you generate a JSON response from Solr, you'll get properly quoted strings for the dates... $ curl 'http://localhost:8983/solr/collection1/query?q=SOLRfl=*_dt;' { responseHeader:{ status:0, QTime:8, params:{ fl:*_dt, q:SOLR}}, response:{numFound:1,start:0,docs:[ { incubationdate_dt:2006-01-17T00:00:00Z}] }} ...but it appears you are trying to *generate* JSON yourself, using the Java objects you get back from a parsed SolrJ response -- so i'm not sure where you would be getting an error about invalid JSON, unless you were doing something invalid in the code you are writing to create that JSON. -Hoss http://www.lucidworks.com/
DateField - Invalid JSON String Exception - converting Query Response to JSON Object
Hi, Wish You All a Very Happy New Year. We have index where date field have default value as 'NOW'. We are using solrj to query solr and when we try to convert query response(response.getResponse) to JSON object in java. The JSON API(org.json) throws 'invalid json string' exception. API say so because date field value i.e. -mm-ddThh:mm:ssZ is not surrounded by double inverted commas( ). So It says required , or } character when API see the colon. Could you please help me to retrieve the date field value as string in JSON response. Or any pointers. Any help would be highly appreciable.
Re: DateField - Invalid JSON String Exception - converting Query Response to JSON Object
Hi, We have index where date field have default value as 'NOW'. We are using solrj to query solr and when we try to convert query response(response.getResponse) to JSON object in java. The JSON API(org.json) throws 'invalid json string' exception. API say so because date field value i.e. -mm-ddThh:mm:ssZ is not surrounded by double inverted commas( ). So It says required , or } character when API see the colon. Could you please help me to retrieve the date field value as string in JSON response. Or any pointers. Any help would be highly appreciable. On Tue, Jan 7, 2014 at 12:28 AM, Amit Jha shanuu@gmail.com wrote: Hi, Wish You All a Very Happy New Year. We have index where date field have default value as 'NOW'. We are using solrj to query solr and when we try to convert query response(response.getResponse) to JSON object in java. The JSON API(org.json) throws 'invalid json string' exception. API say so because date field value i.e. -mm-ddThh:mm:ssZ is not surrounded by double inverted commas( ). So It says required , or } character when API see the colon. Could you please help me to retrieve the date field value as string in JSON response. Or any pointers. Any help would be highly appreciable.
Re: /select with 'q' parameter does not work
Because in your solrconfig ... Against /select ... DirectUpdateHandler is mentioned . It should be solr.searchhanlder .. On 11-Dec-2013 3:11 PM, Nutan nutanshinde1...@gmail.com wrote: I have indexed 9 docs. this my* schema.xml* schema name=documents fields field name=doc_id type=uuid indexed=true stored=true default=NEW multiValued=false/ field name=id type=integer indexed=true stored=true required=true multiValued=false/ field name=contents type=text indexed=true stored=true multiValued=false/ field name=author type=title_text indexed=true stored=true multiValued=true/ field name=title type=title_text indexed=true stored=true/ field name=_version_ type=long indexed=true stored=true multiValued=false/ copyfield source=id dest=text / dynamicField name=ignored_* type=text indexed=false stored=false multiValued=true/ field name=description_ngram type=text_ngram indexed=true stored=false / copyField source=contents dest=description_ngram / /fields types fieldType name=text_ngram class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory minGramSize=2 maxGramSize=2 / /analyzer /fieldType fieldType name=uuid class=solr.UUIDField indexed=true / fieldtype name=ignored stored=false indexed=false class=solr.StrField / fieldType name=integer class=solr.IntField omitNorms=true positionIncrementGap=0/ fieldType name=long class=solr.LongField / fieldType name=string class=solr.StrField / fieldType name=title_text class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 splitOnNumerics=1 / filter class=solr.StemmerOverrideFilterFactory dictionary=my_stemmer.txt / filter class=solr.SnowballPorterFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false / filter class=solr.EnglishMinimalStemFilterFactory / filter class=solr.NGramFilterFactory minGramSize=2 maxGramSize=2 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 splitOnNumerics=1 / filter class=solr.StemmerOverrideFilterFactory dictionary=my_stemmer.txt / filter class=solr.SnowballPorterFilterFactory / filter class=solr.EnglishMinimalStemFilterFactory / /analyzer /fieldType /types defaultSearchFieldcontents/defaultSearchField uniqueKeyid/uniqueKey /schema *solrconfig.xml* is: ?xml version=1.0 encoding=UTF-8 ? config luceneMatchVersionLUCENE_42/luceneMatchVersion dataDir${solr.document.data.dir:}/dataDir requestDispatcher handleSelect=false requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=8500 / /requestDispatcher lib dir=../lib regex=.*\.jar / requestHandler name=standard class=solr.StandardRequestHandler default=true lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=fl*/str str name=dfid/str str name=version2.1/str /lst /requestHandler updateHandler name=/select class=solr.DirectUpdateHandler2 updateLog str name=dir${solr.document.data.dir:}/str /updateLog /updateHandler requestHandler name=/analysis/field startup=lazy class=solr.FieldAnalysisRequestHandler / requestHandler name=/admin/ class=solr.admin.AdminHandlers / requestHandler name=/update class=solr.UpdateRequestHandler/ requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfcontents/str /lst /requestHandler /config (i have also added extract,analysis,elevator,promotion,spell,suggester components in solrconfig but i guess that wont select query) When i run this: http://localhost:8080/solr/document/select?q=*:* -- all the 9 docs are replaced but when i run this: http://localhost:8080/solr/document/select?q=programmer or anything in place of programmer -- output shows numfound=0 evenif there are about 34 times programmer has appeared in docs. Initially it worked fine,but not now. Why is it so? -- View this message in context: http://lucene.472066.n3.nabble.com/select-with-q-parameter-does-not-work-tp4106099.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: /select with 'q' parameter does not work
When you start solr , do you find any error or exception Java -jar ./start.jar ... Then see if there is any problem ... Otherwise take solr solrconfig.xml and try to run .. it should run On 11-Dec-2013 5:41 PM, Nutan nutanshinde1...@gmail.com wrote: requestHandler name=standard class=solr.StandardRequestHandler default=true lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=fl*/str str name=dfid/str str name=version2.1/str /lst /requestHandler updateHandler class=solr.DirectUpdateHandler2 updateLog str name=dir${solr.document.data.dir:}/str /updateLog /updateHandler requestHandler name=/update class=solr.UpdateRequestHandler /requestHandler requestHandler name=/analysis/field startup=lazy class=solr.FieldAnalysisRequestHandler / requestHandler name=/admin/ class=solr.admin.AdminHandlers / requestHandler name=/update class=solr.UpdateRequestHandler/ requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfcontents/str /lst /requestHandler i made changes n this new solrconfig.xml ,but still the query does not work. -- View this message in context: http://lucene.472066.n3.nabble.com/select-with-q-parameter-does-not-work-tp4106099p4106133.html Sent from the Solr - User mailing list archive at Nabble.com.
Please help me to understand debugQuery output
Hello All, Can any one help me in understanding debugQuery output like this. lst name=explain str 0.6276088 = (MATCH) sum of: 0.6276088 = (MATCH) max of: 0.18323982 = (MATCH) sum of: 0.18323982 = (MATCH) weight(state_search:a in 327) [DefaultSimilarity], result of: 0.18323982 = score(doc=327,freq=2.0 = termFreq=2.0 ), product of: 0.3188151 = queryWeight, product of: 3.2512918 = idf(docFreq=35, maxDocs=342) 0.098057985 = queryNorm 0.5747526 = fieldWeight in 327, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 3.2512918 = idf(docFreq=35, maxDocs=342) 0.125 = fieldNorm(doc=327) 0.2505932 = (MATCH) sum of: 0.2505932 = (MATCH) weight(country_search:a in 327) [DefaultSimilarity], result of: 0.2505932 = score(doc=327,freq=1.0 = termFreq=1.0 ), product of: 0.3135134 = queryWeight, product of: 3.1972246 = idf(docFreq=37, maxDocs=342) 0.098057985 = queryNorm 0.79930615 = fieldWeight in 327, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.1972246 = idf(docFreq=37, maxDocs=342) 0.25 = fieldNorm(doc=327) 0.25283098 = (MATCH) sum of: 0.25283098 = (MATCH) weight(area_search:a in 327) [DefaultSimilarity], result of: 0.25283098 = score(doc=327,freq=1.0 = termFreq=1.0 ), product of: 0.398 = queryWeight, product of: 4.06 = idf(docFreq=15, maxDocs=342) 0.098057985 = queryNorm 0.6347222 = fieldWeight in 327, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 4.06 = idf(docFreq=15, maxDocs=342) 0.15625 = fieldNorm(doc=327) 0.6276088 = (MATCH) sum of: 0.12957011 = (MATCH) weight(city_search:a in 327) [DefaultSimilarity], result of: 0.12957011 = score(doc=327,freq=1.0 = termFreq=1.0 ), product of: 0.3188151 = queryWeight, product of: 3.2512918 = idf(docFreq=35, maxDocs=342) 0.098057985 = queryNorm 0.40641147 = fieldWeight in 327, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.2512918 = idf(docFreq=35, maxDocs=342) 0.125 = fieldNorm(doc=327) 0.3638727 = (MATCH) weight(city_search:ab in 327) [DefaultSimilarity], result of: 0.3638727 = score(doc=327,freq=1.0 = termFreq=1.0 ), product of: 0.5342705 = queryWeight, product of: 5.4485164 = idf(docFreq=3, maxDocs=342) 0.098057985 = queryNorm 0.68106455 = fieldWeight in 327, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.4485164 = idf(docFreq=3, maxDocs=342) 0.125 = fieldNorm(doc=327) 0.13416591 = (MATCH) weight(city_search:b in 327) [DefaultSimilarity], result of: 0.13416591 = score(doc=327,freq=1.0 = termFreq=1.0 ), product of: 0.32441998 = queryWeight, product of: 3.3084502 = idf(docFreq=33, maxDocs=342) 0.098057985 = queryNorm 0.41355628 = fieldWeight in 327, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.3084502 = idf(docFreq=33, maxDocs=342) 0.125 = fieldNorm(doc=327) /str Any links where this explaination is explained ? Thanks -- Amit Aggarwal 8095552012
Re: Can I use boosting fields with edismax ?
Ok Erick.. I will try thanks On 25-Nov-2013 2:46 AM, Erick Erickson erickerick...@gmail.com wrote: This should work. Try adding debug=all to your URL, and examine the output both with and without your boosting. I believe you'll see the difference in the score calculations. From there it's a matter of adjusting the boosts to get the results you want. Best, Erick On Sat, Nov 23, 2013 at 9:17 AM, Amit Aggarwal amit.aggarwa...@gmail.com wrote: Hello All , I am using defType=edismax So will boosting will work like this in solrConfig.xml str name=qfvalue_search^2.0 desc_search country_search^1.5 state_search^2.0 city_search^2.5 area_search^3.0/str I think it is not working .. If yes , then what should I do ?
Can I use boosting fields with edismax ?
Hello All , I am using defType=edismax So will boosting will work like this in solrConfig.xml str name=qfvalue_search^2.0 desc_search country_search^1.5 state_search^2.0 city_search^2.5 area_search^3.0/str I think it is not working .. If yes , then what should I do ?
Re: Boosting documents by categorical preferences
I thought about that but my concern/question was how. If I used the pow function then I'm still boosting the bad categories by a small amount..alternatively I could multiply by a negative number but does that work as expected? I haven't done much with negative boosting except for the sledgehammer approach of category exclusion through filters. Thanks Amit On Nov 19, 2013 8:51 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : My approach was something like: : 1) Look at the categories that the user has preferred and compute the : z-score : 2) Pick the top 3 among those : 3) Use those to boost search results. I think that totaly makes sense ... the additional bit i was suggesting that you consider is that instead of picking the highest 3 z-scores, pick the z-scores with the greatest absolute value ... that way if someone is a very booring person and their positive interests are all basically exactly the same as the mean for everyone else, but they have some very strong dis-interests you don't bother boosting on those miniscule interests and instead you negatively boost on the things they are antogonistic against. -Hoss
Re: How to get score with getDocList method Solr API
Hello shekhar , Thanks for answering . Do I have to set GET_SCORES FLAG as last parameter of getDocList method ? Thanks On 19-Nov-2013 1:43 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: A few flags are supported: public static final int GET_DOCSET= 0x4000; public static final int TERMINATE_EARLY = 0x04; public static final int GET_DOCLIST =0x02; // get the documents actually returned in a response public static final int GET_SCORES = 0x01; Use the GET_SCORES flag to get the score with each document. On Tue, Nov 19, 2013 at 8:08 AM, Amit Aggarwal amit.aggarwa...@gmail.com wrote: Hello All, I am trying to develop a custom request handler. Here is the snippet : // returnMe is nothing but a list of Document going to return try { // FLAG ??? DocList docList = searcher.getDocList(parsedQuery, parsedFilterQueryList, Sort.RELEVANCE, 1, maxDocs , FLAG); // Now get DocIterator DocIterator it = docList.iterator(); // Now for each id get doc and put it in listDocument int i =0; while (it.hasNext()) { returnMe.add(searcher.doc(it.next())); } Ques 1 - My question is , what does FLAG represent in getDocList method ? Ques 2 - How can I ensure that searcher.getDocList method give me score also with each document. -- Amit Aggarwal 8095552012 -- Regards, Shalin Shekhar Mangar.
How to get score with getDocList method Solr API
Hello All, I am trying to develop a custom request handler. Here is the snippet : // returnMe is nothing but a list of Document going to return try { // FLAG ??? DocList docList = searcher.getDocList(parsedQuery, parsedFilterQueryList, Sort.RELEVANCE, 1, maxDocs , FLAG); // Now get DocIterator DocIterator it = docList.iterator(); // Now for each id get doc and put it in listDocument int i =0; while (it.hasNext()) { returnMe.add(searcher.doc(it.next())); } Ques 1 - My question is , what does FLAG represent in getDocList method ? Ques 2 - How can I ensure that searcher.getDocList method give me score also with each document. -- Amit Aggarwal 8095552012
Re: Boosting documents by categorical preferences
Hey Chris, Sorry for the delay and thanks for your response. This was inspired by your talk on boosting and biasing that you presented way back when at a meetup. I'm glad that my general approach seems to make sense. My approach was something like: 1) Look at the categories that the user has preferred and compute the z-score 2) Pick the top 3 among those 3) Use those to boost search results. I'll look at using the boosts as an exponent instead of a multiplier as I think that would make sense.. also as it handles the 0 case. This is for a prototype I am doing but I'll share the results one day in a meetup as I think it'll be kinda interesting. Thanks again Amit On Thu, Nov 14, 2013 at 11:11 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I have a question around boosting. I wanted to use the boost= to write a : nested query that will boost a document based on categorical preferences. You have no idea how stoked I am to see you working on this in a real world application. : Currently I have the weights set to the z-score equivalent of a user's : preference for that category which is simply how many standard deviations : above the global average is this user's preference for that movie category. : : My question though is basically whether or not semantically the equation : query(category:Drama)*some weight + query(category:Comedy)*some weight : + query(category:Action)*some weight makes sense? My gut says that your apprach makes sense -- but if i'm understadning you correclty, i think that you need to add 1 to all your weights: the boost is a multiplier, so if someone's rating for every category is is 0 std devs above the average rating (ie: the most average person imaginable), you don't wnat to give every moving in every category a score of 0. Are you picking the top 3 categories the user prefers as a cut off, or are you arbitrarily using N category boosts for however many N categories the user is above the global average in their pref for that category? Are your prefrences coming from explicit user feedback on the categories (ie: rate how much you like comedies on a scale of 1-5) or are you infering it from user ratings of the movies themselves? (ie: rate this movie, which happens to be an scifi,action,comedy, on a scale of 1-5) ... because if it's hte later you probably want to be careful to also normalize based on how many categories the movie is in. the other thing to consider is wether you want to include negative prefrences (ie: weights less then 1) based on how many std dev the user's average is *below* the global average for a category .. in this case i *think* you'd want to divide the raw value from -1 to get a useful multiplier. Alternatively: you oculd experiment with using the weights as exponents instead of multipliers... b=sum(pow(query($cat1),1.482),pow(query($cat2),0.1199),pow(query($cat3),1.448)) ...that would simplify the math you'd have to worry about both for the totally boring average user (x**0 = 1) and for the categories users hate (x**-5 = some positive fraction that will act as a penalty) ... but you'd definitley need to run some tests to see if it over boosts as the std dev variations get really high (might want to take a root first before using them as the exponent) -Hoss
Re: Why do people want to deploy to Tomcat?
Agreed with Doug On 12-Nov-2013 6:46 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: As an aside, I think one reason people feel compelled to deviate from the distributed jetty distribution is because the folder is named example. I've had to explain to a few clients that this is a bit of a misnomer. The IT dept especially sees example and feels uncomfortable using that as a starting point for a jetty install. I wish it was called default or bin or something where its more obviously the default jetty distribution of Solr. On Tue, Nov 12, 2013 at 7:06 AM, Roland Everaert reveatw...@gmail.com wrote: In my case, the first time I had to deploy and configure solr on tomcat (and jboss) it was a requirement to reuse as much as possible the application/web server already in place. The next deployment I also use tomcat, because I was used to deploy on tomcat and I don't know jetty at all. I could ask the same question with regard to jetty. Why use/bundle(/ if not recommend) jetty with solr over other webserver solutions? Regards, Roland Everaert. On Tue, Nov 12, 2013 at 12:33 PM, Alvaro Cabrerizo topor...@gmail.com wrote: In my case, the selection of the servlet container has never been a hard requirement. I mean, some customers provide us a virtual machine configured with java/tomcat , others have a tomcat installed and want to share it with solr, others prefer jetty because their sysadmins are used to configure it... At least in the projects I've been working in, the selection of the servlet engine has not been a key factor in the project success. Regards. On Tue, Nov 12, 2013 at 12:11 PM, Andre Bois-Crettez andre.b...@kelkoo.comwrote: We are using Solr running on Tomcat. I think the top reasons for us are : - we already have nagios monitoring plugins for tomcat that trace queries ok/error, http codes / response time etc in access logs, number of threads, jvm memory usage etc - start, stop, watchdogs, logs : we also use our standard tools for that - what about security filters ? Is that possible with jetty ? André On 11/12/2013 04:54 AM, Alexandre Rafalovitch wrote: Hello, I keep seeing here and on Stack Overflow people trying to deploy Solr to Tomcat. We don't usually ask why, just help when where we can. But the question happens often enough that I am curious. What is the actual business case. Is that because Tomcat is well known? Is it because other apps are running under Tomcat and it is ops' requirement? Is it because Tomcat gives something - to Solr - that Jetty does not? It might be useful to know. Especially, since Solr team is considering making the server part into a black box component. What use cases will that break? So, if somebody runs Solr under Tomcat (or needed to and gave up), let's use this thread to collect this knowledge. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- André Bois-Crettez Software Architect Search Developer http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. -- Doug Turnbull Search Big Data Architect OpenSource Connections http://o19s.com
Boosting documents by categorical preferences
Hi all, I have a question around boosting. I wanted to use the boost= to write a nested query that will boost a document based on categorical preferences. For a movie search for example, say that a user likes drama, comedy, and action. I could use things like qq=q={!boost%20b=$b%20defType=edismax%20v=$qq}b=sum(product(query($cat1),1.482),product(query($cat2),0.1199),product(query($cat3),1.448))cat1=category:Dramacat2=category:Comedycat3=category:Action where cat1=Drama cat2=Comedy cat3=Action Currently I have the weights set to the z-score equivalent of a user's preference for that category which is simply how many standard deviations above the global average is this user's preference for that movie category. My question though is basically whether or not semantically the equation query(category:Drama)*some weight + query(category:Comedy)*some weight + query(category:Action)*some weight makes sense? What are some techniques people use to boost documents based on discrete things like category, manufacturer, genre etc? Thanks! Amit
return value from SolrJ client to php
Hello All, I have a requirement where I have to conect to Solr using SolrJ client and documents return by solr to SolrJ client have to returned to PHP. I know its simple to get document from Solr to SolrJ But how do I return documents from SolrJ to PHP ? Thanks Amit Aggarwal
Re: When is/should qf different from pf?
Thanks Erick. Numeric fields make sense as I guess would strictly string fields too since its one term? In the normal text searching case though does it make sense to have qf and pf differ? Thanks Amit On Oct 28, 2013 3:36 AM, Erick Erickson erickerick...@gmail.com wrote: The facetious answer is when phrases aren't important in the fields. If you're doing a simple boolean match, adding phrase fields will add expense, to no good purpose etc. Phrases on numeric fields seems wrong. FWIW, Erick On Mon, Oct 28, 2013 at 1:03 AM, Amit Nithian anith...@gmail.com wrote: Hi all, I have been using Solr for years but never really stopped to wonder: When using the dismax/edismax handler, when do you have the qf different from the pf? I have always set them to be the same (maybe different weights) but I was wondering if there is a situation where you would have a field in the qf not in the pf or vice versa. My understanding from the docs is that qf is a term-wise hard filter while pf is a phrase-wise boost of documents who made it past the qf filter. Thanks! Amit
Re: How to configure solr to our java project in eclipse
How so you start your another project ? If it is maven or ant then you can use anturn plugin to start solr . Otherwise you can write a small shell script to start solr .. On 27-Oct-2013 9:15 PM, giridhar girimc...@gmail.com wrote: Hi friends,Iam giridhar.please clarify my doubt. we are using solr for our project.the problem the solr is outside of our project( in another folder) we have to manually type java -start.jar to start the solr and use that services. But what we need is,when we run the project,the solr should be automatically start. our project is a java project with tomcat in eclipse. How can i achieve this. Please help me. Thankyou. Giridhar -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-configure-solr-to-our-java-project-in-eclipse-tp4097954.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr For
Depends One core one schema file ... One solrconfig.xml . So if you want only one core then put all required fields of both search in one schema file and carry out your searches Otherwise make two cores having two schema file and perform searches accordingly ... On 27-Oct-2013 7:22 AM, Baskar Sikkayan baskar@gmail.com wrote: Hi, Looking for solr config for Job Site. In a job site there are 2 main searches. 1) Employee can search for job ( based on skill set, job location, title, salary ) 2) Employer can search for employees ( based on skill set, exp, location, ) Should i have a separate config xml for both searches? Thanks, Baskar
Re: Stop solr service
Lol ... Unsubscribe from this mailing list . On 27-Oct-2013 5:02 PM, veena rani veenara...@gmail.com wrote: I want to stop the mail On Sun, Oct 27, 2013 at 4:37 PM, Rafał Kuć r@solr.pl wrote: Hello! Could you please write more about what you want to do? Do you need to stop running Solr process. If yes what you need to do is stop the container (Jetty/Tomcat) that Solr runs in. You can also kill JVM running Solr, however it will be usually enough to just stop the container. -- Regards, Rafał Kuć Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ Hi Team, Pla stop the solr service. -- Regards, Veena Rani P N Banglore. 9538440458
Re: How to configure solr to our java project in eclipse
Try this: http://hokiesuns.blogspot.com/2010/01/setting-up-apache-solr-in-eclipse.html I use this today and it still works. If anything is outdated (as it's a relatively old post) let me know. I wrote this so ping me if you have any questions. Thanks Amit On Sun, Oct 27, 2013 at 7:33 PM, Amit Aggarwal amit.aggarwa...@gmail.comwrote: How so you start your another project ? If it is maven or ant then you can use anturn plugin to start solr . Otherwise you can write a small shell script to start solr .. On 27-Oct-2013 9:15 PM, giridhar girimc...@gmail.com wrote: Hi friends,Iam giridhar.please clarify my doubt. we are using solr for our project.the problem the solr is outside of our project( in another folder) we have to manually type java -start.jar to start the solr and use that services. But what we need is,when we run the project,the solr should be automatically start. our project is a java project with tomcat in eclipse. How can i achieve this. Please help me. Thankyou. Giridhar -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-configure-solr-to-our-java-project-in-eclipse-tp4097954.html Sent from the Solr - User mailing list archive at Nabble.com.
When is/should qf different from pf?
Hi all, I have been using Solr for years but never really stopped to wonder: When using the dismax/edismax handler, when do you have the qf different from the pf? I have always set them to be the same (maybe different weights) but I was wondering if there is a situation where you would have a field in the qf not in the pf or vice versa. My understanding from the docs is that qf is a term-wise hard filter while pf is a phrase-wise boost of documents who made it past the qf filter. Thanks! Amit
Please explain SolConfig.xml in terms of SolrAPIs (Java Psuedo Code)
Hello All, Can some one explain me following snippet of SolrConfig.xml in terms of Solr API (Java Psuedo Code) for better understanding. like *updateHandler class=solr.DirectUpdateHandler2* * * * UpdateLog* * str dir=BLAH /* */UpdateLog* ** ** ** */UpdateHandler* Here I want to know . 1. What is updateHandler ? Is it some Package or class of interface ? 2. Whats is solr.DirectUpdateHandler2 ? Is it class 3. What is updateLog ? is it package ? 4. How do we know that UpdateLog have sub-element dir ? 5. how do we know that updateLog would be sub-element of updateHandler ?? Is updateLog some kind of subClass of something else ? I KNOW that all these things are given in SolConfig.xml but I donot want to cram those things . One example of jetty.xml whatever we write there , it can be translated to JAVA psuedo code
Re: Please explain SolConfig.xml in terms of SolrAPIs (Java Psuedo Code)
Yeah , you caught it right Yes it was kid of Dtd . Anyways thanks a lot for clearing my doubt .. SOLVED . On 25-Oct-2013 6:34 PM, Daniel Collins danwcoll...@gmail.com wrote: I think what you are looking for is some kind of DTD/schema you can use to see all the possible parameters in SolrConfig.xml, short answer, there isn't one (currently) :( jetty.xml has a DTD schema, and its XMLConfiguration format is inherently designed to convert to code, so the list of possible options can be generated by Java Reflection, but Solr's isn't quite that advanced. Generally speaking the config is described in http://wiki.apache.org/solr/SolrConfigXml. However, that is (by the nature of manually generated documentation) a bit out of date, so things like the updateLog aren't referenced there. There is no Schema or DTD for SolrConfig, the best place to look for what the various options are is either the sample config which is generally quite good or the code (org.apache.solr.core.SolrConfig.java). At the end of the day updateLog is just the name of a config parameter it is grouped under updateHandler since it relates to that. How we know such a parameter exists: 1) it was in the sample config (and commented to indicate what it means) 2) its referenced in the code if you look through that On 25 October 2013 13:06, Alexandre Rafalovitch arafa...@gmail.com wrote: I think better understanding is a bit too vague. Is there a specific problem you have? Your Jetty example would make sense if, for example, your goal was to automatically generate solrconfig.xml from some other configuration. But even then, you would probably use fillable templates and don't need fully corresponding JAVA api. For example, you are unlikely to edit the very line you are asking about, it's a little too esoteric: updateHandler class=solr.DirectUpdateHandler2 Perhaps, what you want to do is to look at the smallest possible solrconfig.xml and then expand from there by looking at additional options. Regarding specific options available, most are documented on the Wiki and in the comments of the sample file. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Oct 25, 2013 at 5:19 PM, Amit Aggarwal amit.aggarwa...@gmail.com wrote: Hello All, Can some one explain me following snippet of SolrConfig.xml in terms of Solr API (Java Psuedo Code) for better understanding. like *updateHandler class=solr.DirectUpdateHandler2* * * * UpdateLog* * str dir=BLAH /* */UpdateLog* ** ** ** */UpdateHandler* Here I want to know . 1. What is updateHandler ? Is it some Package or class of interface ? 2. Whats is solr.DirectUpdateHandler2 ? Is it class 3. What is updateLog ? is it package ? 4. How do we know that UpdateLog have sub-element dir ? 5. how do we know that updateLog would be sub-element of updateHandler ?? Is updateLog some kind of subClass of something else ? I KNOW that all these things are given in SolConfig.xml but I donot want to cram those things . One example of jetty.xml whatever we write there , it can be translated to JAVA psuedo code
Re: Committing when indexing in parallel
Hi, As per my knowledge, any number of requests can be issued in parallel for index the documents. Any commit request will write them to index. So if P1 issued a commit then all documents of P2 those are eligible get committed and remaining documents will get committed on other commit request. Rgds AJ On 14-Sep-2013, at 2:51, Phani Chaitanya pvempaty@gmail.com wrote: I'm wondering what happens to commit while we are indexing in parallel in Solr. Are the indexing update requests blocked until the commit finishes ? Lets say I've a process P1 which issued a commit request and there is another process P2 which is still indexing to the same index. What happens to the index in that scenario. Are the P2 indexing requests blocked until P1 commit request finishes ? I'm just wondering about what is the behavior of Solr in the above case. - Phani Chaitanya -- View this message in context: http://lucene.472066.n3.nabble.com/Committing-when-indexing-in-parallel-tp4089953.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MySQL Data import handler
Hi Baskar, Just create a single schema.xml which should contains required fields from 3 tables. Add a status column to child table.i.e 1 = add 2 = update 3 = delete 4 = indexed Etc Write a program using solrj which will read the status and do thing accordingly. Rgds AJ On 15-Sep-2013, at 5:46, Baskar Sikkayan baskar@gmail.com wrote: Hi, If i am supposed to go with Java client, should i still do any configurations in solrconfig.xml or schema.xml. Thanks, Baskar.S On Sat, Sep 14, 2013 at 8:46 PM, Gora Mohanty g...@mimirtech.com wrote: On 14 September 2013 20:07, Baskar Sikkayan baskar@gmail.com wrote: Hi Gora, Thanks a lot for your reply. My requirement is to combine 3 tables in mysql for search operation and planning to sync these 3 tables( not all the columns ) in Apache Solr. Whenever there is any change( adding a new row, deleting a row, modifying the column data( any column in the 3 tables ) ), the same has to updated in solr. Guess, for this requirement, instead of going with delta-import, Apachae Solar java client will be of useful. [...] Yes, if you are comfortable with programming in Java, the Solr client would be a good alternative, though the DataImportHandler can also do what you want. Regards, Gora
Re: Solr Java Client
Add a field called source in schema.xml and value would be your table names. Rgds AJ On 15-Sep-2013, at 5:38, Baskar Sikkayan baskar@gmail.com wrote: Hi, I am new to Solr and trying to use Solr java client instead of using the Data handler. Is there any configuration i need to do for this? I got the following sample code. SolrInputDocument doc = new SolrInputDocument(); doc.addField(cat, book); doc.addField(id, book- + i); doc.addField(name, The Legend of the Hobbit part + i); server.add(doc); server.commit(); // periodically flush I am confused here. I am going to index 3 different tables for 3 different kind of searches. Here i dont have any option to differentiate 3 kind of indexes. Am i missing anything here. Could anyone please shed some light here? Thanks, Baskar.S
Re: Solr Java Client
Question is not clear to me. Please be more elaborative in your query. Why do u want to store index to DB tables? Rgds AJ On 15-Sep-2013, at 7:20, Baskar Sikkayan baskar@gmail.com wrote: How to add index to 3 diff tables from java ... On Sun, Sep 15, 2013 at 6:49 AM, Amit Jha shanuu@gmail.com wrote: Add a field called source in schema.xml and value would be your table names. Rgds AJ On 15-Sep-2013, at 5:38, Baskar Sikkayan baskar@gmail.com wrote: Hi, I am new to Solr and trying to use Solr java client instead of using the Data handler. Is there any configuration i need to do for this? I got the following sample code. SolrInputDocument doc = new SolrInputDocument(); doc.addField(cat, book); doc.addField(id, book- + i); doc.addField(name, The Legend of the Hobbit part + i); server.add(doc); server.commit(); // periodically flush I am confused here. I am going to index 3 different tables for 3 different kind of searches. Here i dont have any option to differentiate 3 kind of indexes. Am i missing anything here. Could anyone please shed some light here? Thanks, Baskar.S
Re: Combining Solr score with customized user ratings for a document
You can use DB for storing user preferences and later if you want you can flush them to solr as an update along with userid. Or you may add a result pipeline filter Rgds AJ On 13-Feb-2013, at 17:50, Á_o chachime...@yahoo.es wrote: Hi: I am working on a proyect where we want to recommend our users products based on their previous 'likes', purchases and so on (typical stuff of a recommender system), while we want to let them browse freely the catalogue by search queries, making use of facets, more-like-this and so on (typical stuff of a Solr index). After reading here and there, I have reached the conclusion that's it's better to keep Solr Index apart from the database. Solr is for products (which can be reindexed from the DB as a nightly batch) while the DB is for everything else, including -the products and- user profiles. So, given an user and a particular search (which can be as simple as q=*), on one hand we have Solr results (i.e. docs + scores) for the query, while on the other we have user predicted ratings (i.e. recommender scores) coming from the DB (though they could be cached elsewhere) for each of the products returned by Solr. And what I want is clear -to state-: combine both scores (e.g. by a simple product) so the user receives a sorted list of relevant products biased by his/her preferences. I have been googleing for the last days without finding which is the best way to achieve this. I think it's not a matter of boosting, or at least I can't see which boosting method could be useful as the boost should be user-based. I think that I need to extend -somewhere- Solr so I can alter the result scores by providing the user ID and connecting to the DB at query time, doing the necessary maths and returning the final score in a -quite- transparent way for the Web app. A less elegant solution could be letting Solr do its work as usual, and then navigate through the XML modifying the scores and reordering the whole list of products (or maybe just the first N results) by the new combined score. What do you think? A big THANKS in advance Álvaro -- View this message in context: http://lucene.472066.n3.nabble.com/Combining-Solr-score-with-customized-user-ratings-for-a-document-tp4040200.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: More on topic of Meta-search/Federated Search with Solr
Hi, I would suggest for the following. 1. Create custom search connectors for each individual sources. 2. Connector will responsible to query the source of any type web, gateways etc. and get the results write the top N results to a solr. 3. Query the same keyword to solr and display the result. Would you like to create something like http://knimbus.com Rgds AJ On 27-Aug-2013, at 2:28, Dan Davis dansm...@gmail.com wrote: One more question here - is this topic more appropriate to a different list? On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis dansm...@gmail.com wrote: I have now come to the task of estimating man-days to add Blended Search Results to Apache Solr. The argument has been made that this is not desirable (see Jonathan Rochkind's blog entries on Bento search with blacklight). But the estimate remains.No estimate is worth much without a design. So, I am come to the difficult of estimating this without having an in-depth knowledge of the Apache core. Here is my design, likely imperfect, as it stands. - Configure a core specific to each search source (local or remote) - On cores that index remote content, implement a periodic delete query that deletes documents whose timestamp is too old - Implement a custom requestHandler for the remote cores that goes out and queries the remote source. For each result in the top N (configurable), it computes an id that is stable (e.g. it is based on the remote resource URL, doi, or hash of data returned). It uses that id to look-up the document in the lucene database. If the data is not there, it updates the lucene core and sets a flag that commit is required. Once it is done, it commits if needed. - Configure a core that uses a custom SearchComponent to call the requestHandler that goes and gets new documents and commits them. Since the cores for remote content are different cores, they can restart their searcher at this point if any commit is needed. The custom SearchComponent will wait for commit and reload to be completed. Then, search continues uses the other cores as shards. - Auto-warming on this will assure that the most recently requested data is present. It will, of course, be very slow a good part of the time. Erik and others, I need to know whether this design has legs and what other alternatives I might consider. On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.comwrote: The lack of global TF/IDF has been answered in the past, in the sharded case, by usually you have similar enough stats that it doesn't matter. This pre-supposes a fairly evenly distributed set of documents. But if you're talking about federated search across different types of documents, then what would you rescore with? How would you even consider scoring docs that are somewhat/ totally different? Think magazine articles an meta-data associated with pictures. What I've usually found is that one can use grouping to show the top N of a variety of results. Or show tabs with different types. Or have the app intelligently combine the different types of documents in a way that makes sense. But I don't know how you'd just get the right thing to happen with some kind of scoring magic. Best Erick On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote: I've thought about it, and I have no time to really do a meta-search during evaluation. What I need to do is to create a single core that contains both of my data sets, and then describe the architecture that would be required to do blended results, with liberal estimates. From the perspective of evaluation, I need to understand whether any of the solutions to better ranking in the absence of global IDF have been explored?I suspect that one could retrieve a much larger than N set of results from a set of shards, re-score in some way that doesn't require IDF, e.g. storing both results in the same priority queue and *re-scoring* before *re-ranking*. The other way to do this would be to have a custom SearchHandler that works differently - it performs the query, retries all results deemed relevant by another engine, adds them to the Lucene index, and then performs the query again in the standard way. This would be quite slow, but perhaps useful as a way to evaluate my method. I still welcome any suggestions on how such a SearchHandler could be implemented.
Solr admin search with wildcard
I'm looking to search (in the solr admin search screen) a certain field for: *youtube* I know that leading wildcards takes a lot of resources but I'm not worried with that My only question is about the syntax, would this work: field:*youtube* ? Thanks, I'm using Solr 3.6.2
Re: Solr admin search with wildcard
The stored and indexed string is actually a url like http://www.youtube.com/somethingsomething;. It looks like removing the quotes does the job: iframe:*youtube* or am I wrong ? For now, performance is not an issue, but accuracy is and I would like to know for example how many URLS have iframe source leading to YouTube for example. So query like: iframe:*youtube* with max rows 10 or something will return in the response numFound field the total number of pages that have a tag ifarme with a source matching *youtube, No ? On Thu, Jun 27, 2013 at 3:24 PM, Jack Krupansky j...@basetechnology.comwrote: No, you cannot use wildcards within a quoted term. Tell us a little more about what your strings look like. You might want to consider tokenizing or using ngrams to avoid the need for wildcards. -- Jack Krupansky -Original Message- From: Amit Sela Sent: Thursday, June 27, 2013 3:33 AM To: solr-user@lucene.apache.org Subject: Solr admin search with wildcard I'm looking to search (in the solr admin search screen) a certain field for: *youtube* I know that leading wildcards takes a lot of resources but I'm not worried with that My only question is about the syntax, would this work: field:*youtube* ? Thanks, I'm using Solr 3.6.2
Re: Solr admin search with wildcard
Forgive my ignorance but I want to be sure, do I add copyField source=iframe dest=text/ to solrindex-mapping.xml? so that my solrindex-mapping.xml looks like this: fields field dest=content source=content/ field dest=title source=title/ field dest=iframe source=iframe/ field dest=host source=host/ field dest=segment source=segment/ field dest=boost source=boost/ field dest=digest source=digest/ field dest=tstamp source=tstamp/ field dest=id source=url/ copyField source=url dest=url/ *copyField source=iframe dest=text/ * /fields uniqueKeyurl/uniqueKey And what do you mean by standard tokenization ? Thanks! On Thu, Jun 27, 2013 at 3:43 PM, Jack Krupansky j...@basetechnology.comwrote: Just copyField from the string field to a text field and use standard tokenization, then you can search the text field for youtube or even something that is a component of the URL path. No wildcard required. -- Jack Krupansky -Original Message- From: Amit Sela Sent: Thursday, June 27, 2013 8:37 AM To: solr-user@lucene.apache.org Subject: Re: Solr admin search with wildcard The stored and indexed string is actually a url like http://www.youtube.com/**somethingsomethinghttp://www.youtube.com/somethingsomething . It looks like removing the quotes does the job: iframe:*youtube* or am I wrong ? For now, performance is not an issue, but accuracy is and I would like to know for example how many URLS have iframe source leading to YouTube for example. So query like: iframe:*youtube* with max rows 10 or something will return in the response numFound field the total number of pages that have a tag ifarme with a source matching *youtube, No ? On Thu, Jun 27, 2013 at 3:24 PM, Jack Krupansky j...@basetechnology.com* *wrote: No, you cannot use wildcards within a quoted term. Tell us a little more about what your strings look like. You might want to consider tokenizing or using ngrams to avoid the need for wildcards. -- Jack Krupansky -Original Message- From: Amit Sela Sent: Thursday, June 27, 2013 3:33 AM To: solr-user@lucene.apache.org Subject: Solr admin search with wildcard I'm looking to search (in the solr admin search screen) a certain field for: *youtube* I know that leading wildcards takes a lot of resources but I'm not worried with that My only question is about the syntax, would this work: field:*youtube* ? Thanks, I'm using Solr 3.6.2
Re: Restaurant availability from database
Hossman did a presentation on something similar to this using spatial data at a Solr meetup some months ago. http://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/ May be helpful to you. On Thu, May 23, 2013 at 9:40 AM, rajh ron...@trimm.nl wrote: Thank you for your answer. Do you mean I should index the availability data as a document in Solr? Because the availability data in our databases is around 6,509,972 records and contains the availability per number of seats and per 15 minutes. I also tried this method, and as far as I know it's only possible to join the availability documents and not to include that information per result document. An example API response (created from the Solr response): { restaurants: [ { id: 13906, name: Allerlei, zipcode: 6511DP, house_number: 59, available: true }, { id: 13907, name: Voorbeeld, zipcode: 6512DP, house_number: 39, available: false } ], resultCount: 12156, resultCountAvailable: 55, } I'm currently hacking around the problem by executing the search again with a very high value for the rows parameter and counting the number of available restaurants on the backend, but this causes a big performance impact (as expected). -- View this message in context: http://lucene.472066.n3.nabble.com/Restaurant-availability-from-database-tp4065609p4065710.html Sent from the Solr - User mailing list archive at Nabble.com.
solr doesn't start on tomcat on aws
I am installing solr on tomcat7 in aws using bitmani tomcat stack.My solr server is not starting; below is the errorINFO: Starting service Catalina May 15, 2013 7:01:51 AM org.apache.catalina.core.StandardEngine startInternal INFO: Starting Servlet Engine: Apache Tomcat/7.0.39 May 15, 2013 7:01:51 AM org.apache.catalina.startup.HostConfig deployDescriptor INFO: Deploying configuration descriptor /opt/bitnami/apache-tomcat/conf/Catalina/localhost/solr.xml May 15, 2013 7:01:52 AM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor /opt/bitnami/apache-tomcat/conf/Catalina/localhost/solr.xml java.lang.NullPointerException at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:625) at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1637) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) May 15, 2013 7:01:52 AM org.apache.catalina.startup.HostConfig deployDescriptors SEVERE: Error waiting for multi-thread deployment of context descriptors to complete java.util.concurrent.ExecutionException: java.lang.NullPointerException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:579) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:475) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1402) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:318) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:90) at org.apache.catalina.util.LifecycleBase.setStateInternal(LifecycleBase.java:402) at org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:347) The /opt/bitnami/apache-tomcat/conf/Catalina/localhost/solr.xml looks like this.?xml version=1.0 encoding=utf-8? The contents of /usr/share/solr/ also looks finebitnami@ip-10-144-66-148:/usr/share/solr$ ls -ltotal 11384drwxr-xr-x 2 tomcat tomcat 4096 Jul 17 2012 bin drwxr-xr-x 5 tomcat tomcat 4096 May 13 13:11 conf drwxr-xr-x 9 tomcat tomcat 4096 Jul 17 2012 contrib drwxr-xr-x 2 tomcat tomcat 4096 May 13 13:20 data drwxr-xr-x 2 tomcat tomcat 4096 May 13 13:21 lib -rw-r--r-- 1 tomcat tomcat 2259 Jul 17 2012 README.txt -rw-r--r-- 1 tomcat tomcat 11628199 May 14 12:58 solr.war-rw-r--r-- 1 tomcat tomcat 1676 Jul 17 2012 solr.xml Not sure what is wrong, but this is killing me :-( -- View this message in context: http://lucene.472066.n3.nabble.com/solr-doesn-t-start-on-tomcat-on-aws-tp4063448.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: writing a custom Filter plugin?
At first I thought you were referring to Filters in Lucene at query time (i.e. bitset filters) but I think you are referring to token filters at indexing/text analysis time? I have had success writing my own Filter as the link presents. The key is that you should write a custom class that extends TokenFilter ( http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/analysis/TokenFilter.html) and write the implementation in your incrementToken() method. My recollection of this is that instead of returning something of a Token like you would have in earlier versions of Lucene, you set attribute values on a notional current token. One obvious attribute is the term text itself and perhaps any positional information. The best place to start is to pick a fairly simple example from the Solr Source (maybe lowercasefilter) and try and mimic that. Cheers! Amit On Mon, May 13, 2013 at 1:33 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Does anyone know of any tutorials, basic examples, and/or documentation on writing your own Filter plugin for Solr? For Solr 4.x/4.3? I would like a Solr 4.3 version of the normalization filters found here for Solr 1.4: https://github.com/billdueber/**lib.umich.edu-solr-stuffhttps://github.com/billdueber/lib.umich.edu-solr-stuff But those are old, for Solr 1.4. Does anyone have any hints for writing a simple substitution Filter for Solr 4.x? Or, does a simple sourcecode example exist anywhere?
Re: Need solr query help
Is it possible instead to store in your solr index a bounding box of store location + delivery radius, do a bounding box intersection between your user's point + radius (as a bounding box) and the shop's delivery bounding box. If you want further precision, the frange may work assuming it's a post-filter implementation so that you are doing heavy computation on a presumably small set of data only to filter out the corner cases around the radius circle that results. I haven't looked at Solr's spatial querying in a while to know if this is possible or not. Cheers Amit On Sat, May 11, 2013 at 10:42 AM, smsolr sms...@hotmail.com wrote: Hi Abhishek, I forgot to explain why it works. It uses the frange filter which is mentioned here:- http://wiki.apache.org/solr/CommonQueryParameters and it works because it filters in results where the geodist minus the shopMaxDeliveryDistance is less than zero (that's what the u=0 means, upper limit=0), i.e.:- geodist - shopMaxDeliveryDistance 0 - geodist shopMaxDeliveryDistance i.e. the geodist is less than the shopMaxDeliveryDistance and so the shop is within delivery range of the location specified. smsolr -- View this message in context: http://lucene.472066.n3.nabble.com/Need-solr-query-help-tp4061800p4062603.html Sent from the Solr - User mailing list archive at Nabble.com.
edismax returns very less matches than regular
I have a simple system. I put the title of webpages into the name field and content of the web pages into the Description field. I want to search both fields and give the name a little more boost. A search on name field or description field returns records cloase to hundreds. http://localhost:8983/solr/select/?q=name:%28coldfusion^2%20cache^1%29fq=author:[*%20TO%20*]%20AND%20-author:chinmoypstart=0rows=10fl=author,score,%20id But search on both fields using boost just gives 5 matches. http://localhost:8983/solr/mindfire/?q=%28%20coldfusion^2%20cache^1%29*defType=edismaxqf=name^1.5%20description^1.0*fq=author:[*%20TO%20*]%20AND%20-author:chinmoypstart=0rows=10fl=author,score,%20id I am wondering what is wrong, because there are valid results returned in 1st query which is ignored by edismax. I am on solr3.6 -- View this message in context: http://lucene.472066.n3.nabble.com/edismax-returns-very-less-matches-than-regular-tp4054442.html Sent from the Solr - User mailing list archive at Nabble.com.
using edismax without velocity
I am using solr3.6 and trying to use the edismax handler. The config has a /browse requestHandler, but it doesn't work because of missing class definition VelocityResponseWriter error. queryResponseWriter name=velocity class=solr.VelocityResponseWriter startup=lazy/ I have copied the jars to solr/lib following the steps here, but no luck http://wiki.apache.org/solr/VelocityResponseWriter#Using_the_VelocityResponseWriter_in_Solr_Core I just want to search on multiple fields with different boost. *Can I use edismax with the /select requestHandler?* If I write a query like below, does it search in both the fields name and description? Does the query below solves my purpose? http://localhost:8080/solr/select/?q=(coldfusion^2 cache^1)*defType=edismaxqf=name^2 description^1*fq=author:[* TO *] AND -author:chinmoypstart=0rows=10fl=author,score, id -- View this message in context: http://lucene.472066.n3.nabble.com/using-edismax-without-velocity-tp4054190.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sharing index amongst multiple nodes
I don't understand why this would be more performant.. seems like it'd be more memory and resource intensive as you'd have multiple class-loaders and multiple cache spaces for no good reason. Just have a single core with sufficiently large caches to handle your response needs. If you want to load balance reads consider having multiple physical nodes with a master/slaves or SolrCloud. On Sat, Apr 6, 2013 at 9:21 AM, Daire Mac Mathúna daire...@gmail.comwrote: Hi. Wat are the thoughts on having multiple SOLR instances i.e. multiple SOLR war files, sharing the same index (i.e. sharing the same solr_home) where only one SOLR instance is used for writing and the others for reading? Is this possible? Is it beneficial - is it more performant than having just one solr instance? How does it affect auto-commits i.e. how would the read nodes know the index has been changed and re-populate cache etc.? Sole 3.6.1 Thanks.
Re: how to skip test while building
If you generate the maven pom files you can do this I think by doing mvn whtaever here -DskipTests=true. On Sat, Apr 6, 2013 at 7:25 AM, Erick Erickson erickerick...@gmail.comwrote: Don't know a good way to skip compiling the tests, but there isn't any harm in compiling them... changing to the solr directory and just issuing ant example dist builds pretty much everything. You don't execute tests unless you specify ant test. ant -p shows you all the targets. Note that you have different targets depending on whether you're executing it in solr_home or solr_home/solr or solr_home/lucene. Since you mention Solr, you probably want to work in solr_home/solr to start. Best Erick On Sat, Apr 6, 2013 at 5:36 AM, parnab kumar parnab.2...@gmail.com wrote: Hi All, I am new to Solr . I am using solr 3.4 . I want to build without building lucene tests files in lucene and skip the tests to be fired . Can anyone please help where to make the necessary changes . Thanks, Pom
Re: Solr 4.2 single server limitations
There's a whole heap of information that is missing like what you plan on storing vs indexing and yes QPS too. My short answer is try with one server until it falls over then start adding more. When you say multiple-server setup do you mean multiple servers where each server acts as a slave storing the entire index so you have load balancing across multiple servers OR do you mean multiple servers where each server stores a portion of the data? If it's the former, sometimes a simple master/slave setup in Solr 4.x works but the latter may mean SolrCloud. Master/Slave is easy but I don't know much about SolrCloud. Questions to think about (this is not exhaustive by any means) 1) When you say 5-10 pages per website (300+ websites) that you are crawling 2x per hour, are you *replacing* the old copy of the web page in your index or storing some form of history for some reason. 2) What are you planning on storing vs indexing which would dictate your memory requirements. 3) You mentioned you don't know QPS but having some guess would help.. is it mostly for storage and occasional lookup (where slow responses is probably tolerable) or is this powering a real user-facing website (where low latency is prob desired). Again, I like to start simple and use one server until it dies then expand from there. Cheers Amit On Thu, Apr 4, 2013 at 7:58 AM, imehesz imeh...@gmail.com wrote: hello, I'm using a single server setup with Nutch (1.6) and Solr (4.2) I plan to trigger the Nutch crawling process every 30 minutes or so and add about 300+ websites a month with (~5-10 pages each). At this point I'm not sure about the query requests/sec. Can I run this on a single server (how long)? If not, what would be the best and most efficient way to have multiple server setup? thanks, --iM -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-single-server-limitations-tp4053829.html Sent from the Solr - User mailing list archive at Nabble.com.
unknown field error when indexing with nutch
Hi all, I'm trying to run a nutch crawler and index to Solr. I'm running Nutch 1.6 and Solr 4.2. I managed to crawl and index with that Nutch version into Solr 3.6.2 but I can't seem to manage to run it with Solr 4.2 I re-built Nutch with the schema-solr4.xml and copied that file to SOLR_HOME/example/solr/collection1/conf/schema.xml but the job fails when trying to index: SolrException: ERROR: [doc= http://0movies.com/watchversion.php?id=3818link=1364879137] unknown field 'host' It looks like Solr is not aware of the schema... Did I miss something ? Thanks.
Re: unknown field error when indexing with nutch
I'm using the solrconfig supplied with Sole 4.2 and I added the nutch request handler. But I keep getting the same errors. On Apr 5, 2013 8:11 PM, Jack Krupansky j...@basetechnology.com wrote: Check your solrconfig.xml file for references to a host field. But maybe more importantly, make sure you use a Solr 4.1 solrconfig and merge in any of your application-specific changes. -- Jack Krupansky -Original Message- From: Amit Sela Sent: Friday, April 05, 2013 12:57 PM To: solr-user@lucene.apache.org Subject: unknown field error when indexing with nutch Hi all, I'm trying to run a nutch crawler and index to Solr. I'm running Nutch 1.6 and Solr 4.2. I managed to crawl and index with that Nutch version into Solr 3.6.2 but I can't seem to manage to run it with Solr 4.2 I re-built Nutch with the schema-solr4.xml and copied that file to SOLR_HOME/example/solr/**collection1/conf/schema.xml but the job fails when trying to index: SolrException: ERROR: [doc= http://0movies.com/**watchversion.php?id=3818link=**1364879137http://0movies.com/watchversion.php?id=3818link=1364879137] unknown field 'host' It looks like Solr is not aware of the schema... Did I miss something ? Thanks.
Re: solre scores remains same for exact match and nearly exact match
Thanks Jack and Andre I am trying to use edismax;but struck with the NoClassDefFoundError: org/apache/solr/response/QueryResponseWriter I am using solr 3.6 I have followed the steps here http://wiki.apache.org/solr/VelocityResponseWriter#Using_the_VelocityResponseWriter_in_Solr_Core Just the jars are copied rest was already there in solrconfig.xml -- View this message in context: http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053811.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: do SearchComponents have access to response contents
We need to also track the size of the response (as the size in bytes of the whole xml response tat is streamed, with stored fields and all). I was a bit worried cause I am wondering if a searchcomponent will actually have access to the response bytes... == Can't you get this from your container access logs after the fact? I may be misunderstanding something but why wouldn't mining the Jetty/Tomcat logs for the response size here suffice? Thanks! Amit On Thu, Apr 4, 2013 at 1:34 AM, xavier jmlucjav jmluc...@gmail.com wrote: A custom QueryResponseWriter...this makes sense, thanks Jack On Wed, Apr 3, 2013 at 11:21 PM, Jack Krupansky j...@basetechnology.com wrote: The search components can see the response as a namedlist, but it is only when SolrDispatchFIlter calls the QueryResponseWriter that XML or JSON or whatever other format (Javabin as well) is generated from the named list for final output in an HTTP response. You probably want a custom query response writer that wraps the XML response writer. Then you can generate the XML and then do whatever you want with it. The QueryResponseWriter class and queryResponseWriter in solrconfig.xml. -- Jack Krupansky -Original Message- From: xavier jmlucjav Sent: Wednesday, April 03, 2013 4:22 PM To: solr-user@lucene.apache.org Subject: do SearchComponents have access to response contents I need to implement some SearchComponent that will deal with metrics on the response. Some things I see will be easy to get, like number of hits for instance, but I am more worried with this: We need to also track the size of the response (as the size in bytes of the whole xml response tat is streamed, with stored fields and all). I was a bit worried cause I am wondering if a searchcomponent will actually have access to the response bytes... Can someone confirm one way or the other? We are targeting Sorl4.0 thanks xavier
Solr ZooKeeper ensemble with HBase
Hi all, I have a running Hadoop + HBase cluster and the HBase cluster is running it's own zookeeper (HBase manages zookeeper). I would like to deploy my SolrCloud cluster on a portion of the machines on that cluster. My question is: Should I have any trouble / issues deploying an additional ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well first of all HBase manages it so I'm not sure it's possible and second I have HBase working pretty hard at times and I don't want to create any connection issues by overloading ZooKeeper. Thanks, Amit.
Re: solre scores remains same for exact match and nearly exact match
Thanks. I added a copy field and that fixed the issue. On Wed, Apr 3, 2013 at 12:29 PM, Gora Mohanty-3 [via Lucene] ml-node+s472066n4053412...@n3.nabble.com wrote: On 3 April 2013 10:52, amit [hidden email]http://user/SendEmail.jtp?type=nodenode=4053412i=0 wrote: Below is my query http://localhost:8983/solr/select/?q=subject:session management in phpfq=category:[*%20TO%20*]fl=category,score,subject [...] Add debugQuery=on to your Solr URL, and you will get an explanation of the score. Your subject field is tokenised, so that there is no a priori reason that an exact match should score higher. Several strategies are available if you want that behaviour. Try searching Google, e.g., for solr exact match higher score. Regards, Gora -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053412.html To unsubscribe from solre scores remains same for exact match and nearly exact match, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4053406code=YW1pdC5tYWxsaWtAZ21haWwuY29tfDQwNTM0MDZ8LTk5Njc5OTA3NA== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053478.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr ZooKeeper ensemble with HBase
Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and maybe 2GB for Solr ? - or you mean CPU / disk ? On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hello, Amit: My guess is that, if HBase is working hard, you're going to have more trouble with HBase and Solr on the same nodes than HBase and Solr sharing a Zookeeper. Solr's usage of Zookeeper is very minimal. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote: Hi all, I have a running Hadoop + HBase cluster and the HBase cluster is running it's own zookeeper (HBase manages zookeeper). I would like to deploy my SolrCloud cluster on a portion of the machines on that cluster. My question is: Should I have any trouble / issues deploying an additional ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well first of all HBase manages it so I'm not sure it's possible and second I have HBase working pretty hard at times and I don't want to create any connection issues by overloading ZooKeeper. Thanks, Amit.
Re: solre scores remains same for exact match and nearly exact match
when I use the copy field destination as text it works fine. I get a boost for exact match. But if I use some other field the score is not boosted for exact match. field name=keywords type=text_general indexed=true stored=false multiValued=true/ copyField source=subject dest=keywords/ Not sure if I am in the right direction.. I am new to solr please bear with me I checked this link http://wiki.apache.org/solr/SolrRelevancyCookbook and trying to index same field multiple times to get exact match. -- View this message in context: http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053718.html Sent from the Solr - User mailing list archive at Nabble.com.
solre scores remains same for exact match and nearly exact match
Below is my query http://localhost:8983/solr/select/?q=subject:session management in phpfq=category:[*%20TO%20*]fl=category,score,subject The result is like below ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime983/int lst name=params str name=fqcategory:[* TO *]/str str name=qsubject:session management in php/str str name=flcategory,score,subject/str /lst /lst result name=response maxScore=0.8770298 start=0 numFound=2 doc float name=score0.8770298/float str name=categoryAnnapurnap/str str name=subjectsession management in asp.net/str /doc doc float name=score0.8770298/float str name=categoryAnnapurnap/str str name=subjectsession management in PHP/str /doc /result /response The question is how come both have the same score when 1 is exact match and the other isn't. This is the schema field name=subject type=text_en_splitting indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ -- View this message in context: http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR on hdfs
Why wouldn't SolrCloud help you here? You can setup shards and replicas etc to have redundancy b/c HDFS isn't designed to serve real time queries as far as I understand. If you are using HDFS as a backup mechanism to me you'd be better served having multiple slaves tethered to a master (in a non-cloud environment) or setup SolrCloud either option would give you more redundancy than copying an index to HDFS. - Amit On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim ysli...@gmail.com wrote: Hi Upayavira, sure, let me explain. I am setting up Nutch and SOLR in hadoop environment. Since I am using hdfs, in the event if there is any crashes to the localhost(running solr), i will still have the shards of data being stored in hdfs. Thanks you so much =) On Thu, Mar 7, 2013 at 1:19 AM, Upayavira u...@odoko.co.uk wrote: What are you actually trying to achieve? If you can share what you are trying to achieve maybe folks can help you find the right way to do it. Upayavira On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: Hello Otis , Is there any configuration where it will index into hdfs instead? I tried crawlzilla and lily but I hope to update specific package such as Hadoop only or nutch only when there are updates. That's y would prefer to install separately . Thanks so much. Looking forward for your reply. On Wednesday, March 6, 2013, Otis Gospodnetic wrote: Hello Joseph, You can certainly put them there, as in: hadoop fs -copyFromLocal localsrc URI But searching such an index will be slow. See also: http://katta.sourceforge.net/ Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim ysli...@gmail.com javascript:; wrote: Hi, Would like to know how can i put the indexed solr shards into hdfs? Thanks.. Joseph On Mar 6, 2013 7:28 PM, Otis Gospodnetic otis.gospodne...@gmail.comjavascript:; wrote: Hi Joseph, What exactly are you looking to to? See http://incubator.apache.org/blur/ Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim ysli...@gmail.com javascript:; wrote: Hi I am running hadoop distributed file system, how do I put my output of the solr dir into hdfs automatically? Thanks so much.. -- Best Regards, *Joseph* -- Best Regards, *Joseph* -- Best Regards, *Joseph*
Re: SOLR on hdfs
Joseph, Doing what Otis said will do literally what you want which is copying the index to HDFS. It's no different than copying it to a different machine which btw is what Solr's master/slave replication scheme does. Alternatively, I think people are starting to setup new Solr instances with SolrCloud which doesn't have the concept of master/slave but rather a series of nodes with the option of having replicas (what I believe to be backup nodes) so that you have the redundancy you want. Honestly HDFS in the way that you are looking for is probably no different than storing your solr index in a RAIDed storage format but I don't pretend to know much about RAID arrays. What exactly are you trying to achieve from a systems perspective? Why do you want Hadoop in the mix here and how does copying the index to HDFS help you? If SolrCloud seems complicated try just setting up a simple master/slave replication scheme for that's really easy. Cheers Amit On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim ysli...@gmail.com wrote: Hi Amit, so you mean that if I just want to get redundancy for solr in hdfs, the only best way to do it is to as per what Otis suggested using the following command hadoop fs -copyFromLocal localsrc URI Ok let me try out solrcloud as I will need to make sure it works well with nutch too.. Thanks for the help.. On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian anith...@gmail.com wrote: Why wouldn't SolrCloud help you here? You can setup shards and replicas etc to have redundancy b/c HDFS isn't designed to serve real time queries as far as I understand. If you are using HDFS as a backup mechanism to me you'd be better served having multiple slaves tethered to a master (in a non-cloud environment) or setup SolrCloud either option would give you more redundancy than copying an index to HDFS. - Amit On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim ysli...@gmail.com wrote: Hi Upayavira, sure, let me explain. I am setting up Nutch and SOLR in hadoop environment. Since I am using hdfs, in the event if there is any crashes to the localhost(running solr), i will still have the shards of data being stored in hdfs. Thanks you so much =) On Thu, Mar 7, 2013 at 1:19 AM, Upayavira u...@odoko.co.uk wrote: What are you actually trying to achieve? If you can share what you are trying to achieve maybe folks can help you find the right way to do it. Upayavira On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: Hello Otis , Is there any configuration where it will index into hdfs instead? I tried crawlzilla and lily but I hope to update specific package such as Hadoop only or nutch only when there are updates. That's y would prefer to install separately . Thanks so much. Looking forward for your reply. On Wednesday, March 6, 2013, Otis Gospodnetic wrote: Hello Joseph, You can certainly put them there, as in: hadoop fs -copyFromLocal localsrc URI But searching such an index will be slow. See also: http://katta.sourceforge.net/ Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim ysli...@gmail.com javascript:; wrote: Hi, Would like to know how can i put the indexed solr shards into hdfs? Thanks.. Joseph On Mar 6, 2013 7:28 PM, Otis Gospodnetic otis.gospodne...@gmail.comjavascript:; wrote: Hi Joseph, What exactly are you looking to to? See http://incubator.apache.org/blur/ Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim ysli...@gmail.com javascript:; wrote: Hi I am running hadoop distributed file system, how do I put my output of the solr dir into hdfs automatically? Thanks so much.. -- Best Regards, *Joseph* -- Best Regards, *Joseph* -- Best Regards, *Joseph* -- Best Regards, *Joseph*
Re: ping query frequency
We too run a ping every 5 seconds and I think the concurrent Mark/Sweep helps to avoid the LB from taking a box out of rotation due to long pauses. Either that or I don't see large enough pauses for my LB to take it out (it'd have to fail 3 times in a row or 15 seconds total before it's gone). The ping query does execute an actual query so of course you want to make this as simple as possible (i.e. q=primary_key:value) so that there's limited to no scanning of the index. I think our query does an id:0 which would always return 0 docs but also any stupid-simple query is fine so long as it hits the caches on subsequent hits. The goal, to me at least, is not that the ping query yields actual docs but that it's a mechanism to remove a solr server out of rotation without having to login to an ops controlled device directly. I'd definitely remove the ping per request (wouldn't the fact that you are doing /select serve as the ping and hence defeat the purpose of the ping query) and definitely do the frequent ping as we are describing if you want to have your solr boxes behind some load balancer. On Sun, Mar 3, 2013 at 8:21 AM, Shawn Heisey s...@elyograg.org wrote: On 3/3/2013 2:15 AM, adm1n wrote: I'm wonderring how frequent this query should be made. Currently it is done before each select request (some very old legacy). I googled a little and found out that it is bad practice and has performance impact. So the question is should I completely remove it or just do it once in some period of time. Can you point me at the place where it says that it's bad practice to do frequent pings? I use the ping functionality in my haproxy load balancer that sits in front of Solr. It executes a ping request against all my Solr instances every five seconds. Most of the time, the ping request (which is distributed) finishes in single-digit milliseconds. If that is considered bad practice, I want to figure out why and submit issues to get the problem fixed. I can imagine that sending a ping before every query would be a bad idea, but I am hoping that the way I'm using it is OK. The only problem with ping requests that I have ever noticed was caused by long garbage collection pauses on my 8GB Solr heap. Those pauses caused the load balancer to incorrectly mark the active Solr instance(s) as down and send requests to a backup. Through experimentation with -XX memory tuning options, I have now eliminated the GC pause problem. For machines running Solr 4.2-SNAPSHOT, I have reduced the heap to 6GB, the 3.5.0 machines are still running with 8GB. Thanks, Shawn
Re: Poll: SolrCloud vs. Master-Slave usage
But does that mean that in SolrCloud, slave nodes are busy indexing documents? On Fri, Mar 1, 2013 at 5:37 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Amit, NRT is not possible in a master-slave setup because of the necessity of a hard commit and replication, both of which add considerable delay. Solr Cloud sends each document for a given shard to each node hosting that shard, so there's no need for the hard commit and replication for visibility. You could conceivably get NRT on a single node without Solr Cloud, but there would be no redundancy. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Fri, Mar 1, 2013 at 1:22 AM, Amit Nithian anith...@gmail.com wrote: Erick, Well put and thanks for the clarification. One question: And if you need NRT, you just can't get it with traditional M/S setups. == Can you explain how that works with SolrCloud? I agree with what you said too because there was an article or discussion I read that said having high-availability masters requires some fairly complicated setups and I guess I am under-estimating how expensive/complicated our setup is relative to what you can get out of the box with SolrCloud. Thanks! Amit On Thu, Feb 28, 2013 at 6:29 PM, Erick Erickson erickerick...@gmail.com wrote: Amit: It's a balancing act. If I was starting fresh, even with one shard, I'd probably use SolrCloud rather than deal with the issues around the how do I recover if my master goes down question. Additionally, SolrCloud allows one to monitor the health of the entire system by monitoring the state information kept in Zookeeper rather than build a monitoring system that understands the changing topology of your network. And if you need NRT, you just can't get it with traditional M/S setups. In a mature production system where all the operational issues are figured out and you don't need NRT, it's easier just to plop 4.x in traditional M/S setups and not go to SolrCloud. And you're right, you have to understand Zookeeper which isn't all that difficult, but is another moving part and I'm a big fan of keeping the number of moving parts down if possible. It's not a one-size-fits-all situation. From what you've described, I can't say there's a compelling reason to do the SolrCloud thing. If you find yourself spending lots of time building monitoring or High Availability/Disaster Recovery tools, then you might find the cost/benefit analysis changing. Personally, I think it's ironic that the memory improvements that came along _with_ SolrCloud make it less necessary to shard. Which means that traditional M/S setups will suit more people longer G Best Erick On Thu, Feb 28, 2013 at 8:22 PM, Amit Nithian anith...@gmail.com wrote: I don't know a ton about SolrCloud but for our setup and my limited understanding of it is that you start to bleed operational and non-operational aspects together which I am not comfortable doing (i.e. software load balancing). Also adding ZooKeeper to the mix is yet another thing to install, setup, monitor, maintain etc which doesn't add any value above and beyond what we have setup already. For example, we have a hardware load balancer that can do the actual load balancing of requests among the slaves and taking slaves in and out of rotation either on demand or if it's down. We've placed a virtual IP on top of our multiple masters so that we have redundancy there. While we have multiple cores, the data volume is large enough to fit on one node so we aren't at the data volume necessary for sharding our indices. I suspect that if we had a sufficiently large dataset that couldn't fit on one box SolrCloud is perfect but when you can fit on one box, why add more complexity? Please correct me if I'm wrong for I'd like to better understand this! On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote: I am doing research on SolrCloud. -- View this message in context: http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Poll: SolrCloud vs. Master-Slave usage
I don't know a ton about SolrCloud but for our setup and my limited understanding of it is that you start to bleed operational and non-operational aspects together which I am not comfortable doing (i.e. software load balancing). Also adding ZooKeeper to the mix is yet another thing to install, setup, monitor, maintain etc which doesn't add any value above and beyond what we have setup already. For example, we have a hardware load balancer that can do the actual load balancing of requests among the slaves and taking slaves in and out of rotation either on demand or if it's down. We've placed a virtual IP on top of our multiple masters so that we have redundancy there. While we have multiple cores, the data volume is large enough to fit on one node so we aren't at the data volume necessary for sharding our indices. I suspect that if we had a sufficiently large dataset that couldn't fit on one box SolrCloud is perfect but when you can fit on one box, why add more complexity? Please correct me if I'm wrong for I'd like to better understand this! On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote: I am doing research on SolrCloud. -- View this message in context: http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Poll: SolrCloud vs. Master-Slave usage
Erick, Well put and thanks for the clarification. One question: And if you need NRT, you just can't get it with traditional M/S setups. == Can you explain how that works with SolrCloud? I agree with what you said too because there was an article or discussion I read that said having high-availability masters requires some fairly complicated setups and I guess I am under-estimating how expensive/complicated our setup is relative to what you can get out of the box with SolrCloud. Thanks! Amit On Thu, Feb 28, 2013 at 6:29 PM, Erick Erickson erickerick...@gmail.comwrote: Amit: It's a balancing act. If I was starting fresh, even with one shard, I'd probably use SolrCloud rather than deal with the issues around the how do I recover if my master goes down question. Additionally, SolrCloud allows one to monitor the health of the entire system by monitoring the state information kept in Zookeeper rather than build a monitoring system that understands the changing topology of your network. And if you need NRT, you just can't get it with traditional M/S setups. In a mature production system where all the operational issues are figured out and you don't need NRT, it's easier just to plop 4.x in traditional M/S setups and not go to SolrCloud. And you're right, you have to understand Zookeeper which isn't all that difficult, but is another moving part and I'm a big fan of keeping the number of moving parts down if possible. It's not a one-size-fits-all situation. From what you've described, I can't say there's a compelling reason to do the SolrCloud thing. If you find yourself spending lots of time building monitoring or High Availability/Disaster Recovery tools, then you might find the cost/benefit analysis changing. Personally, I think it's ironic that the memory improvements that came along _with_ SolrCloud make it less necessary to shard. Which means that traditional M/S setups will suit more people longer G Best Erick On Thu, Feb 28, 2013 at 8:22 PM, Amit Nithian anith...@gmail.com wrote: I don't know a ton about SolrCloud but for our setup and my limited understanding of it is that you start to bleed operational and non-operational aspects together which I am not comfortable doing (i.e. software load balancing). Also adding ZooKeeper to the mix is yet another thing to install, setup, monitor, maintain etc which doesn't add any value above and beyond what we have setup already. For example, we have a hardware load balancer that can do the actual load balancing of requests among the slaves and taking slaves in and out of rotation either on demand or if it's down. We've placed a virtual IP on top of our multiple masters so that we have redundancy there. While we have multiple cores, the data volume is large enough to fit on one node so we aren't at the data volume necessary for sharding our indices. I suspect that if we had a sufficiently large dataset that couldn't fit on one box SolrCloud is perfect but when you can fit on one box, why add more complexity? Please correct me if I'm wrong for I'd like to better understand this! On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote: I am doing research on SolrCloud. -- View this message in context: http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: numFound is not correct while using Result Grouping
I need to write some tests which I hope to do tonight and then I think it'll get into 4.2 On Tue, Feb 26, 2013 at 6:24 AM, Nicholas Ding nicholas...@gmail.comwrote: Thanks Amit, that's cool! So it will also be fixed on Solr 4.2, right? On Mon, Feb 25, 2013 at 6:04 PM, Amit Nithian anith...@gmail.com wrote: Yeah I had a similar problem. I filed and submitted this patch: https://issues.apache.org/jira/browse/SOLR-4310 Let me know if this is what you are looking for! Amit On Mon, Feb 25, 2013 at 1:50 PM, Teun Duynstee t...@duynstee.com wrote: Ah, I see. The docs say Although this result format does not have as much information, it may be easier for existing solr clients to parse. I guess the ngroups value could be added to this format, but apparently it isn't. I do agree with you that to be usefull (as in possible to read for a client that doesn't know of the grouped format), the number should be that of the groups, not of the documents. A quick glance in the code learns that it is indeed not calculated in this case. But not completely trivial to fix. Could you use format=simple instead? That will work with ngroups. Teun 2013/2/25 Nicholas Ding nicholas...@gmail.com Thanks Teun and Carlos, I set group.ngroups=true, but I don't have this ngroup number when I was using group.main = true. On Mon, Feb 25, 2013 at 12:02 PM, Carlos Maroto cmar...@searchtechnologies.com wrote: Use group.ngroups, check it in the Solr wiki for FieldCollapsing Carlos Maroto Search Architect at Search Technologies ( www.searchtechnologies.com) Nicholas Ding nicholas...@gmail.com wrote: Hello, I grouped the result, and set group.main=true. I was expecting the numFound equals to the number of groups, but actually it was not. How do I get the number of groups? Thanks Nicholas
Re: numFound is not correct while using Result Grouping
Yeah I had a similar problem. I filed and submitted this patch: https://issues.apache.org/jira/browse/SOLR-4310 Let me know if this is what you are looking for! Amit On Mon, Feb 25, 2013 at 1:50 PM, Teun Duynstee t...@duynstee.com wrote: Ah, I see. The docs say Although this result format does not have as much information, it may be easier for existing solr clients to parse. I guess the ngroups value could be added to this format, but apparently it isn't. I do agree with you that to be usefull (as in possible to read for a client that doesn't know of the grouped format), the number should be that of the groups, not of the documents. A quick glance in the code learns that it is indeed not calculated in this case. But not completely trivial to fix. Could you use format=simple instead? That will work with ngroups. Teun 2013/2/25 Nicholas Ding nicholas...@gmail.com Thanks Teun and Carlos, I set group.ngroups=true, but I don't have this ngroup number when I was using group.main = true. On Mon, Feb 25, 2013 at 12:02 PM, Carlos Maroto cmar...@searchtechnologies.com wrote: Use group.ngroups, check it in the Solr wiki for FieldCollapsing Carlos Maroto Search Architect at Search Technologies (www.searchtechnologies.com) Nicholas Ding nicholas...@gmail.com wrote: Hello, I grouped the result, and set group.main=true. I was expecting the numFound equals to the number of groups, but actually it was not. How do I get the number of groups? Thanks Nicholas
Re: [ANN] vifun: tool to help visually tweak Solr boosting
This is cool! I had done something similar except changing via JConsole/JMX: https://issues.apache.org/jira/browse/SOLR-2306 We had something not as nice at Zvents but I wanted to expose these as MBean properties so you could change them via any JMX UI like JVisualVM Cheers! Amit On Mon, Feb 25, 2013 at 2:36 PM, jmlucjav jmluc...@gmail.com wrote: Apologies...instructions are wrong on the cd, these commands are to be run at the top level of the project...I fixed the doc to read: cd vifun griffon run-app On Mon, Feb 25, 2013 at 10:45 PM, Jan Høydahl jan@cominvent.com wrote: Hi, I actually tried ../griffonw run-app but it says griffon-app does not appear to be part of a Griffon application. I installed griffon and tried again griffon run-app inside of griffon-app, but same error. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 25. feb. 2013 kl. 19:51 skrev jmlucjav jmluc...@gmail.com: Jan, thanks for looking at this! - Running from source: would you care to send me the error you get (if any) when running from source? I assume you have griffon1.1.0 installed right? - Binary dist: the distrib is created by griffon, so I'll check if the permission issue (I develop on windows, and tested on a clean windows too, so I don't face the issue you mention) is known or can be fixed somehow. I'll update the doc anyway. - wt param: I am already overriding wt param (in order to use javabin). What I didn't allow is to choose the handler to be used when submitting the query. I guess any handler that does not have appends/invariants that would interfere would work fine, I just thought /select is mostly available in most installations and that is one thing less to configure. But yes, I could let the user configure it, I'll open an issue. xavier On Mon, Feb 25, 2013 at 3:10 PM, Jan Høydahl jan@cominvent.com wrote: Cool. I tried running from source (using the bundled griffonw), but I think the instructions may be wrong, had to download binary dist. The file permissions for bin/vifun in binary dist should have +w so you can execute it with ./vifun What about the ability to override the wt param, so that you can point it to the /browse handler directly? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 23. feb. 2013 kl. 15:12 skrev jmlucjav jmluc...@gmail.com: Hi, I have built a small tool to help me tweak some params in Solr (typically qf, bf in edismax). As maybe others find it useful, I am open sourcing it on github: https://github.com/jmlucjav/vifun Check github for some more info and screenshots. I include part of the github page below. regards Description Did you ever spend lots of time trying to tweak all numbers in a *edismax* handler *qf*, *bf*, etc params so docs get scored to your liking? Imagine you have the params below, is 20 the right boosting for *name* or is it too much? Is *population* being boosted too much versus distance? What about new documents? !-- fields, boost some -- str name=qfname^20 textsuggest^10 edge^5 ngram^2 phonetic^1/str str name=mm33%/str !-- boost closest hits -- str name=bfrecip(geodist(),1,500,0)/str !-- boost by population -- str name=bfproduct(log(sum(population,1)),100)/str !-- boost newest docs -- str name=bfrecip(rord(moddate),1,1000,1000)/str This tool was developed in order to help me tweak the values of boosting functions etc in Solr, typically when using edismax handler. If you are fed up of: change a number a bit, restart Solr, run the same query to see how documents are scored now...then this tool is for you. https://github.com/jmlucjav/vifun#featuresFeatures - Can tweak numeric values in the following params: *qf, pf, bf, bq, boost, mm* (others can be easily added) even in *appends or invariants* - View side by side a Baseline query result and how it changes when you gradually change each value in the params - Colorized values, color depends on how the document does related to baseline query - Tooltips give you Explain info - Works on remote Solr installations - Tested with Solr 3.6, 4.0 and 4.1 (other versions would work too, as long as wt=javabin format is compatible) - Developed using Groovy/Griffon https://github.com/jmlucjav/vifun#requirementsRequirements - */select* handler should be available, and not have any *appends or invariants*, as it could interfere with how vifun works. - Java6 is needed (maybe it runs on Java5 too). A JRE should be enough. https://github.com/jmlucjav/vifun#getting-startedGetting started
Re: Slaves always replicate entire index Index versions
A few others have posted about this too apparently and SOLR-4413 is the root problem. Basically what I am seeing is that if your index directory is not index/ but rather index.timestamp set in the index.properties a new index will be downloaded all the time because the download is expecting your index to be in solr_data_dir/index. Sounds like a quick solution might be to rename your index directory to just index and see if the problem goes away. To confirm, look at line 728 in the SnapPuller.java file (in downloadIndexFiles) I am hoping that the patch and a more unified getIndexDir can be added to the next release of Solr as this is a fairly significant bug to me. Cheers Amit On Thu, Feb 21, 2013 at 12:56 AM, Amit Nithian anith...@gmail.com wrote: So the diff in generation numbers are due to the commits I believe that Solr does when it has the new index files but the fact that it's downloading a new index each time is baffling and I just noticed that too (hit the replicate button and noticed a full index download). I'm going to pop in to the source and see what's going on to see why unless there's a known bug filed about this? On Tue, Feb 19, 2013 at 1:48 AM, Raúl Grande Durán raulgrand...@hotmail.com wrote: Hello. We have recently updated our Solr from 3.5 to 4.1 and everything is running perfect except the replication between nodes. We have a master-repeater-2slaves architecture and we have seen some things that weren't happening before: When a Slave (repeater or slaves) starts to replicate it needs to download the entire index. Even when some little changes has been made to the index at master. This takes such a long time since our index is more than 20 Gb.After replication cycle we have different index generations in master, repeater and slaves. For example:Master: gen. 64590Repeater: gen. 64591Both slaves: gen. 64592 My replicationHandler configuration is like this:requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=enable${enable.master:false}/str str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFilesschema.xml,stopwords.txt/str /lst lst name=slave str name=enable${enable.slave:false}/str str name=masterUrl${solr.master.url:http://localhost/solr}/str str name=pollInterval00:03:00/str /lst /requestHandler Our problems are very similar to those explained here: http://lucene.472066.n3.nabble.com/Problem-with-replication-td2294313.html Any ideas?? Thanks
Re: Slaves always replicate entire index Index versions
Thanks for the links... I have updated SOLR-4471 with a proposed solution that I hope can be incorporated or amended so we can get a clean fix into the next version so our operations and network staff will be happier with not having gigs of data flying around the network :-) On Thu, Feb 21, 2013 at 1:24 AM, raulgrande83 raulgrand...@hotmail.comwrote: Hi Amit, I have came across some JIRAs that may be useful in this issue: https://issues.apache.org/jira/browse/SOLR-4471 https://issues.apache.org/jira/browse/SOLR-4354 https://issues.apache.org/jira/browse/SOLR-4303 https://issues.apache.org/jira/browse/SOLR-4413 https://issues.apache.org/jira/browse/SOLR-2326 Please, let us know if you find any solution. Regards. -- View this message in context: http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-Index-versions-tp4041256p4041817.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Slaves always replicate entire index Index versions
Sounds good I am trying the combination of my patch and 4413 now to see how it works and will have to see if I can put unit tests around them as some of what I thought may not be true with respect to the commit generation numbers. For your issue above in your last post, is it possible that there was a commit on the master in that slight window after solr checks for the latest generation of the master but before it downloads the actual files? How frequent are the commits on your master? On Thu, Feb 21, 2013 at 2:00 AM, raulgrande83 raulgrand...@hotmail.comwrote: Thanks for the patch, we'll try to install these fixes and post if replication works or not. I renamed 'index.timestamp' folders to just 'index' but it didn't work. These lines appeared in the log: INFO: Master's generation: 64594 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Slave's generation: 64593 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchFileList SEVERE: No files to download for index generation: 64594 -- View this message in context: http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-Index-versions-tp4041256p4041827.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Anyone else see this error when running unit tests?
Okay so I think I found a solution if you are a maven user and don't mind forcing the test codec to Lucene40 then do the following: Add this to your pom.xml under the build pluginManagement plugins section plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-surefire-plugin/artifactId version2.13/version configuration argLine-Dtests.codec=Lucene40/argLine /configuration /plugin If you are running in Eclipse, simply add this as a VM argument. The default test codec is set to random and this means that there is a possibility of picking Lucene3x if some random variable is 2 and other conditions are met. For me, my test-framework jar must not be ahead of the lucene one (b/c I don't control the classpath order and honestly this shouldn't be a requirement to run a test) so it periodically bombed. This little fix seems to have helped provided that you don't care about Lucene3x vs Lucene40 for your tests (I am on Lucene40 so it's fine for me). HTH! Amit On Mon, Feb 4, 2013 at 6:18 PM, Roman Chyla roman.ch...@gmail.com wrote: Me too, it fails randomly with test classes. We use Solr4.0 for testing, no maven, only ant. --roman On 4 Feb 2013 20:48, Mike Schultz mike.schu...@gmail.com wrote: Yes. Just today actually. I had some unit test based on AbstractSolrTestCase which worked in 4.0 but in 4.1 they would fail intermittently with that error message. The key to this behavior is found by looking at the code in the lucene class: TestRuleSetupAndRestoreClassEnv. I don't understand it completely but there are a number of random code paths through there. The following helped me get around the problem, at least in the short term. @org.apache.lucene.util.LuceneTestCase.SuppressCodecs({Lucene3x,Lucene40}) public class CoreLevelTest extends AbstractSolrTestCase { I also need to call this inside my setUp() method, in 4.0 this wasn't required. initCore(solrconfig.xml, schema.xml, /tmp/my-solr-home); -- View this message in context: http://lucene.472066.n3.nabble.com/Anyone-else-see-this-error-when-running-unit-tests-tp4015034p4038472.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: replication problems with solr4.1
I may be missing something but let me go back to your original statements: 1) You build the index once per week from scratch 2) You replicate this from master to slave. My understanding of the way replication works is that it's meant to only send along files that are new and if any files named the same between the master and slave have different sizes then this is a corruption of sorts and do this index.timestamp and send the full thing down. This, I think, explains your index.timestamp issue although why the old index/ directory isn't being deleted i'm not sure about. This is why I was asking about OS details, file system details etc (perhaps something else is locking that directory preventing Java from deleting it?) The second issue is the index generation which is governed by commits and is represented by looking at the last few characters in the segments_XX file. When the slave downloads the index and does the copy of the new files, it does a commit to force a new searcher hence why the slave generation will be +1 from the master. The index version is a timestamp and it may be the case that the version represents the point in time when the index was downloaded to the slave? In general, it shouldn't matter about these details because replication is only triggered if the master's version slave's version and the clocks that all servers use are synched to some common clock. Caveat however in my answer is that I have yet to try 4.1 as this is next on my TODO list so maybe I'll run into the same problem :-) but I wanted to provide some info as I just recently dug through the replication code to understand it better myself. Cheers Amit On Wed, Feb 13, 2013 at 11:57 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: OK then index generation and index version are out of count when it comes to verify that master and slave index are in sync. What else is possible? The strange thing is if master is 2 or more generations ahead of slave then it works! With your logic the slave must _always_ be one generation ahead of the master, because the slave replicates from master and then does an additional commit to recognize the changes on the slave. This implies that the slave acts as follows: - if the master is one generation ahaed then do an additional commit - if the master is 2 or more generations ahead then do _no_ commit OR - if the master is 2 or more generations ahead then do a commit but don't change generation and version of index Can this be true? I would say not really. Regards Bernd Am 13.02.2013 20:38, schrieb Amit Nithian: Okay so then that should explain the generation difference of 1 between the master and slave On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com wrote: On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote: doesn't it do a commit to force solr to recognize the changes? yes. - Mark
Re: Boost Specific Phrase
Have you looked at the pf parameter for dismax handlers? pf does I think what you are looking for which is to boost documents with the query term exactly matching in the various fields with some phrase slop. On Wed, Feb 13, 2013 at 2:59 AM, Hemant Verma hemantverm...@gmail.comwrote: Hi All I have a use case with phrase search. Let say I have a list of phrases in a file/dictionaries which are important as per our search content. One entry in the dictionary is lets say - project manager. If user's query contains any entry specified in dictionary then I want to boost the score of documents which have exact match of that entry. Lets take one example:- Now suppose user searches for (project manager in India with 2 yrs experience). There are words 'project manager' in the query in exact order as specified in dictionary then I want to boost the score of documents having 'project manager' as an exact match. This can be done at web application level after processing user query with dictionary and create query as below: q=project manager in India with 2 yrs experienceqf=titlebq=title:project manager^5 I want to know is there any better solution available to this use case at Solr level. AFAIK there is something very similar available in FAST ESP know as Phrase Recognition. Thanks Hemant -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Specific-Phrase-tp4040188.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: what do you use for testing relevance?
Ultimately this is dependent on what your metrics for success are. For some places it may be just raw CTR (did my click through rate increase) but for other places it may be a function of money (either it may be gross revenue, profits, # items sold etc). I don't know if there is a generic answer for this question which is leading those to write their own frameworks b/c it's very specific to your needs. A scoring change that leads to an increase in CTR may not necessarily lead to an increase in the metric that makes your business go. On Tue, Feb 12, 2013 at 10:31 PM, Steffen Elberg Godskesen steffen.godske...@gmail.com wrote: Hi Roman, If you're looking for regression testing then https://github.com/sul-dlss/rspec-solr might be worth looking at. If you're not a ruby shop, doing something similar in another language shouldn't be to hard. The basic idea is that you setup a set of tests like If the query is X, then the document with id Y should be in the first 10 results If the query is S, then a document with title T should be the first result If the query is P, then a document with author Q should not be in the first 10 result and that you run these whenever you tune your scoring formula to ensure that you haven't introduced unintended effects. New ideas/requirements for your relevance ranking should always result in writing new tests - that will probably fail until you tune your scoring formula. This is certainly no magic bullet, but it will give you some confidence that you didn't make things worse. And - in my humble opinion - it also gives you the benefit of discouraging you from tuning your scoring just for fun. To put it bluntly: if you cannot write up a requirement in form of a test, you probably have no need to tune your scoring. Regards, -- Steffen On Tuesday, February 12, 2013 at 23:03 , Roman Chyla wrote: Hi, I do realize this is a very broad question, but still I need to ask it. Suppose you make a change into the scoring formula. How do you test/know/see what impact it had? Any framework out there? It seems like people are writing their own tools to measure relevancy. Thanks for any pointers, roman
Re: replication problems with solr4.1
So just a hunch... but when the slave downloads the data from the master, doesn't it do a commit to force solr to recognize the changes? In so doing, wouldn't that increase the generation number? In theory it shouldn't matter because the replication looks for files that are different to determine whether or not to do a full download or a partial replication. In the event of a full replication (an optimize would cause this), I think the replication handler considers this a corruption and forces a full download into this index.timestamp folder with the index.properties pointing at this folder to tell solr this is the new index directory. Since you mentioned you rebuild the index from scratch once per week I'd expect to see this behavior you are mentioning. I remember debugging the code to find out how replication works in 4.0 because of a bug that was fixed in 4.1 but I haven't read through the 4.1 code to see how much (if any) has changed from this logic. In short, I don't know why you'd have the old index/ directory there.. that seems either like a bug or something was locking that directory in the filesystem preventing it from being removed. What OS are you using and is the index/ directory stored on a local file system vs NFS? HTH Amit On Tue, Feb 12, 2013 at 2:26 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Now this is strange, the index generation and index version is changing with replication. e.g. master has index generation 118 index version 136059533234 and slave has index generation 118 index version 136059533234 are both same. Now add one doc to master with commit. master has index generation 119 index version 1360595446556 Next replicate master to slave. The result is: master has index generation 119 index version 1360595446556 slave has index generation 120 index version 1360595564333 I have not seen this before. I thought replication is just taking over the index from master to slave, more like a sync? Am 11.02.2013 09:29, schrieb Bernd Fehling: Hi list, after upgrading from solr4.0 to solr4.1 and running it for two weeks now it turns out that replication has problems and unpredictable results. My installation is single index 41 mio. docs / 115 GB index size / 1 master / 3 slaves. - the master builds a new index from scratch once a week - a replication is started manually with Solr admin GUI What I see is one of these cases: - after a replication a new searcher is opened on index.xxx directory and the old data/index/ directory is never deleted and besides the file replication.properties there is also a file index.properties OR - the replication takes place everything looks fine but when opening the admin GUI the statistics report Last Modified: a day ago Num Docs: 42262349 Max Doc: 42262349 Deleted Docs: 0 Version: 45174 Segment Count: 1 VersionGen Size Master: 1360483635404 112 116.5 GB Slave:1360483806741 113 116.5 GB In the first case, why is the replication doing that??? It is an offline slave, no search activity, just there fore backup! In the second case, why is the version and generation different right after full replication? Any thoughts on this? - Bernd -- * Bernd FehlingBielefeld University Library Dipl.-Inform. (FH)LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: replication problems with solr4.1
Okay so then that should explain the generation difference of 1 between the master and slave On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com wrote: On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote: doesn't it do a commit to force solr to recognize the changes? yes. - Mark
Re: Boost Specific Phrase
Ah yes sorry mis-understood. Another option is to use n-grams so that projectmanager is a term so any query involving project manager in india with 2 years experience would match higher because the query would contain projectmanager as a term. On Wed, Feb 13, 2013 at 9:56 PM, Hemant Verma hemantverm...@gmail.comwrote: Thanks for the response. pf parameter actually boost the documents considering all search keywords mentioned in main query but I am looking for something which boost the documents considering few search keywords from the user query. Like as per the example, user query is (project manager in India with 2 yrs experience) and my dictionary contains one entry as 'project manager' which specifies if user query has 'project manager' in his query then boost those documents which contains 'project manager' as an exact match. -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Specific-Phrase-tp4040188p4040371.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Benefits of Solr over Lucene?
Add to Jack reply, Solr can also be embed into the application and can run on same process. Solr, the server-I zation of lucene. The line is very blurred and solr is not a very thin wrapper around lucene library. Most solr features are distinct from lucene like - detailed breakdown of scoring mathematics - text analysis phases - solr adds to lucene's text analysis library and makes it configurable through XML - introduce the notion of a field types - runtime performance stats including cache hit/ miss rate Rgds AJ On 12-Feb-2013, at 22:17, Jack Krupansky j...@basetechnology.com wrote: Here's yet another short list of benefits of Solr over Lucene (not that any of them take away from Lucene since Solr is based on Lucene): - Multiple core index - go beyond the limits of a single lucene index - Support for multi-core or named collections - richer query parsers (e.g., schema-aware, edismax) - schema language, including configurable field types and configurable analyzers - easier to do per-field/type analysis - plugin architecture, easily configured and customized - Generally, develop a search engine without writing any code, and what code you may write is mostly easily configured plugins - Editable configuration file rather than hard-coded or app-specific properties - Tomcat/Jetty container support enable system administration as corporate IT ops teams already know it - Web-based Admin UI, including debugging features such as field/type analysis - Solr search features are available to any app written in any language, not just Java. All you need is HTTP access. (Granted, there is SOME support for Lucene in SOME other languages.) In short, if you want to embed search engine capabilities in your Java app, Lucene is the way to go, but if you want a web architecture, with the search engine in a separate process from the app in a multi-tier architecture, Solr is the way to go. Granted, you could also use ElasticSearch or roll your own, but Solr basically runs right out of the box with no code development needed to get started and no Java knowledge needed. And to be clear, Solr is not simply an extension of Lucene - Solr is a distinct architectural component that is based on Lucene. In OOP terms, think of composition rather than derivation. -- Jack Krupansky -Original Message- From: JohnRodey Sent: Tuesday, February 12, 2013 10:40 AM To: solr-user@lucene.apache.org Subject: Benefits of Solr over Lucene? I know that Solr web-enables a Lucene index, but I'm trying to figure out what other things Solr offers over Lucene. On the Solr features list it says Solr uses the Lucene search library and extends it!, but what exactly are the extensions from the list and what did Lucene give you? Also if I have an index built through Solr is there a non-HTTP way to search that index? Because solr4j essentially just makes HTTP requests correct? Some features Im particularly interested in are: Geospatial Search Highlighting Dynamic Fields Near Real-Time Indexing Multiple Search Indices Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Benefits-of-Solr-over-Lucene-tp4039964.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr HTTP Replication Question
Okay one last note... just for closure... looks like it was addressed in solr 4.1+ (I was looking at 4.0). On Thu, Jan 24, 2013 at 11:14 PM, Amit Nithian anith...@gmail.com wrote: Okay so after some debugging I found the problem. While the replication piece will download the index from the master server and move the files to the index directory but during the commit phase, these older generation files are deleted and the index is essentially left in tact. I noticed that a full copy is needed if the index is stale (meaning that files in common between the master and slave have different sizes) but also I think a full copy should be needed if the slaves generation is higher than the master as well. In short, to me it's not sufficient enough to simply say a full copy is needed if the slave's index version is = master's index version. I'll create a patch and file a bug along with a more thorough writeup of how I got in this state. Thanks! Amit On Thu, Jan 24, 2013 at 2:33 PM, Amit Nithian anith...@gmail.com wrote: Does Solr's replication look at the generation difference between master and slave when determining whether or not to replicate? To be more clear: What happens if a slave's generation is higher than the master yet the slave's index version is less than the master's index version? I looked at the source and didn't seem to see any reason why the generation matters other than fetching the file list from the master for a given generation. It's too wordy to explain how this happened so I'll go into details on that if anyone cares. Thanks! Amit
Re: Solr HTTP Replication Question
Okay so after some debugging I found the problem. While the replication piece will download the index from the master server and move the files to the index directory but during the commit phase, these older generation files are deleted and the index is essentially left in tact. I noticed that a full copy is needed if the index is stale (meaning that files in common between the master and slave have different sizes) but also I think a full copy should be needed if the slaves generation is higher than the master as well. In short, to me it's not sufficient enough to simply say a full copy is needed if the slave's index version is = master's index version. I'll create a patch and file a bug along with a more thorough writeup of how I got in this state. Thanks! Amit On Thu, Jan 24, 2013 at 2:33 PM, Amit Nithian anith...@gmail.com wrote: Does Solr's replication look at the generation difference between master and slave when determining whether or not to replicate? To be more clear: What happens if a slave's generation is higher than the master yet the slave's index version is less than the master's index version? I looked at the source and didn't seem to see any reason why the generation matters other than fetching the file list from the master for a given generation. It's too wordy to explain how this happened so I'll go into details on that if anyone cares. Thanks! Amit