Block Join Score Highlighting
Hello, I am trying out block joins for my index at the moment as I have many documents that are mainly variations of the same search content. In my case denormalization is not an option, so I am using nested documents. The structure looks like this: doc content doc filter boost required info /doc doc I search within the parent document and filter on the child documents. I get the correct documents this way, but I have issues with scoring and highlighting. I am currently searching on the parent document and returning the child document, as they hold specific information I require. I use the boost field of the child to boost the score of the documents individually. 1. I want the highlighting snippet from the parent document, but the snippets returned are empty as they are based on the childs. 2. I also want to use the score from the parent document search together with the child boost, but now I only get the score from filtering the child nodes (which is 0). I also tried it the other way around; returning the parent node and only filtering on the child node, but in that case I can't boost on the specific child or return the information within that child that I need. Are there options to work around these issues? Or are they just not supported at the moment? -- View this message in context: http://lucene.472066.n3.nabble.com/Block-Join-Score-Highlighting-tp4134045.html Sent from the Solr - User mailing list archive at Nabble.com.
How to Add a New core
Hello All. How do i add a new core in Solr ? My solr directory is : /usr/share/solr-4.6.1/example/solr And it is having only one collection i.e. collection1 Now to add the new core I added a directory collection2 and in that I created 2 more directory. /conf /lib Now my what should be the entry in solr.xml file? solr solrcloud str name=host${host:}/str int name=hostPort${jetty.port:8983}/int str name=hostContext${hostContext:solr}/str int name=zkClientTimeout${zkClientTimeout:15000}/int bool name=genericCoreNodeNames${genericCoreNodeNames:true}/bool /solrcloud shardHandlerFactory name=shardHandlerFactory class=HttpShardHandlerFactory int name=socketTimeout${socketTimeout:0}/int int name=connTimeout${connTimeout:0}/int /shardHandlerFactory /solr What to add in this file to register the new core? Please guide me. -- Regards, *Sohan Kalsariya*
Getting min and max of a solr field for each group while doing field collapsing/result grouping
Hi, I am using SolrCloud for getting results grouped by a particular field. Now, I also want to get min and max value for a particular field for each group. For example, if I am grouping results by city, then I also want to get the minimum and maximum price for each city. Is this possible to do with Solr. Thanks in Advance! -- Varun Gupta
Re: Getting min and max of a solr field for each group while doing field collapsing/result grouping
Hi Varun, I think you can use group.truncate=true with stats component http://wiki.apache.org/solr/StatsComponent If true, facet counts are based on the most relevant document of each group matching the query. Same applies for StatsComponent. Default is false. ! Solr3.4 Supported from Solr 3.4 and up. On Thursday, May 1, 2014 12:30 PM, Varun Gupta varun.vgu...@gmail.com wrote: Hi, I am using SolrCloud for getting results grouped by a particular field. Now, I also want to get min and max value for a particular field for each group. For example, if I am grouping results by city, then I also want to get the minimum and maximum price for each city. Is this possible to do with Solr. Thanks in Advance! -- Varun Gupta
Re: Getting min and max of a solr field for each group while doing field collapsing/result grouping
Hi Ahmet, Thanks for the information! But as per Solr documentation, group.truncate is not supported in distributed searches and I am looking for a solution that can work on SolrCloud. -- Varun Gupta On Thu, May 1, 2014 at 4:12 PM, Ahmet Arslan iori...@yahoo.com wrote: Hi Varun, I think you can use group.truncate=true with stats component http://wiki.apache.org/solr/StatsComponent If true, facet counts are based on the most relevant document of each group matching the query. Same applies for StatsComponent. Default is false. ! Solr3.4 Supported from Solr 3.4 and up. On Thursday, May 1, 2014 12:30 PM, Varun Gupta varun.vgu...@gmail.com wrote: Hi, I am using SolrCloud for getting results grouped by a particular field. Now, I also want to get min and max value for a particular field for each group. For example, if I am grouping results by city, then I also want to get the minimum and maximum price for each city. Is this possible to do with Solr. Thanks in Advance! -- Varun Gupta
Re: How to Add a New core
On 5/1/2014 1:49 AM, Sohan Kalsariya wrote: Hello All. How do i add a new core in Solr ? My solr directory is : /usr/share/solr-4.6.1/example/solr And it is having only one collection i.e. collection1 Now to add the new core I added a directory collection2 and in that I created 2 more directory. /conf /lib Now my what should be the entry in solr.xml file? solr Your solr.xml is the new format. This was usable in 4.4.0, and the solr.xml in the example was upgraded in 4.4 to use that format. The new solr.xml format means that Solr is doing core discovery. This means that you don't add anything to the solr.xml. One way to do it: Create a core.properties file in the collection2 directory that is similar to the one you'll find in the collection1 directory, and restart Solr. The original core and the new core will both be discovered because Solr is looking for the core.properties file. http://wiki.apache.org/solr/Core%20Discovery%20%284.4%20and%20beyond%29 Another way to do it: Once the conf directory exists with schema.xml, solrconfig.xml, and any other config files that those reference, you can also use the CoreAdmin API, which is exposed in the Solr admin UI. You need the CREATE action. This should create the core.properties file and add the core without requiring a Solr restart. Thanks, Shawn
Searching for tokens does not return any results
Hello everyone, I am new to SOLR and this is my first post in this list. I have been working on this problem for a couple of days. I tried everything which I found in google but it looks like I am missing something. Here is my problem: I have a field called: DBASE_LOCAT_NM_TEXT It contains values like: CRD_PROD The goal is to be able to search this field either by putting the exact string CRD_PROD or part of it (tokenized by _) like CRD or PROD Currently: This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the second query does not return any results Here is how I configured the field: field name=DBASE_LOCAT_NM_TEXT type=text_general indexed=true stored=true required=false multiValued=false/ And Here is how I configured the field type : fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I am also using the analysis panel in the SOLR admin console. It shows this: WT CRD_PROD WDF CRD_PROD CRD PROD CRDPROD SF CRD_PROD CRD PROD CRDPROD LCF crd_prod crd prod crdprod SKMFcrd_prod crd prod crdprod RDTFcrd_prod crd prod crdprod I am not sure if it is related or not but this index was created using a Java program using Lucene interface. It used StandardAnalyzer for writing and the field was configured as tokenized, indexed and stored. Does this affect the SOLR configuration? Can you please help me understand what I am missing and how I can debug it? Thanks, Yetkin
Re: Block Join Score Highlighting
Hello, Score support is addressed at https://issues.apache.org/jira/browse/SOLR-5882. Highlighting is another story. be aware of http://heliosearch.org/expand-block-join/ it might somehow useful for your problem. On Thu, May 1, 2014 at 11:32 AM, StrW_dev r.j.bamb...@structweb.nl wrote: Hello, I am trying out block joins for my index at the moment as I have many documents that are mainly variations of the same search content. In my case denormalization is not an option, so I am using nested documents. The structure looks like this: doc content doc filter boost required info /doc doc I search within the parent document and filter on the child documents. I get the correct documents this way, but I have issues with scoring and highlighting. I am currently searching on the parent document and returning the child document, as they hold specific information I require. I use the boost field of the child to boost the score of the documents individually. 1. I want the highlighting snippet from the parent document, but the snippets returned are empty as they are based on the childs. 2. I also want to use the score from the parent document search together with the child boost, but now I only get the score from filtering the child nodes (which is 0). I also tried it the other way around; returning the parent node and only filtering on the child node, but in that case I can't boost on the specific child or return the information within that child that I need. Are there options to work around these issues? Or are they just not supported at the moment? -- View this message in context: http://lucene.472066.n3.nabble.com/Block-Join-Score-Highlighting-tp4134045.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Searching for tokens does not return any results
Hi Yetkin, You are on the right track by examining analysis page. How is your query analyzed using query analyzer? According to what you pasted q=CRD should return your example document. Did you change something in schema.xml and forget to re-start solr and re-index? By the way simple letter tokenizer based lowercase tokenizer seems a better fit to your use-case. With this you dont have deal with WDF's parameters. https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-LowerCaseTokenizer Ahmet On Thursday, May 1, 2014 5:04 PM, Yetkin Ozkucur yetkin.ozku...@asg.com wrote: Hello everyone, I am new to SOLR and this is my first post in this list. I have been working on this problem for a couple of days. I tried everything which I found in google but it looks like I am missing something. Here is my problem: I have a field called: DBASE_LOCAT_NM_TEXT It contains values like: CRD_PROD The goal is to be able to search this field either by putting the exact string CRD_PROD or part of it (tokenized by _) like CRD or PROD Currently: This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the second query does not return any results Here is how I configured the field: field name=DBASE_LOCAT_NM_TEXT type=text_general indexed=true stored=true required=false multiValued=false/ And Here is how I configured the field type : fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I am also using the analysis panel in the SOLR admin console. It shows this: WT CRD_PROD WDF CRD_PROD CRD PROD CRDPROD SF CRD_PROD CRD PROD CRDPROD LCF crd_prod crd prod crdprod SKMF crd_prod crd prod crdprod RDTF crd_prod crd prod crdprod I am not sure if it is related or not but this index was created using a Java program using Lucene interface. It used StandardAnalyzer for writing and the field was configured as tokenized, indexed and stored. Does this affect the SOLR configuration? Can you please help me understand what I am missing and how I can debug it? Thanks, Yetkin
Roll up query with original facets
Hello All, I am having a query issue I cannot seem to find the correct answer for. I am searching against a list of items and returning facets for that list of items. I would like to group the result set on a field such as a “parentItemId”. parentItemId maps to other documents within the same core. I would like my query to return the documents that match parentItemId, but still return the facets of the original query. Is this possible with SOLR 4.3 that I am running? I can provide more details if needed, thanks! Darin
Please add me as Solr Conrtibutor
my wiki username is KeithThoma Please add me to the list so I will be able to make updates to the Solr Wiki. Keith Thoma
Re: timeAllowed in not honoring
On 4/30/2014 5:53 PM, Aman Tandon wrote: Shawn - Yes we have some plans to move to SolrCloud, Our total index size is 40GB with 11M of Docs, Available RAM 32GB, Allowed heap space for solr is 14GB, the GC tuning parameters using in our server is -XX:+UseConcMarkSweepGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps. This means that you have about 18GB of RAM left over to cache a 40GB index. That's less than 50 percent. Every index is different, but this is in the ballpark of where performance problems begin. If you had 48GB of RAM, your performance (not counting possible GC problems) would likely be very good. 64GB would be ideal. Your only GC tuning is switching the collector to CMS. This won't be enough. When I had a config like this and heap of only 8GB, I was seeing GC pauses of 10 to 12 seconds. http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning One question: Do you really need 14GB of heap? One of my servers has a total of 65GB of index (54 million docs) with a 7GB heap and 64GB of RAM. Currently I don't use facets, though. When I do, they will be enum. If you switch all your facets to enum, your heap requirements may go down. Decreasing the heap size will make more memory available for index caching. Thanks, Shawn
Re: Please add me as Solr Conrtibutor
I’ve added you Keith, go ahead :) -Stefan On Thursday, May 1, 2014 at 4:42 PM, Keith Thoma wrote: my wiki username is KeithThoma Please add me to the list so I will be able to make updates to the Solr Wiki. Keith Thoma
Re: overseer queue clogged
I saw an overseer queue clogged as well due to a bad message in the queue. Unfortunately this went unnoticed for a while until there were 130K messages in the overseer queue. Since it was a production system we were not able to simply stop everything and delete all Zookeeper data, so we manually deleted messages by issuing commands directly through the zkCli.sh tool. After all the messages had been cleared, some nodes were in the wrong state (e.g. 'down' when should have been 'active'). Restarting the 'down' or 'recovery failed' nodes brought the whole cluster back to a stable and healthy state. Since it can take some digging to determine backlog in the overseer queue, some of the symptoms we saw were: Overseer throwing an exception like Path must not end with / character Random nodes throwing an exception like ClusterState says we are the leader, but locally we don't think so Bringing up new replicas time out when attempting to fetch shard id -- View this message in context: http://lucene.472066.n3.nabble.com/overseer-queue-clogged-tp4047878p4134129.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: overseer queue clogged
What version are you running? This was fixed in a recent release. It can happen if you hit add core with the defaults on the admin page in older versions. -- Mark Miller about.me/markrmiller On May 1, 2014 at 11:19:54 AM, ryan.cooke (ryan.co...@gmail.com) wrote: I saw an overseer queue clogged as well due to a bad message in the queue. Unfortunately this went unnoticed for a while until there were 130K messages in the overseer queue. Since it was a production system we were not able to simply stop everything and delete all Zookeeper data, so we manually deleted messages by issuing commands directly through the zkCli.sh tool. After all the messages had been cleared, some nodes were in the wrong state (e.g. 'down' when should have been 'active'). Restarting the 'down' or 'recovery failed' nodes brought the whole cluster back to a stable and healthy state. Since it can take some digging to determine backlog in the overseer queue, some of the symptoms we saw were: Overseer throwing an exception like Path must not end with / character Random nodes throwing an exception like ClusterState says we are the leader, but locally we don't think so Bringing up new replicas time out when attempting to fetch shard id -- View this message in context: http://lucene.472066.n3.nabble.com/overseer-queue-clogged-tp4047878p4134129.html Sent from the Solr - User mailing list archive at Nabble.com.
RE : Shards don't return documents in same order
Hi Erick, thank you for your response. You are right, I changed alphaOnlySort to keep lettres and numbers and to remove some acticles (a, an, the). This is the filetype definition : fieldType name=alphaOnlySort class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.PatternReplaceFilterFactory replace=all replacement= pattern=(\b(a|an|the)\b|[^a-z,0-9])/ /analyzer /fieldType Then, I tested each name with admin ui on each server and this is the results : server1 MB20140410A = mb20140410a MB20140411A = mb20140411a MB20140410A-New = mb20140410anew server2 MB20140410A = mb20140410a MB20140411A = mb20140411a MB20140410A-New = mb20140410anew server3 MB20140410A = mb20140410a MB20140411A = mb20140411a MB20140410A-New = mb20140410anew Unfortunately, all results are identical so is there a mean to view data real indexed in these documents ? Can be a problem with a particular server ? All configs are in zookeeper so all cores shouldhave the same config, right ? Is there any way to force a replicat to resynchronize ? Regards, Francois. De : Erick Erickson [erickerick...@gmail.com] Envoyé : 30 avril 2014 16:36 À : solr-user@lucene.apache.org Objet : Re: Shards don't return documents in same order Hmmm, take a look at the admin/analysis page for these inputs for alphaOnlySort. If you're using the stock Solr distro, you're probably not considering the effects patternReplaceFilterFactory which is removing all non-letters. So these three terms reduce to mba mba mbanew You can look at the actual indexed terms by the admin/schema-browser as well. That said, unless you transposed the order because you were concentrating on the numeric part, the doc with MB20140410A-New should always be sorting last. All of which is irrelevant if you're doing something else with alphaOnlySort, so please paste in the fieldType definition if you've changed it. What gets returned in the doc for _stored_ data is a verbatim copy, NOT the output of the analysis chain, which can be confusing. Oh, and Solr uses the internal lucene doc ID to break ties, and docs on different replicas can have different internal Lucene doc IDs relative to each other as a result of merging so that's something else to watch out for. Best, Erick On Wed, Apr 30, 2014 at 1:06 PM, Francois Perron francois.per...@ticketmaster.com wrote: Hi guys, I have a small SolrCloud setup (3 servers, 1 collection with 1 shard and 3 replicat). In my schema, I have a alphaOnlySort field with a copyfield. This is a part of my managed-schema : field name=_root_ type=string indexed=true stored=false/ field name=_uid type=string multiValued=false indexed=true required=true stored=true/ field name=_version_ type=long indexed=true stored=true/ field name=event_id type=string indexed=true stored=true/ field name=event_name type=text_general indexed=true stored=true/ field name=event_name_sort type=alphaOnlySort/ with the copyfield copyField source=event_name dest=event_name_sort/ The problem is : I query my collection with a sort on my alphasort field but on one of my servers, the sort order is not the same. On server 1 and 2, I have this result : doc str name=event_nameMB20140410A/str /doc doc str name=event_nameMB20140410A-New/str /doc doc str name=event_nameMB20140411A/str /doc and on the third one, this : str name=event_nameMB20140410A/str /doc doc str name=event_nameMB20140411A/str /doc doc str name=event_nameMB20140410A-New/str /doc The doc named MB20140411A should be at the end ... Any idea ? Regards
HDS 4.8.0_01 released - solr tomcat distro
For those Tomcat fans out there, we've released HDS 4.8.0_01, based on Solr 4.8.0 of course. HDS is pretty much just Apache Solr, with the addition of a Tomcat based server. Download: http://heliosearch.com/heliosearch-distribution-for-solr/ HDS details: - includes a pre-configured (threads, logging, connection settings, message sizes, etc) and tested Tomcat based Solr server in the server directory - start scripts can be run from anywhere, and allow passing JVM args on command line (just like jetty, so it makes it easier to use) - start scripts work around known JVM bugs - start scripts allow setting port from command line, and default stop port based off of http port to make it easy to run multiple servers on a single box) - the server directory has been kept clean but stuffing all of tomcat under the server/tc directory Getting started: $ cd server $ bin/startup.sh To start on a different port (e.g. 7574): $ cd server $ bin/startup.sh -Dhttp.port=7574 To shut down: $ cd server $ bin/shutdown.sh -Dhttp.port=7574 The scripts even accept -Djetty.port=7574 to make it easier to cut-n-paste from start examples using jetty. The example directory is still there too, so you can still run the jetty based server if you want. -Yonik http://heliosearch.org - solve Solr GC pauses with off-heap filters and fieldcache
Over-ride q.op setting at query time
I have set q.op=AND in solrconfig.xml and use edismax. I see the match as I would expect except when I explicitly try to add binary logic. When I type termA OR term B I am still getting the results of termA AND termB. Am I being stupid or is this just not possible? Thanks, -Bob
Re: Over-ride q.op setting at query time
Hi Bob, Can you paste output of debugQuery=true? On Thursday, May 1, 2014 8:00 PM, Bob Laferriere spongeb...@icloud.com wrote: I have set q.op=AND in solrconfig.xml and use edismax. I see the match as I would expect except when I explicitly try to add binary logic. When I type termA OR term B I am still getting the results of termA AND termB. Am I being stupid or is this just not possible? Thanks, -Bob
Falling back on SlowFuzzyQuery
I'm working on upgrading our Solr 3 applications to Solr 4. The last piece of the puzzle involves the change in how fuzzy matching works in the new version. I'll have to rework how a key feature of our application is implemented to get the same behavior with the new FuzzyQuery as I did in the old version. I'd love to be able to get the rest of the system upgraded first and deal with that separately. I found a previous discussion pointing towards using SlowFuzzyQuery from the sandbox package: http://mail-archives.apache.org/mod_mbox/lucene-java-user/201308.mbox/%3C03be01ce98f7$da6c0760$8f441620$@thetaphi.de%3E Can someone provide a tip to how one might re-introduce SlowFuzzyQuery? After a brief search of the configuration options it doesn't appear to be an obvious direct swap of the class. Would I need to implement a custom Query Parser, Query Handler, or is this something that can be accomplished through configuration?
Re: Over-ride q.op setting at query time
On 5/1/2014 10:59 AM, Bob Laferriere wrote: I have set q.op=AND in solrconfig.xml and use edismax. I see the match as I would expect except when I explicitly try to add binary logic. When I type termA OR term B I am still getting the results of termA AND termB. Am I being stupid or is this just not possible? This is probably the following issue: https://issues.apache.org/jira/browse/SOLR-2649 It hasn't been fixed yet. There's a very long history laid out there. I've commented on it too. Thanks, Shawn
Re: Over-ride q.op setting at query time
When using query screen: 1. chocolate cake results in following: str name=parsedquery_toString+(((Category2Name:chocol^40.0 | ManfProdNum:chocolate | ProductNumber:chocolate | ProductName:chocol^100.0 | Category3Name:chocol^80.0 | Category4Name:chocol^80.0 | Keywords:chocol^300.0 | ProductNameGrams:chocolate^100.0 | Category1Name:chocol) (Category2Name:cake^40.0 | ManfProdNum:cake | ProductNumber:cake | ProductName:cake^100.0 | Category3Name:cake^80.0 | Category4Name:cake^80.0 | Keywords:cake^300.0 | ProductNameGrams:cake^100.0 | Category1Name:cake))~2) (ProductName:chocol cake^100.0) (Keywords:chocol cake^300.0) (ProductNameGrams:chocolate cake^75.0) (Keywords:chocol cake^100.0)/str 2. chocolate OR cake results in following: str name=parsedquery_toString+((Category2Name:chocol^40.0 | ManfProdNum:chocolate | ProductNumber:chocolate | ProductName:chocol^100.0 | Category3Name:chocol^80.0 | Category4Name:chocol^80.0 | Keywords:chocol^300.0 | ProductNameGrams:chocolate^100.0 | Category1Name:chocol) (Category2Name:cake^40.0 | ManfProdNum:cake | ProductNumber:cake | ProductName:cake^100.0 | Category3Name:cake^80.0 | Category4Name:cake^80.0 | Keywords:cake^300.0 | ProductNameGrams:cake^100.0 | Category1Name:cake)) (ProductName:chocol cake^100.0) (Keywords:chocol cake^300.0) (ProductNameGrams:chocolate cake^75.0) (Keywords:chocol cake^100.0)/str 3. if I remove q.op=AND to default to chocolate or cake: str name=parsedquery_toString+((Category2Name:chocol^40.0 | ManfProdNum:chocolate | ProductNumber:chocolate | ProductName:chocol^100.0 | Category3Name:chocol^80.0 | Category4Name:chocol^80.0 | Keywords:chocol^300.0 | ProductNameGrams:chocolate^100.0 | Category1Name:chocol) (Category2Name:cake^40.0 | ManfProdNum:cake | ProductNumber:cake | ProductName:cake^100.0 | Category3Name:cake^80.0 | Category4Name:cake^80.0 | Keywords:cake^300.0 | ProductNameGrams:cake^100.0 | Category1Name:cake)) (ProductName:chocol cake^100.0) (Keywords:chocol cake^300.0) (ProductNameGrams:chocolate cake^75.0) (Keywords:chocol cake^100.0)/str The parsed queries are identical, do you know where the “AND” and “OR” logic would show up in debugQuery? I would expect the same results from the query for #2 and #3 but get different results. -Bob On May 1, 2014, at 12:27 PM, Ahmet Arslan iori...@yahoo.com wrote: Hi Bob, Can you paste output of debugQuery=true? On Thursday, May 1, 2014 8:00 PM, Bob Laferriere spongeb...@icloud.com wrote: I have set q.op=AND in solrconfig.xml and use edismax. I see the match as I would expect except when I explicitly try to add binary logic. When I type termA OR term B I am still getting the results of termA AND termB. Am I being stupid or is this just not possible? Thanks, -Bob
XSLT Caching Warning
I get this warning when Solr (4.7.2) Starts: WARN org.apache.solr.util.xslt.TransformerProvider â The TransformerProvider's simplistic XSLT caching mechanism is not appropriate for high load scenarios, unless a single XSLT transform is used and xsltCacheLifetimeSeconds is set to a sufficiently high value. The solrconfig.xml setting is: queryResponseWriter name=xslt class=solr.XSLTResponseWriter int name=xsltCacheLifetimeSeconds10/int /queryResponseWriter Is there a different class that I should be using? Is there a higher number than 10 that will do the trick? Thanks! -- Chris
Question about Facets in Solr Cloud and there accuracy (especially when not ordered by count)
I found the following discussion very helpful from back in 2012. http://markmail.org/thread/lkl7ffi77w7hpv6n Probably the best description I've seen for how facets are actually calculated in Solr Cloud. Thanks. I presume this is for the most part still accurate. But, I have a slightly different question. Instead of ordering the results for a facet by count, what if I choose to order my results based on index. Will the counts still be correct in Solr Cloud? I would assume the same magic (as was described in the above link) could have been as easily applied to this approach. I assume that I could apply a prefix to the facet when ordering by index (such as 'A' to identify those products beginning with an 'A') and that the counts would still be correct. Thanks. Darin.
Question about Facets and Solr Cloud (accuracy when ordered by index)
I found the following discussion very helpful from back in 2012. http://markmail.org/thread/lkl7ffi77w7hpv6n Probably the best description I've seen for how facets are actually calculated in Solr Cloud. Thanks. I presume this is for the most part still accurate. But, I have a slightly different question. Instead of ordering the results for a facet by count, what if I choose to order my results based on index. Will the counts still be correct in Solr Cloud? I would assume the same magic (as was described in the above link) could have been applied to this approach. I assume that I could also apply a prefix to the facet when ordering by index (such as 'A' to identify those products beginning with an 'A') and that the counts would still be correct. Thanks. Darin.
Fastest way to import big amount of documents in SolrCloud
Hi guys, What would you say it's the fastest way to import data in SolrCloud? Our use case: each day do a single import of a big number of documents. Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk import feature in SOLR? I came upon this promising link: http://wiki.apache.org/solr/UpdateCSV Any idea on how UpdateCSV is performance-wise compared with SolrJ/DataImportHandler? If SolrJ, should we split the data in chunks and start multiple clients at once? In this way we could perhaps take advantage of the multitude number of servers in the SolrCloud configuration? Either way, after the import is finished, should we do an optimize or a commit or none ( http://wiki.solarium-project.org/index.php/V1:Optimize_command)? Any tips and tricks to perform this process the right way are gladly appreciated. Thanks, Costi
Re: Fastest way to import big amount of documents in SolrCloud
Hi Costi, I'd recommend SolrJ, parallelize the inserts. Also, it helps to set the commit intervals reasonable. Just to get a better perspective * Why do you want to do a full index everyday? * How much of data are we talking about? * What's your SolrCloud setup like? * Do you already have some benchmarks which you're not happy with? On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com wrote: Hi guys, What would you say it's the fastest way to import data in SolrCloud? Our use case: each day do a single import of a big number of documents. Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk import feature in SOLR? I came upon this promising link: http://wiki.apache.org/solr/UpdateCSV Any idea on how UpdateCSV is performance-wise compared with SolrJ/DataImportHandler? If SolrJ, should we split the data in chunks and start multiple clients at once? In this way we could perhaps take advantage of the multitude number of servers in the SolrCloud configuration? Either way, after the import is finished, should we do an optimize or a commit or none ( http://wiki.solarium-project.org/index.php/V1:Optimize_command)? Any tips and tricks to perform this process the right way are gladly appreciated. Thanks, Costi -- Anshum Gupta http://www.anshumgupta.net
Re: timeAllowed in not honoring
Hi Shawn, Please check that link http://wiki.apache.org/solr/SimpleFacetParameters#facet.method there is something mentioned in facet.method wiki *The default value is fc (except for BoolField which uses enum) since it tends to use less memory and is faster then the enumeration method when a field has many unique terms in the index.* So can you explain how enum is faster than default. Also we are currently using the solr 4.2 does that support this facet.method=enum, if not then which version should we pick. We are planning to move to SolrCloud with the version solr 4.7.1, so does this 14 GB of RAM will be sufficient? or should we increase it? With Regards Aman Tandon On Thu, May 1, 2014 at 8:20 PM, Shawn Heisey s...@elyograg.org wrote: On 4/30/2014 5:53 PM, Aman Tandon wrote: Shawn - Yes we have some plans to move to SolrCloud, Our total index size is 40GB with 11M of Docs, Available RAM 32GB, Allowed heap space for solr is 14GB, the GC tuning parameters using in our server is -XX:+UseConcMarkSweepGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps. This means that you have about 18GB of RAM left over to cache a 40GB index. That's less than 50 percent. Every index is different, but this is in the ballpark of where performance problems begin. If you had 48GB of RAM, your performance (not counting possible GC problems) would likely be very good. 64GB would be ideal. Your only GC tuning is switching the collector to CMS. This won't be enough. When I had a config like this and heap of only 8GB, I was seeing GC pauses of 10 to 12 seconds. http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning One question: Do you really need 14GB of heap? One of my servers has a total of 65GB of index (54 million docs) with a 7GB heap and 64GB of RAM. Currently I don't use facets, though. When I do, they will be enum. If you switch all your facets to enum, your heap requirements may go down. Decreasing the heap size will make more memory available for index caching. Thanks, Shawn
Re: XSLT Caching Warning
Hi Chris, Looking at source code reveals that warning message printed always. Independent of xsltCacheLifetimeSeconds value. /** singleton */ private TransformerProvider() { // tell'em: currently, we only cache the last used XSLT transform, and blindly recompile it // once cacheLifetimeSeconds expires log.warn( The TransformerProvider's simplistic XSLT caching mechanism is not appropriate + for high load scenarios, unless a single XSLT transform is used + and xsltCacheLifetimeSeconds is set to a sufficiently high value. ); } On Thursday, May 1, 2014 11:29 PM, Christopher Gross cogr...@gmail.com wrote: I get this warning when Solr (4.7.2) Starts: WARN org.apache.solr.util.xslt.TransformerProvider â The TransformerProvider's simplistic XSLT caching mechanism is not appropriate for high load scenarios, unless a single XSLT transform is used and xsltCacheLifetimeSeconds is set to a sufficiently high value. The solrconfig.xml setting is: queryResponseWriter name=xslt class=solr.XSLTResponseWriter int name=xsltCacheLifetimeSeconds10/int /queryResponseWriter Is there a different class that I should be using? Is there a higher number than 10 that will do the trick? Thanks! -- Chris
Re: timeAllowed in not honoring
On 5/1/2014 3:03 PM, Aman Tandon wrote: Please check that link http://wiki.apache.org/solr/SimpleFacetParameters#facet.method there is something mentioned in facet.method wiki *The default value is fc (except for BoolField which uses enum) since it tends to use less memory and is faster then the enumeration method when a field has many unique terms in the index.* So can you explain how enum is faster than default. Also we are currently using the solr 4.2 does that support this facet.method=enum, if not then which version should we pick. We are planning to move to SolrCloud with the version solr 4.7.1, so does this 14 GB of RAM will be sufficient? or should we increase it? The fc method (which means fieldcache) puts all the data required to build facets on that field into the fieldcache, and that data stays there until the next commit or restart. If you are committing frequently, that memory use might be wasted. I was surprised to read that fc uses less memory. It may be very true that the amount of memory required for a single call with facet.method=enum is more than the amount of memory required in the fieldcache for facet.method=fc, but that memory can be recovered as garbage -- with the fc method, it can't be recovered. It sits there, waiting for that facet to be used again, so it can speed it up. When you commit and open a new searcher, it gets thrown away. If you use a lot of different facets, the fieldcache can become HUGE with the fc method. *If you don't do all those facets at the same time* (a very important qualifier), you can switch to enum and the total amount of resident heap memory required will be a lot less. There may be a lot of garbage to collect, but the total heap requirement at any given moment should be smaller. If you actually need to build a lot of different facets at nearly the same time, enum may not actually help. The enum method is actually a little slower than fc for a single run, but the java heap characteristics for multiple runs can cause enum to be faster in bulk. Try both and see what your results are. Thanks, Shawn
Re: Fastest way to import big amount of documents in SolrCloud
Thanks for the reply, Anshum. Please see my answers to your questions below. * Why do you want to do a full index everyday? Not sure I understand what you mean by full index. Every day we want to import additional documents to the existing ones. Of course, we want to remove older ones as well, so the total amount remains roughly the same. * How much of data are we talking about? The number of new documents is around 500k each day. * What's your SolrCloud setup like? We're currently using Solr 3.6 with 16 shards and planning to switch to SolrCloud, hence the inquiry. * Do you already have some benchmarks which you're not happy with? Not yet. Planning to do some tests quite soon. I was looking for some guidance before jumping in. Also, it helps to set the commit intervals reasonable. What do you mean by *reasonable*? Also, do you recommend using autoCommit? We are currently doing an optimize after each import (in Solr 3), in order to speed up future queries. This is proving to take very long though (several hours). Doing a commit instead of optimize is usually bringing the master and slave nodes down. We reverted to calling optimize on every ingest. On Thu, May 1, 2014 at 11:57 PM, Anshum Gupta ans...@anshumgupta.netwrote: Hi Costi, I'd recommend SolrJ, parallelize the inserts. Also, it helps to set the commit intervals reasonable. Just to get a better perspective * Why do you want to do a full index everyday? * How much of data are we talking about? * What's your SolrCloud setup like? * Do you already have some benchmarks which you're not happy with? On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com wrote: Hi guys, What would you say it's the fastest way to import data in SolrCloud? Our use case: each day do a single import of a big number of documents. Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk import feature in SOLR? I came upon this promising link: http://wiki.apache.org/solr/UpdateCSV Any idea on how UpdateCSV is performance-wise compared with SolrJ/DataImportHandler? If SolrJ, should we split the data in chunks and start multiple clients at once? In this way we could perhaps take advantage of the multitude number of servers in the SolrCloud configuration? Either way, after the import is finished, should we do an optimize or a commit or none ( http://wiki.solarium-project.org/index.php/V1:Optimize_command)? Any tips and tricks to perform this process the right way are gladly appreciated. Thanks, Costi -- Anshum Gupta http://www.anshumgupta.net
Re: Solr 4.7 not showing parsedQuery / parsedquery_toString information
Shamik: I'm not sure what the cause of this is, but it definitely seems like a bug to me. I've opened SOLR-6039 and noted a workarround for folks who don't care about the new track debug info and just want the same debug info that was available before 4.7... https://issues.apache.org/jira/browse/SOLR-6039 : Date: Thu, 24 Apr 2014 11:50:37 -0700 : From: Shamik Bandopadhyay sham...@gmail.com : Reply-To: solr-user@lucene.apache.org : To: solr-user@lucene.apache.org : Subject: Solr 4.7 not showing parsedQuery / parsedquery_toString information : : Hi, : : Not sure if this has been a feature change, but I've observed that : parsedquery and parsedquery_toString information are not displayed if : the search doesn't return any result. Here's what is being returned. : : lst name=debug : lst name=track : str name=rid54.215.121.xx-collection1-1398xx4900921-48/str : lst name=EXECUTE_QUERY : lst name=http://54.215.121.xx:8983/solr/collection1/| : http://54.215.117.xxx:8983/solr/collection1/; :str name=QTime6/str :str name=ElapsedTime11/str :str name=RequestPurposeGET_TOP_IDS,GET_FACETS/str :str name=NumFound0/str :str : name=Response{responseHeader={status=0,QTime=6,params={facet=on,tie=0.01,f.text.hl.fragsize=250,q.alt=*:*,facet.method=enum,f.ADSKAudience.facet.mincount=1,v.layout=layout,NOW=1398xx4900920,bq=Source2:sfdcarticles^3 : Source2:downloads^3 Source2:CloudHelp^2.5 Source2:blog^1 : Source2:discussion^2 Source2:documentation^1.5 Source2:youtube^1.5 : Source2:education-curriculum^2 : Source2:mne-help^1.5,fl=id,score,f.ADSKDocumentType.facet.limit=-1,bf=recip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^2.0,facet.field=[ADSKProductLine, : ADSKContentGroup, ADSKReleaseYear, ADSKHelpTopic, ADSKDocumentType, : ADSKAudience],v.template=browse,fq=Source2:(mne-help OR CloudHelp OR : documentation OR videos OR youtube OR discussion OR blog OR : sfdcarticles OR downloads) AND -workflowparentid:[* TO *] AND : -ADSKAccessMode:internal AND : -ADSKAccessMode:beta,fsv=true,f.ADSKReleaseYear.facet.mincount=1,spellcheck.extendedResults=false,f.ADSKProductLine.facet.mincount=1,hl.fl=text : title,wt=javabin,spellcheck.collate=true,requestPurpose=GET_TOP_IDS,GET_FACETS,rows=1,defType=edismax,f.ADSKReleaseYear.facet.limit=-1,facet.sort=index,start=0,q.op=AND,f.ADSKContentGroup.facet.mincount=1,spellcheck=true,f.ADSKContentGroup.facet.limit=-1,distrib=false,debug=track,shards.tolerant=true,hl=false,version=2,v.channel=adskhelpportal,title=Project : Sunshine - HelpPortal Bundle,shard.url= : http://54.215.121.xx:8983/solr/collection1/| : http://54.215.117.xxx:8983/solr/collection1/,df=text,debugQuery=false,v.contentType=text/html;charset=UTF-8,spellcheck.count=5,f.text.hl.alternateField=ShortDesc,f.ADSKHelpTopic.facet.mincount=1,qf=text^1.5 : title^2 IndexTerm^.9 keywords^1.2 ADSKCommandSrch^2 : ADSKLikes^2,f.ADSKHelpTopic.facet.limit=-1,spellcheck.onlyMorePopular=false,rid=54.215.121.xx-collection1-1398xx4900921-48,q=How : can I obtain local offline : Help,f.ADSKDocumentType.facet.mincount=1,f.ADSKAudience.facet.limit=-1,isShard=true,f.ADSKProductLine.facet.limit=-1}},response={numFound=0,start=0,maxScore=0.0,docs=[]},sort_values={},facet_counts={facet_queries={},facet_fields={ADSKProductLine={},ADSKContentGroup={},ADSKReleaseYear={},ADSKHelpTopic={},ADSKDocumentType={},ADSKAudience={}},facet_dates={},facet_ranges={}},debug={}}/str : /lst : lst name=http://54.215.122.xxx:8983/solr/collection1/| : http://50.18.135.xxx:8983/solr/collection1/; :str name=QTime7/str :str name=ElapsedTime15/str :str name=RequestPurposeGET_TOP_IDS,GET_FACETS/str :str name=NumFound0/str :str : name=Response{responseHeader={status=0,QTime=7,params={facet=on,tie=0.01,f.text.hl.fragsize=250,q.alt=*:*,facet.method=enum,f.ADSKAudience.facet.mincount=1,v.layout=layout,NOW=1398xx4900920,bq=Source2:sfdcarticles^3 : Source2:downloads^3 Source2:CloudHelp^2.5 Source2:blog^1 : Source2:discussion^2 Source2:documentation^1.5 Source2:youtube^1.5 : Source2:education-curriculum^2 : Source2:mne-help^1.5,fl=id,score,f.ADSKDocumentType.facet.limit=-1,bf=recip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^2.0,facet.field=[ADSKProductLine, : ADSKContentGroup, ADSKReleaseYear, ADSKHelpTopic, ADSKDocumentType, : ADSKAudience],v.template=browse,fq=Source2:(mne-help OR CloudHelp OR : documentation OR videos OR youtube OR discussion OR blog OR : sfdcarticles OR downloads) AND -workflowparentid:[* TO *] AND : -ADSKAccessMode:internal AND : -ADSKAccessMode:beta,fsv=true,f.ADSKReleaseYear.facet.mincount=1,spellcheck.extendedResults=false,f.ADSKProductLine.facet.mincount=1,hl.fl=text :
Re: Roll up query with original facets
https://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. : Subject: Roll up query with original facets : From: Darin Amos dari...@gmail.com : In-Reply-To: 1398953952.39792.yahoomail...@web124702.mail.ne1.yahoo.com : Message-Id: 5902ae5b-7545-45d4-8662-a9700e1ec...@gmail.com : References: d6259d1ccf526540b1cb447e5f3bc39b8e344f5...@gechem8mail.asg.com : 1398953952.39792.yahoomail...@web124702.mail.ne1.yahoo.com -Hoss http://www.lucidworks.com/
Re: Roll up query with original facets
My apologies!! On May 1, 2014 6:56 PM, Chris Hostetter hossman_luc...@fucit.org wrote: https://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. : Subject: Roll up query with original facets : From: Darin Amos dari...@gmail.com : In-Reply-To: 1398953952.39792.yahoomail...@web124702.mail.ne1.yahoo.com : Message-Id: 5902ae5b-7545-45d4-8662-a9700e1ec...@gmail.com : References: d6259d1ccf526540b1cb447e5f3bc39b8e344f5...@gechem8mail.asg.com : 1398953952.39792.yahoomail...@web124702.mail.ne1.yahoo.com -Hoss http://www.lucidworks.com/
Re: XSLT Caching Warning
The message implies that there is a better way of having XSLT transformations. Is that the case, or is there just this perpetual warning for normal operations? -- Chris On Thu, May 1, 2014 at 5:08 PM, Ahmet Arslan iori...@yahoo.com wrote: Hi Chris, Looking at source code reveals that warning message printed always. Independent of xsltCacheLifetimeSeconds value. /** singleton */ private TransformerProvider() { // tell'em: currently, we only cache the last used XSLT transform, and blindly recompile it // once cacheLifetimeSeconds expires log.warn( The TransformerProvider's simplistic XSLT caching mechanism is not appropriate + for high load scenarios, unless a single XSLT transform is used + and xsltCacheLifetimeSeconds is set to a sufficiently high value. ); } On Thursday, May 1, 2014 11:29 PM, Christopher Gross cogr...@gmail.com wrote: I get this warning when Solr (4.7.2) Starts: WARN org.apache.solr.util.xslt.TransformerProvider â The TransformerProvider's simplistic XSLT caching mechanism is not appropriate for high load scenarios, unless a single XSLT transform is used and xsltCacheLifetimeSeconds is set to a sufficiently high value. The solrconfig.xml setting is: queryResponseWriter name=xslt class=solr.XSLTResponseWriter int name=xsltCacheLifetimeSeconds10/int /queryResponseWriter Is there a different class that I should be using? Is there a higher number than 10 that will do the trick? Thanks! -- Chris
Re: XSLT Caching Warning
On 5/1/2014 7:30 PM, Christopher Gross wrote: The message implies that there is a better way of having XSLT transformations. Is that the case, or is there just this perpetual warning for normal operations? When I was using XSLT, I got a warning for every core, even though I had a cached lifetime that would prevent problems. I don't remember what that lifetime was any more, probably at least five minutes, but it might have been longer. I also met the other criteria -- there was only one XSLT defined. Perhaps the warning needs to be suppressed if the lifetime is above a certain value and there is only one transform defined? I would expect that even 60 seconds would be a long enough lifetime to prevent major issues in high load scenarios ... but we could bikeshed that number forever. Thanks, Shawn
Re: XSLT Caching Warning
I think the key message here is: simplistic XSLT caching mechanism is not appropriate for high load scenarios. As in, maybe this is not really a production-level component. One exception is given and it is not just lifetime, it's also a single-transform. Are you satisfying both of those conditions? If so, it's probably ok to just ignore the warning. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Fri, May 2, 2014 at 3:28 AM, Christopher Gross cogr...@gmail.com wrote: I get this warning when Solr (4.7.2) Starts: WARN org.apache.solr.util.xslt.TransformerProvider â The TransformerProvider's simplistic XSLT caching mechanism is not appropriate for high load scenarios, unless a single XSLT transform is used and xsltCacheLifetimeSeconds is set to a sufficiently high value. The solrconfig.xml setting is: queryResponseWriter name=xslt class=solr.XSLTResponseWriter int name=xsltCacheLifetimeSeconds10/int /queryResponseWriter Is there a different class that I should be using? Is there a higher number than 10 that will do the trick? Thanks! -- Chris