Performing DIH on predefined list of IDS
Relatively frequently (about a once a month) we need to reindex the data, by using DIH and copying the data from one index to another. Because of the fact that we have a large index, it could take from 12 to 24 hours to complete. At the same time the old index is being queried by users. Sometimes DIH could be interrupted at the middle, because of some unexpected exception caused by OutOfMemory or something else (many times it failed when more than 90 % was completed). More than this, almost every time, some items are missing at new the index. It is very complicated to find them. At this stage I can't be sure about what documents exactly were missed and I have to do it again and waiting for many hours. At the same time the old index constantly receives new items. I want to suggest the following way to solve the problem: • Get list of all item ids ( call LUCINE API , like CLUE does for example ) • Start DIH, which will iterate over those ids and each time make a query for n items. 1. Of course original DIH class should be changed to support it. • This will give the following advantages : 1. I will know exactly what items were failed. 2. I can restart the process from any point and in case of DIH failure restart it from the point of failure. so the main difference will be that now DIH running on *:* query and I suggest to run it list of IDS for example if I have 1000 docs and want that this new DIH will take each time 100 docs , so it will do 10 queries , each one will have 100 IDS . ( like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... ) The question is what do you think about it? Or all of this could be done another way and I am trying to reinvent the wheel? -- View this message in context: http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html Sent from the Solr - User mailing list archive at Nabble.com.
Remove all parent docs having specific child doc
Hi, I want to remove all the parent docs having a specific child doc. Eg. docEmployee1 doc fieldDept1/field /doc doc fieldDept2/field /doc /doc docEmployee2 doc fieldDept2/field /doc doc fieldDept3/field /doc /doc Query: Remove all employees which lie in Dept1 Response should be: Employee2 *only* Problem: *NOT operator is not being supported* in block join query parser. *q = {!parent which=employee:*}*:* -department:Dept1 *- results Employee1, Employee2 *q= - {!parent which=employee:*} department:Dept1 *- it does not work with block join query parser. Please suggest as to how I filter those employees which lie in Dept1 using block join or any other query parser? Thanks, Lokesh
Use multiple collections having different configuration
Hello, I have scenario where I want to create/use 2 collection into same Solr named as collection1 and collection2. I want to use distributed servers. Each collection has multiple shards. Each collection contains different configurations(solrconfig.xml and schema.xml). How can I do? In between, If I want to re-configure any collection then how to do that? As I know, If we use single collection which having multiple shards then we need to use this upconfig link - * example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir example/solr/collection1/conf -confname default * and restart all the nodes. For 2 collections into same solr. How can I do re-configure?
Advantage of using Java programming with Solr over Solr API
Hi, What is the advantages of java programming with Solr over Solr API?
Re: ignoring bad documents during index
I want to experiment with this issue , where exactly I should take a look ? I want to try to fix this missing aggregation . What class is responsible to that ? -- View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187587.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Committed before 500
Hi Shawn, I do not want to increase timeout as these errors are very few. Also current timeout of 90 seconds is good enough. Is there a way to find why Solr is getting timed-out ( at times ), could it be that Solr is busy doing other activities like re-indexing, commits etc. Additionally I also found that some of non-leader node move to recovering or recovery failed after these time out errors. I am just wondering if these are related to performance issue and Solr commits needs to be controlled. Regards, Naresh Jakher From: Shawn Heisey-2 [via Lucene] [mailto:ml-node+s472066n4187382...@n3.nabble.com] Sent: Thursday, February 19, 2015 8:12 PM To: Jakher, Naresh Subject: Re: Committed before 500 On 2/19/2015 6:30 AM, NareshJakher wrote: I am using Solr cloud with 3 nodes, at times following error is observed in logs during delete operation. Is it a performance issue ? What can be done to resolve this issue Committed before 500 {msg=Software caused connection abort: socket write error,trace=org.eclipse.jetty.io.EofException I did search on old topics but couldn't find anything concrete related to Solr cloud. Would appreciate any help on the issues as I am relatively new to Solr. A jetty EofException indicates that one specific thing is happening: The TCP connection from the client was severed before Solr responded to the request. Usually this happens because the client has been configured with an absolute timeout or an inactivity timeout, and the timeout was reached. Configuring timeouts so that you can be sure clients don't get stuck is a reasonable idea, but any configured timeouts should be VERY long. You'd want to use a value like five minutes, rather than 10, 30, or 60 seconds. The timeouts MIGHT be in the HttpShardHandler config that Solr and SolrCloud use for distributed searches, and they also might be in operating-system-level config. https://wiki.apache.org/solr/SolrConfigXml?highlight=%28HttpShardHandler%29#Configuration_of_Shard_Handlers_for_Distributed_searches Thanks, Shawn If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187382.html To unsubscribe from Committed before 500, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4187361code=bmFyZXNoLmpha2hlckBjYXBnZW1pbmkuY29tfDQxODczNjF8NzQ0MTczNzc0. NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. -- View this message in context: http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187601.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ignoring bad documents during index
On 20 February 2015 at 15:31, SolrUser1543 osta...@gmail.com wrote: I want to experiment with this issue , where exactly I should take a look ? I want to try to fix this missing aggregation . What class is responsible to that ? Are you indexing through SolrJ, DIH, or what? Regards,
Re: Advantage of using Java programming with Solr over Solr API
On 2/20/2015 6:38 AM, Nitin Solanki wrote: I mean embedded Solr . On Fri, Feb 20, 2015 at 7:05 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: This question makes no sense. Do you mean embedded Solr vs Standalone? Regards, Alex On 20 Feb 2015 3:30 am, Nitin Solanki nitinml...@gmail.com wrote: Hi, What is the advantages of java programming with Solr over Solr API? Standalone Solr offers the admin UI and the ability to do some of your testing with hand-typed URLs in a browser. The embedded server is completely unreachable from anywhere but the java program that embeds it, and has no options for redundancy and high availability. The Java client implementations offer objects and methods that are very easy for a java developer to understand and write with a very small amount of code, and do not require any user code for building URLs or communicating over HTTP. If you were thinking about using EmbeddedSolrServer, you can use one of the other SolrServer (SolrClient in 5.0) implementations instead with a standalone Solr installation. The resulting client code will be nearly identical to what you'd use with EmbeddedSolrServer, because EmbeddedSolrServer is simply another implementation of the same abstract class and interfaces that are used by objects like HttpSolrServer and CloudSolrServer (Server is replaced by Client in 5.0). Thanks, Shawn
Re: Advantage of using Java programming with Solr over Solr API
This question makes no sense. Do you mean embedded Solr vs Standalone? Regards, Alex On 20 Feb 2015 3:30 am, Nitin Solanki nitinml...@gmail.com wrote: Hi, What is the advantages of java programming with Solr over Solr API?
Re: Advantage of using Java programming with Solr over Solr API
I mean embedded Solr . On Fri, Feb 20, 2015 at 7:05 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: This question makes no sense. Do you mean embedded Solr vs Standalone? Regards, Alex On 20 Feb 2015 3:30 am, Nitin Solanki nitinml...@gmail.com wrote: Hi, What is the advantages of java programming with Solr over Solr API?
Re: Collations are not working fine.
How to get only the best collations whose hits are more and need to sort them? On Wed, Feb 18, 2015 at 3:53 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Hi Nitin, I was trying many different options for a couple different queries. In fact, I have collations working ok now with the Suggester and WFSTLookup. The problem may have been due to a different dictionary and/or lookup implementation and the specific options I was sending. In general, we're using spellcheck for search suggestions. The Suggester component (vs. Suggester spellcheck implementation), doesn't handle all of our cases. But we can get things working using the spellcheck interface. What gives us particular troubles are the cases where a term may be valid by itself, but also be the start of longer words. The specific terms are acronyms specific to our business. But I'll attempt to show generic examples. E.g. a partial term like fo can expand to fox, fog, etc. and a full term like brown can also expand to something like brownstone. And, yes, the collation brownstone fox is nonsense. But assume, for the sake of argument, it appears in our documents somewhere. For multiple term query with a spelling error (or partially typed term): brown fo We get collations in order of hits, descending like ... brown fox, brown fog, brownstone fox. So far, so good. For a single term query, brown, we get a single suggestion, brownstone and no collations. So, we don't know to keep the term brown! At this point, we need spellcheck.extendedResults=true and look at the origFreq value in the suggested corrections. Unfortunately, the Suggester (spellcheck dictionary) does not populate the original frequency information. And, without this information, the SpellCheckComponent cannot format the extended results. However, with a simple change to Suggester.java, it was easy to get the needed frequency information use it to make a sound decision to keep or drop the input term. But I'd be much obliged if there is a better way to go about it. Configs below. Thanks, Charlie !-- SpellCheck component -- searchComponent class=solr.SpellCheckComponent name=suggestSC lst name=spellchecker str name=namesuggestDictionary/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.fst.WFSTLookupFactory/str str name=fieldtext_all/str float name=threshold0.0001/float str name=exactMatchFirsttrue/str str name=buildOnCommittrue/str /lst /searchComponent !-- Request Handler -- requestHandler name=/tcSuggest class=solr.SearchHandler lst name=defaults str name=titleSearch Suggestions (spellcheck)/str str name=echoParamsexplicit/str str name=wtjson/str str name=rows0/str str name=defTypeedismax/str str name=dftext_all/str str name=flid,name,ticker,entityType,transactionType,accountType/str str name=spellchecktrue/str str name=spellcheck.count5/str str name=spellcheck.dictionarysuggestDictionary/str str name=spellcheck.alternativeTermCount5/str str name=spellcheck.collatetrue/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations5/str /lst arr name=last-components strsuggestSC/str /arr /requestHandler -Original Message- From: Nitin Solanki [mailto:nitinml...@gmail.com] Sent: Tuesday, February 17, 2015 3:17 AM To: solr-user@lucene.apache.org Subject: Re: Collations are not working fine. Hi Charles, Will you please send the configuration which you tried. It will help to solve my problem. Have you sorted the collations on hits or frequencies of suggestions? If you did than please assist me. On Mon, Feb 16, 2015 at 7:59 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: I have been working with collations the last couple days and I kept adding the collation-related parameters until it started working for me. It seems I needed str name=spellcheck.collateMaxCollectDocs50/str. But, I am using the Suggester with the WFSTLookupFactory. Also, I needed to patch the suggester to get frequency information in the spellcheck response. -Original Message- From: Rajesh Hazari [mailto:rajeshhaz...@gmail.com] Sent: Friday, February 13, 2015 3:48 PM To: solr-user@lucene.apache.org Subject: Re: Collations are not working fine. Hi Nitin, Can u try with the below config, we have these config seems to be working for us. searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_general/str lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldtextSpell/str str name=combineWordstrue/str str name=breakWordsfalse/str
Re: Use multiple collections having different configuration
On 2/20/2015 4:06 AM, Nitin Solanki wrote: I have scenario where I want to create/use 2 collection into same Solr named as collection1 and collection2. I want to use distributed servers. Each collection has multiple shards. Each collection contains different configurations(solrconfig.xml and schema.xml). How can I do? In between, If I want to re-configure any collection then how to do that? As I know, If we use single collection which having multiple shards then we need to use this upconfig link - * example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir example/solr/collection1/conf -confname default * and restart all the nodes. For 2 collections into same solr. How can I do re-configure? First, upload your two different configurations with zkcli upconfig using two different names. Create your collections with the Collections API, and tell each one to use a different collection.configName. If the collection already exists, use the zkcli linkconfig command, and reload the collection. If you need to change a config, edit the config on disk and re-do the zkcli upconfig. Then reload the collection with the Collections API. Alternately you could upload a whole new config and then link it to the existing collection. The Collections API is not yet exposed in the admin interface, you will need to do those calls yourself. If you're doing this with SolrJ, there are some objects inside CollectionAdminRequest that let you do all the API actions. Thanks, Shawn
Re: Performing DIH on predefined list of IDS
My index has about 110 millions of documents. The index is split over several shards. May be the number it's not so big ,but each document is relatively large. The reason to perform the reindex is something like adding a new fields , or adding some update processor which can extract something from one field and put in another and etc. Each time I need to reindex data , I create a new collection and starting to import data from old one . It gives the opportunity for an update processors to act. The dih running with *:* query and takes some number of items each time. In case of exception , the process stops and the middle and I can't to restart from this point. That's the reason that I want to run on predefined list of IDs. In this case I will able to restart from any point and to know about filed IDs. -- View this message in context: http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187753.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performing DIH on predefined list of IDS
On 2/20/2015 3:46 PM, Shawn Heisey wrote: If the URL parameter is idlist then you can use ${dih.request.idlist} in your SELECT statement. I realized after I sent this that you are not using a database ... the list would simply go in the query you send to the other server. I don't know whether the request that the SolrEntityProcessor sends is a GET or a POST, so for a really large list of IDs, you might need to edit the container config on both servers. Thanks, Shawn
Re: Performing DIH on predefined list of IDS
On 2/20/2015 2:57 PM, SolrUser1543 wrote: That's the reason that I want to run on predefined list of IDs. In this case I will able to restart from any point and to know about filed IDs. You can include information on a URL parameter and then use that URL parameter inside your dih config. If the URL parameter is idlist then you can use ${dih.request.idlist} in your SELECT statement. Be aware that most servlet containers have a default header length limit of about 8192 characters, affecting the length of the URL that can be sent successfully. If the list of IDs is going to get huge, you will either need to switch from a GET to a POST request where the parameter is in the post body, or increase the header length limit in the servlet container that is running Solr. Thanks, Shawn
Re: Clarification of locktype=single and implications of use
Thanks Hoss, Protection from misconfiguration and/or starting separate solr instances pointing to the same index dir I can understand. The current documentation on the wiki and in the ref guide (along with just enough understanding of Solr/Lucene indexing to be dangerous) left me wondering if maybe somehow a correctly configured Solr might have multiple processes writing to the same file. I'm wondering if your explanation above might be added to the documentation. Tom On Fri, Feb 20, 2015 at 1:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : We are using Solr. We would not configure two different Solr instances to : write to the same index. So why would a normal Solr set-up possibly end : up having more than one process writing to the same index? The risk here is that if you configure lockType=single, and then have some unintended user error such that two distinct java processes both attempt to use the same index dir, the locType will not protect you in that situation. For example: you normally run solr on port 8983, but someone accidently starts a second instance of solr on more 7574 using the exact same conigs with the exact same index dir -- lockType single won't help you spot this error. lockType=native will (assuming your FileSystem can handle it) lockType=single should protect you however if, for example, multiple SolrCores w/in the same Solr java process attempted to refer to the same index dir because you accidently put an absolulte path in a solrconfig.xml that gets shared my multiple cores. -Hoss http://www.lucidworks.com/
[ANNOUNCE] Apache Solr 5.0.0 and Reference Guide for Solr 5.0 released
20 February 2015, Apache Solr™ 5.0.0 and Reference Guide for Solr 5.0 available The Lucene PMC is pleased to announce the release of Apache Solr 5.0.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 5.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 5.0 Release Highlights: * Usability improvements that include improved bin scripts and new and restructured examples. * Scripts to support installing and running Solr as a service on Linux. * Distributed IDF is now supported and can be enabled via the config. Currently, there are four supported implementations for the same: * LocalStatsCache: Local document stats. * ExactStatsCache: One time use aggregation * ExactSharedStatsCache: Stats shared across requests * LRUStatsCache: Stats shared in an LRU cache across requests * Solr will no longer ship a war file and instead be a downloadable application. * SolrJ now has first class support for Collections API. * Implicit registration of replication,get and admin handlers. * Config API that supports paramsets for easily configuring solr parameters and configuring fields. This API also supports managing of pre-existing request handlers and editing common solrconfig.xml via overlay. * API for managing blobs allows uploading request handler jars and registering them via config API. * BALANCESHARDUNIQUE Collection API that allows for even distribution of custom replica properties. * There's now an option to not shuffle the nodeSet provided during collection creation. * Option to configure bandwidth usage by Replication handler to prevent it from using up all the bandwidth. * Splitting of clusterstate to per-collection enables scalability improvement in SolrCloud. This is also the default format for new Collections that would be created going forward. * timeAllowed is now used to prematurely terminate requests during query expansion and SolrClient request retry. * pivot.facet results can now include nested stats.field results constrained by those pivots. * stats.field can be used to generate stats over the results of arbitrary numeric functions. It also allows for requesting for statistics for pivot facets using tags. * A new DateRangeField has been added for indexing date ranges, especially multi-valued ones. * Spatial fields that used to require units=degrees now take distanceUnits=degrees/kilometers miles instead. * MoreLikeThis query parser allows requesting for documents similar to an existing document and also works in SolrCloud mode. * Logging improvements: * Transaction log replay status is now logged * Optional logging of slow requests. Solr 5.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Detailed change log: http://lucene.apache.org/solr/5_0_0/changes/Changes.html Also available is the *Solr Reference Guide for Solr 5.0*. This 535 page PDF serves as the definitive user's manual for Solr 5.0. It can be downloaded from the Apache mirror network: https://s.apache.org/Solr-Ref-Guide-PDF Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. -- Anshum Gupta http://about.me/anshumgupta
Re: Performing DIH on predefined list of IDS
It's a little bit hard to get the overall context eg why do you live with OOME as usual, what's the reasoning to pull from one index to another, and what's added during this process. Make sure that you are aware of http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor which queries other Solr. and http://wiki.apache.org/solr/DataImportHandler#LogTransformer that you can use to log recently imported ids, to be able to restart indexing from this point. You can drop me more details in your native language if you wish. On Fri, Feb 20, 2015 at 1:32 PM, SolrUser1543 osta...@gmail.com wrote: Relatively frequently (about a once a month) we need to reindex the data, by using DIH and copying the data from one index to another. Because of the fact that we have a large index, it could take from 12 to 24 hours to complete. At the same time the old index is being queried by users. Sometimes DIH could be interrupted at the middle, because of some unexpected exception caused by OutOfMemory or something else (many times it failed when more than 90 % was completed). More than this, almost every time, some items are missing at new the index. It is very complicated to find them. At this stage I can't be sure about what documents exactly were missed and I have to do it again and waiting for many hours. At the same time the old index constantly receives new items. I want to suggest the following way to solve the problem: • Get list of all item ids ( call LUCINE API , like CLUE does for example ) • Start DIH, which will iterate over those ids and each time make a query for n items. 1. Of course original DIH class should be changed to support it. • This will give the following advantages : 1. I will know exactly what items were failed. 2. I can restart the process from any point and in case of DIH failure restart it from the point of failure. so the main difference will be that now DIH running on *:* query and I suggest to run it list of IDS for example if I have 1000 docs and want that this new DIH will take each time 100 docs , so it will do 10 queries , each one will have 100 IDS . ( like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... ) The question is what do you think about it? Or all of this could be done another way and I am trying to reinvent the wheel? -- View this message in context: http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performing DIH on predefined list of IDS
Personally, I much prefer indexing from an independent SolrJ client to using DIH when I have to take explicit control of errors etc. Here's an example: https://lucidworks.com/blog/indexing-with-solrj/ In your example, you seem to be assuming that the Lucene IDs (and here I'm assuming you're not talking about the internal Lucene ID) corresponds to some kind of primary key in your database table. But the correspondence isn't necessarily straightforward, how would it handle composite keys? I'll leave actual comments on DIH's internals to people who, you know, actually understand the code ;)... Erick On Fri, Feb 20, 2015 at 2:32 AM, SolrUser1543 osta...@gmail.com wrote: Relatively frequently (about a once a month) we need to reindex the data, by using DIH and copying the data from one index to another. Because of the fact that we have a large index, it could take from 12 to 24 hours to complete. At the same time the old index is being queried by users. Sometimes DIH could be interrupted at the middle, because of some unexpected exception caused by OutOfMemory or something else (many times it failed when more than 90 % was completed). More than this, almost every time, some items are missing at new the index. It is very complicated to find them. At this stage I can't be sure about what documents exactly were missed and I have to do it again and waiting for many hours. At the same time the old index constantly receives new items. I want to suggest the following way to solve the problem: • Get list of all item ids ( call LUCINE API , like CLUE does for example ) • Start DIH, which will iterate over those ids and each time make a query for n items. 1. Of course original DIH class should be changed to support it. • This will give the following advantages : 1. I will know exactly what items were failed. 2. I can restart the process from any point and in case of DIH failure restart it from the point of failure. so the main difference will be that now DIH running on *:* query and I suggest to run it list of IDS for example if I have 1000 docs and want that this new DIH will take each time 100 docs , so it will do 10 queries , each one will have 100 IDS . ( like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... ) The question is what do you think about it? Or all of this could be done another way and I am trying to reinvent the wheel? -- View this message in context: http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strange search behaviour when upgrading to 4.10.3
Hi Shawn, Also, the tokenizer we use is very similar to the following. ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalTokenizer.java ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalLexer.jflex From the looks of it the text is being indexed as a single token and not broken across whitespace. Thanks, Rishi. -Original Message- From: Shawn Heisey apa...@elyograg.org To: solr-user solr-user@lucene.apache.org Sent: Fri, Feb 20, 2015 11:52 am Subject: Re: Strange search behaviour when upgrading to 4.10.3 On 2/20/2015 9:37 AM, Rishi Easwaran wrote: We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 search results are not being returned, actually looks like only the first word in a sentence is getting indexed. Ex: inserting This is a test message only returns results when searching for content:this*. searching for content:test* or content:message* does not work with 4.10. Only searching for content:*message* works. This leads to me to believe there is something wrong with behaviour of our analyzer and tokenizers snip fields field name=content type=ourType stored=false indexed = true required=false multiValued=true / /fields fieldType name=ourType indexed = true class=solr.TextField analyzer class = com.zimbra.cs.index.ZimbraAnalyzer / /fieldType Looking at the release notes from solr and lucene http://lucene.apache.org/solr/4_10_1/changes/Changes.html http://lucene.apache.org/core/4_10_1/changes/Changes.html Nothing really sticks out, atleast to me. Any help to get it working with 4.10 would be great. The links you provided lead to zero-byte files when I try them, so I could not look deeper. Have you recompiled your custom analysis components against the newer versions of the Solr/Lucene libraries? Anytime you're dealing with custom components, you cannot assume that a component compiled to work with one version of Solr will work with another version. The internal API does change, and there is less emphasis on avoiding API breaks in minor Solr releases than there is with Lucene, because the vast majority of Solr users are not writing their own code that uses the Solr API. Recompiling against the newer libraries may cause compiler errors that reveal places in your code that require changes. Thanks, Shawn
Re: rankquery usage bug?
Ryan, This looks like a good jira ticket to me. Joel Bernstein Search Engineer at Heliosearch On Fri, Feb 20, 2015 at 6:40 PM, Ryan Josal rjo...@gmail.com wrote: Hey guys, I put a rq in defaults but I can't figure out how to override it with no rankquery. Looks like one option might be checking for empty string before trying to use it in QueryComponent? I can work around it in the prep method of an earlier searchcomponent for now. Ryan
Re: Use multiple collections having different configuration
Thanks Shawn.. On Fri, Feb 20, 2015 at 7:53 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/20/2015 4:06 AM, Nitin Solanki wrote: I have scenario where I want to create/use 2 collection into same Solr named as collection1 and collection2. I want to use distributed servers. Each collection has multiple shards. Each collection contains different configurations(solrconfig.xml and schema.xml). How can I do? In between, If I want to re-configure any collection then how to do that? As I know, If we use single collection which having multiple shards then we need to use this upconfig link - * example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir example/solr/collection1/conf -confname default * and restart all the nodes. For 2 collections into same solr. How can I do re-configure? First, upload your two different configurations with zkcli upconfig using two different names. Create your collections with the Collections API, and tell each one to use a different collection.configName. If the collection already exists, use the zkcli linkconfig command, and reload the collection. If you need to change a config, edit the config on disk and re-do the zkcli upconfig. Then reload the collection with the Collections API. Alternately you could upload a whole new config and then link it to the existing collection. The Collections API is not yet exposed in the admin interface, you will need to do those calls yourself. If you're doing this with SolrJ, there are some objects inside CollectionAdminRequest that let you do all the API actions. Thanks, Shawn
Re: Strange search behaviour when upgrading to 4.10.3
On 2/20/2015 4:24 PM, Rishi Easwaran wrote: Also, the tokenizer we use is very similar to the following. ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalTokenizer.java ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalLexer.jflex From the looks of it the text is being indexed as a single token and not broken across whitespace. I can't claim to know how analyzer code works. I did manage to see the code, but it doesn't mean much to me. I would suggest using the analysis tab in the Solr admin interface. On that page, select the field or fieldType, set the verbose flag and type the actual field contents into the index side of the page. When you click the Analyze Values button, it will show you what Solr does with the input at index time. Do you still have access to any machines (dev or otherwise) running the old version with the custom component? If so, do the same things on the analysis page for that version that you did on the new version, and see whether it does something different. If it does do something different, then you will need to track down the problem in the code for your custom analyzer. Thanks, Shawn
Re: ignoring bad documents during index
At the layer right before you send that XML out, have it have a fallback option on error where it sends each document one at a time if there's a failure with the batch. Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Fri, Feb 20, 2015 at 10:26 AM, SolrUser1543 osta...@gmail.com wrote: I am sending a bulk of XML via http request. The same way like indexing via documents in solr interface. -- View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187632.html Sent from the Solr - User mailing list archive at Nabble.com.
Clarification of locktype=single and implications of use
Hello, We don't want to use locktype=native (we are using NFS) or locktype=simple (we mount a read-only snapshot of the index on our search servers and with locktype=simple, Solr refuses to start up becaise it sees the lock file.) However, we don't quite understand the warnings about using locktype=single in the context of normal Solr operation. The ref guide and the wiki ( http://wiki.apache.org/lucene-java/AvailableLockFactories) seem to indicate there is some danger in using locktype=single. The wiki says: locktype=single: Uses an object instance to represent the lock, so this is usefull when you are certain that all modifications to the a given index are running against a single shared in-process Directory instance. This is currently the default locking for RAMDirectory, but it could also make sense on a FSDirectory provided the other processes use the index in read-only. We are using Solr. We would not configure two different Solr instances to write to the same index. So why would a normal Solr set-up possibly end up having more than one process writing to the same index? At the Lucene level there are multiple index writers per thread, but they each write to their own segments, and (I think) all the threads are in the same Solr process), Are we safe using locktype=single? Tom
Re: Getting unique key of a document inside of a Similarity class.
from all the examples of what you've described, i'm fairly certain all you really need is a TFIDF based Similarity where coord(), idf(), tf() and queryNorm() return 1 allways, and you omitNorms from all fields. Yeah, that's what I did in the very first iteration. It works only for cases #1 and #2. If you try query 3 and 4 with such Similarity, you'll get: 3. place:(34\ High\ Street)^3 = doc1(score=9), doc2(score=9) 4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 = doc1(score=16), doc2(score=9) That is not what I need. As I described above, in case of multiple tokens match for a field, method SimScorer.score is called X times, where X is number of matched tokens (in cases #3 and #4 there are 3 tokens), therefore score sums up. I need to score only once in this case, regardless of number of tokens. How to do it? First idea was HashSet based on fieldName, so that after scoring once, it don't score anymore. But in this case only first document was scoring (since second and other documents have the same field name). So I understood that I need also docID for that. And it worked fine until I found out (thank you for that) about that docID is segment-specific. So now I need segmentID as well (or something similar). (You didn't give any examples of what you expect to happen with exclusion clauses in your BooleanQueries For my needs I won't need exclusion clauses, but in this case the same would happen - it would score depending on weight, because condition is true: 5. (NOT name:DocumentOne)^7 = doc2(score=7)
Re: Committed before 500
Since you are getting these failures, the 90 second timeout is not “good enough”. Try increasing it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 20, 2015, at 5:22 AM, NareshJakher naresh.jak...@capgemini.com wrote: Hi Shawn, I do not want to increase timeout as these errors are very few. Also current timeout of 90 seconds is good enough. Is there a way to find why Solr is getting timed-out ( at times ), could it be that Solr is busy doing other activities like re-indexing, commits etc. Additionally I also found that some of non-leader node move to recovering or recovery failed after these time out errors. I am just wondering if these are related to performance issue and Solr commits needs to be controlled. Regards, Naresh Jakher From: Shawn Heisey-2 [via Lucene] [mailto:ml-node+s472066n4187382...@n3.nabble.com] Sent: Thursday, February 19, 2015 8:12 PM To: Jakher, Naresh Subject: Re: Committed before 500 On 2/19/2015 6:30 AM, NareshJakher wrote: I am using Solr cloud with 3 nodes, at times following error is observed in logs during delete operation. Is it a performance issue ? What can be done to resolve this issue Committed before 500 {msg=Software caused connection abort: socket write error,trace=org.eclipse.jetty.io.EofException I did search on old topics but couldn't find anything concrete related to Solr cloud. Would appreciate any help on the issues as I am relatively new to Solr. A jetty EofException indicates that one specific thing is happening: The TCP connection from the client was severed before Solr responded to the request. Usually this happens because the client has been configured with an absolute timeout or an inactivity timeout, and the timeout was reached. Configuring timeouts so that you can be sure clients don't get stuck is a reasonable idea, but any configured timeouts should be VERY long. You'd want to use a value like five minutes, rather than 10, 30, or 60 seconds. The timeouts MIGHT be in the HttpShardHandler config that Solr and SolrCloud use for distributed searches, and they also might be in operating-system-level config. https://wiki.apache.org/solr/SolrConfigXml?highlight=%28HttpShardHandler%29#Configuration_of_Shard_Handlers_for_Distributed_searches Thanks, Shawn If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187382.html To unsubscribe from Committed before 500, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4187361code=bmFyZXNoLmpha2hlckBjYXBnZW1pbmkuY29tfDQxODczNjF8NzQ0MTczNzc0. NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. -- View this message in context: http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187601.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Remove all parent docs having specific child doc
On Fri, Feb 20, 2015 at 2:10 PM, Lokesh Chhaparwal xyzlu...@gmail.com wrote: Hi, I want to remove all the parent docs having a specific child doc. Eg. docEmployee1 doc fieldDept1/field /doc doc fieldDept2/field /doc /doc docEmployee2 doc fieldDept2/field /doc doc fieldDept3/field /doc /doc Query: Remove all employees which lie in Dept1 Response should be: Employee2 *only* Problem: *NOT operator is not being supported* in block join query parser. *q = {!parent which=employee:*}*:* -department:Dept1 *- results Employee1, Employee2 AFAIK the first space after q= breaks space handling between }*:* -d, btw, that you can conclude it by looking into debugQuery=true output, hence try either: q={!parent which=employee:*}*:* -department:Dept1 q={!parent which=employee:*}*:*\ -department:Dept1 q={!parent which=employee:* v=$cq}cq=*:* -department:Dept1 *q= - {!parent which=employee:*} department:Dept1 *- it does not work with block join query parser. Please suggest as to how I filter those employees which lie in Dept1 using block join or any other query parser? Thanks, Lokesh -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: ignoring bad documents during index
I am sending a bulk of XML via http request. The same way like indexing via documents in solr interface. -- View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187632.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr synonyms logic
Hi all, I'm querying a recipe database in Solr. By using synonyms, I'm trying to make my search a little smarter. What I'm trying to do here, is that a search for pastry returns all lasagne, penne cannelloni recipes. However a search for lasagne should only return lasagne recipes. In my synonyms.txt, I have these lines: - lasagne,pastry penne,pastry cannelloni,pastry - Filter in my scheme.xml looks like this: filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.WhitespaceTokenizerFactory / Only in the index analyzer, not in the query. When using the Solr analysis tool, I can see that my index for lasagne has a synonym pastry and my query only queries lasagne. Same for penne and cannelloni, they both have the synonym pastry. Currently my Solr query for lasagne also returns all penne and cannelloni recipes. I cannot understand why this is the case. Can someone explain this behaviour to me please? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-synonyms-logic-tp4187827.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strange search behaviour when upgrading to 4.10.3
Yes, The analyzers and tokenizers were recompiled with new version of solr/lucene and there were some errors, most of them were related to using BytesRefBuilder, which i did. Can you try these links. ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalAnalyzer.java -Original Message- From: Shawn Heisey apa...@elyograg.org To: solr-user solr-user@lucene.apache.org Sent: Fri, Feb 20, 2015 11:52 am Subject: Re: Strange search behaviour when upgrading to 4.10.3 On 2/20/2015 9:37 AM, Rishi Easwaran wrote: We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 search results are not being returned, actually looks like only the first word in a sentence is getting indexed. Ex: inserting This is a test message only returns results when searching for content:this*. searching for content:test* or content:message* does not work with 4.10. Only searching for content:*message* works. This leads to me to believe there is something wrong with behaviour of our analyzer and tokenizers snip fields field name=content type=ourType stored=false indexed = true required=false multiValued=true / /fields fieldType name=ourType indexed = true class=solr.TextField analyzer class = com.zimbra.cs.index.ZimbraAnalyzer / /fieldType Looking at the release notes from solr and lucene http://lucene.apache.org/solr/4_10_1/changes/Changes.html http://lucene.apache.org/core/4_10_1/changes/Changes.html Nothing really sticks out, atleast to me. Any help to get it working with 4.10 would be great. The links you provided lead to zero-byte files when I try them, so I could not look deeper. Have you recompiled your custom analysis components against the newer versions of the Solr/Lucene libraries? Anytime you're dealing with custom components, you cannot assume that a component compiled to work with one version of Solr will work with another version. The internal API does change, and there is less emphasis on avoiding API breaks in minor Solr releases than there is with Lucene, because the vast majority of Solr users are not writing their own code that uses the Solr API. Recompiling against the newer libraries may cause compiler errors that reveal places in your code that require changes. Thanks, Shawn
Strange search behaviour when upgrading to 4.10.3
Hi, We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 search results are not being returned, actually looks like only the first word in a sentence is getting indexed. Ex: inserting This is a test message only returns results when searching for content:this*. searching for content:test* or content:message* does not work with 4.10. Only searching for content:*message* works. This leads to me to believe there is something wrong with behaviour of our analyzer and tokenizers A little bit of background. We have our own analyzer and tokenizer since pre solr 1.4 and its been regularly updated. The analyzer works with solr 4.6 we have it running in production (I also tested that search works with solr 4.9.1). It is very similar to the tokenizers and analyzers located here. ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalAnalyzer.java ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/ But with modifications to work with latest solr/lucene code ex: override- createComponents The schema of the filed being analyzed is as follows fields field name=content type=ourType stored=false indexed = true required=false multiValued=true / /fields fieldType name=ourType indexed = true class=solr.TextField analyzer class = com.zimbra.cs.index.ZimbraAnalyzer / /fieldType Looking at the release notes from solr and lucene http://lucene.apache.org/solr/4_10_1/changes/Changes.html http://lucene.apache.org/core/4_10_1/changes/Changes.html Nothing really sticks out, atleast to me. Any help to get it working with 4.10 would be great. Thanks, Rishi.
Re: Strange search behaviour when upgrading to 4.10.3
On 2/20/2015 9:37 AM, Rishi Easwaran wrote: We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 search results are not being returned, actually looks like only the first word in a sentence is getting indexed. Ex: inserting This is a test message only returns results when searching for content:this*. searching for content:test* or content:message* does not work with 4.10. Only searching for content:*message* works. This leads to me to believe there is something wrong with behaviour of our analyzer and tokenizers snip fields field name=content type=ourType stored=false indexed = true required=false multiValued=true / /fields fieldType name=ourType indexed = true class=solr.TextField analyzer class = com.zimbra.cs.index.ZimbraAnalyzer / /fieldType Looking at the release notes from solr and lucene http://lucene.apache.org/solr/4_10_1/changes/Changes.html http://lucene.apache.org/core/4_10_1/changes/Changes.html Nothing really sticks out, atleast to me. Any help to get it working with 4.10 would be great. The links you provided lead to zero-byte files when I try them, so I could not look deeper. Have you recompiled your custom analysis components against the newer versions of the Solr/Lucene libraries? Anytime you're dealing with custom components, you cannot assume that a component compiled to work with one version of Solr will work with another version. The internal API does change, and there is less emphasis on avoiding API breaks in minor Solr releases than there is with Lucene, because the vast majority of Solr users are not writing their own code that uses the Solr API. Recompiling against the newer libraries may cause compiler errors that reveal places in your code that require changes. Thanks, Shawn
rankquery usage bug?
Hey guys, I put a rq in defaults but I can't figure out how to override it with no rankquery. Looks like one option might be checking for empty string before trying to use it in QueryComponent? I can work around it in the prep method of an earlier searchcomponent for now. Ryan
Re: Clarification of locktype=single and implications of use
: We are using Solr. We would not configure two different Solr instances to : write to the same index. So why would a normal Solr set-up possibly end : up having more than one process writing to the same index? The risk here is that if you configure lockType=single, and then have some unintended user error such that two distinct java processes both attempt to use the same index dir, the locType will not protect you in that situation. For example: you normally run solr on port 8983, but someone accidently starts a second instance of solr on more 7574 using the exact same conigs with the exact same index dir -- lockType single won't help you spot this error. lockType=native will (assuming your FileSystem can handle it) lockType=single should protect you however if, for example, multiple SolrCores w/in the same Solr java process attempted to refer to the same index dir because you accidently put an absolulte path in a solrconfig.xml that gets shared my multiple cores. -Hoss http://www.lucidworks.com/
Re: Remove all parent docs having specific child doc
*q= - {!parent which=employee:*} department:Dept1 *- it does not work with block join query parser. What do you mean? What this query (no spaces, brackets) ? q=-({!parent which=employee:*}department:Dept1) returns in your case? 20.02.2015, 18:02, Mikhail Khludnev mkhlud...@griddynamics.com: On Fri, Feb 20, 2015 at 2:10 PM, Lokesh Chhaparwal xyzlu...@gmail.com wrote: Hi, I want to remove all the parent docs having a specific child doc. Eg. docEmployee1 doc fieldDept1/field /doc doc fieldDept2/field /doc /doc docEmployee2 doc fieldDept2/field /doc doc fieldDept3/field /doc /doc Query: Remove all employees which lie in Dept1 Response should be: Employee2 *only* Problem: *NOT operator is not being supported* in block join query parser. *q = {!parent which=employee:*}*:* -department:Dept1 *- results Employee1, Employee2 AFAIK the first space after q= breaks space handling between }*:* -d, btw, that you can conclude it by looking into debugQuery=true output, hence try either: q={!parent which=employee:*}*:* -department:Dept1 q={!parent which=employee:*}*:*\ -department:Dept1 q={!parent which=employee:* v=$cq}cq=*:* -department:Dept1 *q= - {!parent which=employee:*} department:Dept1 *- it does not work with block join query parser. Please suggest as to how I filter those employees which lie in Dept1 using block join or any other query parser? Thanks, Lokesh -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com 20.02.2015, 18:02, Mikhail Khludnev mkhlud...@griddynamics.com: On Fri, Feb 20, 2015 at 2:10 PM, Lokesh Chhaparwal xyzlu...@gmail.com wrote: Hi, I want to remove all the parent docs having a specific child doc. Eg. docEmployee1 doc fieldDept1/field /doc doc fieldDept2/field /doc /doc docEmployee2 doc fieldDept2/field /doc doc fieldDept3/field /doc /doc Query: Remove all employees which lie in Dept1 Response should be: Employee2 *only* Problem: *NOT operator is not being supported* in block join query parser. *q = {!parent which=employee:*}*:* -department:Dept1 *- results Employee1, Employee2 AFAIK the first space after q= breaks space handling between }*:* -d, btw, that you can conclude it by looking into debugQuery=true output, hence try either: q={!parent which=employee:*}*:* -department:Dept1 q={!parent which=employee:*}*:*\ -department:Dept1 q={!parent which=employee:* v=$cq}cq=*:* -department:Dept1 *q= - {!parent which=employee:*} department:Dept1 *- it does not work with block join query parser. Please suggest as to how I filter those employees which lie in Dept1 using block join or any other query parser? Thanks, Lokesh -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com