[ANNOUNCE] Apache Solr 8.0.0 released
14 March 2019, Apache Solr™ 8.0.0 available The Lucene PMC is pleased to announce the release of Apache Solr 8.0.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. The release is available for immediate download at: http://www.apache.org/dyn/closer.lua/lucene/solr/8.0.0 Please read CHANGES.txt for a detailed list of changes: https://lucene.apache.org/solr/8_0_0/changes/Changes.html Solr 8.0.0 Release Highlights * Solr now uses HTTP/2 for inter-node communication Being a major release, Solr 8 removes many deprecated APIs, changes various parameter defaults and behavior. Some changes may require a re-index of your content. You are thus encouraged to thoroughly read the "Upgrade Notes" at http://lucene.apache.org/solr/8_0_0/changes/Changes.html or in the CHANGES.txt file accompanying the release. Solr 8.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Please report any feedback to the mailing lists ( http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
[ANNOUNCE] Apache Solr 7.7.0 released
11 February 2019, Apache Solr™ 7.7.0 available The Lucene PMC is pleased to announce the release of Apache Solr 7.7.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 7.7.0 is available for immediate download at: http://lucene.apache.org/solr/downloads.html See http://lucene.apache.org/solr/7_7_0/changes/Changes.html for a full list of details. Solr 7.7.0 Release Highlights: Bug Fixes: * URI Too Long with large streaming expressions in SolrJ. * A failure while reloading a SolrCore can result in the SolrCore not being closed. * Spellcheck parameters not working in new UI. * New Admin UI Query does not URL-encode the query produced in the URL box. * Rule-base Authorization plugin skips authorization if querying node does not have collection replica. * Solr installer fails on SuSE linux. * Fix incorrect SOLR_SSL_KEYSTORE_TYPE variable in solr start script. Improvements: * JSON 'terms' Faceting now supports a 'prelim_sort' option to use when initially selecting the top ranking buckets, prior to the final 'sort' option used after refinement. * Add a login page to Admin UI, with initial support for Basic Auth and Kerberos. * New Node-level health check handler at /admin/info/healthcheck and /node/health paths that checks if the node is live, connected to zookeeper and not shutdown. * It is now possible to configure a host whitelist for distributed search. You are encouraged to thoroughly read the "Upgrade Notes" at http://lucene.apache.org/solr/7_7_0/changes/Changes.html or in the CHANGES.txt file accompanying the release. Solr 7.7 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Please report any feedback to the mailing lists ( http://lucene.apache.org/solr/community.html#mailing-lists-irc) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
[ANNOUNCE] Apache Solr 7.5.0 released
24 September 2018, Apache Solr™ 7.5.0 available The Lucene PMC is pleased to announce the release of Apache Solr 7.5.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 7.5.0 is available for immediate download at: http://lucene.apache.org/solr/downloads.html See http://lucene.apache.org/solr/7_5_0/changes/Changes.html for a full list of details. Solr 7.5.0 Release Highlights: Nested/child documents may now be supplied as a field value instead of stand-off. Future releases will leverage this semantic information. Enhance Autoscaling policy support to equally distribute replicas on the basis of arbitrary properties. Nodes are now visible inside a view of the Admin UI "Cloud" tab, listing nodes and key metrics. The status of zookeeper ensemble is now accessible under the Admin UI Cloud tab. The new Korean morphological analyzer ("nori") has been added to default distribution. You are encouraged to thoroughly read the "Upgrade Notes" at http://lucene.apache.org/solr/7_5_0/changes/Changes.html or in the CHANGES.txt file accompanying the release. Solr 7.5 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Please report any feedback to the mailing lists ( http://lucene.apache.org/solr/community.html#mailing-lists-irc) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
[ANNOUNCE] Apache Solr 7.2.1 released
15 January 2018, Apache Solr™ 7.2.1 available The Lucene PMC is pleased to announce the release of Apache Solr 7.2.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. This release includes 3 bug fixes since the 7.2.0 release: * Overseer can never process some last messages. * Rename core in solr standalone mode is not persisted. * QueryComponent's rq parameter parsing no longer considers the defType parameter. * Fix NPE in SolrQueryParser when the query terms inside a filter clause reduce to nothing. Furthermore, this release includes Apache Lucene 7.2.1 which includes 1 bug fix since the 7.2.0 release. The release is available for immediate download at: http://www.apache.org/dyn/closer.lua/lucene/solr/7.2.1 Please read CHANGES.txt for a detailed list of changes: https://lucene.apache.org/solr/7_2_1/changes/Changes.html Please report any feedback to the mailing lists ( http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
[ANNOUNCE] Apache Solr 6.5.1 released
27 April 2017, Apache Solr™ 6.5.1 available The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. This release includes 11 bug fixes since the 6.5.0 release. Some of the major fixes are: * bin\solr.cmd delete and healthcheck now works again; fixed continuation chars ^ * Fix debug related NullPointerException in solr/contrib/ltr OriginalScoreFeature class. * The JSON output of /admin/metrics is fixed to write the container as a map (SimpleOrderedMap) instead of an array (NamedList). * On 'downnode', lots of wasteful mutations are done to ZK. * Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer classes. * The fetch() streaming expression wouldn't work if a value included query syntax chars (like :+-). Fixed, and enhanced the generated query to not pollute the queryCache. * Disable graph query production via schema configuration . This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false. * Fix indexed="false" on numeric PointFields * SQL AVG function mis-interprets field type. * SQL interface does not use client cache. * edismax with sow=false fails to create dismax-per-term queries when any field is boosted. Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug fixes since the 6.5.0 release. The release is available for immediate download at: http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1 Please read CHANGES.txt for a detailed list of changes: https://lucene.apache.org/solr/6_5_1/changes/Changes.html Please report any feedback to the mailing lists ( http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
[ANNOUNCE] Apache Solr 6.5.1 released
27 April 2017, Apache Solr™ 6.5.1 available The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. This release includes 11 bug fixes since the 6.5.0 release. Some of the major fixes are: * bin\solr.cmd delete and healthcheck now works again; fixed continuation chars ^ * Fix debug related NullPointerException in solr/contrib/ltr OriginalScoreFeature class. * The JSON output of /admin/metrics is fixed to write the container as a map (SimpleOrderedMap) instead of an array (NamedList). * On 'downnode', lots of wasteful mutations are done to ZK. * Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer classes. * The fetch() streaming expression wouldn't work if a value included query syntax chars (like :+-). Fixed, and enhanced the generated query to not pollute the queryCache. * Disable graph query production via schema configuration . This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false. * Fix indexed="false" on numeric PointFields * SQL AVG function mis-interprets field type. * SQL interface does not use client cache. * edismax with sow=false fails to create dismax-per-term queries when any field is boosted. Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug fixes since the 6.5.0 release. The release is available for immediate download at: http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1 Please read CHANGES.txt for a detailed list of changes: https://lucene.apache.org/solr/6_5_1/changes/Changes.html Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
[ANNOUNCE] Apache Solr 6.5.1 released
27 April 2017, Apache Solr™ 6.5.1 available The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. This release includes 11 bug fixes since the 6.5.0 release. Some of the major fixes are: * bin\solr.cmd delete and healthcheck now works again; fixed continuation chars ^ * Fix debug related NullPointerException in solr/contrib/ltr OriginalScoreFeature class. * The JSON output of /admin/metrics is fixed to write the container as a map (SimpleOrderedMap) instead of an array (NamedList). * On 'downnode', lots of wasteful mutations are done to ZK. * Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer classes. * The fetch() streaming expression wouldn't work if a value included query syntax chars (like :+-). Fixed, and enhanced the generated query to not pollute the queryCache. * Disable graph query production via schema configuration . This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false. * Fix indexed="false" on numeric PointFields * SQL AVG function mis-interprets field type. * SQL interface does not use client cache. * edismax with sow=false fails to create dismax-per-term queries when any field is boosted. Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug fixes since the 6.5.0 release. The release is available for immediate download at: http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1 Please read CHANGES.txt for a detailed list of changes: https://lucene.apache.org/solr/6_5_1/changes/Changes.html Please report any feedback to the mailing lists ( http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
[ANNOUNCE] Apache Solr 6.5.0 released
27 March 2017, Apache Solr 6.5.0 available The Lucene PMC is pleased to announce the release of Apache Solr 6.5.0. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 6.5.0 is available for immediate download at: - http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Please read CHANGES.txt for a full list of new features and changes: - https://lucene.apache.org/solr/6_5_0/changes/Changes.html Highlights of this Solr release include: - PointFields (fixed-width multi-dimensional numeric & binary types enabling fast range search) are now supported - In-place updates to numeric docValues fields (single valued, non-stored, non-indexed) supported using atomic update syntax - A new LatLonPointSpatialField that uses points or doc values for query - It is now possible to declare a field as "large" in order to bypass the document cache - New sow=false request param (split-on-whitespace) for edismax & standard query parsers enables query-time multi-term synonyms - XML QueryParser (defType=xmlparser) now supports span queries - hl.maxAnalyzedChars now have consistent default across highlighters - UnifiedSolrHighlighter and PostingsSolrHighlighter now support CustomSeparatorBreakIterator - Scoring formula is adjusted for the scoreNodes function - Calcite Planner now applies constant Reduction Rules to optimize plans - A new significantTerms Streaming Expression that is able to extract the significant terms in an index - StreamHandler is now able to use runtimeLib jars - Arithmetic operations are added to the SelectStream - Added modernized self-documenting /v2 API - The .system collection is now created on first request if it does not exist - Admin UI: Added shard deletion button - Metrics API now supports non-numeric metrics (version, disk type, component state, system properties...) - The disk free and aggregated disk free metrics are now reported - The DirectUpdateHandler2 now implements MetricsProducer and exposes stats via the metrics api and configured reporters. - BlockCache is faster due to less failures when caching a new block - MMapDirectoryFactory now supports "preload" option to ask mapped pages to be loaded into physical memory on init - Security: BasicAuthPlugin now supports standalone mode - Arbitrary java system properties can be passed to zkcli - SolrHttpClientBuilder can be configured via java system property - Javadocs and Changes.html are no longer included in the binary distribution, but are hosted online Further details of changes are available in the change log available at: http://lucene.apache.org/solr/6_5_0/changes/Changes.html Please report any feedback to the mailing lists (http://lucene.apache.org/ solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also applies to Maven access. -
[ANNOUNCE] Apache Solr 6.4.0 released
23 January 2016 - Apache Solr™ 6.4.0 Available The Lucene PMC is pleased to announce the release of Apache Solr 6.4.0. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 6.4.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Highlights of this Solr release include: Streaming: * Addition of a HavingStream to Streaming API and Streaming Expressions * Addition of a priority Streaming Expression * Streaming expressions now support collection aliases Machine Learning: * Configurable Learning-To-Rank (LTR) support: upload feature definitions, extract feature values, upload your own machine learnt models and use them to rerank search results. Faceting: * Added "param" query type to facet domain filter specification to obtain filters via query parameters * Any facet command can be filtered using a new parameter filter. Example: { type:terms, field:category, filter:"user:yonik" } Scripts / Command line: * A new command-line tool to manage the snapshots functionality * bin/solr and bin/solr.cmd now use mkroot command SolrCloud / SolrJ * LukeResponse now supports dynamic fields * Solrj client now supports hierarchical clusters and other topics marker * Collection backup/restore are extensible. Security: * Support Secure Impersonation / Proxy User for Solr authentication * Key Store type can be specified in solr.in.sh file for SSL * New generic authentication plugins: 'HadoopAuthPlugin' and 'ConfigurableInternodeAuthHadoopPlugin' that delegate all functionality to Hadoop authentication framework Query / QueryParser / Highlighting: * A new highlighter: The Unified Highlighter. Try it via hl.method=unified; many popular highlighting parameters / features are supported. It's the highest performing highlighter, especially for large documents. Highlighting phrase queries and exotic queries are supported equally as well as the Original Highlighter (aka the default/standard one). Please use this new highlighter and report issues since it will likely become the default one day. * Leading wildcard in complexphrase query parser are now accepted and optimized with the ReversedWildcardFilterFactory when it's provided Metrics: * Use metrics-jvm library to instrument jvm internals such as GC, memory usage and others. * A lot of metrics have been added to the collection: index merges, index store I/Os, query, update, core admin, core load thread pools, shard replication, tlog replay and replicas * A new /admin/metrics API to return all metrics collected by Solr via API. Misc changes: * The new config parameter 'maxRamMB'can now limit the memory consumed by the FastLRUCache * A new document processor 'SkipExistingDocumentsProcessor' that skips duplicate inserts and ignores updates to missing docs * FieldCache information fetched via the mbeans handler or seen via the UI now displays the total size used. * A new config flag 'enable' allows to enable/disable any cache Please note, this release cannot be built from source with Java 8 update 121, use an earlier version instead! This is caused by a bug introduced into the Javadocs tool shipped with that update. The workaround was too late for this Lucene release. Of course, you can use the binary artifacts. See the Solr CHANGES.txt files included with the release for a full list of details. Thanks, Jim Ferenczi
Re: Very high memory and CPU utilization.
Well it seems that doing q="network se*" is working but not in the way you expect. Doing this q="network se*" would not trigger a prefix query and the "*" character would be treated as any character. I suspect that your query is in fact "network se" (assuming you're using a StandardTokenizer) and that the word "se" is very popular in your documents. That would explain the slow response time. Bottom line is that doing "network se*" will not trigger prefix query at all (I may be wrong but this is the expected behaviour for Solr up to 4.3). 2015-11-02 13:47 GMT+01:00 Modassar Ather <modather1...@gmail.com>: > The problem is with the same query as phrase. q="network se*". > > The last . is fullstops for the sentence and the query is q=field:"network > se*" > > Best, > Modassar > > On Mon, Nov 2, 2015 at 6:10 PM, jim ferenczi <jim.feren...@gmail.com> > wrote: > > > Oups I did not read the thread carrefully. > > *The problem is with the same query as phrase. q="network se*".* > > I was not aware that you could do that with Solr ;). I would say this is > > expected because in such case if the number of expansions for "se*" is > big > > then you would have to check the positions for a significant words. I > don't > > know if there is a limitation in the number of expansions for a prefix > > query contained into a phrase query but I would look at this parameter > > first (limit the number of expansion per prefix search, let's say the N > > most significant words based on the frequency of the words for instance). > > > > 2015-11-02 13:36 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>: > > > > > > > > > > > > > > *I am not able to get the above point. So when I start Solr with 28g > > RAM, > > > for all the activities related to Solr it should not go beyond 28g. And > > the > > > remaining heap will be used for activities other than Solr. Please help > > me > > > understand.* > > > > > > Well those 28GB of heap are the memory "reserved" for your Solr > > > application, though some parts of the index (not to say all) are > > retrieved > > > via MMap (if you use the default MMapDirectory) which do not use the > heap > > > at all. This is a very important part of Lucene/Solr, the heap should > be > > > sized in a way that let a significant amount of RAM available for the > > > index. If not then you rely on the speed of your disk, if you have SSDs > > > it's better but reads are still significantly slower with SSDs than > with > > > direct RAM access. Another thing to keep in mind is that mmap will > always > > > tries to put things in RAM, this is why I suspect that you swap > activity > > is > > > killing your performance. > > > > > > 2015-11-02 11:55 GMT+01:00 Modassar Ather <modather1...@gmail.com>: > > > > > >> Thanks Jim for your response. > > >> > > >> The remaining size after you removed the heap usage should be reserved > > for > > >> the index (not only the other system activities). > > >> I am not able to get the above point. So when I start Solr with 28g > > RAM, > > >> for all the activities related to Solr it should not go beyond 28g. > And > > >> the > > >> remaining heap will be used for activities other than Solr. Please > help > > me > > >> understand. > > >> > > >> *Also the CPU utilization goes upto 400% in few of the nodes:* > > >> You said that only machine is used so I assumed that 400% cpu is for a > > >> single process (one solr node), right ? > > >> Yes you are right that 400% is for single process. > > >> The disks are SSDs. > > >> > > >> Regards, > > >> Modassar > > >> > > >> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi <jim.feren...@gmail.com> > > >> wrote: > > >> > > >> > *if it correlates with the bad performance you're seeing. One > > important > > >> > thing to notice is that a significant part of your index needs to be > > in > > >> RAM > > >> > (especially if you're using SSDs) in order to achieve good > > performance.* > > >> > > > >> > Especially if you're not using SSDs, sorry ;) > > >> > > > >> > 2015-11-02 11:38 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>: > > >>
Re: Very high memory and CPU utilization.
12 shards with 28GB for the heap and 90GB for each index means that you need at least 336GB for the heap (assuming you're using all of it which may be easily the case considering the way the GC is handling memory) and ~= 1TO for the index. Let's say that you don't need your entire index in RAM, the problem as I see it is that you don't have enough RAM for your index + heap. Assuming your machine has 370GB of RAM there are only 34GB left for your index, 1TO/34GB means that you can only have 1/30 of your entire index in RAM. I would advise you to check the swap activity on the machine and see if it correlates with the bad performance you're seeing. One important thing to notice is that a significant part of your index needs to be in RAM (especially if you're using SSDs) in order to achieve good performance: *As mentioned above this is a big machine with 370+ gb of RAM and Solr (12 nodes total) is assigned 336 GB. The rest is still a good for other system activities.* The remaining size after you removed the heap usage should be reserved for the index (not only the other system activities). *Also the CPU utilization goes upto 400% in few of the nodes:* You said that only machine is used so I assumed that 400% cpu is for a single process (one solr node), right ? This seems impossible if you are sure that only one query is played at a time and no indexing is performed. Best thing to do is to dump stack trace of the solr nodes during the query and to check what the threads are doing. Jim 2015-11-02 10:38 GMT+01:00 Modassar Ather: > Just to add one more point that one external Zookeeper instance is also > running on this particular machine. > > Regards, > Modassar > > On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather > wrote: > > > Hi Toke, > > Thanks for your response. My comments in-line. > > > > That is 12 machines, running a shard each? > > No! This is a single big machine with 12 shards on it. > > > > What is the total amount of physical memory on each machine? > > Around 370 gb on the single machine. > > > > Well, se* probably expands to a great deal of documents, but a huge bump > > in memory utilization and 3 minutes+ sounds strange. > > > > - What are your normal query times? > > Few simple queries are returned with in a couple of seconds. But the more > > complex queries with proximity and wild cards have taken more than 3-4 > > minutes and some times some queries have timed out too where time out is > > set to 5 minutes. > > - How many hits do you get from 'network se*'? > > More than a million records. > > - How many results do you return (the rows-parameter)? > > It is the default one 10. Grouping is enabled on a field. > > - If you issue a query without wildcards, but with approximately the > > same amount of hits as 'network se*', how long does it take? > > A query resulting in around half a million record return within a couple > > of seconds. > > > > That is strange, yes. Have you checked the logs to see if something > > unexpected is going on while you test? > > Have not seen anything particularly. Will try to check again. > > > > If you are using spinning drives and only have 32GB of RAM in total in > > each machine, you are probably struggling just to keep things running. > > As mentioned above this is a big machine with 370+ gb of RAM and Solr (12 > > nodes total) is assigned 336 GB. The rest is still a good for other > system > > activities. > > > > Thanks, > > Modassar > > > > On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen > > wrote: > > > >> On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote: > >> > I have a setup of 12 shard cluster started with 28gb memory each on a > >> > single server. There are no replica. The size of index is around 90gb > on > >> > each shard. The Solr version is 5.2.1. > >> > >> That is 12 machines, running a shard each? > >> > >> What is the total amount of physical memory on each machine? > >> > >> > When I query "network se*", the memory utilization goes upto 24-26 gb > >> and > >> > the query takes around 3+ minutes to execute. Also the CPU utilization > >> goes > >> > upto 400% in few of the nodes. > >> > >> Well, se* probably expands to a great deal of documents, but a huge bump > >> in memory utilization and 3 minutes+ sounds strange. > >> > >> - What are your normal query times? > >> - How many hits do you get from 'network se*'? > >> - How many results do you return (the rows-parameter)? > >> - If you issue a query without wildcards, but with approximately the > >> same amount of hits as 'network se*', how long does it take? > >> > >> > Why the CPU utilization is so high and more than one core is used. > >> > As far as I understand querying is single threaded. > >> > >> That is strange, yes. Have you checked the logs to see if something > >> unexpected is going on while you test? > >> > >> > How can I disable replication(as it is implicitly enabled) permanently > >>
Re: Very high memory and CPU utilization.
*if it correlates with the bad performance you're seeing. One important thing to notice is that a significant part of your index needs to be in RAM (especially if you're using SSDs) in order to achieve good performance.* Especially if you're not using SSDs, sorry ;) 2015-11-02 11:38 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>: > 12 shards with 28GB for the heap and 90GB for each index means that you > need at least 336GB for the heap (assuming you're using all of it which may > be easily the case considering the way the GC is handling memory) and ~= > 1TO for the index. Let's say that you don't need your entire index in RAM, > the problem as I see it is that you don't have enough RAM for your index + > heap. Assuming your machine has 370GB of RAM there are only 34GB left for > your index, 1TO/34GB means that you can only have 1/30 of your entire index > in RAM. I would advise you to check the swap activity on the machine and > see if it correlates with the bad performance you're seeing. One important > thing to notice is that a significant part of your index needs to be in RAM > (especially if you're using SSDs) in order to achieve good performance: > > > > *As mentioned above this is a big machine with 370+ gb of RAM and Solr (12 > nodes total) is assigned 336 GB. The rest is still a good for other system > activities.* > The remaining size after you removed the heap usage should be reserved for > the index (not only the other system activities). > > > *Also the CPU utilization goes upto 400% in few of the nodes:* > You said that only machine is used so I assumed that 400% cpu is for a > single process (one solr node), right ? > This seems impossible if you are sure that only one query is played at a > time and no indexing is performed. Best thing to do is to dump stack trace > of the solr nodes during the query and to check what the threads are doing. > > Jim > > > > 2015-11-02 10:38 GMT+01:00 Modassar Ather <modather1...@gmail.com>: > >> Just to add one more point that one external Zookeeper instance is also >> running on this particular machine. >> >> Regards, >> Modassar >> >> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather <modather1...@gmail.com> >> wrote: >> >> > Hi Toke, >> > Thanks for your response. My comments in-line. >> > >> > That is 12 machines, running a shard each? >> > No! This is a single big machine with 12 shards on it. >> > >> > What is the total amount of physical memory on each machine? >> > Around 370 gb on the single machine. >> > >> > Well, se* probably expands to a great deal of documents, but a huge bump >> > in memory utilization and 3 minutes+ sounds strange. >> > >> > - What are your normal query times? >> > Few simple queries are returned with in a couple of seconds. But the >> more >> > complex queries with proximity and wild cards have taken more than 3-4 >> > minutes and some times some queries have timed out too where time out is >> > set to 5 minutes. >> > - How many hits do you get from 'network se*'? >> > More than a million records. >> > - How many results do you return (the rows-parameter)? >> > It is the default one 10. Grouping is enabled on a field. >> > - If you issue a query without wildcards, but with approximately the >> > same amount of hits as 'network se*', how long does it take? >> > A query resulting in around half a million record return within a couple >> > of seconds. >> > >> > That is strange, yes. Have you checked the logs to see if something >> > unexpected is going on while you test? >> > Have not seen anything particularly. Will try to check again. >> > >> > If you are using spinning drives and only have 32GB of RAM in total in >> > each machine, you are probably struggling just to keep things running. >> > As mentioned above this is a big machine with 370+ gb of RAM and Solr >> (12 >> > nodes total) is assigned 336 GB. The rest is still a good for other >> system >> > activities. >> > >> > Thanks, >> > Modassar >> > >> > On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen <t...@statsbiblioteket.dk> >> > wrote: >> > >> >> On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote: >> >> > I have a setup of 12 shard cluster started with 28gb memory each on a >> >> > single server. There are no replica. The size of index is around >> 90gb on >> >> > each shard. The Solr version is 5.2.1. >> >> >
Re: Very high memory and CPU utilization.
*I am not able to get the above point. So when I start Solr with 28g RAM, for all the activities related to Solr it should not go beyond 28g. And the remaining heap will be used for activities other than Solr. Please help me understand.* Well those 28GB of heap are the memory "reserved" for your Solr application, though some parts of the index (not to say all) are retrieved via MMap (if you use the default MMapDirectory) which do not use the heap at all. This is a very important part of Lucene/Solr, the heap should be sized in a way that let a significant amount of RAM available for the index. If not then you rely on the speed of your disk, if you have SSDs it's better but reads are still significantly slower with SSDs than with direct RAM access. Another thing to keep in mind is that mmap will always tries to put things in RAM, this is why I suspect that you swap activity is killing your performance. 2015-11-02 11:55 GMT+01:00 Modassar Ather <modather1...@gmail.com>: > Thanks Jim for your response. > > The remaining size after you removed the heap usage should be reserved for > the index (not only the other system activities). > I am not able to get the above point. So when I start Solr with 28g RAM, > for all the activities related to Solr it should not go beyond 28g. And the > remaining heap will be used for activities other than Solr. Please help me > understand. > > *Also the CPU utilization goes upto 400% in few of the nodes:* > You said that only machine is used so I assumed that 400% cpu is for a > single process (one solr node), right ? > Yes you are right that 400% is for single process. > The disks are SSDs. > > Regards, > Modassar > > On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi <jim.feren...@gmail.com> > wrote: > > > *if it correlates with the bad performance you're seeing. One important > > thing to notice is that a significant part of your index needs to be in > RAM > > (especially if you're using SSDs) in order to achieve good performance.* > > > > Especially if you're not using SSDs, sorry ;) > > > > 2015-11-02 11:38 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>: > > > > > 12 shards with 28GB for the heap and 90GB for each index means that you > > > need at least 336GB for the heap (assuming you're using all of it which > > may > > > be easily the case considering the way the GC is handling memory) and > ~= > > > 1TO for the index. Let's say that you don't need your entire index in > > RAM, > > > the problem as I see it is that you don't have enough RAM for your > index > > + > > > heap. Assuming your machine has 370GB of RAM there are only 34GB left > for > > > your index, 1TO/34GB means that you can only have 1/30 of your entire > > index > > > in RAM. I would advise you to check the swap activity on the machine > and > > > see if it correlates with the bad performance you're seeing. One > > important > > > thing to notice is that a significant part of your index needs to be in > > RAM > > > (especially if you're using SSDs) in order to achieve good performance: > > > > > > > > > > > > *As mentioned above this is a big machine with 370+ gb of RAM and Solr > > (12 > > > nodes total) is assigned 336 GB. The rest is still a good for other > > system > > > activities.* > > > The remaining size after you removed the heap usage should be reserved > > for > > > the index (not only the other system activities). > > > > > > > > > *Also the CPU utilization goes upto 400% in few of the nodes:* > > > You said that only machine is used so I assumed that 400% cpu is for a > > > single process (one solr node), right ? > > > This seems impossible if you are sure that only one query is played at > a > > > time and no indexing is performed. Best thing to do is to dump stack > > trace > > > of the solr nodes during the query and to check what the threads are > > doing. > > > > > > Jim > > > > > > > > > > > > 2015-11-02 10:38 GMT+01:00 Modassar Ather <modather1...@gmail.com>: > > > > > >> Just to add one more point that one external Zookeeper instance is > also > > >> running on this particular machine. > > >> > > >> Regards, > > >> Modassar > > >> > > >> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather < > modather1...@gmail.com> > > >> wrote: > > >> > > >> > Hi Toke, > > >> > Thanks for your response. My comments in-line. >
Re: Very high memory and CPU utilization.
Oups I did not read the thread carrefully. *The problem is with the same query as phrase. q="network se*".* I was not aware that you could do that with Solr ;). I would say this is expected because in such case if the number of expansions for "se*" is big then you would have to check the positions for a significant words. I don't know if there is a limitation in the number of expansions for a prefix query contained into a phrase query but I would look at this parameter first (limit the number of expansion per prefix search, let's say the N most significant words based on the frequency of the words for instance). 2015-11-02 13:36 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>: > > > > *I am not able to get the above point. So when I start Solr with 28g RAM, > for all the activities related to Solr it should not go beyond 28g. And the > remaining heap will be used for activities other than Solr. Please help me > understand.* > > Well those 28GB of heap are the memory "reserved" for your Solr > application, though some parts of the index (not to say all) are retrieved > via MMap (if you use the default MMapDirectory) which do not use the heap > at all. This is a very important part of Lucene/Solr, the heap should be > sized in a way that let a significant amount of RAM available for the > index. If not then you rely on the speed of your disk, if you have SSDs > it's better but reads are still significantly slower with SSDs than with > direct RAM access. Another thing to keep in mind is that mmap will always > tries to put things in RAM, this is why I suspect that you swap activity is > killing your performance. > > 2015-11-02 11:55 GMT+01:00 Modassar Ather <modather1...@gmail.com>: > >> Thanks Jim for your response. >> >> The remaining size after you removed the heap usage should be reserved for >> the index (not only the other system activities). >> I am not able to get the above point. So when I start Solr with 28g RAM, >> for all the activities related to Solr it should not go beyond 28g. And >> the >> remaining heap will be used for activities other than Solr. Please help me >> understand. >> >> *Also the CPU utilization goes upto 400% in few of the nodes:* >> You said that only machine is used so I assumed that 400% cpu is for a >> single process (one solr node), right ? >> Yes you are right that 400% is for single process. >> The disks are SSDs. >> >> Regards, >> Modassar >> >> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi <jim.feren...@gmail.com> >> wrote: >> >> > *if it correlates with the bad performance you're seeing. One important >> > thing to notice is that a significant part of your index needs to be in >> RAM >> > (especially if you're using SSDs) in order to achieve good performance.* >> > >> > Especially if you're not using SSDs, sorry ;) >> > >> > 2015-11-02 11:38 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>: >> > >> > > 12 shards with 28GB for the heap and 90GB for each index means that >> you >> > > need at least 336GB for the heap (assuming you're using all of it >> which >> > may >> > > be easily the case considering the way the GC is handling memory) and >> ~= >> > > 1TO for the index. Let's say that you don't need your entire index in >> > RAM, >> > > the problem as I see it is that you don't have enough RAM for your >> index >> > + >> > > heap. Assuming your machine has 370GB of RAM there are only 34GB left >> for >> > > your index, 1TO/34GB means that you can only have 1/30 of your entire >> > index >> > > in RAM. I would advise you to check the swap activity on the machine >> and >> > > see if it correlates with the bad performance you're seeing. One >> > important >> > > thing to notice is that a significant part of your index needs to be >> in >> > RAM >> > > (especially if you're using SSDs) in order to achieve good >> performance: >> > > >> > > >> > > >> > > *As mentioned above this is a big machine with 370+ gb of RAM and Solr >> > (12 >> > > nodes total) is assigned 336 GB. The rest is still a good for other >> > system >> > > activities.* >> > > The remaining size after you removed the heap usage should be reserved >> > for >> > > the index (not only the other system activities). >> > > >> > > >> > > *Also the CPU utilization goes upto 400% in few of the nodes:*
Re: fq versus q
In part of queries we see strange behavior where q performs 5-10x better than fq. The question is why? Are you sure that the query result cache is disabled ? 2015-06-24 13:28 GMT+02:00 Esther Goldbraich estherg...@il.ibm.com: Hi, We are comparing the performance of fq versus q for queries that are actually filters and should not be cached. In part of queries we see strange behavior where q performs 5-10x better than fq. The question is why? An example1: q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1 to DATE2} sort=maildate_sort* desc rows=50 start=0 group=true group.query=some query (without dates) group.query=*:* group.sort=maildate_sort desc additional fqs Schema: field name=maildate stored=true indexed=true type=tdate/ field name=maildate_sort stored=false indexed=false type=tdate docValues=true/ Thank you, Esther - Esther Goldbraich Social Technologies Analytics - IBM Haifa Research Lab Phone: +972-4-8281059
Re: Solr returns incorrect results after sorting
Then you just have to remove the group.sort especially if your group limit is set to 1. Le 19 mars 2015 16:57, kumarraj rajitpro2...@gmail.com a écrit : *if the number of documents in one group is more than one then you cannot ensure that this document reflects the main sort Is there a way the top record which is coming up in the group is considered for sorting? We require to show the record from 212(even though price is low) in both the cases of high to low or low to high..and still the main sorting should work? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266p4194008.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr returns incorrect results after sorting
Hi Raj, The group.sort you are using defines multiple criterias. The first criteria is the big solr function starting with the max. This means that inside each group the documents will be sorted by this criteria and if the values are equals between two documents then the comparison fallbacks to the second criteria (inStock_boolean desc) and so on. *Even though if i add price asc in the group.sort, but still the main sort does not consider that.* The main sort does not have to consider what's in the group.sort. The group.sort defines the way the documents are sorted inside each group. So if you want to sort the document inside each group with the same order than in the main sort you can remove the group.sort or you can have a primary sort on pricecommon_double desc in your group.sort: *group.sort=pricecommon_double desc, max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0)) desc,inStock_boolean desc,geodist() asc* Cheers, Jim 2015-03-18 7:28 GMT+01:00 kumarraj rajitpro2...@gmail.com: Hi Jim, Yes, you are right.. that document is having price 499.99, But i want to consider the first record in the group as part of the main sort. Even though if i add price asc in the group.sort, but still the main sort does not consider that. group.sort=max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0)) desc,inStock_boolean descgeodist() asc,pricecommon_double ascsort=pricecommon_double desc Is there any other workaround so that sort is always based on the first record which is pulled up in each group? Regards, Raj -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266p4193658.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr returns incorrect results after sorting
Hi, Please note that you have two sort criteria, one to sort the documents inside each group and one to sort the groups. In the example you sent, the group 10002 has two documents and your group.limit is set to 1. If you redo the query with group.limit=2 I suspect that you'll see the second document of this group with a pricecommon_double between 479.99 and 729.97. This would mean that the sorting is correct ;). Bottom line is that when you have a group.sort different than the main sort, if the number of documents in one group is more than one then you cannot ensure that this document reflects the main sort. Try for instance group.sort=pricecommon_double asc (main sort inverse order) and you'll see that the sort inside each group is always applied after the main sort. This is the only way to meet the expectations ;). Cheers, Jim 2015-03-17 9:48 GMT+01:00 kumarraj rajitpro2...@gmail.com: Thanks David, that was a typo. Do you see any other issues? While solr does the grouping and if more than one document which are matched with given group.sort condition(numfound=2), then that particular document is not sorted correctly, when sorted by price.(sort=price) is applied across all the groups. Example: Below is the sample result. arr name=groups lst str name=groupValue10001/str result name=doclist numFound=1 start=0 doc double name=pricecommon_double729.97/double str name=code_string10001/str str name=name_textProduct1/str str name=storeName_string203/str double name=geodist()198.70324062133778/double/doc /result /lst lst str name=groupValue10002/str result name=doclist numFound=2 start=0 doc double name=pricecommon_double279.99/double str name=code_string10002/str str name=name_textProduct2/str str name=storeName_string212/str double name=geodist()0.0/double/doc /result /lst lst str name=groupValue10003/str result name=doclist numFound=1 start=0 doc double name=pricecommon_double479.99/double str name=code_string10003/str str name=name_textProduct3/str str name=storeName_string203/str double name=geodist()198.70324062133778/double/doc /result /lst I expect product 10002, to be sorted and shown after 1003, but it is not sorted correctly. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266p4193457.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filter cache pollution during sharded edismax queries
I think you should test with facet.shard.limit=-1 this will disallow the limit for the facet on the shards and remove the needs for facet refinements. I bet that returning every facet with a count greater than 0 on internal queries is cheaper than using the filter cache to handle a lot of refinements. Jim 2014-10-01 10:24 GMT+02:00 Charlie Hull char...@flax.co.uk: On 30/09/2014 22:25, Erick Erickson wrote: Just from a 20,000 ft. view, using the filterCache this way seems...odd. +1 for using a different cache, but that's being quite unfamiliar with the code. Here's a quick update: 1. LFUCache performs worse so we returned to LRUCache 2. Making the cache smaller than the default 512 reduced performance. 3. Raising the cache size to 2048 didn't seem to have a significant effect on performance but did reduce CPU load significantly. This may help our client as they can reduce their system spec considerably. We're continuing to test with our client, but the upshot is that even if you think you don't need the filter cache, if you're doing distributed faceting you probably do, and you should size it based on experimentation. In our case there is a single filter but the cache needs to be considerably larger than that! Cheers Charlie On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote: Once all the facets have been gathered, the co-ordinating node then asks the subnodes for an exact count for the final top-N facets, What's the point to refine these counts? I've thought that it make sense only for facet.limit ed requests. Is it correct statement? can those who suffer from the low performance, just unlimit facet.limit to avoid that distributed hop? Presumably yes, but if you've got a sufficiently high cardinality field then any gains made by missing out the hop will probably be offset by having to stream all the return values back again. Alan -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: FAST-like document vector data structures in Solr?
Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com : Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen
Re: Incorrect group.ngroups value
Hi Bryan, This is a known limitations of the grouping. https://wiki.apache.org/solr/FieldCollapsing#RequestParameters group.ngroups: *WARNING: If this parameter is set to true on a sharded environment, all the documents that belong to the same group have to be located in the same shard, otherwise the count will be incorrect. If you are using SolrCloud https://wiki.apache.org/solr/SolrCloud, consider using custom hashing* Cheers, Jim 2014-08-21 21:44 GMT+02:00 Bryan Bende bbe...@gmail.com: Is there any known issue with using group.ngroups in a distributed Solr using version 4.8.1 ? I recently upgraded a cluster from 4.6.1 to 4.8.1, and I'm noticing several queries where ngroups will be more than the actual groups returned in the response. For example, ngroups will say 5, but then there will be 3 groups in the response. It is not happening on all queries, only some.
Re: Bloom filter
Hi Per, First of all the BloomFilter implementation in Lucene is not exactly a bloom filter. It uses only one hash function and you cannot set the false positive ratio beforehand. ElasticSearch has its own bloom filter implementation (using guava like BloomFilter), you should take a look at their implementation if you really need this feature. What is your use-case ? If your index fits in RAM the bloom filter won't help (and it may have a negative impact if you have a lot of segments). In fact the only use case where the bloom filter can help is when your term dictionary does not fit in RAM which is rarely the case. Regards, Jim 2014-07-28 16:13 GMT+02:00 Per Steffensen st...@designware.dk: Yes I found that one, along with SOLR-3950. Well at least it seems like the support is there in Lucene. I will figure out myself how to make it work via Solr, the way I need it to work. My use-case is not as specified in SOLR-1375, but the solution might be the same. Any input is of course still very much appreciated. Regards, Per Steffensen On 28/07/14 15:42, Lukas Drbal wrote: Hi Per, link to jira - https://issues.apache.org/jira/browse/SOLR-1375 Unresolved ;-) L. On Mon, Jul 28, 2014 at 1:17 PM, Per Steffensen st...@designware.dk wrote: Hi Where can I find documentation on how to use Bloom filters in Solr (4.4). http://wiki.apache.org/solr/BloomIndexComponent seems to be outdated - there is no BloomIndexComponent included in 4.4 code. Regards, Per Steffensen
Re: Compression vs FieldCache for doc ids retrieval
@William Firstly because I was sure that the ticket (or an equivalent) was already opened but I just could not find it. Thanks @Manuel. Secondly because I wanted to start the discussion, I have the feeling that the compression of the documents, activated by default, can be a killer for some applications (if the number of shards is big or if you have a lot of deep paging queries) and I wanted to check if someone noticed the problem in a benchmark. Let's say that you have 10 shards and you want to return 10 documents per request, in the first stage of the search each shard would need to decompress 10 blocks of 16k each whereas the second stage would need to decompress only 10 blocks total. This makes me believe that this patch should be the default behaviour for any distributed search in Solr (I mean more than 1 shard). Maybe it's better to continue the discussion on the ticket created by Manuel, but still, I think that it could speed up every queries (not only deep paging queries like in the patch proposed in Manuel's ticket). Jim 2014-06-01 14:06 GMT+09:00 William Bell billnb...@gmail.com: Why not just submit a JIRA issue - and add your patch so that we can all benefit? On Fri, May 30, 2014 at 5:34 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Is the issue SOLR-5478 what you were looking for? -- Bill Bell billnb...@gmail.com cell 720-256-8076
Compression vs FieldCache for doc ids retrieval
Dear Solr users, we migrated our solution from Solr 4.0 to Solr 4.3 and we noticed a degradation of the search performance. We compared the two versions and found out that most of the time is spent in the decompression of the retrievable fields in Solr 4.3. The block compression of the documents is a great feature for us because it reduces the size of our index but we don’t have enough resources (I mean cpus) to safely migrate to the new version. In order to reduce the cost of the decompression we tried a simple patch in the BinaryResponseWriter; during the first phase of the distributed search the response writer gets the documents from the index reader to only extract the doc ids of the top N results. Our patch uses the field cache to get the doc ids during the first phase and thus replaces a full decompression of 16k blocks (for a single document) by a simple get in an array (the field cache or the doc values). Thanks to this patch we are now able to handle the same number of QPS than before (with Solr 4.0). Of course the document cache could help as well but but not as much as one would have though (mainly because we have a lot of deep paging queries). I am sure that the idea we implemented is not new but I haven’t seen any Jira about it. Should we create one (I mean does it have a chance to be included in future release of Solr or does anybody already working on this) ? Cheers, Jim
Re: Memory + WeakIdentityMap
Hi, If you are not on windows, you can try to disable the tracking of clones in the MMapDirectory by setting unmap to false in your solrconfig.xml: *directoryFactory name=DirectoryFactory class=solr.MMapDirectoryFactory} * * bool name=unmapfalse/bool* */directoryFactory* The MMapDirectory keeps track of all clones in a weak map and forces the unmapping of the buffers on close. This was added because on Windows mmapped files cannot be modified or deleted. If unmap is false the weak map is not created and the weak references you see in your heap should disapear as well. You can find more informations here: https://issues.apache.org/jira/browse/LUCENE-4740 Thanks, Jim 2014-03-21 6:56 GMT+01:00 Shawn Heisey s...@elyograg.org: On 3/20/2014 6:54 PM, Harish Agarwal wrote: I'm transitioning my index from a 3.x version to 4.6. I'm running a large heap (20G), primarily to accomodate a large facet cache (~5G), but have been able to run it on 3.x stably. On 4.6.0 after stress testing I'm finding that all of my shards are spending all of their time in GC. After taking a heap dump and analyzing, it appears that org.apache.lucene.util.WeakIdentityMap is using many Gs of memory. Does anyone have any insight into which Solr component(s) use this and whether this kind of memory consumption is to be expected? I can't really say what WeakIdentityMap is doing. I can trace the only usage in Lucene to MMapDirectory, but it doesn't make a lot of sense for this to use a lot of memory, unless this is the source of the memory misreporting that Java 7 seems to do with MMap. See this message in a recent thread on this mailing list: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201403.mbox/%3c53285ca1.9000...@elyograg.org%3E If you have a lot of facets, one approach for performance is to use facet.method=enum so that your Java heap does not need to be super large. This does not actually reduce the overall system memory requirements. It just shifts the responsibility for caching to the operating system instead of Solr, and requires that you have enough memory to put a majority of the index into the OS disk cache. Ideally, there would be enough RAM for the entire index to fit. http://wiki.apache.org/solr/SolrPerformanceProblems Another option for facet memory optimization is docValues. One caveat: It is my understanding that the docValues content is the same as a stored field. Depending on your schema definition, this may be different than the indexed values that facets normally use. The docValues feature also helps with sorting. Thanks, Shawn
Re: run filter queries after post filter
Hi Rohit, The main problem is that if the query inside the filter does not have a PostFilter implementation then your post filter is silently transformed into a simple filter. The query field:value is based on the inverted lists and does not have a postfilter support. If your field is a numeric field take a look at the frange query parser which has post filter support: To filter out document with a field value less than 5: fq={!frange l=5 cache=false cost=200}field(myField) Cheers, Jim 2013/10/9 Rohit Harchandani rhar...@gmail.com yes i get that. actually i should have explained in more detail. - i have a query which gets certain documents. - the post filter gets these matched documents and does some processing on them and filters the results. - but after this is done i need to apply another filter - which is why i gave a higher cost to it. the reason i need to do this is because the processing done by the post filter depends on the documents matching the query till that point. since the normal fq clause is also getting executed before the post filter (despite the cost), the final results are not accurate thanks Rohit On Wed, Oct 9, 2013 at 4:14 PM, Erick Erickson erickerick...@gmail.com wrote: Ah, I think you're misunderstanding the nature of post-filters. Or I'm confused, which happens a lot! The whole point of post filters is that they're assumed to be expensive (think ACL calculation). So you want them to run on the fewest documents possible. So only docs that make it through the primary query _and_ all lower-cost filters will get to this post-filter. This means they can't be cached for instance, because they don't see (hopefully) very many docs. This is radically different than normal fq clauses, which are calculated on the entire corpus and can thus be cached. Best, Erick On Wed, Oct 9, 2013 at 11:59 AM, Rohit Harchandani rhar...@gmail.com wrote: Hey, so the post filter logs the number of ids that it receives. With the above filter having cost=200, the post filter should have received the same number of ids as before ( when the filter was not present ). But that does not seem to be the case...with the filter query on the index, the number of ids that the post filter is receiving reduces. Thanks, Rohit On Tue, Oct 8, 2013 at 8:29 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, seems like it should. What's our evidence that it isn't working? Best, Erick On Tue, Oct 8, 2013 at 4:10 PM, Rohit Harchandani rhar...@gmail.com wrote: Hey, I am using solr 4.0 with my own PostFilter implementation which is executed after the normal solr query is done. This filter has a cost of 100. Is it possible to run filter queries on the index after the execution of the post filter? I tried adding the below line to the url but it did not seem to work: fq={!cache=false cost=200}field:value Thanks, Rohit