[ANNOUNCE] Apache Solr 8.0.0 released

2019-03-14 Thread jim ferenczi
14 March 2019, Apache Solr™ 8.0.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 8.0.0

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/8.0.0

Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/8_0_0/changes/Changes.html

Solr 8.0.0 Release Highlights
* Solr now uses HTTP/2 for inter-node communication

Being a major release, Solr 8 removes many deprecated APIs, changes various
parameter defaults and behavior. Some changes may require a re-index of
your content. You are thus encouraged to thoroughly read the "Upgrade
Notes" at http://lucene.apache.org/solr/8_0_0/changes/Changes.html or in
the CHANGES.txt file accompanying the release.

Solr 8.0 also includes many other new features as well as numerous
optimizations and bugfixes of the corresponding Apache Lucene release.

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 7.7.0 released

2019-02-11 Thread jim ferenczi
11 February 2019, Apache Solr™ 7.7.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 7.7.0

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial
search. Solr is highly scalable, providing fault tolerant distributed
search and indexing, and powers the search and navigation features of many
of the world's largest internet sites.

Solr 7.7.0 is available for immediate download at:
http://lucene.apache.org/solr/downloads.html

See http://lucene.apache.org/solr/7_7_0/changes/Changes.html for a full
list of details.

Solr 7.7.0 Release Highlights:

Bug Fixes:
  * URI Too Long with large streaming expressions in SolrJ.
  * A failure while reloading a SolrCore can result in the SolrCore not
being closed.
  * Spellcheck parameters not working in new UI.
  * New Admin UI Query does not URL-encode the query produced in the URL
box.
  * Rule-base Authorization plugin skips authorization if querying node
does not have collection replica.
  * Solr installer fails on SuSE linux.
  * Fix incorrect SOLR_SSL_KEYSTORE_TYPE variable in solr start script.

Improvements:
  * JSON 'terms' Faceting now supports a 'prelim_sort' option to use when
initially selecting the top ranking buckets, prior to the final 'sort'
option used after refinement.
  * Add a login page to Admin UI, with initial support for Basic Auth and
Kerberos.
  * New Node-level health check handler at /admin/info/healthcheck and
/node/health paths that checks if the node is live, connected to zookeeper
and not shutdown.
  * It is now possible to configure a host whitelist for distributed search.

You are encouraged to thoroughly read the "Upgrade Notes" at
http://lucene.apache.org/solr/7_7_0/changes/Changes.html or in the
CHANGES.txt file accompanying the release.

Solr 7.7 also includes many other new features as well as numerous
optimizations and bugfixes of the corresponding Apache Lucene release.

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/community.html#mailing-lists-irc)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 7.5.0 released

2018-09-24 Thread jim ferenczi
24 September 2018, Apache Solr™ 7.5.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 7.5.0

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial
search. Solr is highly scalable, providing fault tolerant distributed
search and indexing, and powers the search and navigation features of many
of the world's largest internet sites.

Solr 7.5.0 is available for immediate download at:
http://lucene.apache.org/solr/downloads.html

See http://lucene.apache.org/solr/7_5_0/changes/Changes.html for a full
list of details.

Solr 7.5.0 Release Highlights:

  Nested/child documents may now be supplied as a field value instead of
stand-off. Future releases will leverage this semantic information.
  Enhance Autoscaling policy support to equally distribute replicas on the
basis of arbitrary properties.
  Nodes are now visible inside a view of the Admin UI "Cloud" tab, listing
nodes and key metrics.
  The status of zookeeper ensemble is now accessible under the Admin UI
Cloud tab.
  The new Korean morphological analyzer ("nori") has been added to default
distribution.

You are encouraged to thoroughly read the "Upgrade Notes" at
http://lucene.apache.org/solr/7_5_0/changes/Changes.html or in the
CHANGES.txt file accompanying the release.

Solr 7.5 also includes many other new features as well as numerous
optimizations and bugfixes of the corresponding Apache Lucene release.

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/community.html#mailing-lists-irc)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 7.2.1 released

2018-01-15 Thread jim ferenczi
15 January 2018, Apache Solr™ 7.2.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 7.2.1

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release includes 3 bug fixes since the 7.2.0 release:

* Overseer can never process some last messages.

* Rename core in solr standalone mode is not persisted.

* QueryComponent's rq parameter parsing no longer considers the defType
parameter.

* Fix NPE in SolrQueryParser when the query terms inside a filter clause
reduce to nothing.

Furthermore, this release includes Apache Lucene 7.2.1 which includes 1 bug
fix since the 7.2.0 release.

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/7.2.1

Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/7_2_1/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 6.5.1 released

2017-04-27 Thread jim ferenczi
27 April 2017, Apache Solr™ 6.5.1 available


The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1


Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.


This release includes 11 bug fixes since the 6.5.0 release. Some of the
major fixes are:


* bin\solr.cmd delete and healthcheck now works again; fixed continuation
chars ^


* Fix debug related NullPointerException in solr/contrib/ltr
OriginalScoreFeature class.


* The JSON output of /admin/metrics is fixed to write the container as a
map (SimpleOrderedMap) instead of an array (NamedList).


* On 'downnode', lots of wasteful mutations are done to ZK.


* Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer
classes.


* The fetch() streaming expression wouldn't work if a value included query
syntax chars (like :+-). Fixed, and enhanced the generated query to not
pollute the queryCache.


* Disable graph query production via schema configuration . This fixes broken queries for
ShingleFilter-containing query-time analyzers when request param sow=false.


* Fix indexed="false" on numeric PointFields


* SQL AVG function mis-interprets field type.


* SQL interface does not use client cache.


* edismax with sow=false fails to create dismax-per-term queries when any
field is boosted.


Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug
fixes since the 6.5.0 release.


The release is available for immediate download at:


http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1

Please read CHANGES.txt for a detailed list of changes:


https://lucene.apache.org/solr/6_5_1/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)


Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 6.5.1 released

2017-04-27 Thread jim ferenczi
27 April 2017, Apache Solr™ 6.5.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release includes 11 bug fixes since the 6.5.0 release. Some of the
major fixes are:

* bin\solr.cmd delete and healthcheck now works again; fixed continuation
chars ^

* Fix debug related NullPointerException in solr/contrib/ltr
OriginalScoreFeature class.

* The JSON output of /admin/metrics is fixed to write the container as a
map (SimpleOrderedMap) instead of an array (NamedList).

* On 'downnode', lots of wasteful mutations are done to ZK.

* Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer
classes.

* The fetch() streaming expression wouldn't work if a value included query
syntax chars (like :+-). Fixed, and enhanced the generated query to not
pollute the queryCache.

* Disable graph query production via schema configuration . This fixes
broken queries for ShingleFilter-containing query-time analyzers when
request param sow=false.

* Fix indexed="false" on numeric PointFields

* SQL AVG function mis-interprets field type.

* SQL interface does not use client cache.

* edismax with sow=false fails to create dismax-per-term queries when any
field is boosted.

Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug
fixes since the 6.5.0 release.

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1
Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/6_5_1/changes/Changes.html
Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 6.5.1 released

2017-04-27 Thread jim ferenczi
27 April 2017, Apache Solr™ 6.5.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release includes 11 bug fixes since the 6.5.0 release. Some of the
major fixes are:

* bin\solr.cmd delete and healthcheck now works again; fixed continuation
chars ^

* Fix debug related NullPointerException in solr/contrib/ltr
OriginalScoreFeature class.

* The JSON output of /admin/metrics is fixed to write the container as a
map (SimpleOrderedMap) instead of an array (NamedList).

* On 'downnode', lots of wasteful mutations are done to ZK.

* Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer
classes.

* The fetch() streaming expression wouldn't work if a value included query
syntax chars (like :+-). Fixed, and enhanced the generated query to not
pollute the queryCache.

* Disable graph query production via schema configuration . This fixes broken queries for
ShingleFilter-containing query-time analyzers when request param sow=false.

* Fix indexed="false" on numeric PointFields

* SQL AVG function mis-interprets field type.

* SQL interface does not use client cache.

* edismax with sow=false fails to create dismax-per-term queries when any
field is boosted.

Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug
fixes since the 6.5.0 release.

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1
Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/6_5_1/changes/Changes.html
Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 6.5.0 released

2017-03-27 Thread jim ferenczi
27 March 2017, Apache Solr 6.5.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.5.0.

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

Solr 6.5.0 is available for immediate download at:

   -

   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Please read CHANGES.txt for a full list of new features and changes:

   -

   https://lucene.apache.org/solr/6_5_0/changes/Changes.html

Highlights of this Solr release include:

   - PointFields (fixed-width multi-dimensional numeric & binary types
   enabling fast range search) are now supported
   - In-place updates to numeric docValues fields (single valued,
   non-stored, non-indexed) supported using atomic update syntax
   - A new LatLonPointSpatialField that uses points or doc values for query
   - It is now possible to declare a field as "large" in order to bypass
   the document cache
   - New sow=false request param (split-on-whitespace) for edismax &
   standard query parsers enables query-time multi-term synonyms
   - XML QueryParser (defType=xmlparser) now supports span queries
   - hl.maxAnalyzedChars now have consistent default across highlighters
   - UnifiedSolrHighlighter and PostingsSolrHighlighter now support
   CustomSeparatorBreakIterator
   - Scoring formula is adjusted for the scoreNodes function
   - Calcite Planner now applies constant Reduction Rules to optimize plans
   - A new significantTerms Streaming Expression that is able to extract
   the significant terms in an index
   - StreamHandler is now able to use runtimeLib jars
   - Arithmetic operations are added to the SelectStream
   - Added modernized self-documenting /v2 API
   - The .system collection is now created on first request if it does not
   exist
   - Admin UI: Added shard deletion button
   - Metrics API now supports non-numeric metrics (version, disk type,
   component state, system properties...)
   - The disk free and aggregated disk free metrics are now reported
   - The DirectUpdateHandler2 now implements MetricsProducer and exposes
   stats via the metrics api and configured reporters.
   - BlockCache is faster due to less failures when caching a new block
   - MMapDirectoryFactory now supports "preload" option to ask mapped pages
   to be loaded into physical memory on init
   - Security: BasicAuthPlugin now supports standalone mode
   - Arbitrary java system properties can be passed to zkcli
   - SolrHttpClientBuilder can be configured via java system property
   - Javadocs and Changes.html are no longer included in the binary
   distribution, but are hosted online

Further details of changes are available in the change log available at:
http://lucene.apache.org/solr/6_5_0/changes/Changes.html

Please report any feedback to the mailing lists (http://lucene.apache.org/
solr/discussion.html)
Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also applies to Maven access.

   -


[ANNOUNCE] Apache Solr 6.4.0 released

2017-01-23 Thread jim ferenczi
23 January 2016 - Apache Solr™ 6.4.0 Available
The Lucene PMC is pleased to announce the release of Apache Solr 6.4.0.

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

Solr 6.4.0 is available for immediate download at:
http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Highlights of this Solr release include:

Streaming:
  * Addition of a HavingStream to Streaming API and Streaming Expressions
  * Addition of a priority Streaming Expression
  * Streaming expressions now support collection aliases

Machine Learning:
  * Configurable Learning-To-Rank (LTR) support: upload feature
definitions, extract feature values, upload your own machine learnt models
and use them to rerank search results.

Faceting:
  * Added "param" query type to facet domain filter specification to obtain
filters via query parameters
  * Any facet command can be filtered using a new parameter filter.
Example: { type:terms, field:category, filter:"user:yonik" }

Scripts / Command line:
  * A new command-line tool to manage the snapshots functionality
  * bin/solr and bin/solr.cmd now use mkroot command

SolrCloud / SolrJ
  * LukeResponse now supports dynamic fields
  * Solrj client now supports hierarchical clusters and other topics marker
  * Collection backup/restore are extensible.

Security:
  * Support Secure Impersonation / Proxy User for Solr authentication
  * Key Store type can be specified in solr.in.sh file for SSL
  * New generic authentication plugins: 'HadoopAuthPlugin' and
'ConfigurableInternodeAuthHadoopPlugin' that delegate all functionality to
Hadoop authentication framework

Query / QueryParser / Highlighting:
  * A new highlighter: The Unified Highlighter. Try it via
hl.method=unified; many popular highlighting parameters / features are
supported. It's the highest performing highlighter, especially for large
documents. Highlighting phrase queries and exotic queries are supported
equally as well as the Original Highlighter (aka the default/standard one).
Please use this new highlighter and report issues since it will likely
become the default one day.
  * Leading wildcard in complexphrase query parser are now accepted and
optimized with the ReversedWildcardFilterFactory when it's provided

Metrics:
  * Use metrics-jvm library to instrument jvm internals such as GC, memory
usage and others.
  * A lot of metrics have been added to the collection: index merges, index
store I/Os, query, update, core admin, core load thread pools, shard
replication, tlog replay and replicas
  * A new /admin/metrics API to return all metrics collected by Solr via
API.

Misc changes:
  * The new config parameter 'maxRamMB'can now limit the memory consumed by
the FastLRUCache
  * A new document processor 'SkipExistingDocumentsProcessor' that skips
duplicate inserts and ignores updates to missing docs
  * FieldCache information fetched via the mbeans handler or seen via the
UI now displays the total size used.
  * A new config flag 'enable' allows to enable/disable any cache

Please note, this release cannot be built from source with Java 8 update
121, use an earlier version instead! This is caused by a bug introduced
into the Javadocs tool shipped with that update. The workaround was too
late for this Lucene release. Of course, you can use the binary artifacts.

See the Solr CHANGES.txt files included with the release for a full list of
details.

Thanks,
Jim Ferenczi


Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
Well it seems that doing q="network se*" is working but not in the way you
expect. Doing this q="network se*" would not trigger a prefix query and the
"*" character would be treated as any character. I suspect that your query
is in fact "network se" (assuming you're using a StandardTokenizer) and
that the word "se" is very popular in your documents. That would explain
the slow response time. Bottom line is that doing "network se*" will not
trigger prefix query at all (I may be wrong but this is the expected
behaviour for Solr up to 4.3).

2015-11-02 13:47 GMT+01:00 Modassar Ather <modather1...@gmail.com>:

> The problem is with the same query as phrase. q="network se*".
>
> The last . is fullstops for the sentence and the query is q=field:"network
> se*"
>
> Best,
> Modassar
>
> On Mon, Nov 2, 2015 at 6:10 PM, jim ferenczi <jim.feren...@gmail.com>
> wrote:
>
> > Oups I did not read the thread carrefully.
> > *The problem is with the same query as phrase. q="network se*".*
> > I was not aware that you could do that with Solr ;). I would say this is
> > expected because in such case if the number of expansions for "se*" is
> big
> > then you would have to check the positions for a significant words. I
> don't
> > know if there is a limitation in the number of expansions for a prefix
> > query contained into a phrase query but I would look at this parameter
> > first (limit the number of expansion per prefix search, let's say the N
> > most significant words based on the frequency of the words for instance).
> >
> > 2015-11-02 13:36 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>:
> >
> > >
> > >
> > >
> > > *I am not able to get  the above point. So when I start Solr with 28g
> > RAM,
> > > for all the activities related to Solr it should not go beyond 28g. And
> > the
> > > remaining heap will be used for activities other than Solr. Please help
> > me
> > > understand.*
> > >
> > > Well those 28GB of heap are the memory "reserved" for your Solr
> > > application, though some parts of the index (not to say all) are
> > retrieved
> > > via MMap (if you use the default MMapDirectory) which do not use the
> heap
> > > at all. This is a very important part of Lucene/Solr, the heap should
> be
> > > sized in a way that let a significant amount of RAM available for the
> > > index. If not then you rely on the speed of your disk, if you have SSDs
> > > it's better but reads are still significantly slower with SSDs than
> with
> > > direct RAM access. Another thing to keep in mind is that mmap will
> always
> > > tries to put things in RAM, this is why I suspect that you swap
> activity
> > is
> > > killing your performance.
> > >
> > > 2015-11-02 11:55 GMT+01:00 Modassar Ather <modather1...@gmail.com>:
> > >
> > >> Thanks Jim for your response.
> > >>
> > >> The remaining size after you removed the heap usage should be reserved
> > for
> > >> the index (not only the other system activities).
> > >> I am not able to get  the above point. So when I start Solr with 28g
> > RAM,
> > >> for all the activities related to Solr it should not go beyond 28g.
> And
> > >> the
> > >> remaining heap will be used for activities other than Solr. Please
> help
> > me
> > >> understand.
> > >>
> > >> *Also the CPU utilization goes upto 400% in few of the nodes:*
> > >> You said that only machine is used so I assumed that 400% cpu is for a
> > >> single process (one solr node), right ?
> > >> Yes you are right that 400% is for single process.
> > >> The disks are SSDs.
> > >>
> > >> Regards,
> > >> Modassar
> > >>
> > >> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi <jim.feren...@gmail.com>
> > >> wrote:
> > >>
> > >> > *if it correlates with the bad performance you're seeing. One
> > important
> > >> > thing to notice is that a significant part of your index needs to be
> > in
> > >> RAM
> > >> > (especially if you're using SSDs) in order to achieve good
> > performance.*
> > >> >
> > >> > Especially if you're not using SSDs, sorry ;)
> > >> >
> > >> > 2015-11-02 11:38 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>:
> > >>

Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
12 shards with 28GB for the heap and 90GB for each index means that you
need at least 336GB for the heap (assuming you're using all of it which may
be easily the case considering the way the GC is handling memory) and ~=
1TO for the index. Let's say that you don't need your entire index in RAM,
the problem as I see it is that you don't have enough RAM for your index +
heap. Assuming your machine has 370GB of RAM there are only 34GB left for
your index, 1TO/34GB means that you can only have 1/30 of your entire index
in RAM. I would advise you to check the swap activity on the machine and
see if it correlates with the bad performance you're seeing. One important
thing to notice is that a significant part of your index needs to be in RAM
(especially if you're using SSDs) in order to achieve good performance:



*As mentioned above this is a big machine with 370+ gb of RAM and Solr (12
nodes total) is assigned 336 GB. The rest is still a good for other system
activities.*
The remaining size after you removed the heap usage should be reserved for
the index (not only the other system activities).


*Also the CPU utilization goes upto 400% in few of the nodes:*
You said that only machine is used so I assumed that 400% cpu is for a
single process (one solr node), right ?
This seems impossible if you are sure that only one query is played at a
time and no indexing is performed. Best thing to do is to dump stack trace
of the solr nodes during the query and to check what the threads are doing.

Jim



2015-11-02 10:38 GMT+01:00 Modassar Ather :

> Just to add one more point that one external Zookeeper instance is also
> running on this particular machine.
>
> Regards,
> Modassar
>
> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather 
> wrote:
>
> > Hi Toke,
> > Thanks for your response. My comments in-line.
> >
> > That is 12 machines, running a shard each?
> > No! This is a single big machine with 12 shards on it.
> >
> > What is the total amount of physical memory on each machine?
> > Around 370 gb on the single machine.
> >
> > Well, se* probably expands to a great deal of documents, but a huge bump
> > in memory utilization and 3 minutes+ sounds strange.
> >
> > - What are your normal query times?
> > Few simple queries are returned with in a couple of seconds. But the more
> > complex queries with proximity and wild cards have taken more than 3-4
> > minutes and some times some queries have timed out too where time out is
> > set to 5 minutes.
> > - How many hits do you get from 'network se*'?
> > More than a million records.
> > - How many results do you return (the rows-parameter)?
> > It is the default one 10. Grouping is enabled on a field.
> > - If you issue a query without wildcards, but with approximately the
> > same amount of hits as 'network se*', how long does it take?
> > A query resulting in around half a million record return within a couple
> > of seconds.
> >
> > That is strange, yes. Have you checked the logs to see if something
> > unexpected is going on while you test?
> > Have not seen anything particularly. Will try to check again.
> >
> > If you are using spinning drives and only have 32GB of RAM in total in
> > each machine, you are probably struggling just to keep things running.
> > As mentioned above this is a big machine with 370+ gb of RAM and Solr (12
> > nodes total) is assigned 336 GB. The rest is still a good for other
> system
> > activities.
> >
> > Thanks,
> > Modassar
> >
> > On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen 
> > wrote:
> >
> >> On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote:
> >> > I have a setup of 12 shard cluster started with 28gb memory each on a
> >> > single server. There are no replica. The size of index is around 90gb
> on
> >> > each shard. The Solr version is 5.2.1.
> >>
> >> That is 12 machines, running a shard each?
> >>
> >> What is the total amount of physical memory on each machine?
> >>
> >> > When I query "network se*", the memory utilization goes upto 24-26 gb
> >> and
> >> > the query takes around 3+ minutes to execute. Also the CPU utilization
> >> goes
> >> > upto 400% in few of the nodes.
> >>
> >> Well, se* probably expands to a great deal of documents, but a huge bump
> >> in memory utilization and 3 minutes+ sounds strange.
> >>
> >> - What are your normal query times?
> >> - How many hits do you get from 'network se*'?
> >> - How many results do you return (the rows-parameter)?
> >> - If you issue a query without wildcards, but with approximately the
> >> same amount of hits as 'network se*', how long does it take?
> >>
> >> > Why the CPU utilization is so high and more than one core is used.
> >> > As far as I understand querying is single threaded.
> >>
> >> That is strange, yes. Have you checked the logs to see if something
> >> unexpected is going on while you test?
> >>
> >> > How can I disable replication(as it is implicitly enabled) permanently
> >> 

Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
*if it correlates with the bad performance you're seeing. One important
thing to notice is that a significant part of your index needs to be in RAM
(especially if you're using SSDs) in order to achieve good performance.*

Especially if you're not using SSDs, sorry ;)

2015-11-02 11:38 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>:

> 12 shards with 28GB for the heap and 90GB for each index means that you
> need at least 336GB for the heap (assuming you're using all of it which may
> be easily the case considering the way the GC is handling memory) and ~=
> 1TO for the index. Let's say that you don't need your entire index in RAM,
> the problem as I see it is that you don't have enough RAM for your index +
> heap. Assuming your machine has 370GB of RAM there are only 34GB left for
> your index, 1TO/34GB means that you can only have 1/30 of your entire index
> in RAM. I would advise you to check the swap activity on the machine and
> see if it correlates with the bad performance you're seeing. One important
> thing to notice is that a significant part of your index needs to be in RAM
> (especially if you're using SSDs) in order to achieve good performance:
>
>
>
> *As mentioned above this is a big machine with 370+ gb of RAM and Solr (12
> nodes total) is assigned 336 GB. The rest is still a good for other system
> activities.*
> The remaining size after you removed the heap usage should be reserved for
> the index (not only the other system activities).
>
>
> *Also the CPU utilization goes upto 400% in few of the nodes:*
> You said that only machine is used so I assumed that 400% cpu is for a
> single process (one solr node), right ?
> This seems impossible if you are sure that only one query is played at a
> time and no indexing is performed. Best thing to do is to dump stack trace
> of the solr nodes during the query and to check what the threads are doing.
>
> Jim
>
>
>
> 2015-11-02 10:38 GMT+01:00 Modassar Ather <modather1...@gmail.com>:
>
>> Just to add one more point that one external Zookeeper instance is also
>> running on this particular machine.
>>
>> Regards,
>> Modassar
>>
>> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather <modather1...@gmail.com>
>> wrote:
>>
>> > Hi Toke,
>> > Thanks for your response. My comments in-line.
>> >
>> > That is 12 machines, running a shard each?
>> > No! This is a single big machine with 12 shards on it.
>> >
>> > What is the total amount of physical memory on each machine?
>> > Around 370 gb on the single machine.
>> >
>> > Well, se* probably expands to a great deal of documents, but a huge bump
>> > in memory utilization and 3 minutes+ sounds strange.
>> >
>> > - What are your normal query times?
>> > Few simple queries are returned with in a couple of seconds. But the
>> more
>> > complex queries with proximity and wild cards have taken more than 3-4
>> > minutes and some times some queries have timed out too where time out is
>> > set to 5 minutes.
>> > - How many hits do you get from 'network se*'?
>> > More than a million records.
>> > - How many results do you return (the rows-parameter)?
>> > It is the default one 10. Grouping is enabled on a field.
>> > - If you issue a query without wildcards, but with approximately the
>> > same amount of hits as 'network se*', how long does it take?
>> > A query resulting in around half a million record return within a couple
>> > of seconds.
>> >
>> > That is strange, yes. Have you checked the logs to see if something
>> > unexpected is going on while you test?
>> > Have not seen anything particularly. Will try to check again.
>> >
>> > If you are using spinning drives and only have 32GB of RAM in total in
>> > each machine, you are probably struggling just to keep things running.
>> > As mentioned above this is a big machine with 370+ gb of RAM and Solr
>> (12
>> > nodes total) is assigned 336 GB. The rest is still a good for other
>> system
>> > activities.
>> >
>> > Thanks,
>> > Modassar
>> >
>> > On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen <t...@statsbiblioteket.dk>
>> > wrote:
>> >
>> >> On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote:
>> >> > I have a setup of 12 shard cluster started with 28gb memory each on a
>> >> > single server. There are no replica. The size of index is around
>> 90gb on
>> >> > each shard. The Solr version is 5.2.1.
>> >>
>

Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
*I am not able to get  the above point. So when I start Solr with 28g RAM,
for all the activities related to Solr it should not go beyond 28g. And the
remaining heap will be used for activities other than Solr. Please help me
understand.*

Well those 28GB of heap are the memory "reserved" for your Solr
application, though some parts of the index (not to say all) are retrieved
via MMap (if you use the default MMapDirectory) which do not use the heap
at all. This is a very important part of Lucene/Solr, the heap should be
sized in a way that let a significant amount of RAM available for the
index. If not then you rely on the speed of your disk, if you have SSDs
it's better but reads are still significantly slower with SSDs than with
direct RAM access. Another thing to keep in mind is that mmap will always
tries to put things in RAM, this is why I suspect that you swap activity is
killing your performance.

2015-11-02 11:55 GMT+01:00 Modassar Ather <modather1...@gmail.com>:

> Thanks Jim for your response.
>
> The remaining size after you removed the heap usage should be reserved for
> the index (not only the other system activities).
> I am not able to get  the above point. So when I start Solr with 28g RAM,
> for all the activities related to Solr it should not go beyond 28g. And the
> remaining heap will be used for activities other than Solr. Please help me
> understand.
>
> *Also the CPU utilization goes upto 400% in few of the nodes:*
> You said that only machine is used so I assumed that 400% cpu is for a
> single process (one solr node), right ?
> Yes you are right that 400% is for single process.
> The disks are SSDs.
>
> Regards,
> Modassar
>
> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi <jim.feren...@gmail.com>
> wrote:
>
> > *if it correlates with the bad performance you're seeing. One important
> > thing to notice is that a significant part of your index needs to be in
> RAM
> > (especially if you're using SSDs) in order to achieve good performance.*
> >
> > Especially if you're not using SSDs, sorry ;)
> >
> > 2015-11-02 11:38 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>:
> >
> > > 12 shards with 28GB for the heap and 90GB for each index means that you
> > > need at least 336GB for the heap (assuming you're using all of it which
> > may
> > > be easily the case considering the way the GC is handling memory) and
> ~=
> > > 1TO for the index. Let's say that you don't need your entire index in
> > RAM,
> > > the problem as I see it is that you don't have enough RAM for your
> index
> > +
> > > heap. Assuming your machine has 370GB of RAM there are only 34GB left
> for
> > > your index, 1TO/34GB means that you can only have 1/30 of your entire
> > index
> > > in RAM. I would advise you to check the swap activity on the machine
> and
> > > see if it correlates with the bad performance you're seeing. One
> > important
> > > thing to notice is that a significant part of your index needs to be in
> > RAM
> > > (especially if you're using SSDs) in order to achieve good performance:
> > >
> > >
> > >
> > > *As mentioned above this is a big machine with 370+ gb of RAM and Solr
> > (12
> > > nodes total) is assigned 336 GB. The rest is still a good for other
> > system
> > > activities.*
> > > The remaining size after you removed the heap usage should be reserved
> > for
> > > the index (not only the other system activities).
> > >
> > >
> > > *Also the CPU utilization goes upto 400% in few of the nodes:*
> > > You said that only machine is used so I assumed that 400% cpu is for a
> > > single process (one solr node), right ?
> > > This seems impossible if you are sure that only one query is played at
> a
> > > time and no indexing is performed. Best thing to do is to dump stack
> > trace
> > > of the solr nodes during the query and to check what the threads are
> > doing.
> > >
> > > Jim
> > >
> > >
> > >
> > > 2015-11-02 10:38 GMT+01:00 Modassar Ather <modather1...@gmail.com>:
> > >
> > >> Just to add one more point that one external Zookeeper instance is
> also
> > >> running on this particular machine.
> > >>
> > >> Regards,
> > >> Modassar
> > >>
> > >> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather <
> modather1...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Toke,
> > >> > Thanks for your response. My comments in-line.
>

Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
Oups I did not read the thread carrefully.
*The problem is with the same query as phrase. q="network se*".*
I was not aware that you could do that with Solr ;). I would say this is
expected because in such case if the number of expansions for "se*" is big
then you would have to check the positions for a significant words. I don't
know if there is a limitation in the number of expansions for a prefix
query contained into a phrase query but I would look at this parameter
first (limit the number of expansion per prefix search, let's say the N
most significant words based on the frequency of the words for instance).

2015-11-02 13:36 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>:

>
>
>
> *I am not able to get  the above point. So when I start Solr with 28g RAM,
> for all the activities related to Solr it should not go beyond 28g. And the
> remaining heap will be used for activities other than Solr. Please help me
> understand.*
>
> Well those 28GB of heap are the memory "reserved" for your Solr
> application, though some parts of the index (not to say all) are retrieved
> via MMap (if you use the default MMapDirectory) which do not use the heap
> at all. This is a very important part of Lucene/Solr, the heap should be
> sized in a way that let a significant amount of RAM available for the
> index. If not then you rely on the speed of your disk, if you have SSDs
> it's better but reads are still significantly slower with SSDs than with
> direct RAM access. Another thing to keep in mind is that mmap will always
> tries to put things in RAM, this is why I suspect that you swap activity is
> killing your performance.
>
> 2015-11-02 11:55 GMT+01:00 Modassar Ather <modather1...@gmail.com>:
>
>> Thanks Jim for your response.
>>
>> The remaining size after you removed the heap usage should be reserved for
>> the index (not only the other system activities).
>> I am not able to get  the above point. So when I start Solr with 28g RAM,
>> for all the activities related to Solr it should not go beyond 28g. And
>> the
>> remaining heap will be used for activities other than Solr. Please help me
>> understand.
>>
>> *Also the CPU utilization goes upto 400% in few of the nodes:*
>> You said that only machine is used so I assumed that 400% cpu is for a
>> single process (one solr node), right ?
>> Yes you are right that 400% is for single process.
>> The disks are SSDs.
>>
>> Regards,
>> Modassar
>>
>> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi <jim.feren...@gmail.com>
>> wrote:
>>
>> > *if it correlates with the bad performance you're seeing. One important
>> > thing to notice is that a significant part of your index needs to be in
>> RAM
>> > (especially if you're using SSDs) in order to achieve good performance.*
>> >
>> > Especially if you're not using SSDs, sorry ;)
>> >
>> > 2015-11-02 11:38 GMT+01:00 jim ferenczi <jim.feren...@gmail.com>:
>> >
>> > > 12 shards with 28GB for the heap and 90GB for each index means that
>> you
>> > > need at least 336GB for the heap (assuming you're using all of it
>> which
>> > may
>> > > be easily the case considering the way the GC is handling memory) and
>> ~=
>> > > 1TO for the index. Let's say that you don't need your entire index in
>> > RAM,
>> > > the problem as I see it is that you don't have enough RAM for your
>> index
>> > +
>> > > heap. Assuming your machine has 370GB of RAM there are only 34GB left
>> for
>> > > your index, 1TO/34GB means that you can only have 1/30 of your entire
>> > index
>> > > in RAM. I would advise you to check the swap activity on the machine
>> and
>> > > see if it correlates with the bad performance you're seeing. One
>> > important
>> > > thing to notice is that a significant part of your index needs to be
>> in
>> > RAM
>> > > (especially if you're using SSDs) in order to achieve good
>> performance:
>> > >
>> > >
>> > >
>> > > *As mentioned above this is a big machine with 370+ gb of RAM and Solr
>> > (12
>> > > nodes total) is assigned 336 GB. The rest is still a good for other
>> > system
>> > > activities.*
>> > > The remaining size after you removed the heap usage should be reserved
>> > for
>> > > the index (not only the other system activities).
>> > >
>> > >
>> > > *Also the CPU utilization goes upto 400% in few of the nodes:*

Re: fq versus q

2015-06-24 Thread jim ferenczi
 In part of queries we see strange behavior where q performs 5-10x better
 than fq. The question is why?
Are you sure that the query result cache is disabled ?

2015-06-24 13:28 GMT+02:00 Esther Goldbraich estherg...@il.ibm.com:

 Hi,

 We are comparing the performance of fq versus q for queries that are
 actually filters and should not be cached.
 In part of queries we see strange behavior where q performs 5-10x better
 than fq. The question is why?

 An example1:
 q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1
 to DATE2}
 sort=maildate_sort* desc
 rows=50
 start=0
 group=true
 group.query=some query (without dates)
 group.query=*:*
 group.sort=maildate_sort desc
 additional fqs

 Schema:
 field name=maildate stored=true indexed=true type=tdate/
 field name=maildate_sort stored=false indexed=false type=tdate
 docValues=true/

 Thank you,
 Esther
 -
 Esther Goldbraich
 Social Technologies  Analytics - IBM Haifa Research Lab
 Phone: +972-4-8281059


Re: Solr returns incorrect results after sorting

2015-03-19 Thread jim ferenczi
Then you just have to remove the group.sort especially if your group limit
is set to 1.
Le 19 mars 2015 16:57, kumarraj rajitpro2...@gmail.com a écrit :

 *if the number of documents in one group is more than one then you cannot
 ensure that this document reflects the main sort

 Is there a way the top record which is coming up in the group is considered
 for sorting?
 We require to show the record from 212(even though price is low) in both
 the
 cases of high to low or low to high..and still the main sorting should
 work?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266p4194008.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr returns incorrect results after sorting

2015-03-18 Thread jim ferenczi
Hi Raj,
The group.sort you are using defines multiple criterias. The first criteria
is the big solr function starting with the max. This means that inside
each group the documents will be sorted by this criteria and if the values
are equals between two documents then the comparison fallbacks to the
second criteria (inStock_boolean desc) and so on.

*Even though if i add price asc in the group.sort, but still the main
sort does not consider that.*
The main sort does not have to consider what's in the group.sort. The
group.sort defines the way the documents are sorted inside each group. So
if you want to sort the document inside each group with the same order than
in the main sort you can remove the group.sort or you can have a primary
sort on pricecommon_double desc in your group.sort:
*group.sort=pricecommon_double
desc, 
max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0))
desc,inStock_boolean
desc,geodist() asc*


Cheers,
Jim



2015-03-18 7:28 GMT+01:00 kumarraj rajitpro2...@gmail.com:

 Hi Jim,

 Yes, you are right.. that document is having price 499.99,
 But i want to consider the first record in the group as part of the main
 sort.
 Even though if i add price asc in the group.sort, but still the main sort
 does not consider that.

 group.sort=max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0))
 desc,inStock_boolean descgeodist() asc,pricecommon_double
 ascsort=pricecommon_double desc

 Is there any other workaround so that sort is always based on the first
 record which is pulled up in each group?


 Regards,
 Raj



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266p4193658.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr returns incorrect results after sorting

2015-03-17 Thread jim ferenczi
Hi,
Please note that you have two sort criteria, one to sort the documents
inside each group and one to sort the groups. In the example you sent, the
group 10002 has two documents and your group.limit is set to 1. If you redo
the query with group.limit=2 I suspect that you'll see the second document
of this group with a pricecommon_double between 479.99 and 729.97. This
would mean that the sorting is correct ;). Bottom line is that when you
have a group.sort different than the main sort, if the number of documents
in one group is more than one then you cannot ensure that this document
reflects the main sort. Try for instance group.sort=pricecommon_double
asc (main sort inverse order) and you'll see that the sort inside each
group is always applied after the main sort. This is the only way to meet
the expectations ;).

Cheers,
Jim



2015-03-17 9:48 GMT+01:00 kumarraj rajitpro2...@gmail.com:

 Thanks David, that was a typo.
 Do you see any other issues? While solr does the grouping and if more than
 one document which are matched with given group.sort condition(numfound=2),
 then that particular document is not sorted correctly, when sorted by
 price.(sort=price) is applied across all the groups.

 Example: Below is the sample result.

  arr name=groups
   lst
 str name=groupValue10001/str
 result name=doclist numFound=1 start=0
   doc
 double name=pricecommon_double729.97/double
 str name=code_string10001/str
 str name=name_textProduct1/str
 str name=storeName_string203/str
 double name=geodist()198.70324062133778/double/doc
 /result
   /lst
   lst
 str name=groupValue10002/str
 result name=doclist numFound=2 start=0
   doc
 double name=pricecommon_double279.99/double
 str name=code_string10002/str
 str name=name_textProduct2/str
 str name=storeName_string212/str
 double name=geodist()0.0/double/doc
 /result
   /lst
   lst
 str name=groupValue10003/str
 result name=doclist numFound=1 start=0
   doc
 double name=pricecommon_double479.99/double
 str name=code_string10003/str
 str name=name_textProduct3/str
 str name=storeName_string203/str
 double name=geodist()198.70324062133778/double/doc
 /result
   /lst

 I expect product 10002, to be sorted and shown after 1003, but it is not
 sorted correctly.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266p4193457.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Filter cache pollution during sharded edismax queries

2014-10-01 Thread jim ferenczi
I think you should test with facet.shard.limit=-1 this will disallow the
limit for the facet on the shards and remove the needs for facet
refinements. I bet that returning every facet with a count greater than 0
on internal queries is cheaper than using the filter cache to handle a lot
of refinements.

Jim

2014-10-01 10:24 GMT+02:00 Charlie Hull char...@flax.co.uk:

 On 30/09/2014 22:25, Erick Erickson wrote:

 Just from a 20,000 ft. view, using the filterCache this way seems...odd.

 +1 for using a different cache, but that's being quite unfamiliar with the
 code.


 Here's a quick update:

 1. LFUCache performs worse so we returned to LRUCache
 2. Making the cache smaller than the default 512 reduced performance.
 3. Raising the cache size to 2048 didn't seem to have a significant effect
 on performance but did reduce CPU load significantly. This may help our
 client as they can reduce their system spec considerably.

 We're continuing to test with our client, but the upshot is that even if
 you think you don't need the filter cache, if you're doing distributed
 faceting you probably do, and you should size it based on experimentation.
 In our case there is a single filter but the cache needs to be considerably
 larger than that!

 Cheers

 Charlie



 On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote:



  Once all the facets have been gathered, the co-ordinating node then
 asks
 the subnodes for an exact count for the final top-N facets,



 What's the point to refine these counts? I've thought that it make sense
 only for facet.limit ed requests. Is it correct statement? can those who
 suffer from the low performance, just unlimit  facet.limit to avoid that
 distributed hop?


 Presumably yes, but if you've got a sufficiently high cardinality field
 then any gains made by missing out the hop will probably be offset by
 having to stream all the return values back again.

 Alan


  --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com






 --
 Charlie Hull
 Flax - Open Source Enterprise Search

 tel/fax: +44 (0)8700 118334
 mobile:  +44 (0)7767 825828
 web: www.flax.co.uk



Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread jim ferenczi
Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com
:

 Hello all,
   as the migration from FAST to Solr is a relevant topic for several of
 our customers, there is one issue that does not seem to be addressed by
 Lucene/Solr: document vectors FAST-style. These document vectors are
 used to form metrics of similarity, i.e., they may be used as a
 semantic fingerprint of documents to define similarity relations. I
 can think of several ways of approximating a mapping of this mechanism
 to Solr, but there are always drawbacks - mostly performance-wise.

 Has anybody else encountered and possibly approached this challenge so far?

 Is there anything in the roadmap of Solr that has not revealed itself to
 me, addressing this issue?

 Your input is greatly appreciated!

 Cheers,
 --Jürgen




Re: Incorrect group.ngroups value

2014-08-22 Thread jim ferenczi
Hi Bryan,
This is a known limitations of the grouping.
https://wiki.apache.org/solr/FieldCollapsing#RequestParameters

group.ngroups:


*WARNING: If this parameter is set to true on a sharded environment, all
the documents that belong to the same group have to be located in the same
shard, otherwise the count will be incorrect. If you are using SolrCloud
https://wiki.apache.org/solr/SolrCloud, consider using custom hashing*

Cheers,
Jim



2014-08-21 21:44 GMT+02:00 Bryan Bende bbe...@gmail.com:

 Is there any known issue with using group.ngroups in a distributed Solr
 using version 4.8.1 ?

 I recently upgraded a cluster from 4.6.1 to 4.8.1, and I'm noticing several
 queries where ngroups will be more than the actual groups returned in the
 response. For example, ngroups will say 5, but then there will be 3 groups
 in the response. It is not happening on all queries, only some.



Re: Bloom filter

2014-07-30 Thread jim ferenczi
Hi Per,
First of all the BloomFilter implementation in Lucene is not exactly a
bloom filter. It uses only one hash function and you cannot set the false
positive ratio beforehand. ElasticSearch has its own bloom filter
implementation (using guava like BloomFilter), you should take a look at
their implementation if you really need this feature.
What is your use-case ? If your index fits in RAM the bloom filter won't
help (and it may have a negative impact if you have a lot of segments). In
fact the only use case where the bloom filter can help is when your term
dictionary does not fit in RAM which is rarely the case.

Regards,
Jim



2014-07-28 16:13 GMT+02:00 Per Steffensen st...@designware.dk:

 Yes I found that one, along with SOLR-3950. Well at least it seems like
 the support is there in Lucene. I will figure out myself how to make it
 work via Solr, the way I need it to work. My use-case is not as specified
 in SOLR-1375, but the solution might be the same. Any input is of course
 still very much appreciated.

 Regards, Per Steffensen


 On 28/07/14 15:42, Lukas Drbal wrote:

 Hi Per,

 link to jira - https://issues.apache.org/jira/browse/SOLR-1375 Unresolved
 ;-)

 L.


 On Mon, Jul 28, 2014 at 1:17 PM, Per Steffensen st...@designware.dk
 wrote:

  Hi

 Where can I find documentation on how to use Bloom filters in Solr (4.4).
 http://wiki.apache.org/solr/BloomIndexComponent seems to be outdated -
 there is no BloomIndexComponent included in 4.4 code.

 Regards, Per Steffensen







Re: Compression vs FieldCache for doc ids retrieval

2014-06-02 Thread jim ferenczi
@William Firstly because I was sure that the ticket (or an equivalent) was
already opened but I just could not find it. Thanks @Manuel. Secondly
because I wanted to start the discussion, I have the feeling that the
compression of the documents, activated by default, can be a killer for
some applications (if the number of shards is big or if you have a lot of
deep paging queries) and I wanted to check if someone noticed the problem
in a benchmark. Let's say that you have 10 shards and you want to return 10
documents per request, in the first stage of the search each shard would
need to decompress 10 blocks of 16k each whereas the second stage would
need to decompress only 10 blocks total. This makes me believe that this
patch should be the default behaviour for any distributed search in Solr (I
mean more than 1 shard).
Maybe it's better to continue the discussion on the ticket created by
Manuel, but still, I think that it could speed up every queries (not only
deep paging queries like in the patch proposed in Manuel's ticket).

Jim



2014-06-01 14:06 GMT+09:00 William Bell billnb...@gmail.com:

 Why not just submit a JIRA issue - and add your patch so that we can all
 benefit?


 On Fri, May 30, 2014 at 5:34 AM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

  Is the issue SOLR-5478 what you were looking for?
 



 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Compression vs FieldCache for doc ids retrieval

2014-05-26 Thread jim ferenczi
Dear Solr users,

we migrated our solution from Solr 4.0 to Solr 4.3 and we noticed a
degradation of the search performance. We compared the two versions and
found out that most of the time is spent in the decompression of the
retrievable fields in Solr 4.3. The block compression of the documents is a
great feature for us because it reduces the size of our index but we don’t
have enough resources (I mean cpus) to safely migrate to the new version.
In order to reduce the cost of the decompression we tried a simple patch in
the BinaryResponseWriter; during the first phase of the distributed search
the response writer gets the documents from the index reader to only
extract the doc ids of the top N results. Our patch uses the field cache to
get the doc ids during the first phase and thus replaces a full
decompression of 16k blocks (for a single document) by a simple get in an
array (the field cache or the doc values). Thanks to this patch we are now
able to handle the same number of QPS than before (with Solr 4.0). Of
course the document cache could help as well but but not as much as one
would have though (mainly because we have a lot of deep paging queries).

I am sure that the idea we implemented is not new but I haven’t seen any
Jira about it. Should we create one (I mean does it have a chance to be
included in future release of Solr or does anybody already working on this)
?

Cheers,

Jim


Re: Memory + WeakIdentityMap

2014-03-21 Thread jim ferenczi
Hi,
If you are not on windows, you can try to disable the tracking of clones in
the MMapDirectory by setting unmap to false in your solrconfig.xml:


*directoryFactory name=DirectoryFactory
class=solr.MMapDirectoryFactory} *

*  bool name=unmapfalse/bool*
*/directoryFactory*
The MMapDirectory keeps track of all clones in a weak map and forces the
unmapping of the buffers on close. This was added because on Windows
mmapped files cannot be modified or deleted. If unmap is false the weak map
is not created and the weak references you see in your heap should disapear
as well.
You can find more informations here:
https://issues.apache.org/jira/browse/LUCENE-4740

Thanks,
Jim






2014-03-21 6:56 GMT+01:00 Shawn Heisey s...@elyograg.org:

 On 3/20/2014 6:54 PM, Harish Agarwal wrote:
  I'm transitioning my index from a 3.x version to 4.6.  I'm running a
 large
  heap (20G), primarily to accomodate a large facet cache (~5G), but have
  been able to run it on 3.x stably.
 
  On 4.6.0 after stress testing I'm finding that all of my shards are
  spending all of their time in GC.  After taking a heap dump and
 analyzing,
  it appears that org.apache.lucene.util.WeakIdentityMap is using many Gs
 of
  memory.  Does anyone have any insight into which Solr component(s) use
 this
  and whether this kind of memory consumption is to be expected?

 I can't really say what WeakIdentityMap is doing.  I can trace the only
 usage in Lucene to MMapDirectory, but it doesn't make a lot of sense for
 this to use a lot of memory, unless this is the source of the memory
 misreporting that Java 7 seems to do with MMap.  See this message in a
 recent thread on this mailing list:


 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201403.mbox/%3c53285ca1.9000...@elyograg.org%3E

 If you have a lot of facets, one approach for performance is to use
 facet.method=enum so that your Java heap does not need to be super large.

 This does not actually reduce the overall system memory requirements.
 It just shifts the responsibility for caching to the operating system
 instead of Solr, and requires that you have enough memory to put a
 majority of the index into the OS disk cache.  Ideally, there would be
 enough RAM for the entire index to fit.

 http://wiki.apache.org/solr/SolrPerformanceProblems

 Another option for facet memory optimization is docValues.  One caveat:
 It is my understanding that the docValues content is the same as a
 stored field.  Depending on your schema definition, this may be
 different than the indexed values that facets normally use.  The
 docValues feature also helps with sorting.

 Thanks,
 Shawn




Re: run filter queries after post filter

2013-10-09 Thread jim ferenczi
Hi Rohit,
The main problem is that if the query inside the filter does not have a
PostFilter implementation then your post filter is silently transformed
into a simple filter. The query field:value is based on the inverted
lists and does not have a postfilter support.
If your field is a numeric field take a look at the frange query parser
which has post filter support:
To filter out document with a field value less than 5:
fq={!frange l=5 cache=false cost=200}field(myField)

Cheers,
Jim


2013/10/9 Rohit Harchandani rhar...@gmail.com

 yes i get that. actually i should have explained in more detail.

 - i have a query which gets certain documents.
 - the post filter gets these matched documents and does some processing on
 them and filters the results.
 - but after this is done i need to apply another filter - which is why i
 gave a higher cost to it.

 the reason i need to do this is because the processing done by the post
 filter depends on the documents matching the query till that point.
 since the normal fq clause is also getting executed before the post filter
 (despite the cost), the final results are not accurate

 thanks
 Rohit




 On Wed, Oct 9, 2013 at 4:14 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Ah, I think you're misunderstanding the nature of post-filters.
  Or I'm confused, which happens a lot!
 
  The whole point of post filters is that they're assumed to be
  expensive (think ACL calculation). So you want them to run
  on the fewest documents possible. So only docs that make it
  through the primary query _and_ all lower-cost filters will get
  to this post-filter. This means they can't be cached for
  instance, because they don't see (hopefully) very many docs.
 
  This is radically different than normal fq clauses, which are
  calculated on the entire corpus and can thus be cached.
 
  Best,
  Erick
 
  On Wed, Oct 9, 2013 at 11:59 AM, Rohit Harchandani rhar...@gmail.com
  wrote:
   Hey,
   so the post filter logs the number of ids that it receives.
   With the above filter having cost=200, the post filter should have
  received
   the same number of ids as before ( when the filter was not present ).
   But that does not seem to be the case...with the filter query on the
  index,
   the number of ids that the post filter is receiving reduces.
  
   Thanks,
   Rohit
  
  
   On Tue, Oct 8, 2013 at 8:29 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
  
   Hmmm, seems like it should. What's our evidence that it isn't working?
  
   Best,
   Erick
  
   On Tue, Oct 8, 2013 at 4:10 PM, Rohit Harchandani rhar...@gmail.com
   wrote:
Hey,
I am using solr 4.0 with my own PostFilter implementation which is
   executed
after the normal solr query is done. This filter has a cost of 100.
  Is it
possible to run filter queries on the index after the execution of
 the
   post
filter?
I tried adding the below line to the url but it did not seem to
 work:
fq={!cache=false cost=200}field:value
Thanks,
Rohit