Re: Facets based on sampling

2017-10-20 Thread John Davis
Hi Yonik,
Any update on sampling based facets. The current faceting is really slow
for fields with high cardinality even with method=uif. Or are there
alternative work-arounds to only look at N docs when computing facets?

On Fri, Nov 4, 2016 at 4:43 PM, Yonik Seeley  wrote:

> Sampling has been on my TODO list for the JSON Facet API.
> How much it would help depends on where the bottlenecks are, but that
> in conjunction with a hashing approach to collection (assuming field
> cardinality is high) should definitely help.
>
> -Yonik
>
>
> On Fri, Nov 4, 2016 at 3:02 PM, John Davis 
> wrote:
> > Hi,
> > I am trying to improve the performance of queries with facets. I
> understand
> > that for queries with high facet cardinality and large number results the
> > current facet computation algorithms can be slow as they are trying to
> loop
> > across all docs and facet values.
> >
> > Does there exist an option to compute facets by just looking at the top-n
> > results instead of all of them or a sample of results based on some query
> > parameters? I couldn't find one and if it does not exist, has this come
> up
> > before? This would definitely not be a precise facet count but using
> > reasonable sampling algorithms we should be able to extrapolate well.
> >
> > Thank you in advance for any advice!
> >
> > John
>


Re: Upload/update full schema and solrconfig in standalone mode

2017-10-20 Thread Erick Erickson
SCP is just a copy program that allows you to copy files to a remote
system. Think "cp -r..." But to a remote system.


On Oct 20, 2017 11:47, "Alessandro Hoss"  wrote:

> Thanks for your comments Rick,
>
> Sorry, but I didn't understand what you mean with "scp". But let me explain
> our scenario:
> Our application is on-premises, so I can't control the infrastructure of
> the customer, they just tell me the Solr address and if Solr is running on
> Cloud mode or not.
>
> As our app is responsible for indexing and searching, we used to control
> schema and solrconfig configurations by sending these files to the
> customers, but this requires to trust in their Ops and often causes some
> troubles because of misconfigurations.
>
> I was looking for a way of doing this automatically from my app.
>
> Thanks again,
> Alessandro Hoss
>
> On Fri, Oct 20, 2017 at 2:39 PM Rick Leir  wrote:
>
> > Alessandro
> > First, let me say that the whole idea makes me nervous.
> > 1/ are you better off with scp? I would not want to do this via Solr API
> > 2/ the right way to do this is with Ansible, Puppet or Docker,
> > 3/ would you like to update a 'QA' installation, test it, then flip it
> > into production? Cheers -- Rick
> >
> > On October 20, 2017 8:49:14 AM EDT, Alessandro Hoss 
> > wrote:
> > >Hello,
> > >
> > >Is it possible to upload the entire schema and solrconfig.xml to a Solr
> > >running on standalone mode?
> > >
> > >I know about the Config API
> > >, but it
> > >allows
> > >only add or modify solrconfig properties, and what I want is to change
> > >the
> > >whole config (schema and solrconfig) to ensure it's up to date.
> > >
> > >What I need is something similar to the Configsets API
> > >, where
> > >I'm
> > >able to upload a zip containing both schema and solrconfig.xml, but
> > >unfortunately it's SolrCloud only.
> > >
> > >Is there a way of doing that in standalone mode?
> > >
> > >Thanks in advance.
> > >Alessandro Hoss
> >
> > --
> > Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


Re: Jetty maxThreads

2017-10-20 Thread Yonik Seeley
The high number of maxThreads is to avoid distributed deadlock.
The fix is multiple thread pools, depending on request type:
https://issues.apache.org/jira/browse/SOLR-7344

-Yonik


On Wed, Oct 18, 2017 at 4:41 PM, Walter Underwood  wrote:
> Jetty maxThreads is set to 10,000 which seams way too big.
>
> The comment suggests 5X the number of CPUs. We have 36 CPUs, which would mean 
> 180 threads, which seems more reasonable.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Re: Jetty maxThreads

2017-10-20 Thread Shawn Heisey
On 10/18/2017 2:41 PM, Walter Underwood wrote:
> Jetty maxThreads is set to 10,000 which seams way too big.
>
> The comment suggests 5X the number of CPUs. We have 36 CPUs, which would mean 
> 180 threads, which seems more reasonable.

I have not seen any evidence that maxThreads at 1 causes memory
issues.  The out-of-the-box heap size for all recent releases is 512MB,
and Solr starts up just fine with 1 maxThreads.

Most containers (including Jetty and Tomcat) default to a maxThreads
value of 200.  The Jetty included with Solr has had a setting of 1
since I first started testing with version 1.4.  Users who provide their
own containers frequently run into problems where the container will not
allow Solr to start the threads it needs to run properly, so they must
increase the value.

This is a graph of threads on a couple of my Solr servers:

https://www.dropbox.com/s/4ux2y3xwvsjjrmt/solr-thread-graph.png?dl=0

The server named bigindy5 (rear graph in the screenshot) is my dev
server, running 6.6.2-SNAPSHOT.  The server named idxb6 is running
5.3.2-SNAPSHOT and is a backup production server.

The dev server has 8 CPU cores without hyperthreading and 24 Solr cores
(indexes).  Most of those cores have index data in them -- the dev
server has copies of *all* my indexes onboard.  It has very little
activity though -- aside from once-a-minute maintenance updates and a
monitoring server, there's virtually no query activity.

The production server has 20 CPU cores with hyperthreading (so it looks
like 40 to the OS), the same 24 Solr cores, but only a handful of those
cores have data, the rest are idle.  There's one critical difference in
activity for this server compared to the dev server -- four of the cores
on the machine are actively indexing from MySQL with the dataimport
handler, because I'm doing a full rebuild on that index.  Because this
server is in a backup role currently, its query load is similar to the
dev server -- almost nothing.

These servers handle distributed indexes, but they are NOT running in
cloud mode.  If there were active queries, then more threads would be
needed than currently are running.  If there were more Solr cores
(indexes), then more threads would be needed.

My installation is probably bigger than typical, but is definitely not
what I would call large.  As you can see from the screenshot, these
servers have reached thread counts in the 300 range, and are currently
sitting at about 250.  If I followed that recommendation of 5 threads
per CPU, I would configure a value of 40 on the dev server, which
wouldn't be anywhere near enough.

I've got another server running version 4.7.2 with 8 CPU cores (no
hyperthreading) and slightly fewer Solr cores.  This is a server that
actively gets queries at a fairly low QPS rate.  It shows a steady
thread count of around 200, with a peak thread count of 1032.  That
instance of Solr has an uptime of 208 days.

Based on what I have seen on my servers, I would not run with maxThreads
less than 2000, and I don't see any reason to change it from the
provided default of 1.

Thanks,
Shawn



Re: Deploy Solr to production: best practices

2017-10-20 Thread Shawn Heisey
On 10/18/2017 11:32 PM, maximka19 wrote:
> *1.* Container: from Solr 5 there is now .WAR-file provided in package. I
> couldn't deploy Solr 7.1 to Tomcat 9. None of existing tutorials or guides
> helped. No such information for newer versions.

The included Jetty is the only supported option since version 5.0.

https://wiki.apache.org/solr/WhyNoWar

> So, does this mean that officially Solr isn't support other containers like
> Tomcat? Can we use Jetty as a main container in production issues? And it's
> officially recommended by developers/maintainers? If so, how can I host Solr
> as a service in Windows Server? There are not any scripts in package for
> Windows, only for .nix machines. How to do that? What a best practices? NO
> information, tutorials, guides are provided in such question, especially for
> Windows users.

Correct, running in a user-supplied container is not a supported
option.  It *is* still possible to install Solr into a third party
container like Tomcat, but you'll be on your own.  Since version 5.3,
the webapp is no longer compressed into a WAR file, instead it is
already exploded to server/solr-webapp/webapp.  As far as I am aware,
Tomcat does have the ability to run an already-exploded webapp.

Jetty is considered an enterprise-grade platform.  You might have heard
of this product, which is built with Jetty:

https://cloud.google.com/appengine/docs/flexible/

As for the OS, you're probably getting the impression that Windows is a
second class citizen to the average Solr developer.  That would indeed
be a correct impression.  As for my own opinion, although I do consider
Windows to be technically inferior to Linux and other open source
operating systems, it's a capable platform that can run Solr just fine.

My primary objection to running Solr on Windows has little to do with
the technology, it's mostly about cost.  You wouldn't want to run
production Solr on a client OS like Windows 10.  The server operating
systems usually add a significant cost to new hardware deployments.

I think that NSSM is the program most commonly used for turning the Solr
download into a Windows service.

https://nssm.cc/

There is an issue in Jira to create a Windows service out of the box,
but work on it has stalled.

https://issues.apache.org/jira/browse/SOLR-7105

Thanks,
Shawn



Re: Certificate issue ERR_SSL_VERSION_OR_CIPHER_MISMATCH

2017-10-20 Thread Shawn Heisey
On 10/19/2017 6:30 AM, Younge, Kent A - Norman, OK - Contractor wrote:
> Built a clean Solr server imported my certificates and when I go to the 
> SSL/HTTPS page it tells me that I have ERR_SSL_VERSION_OR_CIPHER_MISMATCH in 
> Chrome and in IE tells me that I need to TURN ON TLS 1.0, TLS 1.1, and TLS 
> 1.2.

What java version?  What Java vendor?  What operating system?  The OS
won't have a lot of impact on HTTPS, I just ask in case other
information is desired, so we can tailor the information requests.

I see other messages where you mention Solr 6.6, which requires Java 8.

As Hoss mentioned to you in another thread, *all* of the SSL capability
is provided by Java.  The Jetty that ships with Solr includes a config
for HTTPS.  The included Jetty config *excludes* a handful of
low-quality ciphers that your browser probably already refuses to use,
but that's the only cipher-specific configuration.  If you haven't
changed the Jetty config in the Solr download, then Jetty defaults and
your local Java settings will control everything else.  As far as I am
aware, Solr doesn't influence the SSL config at all.

  
    
  SSL_RSA_WITH_DES_CBC_SHA
  SSL_DHE_RSA_WITH_DES_CBC_SHA
  SSL_DHE_DSS_WITH_DES_CBC_SHA
  SSL_RSA_EXPORT_WITH_RC4_40_MD5
  SSL_RSA_EXPORT_WITH_DES40_CBC_SHA
  SSL_DHE_RSA_EXPORT_WITH_DES40_CBC_SHA
  SSL_DHE_DSS_EXPORT_WITH_DES40_CBC_SHA
    
  

It is extremely unlikely that Solr itself is causing these problems.  It
is more likely that there's something about your environment (java
version, custom java config, custom Jetty config, browser customization,
or maybe something else) that is resulting in a protocol and cipher list
that your browser doesn't like.

Thanks,
Shawn



Re: Concern on solr commit

2017-10-20 Thread Shawn Heisey
On 10/18/2017 3:09 AM, Leo Prince wrote:
> Is there any known negative impacts in setting up autoSoftCommit as 1
> second other than RAM usage..?

For most users, setting autoSoftCommit to one second is a BAD idea.  In
many indexes, commits take longer than one second to complete.  If you
do heavy indexing with that setting and the commit takes longer than one
second to complete, then you will end up with overlapping commits, all
of which will *try* to open new a searcher.  In a hypothetical situation
where heavy indexing lasts for an hour and commits take enough time,
Solr will be doing *constant* commits (which may overlap) for the entire
hour.

In that kind of scenario, Solr will prevent most of the new searchers
from actually opening because of the maxWarmingSearchers configuration,
so you won't see changes within the configured one second anyway. 
Frequent commits, especially if they overlap, will pound the CPU and
Java's garbage collection, so the server is going to be extremely busy
and may have difficulty handling indexing and query requests in a timely
manner.

Soft commits are sometimes a little bit faster than hard commits that
open a new searcher, but the difference is not extreme.  Later in the
thread you stated "Since we are using SoftCommits, the docs written will
be in RAM until a AutoCommit to reflect onto Disk".  This is not
*exactly* how it works.  Soft commits have the *possibility* of being
handled by the caching nature of the NRTCachingDirectory
implementation.  See the javadoc for the directory implementation:

http://lucene.apache.org/core/7_0_0/core/org/apache/lucene/store/NRTCachingDirectory.html

The default cache sizes stated in that documentation are 5 and 60 MB,
but the actual defaults in the code are 4 and 48.  If a new segment is
larger than 4MB or the total amount to be cached is larger than 48MB,
then it won't be saved to RAM, it'll go to disk.  In a nutshell, with
very heavy indexing, soft commits won't save you anything, since you're
probably going to be writing some information to disk anyway.

You should set autoSoftCommit to the longest possible interval you can
stand for the visibility of changes.  I would recommend AT LEAST 60
seconds, but if your commits finish quickly, you might be able to go
lower.  The autoCommit section should set openSearcher to false.  Unless
your autoCommit interval has been increased beyond the typical default
of 15 seconds, I usually recommend making the autoSoftCommit interval
longer than the autoCommit interval.

You have already been given the following link.  It covers most of the
caveats where commits are concerned.  The title of the article mentions
SolrCloud, but the information applies to standalone mode too:

https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks,
Shawn



Re: hi load mitigation

2017-10-20 Thread Shawn Heisey
On 10/17/2017 7:10 AM, j.s. wrote:
> i run a stand alone solr instance in which usage has suddenly spiked a
> bit. the load was at 8, but by adding another CPU i brought it down to
> 2. much better but not where i'd like it to be.
>
> i guess i'm writing to see if anyone has any suggestions about where
> to look to improve this. the data size is 24G. i have 4G of memory
> dedicated to server.

I see two potential problems based on the limited available
information.  Both of these potential problems involve memory.  It could
be possible that *both* problems are contributing to what you've
observed.  It's also possible that I am completely wrong and don't
understand your setup well enough to make any assessment.

One possible problem is that your heap is too small for what you have
Solr doing, causing Java to constantly perform garbage collections.  The
default heap size in recent Solr versions is 512MB, which is quite
small.  The default is intentionally small, so that Solr will start out
of the box on most hardware.  Production installations are almost
certainly going to need to increase the heap size.

The other possible problem is that you don't have enough total system
memory for good performance.

If you're saying that the machine has 4GB of total system memory, then I
can almost guarantee that you don't have enough.  For an index size of
24GB, if that is the only index on the system, you're going to want the
OS to be able to cache several gigabytes of index data in memory.  What
I would want for that index is a machine with between 16GB and 32GB of
total memory, possibly more.  Some of that memory will be assigned to
programs, including Solr's heap, and whatever is left will be used by
the OS to cache data.

If your statement about 4GB memory is only talking about the Java heap,
then it will be important to know how much memory is left and whether
there is other software on the machine.

https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

If there is insufficient memory for caching, then Solr will be forced to
read data from the disk to perform its most basic functions.  Disks are
SLOW, and when processes like Solr are waiting for disk access, the
system load average will usually be high.

Later in the thread, you mention that you're looking at "top" on a Linux
system.  It would be useful to see the top display.  Run top, press
shift-M to sort the list by memory usage, and grab a screenshot.  Share
that screenshot with us using a file sharing website.

Thanks,
Shawn



Retrieve DocIdSet from Query in lucene 5.x

2017-10-20 Thread Jamie Johnson
I am trying to migrate some old code that used to retrieve DocIdSets from
filters, but with Filters being deprecated in Lucene 5.x I am trying to
move away from those classes but I'm not sure the right way to do this
now.  Are there any examples of doing this?


Re: Upload/update full schema and solrconfig in standalone mode

2017-10-20 Thread Alessandro Hoss
Thanks for your comments Rick,

Sorry, but I didn't understand what you mean with "scp". But let me explain
our scenario:
Our application is on-premises, so I can't control the infrastructure of
the customer, they just tell me the Solr address and if Solr is running on
Cloud mode or not.

As our app is responsible for indexing and searching, we used to control
schema and solrconfig configurations by sending these files to the
customers, but this requires to trust in their Ops and often causes some
troubles because of misconfigurations.

I was looking for a way of doing this automatically from my app.

Thanks again,
Alessandro Hoss

On Fri, Oct 20, 2017 at 2:39 PM Rick Leir  wrote:

> Alessandro
> First, let me say that the whole idea makes me nervous.
> 1/ are you better off with scp? I would not want to do this via Solr API
> 2/ the right way to do this is with Ansible, Puppet or Docker,
> 3/ would you like to update a 'QA' installation, test it, then flip it
> into production? Cheers -- Rick
>
> On October 20, 2017 8:49:14 AM EDT, Alessandro Hoss 
> wrote:
> >Hello,
> >
> >Is it possible to upload the entire schema and solrconfig.xml to a Solr
> >running on standalone mode?
> >
> >I know about the Config API
> >, but it
> >allows
> >only add or modify solrconfig properties, and what I want is to change
> >the
> >whole config (schema and solrconfig) to ensure it's up to date.
> >
> >What I need is something similar to the Configsets API
> >, where
> >I'm
> >able to upload a zip containing both schema and solrconfig.xml, but
> >unfortunately it's SolrCloud only.
> >
> >Is there a way of doing that in standalone mode?
> >
> >Thanks in advance.
> >Alessandro Hoss
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread Yonik Seeley
On Fri, Oct 20, 2017 at 2:22 PM, kenny  wrote:

> Thanks for the clear explanation. A couple of follow up questions
>
> - can we tune overrequesting in json API?
>

Yes, I still need to document it, but you can specify a specific number of
documents to overrequest:
{
  type : field,
  field : cat,
  overrequest : 500
}

Also note that the JSON facet API does not do refinement by default (it's
not always desired).
Add refine:true to the field facet if you do want it.


> - we do see conflicting counts but that's when we have offsets different
> from 0. We have now already tested it in solr 6.6 with json api. We
> sometimes get the same value in different offsets: for example the range of
> constraints [0,500] and [500,1000] might contain the same constraint.
>

That can happen with both regular faceting and with the JSON Facet API
(deeper paging "discoveres" a new constraint which ranks higher).
Regular faceting does more overrequest by default, and does refinement by
default.  So adding refine:true and a deeper overrequest for json facets
should perform equivalently.

 -Yonik

Kenny
>
> On 20-10-17 17:12, Yonik Seeley wrote:
>
> Facet refinement in Solr guarantees that counts for returned
> constraints are correct, but does not guarantee that the top N
> returned isn't missing a constraint.
>
> Consider the following shard counts (3 shards) for the following
> constraints (aka facet values):
> constraintA: 2 0 0
> constraintB: 0 2 0
> constraintC: 0 0 2
> constraintD: 1 1 1
>
> Now for simplicity consider facet.limit=1:
> Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets
> back A=2,B=2,C=2)
> Phase 2: refinement: retrieve counts for A,B,C for any shard that did
> not contribute to the count in Phase 1: (for example we ask shard2 and
> shard3 for the count of A)
> The counts are all correct, but we missed "D" because it never
> appeared in Phase #1
>
> Solr actually has overrequesting in the first phase to reduce the
> chances of this happening (i.e. it won't actually happen with the
> exact scenario above), but it can still happen.
>
> You can increase the overrequest amount 
> (seehttps://lucene.apache.org/solr/guide/6_6/faceting.html)
> Or use streaming expressions or the SQL that goes on top of that in
> the latest Solr releases.
>
> -Yonik
>
>
> On Fri, Oct 20, 2017 at 10:19 AM, kenny  
>  wrote:
>
> Hi all,
>
> When we run some 'deep' facet counts (eg facet values from 0 to 500 and then
> from 500 to 1000), we see small but disturbing difference in counts between
> the two (for example last count on first batch 165, first count on second
> batch 167)
> We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
> Any-one seen ths before? I could not find any bug reported like this.
>
> Thanks
>
> Kenny
>
>
>
> --
>
> [image: ONTOFORCE] 
> Kenny Knecht, PhD
> CTO and technical lead
> +32 486 75 66 16 <00324756616>
> ke...@ontoforce.com
> www.ontoforce.com
> Meetdistrict, Ottergemsesteenweg-Zuid 808, 9000 Gent, Belgium
> 
> CIC, One Broadway, MA 02142 Cambridge, United States
>


Re: Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread kenny

Thanks for the clear explanation. A couple of follow up questions

- can we tune overrequesting in json API?

- we do see conflicting counts but that's when we have offsets different 
from 0. We have now already tested it in solr 6.6 with json api. We 
sometimes get the same value in different offsets: for example the range 
of constraints [0,500] and [500,1000] might contain the same constraint.



Kenny


On 20-10-17 17:12, Yonik Seeley wrote:

Facet refinement in Solr guarantees that counts for returned
constraints are correct, but does not guarantee that the top N
returned isn't missing a constraint.

Consider the following shard counts (3 shards) for the following
constraints (aka facet values):
constraintA: 2 0 0
constraintB: 0 2 0
constraintC: 0 0 2
constraintD: 1 1 1

Now for simplicity consider facet.limit=1:
Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets
back A=2,B=2,C=2)
Phase 2: refinement: retrieve counts for A,B,C for any shard that did
not contribute to the count in Phase 1: (for example we ask shard2 and
shard3 for the count of A)
The counts are all correct, but we missed "D" because it never
appeared in Phase #1

Solr actually has overrequesting in the first phase to reduce the
chances of this happening (i.e. it won't actually happen with the
exact scenario above), but it can still happen.

You can increase the overrequest amount (see
https://lucene.apache.org/solr/guide/6_6/faceting.html)
Or use streaming expressions or the SQL that goes on top of that in
the latest Solr releases.

-Yonik


On Fri, Oct 20, 2017 at 10:19 AM, kenny  wrote:

Hi all,

When we run some 'deep' facet counts (eg facet values from 0 to 500 and then
from 500 to 1000), we see small but disturbing difference in counts between
the two (for example last count on first batch 165, first count on second
batch 167)
We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
Any-one seen ths before? I could not find any bug reported like this.

Thanks

Kenny



--

ONTOFORCE  
Kenny Knecht, PhD
CTO and technical lead
+32 486 75 66 16 
ke...@ontoforce.com 
www.ontoforce.com 

Meetdistrict, Ottergemsesteenweg-Zuid 808, 9000 Gent, Belgium
CIC, One Broadway, MA 02142 Cambridge, United States



Re: Solr nodes going into recovery mode and eventually failing

2017-10-20 Thread shamik
Zisis, thanks for chiming in. This is really an interesting information and
probably in line what I'm trying to fix. In my case, the facet fields are
certainly not high cardinal ones. Most of them have a finite set of data,
the max being 200 (though it has a low usage percentage). Earlier I had
facet.limit=-1, but then scaled down to 200 to eliminate any performance
overhead.

I was not aware of maxRamMB parameter, looks like it's only available for
queryResultCache. Is that what you are referring to? Can you please share
your cache configuration? 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


LTR feature extraction performance issues

2017-10-20 Thread Brian Yee
I enabled LTR feature extraction and response times spiked. I suppose that was 
to be expected, but are there any tips regarding performance? I have the 
feature values cache set up as described in the docs:



Do I simply have to wait for the cache to fill up and hope that response times 
go down? Should I make these cache values bigger?


  *   Brian


Re: OOM during indexing with 24G heap - Solr 6.5.1

2017-10-20 Thread Shawn Heisey
On 10/16/2017 5:38 PM, Randy Fradin wrote:
> Each shard has around 4.2 million documents which are around 40GB on disk.
> Two nodes have 3 shard replicas each and the third has 2 shard replicas.
>
> The text of the exception is: java.lang.OutOfMemoryError: Java heap space
> And the heap dump is a full 24GB indicating the full heap space was being
> used.
>
> Here is the solrconfig as output by the config request handler:

I was hoping for the actual XML, but I don't see any red flags in the
output you've provided.  It does look like you've probably got a very
minimal configuration.  Some things that I expected to see (and do see
on my own systems) aren't in the handler output at all.

With only 12 million docs on the machine, I would not expect any need
for 24GB of heap except in the case of a large number of particularly
RAM-hungry complex queries.  The ratio of index size to document count
says that the documents are bigger than what I think of as typical, but
not what I would call enormous.  If there's any way you can adjust your
schema to remove unused parts and reduce the index size, that would be a
good idea, but I don't consider that to be an immediate action item. 
Your index size is well within what Solr should be able to handle easily
-- if there are sufficient system resources, memory in particular.

The 6.5.1 version of Solr that you're running should have most known
memory leak issues fixed -- and there are not many of those.  I'm not
aware of any leak problems that would affect Lucene's DocumentsWriter
class, where you said most of the heap was being consumed.  That doesn't
necessarily mean there isn't a leak bug that applies, just that I am not
aware of any.

You have indicated that you're doing a very large number of concurrent
update requests, up to 240 at the same time.  I cannot imagine a
situation where Lucene would require a buffer (100 MB in your config)
for every indexing thread.  That would really cause some major memory
issues with Lucene and Solr installations.

Your description of what you have in your heap sounds a little bit
different than a buffer per indexing thread.  It sounds like your
indexing has resulted in a LOT of flushes, which is probably normal,
except that the flush queue doesn't appear to be getting emptied.  If
I'm right, either your indexing is happening faster than Lucene can
flush the segments that get built, or there is something preventing
Lucene from actually doing the flush.  I do not see any indication in
the code that Lucene ever imposes a limit on the number of queued
flushes, but in a system that's working correctly, it probably doesn't
have to.  My theories here should be validated by somebody who has much
better insight into Lucene than I do.

I'm interested in seeing some details about the system and the processes
running.  What OS is this running on?  If it's something other than
Windows, you probably have the "top" utility installed.  The gnu version
of top has a keyboard shortcut (shift-M) to sort by memory usage.  If
it's available, run top (not htop or any other variant), press the key
to sort by memory, and grab a screenshot.

On recent versions of Windows, there's a program called Resource
Monitor.  If you're on Windows, run that program, click on the memory
tab, sort by Private, make sure that the memory graph and MB counts
below the process list are fully visible, and grab a screenshot.

It is unlikely that you'll be able to send a screenshot image to the
list, so you'll probably need a file sharing website.

Thanks,
Shawn



Re: Solr nodes going into recovery mode and eventually failing

2017-10-20 Thread shamik
Thanks Eric, in my case, each replica is running on it's own JVM, so even if
we consider 8gb of filter cache, it still has 27gb to play with. Isn't this
is a decent amount of memory to handle the rest of the JVM operation? 

Here's an example of implicit filters that get applied to almost all the
queries. Except for Source2 and  AccessMode, rest of the fields have doc
values enabled. Our sorting is down mostly on relevance, so there's little
impact there.

fq=language:("english")=ContentGroup:"Learn & Explore" OR "Getting
Started" OR "Troubleshooting" OR "Downloads")=Source2:("Help" OR
"documentation" OR "video" OR (+Source2:"discussion" +Solution:"yes") OR
"sfdcarticles" OR "downloads" OR "topicarticles" OR "screen" OR "blog" OR
"simplecontent" OR "auonline" OR "contributedlink" OR "collection") AND
-workflowparentid:[*+TO+*] AND -AccessMode:"internal" AND -AccessMode:"beta"
AND -DisplayName:Partner AND -GlobalDedup:true AND -Exclude:"knowledge" AND
-Exclude:"all" =recip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^1.0

As you can see, there's a bunch, so filter cache is sort of important for us
for performance standpoint. The hit ratio of 25% is abysmal and I don't
think there are too many unique queries which are contributing to this. As I
mentioned earlier, the increase in size parameter does improve the hit
count. Just wondering, what are the best practices around scenarios like
this? Looks like I've pretty much exhausted my options :-).



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-20 Thread Zisis T.
I'll post my experience too, I believe it might be related to the low
FilterCache hit ratio issue. Please let me know if you think I'm off topic
here to create a separate thread.

I've run search stress tests on a 2 different Solr 6.5.1 installations
sending Distributed search queries with facets (using facet.field, faceted
fields have docValues="true")

*1-shard*
1 shard of ~200GB/8M docs
FilterCache with default 512 entries. 90% hit ratio

*2-shards*
2 shards of ~100GB/4M docs each
FilterCache with default 512 entries. 10% hit ratio. Huge drop in search
performance

Noticed that the majority of the FilterCache entries look like "filters on
facet terms" and instead of a FixedBitSet which size is equal to the # of
docs in the shard, it contains
an int[] of the matching docids

  Key   Value
-
FacetField1:Term1   ->  int[] of matching docids
FacetField1:Term2   ->  int[] of matching docids
FacetField2:Term3   ->  int[] of matching docids
FacetField2:Term4   ->  int[] of matching docids

Given that Field1 and Field2 are high cardinality fields there are too many
keys in the cache but with few matched documents in most of the cases.
Therefore since the cache values do not need so much memory, I ended up
using *maxRamMB*=120 which in my case gives ~80% hit ratio allowing more
entries in the cache and better control over consumed memory. 

This has been previously discussed here too
http://lucene.472066.n3.nabble.com/Filter-cache-pollution-during-sharded-edismax-queries-td4074867.html#a4162151

Is this "overuse" of FilterCache normal in distributed search? 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Upload/update full schema and solrconfig in standalone mode

2017-10-20 Thread Rick Leir
Alessandro
First, let me say that the whole idea makes me nervous.
1/ are you better off with scp? I would not want to do this via Solr API
2/ the right way to do this is with Ansible, Puppet or Docker, 
3/ would you like to update a 'QA' installation, test it, then flip it into 
production? Cheers -- Rick

On October 20, 2017 8:49:14 AM EDT, Alessandro Hoss  wrote:
>Hello,
>
>Is it possible to upload the entire schema and solrconfig.xml to a Solr
>running on standalone mode?
>
>I know about the Config API
>, but it
>allows
>only add or modify solrconfig properties, and what I want is to change
>the
>whole config (schema and solrconfig) to ensure it's up to date.
>
>What I need is something similar to the Configsets API
>, where
>I'm
>able to upload a zip containing both schema and solrconfig.xml, but
>unfortunately it's SolrCloud only.
>
>Is there a way of doing that in standalone mode?
>
>Thanks in advance.
>Alessandro Hoss

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: E-Commerce Search: tf-idf, tie-break and boolean model

2017-10-20 Thread Walter Underwood
Setting mm to 100% means that any misspelled word in a query means zero 
results. That is not a good experience. Usually, 10% of queries contain a 
misspelling.

Set mm to 1.

The F-measure is not a good choice for this because recall is not very 
important in e-commerce. Use precision-oriented measures. P@3 is a good start. 
If there is usually exactly one correct answer (this was true when I did search 
at Netflix), MRR is a better choice. That measures the position of the first 
relevant result.

https://techblog.chegg.com/2012/12/12/measuring-search-relevance-with-mrr/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 20, 2017, at 1:05 AM, Vincenzo D'Amore  wrote:
> 
> Thanks for all the info, I really appreciate your help. I'm working on the
> configuration and following your suggestions.
> 
> We already had a golden set of query-results pairs (~1000) used to tune and
> check how my application (and Solr configuration) performs.
> But I've to entirely double check if this set is still relevant.
> The results of each query are used to calculate F1.
> 
> Nevertheless, having this base of tests le me able to try few rounds adding
> and removing custom similarity, changing the tie configuration and so on
> and so forth.
> 
> Now I want share with you my results:
> 
> - I've just set mm=100%
> 
> - TF - set as constant 1.0 - slight improvement in search results,
> basically it seems perform better when there are few products that are
> almost identical, but some of them have the same keyword repeated many
> times. For example a product "iphone charger for iphone 5, iphone
> 5s, iphone 6" versus a product "iphone charge"
> 
> - IDF - set as constant 1.0 - the results were not catastrophic but, for
> sure, worse than having default similarity. So I've roll backed this
> change, it seems to me the results are flattened too much.
> 
> - tie - I've just tried 0.1 and 1.0, at moment 1.0 seems to perform better.
> But not sure why.
> 
> I want try to add some relevant fields (tags, categories) in order to the
> have more chances to match the correct results.
> 
> Best regards,
> Vincenzo
> 
> On Tue, Oct 17, 2017 at 11:38 PM, Walter Underwood 
> wrote:
> 
>> That page from Stanford is not about e-commerce search. Westlaw is
>> professional librarian search.
>> 
>> I agree with Emir’s advice. Start with edismax. Use a small value for the
>> tie-breaker. It is one of the least important configuration values. I use
>> the default from the sample configs:
>> 
>>   0.1
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 16, 2017, at 1:53 AM, Emir Arnautović <
>> emir.arnauto...@sematext.com> wrote:
>>> 
>>> Hi Vincenzo,
>>> Unless you have really specific ranking requirements, I would not
>> suggest you to start with you proprietary similarity implementation. In
>> most cases edismax will be good enough to cover your requirements. It is
>> not easy task to tune edismax since it has a log knobs that you can use.
>>> In general there are two approaches that you can use: Create a golden
>> set of query-results pairs and use it with some metric (e.g. you can start
>> with simple F-measure) and tune parameters to maximize metric. The
>> alternative approach (complements the first one) is to let user use your
>> search, track clicks and monitor search metrics like mean reciprocal rank,
>> zero result queries, page depth etc. and tune queries to get better
>> results. If you can do A/B testing, you can use that as well to see which
>> changes are better.
>>> In most cases, this is iterative process and you should not expect to
>> get it right the first time and that you will be able to tune it to cover
>> all cases.
>>> 
>>> Good luck!
>>> 
>>> HTH,
>>> Emir
>>> 
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
 On 16 Oct 2017, at 10:30, Vincenzo D'Amore  wrote:
 
 Hi all,
 
 I'm trying to figure out how to tune Solr for an e-commerce search.
 
 I want to share with you what I did in the hope to understand if I was
 right and, if there, I could also improve my configuration.
 
 I also read that the boolean model has to be preferred in this case.
 
 https://nlp.stanford.edu/IR-book/html/htmledition/the-extend
>> ed-boolean-model-versus-ranked-retrieval-1.html
 
 
 So, I first wrote my own implementation of DefaultSimilarity returning
 constantly 1.0 for TF and IDF.
 
 Now I'm struggling to understand how to configure tie-break parameter,
>> my
 opinion was to configure it to 0.1 or 0.0, thats because, if I
>> understood
 well, in this way the boolean model should be preferred, that's because
 only the maximum scoring subquery contributes to final 

Re: Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread Yonik Seeley
Facet refinement in Solr guarantees that counts for returned
constraints are correct, but does not guarantee that the top N
returned isn't missing a constraint.

Consider the following shard counts (3 shards) for the following
constraints (aka facet values):
constraintA: 2 0 0
constraintB: 0 2 0
constraintC: 0 0 2
constraintD: 1 1 1

Now for simplicity consider facet.limit=1:
Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets
back A=2,B=2,C=2)
Phase 2: refinement: retrieve counts for A,B,C for any shard that did
not contribute to the count in Phase 1: (for example we ask shard2 and
shard3 for the count of A)
The counts are all correct, but we missed "D" because it never
appeared in Phase #1

Solr actually has overrequesting in the first phase to reduce the
chances of this happening (i.e. it won't actually happen with the
exact scenario above), but it can still happen.

You can increase the overrequest amount (see
https://lucene.apache.org/solr/guide/6_6/faceting.html)
Or use streaming expressions or the SQL that goes on top of that in
the latest Solr releases.

-Yonik


On Fri, Oct 20, 2017 at 10:19 AM, kenny  wrote:
> Hi all,
>
> When we run some 'deep' facet counts (eg facet values from 0 to 500 and then
> from 500 to 1000), we see small but disturbing difference in counts between
> the two (for example last count on first batch 165, first count on second
> batch 167)
> We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
> Any-one seen ths before? I could not find any bug reported like this.
>
> Thanks
>
> Kenny


SynonymFilterFactory deprecated

2017-10-20 Thread Vincenzo D'Amore
Hi all,

I see in Solr SynonymFilterFactory is deprecated

https://lucene.apache.org/core/7_1_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymFilterFactory.html

the documentation suggest:

Use SynonymGraphFilterFactory
> 
>  instead,
> but be sure to also use FlattenGraphFilterFactory
> 
>  at
> index time (not at search time) as well.


On the other hand documentation also say FlattenGraphFilterFactory is
experimental and might change in incompatible ways in the next release.

Not sure what to do in this case. Not clear what does
FlattenGraphFilterFactory and why should I have it after the
SynonymGraphFilterFactory.

And again, if I have many SynonymGraphFilterFactory at index time, may I
have only one FlattenGraphFilterFactory at end of chain or should I add a
FlattenGraphFilterFactory for each SynonymGraphFilterFactory found in the
chain?

Thanks for your time and best regards,
Vincenzo


Re: solr core replication

2017-10-20 Thread Erick Erickson
Does that persist even after you restart Solr on the target cluster?

And that clears up one bit of confusion I had, I didn't know how you
were having each shard on the target cluster use a different master URL
given they all use the same solrconfig file. I was guessing some magic with
system variables, but it turns out you were wy ahead of me and
not configuring the replication in solrconfig at all.

But no, I know of no API level command that works to do what you're asking.
I also don't know where that data is persisted, I'm afraid you'll have to go
code-diving for all the help I can be

Using fetchindex this way in SolrCloud is something of an edge case. It'll
probably be around forever since replication is used as a fall-back when
a replica syncs, but there'll be some bits like this hanging around I'd guess.

Best,
Erick

On Thu, Oct 19, 2017 at 11:55 PM, Hendrik Haddorp
 wrote:
> Hi Erick,
>
> that is actually the call I'm using :-)
> If you invoke
> http://solr_target_machine:port/solr/core/replication?command=details after
> that you can see the replication status. But even after a Solr restart the
> call still shows the replication relation and I would like to remove this so
> that the core looks "normal" again.
>
> regards,
> Hendrik
>
> On 20.10.2017 02:31, Erick Erickson wrote:
>>
>> Little known trick:
>>
>> The fetchIndex replication API call can take any parameter you specify
>> in your config. So you don't have to configure replication at all on
>> your target collection, just issue the replication API command with
>> masterUrl, something like:
>>
>>
>> http://solr_target_machine:port/solr/core/replication?command=fetchindex=http://solr_source_machine:port/solr/core
>>
>> NOTE, "core" above will be something like collection1_shard1_replica1
>>
>> During the fetchindex, you won't be able to search on the target
>> collection although the source will be searchable.
>>
>> Now, all that said this is just copying stuff. So let's say you've
>> indexed to your source cluster and set up your target cluster (but
>> don't index anything to the target or do the replication etc). Now if
>> you shut down the target cluster and just copy the entire data dir
>> from each source replica to each target replica then start all the
>> target Solr instances up you'll be fine.
>>
>> Best,
>> Erick
>>
>> On Thu, Oct 19, 2017 at 1:33 PM, Hendrik Haddorp
>>  wrote:
>>>
>>> Hi,
>>>
>>> I want to transfer a Solr collection from one SolrCloud to another one.
>>> For
>>> that I create a collection in the target cloud using the same config set
>>> as
>>> on the source cloud but with a replication factor of one. After that I'm
>>> using the Solr core API with a "replication?command=fetchindex" command
>>> to
>>> transfer the data. In the last step I'm increasing the replication
>>> factor.
>>> This seems to work fine so far. When I invoke
>>> "replication?command=details"
>>> I can see my replication setup and check if the replication is done. In
>>> the
>>> end I would like to remove this relation again but there does not seem to
>>> be
>>> an API call for that. Given that the replication should be a one time
>>> replication according to the API on
>>> https://lucene.apache.org/solr/guide/6_6/index-replication.html this
>>> should
>>> not be a big problem. It just does not look clean to me to leave this in
>>> the
>>> system. Is there anything I'm missing?
>>>
>>> regards,
>>> Hendrik
>
>


Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread kenny

Hi all,

When we run some 'deep' facet counts (eg facet values from 0 to 500 and 
then from 500 to 1000), we see small but disturbing difference in counts 
between the two (for example last count on first batch 165, first count 
on second batch 167)

We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
Any-one seen ths before? I could not find any bug reported like this.

Thanks

Kenny


Upload/update full schema and solrconfig in standalone mode

2017-10-20 Thread Alessandro Hoss
Hello,

Is it possible to upload the entire schema and solrconfig.xml to a Solr
running on standalone mode?

I know about the Config API
, but it allows
only add or modify solrconfig properties, and what I want is to change the
whole config (schema and solrconfig) to ensure it's up to date.

What I need is something similar to the Configsets API
, where I'm
able to upload a zip containing both schema and solrconfig.xml, but
unfortunately it's SolrCloud only.

Is there a way of doing that in standalone mode?

Thanks in advance.
Alessandro Hoss


Re: Goal: reverse chronological display Methods? (1) boost, and/or (2) disable idf

2017-10-20 Thread Rick Leir
Bill,
In the debug score calculations, the bf boosting does not appear at all. I 
would expect it to at least show up with a small value. So maybe we need to 
look at the query. 
Cheers -- Rick
-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Solr json facet API contains option

2017-10-20 Thread kenny

Hi,

I don't seem to find a 'contains' (with or without ignorecase) in the 
available descriptions of the JSON facet API. Is that because there is 
none? Or is it just not adequately described. For example in the 
official ref guide for 6.6 or 7.0 there is no mention of this feature. 
Is it production ready? Where can I find an up to date description? 
Right now my only resource is http://yonik.com/json-facet-api/



Thanks


Kenny



Re: LTR features and searching for field using multiple words

2017-10-20 Thread Dariusz Wojtas
I have found a solution based on Yonik's post in this thread:
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/95646
The answer was to surround the searched value with double quotes. In my
case they had to be escaped, because there are already quotes in the SOLR
feature definition.
The working feature definition is as follows:

{
  "store": "store_incidentDB",
  "name": "scoreFullAddressStreet",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params":{ "q": "{!parent which='type:entity' score='max'
v='keyword_address:\"${fullAddressStreet}\"'}" }
}

Best regards,
Dariusz Wojtas


On Fri, Oct 20, 2017 at 11:49 AM, Dariusz Wojtas  wrote:

> Hi,
>
> Recently I work with LTR features.
> In some of these features I use the block join parent parser.
> It works as expected until I pass multi-word value into the query.
> I have a parameter called 'fullAddressStreet' and it
> - works when I pass value 'something'
> - does not work if I pass value 'something street' or 'something 17B'.
>
> It fails with:
>   java.lang.RuntimeException: Error while creating weights in LTR:
>   java.lang.RuntimeException: Exception from createWeight for SolrFeature
>   [name=scoreFullAddressStreet,
> params={q={!parent which='type:entity' score='max'}keyword_address:${
> fullAddressStreet}}]
> no field name specified in query and no default specified via 'df' param
>
> I have tried fifferent variants:
>{!parent which='type:entity' score='max'}keyword_address:${
> fullAddressStreet}
>{!parent which='type:entity' score='max' v='keyword_address:${
> fullAddressStreet}'}
>{!parent which='type:entity' score='max' df='keyword_address'
> v='keyword_address:${fullAddressStreet}'}
>
> When using in LTR feature, all query definition is surrounded with double
> quotes.
> All these variants work when I ask for single word, fail the same way with
> multiple words.
>
> The 'keyword_address' field definition is:
>  stored="false" multiValued="true"/>
>
> and the value is copied from other fields:
> 
> 
> 
>
> How do I correctly use this parser?
> Character escaping? How?
>
> Best regards,
> Dariusz Wojtas
>


[ANNOUNCE] Luke 7.1.0 released

2017-10-20 Thread Tomoko Uchida
Download the release zip here:

https://github.com/DmitryKey/luke/releases/tag/luke-7.1.0

Upgrade to Lucene 7.1.0.
and, other changes in this release:


-- 
Tomoko Uchida


LTR features and searching for field using multiple words

2017-10-20 Thread Dariusz Wojtas
Hi,

Recently I work with LTR features.
In some of these features I use the block join parent parser.
It works as expected until I pass multi-word value into the query.
I have a parameter called 'fullAddressStreet' and it
- works when I pass value 'something'
- does not work if I pass value 'something street' or 'something 17B'.

It fails with:
  java.lang.RuntimeException: Error while creating weights in LTR:
  java.lang.RuntimeException: Exception from createWeight for SolrFeature
  [name=scoreFullAddressStreet,
params={q={!parent which='type:entity'
score='max'}keyword_address:${fullAddressStreet}}]
no field name specified in query and no default specified via 'df' param

I have tried fifferent variants:
   {!parent which='type:entity'
score='max'}keyword_address:${fullAddressStreet}
   {!parent which='type:entity' score='max'
v='keyword_address:${fullAddressStreet}'}
   {!parent which='type:entity' score='max' df='keyword_address'
v='keyword_address:${fullAddressStreet}'}

When using in LTR feature, all query definition is surrounded with double
quotes.
All these variants work when I ask for single word, fail the same way with
multiple words.

The 'keyword_address' field definition is:


and the value is copied from other fields:




How do I correctly use this parser?
Character escaping? How?

Best regards,
Dariusz Wojtas


Re: Measuring time spent in analysis and writing to index

2017-10-20 Thread Zisis T.
Another thing you can do - and which has helped me in the past quite a few
times - is to just run JVisualVM, attach to Solr's Java process and enable
the CPU sampler under the Sampler tab.

As you run indexing the methods that most time is spent on will appear near
the top. 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Concern on solr commit

2017-10-20 Thread Emir Arnautović
Hi Leo,
If you gracefully shut down Solr documents will be committed.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 20 Oct 2017, at 08:44, Leo Prince  wrote:
> 
> Thank you Yonik.
> 
> Since we are using SoftCommits, the docs written will be in RAM until a
> AutoCommit to reflect onto Disk, I just wanted to know what happens when
> Solr restarts. Being said, I am using 4.10 and tomcat is handling the Solr,
> when we restart the tomcat service just before an AutoCommit, what happens
> to the temporary soft written docs which is in RAM. Will they gracefully
> write to the disk before restart or should I have to do
> "/solr/update?commit=true" manually every time before restarting Solr..?
> 
> On Wed, Oct 18, 2017 at 6:08 PM, Yonik Seeley  wrote:
> 
>> On Wed, Oct 18, 2017 at 5:09 AM, Leo Prince
>>  wrote:
>>> Is there any known negative impacts in setting up autoSoftCommit as 1
>>> second other than RAM usage..?
>> 
>> Briefly:
>> Don't use autowarming (but keep caches enabled!)
>> Use docValues for fields you will facet and sort on (this will avoid
>> using FieldCache)
>> 
>> -Yonik
>> 



Sort by field from another collection

2017-10-20 Thread Dmitry Gerasimov
Hi!

I have one main collection of people and a few more collections with
additional data. All search queries are on the main collection with
joins to one or more additional collections. A simple example would
be:

(*:* {!join from=people_person_id to=people_person_id
fromIndex=fundraising_donor_info v='total_donations_1y: [1000 TO
2000]'})


I need to sort results by fields from additional collections (e.g.
"total_donations_1y”) . Is there any way to do that through the common
query parameters? Or the only way is using streaming expressions?

Dmitry


Re: ClassicAnalyzer Behavior on accent character

2017-10-20 Thread Chitra
Hi,
 So, Isn't advisable to use classicTokenizer and classicAnalyzer?

On Thu, Oct 19, 2017 at 8:29 PM, Erick Erickson 
wrote:

> Have you looked at the specification to see how it's _supposed_ to work?
>
> From the javadocs:
> "implements Unicode text segmentation, * as specified by UAX#29."
>
> See http://unicode.org/reports/tr29/#Word_Boundaries
>
> If you look at the spec and feel that ClassicAnalyzer incorrectly
> implements the word break rules then perhaps there's a JIRA.
>
> Best,
> Erick
>
> On Thu, Oct 19, 2017 at 6:39 AM, Chitra  wrote:
> > Hi,
> >   I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was
> > indexed as "er l n", some characters were trimmed while indexing.
> >
> > Here is my code
> >
> > protected Analyzer.TokenStreamComponents createComponents(final String
> > fieldName, final Reader reader)
> > {
> > final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> > reader);
> > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >
> > TokenStream tok = new ClassicFilter(src);
> > tok = new LowerCaseFilter(getVersion(), tok);
> > tok = new StopFilter(getVersion(), tok, stopwords);
> > tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> > search
> >
> > return new Analyzer.TokenStreamComponents(src, tok)
> > {
> > @Override
> > protected void setReader(final Reader reader) throws
> IOException
> > {
> >
> > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> > super.setReader(reader);
> > }
> > };
> > }
> >
> >
> > Am I missing anything? Is that expected behavior for my input or any
> reason
> > behind such abnormal behavior?
> >
> >
> > --
> > Regards,
> > Chitra
>



-- 
Regards,
Chitra


Re: E-Commerce Search: tf-idf, tie-break and boolean model

2017-10-20 Thread Vincenzo D'Amore
Thanks for all the info, I really appreciate your help. I'm working on the
configuration and following your suggestions.

We already had a golden set of query-results pairs (~1000) used to tune and
check how my application (and Solr configuration) performs.
But I've to entirely double check if this set is still relevant.
The results of each query are used to calculate F1.

Nevertheless, having this base of tests le me able to try few rounds adding
and removing custom similarity, changing the tie configuration and so on
and so forth.

Now I want share with you my results:

- I've just set mm=100%

- TF - set as constant 1.0 - slight improvement in search results,
basically it seems perform better when there are few products that are
almost identical, but some of them have the same keyword repeated many
times. For example a product "iphone charger for iphone 5, iphone
5s, iphone 6" versus a product "iphone charge"

- IDF - set as constant 1.0 - the results were not catastrophic but, for
sure, worse than having default similarity. So I've roll backed this
change, it seems to me the results are flattened too much.

- tie - I've just tried 0.1 and 1.0, at moment 1.0 seems to perform better.
But not sure why.

I want try to add some relevant fields (tags, categories) in order to the
have more chances to match the correct results.

Best regards,
Vincenzo

On Tue, Oct 17, 2017 at 11:38 PM, Walter Underwood 
wrote:

> That page from Stanford is not about e-commerce search. Westlaw is
> professional librarian search.
>
> I agree with Emir’s advice. Start with edismax. Use a small value for the
> tie-breaker. It is one of the least important configuration values. I use
> the default from the sample configs:
>
>0.1
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Oct 16, 2017, at 1:53 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> >
> > Hi Vincenzo,
> > Unless you have really specific ranking requirements, I would not
> suggest you to start with you proprietary similarity implementation. In
> most cases edismax will be good enough to cover your requirements. It is
> not easy task to tune edismax since it has a log knobs that you can use.
> > In general there are two approaches that you can use: Create a golden
> set of query-results pairs and use it with some metric (e.g. you can start
> with simple F-measure) and tune parameters to maximize metric. The
> alternative approach (complements the first one) is to let user use your
> search, track clicks and monitor search metrics like mean reciprocal rank,
> zero result queries, page depth etc. and tune queries to get better
> results. If you can do A/B testing, you can use that as well to see which
> changes are better.
> > In most cases, this is iterative process and you should not expect to
> get it right the first time and that you will be able to tune it to cover
> all cases.
> >
> > Good luck!
> >
> > HTH,
> > Emir
> >
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 16 Oct 2017, at 10:30, Vincenzo D'Amore  wrote:
> >>
> >> Hi all,
> >>
> >> I'm trying to figure out how to tune Solr for an e-commerce search.
> >>
> >> I want to share with you what I did in the hope to understand if I was
> >> right and, if there, I could also improve my configuration.
> >>
> >> I also read that the boolean model has to be preferred in this case.
> >>
> >> https://nlp.stanford.edu/IR-book/html/htmledition/the-extend
> ed-boolean-model-versus-ranked-retrieval-1.html
> >>
> >>
> >> So, I first wrote my own implementation of DefaultSimilarity returning
> >> constantly 1.0 for TF and IDF.
> >>
> >> Now I'm struggling to understand how to configure tie-break parameter,
> my
> >> opinion was to configure it to 0.1 or 0.0, thats because, if I
> understood
> >> well, in this way the boolean model should be preferred, that's because
> >> only the maximum scoring subquery contributes to final score.
> >>
> >> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-
> parser.html#TheDisMaxQueryParser-Thetie_TieBreaker_Parameter
> >>
> >>
> >> Not sure if this could be enough or if you need more information,
> thanks in
> >> advance for anyone would add a bit in this discussion.
> >>
> >> Best regards,
> >> Vincenzo
> >>
>
>


Re: solr core replication

2017-10-20 Thread Hendrik Haddorp

Hi Erick,

that is actually the call I'm using :-)
If you invoke 
http://solr_target_machine:port/solr/core/replication?command=details 
after that you can see the replication status. But even after a Solr 
restart the call still shows the replication relation and I would like 
to remove this so that the core looks "normal" again.


regards,
Hendrik

On 20.10.2017 02:31, Erick Erickson wrote:

Little known trick:

The fetchIndex replication API call can take any parameter you specify
in your config. So you don't have to configure replication at all on
your target collection, just issue the replication API command with
masterUrl, something like:

http://solr_target_machine:port/solr/core/replication?command=fetchindex=http://solr_source_machine:port/solr/core

NOTE, "core" above will be something like collection1_shard1_replica1

During the fetchindex, you won't be able to search on the target
collection although the source will be searchable.

Now, all that said this is just copying stuff. So let's say you've
indexed to your source cluster and set up your target cluster (but
don't index anything to the target or do the replication etc). Now if
you shut down the target cluster and just copy the entire data dir
from each source replica to each target replica then start all the
target Solr instances up you'll be fine.

Best,
Erick

On Thu, Oct 19, 2017 at 1:33 PM, Hendrik Haddorp
 wrote:

Hi,

I want to transfer a Solr collection from one SolrCloud to another one. For
that I create a collection in the target cloud using the same config set as
on the source cloud but with a replication factor of one. After that I'm
using the Solr core API with a "replication?command=fetchindex" command to
transfer the data. In the last step I'm increasing the replication factor.
This seems to work fine so far. When I invoke "replication?command=details"
I can see my replication setup and check if the replication is done. In the
end I would like to remove this relation again but there does not seem to be
an API call for that. Given that the replication should be a one time
replication according to the API on
https://lucene.apache.org/solr/guide/6_6/index-replication.html this should
not be a big problem. It just does not look clean to me to leave this in the
system. Is there anything I'm missing?

regards,
Hendrik




Re: Concern on solr commit

2017-10-20 Thread Leo Prince
Thank you Yonik.

Since we are using SoftCommits, the docs written will be in RAM until a
AutoCommit to reflect onto Disk, I just wanted to know what happens when
Solr restarts. Being said, I am using 4.10 and tomcat is handling the Solr,
when we restart the tomcat service just before an AutoCommit, what happens
to the temporary soft written docs which is in RAM. Will they gracefully
write to the disk before restart or should I have to do
"/solr/update?commit=true" manually every time before restarting Solr..?

On Wed, Oct 18, 2017 at 6:08 PM, Yonik Seeley  wrote:

> On Wed, Oct 18, 2017 at 5:09 AM, Leo Prince
>  wrote:
> > Is there any known negative impacts in setting up autoSoftCommit as 1
> > second other than RAM usage..?
>
> Briefly:
> Don't use autowarming (but keep caches enabled!)
> Use docValues for fields you will facet and sort on (this will avoid
> using FieldCache)
>
> -Yonik
>