Re: Changing Leadership in SolrCloud

2018-03-02 Thread Zahra Aminolroaya
Dear Mr. Shalin,

Yes. I mean "state" in Cluster State API and UI.

Let me explain what happened previous days by detail:

Think I have Collection A distributed across node1 (the leader), node2 and
node 3. 

I used the following command to block node 1 solr and zookeeper ports from
being listend:
(the ports are 2888/3888/2181 and 4239)

firewall-cmd --remove-port=/tcp --permanent

node 1 state is still "active", and leader is "true" in response of Cluster
State API.

the Solr logs of node 1 is like below:


org.apache.solr.common.SolrException: ClusterState says we are the leader
(:4239/solr/collectionA_shard2_replica1), but locally we don't
think so. Request came from :4239/solr/collectionA_shard4_replica3/
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:658)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:418)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:346)
at ..

node 2 error in solr logs is:

forwarding update to :4239/solr/collection A_shard5_replica1/
failed - retrying ... retries: 24 add{,id=121,commitWithin=1000}
params:update.chain=add-unknown-fields-to-the-schema=TOLEADER=node2:4239/solr/collection
A_shard2_replica2/
rsp:503:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at node1IP:4239/solr/collection A_shard5_replica1: Service
Unavailable

node 3 error in solr logs is like node 2 error.



Unforunately, today I found that my node 4 and node 5 from collection B and
C became down. The  logs errors were like below:

2018-03-01 00:26:46.133 ERROR
(zkCallback-4-thread-28-processing-n:node4IP:4239_solr-EventThread) [   ]
o.a.s.c.ZkController :org.apache.solr.common.SolrException: There was a
problem making a request to the leader
at
org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1551)
at
org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:476)
at org.apache.solr.cloud.ZkController.access$500(ZkController.java:121)
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:338)
at
org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:168)
at
org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:57)
at
org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:142)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)

and 

Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /collections/Collection B/state.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1212)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
at
org.apache.solr.common.cloud.ZkStateReader.fetchCollectionState(ZkStateReader.java:1110)
at
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:1096)
... 39 more


I think these errors are related to blocking the ports of node 1.

I wonder if you help me.

Regards,
Zahra









--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Rename solr to another name

2018-03-02 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Thanks for the info.

I have managed to change the one that starts Solr, and it's working so far.

Now I'm working on changing things like solr.xml and the JAR file in the
dist directory, like solr-cell-6.5.1.jar to names like my-cell-6.5.1.jar.
Can we change the name of those JAR file as well?

We are customizing it based on the requirements for the project that we are
handling.

Regards,
Edwin

On 3 March 2018 at 08:20, Shawn Heisey  wrote:

> On 3/2/2018 4:07 PM, Zheng Lin Edwin Yeo wrote:
> > Does this means that we have to recompile some of the JAR files that
> comes
> > with Solr in order for it to work? As they have been hard-coded with
> things
> > like "solr-webapp"?
>
> I don't see it in any of the java source code, so a recompile wouldn't
> be necessary.  But most of the scripts in the bin directory and other
> places have it hardcoded, including the one that starts Solr.
>
> Once you start down the road of changing these things, you're probably
> going to be forever hunting for things that don't work right and working
> to get them fixed.  Better to just leave it alone and be sure it'll work.
>
> Thanks,
> Shawn
>
>


Re: Filesystems supported by Solr

2018-03-02 Thread Shawn Heisey
On 3/2/2018 2:30 PM, Ritesh Chaman wrote:
> I am trying to deploy solr on my ADLS subscription. can you tell me if that
> is tested and is compatible.

Walter says that this is storage related to Azure.

https://azure.microsoft.com/en-us/services/data-lake-store/

If this is what you are talking about, that page says they use HDFS
APIs.  Which means that you MIGHT be able to use it with the HDFS
support built into Solr.  You'd have to ask Microsoft what they can support.

But I can almost guarantee that unless you have plenty of memory in the
system for HDFS to cache that data, that it's going to be VERY slow,
just due to the amount of latency involved in accessing data over the
Internet.  Accessing the data after a reboot or a restart of the
particular service that caches the data is probably going to also be
extremely slow, even if you DO have plenty of memory for caching.

You'll see far better performance if you just install enough disk space
in your Solr servers to hold your index data, and install enough memory
that the OS can effectively cache that index data.

Thanks,
Shawn



Re: Performance Implications of Different Routing Schemes

2018-03-02 Thread Shawn Heisey
On 3/2/2018 11:43 AM, Stephen Lewis wrote:
> I'm wondering what information you may be able to provide on performance
> implications of implicit routing VS composite ID routing. In particular,
> I'm curious what the horizontal scaling behavior may be of implicit routing
> or composite ID routing with and without the "/" param appended on.

The hash calculations should probably introduce so little overhead that
you'd never notice it.

I once implemented a hash algorithm using two hash classes built into
Java.  I'm pretty sure that it was NOT a fast implementation ... and it
could calculate over a million hashes per second on input strings of
about 20 characters.

The hash algorithm used by CompositeId (one of the MurmurHash
implementations) is supposed to be one of the fastest algorithms in the
world.  Unless your uniqueId field values are extremely huge, I really
doubt that hash calculation is a significant source of overhead.

The use of implicit doesn't automatically mean there's no overhead for
routing.  The implicit router can still redirect documents to different
shards, it just does it explicitly, usually with a shard name in a
particular field, rather than by hash calculation.

> A relatively simple assessment I've done belowleads me to believe the
> following is likely the case: if we have S shards and B as our "/bits"
> param, then resource usage would Big O scale as follows (note: Previously
> I've received the advice that any shard should be capped at a max of 120M
> documents, which is where the cap on docs/shard-key comes from)

Query performance should be about the same for any routing type.  It
does look like when you use compositeId and actually implement shard
keys, you can limit queries to those shards, but a *general* query is
going to hit all shards.

If your query rate is very low (or shards are distributed across a lot
of hardware that has significant spare CPU capacity) performance isn't
going to be dramatically different for a query that hits 2 shards versus
one that hits 6 shards.  If your query rate is high, then more shards
probably will be noticeably slower than fewer shards.

For the maximum docs to allow per shard:  It depends.  For some indexes
and use cases, a million documents per shard might be way too big.  For
others, 500 million per shard might have incredible performance.  There
are no hard rules about this.  It's entirely dependent on what you're
actually doing.

Thanks,
Shawn



Re: Updating documents and commit/rollback

2018-03-02 Thread Shawn Heisey
On 3/2/2018 10:39 AM, Christopher Schultz wrote:
> The problem is that I'm updating the index after my SQL UPDATE(s) have
> run, but before my SQL COMMIT occurs. I have had a problem where the SQL
> fails and rolls-back, but the solrClient is not rolled-back.
>
> I'm a little wary of rolling-back Solr because, as I understand it, the
> client itself doesn't carry any transactional information. That is, it
> should be a shared-resource (within the web application) and indeed,
> other clients could be connecting from other places (like other app
> servers running the same application). Performing either commit() or
> rollback() on the Solr client will commit/rollback *all* writes since
> the last commit, right?

Correct.  Relational databases typically keep track of transactions on
one connection separately from transactions on another connection, and
can roll one of them back without affecting the others.

Solr doesn't have this capability.  The reason that it doesn't have this
capability is that Lucene doesn't have it, and the majority of Solr
functionality is provided by Lucene.

If updates are happening concurrently from multiple sources, then
there's no way to have any kind of meaningful rollback.

I see two solutions:

1) Funnel all updates through a single thread/process, which will not
move on from one update to another until the final decision is made
about that update.  Then rolling back becomes possible, because there is
only one source for updates.  The disadvantage here is that this
thread/process becomes a bottleneck, and performance may suffer
greatly.  Also, it can be a single point of failure.  If the rate of
updates is low, then the bottleneck may not be a problem.

2) Have your updating software revert the changes "manually" in
situations where the SQL change is rolled back ... by either deleting
the record or sending another update to change values back to what they
were before.

Thanks,
Shawn



Re: solr url control

2018-03-02 Thread Shawn Heisey
On 3/2/2018 10:29 AM, Becky Bonner wrote:
> We are trying to setup one solr server for several applications each with a 
> different collection.  Is there a way to have have 2 collections under one 
> folder and the url be something like this:
> http://mysolrinstance.com/solr/myParent1/collection1
> http://mysolrinstance.com/solr/myParent1/collection2
> http://mysolrinstance.com/solr/myParent2
> http://mysolrinstance.com/solr/myParent3

No. I am not aware of any way to set up a hierarchy like this. 
Collections and cores have one identifier for their names.  You could
use myparent1_collection1 as a name.

Implementing such a hierarchy like this would likely be difficult for
the dev team, and would probably be a large source of bugs for several
releases after it first became available.  I don't think a feature like
this is likely to happen.

Later, you said "We would not want the data from one collection to ever
show up in another collection query."  That's not ever going to happen
unless the software making the query explicitly requests it, and it will
need to know details about the indexes in your Solr server to be able to
do it successfully.  FYI: People who cannot be trusted shouldn't ever
have direct access to your Solr installation.

Are you running SolrCloud?  I ask because if you're not, then the
terminology for each index isn't a "collection" ... it's a core.  This
is a pedantic statement, but you'll get better answers if your
terminology is correct.

If you were running SolrCloud, it would be extremely unlikely for you to
have a directory structure like you describe.  SolrCloud normally
handles all core creation behind the scenes and isn't going to set up a
directory structure like that.

Information about how core discovery works:

https://wiki.apache.org/solr/Core%20Discovery%20%284.4%20and%20beyond%29#Finding_cores

Thanks,
Shawn



Re: Rename solr to another name

2018-03-02 Thread Shawn Heisey
On 3/2/2018 4:07 PM, Zheng Lin Edwin Yeo wrote:
> Does this means that we have to recompile some of the JAR files that comes
> with Solr in order for it to work? As they have been hard-coded with things
> like "solr-webapp"?

I don't see it in any of the java source code, so a recompile wouldn't
be necessary.  But most of the scripts in the bin directory and other
places have it hardcoded, including the one that starts Solr.

Once you start down the road of changing these things, you're probably
going to be forever hunting for things that don't work right and working
to get them fixed.  Better to just leave it alone and be sure it'll work.

Thanks,
Shawn



Re: Filesystems supported by Solr

2018-03-02 Thread Walter Underwood
From a quick google search, ADLS seems like the Azure version of S3. Putting 
Solr indexes on S3 would be unbelievably slow, if it worked at all. Years ago, 
I accidentally put indexes on NFS and it was 100X slower.

Tell us more about what you are trying to do. It is unusual to put Solr indexes 
on anything but a local filesystem. HDFS is the only exception I can think of.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 2, 2018, at 1:30 PM, Ritesh Chaman  wrote:
> 
> Hi team
> 
> I am trying to deploy solr on my ADLS subscription. can you tell me if that
> is tested and is compatible.
> 
> Regards
> 
> On Tue, Feb 20, 2018 at 2:22 PM, Ritesh Chaman 
> wrote:
> 
>> Hi team
>> 
>> May I know what all filesystems are supported by Solr. For eg ADLS,WASB,
>> S3 etc. Thanks.
>> 
>> Ritesh
>> 



Re: Filesystems supported by Solr

2018-03-02 Thread Ritesh Chaman
Hi team

I am trying to deploy solr on my ADLS subscription. can you tell me if that
is tested and is compatible.

Regards

On Tue, Feb 20, 2018 at 2:22 PM, Ritesh Chaman 
wrote:

> Hi team
>
> May I know what all filesystems are supported by Solr. For eg ADLS,WASB,
> S3 etc. Thanks.
>
> Ritesh
>


Re: Rename solr to another name

2018-03-02 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Thanks for the reply.

Regarding this:
>  The scripts included with Solr have this path hardcoded, and for that
reason, Solr probably won't even start without manual script edits if the
webapp directory is changed

Does this means that we have to recompile some of the JAR files that comes
with Solr in order for it to work? As they have been hard-coded with things
like "solr-webapp"?

Regards,
Edwin

On 3 March 2018 at 00:31, Shawn Heisey  wrote:

> On 3/2/2018 8:27 AM, Zheng Lin Edwin Yeo wrote:
>
>> Are we able to rename the folder name like solr-webapp or the names
>> like solr-jetty-context.xml to the customised name like my-webapp and
>> my-jetty-context.xml?
>>
>> I'm currently using Solr 6.5.1, and will upgrade to Solr 7.2.1 soon.
>>
>
> When people start wanting to customize internal details to this level, I
> seriously have to ask "why?"  The dev team has spent a lot of time and
> effort coming up with these configurations. They are not designed to be
> customizable.  Some of them certainly CAN be customized, with a lot of
> manual work.
>
> These particular questions involve Jetty, not so much Solr.  From what I
> can tell, the solr-jetty-context.xml file can be renamed to anything, it's
> the fact that it's in the "contexts" directory that makes Jetty read it.  I
> do not know whether it needs an xml extension ... I would preserve that if
> it were me.
>
> The "solr-webapp" directory location is defined within the
> solr-jetty-context.xml file, so if you change that, you're going to need to
> edit the context file in order to keep Solr working. The scripts included
> with Solr have this path hardcoded, and for that reason, Solr probably
> won't even start without manual script edits if the webapp directory is
> changed.
>
> Anticipating another question you might have:  Another piece of
> information in solr-jetty-context.xml is the context path -- set to "/solr"
> -- this is the first part of the URL path on all API calls.  You can change
> this, and Solr itself will still work properly, but a lot of things have
> this path hard-coded (like the scripts, certain Java libraries like
> SolrCli, and the admin UI), so those features will break unless you
> manually edit them for the new path.
>
> Thanks,
> Shawn
>
>


Re: Solr 7.2.0 CDCR Issue with TLOG collections

2018-03-02 Thread Webster Homer
It looks like the data is getting to the target servers. I see tlog files
with the right timestamps. Looking at the timestamps on the documents in
the collection none of the data appears to have been loaded.
In the solr.log I see lots of /cdcr messages  action=LASTPROCESSEDVERSION,
 action=COLLECTIONCHECKPOINT, and  action=SHARDCHECKPOINT

no errors

autoCommit is set to  6 I tried sending a commit explicitly no
difference. cdcr is uploading data, but no new data appears in the
collection.

On Fri, Mar 2, 2018 at 1:39 PM, Webster Homer 
wrote:

> We have been having strange behavior with CDCR on Solr 7.2.0.
>
> We have a number of replicas which have identical schemas. We found that
> TLOG replicas give much more consistent search results.
>
> We created a collection using TLOG replicas in our QA clouds.
> We have a locally hosted solrcloud with 2 nodes, all our collections have
> 2 shards. We use CDCR to replicate the collections from this environment to
> 2 data centers hosted in Google cloud. This seems to work fairly well for
> our collections with NRT replicas. However the new TLOG collection has
> problems.
>
> The google cloud solrclusters have 4 nodes each (3 separate Zookeepers). 2
> shards per collection with 2 replicas per shard.
>
> We never see data show up in the cloud collections, but we do see tlog
> files show up on the cloud servers. I can see that all of the servers have
> cdcr started, buffers are disabled.
> The cdcr source configuration is:
>
> "requestHandler":{"/cdcr":{
>   "name":"/cdcr",
>   "class":"solr.CdcrRequestHandler",
>   "replica":[
> {
>   "zkHost":"xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,x
> xx-mzk03.sial.com:2181/solr",
>   "source":"b2b-catalog-material-180124T",
>   "target":"b2b-catalog-material-180124T"},
> {
>   "zkHost":"-mzk01.sial.com:2181,-mzk02.sial.com:2181,
> -mzk03.sial.com:2181/solr",
>   "source":"b2b-catalog-material-180124T",
>   "target":"b2b-catalog-material-180124T"}],
>   "replicator":{
> "threadPoolSize":4,
> "schedule":500,
> "batchSize":250},
>   "updateLogSynchronizer":{"schedule":6
>
> The target configurations in the 2 clouds are the same:
> "requestHandler":{"/cdcr":{ "name":"/cdcr", "class":"solr.
> CdcrRequestHandler", "buffer":{"defaultState":"disabled"}}}
>
> All of our collections have a timestamp field, index_date. In the source
> collection all the records have a date of 2/28/2018 but the target
> collections have a latest date of 1/26/2018
>
> I don't see cdcr errors in the logs, but we use logstash to search them,
> and we're still perfecting that.
>
> We have a number of similar collections that behave correctly. This is the
> only collection that is a TLOG collection. It appears that CDCR doesn't
> support TLOG collections.
>
> This begins to look like a bug
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Shard replica labels in Solr Admin graph?

2018-03-02 Thread Shawn Heisey
On 3/2/2018 12:51 PM, Scott Prentice wrote:
> I made the adjustment to /etc/hosts, and now all's well. This also
> fixed an underlying problem that I hadn't noticed at the time I send
> my query .. that only one Solr server was actually running. Turns out
> that Zookeeper saw them all as 127.0.1.1 and didn't let the other
> instances fully start up.
>
> These were brand new, fresh, Ubuntu installs. Strange that the
> /etc/hosts isn't set up to handle this.

I know that if you set up a manual IP address during install, and choose
correct hostname/domain values, that Ubuntu/Debian should do the right
thing with regard to /etc/hosts.

But there are situations in which those operating systems will use
127.0.1.1.  I am not aware of what situations those are ... maybe if you
let it use DHCP during install?  I never do that.

https://serverfault.com/a/363098

Thanks,
Shawn



Re: Shard replica labels in Solr Admin graph?

2018-03-02 Thread Scott Prentice

Thanks Shawn!

I made the adjustment to /etc/hosts, and now all's well. This also fixed 
an underlying problem that I hadn't noticed at the time I send my query 
.. that only one Solr server was actually running. Turns out that 
Zookeeper saw them all as 127.0.1.1 and didn't let the other instances 
fully start up.


These were brand new, fresh, Ubuntu installs. Strange that the 
/etc/hosts isn't set up to handle this.


Cheers,
...scott


On 2/28/18 8:48 PM, Shawn Heisey wrote:

On 2/28/2018 5:42 PM, Scott Prentice wrote:
We initially tested our Solr Cloud implementation on a single VM with 
3 Solr servers and 3 Zookeeper servers. Once that seemed good, we 
moved to 3 VMs with 1 Solr/Zookeeper on each. That's all looking 
good, but in the Solr Admin > Cloud > Graph, all of my shard replicas 
are on "127.0.1.1" .. with the single VM setup it listed the port 
number so you could tell which "server" it was on.


Is there some way to get the shard replicas to list with the actual 
IPs of the server that the replica is on, rather than 127.0.1.1?


That is not going to work if those are separate machines.

There are two ways to fix this.

One is to figure out why Java is choosing a loopback address when it 
attempts to detect the machine's hostname.  I'm almost certain that 
/etc/hosts is set up incorrectly.  In my opinion, a typical /etc/hosts 
file should have two IPv4 lines, one defining localhost as 127.0.0.1, 
and another defining the machine's actual IP address as both the fully 
qualified domain name and the short hostname. An example:


127.0.0.1   localhost
192.168.1.200   smeagol.REDACTED.com    smeagol

The machine's hostname should not be found on any line that does not 
have a real IP address on it.


The other way to solve the problem is to specify the "host" system 
property to override Java's detection of the machine 
address/hostname.  You can either add a commandline option to set the 
property, or add it to solr.xml.  Note that if your solr.xml file is 
in zookeeper, then you can't use solr.xml.  This is because with 
solr.xml in zookeeper, every machine would have the same host 
definition, and that won't work.


https://lucene.apache.org/solr/guide/6_6/format-of-solr-xml.html#the-code-solrcloud-code-element 



Thanks,
Shawn






Solr 7.2.0 CDCR Issue with TLOG collections

2018-03-02 Thread Webster Homer
We have been having strange behavior with CDCR on Solr 7.2.0.

We have a number of replicas which have identical schemas. We found that
TLOG replicas give much more consistent search results.

We created a collection using TLOG replicas in our QA clouds.
We have a locally hosted solrcloud with 2 nodes, all our collections have 2
shards. We use CDCR to replicate the collections from this environment to 2
data centers hosted in Google cloud. This seems to work fairly well for our
collections with NRT replicas. However the new TLOG collection has problems.

The google cloud solrclusters have 4 nodes each (3 separate Zookeepers). 2
shards per collection with 2 replicas per shard.

We never see data show up in the cloud collections, but we do see tlog
files show up on the cloud servers. I can see that all of the servers have
cdcr started, buffers are disabled.
The cdcr source configuration is:

"requestHandler":{"/cdcr":{
  "name":"/cdcr",
  "class":"solr.CdcrRequestHandler",
  "replica":[
{
  "zkHost":"xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,
xxx-mzk03.sial.com:2181/solr",
  "source":"b2b-catalog-material-180124T",
  "target":"b2b-catalog-material-180124T"},
{
  "zkHost":"-mzk01.sial.com:2181,-mzk02.sial.com:2181,
-mzk03.sial.com:2181/solr",
  "source":"b2b-catalog-material-180124T",
  "target":"b2b-catalog-material-180124T"}],
  "replicator":{
"threadPoolSize":4,
"schedule":500,
"batchSize":250},
  "updateLogSynchronizer":{"schedule":6

The target configurations in the 2 clouds are the same:
"requestHandler":{"/cdcr":{ "name":"/cdcr", "class":
"solr.CdcrRequestHandler", "buffer":{"defaultState":"disabled"}}}

All of our collections have a timestamp field, index_date. In the source
collection all the records have a date of 2/28/2018 but the target
collections have a latest date of 1/26/2018

I don't see cdcr errors in the logs, but we use logstash to search them,
and we're still perfecting that.

We have a number of similar collections that behave correctly. This is the
only collection that is a TLOG collection. It appears that CDCR doesn't
support TLOG collections.

This begins to look like a bug

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: solr url control

2018-03-02 Thread Becky Bonner
So the thing is ... these collections all have very unique schemas and the data 
are unrelated to each other.  And we do a lot of field queries on the content.  
We would not want the data from one collection to ever show up in another 
collection query.  They are used by different audiences and securities as well. 
 We want to keep them separated.  

While it is not required that the urls include the myParentX ... it would be 
consistent with our current implementation that we are upgrading from 4.6 to 
7.2.  this was a very simple task under apache but I cant figure out how to do 
this in solr 7

-Original Message-
From: Becky Bonner 
Sent: Friday, March 2, 2018 1:11 PM
To: 'solr-user@lucene.apache.org' 
Subject: RE: solr url control

Sorry Webster - I meant to make this a new question ... but accidentally sent 
it. You wrote
From: Webster Homer [mailto:webster.ho...@sial.com] 
Sent: Friday, March 2, 2018 12:20 PM
To: solr-user@lucene.apache.org
Subject: Re: NRT replicas miss hits and return duplicate hits when paging 
solrcloud searches

Becky,
This should have been its own question.

Solrcloud is different from standalone solr, the configurations live in 
Zookeeper and the index is created under SOLR_HOME. You might want to rethink 
your solution, What problem are you trying to solve with that layout? Would it 
be solved by creating the Parent1 collection with 2 shards?

-Original Message-
From: Becky Bonner 
Sent: Friday, March 2, 2018 11:29 AM
To: solr-user@lucene.apache.org
Subject: solr url control

We are trying to setup one solr server for several applications each with a 
different collection.  Is there a way to have have 2 collections under one 
folder and the url be something like this:
http://mysolrinstance.com/solr/myParent1/collection1
http://mysolrinstance.com/solr/myParent1/collection2
http://mysolrinstance.com/solr/myParent2
http://mysolrinstance.com/solr/myParent3


We organized it like that under the solr folder but the URLs to the collections 
do not include the "myParent1".
This makes the names of my collections more confusing because you can't tell 
what application they belong to.  It wasn’t a problem until we had 2 
collections for one of the apps.


RE: solr url control

2018-03-02 Thread Becky Bonner
Sorry Webster - I meant to make this a new question ... but accidentally sent 
it. You wrote
From: Webster Homer [mailto:webster.ho...@sial.com] 
Sent: Friday, March 2, 2018 12:20 PM
To: solr-user@lucene.apache.org
Subject: Re: NRT replicas miss hits and return duplicate hits when paging 
solrcloud searches

Becky,
This should have been its own question.

Solrcloud is different from standalone solr, the configurations live in 
Zookeeper and the index is created under SOLR_HOME. You might want to rethink 
your solution, What problem are you trying to solve with that layout? Would it 
be solved by creating the Parent1 collection with 2 shards?

-Original Message-
From: Becky Bonner 
Sent: Friday, March 2, 2018 11:29 AM
To: solr-user@lucene.apache.org
Subject: solr url control

We are trying to setup one solr server for several applications each with a 
different collection.  Is there a way to have have 2 collections under one 
folder and the url be something like this:
http://mysolrinstance.com/solr/myParent1/collection1
http://mysolrinstance.com/solr/myParent1/collection2
http://mysolrinstance.com/solr/myParent2
http://mysolrinstance.com/solr/myParent3


We organized it like that under the solr folder but the URLs to the collections 
do not include the "myParent1".
This makes the names of my collections more confusing because you can't tell 
what application they belong to.  It wasn’t a problem until we had 2 
collections for one of the apps.


Performance Implications of Different Routing Schemes

2018-03-02 Thread Stephen Lewis
Hello!

I'm wondering what information you may be able to provide on performance
implications of implicit routing VS composite ID routing. In particular,
I'm curious what the horizontal scaling behavior may be of implicit routing
or composite ID routing with and without the "/" param appended on.

I've been following this documentation
,
and a few other blogs/articles I've seen around the web (including on lucid
works). Many of these discuss what the techniques and general philosophy
are of different document routing techniques, but I haven't been able to
find a "Big O" assessment so far by searching online. I'm aware than any
particular workload really needs a Sizing Exercise

to fully understand its implications, but I'm hoping to plan high level
architecture beyond what I can currently forsee in scale.

A relatively simple assessment I've done belowleads me to believe the
following is likely the case: if we have S shards and B as our "/bits"
param, then resource usage would Big O scale as follows (note: Previously
I've received the advice that any shard should be capped at a max of 120M
documents, which is where the cap on docs/shard-key comes from)

   - Implicit routing
  - One Read: O(S)
 - hits horizontal scaling limit eventually as S grows
 - No cap on docs per shard key (no shard key)
  - Composite ID routing, no bits param:
  - One Read: O(1)
 - no horizontal scaling limit as S grows
 - Docs on a shard key capped at 120 million
  - Composite ID routing with bits param:
  - One Read: O(2^B)
 - no horizontal scaling limit as S grows for fixed B
 - Docs on a shard key capped at 120 million * 2^B


So my questions: Is this "big O" analysis about correct? Does SOLR have an
ability to scale horizontally on implicit routing despite what my simple
analysis would suggest? Are there other considerations here you can
enlighten me on?

I would guess the answer to the second question is "no" because otherwise
it wouldn't seem to me that composite ID routing would add much concrete
value. But perhaps there are some other factors I've yet to consider.

Thanks for your time and help! Looking forward to hearing back :)

Cheers,
Stephen


Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-03-02 Thread Webster Homer
Becky,
This should have been its own question.

Solrcloud is different from standalone solr, the configurations live in
Zookeeper and the index is created under SOLR_HOME. You might want to
rethink your solution, What problem are you trying to solve with that
layout? Would it be solved by creating the Parent1 collection with 2 shards?

On Fri, Mar 2, 2018 at 10:56 AM, Becky Bonner  wrote:

> We are trying to setup one solr server for several applications each with
> a different collection.  Is there a way to have have 2 collections under
> one folder and the url be something like this:
> http://mysolrinstance.com/solr/myParent1/collection1
> http://mysolrinstance.com/solr/myParent1/collection2
> http://mysolrinstance.com/solr/myParent2
> http://mysolrinstance.com/solr/myParent3
>
>
> We organized it like that under the solr folder but the URLs to the
> collections do not include the "myParent1".
> This makes the names of my collections more confusing because you can't
> tell what application they belong to.  It wasn’t a problem until we had 2
> collections for one of the apps.
>
>
>
>
> -Original Message-
> From: Webster Homer [mailto:webster.ho...@sial.com]
> Sent: Friday, March 2, 2018 10:29 AM
> To: solr-user@lucene.apache.org
> Subject: Re: NRT replicas miss hits and return duplicate hits when paging
> solrcloud searches
>
> I am trying to test if enabling stats cache as suggested by Eric would
> also address this issue. I added this to my solrconfig.xml
>
>  
>
> I executed queries and saw no differences. Then I re-indexed the data,
> again I saw no differences in behavior.
> Then I found this,  SOLR-10952. It seems we need to disable the
> queryResultCache for the global stats cache to work.
> I've never disabled this before. I edited the solrconfig.xml setting the
> sizes to 0. I'm not sure if this is how to disable the cache or not.
>
>   size="0"
>  initialSize="0"
>  autowarmCount="0"/>
>
> I also set this:
>0
>
> Then uploaded the solrconfig.xml and reloaded the collection. It sill made
> no difference. Do I need to restart solr for this to take effect?
> When I look in the admin console, the queryResultCache still seems to have
> the old settings.
>
> Does enabling statsCache require a solr restart too? Does enabling the
> statsCache require that the data be re-indexed? The documentation on this
> feature is skimpy.
> Is there a way to see if it's enabled in the Admin Console?
>
> On Tue, Feb 27, 2018 at 9:31 AM, Webster Homer 
> wrote:
>
> > Emir,
> >
> > Using tlog replica types addresses my immediate problem.
> >
> > The secondary issue is that all of our searches show inconsistent
> results.
> > These are all normal paging use cases. We regularly test our
> > relevancy, and these differences creates confusion in the testers.
> > Moreover, we are migrating from Endeca which has very consistent results.
> >
> > I'm hoping that using the global stats cache will make the other
> > searches more stable. I think we will eventually move to favoring tlog
> > replicas. We have a couple of collections where NRT makes sense, but
> > those collections don't need to return data in relevancy order. I
> > think NRT should be considered a niche use case for a search engine,
> > tlog and pull replicas are a much better fit for a search engine
> > (imho)
> >
> > On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Webster,
> >> Since you are returning all hits, returning the last page is almost
> >> as heavy for Solr as returning all documents. Maybe you should
> >> consider just returning one large page and completely avoid this issue.
> >> I agree with you that this should be handled by Solr. ES solved this
> >> issue with “preference” search parameter where you can set session id
> >> as preference and it will stick to the same shards. I guess you could
> >> try similar thing on your own but that would require you to send list
> >> of shards as parameter for your search and balance it for different
> sessions.
> >>
> >> HTH,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >> > On 26 Feb 2018, at 21:03, Webster Homer 
> wrote:
> >> >
> >> > Erick,
> >> >
> >> > No we didn't look at that. I will add it to the list. We have  not
> >> > seen performance issues with solr. We have much slower technologies
> >> > in our stack. This project was to replace a system that was too slow.
> >> >
> >> > Thank you, I will look into it
> >> >
> >> > Webster
> >> >
> >> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <
> >> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> >> Did you try enabling distributed IDF (statsCache)? See:
> >> >> 

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-03-02 Thread Webster Homer
Thanks Shawn.

Commenting it out works to remove it. If I change the values e.g. change
the 512 to 0, it does require a restart to take effect.

Tested using statsCache set to
org.apache.solr.search.stats.ExactSharedStatsCache,
with the queryResultCache disabled, and I still see the problem with NRT
replicas. So using TLOG replicas still looks like the best work around for
the NRT issue

On Fri, Mar 2, 2018 at 10:44 AM, Shawn Heisey  wrote:

> On 3/2/2018 9:28 AM, Webster Homer wrote:
>
>> I've never disabled this before. I edited the solrconfig.xml setting the
>> sizes to 0. I'm not sure if this is how to disable the cache or not.
>>
>>  >   size="0"
>>   initialSize="0"
>>   autowarmCount="0"/>
>>
>
> To completely disable a cache, either comment it out or remove it from the
> config.  I do not know whether setting the size to 0 will actually work or
> not.
>
> Does enabling statsCache require a solr restart too? Does enabling the
>> statsCache require that the data be re-indexed? The documentation on this
>> feature is skimpy.
>>
>
> Most changes to solrconfig.xml just require a reload.  I would expect any
> cache configurations to fall into that category.
>
> Is there a way to see if it's enabled in the Admin Console?
>>
>
> I don't know anything about the statsCache.  If you don't see it in the
> Plugins/Stats tab, that's probably something that was forgotten, and needs
> to be added to the admin UI.
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Updating documents and commit/rollback

2018-03-02 Thread Christopher Schultz
Hey, folks. I've been a long-time Lucene user (running a hilariously-old
1.9.1 version forever), but I'm only just now getting into using Solr.

My particular use-case is storing information about web-application
users so they can be found more quickly than our current RDBMS-based
search (SELECT ... FROM user WHERE username LIKE '%foo%' OR
email_address LIKE '%foo%' OR last_name LIKE '%foo%'...).

I've set up my Solr (very basic... just untar, bin/solr start), created
a core/collection (I'm running single-server for now, no cloudy
zookeeper stuff ATM), customized my schema (using the Schema API, since
hand-editing is discouraged) and loaded my data. I can search just fine
through the Solr dashboard.

I've also user solr-solrj to perform searches from within my
application, replacing the previous JDBC-based search with the
Solr-based one. All is well.

Now I'm trying to figure out the best way to update users in the index
when their information (e.g. first/last names) change. I have used
solr-solrj quite simply like this:

SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", user.getId());
doc.addField("username", user.getUsername());
doc.addField("first_name", user.getFirstName());
doc.addField("last_name", user.getLastName());
...
solrClient.add("users", doc);
solrClient.commit();

I'm having a problem, though, and I'd like to know what the "right"
solution is.

The problem is that I'm updating the index after my SQL UPDATE(s) have
run, but before my SQL COMMIT occurs. I have had a problem where the SQL
fails and rolls-back, but the solrClient is not rolled-back.

I'm a little wary of rolling-back Solr because, as I understand it, the
client itself doesn't carry any transactional information. That is, it
should be a shared-resource (within the web application) and indeed,
other clients could be connecting from other places (like other app
servers running the same application). Performing either commit() or
rollback() on the Solr client will commit/rollback *all* writes since
the last commit, right?

That means that there is no meaningful way that I can say to Solr "oops,
I actually need you to NOT add that document I just told you about".
Instead, I have to either commit the document I don't want (and, I
dunno, delete it later or whatever) or risk rolling-back other writes
that other clients have performed.

Do I have that right?

So... what's the best way to do this kind of thing? Can I ask Solr to
add-and-commit at the same time? If so, how? Is there a meaningful
"rollback this one addition" that I can perform? If so, how?

Thanks for a great product,
-chris



signature.asc
Description: OpenPGP digital signature


solr url control

2018-03-02 Thread Becky Bonner
We are trying to setup one solr server for several applications each with a 
different collection.  Is there a way to have have 2 collections under one 
folder and the url be something like this:
http://mysolrinstance.com/solr/myParent1/collection1
http://mysolrinstance.com/solr/myParent1/collection2
http://mysolrinstance.com/solr/myParent2
http://mysolrinstance.com/solr/myParent3


We organized it like that under the solr folder but the URLs to the collections 
do not include the "myParent1".
This makes the names of my collections more confusing because you can't tell 
what application they belong to.  It wasn’t a problem until we had 2 
collections for one of the apps.


RE: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-03-02 Thread Becky Bonner
We are trying to setup one solr server for several applications each with a 
different collection.  Is there a way to have have 2 collections under one 
folder and the url be something like this:
http://mysolrinstance.com/solr/myParent1/collection1
http://mysolrinstance.com/solr/myParent1/collection2
http://mysolrinstance.com/solr/myParent2
http://mysolrinstance.com/solr/myParent3


We organized it like that under the solr folder but the URLs to the collections 
do not include the "myParent1".
This makes the names of my collections more confusing because you can't tell 
what application they belong to.  It wasn’t a problem until we had 2 
collections for one of the apps.




-Original Message-
From: Webster Homer [mailto:webster.ho...@sial.com] 
Sent: Friday, March 2, 2018 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: NRT replicas miss hits and return duplicate hits when paging 
solrcloud searches

I am trying to test if enabling stats cache as suggested by Eric would also 
address this issue. I added this to my solrconfig.xml

 

I executed queries and saw no differences. Then I re-indexed the data, again I 
saw no differences in behavior.
Then I found this,  SOLR-10952. It seems we need to disable the 
queryResultCache for the global stats cache to work.
I've never disabled this before. I edited the solrconfig.xml setting the sizes 
to 0. I'm not sure if this is how to disable the cache or not.



I also set this:
   0

Then uploaded the solrconfig.xml and reloaded the collection. It sill made no 
difference. Do I need to restart solr for this to take effect?
When I look in the admin console, the queryResultCache still seems to have the 
old settings.

Does enabling statsCache require a solr restart too? Does enabling the 
statsCache require that the data be re-indexed? The documentation on this 
feature is skimpy.
Is there a way to see if it's enabled in the Admin Console?

On Tue, Feb 27, 2018 at 9:31 AM, Webster Homer 
wrote:

> Emir,
>
> Using tlog replica types addresses my immediate problem.
>
> The secondary issue is that all of our searches show inconsistent results.
> These are all normal paging use cases. We regularly test our 
> relevancy, and these differences creates confusion in the testers. 
> Moreover, we are migrating from Endeca which has very consistent results.
>
> I'm hoping that using the global stats cache will make the other 
> searches more stable. I think we will eventually move to favoring tlog 
> replicas. We have a couple of collections where NRT makes sense, but 
> those collections don't need to return data in relevancy order. I 
> think NRT should be considered a niche use case for a search engine, 
> tlog and pull replicas are a much better fit for a search engine 
> (imho)
>
> On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović < 
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Webster,
>> Since you are returning all hits, returning the last page is almost 
>> as heavy for Solr as returning all documents. Maybe you should 
>> consider just returning one large page and completely avoid this issue.
>> I agree with you that this should be handled by Solr. ES solved this 
>> issue with “preference” search parameter where you can set session id 
>> as preference and it will stick to the same shards. I guess you could 
>> try similar thing on your own but that would require you to send list 
>> of shards as parameter for your search and balance it for different sessions.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
>> Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 26 Feb 2018, at 21:03, Webster Homer  wrote:
>> >
>> > Erick,
>> >
>> > No we didn't look at that. I will add it to the list. We have  not 
>> > seen performance issues with solr. We have much slower technologies 
>> > in our stack. This project was to replace a system that was too slow.
>> >
>> > Thank you, I will look into it
>> >
>> > Webster
>> >
>> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Did you try enabling distributed IDF (statsCache)? See:
>> >> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>> >>
>> >> It's may not totally fix the issue, but it's worth trying. It does 
>> >> come with a performance penalty of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <
>> webster.ho...@sial.com>
>> >> wrote:
>> >>> Thanks Shawn, I had settled on this as a solution.
>> >>>
>> >>> All our use cases for Solr is to return results in order of 
>> >>> relevancy
>> to
>> >>> the query, so having a deterministic sort would defeat that purpose.
>> >> Since
>> >>> we wanted to be able to return all the results for a query, I
>> originally
>> >>> looked at using the Streaming API, but that doesn't support 
>> >>> returning results sorted by relevancy

Re: [poll] which loadbalancer are you using for SolrCloud

2018-03-02 Thread Shawn Heisey

On 3/2/2018 9:11 AM, David Hastings wrote:

Ill have to take a look at HAProxy.  How much faster than nginx is it?


I know very little about nginx.

Here's some information about haproxy performance.  It's information 
they provide themselves, so configure your grain of salt accordingly. :)


http://www.haproxy.org#perf
http://www.haproxy.org/10g.html

Thanks,
Shawn



Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-03-02 Thread Shawn Heisey

On 3/2/2018 9:28 AM, Webster Homer wrote:

I've never disabled this before. I edited the solrconfig.xml setting the
sizes to 0. I'm not sure if this is how to disable the cache or not.

 


To completely disable a cache, either comment it out or remove it from 
the config.  I do not know whether setting the size to 0 will actually 
work or not.



Does enabling statsCache require a solr restart too? Does enabling the
statsCache require that the data be re-indexed? The documentation on this
feature is skimpy.


Most changes to solrconfig.xml just require a reload.  I would expect 
any cache configurations to fall into that category.



Is there a way to see if it's enabled in the Admin Console?


I don't know anything about the statsCache.  If you don't see it in the 
Plugins/Stats tab, that's probably something that was forgotten, and 
needs to be added to the admin UI.


Thanks,
Shawn



Re: Rename solr to another name

2018-03-02 Thread Shawn Heisey

On 3/2/2018 8:27 AM, Zheng Lin Edwin Yeo wrote:

Are we able to rename the folder name like solr-webapp or the names
like solr-jetty-context.xml to the customised name like my-webapp and
my-jetty-context.xml?

I'm currently using Solr 6.5.1, and will upgrade to Solr 7.2.1 soon.


When people start wanting to customize internal details to this level, I 
seriously have to ask "why?"  The dev team has spent a lot of time and 
effort coming up with these configurations. They are not designed to be 
customizable.  Some of them certainly CAN be customized, with a lot of 
manual work.


These particular questions involve Jetty, not so much Solr.  From what I 
can tell, the solr-jetty-context.xml file can be renamed to anything, 
it's the fact that it's in the "contexts" directory that makes Jetty 
read it.  I do not know whether it needs an xml extension ... I would 
preserve that if it were me.


The "solr-webapp" directory location is defined within the 
solr-jetty-context.xml file, so if you change that, you're going to need 
to edit the context file in order to keep Solr working. The scripts 
included with Solr have this path hardcoded, and for that reason, Solr 
probably won't even start without manual script edits if the webapp 
directory is changed.


Anticipating another question you might have:  Another piece of 
information in solr-jetty-context.xml is the context path -- set to 
"/solr" -- this is the first part of the URL path on all API calls.  You 
can change this, and Solr itself will still work properly, but a lot of 
things have this path hard-coded (like the scripts, certain Java 
libraries like SolrCli, and the admin UI), so those features will break 
unless you manually edit them for the new path.


Thanks,
Shawn



Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

2018-03-02 Thread Webster Homer
I am trying to test if enabling stats cache as suggested by Eric would also
address this issue. I added this to my solrconfig.xml

 

I executed queries and saw no differences. Then I re-indexed the data,
again I saw no differences in behavior.
Then I found this,  SOLR-10952. It seems we need to disable the
queryResultCache for the global stats cache to work.
I've never disabled this before. I edited the solrconfig.xml setting the
sizes to 0. I'm not sure if this is how to disable the cache or not.



I also set this:
   0

Then uploaded the solrconfig.xml and reloaded the collection. It sill made
no difference. Do I need to restart solr for this to take effect?
When I look in the admin console, the queryResultCache still seems to have
the old settings.

Does enabling statsCache require a solr restart too? Does enabling the
statsCache require that the data be re-indexed? The documentation on this
feature is skimpy.
Is there a way to see if it's enabled in the Admin Console?

On Tue, Feb 27, 2018 at 9:31 AM, Webster Homer 
wrote:

> Emir,
>
> Using tlog replica types addresses my immediate problem.
>
> The secondary issue is that all of our searches show inconsistent results.
> These are all normal paging use cases. We regularly test our relevancy, and
> these differences creates confusion in the testers. Moreover, we are
> migrating from Endeca which has very consistent results.
>
> I'm hoping that using the global stats cache will make the other searches
> more stable. I think we will eventually move to favoring tlog replicas. We
> have a couple of collections where NRT makes sense, but those collections
> don't need to return data in relevancy order. I think NRT should be
> considered a niche use case for a search engine, tlog and pull replicas are
> a much better fit for a search engine (imho)
>
> On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Webster,
>> Since you are returning all hits, returning the last page is almost as
>> heavy for Solr as returning all documents. Maybe you should consider just
>> returning one large page and completely avoid this issue.
>> I agree with you that this should be handled by Solr. ES solved this
>> issue with “preference” search parameter where you can set session id as
>> preference and it will stick to the same shards. I guess you could try
>> similar thing on your own but that would require you to send list of shards
>> as parameter for your search and balance it for different sessions.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 26 Feb 2018, at 21:03, Webster Homer  wrote:
>> >
>> > Erick,
>> >
>> > No we didn't look at that. I will add it to the list. We have  not seen
>> > performance issues with solr. We have much slower technologies in our
>> > stack. This project was to replace a system that was too slow.
>> >
>> > Thank you, I will look into it
>> >
>> > Webster
>> >
>> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Did you try enabling distributed IDF (statsCache)? See:
>> >> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>> >>
>> >> It's may not totally fix the issue, but it's worth trying. It does
>> >> come with a performance penalty of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <
>> webster.ho...@sial.com>
>> >> wrote:
>> >>> Thanks Shawn, I had settled on this as a solution.
>> >>>
>> >>> All our use cases for Solr is to return results in order of relevancy
>> to
>> >>> the query, so having a deterministic sort would defeat that purpose.
>> >> Since
>> >>> we wanted to be able to return all the results for a query, I
>> originally
>> >>> looked at using the Streaming API, but that doesn't support returning
>> >>> results sorted by relevancy
>> >>>
>> >>> I disagree with you about NRT replicas though. They may function as
>> >>> designed, but since they cannot guarantee consistent results their
>> design
>> >>> is buggy, at least it is for a search engine.
>> >>>
>> >>>
>> >>> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey 
>> >> wrote:
>> >>>
>>  On 2/26/2018 10:26 AM, Webster Homer wrote:
>> > We need the results by relevancy so the application sorts the
>> results
>> >> by
>> > score desc, and the unique id ascending as the tie breaker
>> 
>>  This is the reason for the discrepancy, and why the different replica
>>  types don't have the same issue.
>> 
>>  Each NRT replica can have different deleted documents than the
>> others,
>>  just due to the way that NRT replicas work.  Deleted documents affect
>>  relevancy scoring.  When one replica has say 5000 deleted documents
>> and
>>  another has 200, or has 5000 but they're 

Re: [poll] which loadbalancer are you using for SolrCloud

2018-03-02 Thread Daniel Carrasco
I use HAProxy, because is much more configurable than Nginx and I can send
commands to solr collection and search for text to check if the node is
healthy.

Nginx is very fast too, but health check are worst than HAProxy.

Greetings!!

2018-03-02 17:11 GMT+01:00 David Hastings :

> Ill have to take a look at HAProxy.  How much faster than nginx is it?
>
> To answer the question, I personally use nginx for load balancing/failovers
> and its been good, use the same nginx servers to load balance a Galera
> cluster as well.
>
> On Fri, Mar 2, 2018 at 11:09 AM, Shawn Heisey 
> wrote:
>
> > On 3/2/2018 6:13 AM, Bernd Fehling wrote:
> >
> >> I would like to poll for the loadbalancer you are using for SolrCloud.
> >>
> >> Are you using a loadbalancer for SolrCloud?
> >>
> >> If yes, which one (SolrJ, HAProxy, Varnish, Nginx,...) and why?
> >>
> >
> > I use haproxy for Solr -- not SolrCloud.  It is an amazing and FAST piece
> > of software, without the overhead of a full webserver (apache, nginx).
> It
> > also has zero cost, which is far more attractive than hardware load
> > balancers, and can do anything I've seen a hardware load balancer do.
> With
> > the presence of another piece of software (such as pacemaker) you can
> even
> > have hardware redundancy for the load balancer.
> >
> > Most of my clients talking to Solr are Java, so they use
> > HttpSolrClient/HttpSolrServer from SolrJ, connecting to the load
> balancer.
> >
> > For SolrCloud, if your clients are Java, you don't need a load balancer,
> > because the client (CloudSolrClient in SolrJ) talks to the entire cluster
> > and dynamically adjusts to changes in clusterstate.
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_


Re: [poll] which loadbalancer are you using for SolrCloud

2018-03-02 Thread David Hastings
Ill have to take a look at HAProxy.  How much faster than nginx is it?

To answer the question, I personally use nginx for load balancing/failovers
and its been good, use the same nginx servers to load balance a Galera
cluster as well.

On Fri, Mar 2, 2018 at 11:09 AM, Shawn Heisey  wrote:

> On 3/2/2018 6:13 AM, Bernd Fehling wrote:
>
>> I would like to poll for the loadbalancer you are using for SolrCloud.
>>
>> Are you using a loadbalancer for SolrCloud?
>>
>> If yes, which one (SolrJ, HAProxy, Varnish, Nginx,...) and why?
>>
>
> I use haproxy for Solr -- not SolrCloud.  It is an amazing and FAST piece
> of software, without the overhead of a full webserver (apache, nginx).  It
> also has zero cost, which is far more attractive than hardware load
> balancers, and can do anything I've seen a hardware load balancer do.  With
> the presence of another piece of software (such as pacemaker) you can even
> have hardware redundancy for the load balancer.
>
> Most of my clients talking to Solr are Java, so they use
> HttpSolrClient/HttpSolrServer from SolrJ, connecting to the load balancer.
>
> For SolrCloud, if your clients are Java, you don't need a load balancer,
> because the client (CloudSolrClient in SolrJ) talks to the entire cluster
> and dynamically adjusts to changes in clusterstate.
>
> Thanks,
> Shawn
>
>


Re: solo source build in local error

2018-03-02 Thread Shawn Heisey

On 3/2/2018 7:42 AM, ramyogi wrote:

solr-repo/lucene-solr/build.xml:21: The following error occurred while
executing this line:/solr-repo/lucene-solr/lucene/common-build.xml:623:
java.lang.NullPointerException  at java.util.Arrays.stream(Arrays.java:5004)
at java.util.stream.Stream.of(Stream.java:1000) at
java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)   
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)I
am trying to build and debug SOLR . seems build throwing error. anything to
set before run build ?


Downgrade your ant version from 1.10.2 to 1.10.1 or 1.9.x. There's a bug 
in 1.10.2 that produces NullPointerException anytime ant is run in the 
Lucene/Solr source code.


Thanks,
Shawn



Re: solo source build in local error

2018-03-02 Thread Erick Erickson
Ant 1.10.2 has a bug, are you using that version? 1.10.1 works fine.


On Fri, Mar 2, 2018 at 6:42 AM, ramyogi  wrote:
> solr-repo/lucene-solr/build.xml:21: The following error occurred while
> executing this line:/solr-repo/lucene-solr/lucene/common-build.xml:623:
> java.lang.NullPointerException  at java.util.Arrays.stream(Arrays.java:5004)
> at java.util.stream.Stream.of(Stream.java:1000) at
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) 
>   at
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)I
> am trying to build and debug SOLR . seems build throwing error. anything to
> set before run build ?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: [poll] which loadbalancer are you using for SolrCloud

2018-03-02 Thread Shawn Heisey

On 3/2/2018 6:13 AM, Bernd Fehling wrote:

I would like to poll for the loadbalancer you are using for SolrCloud.

Are you using a loadbalancer for SolrCloud?

If yes, which one (SolrJ, HAProxy, Varnish, Nginx,...) and why?


I use haproxy for Solr -- not SolrCloud.  It is an amazing and FAST 
piece of software, without the overhead of a full webserver (apache, 
nginx).  It also has zero cost, which is far more attractive than 
hardware load balancers, and can do anything I've seen a hardware load 
balancer do.  With the presence of another piece of software (such as 
pacemaker) you can even have hardware redundancy for the load balancer.


Most of my clients talking to Solr are Java, so they use 
HttpSolrClient/HttpSolrServer from SolrJ, connecting to the load balancer.


For SolrCloud, if your clients are Java, you don't need a load balancer, 
because the client (CloudSolrClient in SolrJ) talks to the entire 
cluster and dynamically adjusts to changes in clusterstate.


Thanks,
Shawn



dataimporthandler ignoring configured timezone for indexStartTime?

2018-03-02 Thread Elizabeth Haubert
I'm getting the incorrect the reported time deltas on the admin console for
"indexing since" and "started". It looks like DIH is converting the last
start time to UTC:

Last Update: 09:57:15

Indexing completed. Added/Updated: 94078 documents. Deleted 0 documents.
(Duration: 06s)

Requests: 1 , Fetched: 94,078 15,680/s, Skipped: 0 , Processed: 94,078
15,680/s

Started: about 5 hours ago


Server is configured for the EST timezone.

Timezone is set in solr.in.sh:
# By default the start script uses UTC; override the timezone if needed
SOLR_TIMEZONE="EST"

DIH propertywriter specifies timezone in date format:


And timezone is actually being written out in dataimport.properties:
#Fri Mar 02 09:55:11 EST 2018
last_index_time=2018-03-02 09\:55\:06 EST
autosuggest.last_index_time=2018-03-02 09\:55\:06 EST


The code in DataImporter.doc looks like it is pulling the starttime
directly from the PropertyWriter,

so I'm a little stuck what else needs to be configured here.


  public void doFullImport(DIHWriter writer, RequestInfo requestParams) {

LOG.info("Starting Full Import");

setStatus(Status.RUNNING_FULL_DUMP);

try {

  DIHProperties dihPropWriter = createPropertyWriter();

  setIndexStartTime(dihPropWriter.getCurrentTimestamp());



Suggestions?


Thank you,

Elizabeth


Re: Configuring Solr Data and Index directories

2018-03-02 Thread Shawn Heisey

On 3/2/2018 2:15 AM, YELESWARAPU, VENKATA BHAN wrote:

While deploying Solr I just see one parameter where we provide solr_home path.
For ex: -Dsolr.solr.home=/usr/local/clo/ven/solr_home

1)  Is there any path where we can configure data and index directories.
2)  Can we separate data directory from solr_home.
3)  Also, how to enable password protection for solr so that only limited 
people can access.


If you're running SolrCloud, I strongly recommend that you don't try to 
customize beyond the solr home.  SolrCloud can do a lot more automation 
than standalone, and doesn't offer any way to customize some things 
until AFTER you create your collections.  In SolrCloud mode, about the 
only thing found in each core's directory is the index data -- configs 
are in ZooKeeper -- so there's generally no reason to customize more.


Even when customizing things, I recommend the absolute minimum amount of 
customization that will meet your needs.


In each core's instanceDir, which is normally a directory under the solr 
home, there is a core.properties file.  The "dataDir" property is 
relative to instanceDir, and defaults to "./data".  The index directory 
is normally found in dataDir.  It's probably possible to customize the 
index directory, but I am not immediately familiar with how to do it, 
and it's not something I would generally worry about -- move dataDir if 
you need to, but let the index directory live there.  These properties 
can be set during core creation with the CoreAdmin API.  When in cloud 
mode, the CoreAdmin API should not be used, which is why I don't 
recommend a lot of customization for SolrCloud mode.


More customization is possible, but like I said above, I recommend using 
a minimum, and the options I've mentioned will handle most of what 
people want, without making it very difficult to understand where things 
are.


Basic authentication in SolrCloud mode has been available since version 
5.3.  Since version 6.5, it is also available in standalone mode.


https://lucene.apache.org/solr/guide/6_6/basic-authentication-plugin.html

My thoughts about configuring security:  If you follow recommendations 
and place Solr in a network location where it cannot be reached except 
by trusted applications and trusted admins (enforced with something like 
a firewall), you do not need extra security like HTTPS and authentication.


Thanks,
Shawn



Re: SolrCloud 7.2.1 - UnsupportedOperationException thrown after query on specific environments

2018-03-02 Thread Andy Jolly
Erick Erickson wrote
> Maybe your remote job server is using a different set of jars than
> your local one? How does the remote job server work?

The remote job server and our local are running the same code as our local,
and both our local and remote job server are making queries against the same
SolrCloud cluster.  The main difference is we are running the job on our
local through a unit test that kicks off the entire job.

We have noticed that these errors are being thrown on all of our Solr nodes,
not just the node containing the collection that is being queried.


Erick Erickson wrote
> No log snippets came through BTW, so I'm guessing a bit. The Apache
> mail server is quite aggressive about stripping stuff

Here is the log snippet without any formatting.  Hopefully that should work.

2018-03-01 20:01:13.009 INFO  (qtp20671747-2258) [c:mycollection s:shard1
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
[mycollection_shard1_replica_n1]  webapp=/solr path=/select
params={q=id:79ea39cb1fe01706a05d9595088fc0e04af7b5bf=edismax=recip(ms(NOW,published_on),3.16e-11,1,1)^2.0=0=-excluded_tenants:(1)=type:(News)=1=2.2}
hits=1 status=0 QTime=0
2018-03-01 20:01:12.998 INFO  (qtp20671747-2231) [c:mycollection s:shard1
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
[mycollection_shard1_replica_n1]  webapp=/solr path=/select
params={q=id:66d7fa7c716633e33aacf5b8514052f42889267f=edismax=0=type:(Job)=1=2.2}
hits=0 status=0 QTime=0
2018-03-01 20:01:12.998 INFO  (qtp20671747-2257) [c:mycollection s:shard1
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
[mycollection_shard1_replica_n1]  webapp=/solr path=/select
params={q=id:5f02f0d8034a15c4604baec33c40c1f48152ffdf=edismax=0=type:(Job)=1=2.2}
hits=1 status=0 QTime=0
2018-02-28 20:00:11.713 ERROR (qtp20671747-314) [   ] o.a.s.s.HttpSolrCall
null:java.lang.UnsupportedOperationException
at java.util.AbstractList.add(AbstractList.java:148)
at java.util.AbstractList.add(AbstractList.java:108)
at
org.apache.solr.servlet.HttpSolrCall.getRemotCoreUrl(HttpSolrCall.java:901)
at
org.apache.solr.servlet.HttpSolrCall.extractRemotePath(HttpSolrCall.java:432)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:289)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:470)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at 

Is there a way to sort by conditional function in the Solr 7.2 JSON API?

2018-03-02 Thread Tom Van Cuyck
Hi,

In the Solr 7.2 JSON API, when faceting over terms, I would like to sort
the buckets over the average of a numerical property, as shown below

curl http://localhost:8983/solr/core/select -d '
q=*:*&
rows=0&
wt=json&
json.facet={
 "field" : {
"type" : "terms",
"field" : "string-field",
"sort" : "avg desc",
"limit" : 50,
facet : {
avg : "avg(number_i)",
unique : "unique(number_i)"
   }
  }
}'


However, when none of the documents in a bucket has a value for the
numerical property (e.g. unique = 0 in this case), an average value avg = 0
is returned.
This average value of 0 is then used for sorting the buckets.

I would like the buckets with no value for the numerical property to be
sorted last.
Is there a way to e.g. use conditional sorting? E.g.
sort: "if(gt(unique,0),avg,-9) desc"

I can't get this to work, while in the old API this appaers to be possible.

Or is there another way to sort the buckets with a missing numeric value
last?

Kind regards, Tom


Rename solr to another name

2018-03-02 Thread Zheng Lin Edwin Yeo
Hi,

Are we able to rename the folder name like solr-webapp or the names
like solr-jetty-context.xml to the customised name like my-webapp and
my-jetty-context.xml?

I'm currently using Solr 6.5.1, and will upgrade to Solr 7.2.1 soon.

Regards,
Edwin


Re: 7.2.1 ExactStatsCache seems no longer functioning

2018-03-02 Thread Webster Homer
Your problem seems a lot like an issue I see with Near Real Time (NRT)
replicas. I posted about it in this forum. I was told that a possible
solution was to use the Global Stats feature. I am looking at testing that
now.

Have you tried using Tlog replicas? That fixed my issues with relevancy
differences between queries.

On Mon, Feb 19, 2018 at 9:41 AM, Markus Jelsma 
wrote:

> Hello,
>
> We're on 7.2.1 and rely on ExactStatsCache to work around the problem of
> not all nodes sharing the same maxDoc within a shard. But, it doesn't work,
> anymore!
>
> I've looked things up in Jira but nothing so far. SOLR-10952 also doesn't
> cause it because with queryResultCache disabled, document scores don't
> match up, the ordering of search results is not constant for the same query
> in consecutive searches.
>
> We see this on a local machine, just with default similarity and classic
> query parser.
>
> Any hints on what to do now?
>
> Many thanks,
> Markus
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


solo source build in local error

2018-03-02 Thread ramyogi
solr-repo/lucene-solr/build.xml:21: The following error occurred while
executing this line:/solr-repo/lucene-solr/lucene/common-build.xml:623:
java.lang.NullPointerException  at java.util.Arrays.stream(Arrays.java:5004)
at java.util.stream.Stream.of(Stream.java:1000) at
java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)   
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)I
am trying to build and debug SOLR . seems build throwing error. anything to
set before run build ?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

[poll] which loadbalancer are you using for SolrCloud

2018-03-02 Thread Bernd Fehling
Dear list,

I would like to poll for the loadbalancer you are using for SolrCloud.

Are you using a loadbalancer for SolrCloud?

If yes, which one (SolrJ, HAProxy, Varnish, Nginx,...) and why?

If not, why not?


Regards, Bernd


index mail with MailEntityProcessor

2018-03-02 Thread Dimitris Kardarakos

Hello everyone.

I have created a collection and indexed mails from a gmail mailbox. 
Nevertheless, only plain text is indexed. Neither html formatted nor 
attachments' indexing works.


To index mails, I have included the below libs to solrconfig:

regex=".*\.jar" />
regex="solr-cell-\d.*\.jar" />


Created mail-data-config.xml as below:


  
  
      fetchMailsSince="2018-01-31 00:00:00" batchSize="20" 
folders="inbox" processAttachement="true" name="mail_entity"/>

  


and added the below as well to solrconfig.

  
    
  mail-data-config.xml
    
  

Please for your support :)

--
Dimitris Kardarakos



Re: Word / PDF document snippet rendering in search

2018-03-02 Thread Charlie Hull

On 02/03/2018 00:15, T Wild wrote:

I'm interested in building a software system which will connect to various
document sources, extract the content from the documents contained within
each source, and make the extracted content available to a search engine
such Solr. This search engine will serve as the back-end for a web-based
search application.
This is basically an 'enterprise search' system. You use 'connectors' to 
get text out of the source documents - in Solr applications we often use 
Apache Tika to extract text from common formats like Office or PDF, 
Apache ManifoldCF is another useful project for connecting to repositories.




I'm interested in rendering snippets of these documents in the search
results for well-known types, such as Microsoft Word and PDF. How would one
go about implementing document snippet rendering in search?


If you just want the snippets as text, you can use Solr highlighters 
which can provide contextual snippets (i.e chunks of text around the 
query matches).


I'd be happy with serving up these snippets in any format, including as
images. I just want to be able to give my users some kind of formatted
preview of their results for well-known types.


If you however want to show bits of the original documents that's more 
difficult. You'll need to store a reference to the original document in 
Solr and use an external system to display it - you'll need specific 
systems for different doc types: PDFs can be shown in various browser 
plugins for example. Another approach is illustrated in this open source 
code we wrote a while ago - it uses OpenOffice in 'headless' mode to 
provide images of the source document:

https://github.com/flaxsearch/flaxcode/tree/master/flax_basic/libs/previewgen

Hope this helps!

Cheers

Charlie


Thank you!




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr thread problems

2018-03-02 Thread 苗海泉
Thank you for reading my question in detail. Let me explain. With 1169
threads, our collection is only 937 in number, and it is the number of
threads in a solr node, not the total number of threads in the cluster.
With more than a thousand collections, solrCloud is in a poor condition, so
we've reduced the number of collections.

For our business, we add only unmodified systems, and we do not delete
individual documents, so the delete wait you say may not be the case
because we did not delete the action.


I tested under the solrCloud commit scheduling thread and query thread and
found that the number of commit scheduling threads and the replica'num solr
node is related to the case of automatic commit, commit scheduling thread
number = 2 * solr replica num, If there is no automatic commit, commit the
number of scheduled threads = solr replica num, I was puzzled is that the
commit scheduling thread after execution (non-automatic submission) is
still not destroyed, as the collection number increases, the commit
scheduling thread Will be more and more until solrcloud restart it.

2018-03-02 10:15 GMT+08:00 Shawn Heisey :

> On 3/1/2018 4:31 AM, 苗海泉 wrote:
>
>> My question is, what is the relationship between the number of threads in
>> the commitScheduler thread pool and what? The number of searcherExecutor
>> thread pool and the above have a relationship, why so much, thank you!
>>
>
> I don't understand this question.  Can you try again?
>
> Looking at the information you've provided ... are you doing a large
> number of simultaneous DeleteByQuery operations?That does seem like a lot
> of threads in a WAITING state.  I'm not sure what is going on, but if there
> are a lot of DeleteByQuery operations happening, it MIGHT cause that.
>
> In another discussion, you mentioned having more than one thousand
> collections.  This is going to result in Solr creating a large number of
> threads.  I'm actually surprised that you don't have far more than 1169
> threads.
>
> Every experiment I've done with thousands of collections has turned out
> badly, often before I even reach 1000 of them. SolrCloud just does not like
> dealing with it.
>
> Three years ago, I filed the following issue about it. Somebody marked it
> as fixed, though I have no idea why.  It doesn't appear to be fixed to me,
> and there were never any code changes related to the issue:
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
>
>
> In that discussion about lots of collections, you asked this:
>
> Thank you for your advice on gc tools, what do you suggest to me?
>>
>
> I don't understand this question either.  I had just given you some advice
> about GC tools.  What are you asking?
>
> Thanks,
> Shawn
>
>


-- 
==
联创科技
知行如一
==


Configuring Solr Data and Index directories

2018-03-02 Thread YELESWARAPU, VENKATA BHAN
Information Classification: ** Limited Access

Dear Team,

While deploying Solr I just see one parameter where we provide solr_home path.
For ex: -Dsolr.solr.home=/usr/local/clo/ven/solr_home

1)  Is there any path where we can configure data and index directories.
2)  Can we separate data directory from solr_home.
3)  Also, how to enable password protection for solr so that only limited 
people can access.

Could you please help answer these?

Thank you very much,
Dutt