Re: is ConcurrentUpdateSolrClient.Builder thread safe?

2018-01-10 Thread Shawn Heisey

On 1/11/2018 12:05 AM, Bernd Fehling wrote:

This will nerver pass a Jepsen test and I call it _NOT_ thread safe.

I haven't looked into the code yet, to see if the queue is FIFO, otherwise
this would be stupid.


I was not thinking about order of operations when I said that the client 
was threadsafe.  I meant that one client object can be used 
simultaneously by multiple threads without anything getting 
cross-contaminated within the program.


If you are absolutely reliant on operations happening in a precise 
order, such that a document could get indexed in one request and then 
replaced (or updated) with a later request, you should not use the 
concurrent client.  You could define it with a single thread, but if you 
do that, then the concurrent client doesn't work any faster than the 
standard client.


When a concurrent client is built, it creates the specified number of 
processing threads.  When updates are sent, they are added to an 
internal queue.  The processing threads will handle requests from the 
queue as long as the queue is not empty.


Those threads will process the requests they have been assigned 
simultaneously.  Although I'm sure that each thread pulls requests off 
the queue in a FIFO manner, I have a scenario for you to consider.  This 
scenario is not just an intellectual exercise, it is the kind of thing 
that can easily happen in the wild.


Let's say that when document X is initially indexed, it is at position 
997 in a batch of 1000 documents.  Then two update requests later, the 
new version of document X is at position 2 in another batch of 1000 
documents.


If there are at least three threads in the concurrent client, those 
update requests may begin execution at nearly the same time.  In that 
situation, Solr is likely to index document X in the request added later 
before it indexes document X in the request added earlier, resulting in 
outdated information ending up in the index.


The same thing can happen even with a non-concurrent client when it is 
used in a multi-threaded manner.


Preserving order of operations cannot be guaranteed if there are 
multiple threads.  It could be possible to add some VERY sophisticated 
synchronization capabilities, but writing code to do that would be very 
difficult, and it wouldn't be trivial to use either.


Thanks,
Shawn


Re: is ConcurrentUpdateSolrClient.Builder thread safe?

2018-01-10 Thread Bernd Fehling
Hi Shawn,

from your answer I see that you are obviously not using 
ConcurrentUpdateSolrClient.
I didn't say that I use ConcurrentUpdateSolrClient in multiple threads.
I say that ConcurrentUpdateSolrClient.Builder has a method to set
"withThreadCount", to empty the Clients queue with multiple threads.
This is useful for bulk loading huge data volumes or replay backup into index.

As I can see at the indexer with infostream, there are _no_ indexing errors.

I tried now with one thread several times and everything was fine.
The newer docs replaced the older docs (wich were marked deleted) in the index.
With more than 1 "threadCount" for emtying the queue there are problems with
ConcurrentUpdateSolrClient.

This will nerver pass a Jepsen test and I call it _NOT_ thread safe.

I haven't looked into the code yet, to see if the queue is FIFO, otherwise
this would be stupid.

Regards
Bernd


Am 11.01.2018 um 02:27 schrieb Shawn Heisey:
> On 1/10/2018 8:33 AM, Bernd Fehling wrote:
>> after some strange search results I was trying to locate the problem
>> and it turned out that it starts with bulk loading with SolrJ
>> and ConcurrentUpdateSolrClient.Builder with several threads.
>>
>> I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
>> according the docs send to the indexer?
> 
> Why would you need the Builder to be threadsafe?
> 
> The actual client object (ConcurrentUpdateSolrClient) should be perfectly 
> threadsafe, but the Builder probably isn't, and I can't think of any
> reason to try and use it with multiple threads.  In a well-constructed 
> program, you will use the Builder exactly once, in an initialization
> thread, and then have all the indexing threads use the client object that the 
> Builder creates.
> 
> I hope you're aware that the concurrent client swallows all indexing errors 
> and does not tell your program about them.
> 
> Thanks,
> Shawn
> 


Re: Ingestion not scaling horizontally as I add more cores to Solr

2018-01-10 Thread Shawn Heisey

On 1/10/2018 12:58 PM, Shashank Pedamallu wrote:

As you can see, the number of documents being ingested per core is not scaling 
horizontally as I'm adding more cores. Rather the total number of documents 
getting ingested for Solr JVM is being topped around 90k documents per second.


I would call 90K documents per second a very respectable speed.  I can't 
get my indexing to happen at anywhere near that rate.  My indexing is 
not multi-threaded, though.



 From the iostats and top commands, I do not see any bottlenecks with the iops 
or cpu respectively, CPU usaeg is around 65% and a sample of iostats is below:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

   55.320.002.331.640.00   40.71

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn

sda5   2523.00 45812.00298312.00  45812 298312


Nearly 300 megabytes per second write speed?  That's a LOT of data. 
This storage must be quite a bit better than a single spinning disk. 
You won't get that kind of sustained transfer speed out of standard 
spinning disks unless they are using something like RAID10 or RAID0. 
This transfer speed is also well beyond the capabilities of Gigabit 
Ethernet.


When Gus asked whether you were sending documents to the cloud from your 
local machine, I don't think he was referring to a public cloud.  I 
think he assumed you were running SolrCloud, so "cloud" was probably 
referring to your Solr installation, not a public cloud service.  If I 
had to guess, I think the intent was to find out what caliber of machine 
you're using to send the indexing requests.


I don't know if the bottleneck is on the client side or the server side. 
 But I would imagine that with everything on a single machine, you may 
not be able to get the ingestion rate to go much higher.


Is the jmeter running on a different machine from Solr or on the same 
machine?


Thanks,
Shawn


Re: Spatial search, nested docs, feature density

2018-01-10 Thread Mikhail Khludnev
The problem itself sounds really challenging, but literally two point from
the last question are:-
 -
https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-Scoring
 - find field in
https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions


On Thu, Jan 11, 2018 at 2:13 AM, Leila Deljkovic <
leila.deljko...@koordinates.com> wrote:

> Hi,
>
> https://lucene.apache.org/solr/guide/7_0/uploading-data-
> with-index-handlers.html#UploadingDatawithIndexHandlers
> -NestedChildDocuments  solr/guide/7_0/uploading-data-with-index-handlers.html#
> UploadingDatawithIndexHandlers-NestedChildDocuments>
>
> I have never used nested documents, but a bit of background on what I’m
> doing is that a spatial data layer consisting of features (points, lines,
> polygons, or an aerial image) is split up into sections (grid cells) based
> on the density of these features over the layer; smaller grid cells
> indicate high density of features in that area.
>
> I need to rank results based on density of features and whether dense
> areas of the layer overlap with the region of space on a map I am searching
> in. This is important because a layer could cover an entire country, for
> example if I query for “roads”, the layer would be dense in urban areas as
> there are more roads there, and less dense in rural areas, and if I am
> searching for a particular city, this layer would be of interest to me even
> though it covers the entire country. The idea is for the original layer to
> be the parent document (which is what should be returned when a query is
> made), and the child documents are the individual grid cells (which will
> hold the geometry of the cell and a density field for the features inside
> the cell).
>
> I would like to know if it is possible to rank the parent document based
> on a function which aggregates fields from the child documents (in this
> case, the density field). There is not much info on this that I could find
> online.
>
> Thanks




-- 
Sincerely yours
Mikhail Khludnev


Re: ClassicTokenizer

2018-01-10 Thread Shawn Heisey

On 1/10/2018 2:27 PM, Rick Leir wrote:

I did not express that clearly.
The reference guide says "The Classic Tokenizer preserves the same behavior as the 
Standard Tokenizer of Solr versions 3.1 and previous. "

So I am curious to know why they changed StandardTokenizer after 3.1 to break 
on hyphens, when it seems to me to work better the old way?


I really have no idea.  Those are Lucene classes, not Solr.  Maybe 
someone who was around for whatever discussions happened on Lucene lists 
back in those days will comment.


I wasn't able to find the issue where ClassicTokenizer was created, and 
I couldn't find any information discussing the change.


If I had to guess why StandardTokenizer was updated this way, I think it 
is to accommodate searches where people were searching for one word in 
text where that word was part of something larger with a hyphen, and it 
wasn't being found.  There was probably a discussion among the 
developers about what a typical Lucene user would want, so they could 
decide what they would have the standard tokenizer do.


Likely because there was a vocal segment of the community reliant on the 
old behavior, they preserved that behavior in ClassicTokenizer, but 
updated the standard one to do what they felt would be normal for a 
typical user.


Obviously *your* needs do not fall in line with what was decided ... so 
the standard tokenizer isn't going to work for you.


Thanks,
Shawn


Re: ./fs-manager process run under solr

2018-01-10 Thread Shawn Heisey

On 1/10/2018 12:19 PM, Andy Fake wrote:

I use Solr 5.5, I recently notice a process a process ./fs-manager is run
under user solr that take quite high CPU usage. I don't think I see such
process before.


I have never heard of this, and have never seen it.

Searching the source code, I cannot find that string.

What OS is Solr running on?  How did you start it?  Exactly what are you 
looking at when you see this "fs-manager" process?


Thanks,
Shawn


Re: Ingestion not scaling horizontally as I add more cores to Solr

2018-01-10 Thread Shashank Pedamallu
They are separate cases. In attempt 1 – I was ingesting to only 1 core. Then to 
3 cores and then 5 cores. Yes, they are completely independent cores.

I think I was not reading the ‘iostats’ right. With –x option,  the ‘avgrq-sz’ 
parameter is constantly above 300. From some readings online, I see that 3 
digit number for this parameter is a red flag. I’m trying to run the 
experiments on better disk now. 

Yes, the intent is to max out the cpu to find the maximum load the system can 
handle.

Thanks,
Shashank

On 1/10/18, 4:59 PM, "Erick Erickson"  wrote:

OK, so I'm assuming your indexer indexes to 1, 3 and 5 separate cores
depending on how many are available, right? And these cores are essentially
totally independent.

I'd guess your gating factor is your ingestion process. Try spinning up two
identical ones from two separate clients. Eventually you should be able to
max out your CPU as you add cores. The fact that your indexing rate is
fairly constant at 90K docs/sec is a red flag that that's the rate you're
feeding docs to Solr.

At some point you'll max out our CPU and that'll be the limit.

Best,
Erick

On Wed, Jan 10, 2018 at 1:52 PM, Shashank Pedamallu 
wrote:

> - Did you sept up an actual multiple node cluster or are you running this
> all on one box?
> Sorry, I should have mentioned this earlier. I’m running Solr in non-cloud
> mode. It is just a single node Solr.
>
> - Are you configuring Jmeter to send with multiple threads?
> Yes, multiple threads looping a fixed number of times
>
> - Are they all sending to the same node, or are you distributing across
> nodes? Is there a load balancer?
> Yes, since there is only one node.
>
> - If you are sending requests up to the cloud from your local machine,
> that is frequently a slow link.
> Not a public cloud. Our private one.
>
> - are you sending one document at a time or batching them up?
> Batching them up. About 1000 documents in one request
>
> Thanks,
> Shashank
>
> On 1/10/18, 1:35 PM, "Gus Heck"  wrote:
>
> Ok then here's a few things to check...
>
>- Did you sept up an actual multiple node cluster or are you 
running
>this all on one box?
>- Are you configuring Jmeter to send with multiple threads?
>- Are they all sending to the same node, or are you distributing
> across
>nodes? Is there a load balancer?
>- Are you sending from a machine on the same network as the
> machines in
>the Solr cluster?
>- If you are sending requests up to the cloud from your local
> machine,
>that is frequently a slow link.
>- Also don't forget to check your zookeeper cluster's health... if
> it's
>bogged down that will slow down solr.
>
> If you have all machines on the same network, many threads, load
> balancing
> and no questionable equipment (or networking limitations put in place
> by
> IT) in the middle, then something (either CPU or network interface)
> should
> be maxed out somewhere on at least one machine, either on the Jmeter
> side
> or Solr side.
>
> -Gus
>
> On Wed, Jan 10, 2018 at 3:54 PM, Shashank Pedamallu <
> spedama...@vmware.com>
> wrote:
>
> > Hi Gus,
> >
> > Thank  for the reply. I’m sending via jmeter running on my local
> machine
> > to Solr running on a remote vm.
> >
> > Thanks,
> > Shashank
> >
> > On 1/10/18, 12:34 PM, "Gus Heck"  wrote:
> >
> > Ingested how? Sounds like your document sending mechanism is
> maxed,
> > not the
> > solr cluster...
> >
> > On Wed, Jan 10, 2018 at 2:58 PM, Shashank Pedamallu <
> > spedama...@vmware.com>
> > wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I’m trying to find the upper thresholds of ingestion and I 
have
> > tried the
> > > following. In each of the experiments, I’m ingesting random
> > documents with
> > > 5 fields.
> > >
> > >
> > > Number of Cores Number of documents ingested per second per
> core
> > > 1   89000
> > > 3   33000
> > > 5   18000
> > >
> > >
> > > As you can see, the number of documents being ingested per
> core is
> > not
> > > scaling horizontally as I'm adding more cores. Rather the 
total
> > number of
> > > documents getting ingested 

Re: regarding exposing merge metrics

2018-01-10 Thread Shawn Heisey

On 1/10/2018 11:08 AM, S G wrote:

Last comment by Shawn on SOLR-10130 is:
Metrics was just a theory, sounds like that's not it.

It would be very interesting to know what really caused the slowdown and do
we really need the config or not.


That comment wasn't actually about SOLR-10130 itself.I commented on that 
issue because it was semi-related to what I was looking into, so I 
figured the developer who fixed it might have some insight to the 
question I needed answered.


I think that SOLR-10130 was probably handled correctly and that they did 
indeed find/fix the problem.


My comments were speculating about completely different performance 
issues -- someone was seeing reduced performance upgrading from 4.x to a 
version that included the fix for SOLR-10130.  Because SOLR-10130 
already indicated that metrics could cause performance issues, I was 
wondering whether the metrics that were added for keeping track of Jetty 
operation could possibly be the source of the user's problems.  The 
response I got indicated that it was unlikely that the Jetty metrics 
were the cause.


Performance regressions between 4.x and later versions have been noted 
by many users.  I think some of them were caused by changes in Lucene, 
and it's possible that those who understand these issues have not yet 
found a solution to fix them.


Thanks,
Shawn



Re: is ConcurrentUpdateSolrClient.Builder thread safe?

2018-01-10 Thread Shawn Heisey

On 1/10/2018 8:33 AM, Bernd Fehling wrote:

after some strange search results I was trying to locate the problem
and it turned out that it starts with bulk loading with SolrJ
and ConcurrentUpdateSolrClient.Builder with several threads.

I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
according the docs send to the indexer?


Why would you need the Builder to be threadsafe?

The actual client object (ConcurrentUpdateSolrClient) should be 
perfectly threadsafe, but the Builder probably isn't, and I can't think 
of any reason to try and use it with multiple threads.  In a 
well-constructed program, you will use the Builder exactly once, in an 
initialization thread, and then have all the indexing threads use the 
client object that the Builder creates.


I hope you're aware that the concurrent client swallows all indexing 
errors and does not tell your program about them.


Thanks,
Shawn



Re: Ingestion not scaling horizontally as I add more cores to Solr

2018-01-10 Thread Erick Erickson
OK, so I'm assuming your indexer indexes to 1, 3 and 5 separate cores
depending on how many are available, right? And these cores are essentially
totally independent.

I'd guess your gating factor is your ingestion process. Try spinning up two
identical ones from two separate clients. Eventually you should be able to
max out your CPU as you add cores. The fact that your indexing rate is
fairly constant at 90K docs/sec is a red flag that that's the rate you're
feeding docs to Solr.

At some point you'll max out our CPU and that'll be the limit.

Best,
Erick

On Wed, Jan 10, 2018 at 1:52 PM, Shashank Pedamallu 
wrote:

> - Did you sept up an actual multiple node cluster or are you running this
> all on one box?
> Sorry, I should have mentioned this earlier. I’m running Solr in non-cloud
> mode. It is just a single node Solr.
>
> - Are you configuring Jmeter to send with multiple threads?
> Yes, multiple threads looping a fixed number of times
>
> - Are they all sending to the same node, or are you distributing across
> nodes? Is there a load balancer?
> Yes, since there is only one node.
>
> - If you are sending requests up to the cloud from your local machine,
> that is frequently a slow link.
> Not a public cloud. Our private one.
>
> - are you sending one document at a time or batching them up?
> Batching them up. About 1000 documents in one request
>
> Thanks,
> Shashank
>
> On 1/10/18, 1:35 PM, "Gus Heck"  wrote:
>
> Ok then here's a few things to check...
>
>- Did you sept up an actual multiple node cluster or are you running
>this all on one box?
>- Are you configuring Jmeter to send with multiple threads?
>- Are they all sending to the same node, or are you distributing
> across
>nodes? Is there a load balancer?
>- Are you sending from a machine on the same network as the
> machines in
>the Solr cluster?
>- If you are sending requests up to the cloud from your local
> machine,
>that is frequently a slow link.
>- Also don't forget to check your zookeeper cluster's health... if
> it's
>bogged down that will slow down solr.
>
> If you have all machines on the same network, many threads, load
> balancing
> and no questionable equipment (or networking limitations put in place
> by
> IT) in the middle, then something (either CPU or network interface)
> should
> be maxed out somewhere on at least one machine, either on the Jmeter
> side
> or Solr side.
>
> -Gus
>
> On Wed, Jan 10, 2018 at 3:54 PM, Shashank Pedamallu <
> spedama...@vmware.com>
> wrote:
>
> > Hi Gus,
> >
> > Thank  for the reply. I’m sending via jmeter running on my local
> machine
> > to Solr running on a remote vm.
> >
> > Thanks,
> > Shashank
> >
> > On 1/10/18, 12:34 PM, "Gus Heck"  wrote:
> >
> > Ingested how? Sounds like your document sending mechanism is
> maxed,
> > not the
> > solr cluster...
> >
> > On Wed, Jan 10, 2018 at 2:58 PM, Shashank Pedamallu <
> > spedama...@vmware.com>
> > wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I’m trying to find the upper thresholds of ingestion and I have
> > tried the
> > > following. In each of the experiments, I’m ingesting random
> > documents with
> > > 5 fields.
> > >
> > >
> > > Number of Cores Number of documents ingested per second per
> core
> > > 1   89000
> > > 3   33000
> > > 5   18000
> > >
> > >
> > > As you can see, the number of documents being ingested per
> core is
> > not
> > > scaling horizontally as I'm adding more cores. Rather the total
> > number of
> > > documents getting ingested for Solr JVM is being topped around
> 90k
> > > documents per second.
> > >
> > >
> > > From the iostats and top commands, I do not see any
> bottlenecks with
> > the
> > > iops or cpu respectively, CPU usaeg is around 65% and a sample
> of
> > iostats
> > > is below:
> > >
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >
> > >   55.320.002.331.640.00   40.71
> > >
> > >
> > > Device:tpskB_read/skB_wrtn/skB_read
> > kB_wrtn
> > >
> > > sda5   2523.00 45812.00298312.00  45812
> >  298312
> > >
> > >
> > > Can someone please guide me as to how I can debug this further
> and
> > > root-cause the bottleneck for not being able to increase the
> > ingestion
> > > horizontally.
> > >
> > >
> > > Thanks,
> > >
> > > Shashank
> > >
> >
> >
> >
> 

Mixing simple and nested docs in same update?

2018-01-10 Thread Jan Høydahl
Hi,

We index several large nested documents. We found that querying the data 
behaves differently depending on how the documents are indexed.

To reproduce:

solr start
solr create -c nested
# Index one plain document, “friend" and a nested one, “mother” and “daughter”, 
in same request:
curl localhost:8983/solr/nested/update -d ‘
 
   
 friend
 other
   
   
 mother
 parent
 
   daughter
   child
 
   
 '

# Query for mother’s children using either child transformer or child query 
parser
curl 
"localhost:8983/solr/a/query?q=id:mother=%2A%2C%5Bchild%20parentFilter%3Dtype%3Aparent%5D”
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":4,
"params":{
  "q":"id:mother",
  "fl":"*,[child parentFilter=type:parent]"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"mother",
"type":["parent"],
"_version_":1589249812802306048,
"type_str":["parent"],
"_childDocuments_":[
{
  "id":"friend",
  "type":["other"],
  "_version_":1589249812729954304,
  "type_str":["other"]},
{
  "id":"daughter",
  "type":["child"],
  "_version_":1589249812802306048,
  "type_str":["child"]}]}]
  }}

As you can see, the “friend” got included as a child of “mother”.
If you index the exact same request, putting “friend” after “mother” in the xml,
the query works as expected.

Inspecting the index, everything looks correct, and only “daughter” and 
“mother” have _root_=mother.
Is there a rule that you should start a new update request for each type of 
parent/child relationship
that you need to index, and not mix them in the same request?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Spatial search, nested docs, feature density

2018-01-10 Thread Leila Deljkovic
Hi,

https://lucene.apache.org/solr/guide/7_0/uploading-data-with-index-handlers.html#UploadingDatawithIndexHandlers-NestedChildDocuments
 


I have never used nested documents, but a bit of background on what I’m doing 
is that a spatial data layer consisting of features (points, lines, polygons, 
or an aerial image) is split up into sections (grid cells) based on the density 
of these features over the layer; smaller grid cells indicate high density of 
features in that area. 

I need to rank results based on density of features and whether dense areas of 
the layer overlap with the region of space on a map I am searching in. This is 
important because a layer could cover an entire country, for example if I query 
for “roads”, the layer would be dense in urban areas as there are more roads 
there, and less dense in rural areas, and if I am searching for a particular 
city, this layer would be of interest to me even though it covers the entire 
country. The idea is for the original layer to be the parent document (which is 
what should be returned when a query is made), and the child documents are the 
individual grid cells (which will hold the geometry of the cell and a density 
field for the features inside the cell). 

I would like to know if it is possible to rank the parent document based on a 
function which aggregates fields from the child documents (in this case, the 
density field). There is not much info on this that I could find online.

Thanks

Re: Spatial search (and nested docs)

2018-01-10 Thread Leila Deljkovic
Hi Emir,

Thanks for the reply. My problem has been simplified a bit now. 

https://lucene.apache.org/solr/guide/7_0/uploading-data-with-index-handlers.html#UploadingDatawithIndexHandlers-NestedChildDocuments
 


I have never used nested documents, but a bit of background is that a spatial 
data layer consisting of features (points, lines, polygons, or an aerial image) 
is split up into sections (grid cells) based on the density of these features 
over the layer; smaller grid cells indicate high density of features in that 
area. 

I need to rank results based on density of features and whether dense areas of 
the layer overlap with the region of space on a map I am searching in. This is 
important because a layer could cover an entire country, for example if I query 
for “roads”, the layer would be dense in urban areas as there are more roads 
there, and less dense in rural areas, and if I am searching for a particular 
city, this layer would be of interest to me even though it covers the entire 
country. The idea is for the original layer to be the parent document (which is 
what should be returned when a query is made), and the child documents are the 
individual grid cells (which will hold the geometry of the cell and a density 
field for the features inside the cell). 

I would like to know if it is possible to rank the parent document based on a 
function which aggregates fields from the child documents (in this case, the 
density field). There is not much info on this that I could find online.

Thanks

> On 10/01/2018, at 11:58 PM, Emir Arnautović  
> wrote:
> 
> Hi Leila,
> Maybe I need to refresh my spatial terminology, but I am having troubles 
> following your case. Can you explain a bit more, what is dataset that is 
> indexed and what are query inputs and what should be the result. The one 
> thing that puzzles me the most is “nested documents”.
> 
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 10 Jan 2018, at 04:15, Leila Deljkovic  
>> wrote:
>> 
>> Hi,
>> 
>> I’m quite new to Solr and am interested in using spatial search for 
>> geospatial data (Solr 7.1).
>> 
>> One problem is addressing feature density over a layer and using this to 
>> determine if a layer would be a relevant result over a search extent. I’d 
>> like to know is it feasible/possible to “split” a data layer into nested 
>> documents and index them, then at query time, count the number of nested 
>> documents that coincide with the search extent. Or maybe make use of 
>> overlapRatio or similar.
>> 
>> Thanks
>> 
>> 
> 



./fs-manager process run under solr

2018-01-10 Thread Andy Fake
Hi,

I use Solr 5.5, I recently notice a process a process ./fs-manager is run
under user solr that take quite high CPU usage. I don't think I see such
process before.

Is that a legitimate process from Solr?

Thanks.


Re: Ingestion not scaling horizontally as I add more cores to Solr

2018-01-10 Thread Erick Erickson
And I'd add
- are you sending one document at a time or batching them up? See:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, Jan 10, 2018 at 1:35 PM, Gus Heck  wrote:

> Ok then here's a few things to check...
>
>- Did you sept up an actual multiple node cluster or are you running
>this all on one box?
>- Are you configuring Jmeter to send with multiple threads?
>- Are they all sending to the same node, or are you distributing across
>nodes? Is there a load balancer?
>- Are you sending from a machine on the same network as the machines in
>the Solr cluster?
>- If you are sending requests up to the cloud from your local machine,
>that is frequently a slow link.
>- Also don't forget to check your zookeeper cluster's health... if it's
>bogged down that will slow down solr.
>
> If you have all machines on the same network, many threads, load balancing
> and no questionable equipment (or networking limitations put in place by
> IT) in the middle, then something (either CPU or network interface) should
> be maxed out somewhere on at least one machine, either on the Jmeter side
> or Solr side.
>
> -Gus
>
> On Wed, Jan 10, 2018 at 3:54 PM, Shashank Pedamallu  >
> wrote:
>
> > Hi Gus,
> >
> > Thank  for the reply. I’m sending via jmeter running on my local machine
> > to Solr running on a remote vm.
> >
> > Thanks,
> > Shashank
> >
> > On 1/10/18, 12:34 PM, "Gus Heck"  wrote:
> >
> > Ingested how? Sounds like your document sending mechanism is maxed,
> > not the
> > solr cluster...
> >
> > On Wed, Jan 10, 2018 at 2:58 PM, Shashank Pedamallu <
> > spedama...@vmware.com>
> > wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I’m trying to find the upper thresholds of ingestion and I have
> > tried the
> > > following. In each of the experiments, I’m ingesting random
> > documents with
> > > 5 fields.
> > >
> > >
> > > Number of Cores Number of documents ingested per second per core
> > > 1   89000
> > > 3   33000
> > > 5   18000
> > >
> > >
> > > As you can see, the number of documents being ingested per core is
> > not
> > > scaling horizontally as I'm adding more cores. Rather the total
> > number of
> > > documents getting ingested for Solr JVM is being topped around 90k
> > > documents per second.
> > >
> > >
> > > From the iostats and top commands, I do not see any bottlenecks
> with
> > the
> > > iops or cpu respectively, CPU usaeg is around 65% and a sample of
> > iostats
> > > is below:
> > >
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >
> > >   55.320.002.331.640.00   40.71
> > >
> > >
> > > Device:tpskB_read/skB_wrtn/skB_read
> > kB_wrtn
> > >
> > > sda5   2523.00 45812.00298312.00  45812
> >  298312
> > >
> > >
> > > Can someone please guide me as to how I can debug this further and
> > > root-cause the bottleneck for not being able to increase the
> > ingestion
> > > horizontally.
> > >
> > >
> > > Thanks,
> > >
> > > Shashank
> > >
> >
> >
> >
> > --
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> > the111shift.com=DwIFaQ=uilaK90D4TOVoH58JNXRgQ=
> > blJD2pBapH3dDkoajIf9mT9SSbbs19wRbChNde1ErNI=DT_
> > 33Z3k4h8T1t65CuyH0oMxay15ddkfDYAQefzgpa4=6-1wd3YPVRgcvlk3LkK7Wz-
> > 3hDFliEGwVGc44HJH1x4=
> >
> >
> >
>
>
> --
> http://www.the111shift.com
>


Re: Very high number of deleted docs, part 2

2018-01-10 Thread Erick Erickson
There's some background here:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

the 2.5 "live" document limit is really "50% of the max segment size",
hard-coded in TieredMergePolicy.

bq: Well, maxSegments with optimize or commit with expungeDeletes did not
do the job in testing

Surprising. What actually happened? Do note that expungeDeletes does not
promise to remove all deleted docs, it merges segments with < (some
percentage) deleted documents.

Best,
Erick

On Wed, Jan 10, 2018 at 9:45 AM, Markus Jelsma 
wrote:

> Well, maxSegments with optimize or commit with expungeDeletes did not do
> the job in testing. But tell me more about 2.5G live documents limit, no
> idea what it is.
>
> Thanks,
> Markus
>
> -Original message-
> > From:Erick Erickson 
> > Sent: Friday 5th January 2018 17:56
> > To: solr-user 
> > Subject: Re: Very high number of deleted docs, part 2
> >
> > I'm not 100% sure that playing with maxSegments will work.
> >
> > what will work is to re-index everything. You can re-index into the
> > existing collection, no need to start with a new collection. Eventually
> > you'll replace enough docs in the over-sized segments that they'll fall
> > under the 2.5G live documents limit and be merged away. Not elegant, but
> > it'd work.
> >
> > Best,
> > Erick
> >
> > On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > It could be that when this index was first reconstructed, it was
> optimized
> > > to one segment before packed and shipped.
> > >
> > > How about optimizing it again, with maxSegments set to ten, it should
> > > recover right?
> > >
> > > -Original message-
> > > > From:Shawn Heisey 
> > > > Sent: Friday 5th January 2018 14:34
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Very high number of deleted docs, part 2
> > > >
> > > > On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> > > > > Another collection, now on 7.1, also shows this problem and has
> > > default TMP settings. This time size is different, each shard of this
> > > collection is over 40 GB, and each shard has about 50 % deleted
> documents.
> > > Each shard's largest segment is just under 20 GB with about 75 %
> deleted
> > > documents. After that are a few five/six GB segments with just under
> 50 %
> > > deleted documents.
> > > > >
> > > > > What do i need to change to make Lucene believe that at least that
> > > twenty GB and three month old segment should be merged away. And how
> what
> > > would the predicted indexing performance penalty be?
> > > >
> > > > Quick answer: Erick's statements in the previous thread can be
> > > > summarized as this:  On large indexes that do a lot of deletes or
> > > > updates, once you do an optimize, you have to continue to do
> optimizes
> > > > regularly, or you're going to have this problem.
> > > >
> > > > TL;DR:
> > > >
> > > > I think Erick covered most of this (possibly all of it) in the
> previous
> > > > thread.
> > > >
> > > > If you've got a 20GB segment and TMP's settings are default, then
> that
> > > > means at some point in the past, you've done an optimize.  The
> default
> > > > TMP settings have a maximum segment size of 5GB, so if you never
> > > > optimize, then there will never be a segment larger than 5GB, and the
> > > > deleted document percentage would be less likely to get out of
> control.
> > > > The optimize operation ignores the maximum segment size and reduces
> the
> > > > index to a single large segment with zero deleted docs.
> > > >
> > > > TMP's behavior with really big segments is apparently completely as
> the
> > > > author intended, but this specific problem wasn't ever addressed.
> > > >
> > > > If you do an optimize once and then don't ever do it again, any very
> > > > large segments are going to be vulnerable to this problem, and the
> only
> > > > way (currently) to fix it is to do another optimize.
> > > >
> > > > See this issue for a more in-depth discussion and an attempt to
> figure
> > > > out how to avoid it:
> > > >
> > > > https://issues.apache.org/jira/browse/LUCENE-7976
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
>


Re: Ingestion not scaling horizontally as I add more cores to Solr

2018-01-10 Thread Gus Heck
Ok then here's a few things to check...

   - Did you sept up an actual multiple node cluster or are you running
   this all on one box?
   - Are you configuring Jmeter to send with multiple threads?
   - Are they all sending to the same node, or are you distributing across
   nodes? Is there a load balancer?
   - Are you sending from a machine on the same network as the machines in
   the Solr cluster?
   - If you are sending requests up to the cloud from your local machine,
   that is frequently a slow link.
   - Also don't forget to check your zookeeper cluster's health... if it's
   bogged down that will slow down solr.

If you have all machines on the same network, many threads, load balancing
and no questionable equipment (or networking limitations put in place by
IT) in the middle, then something (either CPU or network interface) should
be maxed out somewhere on at least one machine, either on the Jmeter side
or Solr side.

-Gus

On Wed, Jan 10, 2018 at 3:54 PM, Shashank Pedamallu 
wrote:

> Hi Gus,
>
> Thank  for the reply. I’m sending via jmeter running on my local machine
> to Solr running on a remote vm.
>
> Thanks,
> Shashank
>
> On 1/10/18, 12:34 PM, "Gus Heck"  wrote:
>
> Ingested how? Sounds like your document sending mechanism is maxed,
> not the
> solr cluster...
>
> On Wed, Jan 10, 2018 at 2:58 PM, Shashank Pedamallu <
> spedama...@vmware.com>
> wrote:
>
> > Hi,
> >
> >
> >
> > I’m trying to find the upper thresholds of ingestion and I have
> tried the
> > following. In each of the experiments, I’m ingesting random
> documents with
> > 5 fields.
> >
> >
> > Number of Cores Number of documents ingested per second per core
> > 1   89000
> > 3   33000
> > 5   18000
> >
> >
> > As you can see, the number of documents being ingested per core is
> not
> > scaling horizontally as I'm adding more cores. Rather the total
> number of
> > documents getting ingested for Solr JVM is being topped around 90k
> > documents per second.
> >
> >
> > From the iostats and top commands, I do not see any bottlenecks with
> the
> > iops or cpu respectively, CPU usaeg is around 65% and a sample of
> iostats
> > is below:
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >
> >   55.320.002.331.640.00   40.71
> >
> >
> > Device:tpskB_read/skB_wrtn/skB_read
> kB_wrtn
> >
> > sda5   2523.00 45812.00298312.00  45812
>  298312
> >
> >
> > Can someone please guide me as to how I can debug this further and
> > root-cause the bottleneck for not being able to increase the
> ingestion
> > horizontally.
> >
> >
> > Thanks,
> >
> > Shashank
> >
>
>
>
> --
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> the111shift.com=DwIFaQ=uilaK90D4TOVoH58JNXRgQ=
> blJD2pBapH3dDkoajIf9mT9SSbbs19wRbChNde1ErNI=DT_
> 33Z3k4h8T1t65CuyH0oMxay15ddkfDYAQefzgpa4=6-1wd3YPVRgcvlk3LkK7Wz-
> 3hDFliEGwVGc44HJH1x4=
>
>
>


-- 
http://www.the111shift.com


Re: ClassicTokenizer

2018-01-10 Thread Rick Leir
Shawn
I did not express that clearly. 
The reference guide says "The Classic Tokenizer preserves the same behavior as 
the Standard Tokenizer of Solr versions 3.1 and previous. "

So I am curious to know why they changed StandardTokenizer after 3.1 to break 
on hyphens, when it seems to me to work better the old way?
Thanks
Rick

On January 9, 2018 7:07:59 PM EST, Shawn Heisey  wrote:
>On 1/9/2018 9:36 AM, Rick Leir wrote:
>> A while ago the default was changed to StandardTokenizer from
>ClassicTokenizer. The biggest difference seems to be that Classic does
>not break on hyphens. There is also a different character pr(mumble). I
>prefer the Classic's non-break on hyphens.
>
>To have any ability to research changes, we're going to need to know
>precisely what you mean by "default" in that statement.
>
>Are you talking about the example schemas, or some kind of inherent
>default when an analysis chain is not specified?
>
>Probably the reason for the change is an attempt to move into the
>modern
>era, become more standardized, and stop using old/legacy
>implementations.  The name of the new default contains the word
>"Standard" which would fit in with that goal.
>
>I can't locate any changes in the last couple of years that change the
>classic tokenizer to standard.  Maybe I just don't know the right place
>to look.
>
>> What was the reason for changing this default? If I understand this
>better I can avoid some pitfalls, perhaps.
>
>If you are talking about example schemas, then the following may apply:
>
>Because you understand how analysis components work well enough to even
>ask your question, I think you're probably the kind of admin who is
>going to thoroughly customize the schema and not rely on the defaults
>for TextField types that come with Solr.  You're free to continue using
>the classic tokenizer in your schema if that meets your needs better
>than whatever changes are made to the examples by the devs.  The
>examples are only starting points, virtually all Solr installs require
>customizing the schema.
>
>Thanks,
>Shawn

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Ingestion not scaling horizontally as I add more cores to Solr

2018-01-10 Thread Shashank Pedamallu
Hi Gus,

Thank  for the reply. I’m sending via jmeter running on my local machine to 
Solr running on a remote vm.

Thanks,
Shashank

On 1/10/18, 12:34 PM, "Gus Heck"  wrote:

Ingested how? Sounds like your document sending mechanism is maxed, not the
solr cluster...

On Wed, Jan 10, 2018 at 2:58 PM, Shashank Pedamallu 
wrote:

> Hi,
>
>
>
> I’m trying to find the upper thresholds of ingestion and I have tried the
> following. In each of the experiments, I’m ingesting random documents with
> 5 fields.
>
>
> Number of Cores Number of documents ingested per second per core
> 1   89000
> 3   33000
> 5   18000
>
>
> As you can see, the number of documents being ingested per core is not
> scaling horizontally as I'm adding more cores. Rather the total number of
> documents getting ingested for Solr JVM is being topped around 90k
> documents per second.
>
>
> From the iostats and top commands, I do not see any bottlenecks with the
> iops or cpu respectively, CPU usaeg is around 65% and a sample of iostats
> is below:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>   55.320.002.331.640.00   40.71
>
>
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
>
> sda5   2523.00 45812.00298312.00  45812 298312
>
>
> Can someone please guide me as to how I can debug this further and
> root-cause the bottleneck for not being able to increase the ingestion
> horizontally.
>
>
> Thanks,
>
> Shashank
>



-- 

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.the111shift.com=DwIFaQ=uilaK90D4TOVoH58JNXRgQ=blJD2pBapH3dDkoajIf9mT9SSbbs19wRbChNde1ErNI=DT_33Z3k4h8T1t65CuyH0oMxay15ddkfDYAQefzgpa4=6-1wd3YPVRgcvlk3LkK7Wz-3hDFliEGwVGc44HJH1x4=




Re: Ingestion not scaling horizontally as I add more cores to Solr

2018-01-10 Thread Gus Heck
Ingested how? Sounds like your document sending mechanism is maxed, not the
solr cluster...

On Wed, Jan 10, 2018 at 2:58 PM, Shashank Pedamallu 
wrote:

> Hi,
>
>
>
> I’m trying to find the upper thresholds of ingestion and I have tried the
> following. In each of the experiments, I’m ingesting random documents with
> 5 fields.
>
>
> Number of Cores Number of documents ingested per second per core
> 1   89000
> 3   33000
> 5   18000
>
>
> As you can see, the number of documents being ingested per core is not
> scaling horizontally as I'm adding more cores. Rather the total number of
> documents getting ingested for Solr JVM is being topped around 90k
> documents per second.
>
>
> From the iostats and top commands, I do not see any bottlenecks with the
> iops or cpu respectively, CPU usaeg is around 65% and a sample of iostats
> is below:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>   55.320.002.331.640.00   40.71
>
>
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
>
> sda5   2523.00 45812.00298312.00  45812 298312
>
>
> Can someone please guide me as to how I can debug this further and
> root-cause the bottleneck for not being able to increase the ingestion
> horizontally.
>
>
> Thanks,
>
> Shashank
>



-- 
http://www.the111shift.com


Ingestion not scaling horizontally as I add more cores to Solr

2018-01-10 Thread Shashank Pedamallu
Hi,



I’m trying to find the upper thresholds of ingestion and I have tried the 
following. In each of the experiments, I’m ingesting random documents with 5 
fields.


Number of Cores Number of documents ingested per second per core
1   89000
3   33000
5   18000


As you can see, the number of documents being ingested per core is not scaling 
horizontally as I'm adding more cores. Rather the total number of documents 
getting ingested for Solr JVM is being topped around 90k documents per second.


>From the iostats and top commands, I do not see any bottlenecks with the iops 
>or cpu respectively, CPU usaeg is around 65% and a sample of iostats is below:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

  55.320.002.331.640.00   40.71


Device:tpskB_read/skB_wrtn/skB_readkB_wrtn

sda5   2523.00 45812.00298312.00  45812 298312


Can someone please guide me as to how I can debug this further and root-cause 
the bottleneck for not being able to increase the ingestion horizontally.


Thanks,

Shashank


Re: regarding exposing merge metrics

2018-01-10 Thread S G
Last comment by Shawn on SOLR-10130 is:
Metrics was just a theory, sounds like that's not it.

It would be very interesting to know what really caused the slowdown and do
we really need the config or not.

Thanks
SG



On Tue, Jan 9, 2018 at 12:00 PM, suresh pendap 
wrote:

> Thanks Shalin for sharing the link. However if I follow the thread then it
> seems like there was no conclusive evidence found that the performance
> degradation was due to the merge or index related metrics.
> If that is the case then can we just get rid of the config and publish
> these metrics by default?
>
>
> Regards
> suresh
>
>
>
> On Mon, Jan 8, 2018 at 10:25 PM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
> > The merge metrics were enabled by default in 6.4 but they were turned
> > off in 6.4.2 because of large performance degradations. For more
> > details, see https://issues.apache.org/jira/browse/SOLR-10130
> >
> > On Tue, Jan 9, 2018 at 9:11 AM, S G  wrote:
> > > Yes, this is actually confusing and the documentation (
> > > https://lucene.apache.org/solr/guide/7_2/metrics-reporting.html) does
> > not
> > > help either:
> > >
> > > *Index Merge Metrics* : These metrics are collected in respective
> > > registries for each core (e.g., solr.core.collection1….), under the
> INDEX
> > > category.
> > > Basic metrics are always collected - collection of additional metrics
> can
> > > be turned on using boolean parameters in the
> /config/indexConfig/metrics.
> > >
> > > However, we do not see the merge-metrics being collected if the above
> > > config is absent. So what basic metrics are always collected for merge?
> > > And why do the merge metrics require an additional config while most of
> > the
> > > others are reported directly?
> > >
> > > Thanks
> > > SG
> > >
> > >
> > >
> > > On Mon, Jan 8, 2018 at 2:02 PM, suresh pendap  >
> > > wrote:
> > >
> > >> Hi,
> > >> I am following the instructions from
> > >> https://lucene.apache.org/solr/guide/7_1/metrics-reporting.html
> > >>  in order to expose the Index merge related metrics.
> > >>
> > >> The document says that we have to add the below snippet in order to
> > expose
> > >> the merge metrics
> > >>
> > >> 
> > >>   ...
> > >>   
> > >> 
> > >>   524288
> > >>   true
> > >> 
> > >> ...
> > >>   
> > >> ...
> > >>
> > >>
> > >>
> > >> I would like to know why is this metrics not exposed by default just
> > like
> > >> all the other metrics?
> > >>
> > >> Is there any performance overhead that we should be concerned about
> it?
> > >>
> > >> If there was no particular reason, then can we expose it by default?
> > >>
> > >>
> > >>
> > >> Regards
> > >> Suresh
> > >>
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>


RE: Very high number of deleted docs, part 2

2018-01-10 Thread Markus Jelsma
Well, maxSegments with optimize or commit with expungeDeletes did not do the 
job in testing. But tell me more about 2.5G live documents limit, no idea what 
it is.

Thanks,
Markus 
 
-Original message-
> From:Erick Erickson 
> Sent: Friday 5th January 2018 17:56
> To: solr-user 
> Subject: Re: Very high number of deleted docs, part 2
> 
> I'm not 100% sure that playing with maxSegments will work.
> 
> what will work is to re-index everything. You can re-index into the
> existing collection, no need to start with a new collection. Eventually
> you'll replace enough docs in the over-sized segments that they'll fall
> under the 2.5G live documents limit and be merged away. Not elegant, but
> it'd work.
> 
> Best,
> Erick
> 
> On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma 
> wrote:
> 
> > It could be that when this index was first reconstructed, it was optimized
> > to one segment before packed and shipped.
> >
> > How about optimizing it again, with maxSegments set to ten, it should
> > recover right?
> >
> > -Original message-
> > > From:Shawn Heisey 
> > > Sent: Friday 5th January 2018 14:34
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Very high number of deleted docs, part 2
> > >
> > > On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> > > > Another collection, now on 7.1, also shows this problem and has
> > default TMP settings. This time size is different, each shard of this
> > collection is over 40 GB, and each shard has about 50 % deleted documents.
> > Each shard's largest segment is just under 20 GB with about 75 % deleted
> > documents. After that are a few five/six GB segments with just under 50 %
> > deleted documents.
> > > >
> > > > What do i need to change to make Lucene believe that at least that
> > twenty GB and three month old segment should be merged away. And how what
> > would the predicted indexing performance penalty be?
> > >
> > > Quick answer: Erick's statements in the previous thread can be
> > > summarized as this:  On large indexes that do a lot of deletes or
> > > updates, once you do an optimize, you have to continue to do optimizes
> > > regularly, or you're going to have this problem.
> > >
> > > TL;DR:
> > >
> > > I think Erick covered most of this (possibly all of it) in the previous
> > > thread.
> > >
> > > If you've got a 20GB segment and TMP's settings are default, then that
> > > means at some point in the past, you've done an optimize.  The default
> > > TMP settings have a maximum segment size of 5GB, so if you never
> > > optimize, then there will never be a segment larger than 5GB, and the
> > > deleted document percentage would be less likely to get out of control.
> > > The optimize operation ignores the maximum segment size and reduces the
> > > index to a single large segment with zero deleted docs.
> > >
> > > TMP's behavior with really big segments is apparently completely as the
> > > author intended, but this specific problem wasn't ever addressed.
> > >
> > > If you do an optimize once and then don't ever do it again, any very
> > > large segments are going to be vulnerable to this problem, and the only
> > > way (currently) to fix it is to do another optimize.
> > >
> > > See this issue for a more in-depth discussion and an attempt to figure
> > > out how to avoid it:
> > >
> > > https://issues.apache.org/jira/browse/LUCENE-7976
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
> 


is ConcurrentUpdateSolrClient.Builder thread safe?

2018-01-10 Thread Bernd Fehling
Hi list,

after some strange search results I was trying to locate the problem
and it turned out that it starts with bulk loading with SolrJ
and ConcurrentUpdateSolrClient.Builder with several threads.

I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
according the docs send to the indexer?

It feels like documents with the same doc_id are not always indexed
in the order they are sent to the indexer. It is some kind of random generator.

Example:
file LR00010.xml

  my_uniq_id_1234
  2017-03-28T23:21:40Z
  ...

file LR01000.xml

  my_uniq_id_1234
  2017-04-26T00:42:10Z
  ...


The files are in the same subdir.
They are loaded, processed, and send to the indexer in ascending natural order.
LR00010.xml is handled way before LR01000.xml.

But the result is that sometimes the older doc of LR00010.xml is in the index
and the newer doc from LR01000.xml is marked as deleted, and sometimes the
newer doc of LR01000.xml is in the index and the older doc from LR00010.xml
is marked as deleted.

Anyone seens this?

I could try ConcurrentUpdateSolrClient.Builder with only one thread and
see if the problem still exists.

Regards
Bernd




Re: Regarding document routing

2018-01-10 Thread Shawn Heisey

On 1/10/2018 12:18 AM, manish tanger wrote:

I am having a doubt in implicit routing and didn't find much info about
this over the internet, so Please help me out on this.

*About environment:*
M/c 1: Zookeeper 1 and Solr 1
M/c 2: Zookeeper 2 and Solr 2


For redundancy with ZK, you need three hosts minimum.  A two-host ZK 
ensemble is actually *less* reliable than using one server.  You aren't 
protected against failure until you have at least three.  You would only 
need a minimum of two Solr hosts, though.



I am using clustered zookeeper and using "CloudSolrClient" from solrJ
API in java.

*this.solrCloudClient = new
CloudSolrClient.Builder().withZkHost(zkHostList).build();*

*Requirement:*

My requirement is to store lots of data on solr using a single collection.
so my idea is that i am going to create a new shard for every hour so that
indexing doesn't take much time.

I choose for the implicit document routing, but I am unable to redirect the
docs on the particular shard. Zookeeper is still distributing it on all
nodes and shards.


ZooKeeper isn't responsible for distributing documents between shards.  
It is Solr that does this, using information in the ZK database.  With 
the implicit router, the only routing information in ZK is the shard 
names.  Solr cannot make decisions about which shard gets the documents, 
that information must come from the system doing the indexing.



*What I have tried:*
1. I have created a collection with implicit routing and put customer
routing field "*dateandhour*" and add it as a filed in my collection.

 While adding solr input doc I am setting this filed with shard name.


What was the precise commands or API calls that you used to create the 
collection?  What is the definition of the dateandhour field?



2. I have also tried to add shard name to id filed like:
  id="*shardName!*uniquedocumentId"


If you want to use a prefix in the uniqueId field, you must be using the 
compositeId router, not the implicit router.  The compositeId router 
will not fit your use case, though -- you cannot add shards to a 
collection if it uses compositeId.  Also, the prefix does not specify 
the shard by name, the value of the prefix is hashed to determine which 
shard(s) are used.


Here's the documentation on document routing:

https://lucene.apache.org/solr/guide/7_2/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting

Thanks,
Shawn