Re: Searching for a term which isn't a part of an expression

2016-12-14 Thread Dean Gurvitz
Hi,
The list of phrases wil be relatively dynamic, so changing the indexing
process isn't a very good solution for us.

We also considered using a PostFilter or adding a SearchComponent to filter
out the "bad" results, but obviously a true query-time support would be a
lot better.


On Wed, Dec 14, 2016 at 10:52 PM, Ahmet Arslan 
wrote:

> Hi,
>
> Do you have a common list of phrases that you want to prohibit partial
> match?
> You can index those phrases in a special way, for example,
>
> This is a new world hello_world hot_dog tap_water etc.
>
> ahmet
>
>
> On Wednesday, December 14, 2016 9:20 PM, deansg  wrote:
> We would like to enable queries for a specific term that doesn't appear as
> a
> part of a given expression. Negating the expression will not help, as we
> still want to return items that contain the term independently, even if
> they
> contain full expression as well.
> For example, we would like to search for items that have the term "world"
> but not as a part of "hello world". If the text is: "This is a new world.
> Hello world", we would still want to return the item, as "world" appears
> independently as well as a part of "Hello world". However, we will not want
> to return items that only have the expression "hello world" in them.
> Does Solr support these types of queries? We thought about using regex, but
> since the text is tokenized I don't think that will be possible.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Searching-for-a-term-which-isn-t-a-part-of-an-
> expression-tp4309746.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr Cloud Replica Cores Give different Results for the Same query

2016-12-14 Thread Erick Erickson
Let's back up a bit. You say "This seems to cause two replicas to
return different hits depending upon which one is queried."

OK, _how_ are they different? I've been assuming different numbers of
hits. If you're getting the same number of hits but different document
ordering, that's a completely different issue and may be easily
explainable. If this is true, skip the rest of this message. I only
realized we may be using a different definition of "different hits"
part way through writing this reply.



Having the timestamp as a string isn't a problem, you can do something
very similar with wildcards and the like if it's a string that sorts
the same way the timestamp would. And it's best if it's created
upstream anyway that way it's guaranteed to be the same for the doc on
all replicas.

If the date is in canonical form (-MM-DDTHH:MM:SSZ) then a simple
copyfield to a date field would do the trick.

But there's no real reason to do any of that. Given that you see this
when there's no indexing going on then there's no point to those
tests, those were just for a way to examine your nodes while there was
active indexing.

How do you fix this problem when you see it? If it goes away by itself
that would gives at least a start on where to look. If you have to
manually intervene it would be good to know what you do.

The CDCR pattern is docs to from the leader on the source cluster to
the leader on the target cluster. Once the target leader gets the
docs, it's supposed to send the doc to all the replicas.

To try to narrow down the issue, next time it occurs can you look at
_both_ the source and target clusters and see if they _both_ show the
same discrepancy? What I'm looking for is whether both are
self-consistent. That is, all the replicas for shardN on the source
cluster show the same documents (M). All the replicas for shardN on
the target cluster show the same number of docs (N). I'm not as
concerned if M != N at this point. Note I'm looking at the number of
hits here, not say the document ordering.

To do this you'll have to do the trick I mentioned where you query
each replica separately.

And are you absolutely sure that your different results are coming
from the _same_ cluster? If you're comparing a query from the source
cluster with a query from the target cluster, that's different than if
the queries come from the same cluster.

Best,
Erick

On Wed, Dec 14, 2016 at 2:48 PM, Webster Homer  wrote:
> Thanks for the quick feedback.
>
> We are not doing continuous indexing, we do a complete load once a week and
> then have a daily partial load for any documents that have changed since
> the load. These partial loads take only a few minutes every morning.
>
> The problem is we see this discrepancy long after the data load completes.
>
> We have a source collection that uses cdcr to replicate to the target. I
> see the current=false setting in both the source and target collections.
> Only the target collection is being heavily searched so that is where my
> concern is. So what could cause this kind of issue?
> Do we have a configuration problem?
>
> It doesn't happen all the time, so I don't currently have a reproducible
> test case, yet.
>
> I will see about adding the timestamp, we have one, but it was created as a
> string, and was generated by our ETL job
>
> On Wed, Dec 14, 2016 at 3:42 PM, Erick Erickson 
> wrote:
>
>> The commit points on different replicas will trip at different wall
>> clock times so the leader and replica may return slightly different
>> results depending on whether doc X was included in the commit on one
>> replica but not on the second. After the _next_ commit interval (2
>> seconds in your case), doc X will be committed on the second replica:
>> that is it's not lost.
>>
>> Here's a couple of ways to verify:
>>
>> 1> turn off indexing and wait a few seconds. The replicas should have
>> the exact same documents. "A few seconds" is your autocommit (soft in
>> your case) interval + autowarm time. This last is unknown, but you can
>> check your admin/plugins-stats search handler times, it's reported
>> there. Now issue your queries. If the replicas don't report the same
>> docs A Bad Thing that should be worrying. BTW, with a 2 second soft
>> commit interval, which is really aggressive, you _better not_ have
>> very large autowarm intervals!
>>
>> 2> Include a timestamp in your docs when they are indexed. There's an
>> automatic way to do that BTW now do your queries and append an FQ
>> clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas
>> should have the same counts unless you are deleting documents. I
>> mention deletes on the off chance that you're deleting documents that
>> fall in the interval and then the same as above could theoretically
>> occur. Updates should be fine.
>>
>> BTW, I've seen continuous monitoring of this done by automated
>> scripts. The key is to get the shard URL and ping that with
>> &distrib=false. It'll 

Re: Nested JSON Facets (Subfacets)

2016-12-14 Thread Yonik Seeley
That should work... what version of Solr are you using?  Did you
change the type of the popularity field w/o completely reindexing?

You can try to verify the number of documents in each bucket that have
the popularity field by adding another sub-facet next to cat_pop:
num_pop:{query:"popularity:[* TO *]"}

> A quick check with this json.facet parameter:
>
> json.facet: {cat_pop:"sum(popularity)“}
>
> returns:
>
> "facets“: {
> "count":2508,
> "cat_pop":21.0},

That looks like a pretty low sum for all those documents perhaps
most of them are missing "popularity" (or have a 0 popularity).
To test one of the buckets at the top-level this way, you could add
fq=shop_cat:"Men > Clothing > Jumpers & Cardigans"
and see if you get anything.

-Yonik


On Wed, Dec 14, 2016 at 12:46 PM, CA  wrote:
> Hi all,
>
> this is about using a function in nested facets, specifically the „sum()“ 
> function inside a „terms“ facet using the json.facet api.
>
> My json.facet parameter looks like this:
>
> json.facet={shop_cat: {type:terms, field:shop_cat, facet: 
> {cat_pop:"sum(popularity)"}}}
>
> A snippet of the result:
>
> "facets“: {
> "count":2508,
> "shop_cat“: {
> "buckets“: [{
> "val“: "Men > Clothing > Jumpers & Cardigans",
> "count":252,
> "cat_pop“:0.0
>  }, {
>"val":"Men > Clothing > Jackets & Coats",
>"count":157,
>"cat_pop“:0.0
>  }, // and more
>
> This looks fine all over but it turns out that „cat_pop“, the result of 
> „sum(popularity)“ is always 0.0 even if the documents for this facet value 
> have popularities > 0.
>
> A quick check with this json.facet parameter:
>
> json.facet: {cat_pop:"sum(popularity)“}
>
> returns:
>
> "facets“: {
> "count":2508,
> "cat_pop":21.0},
>
> To me, it seems it works fine on the base level but not when nested. Still, 
> Yonik’s documentation and the Jira issues indicate that it is possible to use 
> functions in nested facets so I might just be using the wrong structure? I 
> have a hard time finding any other examples on the i-net and I had no luck 
> changing the structure around.
> Could someone shed some light on this for me? It would also help to know if 
> it is not possible to sum the values up this way.
>
> Thanks a lot!
> Chantal
>
>


Re: Solr Cloud Replica Cores Give different Results for the Same query

2016-12-14 Thread Webster Homer
Thanks for the quick feedback.

We are not doing continuous indexing, we do a complete load once a week and
then have a daily partial load for any documents that have changed since
the load. These partial loads take only a few minutes every morning.

The problem is we see this discrepancy long after the data load completes.

We have a source collection that uses cdcr to replicate to the target. I
see the current=false setting in both the source and target collections.
Only the target collection is being heavily searched so that is where my
concern is. So what could cause this kind of issue?
Do we have a configuration problem?

It doesn't happen all the time, so I don't currently have a reproducible
test case, yet.

I will see about adding the timestamp, we have one, but it was created as a
string, and was generated by our ETL job

On Wed, Dec 14, 2016 at 3:42 PM, Erick Erickson 
wrote:

> The commit points on different replicas will trip at different wall
> clock times so the leader and replica may return slightly different
> results depending on whether doc X was included in the commit on one
> replica but not on the second. After the _next_ commit interval (2
> seconds in your case), doc X will be committed on the second replica:
> that is it's not lost.
>
> Here's a couple of ways to verify:
>
> 1> turn off indexing and wait a few seconds. The replicas should have
> the exact same documents. "A few seconds" is your autocommit (soft in
> your case) interval + autowarm time. This last is unknown, but you can
> check your admin/plugins-stats search handler times, it's reported
> there. Now issue your queries. If the replicas don't report the same
> docs A Bad Thing that should be worrying. BTW, with a 2 second soft
> commit interval, which is really aggressive, you _better not_ have
> very large autowarm intervals!
>
> 2> Include a timestamp in your docs when they are indexed. There's an
> automatic way to do that BTW now do your queries and append an FQ
> clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas
> should have the same counts unless you are deleting documents. I
> mention deletes on the off chance that you're deleting documents that
> fall in the interval and then the same as above could theoretically
> occur. Updates should be fine.
>
> BTW, I've seen continuous monitoring of this done by automated
> scripts. The key is to get the shard URL and ping that with
> &distrib=false. It'll look something like
> http://host:port/solr/collection_shard1_replica1 People usually
> just use *:* and compare numFound.
>
> Best,
> Erick
>
>
>
> On Wed, Dec 14, 2016 at 1:10 PM, Webster Homer 
> wrote:
> > We are using Solr Cloud 6.2
> >
> > We have been noticing an issue where the index in a core shows as
> current =
> > false
> >
> > We have autocommit set for 15 seconds, and soft commit at 2 seconds
> >
> > This seems to cause two replicas to return different hits depending upon
> > which one is queried.
> >
> > What would lead to the indexes not being "current"? The documentation on
> > the meaning of current is vague.
> >
> > The collections in our cloud have two shards each with two replicas. I
> see
> > this with several of the collections.
> >
> > We don't know how they get like this but it's troubling
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.merckgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from an

RE: DocTransformer not always working

2016-12-14 Thread Markus Jelsma
Hello - i just looked up the DocTransformer Javadoc and spotted the 
getExtraRequestFields method. 

What you mention makes sense, so i immediately tried:
solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id 
asc&q=*:*&fl=minhash,minhash:[binstr]
{
  "response":{"numFound":97895,"start":0,"docs":[
  {

"minhash":"11101101001010001101001010111101100100110010"}]
  }}

So as i get it, instead of using getRequestedFields, just now i just did an 
explicit get for that fields. Don't mind the changed numFound, it's a live 
index.

Well, i can work with this really fine knowing this, but does it make sense? I 
did assume (or be wrong in doing so) that fl=minhash:[binstr] should mean get 
that field and pass it through the transformer. At least i just now fell for 
it, maybe other shouldn't :)

Anyway, thanks again today,
Markus

-Original message-
> From:Chris Hostetter 
> Sent: Wednesday 14th December 2016 23:14
> To: solr-user 
> Subject: Re: DocTransformer not always working
> 
> 
> Fairly certain you aren't overridding getExtraRequestFields, so when your 
> DocTransformer is evaluated it can'd find the field you want it to 
> transform.
> 
> By default, the ResponseWriters don't provide any fields that aren't 
> explicitly requested by the user, or specified as "extra" by the 
> DocTransformer.
> 
> IIUC you want the stored value of the "minhash" field to be available to 
> you, but the response writer code doesn't know that -- it just knows you 
> want "minhash" to be the output respons key for the "[binstr]" 
> transformer.
> 
> 
> Take a look at RawValueTransformerFactory as an example to borrow from.
> 
> 
> 
> 
> : Date: Wed, 14 Dec 2016 21:55:26 +
> : From: Markus Jelsma 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user 
> : Subject: DocTransformer not always working
> : 
> : Hello - I just spotted an oddity with all two custom DocTransformers we 
> sometimes use on Solr 6.3.0. This particular transformer in the example just 
> transforms a long (or int) into a sequence of bits. I just use it as an 
> convenience to compare minhashes with my eyeballs. First example is very 
> straightforward, fl=minhash:[binstr], show only the minhash field, but as a 
> bit sequence.
> : 
> : 
> solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=minhash:[binstr]
> : {
> :   "response":{"numFound":96933,"start":0,"docs":[
> :   {}]
> :   }}
> : 
> : The document is empty! This also happens with another transformer. The next 
> example i also request the lang field:
> : 
> : solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id 
> asc&q=*:*&fl=lang,minhash:[binstr]
> : {
> :   "response":{"numFound":96933,"start":0,"docs":[
> :   {
> : "lang":"nl"}]
> :   }}
> : 
> : Ok, at least i now get the lang field, but the transformed minhash is 
> nowhere to be seen. In the next example i request all fields and the 
> transformed minhash:
> : 
> : 
> /solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=*,minhash:[binstr]
> : {
> :   "response":{"numFound":96933,"start":0,"docs":[
> :   {
> : 
> "minhash":"11101101001010001101001010111101100100110010",
> : ...other fields here
> : "_version_":1553728923368423424}]
> :   }}
> : 
> : So it seems that right now, i can only use a transformer properly if i 
> request all fields. I believe it used to work with all three examples just as 
> you would expect. But since i haven't used transformers for a while, i don't 
> know at which version it stopped working like that (if it ever did of course 
> :)
> : 
> : Did i mess something up or did a bug creep on me?
> : 
> : Thanks,
> : Markus
> : 
> 
> -Hoss
> http://www.lucidworks.com/
> 


Re: DocTransformer not always working

2016-12-14 Thread Chris Hostetter

Fairly certain you aren't overridding getExtraRequestFields, so when your 
DocTransformer is evaluated it can'd find the field you want it to 
transform.

By default, the ResponseWriters don't provide any fields that aren't 
explicitly requested by the user, or specified as "extra" by the 
DocTransformer.

IIUC you want the stored value of the "minhash" field to be available to 
you, but the response writer code doesn't know that -- it just knows you 
want "minhash" to be the output respons key for the "[binstr]" 
transformer.


Take a look at RawValueTransformerFactory as an example to borrow from.




: Date: Wed, 14 Dec 2016 21:55:26 +
: From: Markus Jelsma 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user 
: Subject: DocTransformer not always working
: 
: Hello - I just spotted an oddity with all two custom DocTransformers we 
sometimes use on Solr 6.3.0. This particular transformer in the example just 
transforms a long (or int) into a sequence of bits. I just use it as an 
convenience to compare minhashes with my eyeballs. First example is very 
straightforward, fl=minhash:[binstr], show only the minhash field, but as a bit 
sequence.
: 
: 
solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=minhash:[binstr]
: {
:   "response":{"numFound":96933,"start":0,"docs":[
:   {}]
:   }}
: 
: The document is empty! This also happens with another transformer. The next 
example i also request the lang field:
: 
: solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id 
asc&q=*:*&fl=lang,minhash:[binstr]
: {
:   "response":{"numFound":96933,"start":0,"docs":[
:   {
: "lang":"nl"}]
:   }}
: 
: Ok, at least i now get the lang field, but the transformed minhash is nowhere 
to be seen. In the next example i request all fields and the transformed 
minhash:
: 
: 
/solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=*,minhash:[binstr]
: {
:   "response":{"numFound":96933,"start":0,"docs":[
:   {
: 
"minhash":"11101101001010001101001010111101100100110010",
: ...other fields here
: "_version_":1553728923368423424}]
:   }}
: 
: So it seems that right now, i can only use a transformer properly if i 
request all fields. I believe it used to work with all three examples just as 
you would expect. But since i haven't used transformers for a while, i don't 
know at which version it stopped working like that (if it ever did of course :)
: 
: Did i mess something up or did a bug creep on me?
: 
: Thanks,
: Markus
: 

-Hoss
http://www.lucidworks.com/


Re: High increasing slab memory solr 6

2016-12-14 Thread Shawn Heisey
On 12/14/2016 7:12 AM, moscovig wrote:
> Shawn, thanks for the reply
>
> Please take a look at that post. It's describing the same issue with ES
>
> They describe the issue as "dentry cache is bloating memory"
>
> https://discuss.elastic.co/t/memory-usage-of-the-machine-with-es-is-continuously-increasing/23537/5

They concluded that it was not a problem in ES or Lucene.  It's an OS
issue, and is mostly only an annoyance, because the memory is
reclaimable.  If the amount of memory involved is very large, apparently
that can cause long stop-the-world pauses as the memory is automatically
cleaned up by the OS.

There is absolutely nothing that Solr or Lucene (or even ES) can do
about this issue.  It is perfectly normal for programs to check for the
existence of files that do not actually exist at that moment of the
check.  The issues that can be reached from that post say that
attempting to stat nonexistent files is the trigger for the problem in
the OS.

Updating your OS to the newest update packages (and probably rebooting)
might fix it.

Thanks,
Shawn



DocTransformer not always working

2016-12-14 Thread Markus Jelsma
Hello - I just spotted an oddity with all two custom DocTransformers we 
sometimes use on Solr 6.3.0. This particular transformer in the example just 
transforms a long (or int) into a sequence of bits. I just use it as an 
convenience to compare minhashes with my eyeballs. First example is very 
straightforward, fl=minhash:[binstr], show only the minhash field, but as a bit 
sequence.

solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=minhash:[binstr]
{
  "response":{"numFound":96933,"start":0,"docs":[
  {}]
  }}

The document is empty! This also happens with another transformer. The next 
example i also request the lang field:

solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id 
asc&q=*:*&fl=lang,minhash:[binstr]
{
  "response":{"numFound":96933,"start":0,"docs":[
  {
"lang":"nl"}]
  }}

Ok, at least i now get the lang field, but the transformed minhash is nowhere 
to be seen. In the next example i request all fields and the transformed 
minhash:

/solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=*,minhash:[binstr]
{
  "response":{"numFound":96933,"start":0,"docs":[
  {

"minhash":"11101101001010001101001010111101100100110010",
...other fields here
"_version_":1553728923368423424}]
  }}

So it seems that right now, i can only use a transformer properly if i request 
all fields. I believe it used to work with all three examples just as you would 
expect. But since i haven't used transformers for a while, i don't know at 
which version it stopped working like that (if it ever did of course :)

Did i mess something up or did a bug creep on me?

Thanks,
Markus


Re: Solr Cloud Replica Cores Give different Results for the Same query

2016-12-14 Thread Erick Erickson
The commit points on different replicas will trip at different wall
clock times so the leader and replica may return slightly different
results depending on whether doc X was included in the commit on one
replica but not on the second. After the _next_ commit interval (2
seconds in your case), doc X will be committed on the second replica:
that is it's not lost.

Here's a couple of ways to verify:

1> turn off indexing and wait a few seconds. The replicas should have
the exact same documents. "A few seconds" is your autocommit (soft in
your case) interval + autowarm time. This last is unknown, but you can
check your admin/plugins-stats search handler times, it's reported
there. Now issue your queries. If the replicas don't report the same
docs A Bad Thing that should be worrying. BTW, with a 2 second soft
commit interval, which is really aggressive, you _better not_ have
very large autowarm intervals!

2> Include a timestamp in your docs when they are indexed. There's an
automatic way to do that BTW now do your queries and append an FQ
clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas
should have the same counts unless you are deleting documents. I
mention deletes on the off chance that you're deleting documents that
fall in the interval and then the same as above could theoretically
occur. Updates should be fine.

BTW, I've seen continuous monitoring of this done by automated
scripts. The key is to get the shard URL and ping that with
&distrib=false. It'll look something like
http://host:port/solr/collection_shard1_replica1 People usually
just use *:* and compare numFound.

Best,
Erick



On Wed, Dec 14, 2016 at 1:10 PM, Webster Homer  wrote:
> We are using Solr Cloud 6.2
>
> We have been noticing an issue where the index in a core shows as current =
> false
>
> We have autocommit set for 15 seconds, and soft commit at 2 seconds
>
> This seems to cause two replicas to return different hits depending upon
> which one is queried.
>
> What would lead to the indexes not being "current"? The documentation on
> the meaning of current is vague.
>
> The collections in our cloud have two shards each with two replicas. I see
> this with several of the collections.
>
> We don't know how they get like this but it's troubling
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.merckgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.


Solr Cloud Replica Cores Give different Results for the Same query

2016-12-14 Thread Webster Homer
We are using Solr Cloud 6.2

We have been noticing an issue where the index in a core shows as current =
false

We have autocommit set for 15 seconds, and soft commit at 2 seconds

This seems to cause two replicas to return different hits depending upon
which one is queried.

What would lead to the indexes not being "current"? The documentation on
the meaning of current is vague.

The collections in our cloud have two shards each with two replicas. I see
this with several of the collections.

We don't know how they get like this but it's troubling

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: Traverse over response docs in SearchComponent impl.

2016-12-14 Thread Markus Jelsma
Thanks!

Running the same code in cloud mode worked nicely almost right away. Getting it 
to work in non-cloud mode is still non-trivial. I can get the DocList in 
process(), but AFAIK it just provides Lucene docIds, not a nice DocumentList we 
could work with.

The use-case is straightforward, the resultset contains id's. I collect them 
and do a bulk getById to another Solr index. Via fl-specified retrieved fields 
from the remote index are added to the resultset, enriching each document in 
the server, without intervening middleware.

All our server run in cloud mode, so getting it to work in local mode is just a 
convenience when developing. We have quite a few components that run in cloud 
and non-cloud mode. Non-cloud mode is for some reason almost always harder to 
implement, sometimes even at Lucene level with IndexSearcher, hand crafted 
queries and all.

Thanks again, it runs as a charm.
Markus

 
-Original message-
> From:Chris Hostetter 
> Sent: Tuesday 13th December 2016 23:27
> To: solr-user 
> Subject: Re: Traverse over response docs in SearchComponent impl.
> 
> 
> FWIW: Perhaps an XY problem?  can you explain more in depth what it is you 
> plan on doing in this search component?
> 
> : I can see that Solr calls the component's process() method, but from 
> : within that method, rb.getResponseDocs(); is always null. No matter what 
> : i try, i do not seem to be able to get a hold of that list of response 
> : docs.
> 
> IIRC getResponseDocs() is only non-null when agregating distributed/cloud 
> resultsfrom multiple shards (where we already have a fully 
> populated SolrDocumentList due to agregating the remote responses), but in 
> a single-node Solr request only a "DocList" is used, and the stored field 
> values are read lazily from the IndexReader by the ResponseWriter.
> 
> So if you're not writting a distributed component, check 
> ResponseBuilder.getResults() ?
> 
> Even if you are writting a component for a distributed solr setup, what 
> method you call (and where you call it) depends a lot on when/where you 
> expect your code to run...
> 
> IIRC: 
> * prepare() runs on every node for every request (original aggregation 
> request and every sub-request to each shard).  
> * distributedProcess runs on the aggregation node, and is called 
> repeatedly for each "stage" requested by any components (so at a minimum 
> once, 
> usually twice to fetch stored fields, maybe more if there are multiple 
> facet refinement phases, etc...).  
> * modifyRequest() & handleResponses() are called on the aggregation node 
> prior/after every sub-request to every shard.
> * process() is called on each shard for each sub request. 
> * finishStage is called on the aggreation node at the ned of each stage 
> (after all the responses from all shards for that sub-request)
> 
> 
> ...so something like HighlightComponent does it's main work in the 
> process() method, because it only needs the data for each doc, the impacts 
> of other (aggregated) docs don't affect the results -- then later 
> finishStage combines the results.
> 
> If you on the otherhand want to look at all of the *final* documents being 
> returned to the user, not on a per-shard basis but on an aggregate basis, 
> you'd want to put that logic in something like finishStage and check for 
> the stage that does a GET_FIELDS -- but if you want your component to 
> *also* work in non-cloud mode, you'd need the same logic in your process() 
> method (looking at the DocList instead of the SolrDocumentList, with a 
> conditional to check for distrib=false so you don't waste a bunch of work 
> on per-shard queries when it is in fact being used in cloud-mode)
> 
> 
> None of this is very straight forward, but you are admitedly geting int 
> overy advanced expert territory here.
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 


Re: Searching for a term which isn't a part of an expression

2016-12-14 Thread Ahmet Arslan
Hi,

Do you have a common list of phrases that you want to prohibit partial match?
You can index those phrases in a special way, for example,

This is a new world hello_world hot_dog tap_water etc.

ahmet


On Wednesday, December 14, 2016 9:20 PM, deansg  wrote:
We would like to enable queries for a specific term that doesn't appear as a
part of a given expression. Negating the expression will not help, as we
still want to return items that contain the term independently, even if they
contain full expression as well.
For example, we would like to search for items that have the term "world"
but not as a part of "hello world". If the text is: "This is a new world.
Hello world", we would still want to return the item, as "world" appears
independently as well as a part of "Hello world". However, we will not want
to return items that only have the expression "hello world" in them.
Does Solr support these types of queries? We thought about using regex, but
since the text is tokenized I don't think that will be possible.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-a-term-which-isn-t-a-part-of-an-expression-tp4309746.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr on HDFS: increase in query time with increase in data

2016-12-14 Thread Chetas Joshi
Hi everyone,

I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
the following config.
maxShardsperNode: 1
replicationFactor: 1

I have been ingesting data into Solr for the last 3 months. With increase
in data, I am observing increase in the query time. Currently the size of
my indices is 70 GB per shard (i.e. per node).

I am using cursor approach (/export handler) using SolrJ client to get back
results from Solr. All the fields I am querying on and all the fields that
I get back from Solr are indexed and have docValues enabled as well. What
could be the reason behind increase in query time?

Has this got something to do with the OS disk cache that is used for
loading the Solr indices? When a query is fired, will Solr wait for all
(70GB) of disk cache being available so that it can load the index file?

Thnaks!


Searching for a term which isn't a part of an expression

2016-12-14 Thread deansg
We would like to enable queries for a specific term that doesn't appear as a
part of a given expression. Negating the expression will not help, as we
still want to return items that contain the term independently, even if they
contain full expression as well.
For example, we would like to search for items that have the term "world"
but not as a part of "hello world". If the text is: "This is a new world.
Hello world", we would still want to return the item, as "world" appears
independently as well as a part of "Hello world". However, we will not want
to return items that only have the expression "hello world" in them.
Does Solr support these types of queries? We thought about using regex, but
since the text is tokenized I don't think that will be possible.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-a-term-which-isn-t-a-part-of-an-expression-tp4309746.html
Sent from the Solr - User mailing list archive at Nabble.com.


Nested JSON Facets (Subfacets)

2016-12-14 Thread CA
Hi all,

this is about using a function in nested facets, specifically the „sum()“ 
function inside a „terms“ facet using the json.facet api.

My json.facet parameter looks like this:

json.facet={shop_cat: {type:terms, field:shop_cat, facet: 
{cat_pop:"sum(popularity)"}}}

A snippet of the result:

"facets“: {
"count":2508,
"shop_cat“: {
"buckets“: [{
"val“: "Men > Clothing > Jumpers & Cardigans",
"count":252,
"cat_pop“:0.0
 }, {
   "val":"Men > Clothing > Jackets & Coats",
   "count":157,
   "cat_pop“:0.0
 }, // and more

This looks fine all over but it turns out that „cat_pop“, the result of 
„sum(popularity)“ is always 0.0 even if the documents for this facet value have 
popularities > 0.

A quick check with this json.facet parameter:

json.facet: {cat_pop:"sum(popularity)“}

returns:

"facets“: {
"count":2508,
"cat_pop":21.0},

To me, it seems it works fine on the base level but not when nested. Still, 
Yonik’s documentation and the Jira issues indicate that it is possible to use 
functions in nested facets so I might just be using the wrong structure? I have 
a hard time finding any other examples on the i-net and I had no luck changing 
the structure around.
Could someone shed some light on this for me? It would also help to know if it 
is not possible to sum the values up this way.

Thanks a lot!
Chantal




Re: "on deck" searcher vs warming searcher

2016-12-14 Thread Chris Hostetter

: In a situation where searchers A-E are queued in the states
: A: Current
: B: Warming
: C: Ondeck
: D: Ondeck
: E: Being created with newSearcher
: 
: wouldn't it make sense to discard C before it gets promoted to Warming,
: as the immediate action after warming C would be to start warming D?
: 
: Are there some situations where the (potentially extremely short lived)
: C searcher must be visible before D replaces it?

In theory it might make sense to throw out C, but in practice: 

1) since maxWarmingSearchers is typically a small value, E (and 
sometimes D) are rarely created

2) because of how the code is structured, discarding C isn't 
particularly easy ... the calls are happening in parallel threads, ie: 
some Thread#1 is warming B while some thread #2 has just opened C and is 
blocked on the single threaded warming executor while waiting to warm it.  
When Thread #3 comes along and opens D, it also gets blocked on the same 
executor.  

We'd need to revamp that code in some way that the existence of Thread #3 
(and beyond) while Thread #2 is queued up would cause Thread #2 to close C 
(w/o warming it) and instead be blocked waiting for D to warm -- such that 
once D completes warming both Thread #2 and Thread #3 both return D.

All of which is complicated by the fact that the code is actaully 
returning the Searchers immediately, but also returning/setting a Future 
ref that is what's waiting on the warming to finish -- so callers can 
actaully use the searchers concurrently with the warming (ie: 
useColdSearchers) if they wish.


So in a nutshell: yes, but it would be a pretty invasive change, and AFAIK 
rarely impacts people who don't already have bigger problems.




-Hoss
http://www.lucidworks.com/


Re: Reg: Is there a way to query solr leader directly using solrj?

2016-12-14 Thread Erick Erickson
First off I'm a bit confused. You say you're working with an
UpdateProcessorFactory but then want to use SolrJ to get
a leader. Why do this? Why not just work entirely locally and
reach into the _local_ index (note, you have to do this
after the doc has been routed to the correct shard)? Once there
you should be able to use the real-time get functionality to get
the latest version that's been sent regardless of whether it's
been committed or not.

And in the middle of this you say you're "pointing to the leader",
which implies you're really doing this from some external SolrJ
client, not as part of an update chain at all. So I'm missing
something.

Or are you talking about doing this on the _client_?

To answer your question, though,
CloudSolrClient.getCollection(...).getLeader(...)...

Best,
Erick

On Wed, Dec 14, 2016 at 4:48 AM, indhu priya  wrote:
> Hi,
>
> In my project I have one leader and one replica architecture.
> I am using custom code( using DocumentUpdateProcessorFactory) for merging
> old documents with incoming new documents.
>
> eg. 1. if 1st document have 10 fields, all 10 fields will be indexed.
>   2. if 2nd document have 8 fields, 5 of which are in old document and
> 3 are in new document, then we will find the old document in index(using
> solrJ), then update the 5 fields of old document and add the new 3 fields
> with old document and hence we have total 13 updated fields in result
> document.
>
> When I am pointing to leader and do indexing, I am not facing any issues.
> But if I point to replica, then I am facing issues. since document
> distribution from replica to leader and again to replica is taking time.
>
> Eg. If first document comes in replica at time t1, then the distribution to
> leader happens at t2 and then the leader distributes it to replica at time
> t3. But the second in-coming document is coming before t3 and hence the
> custom code is not able to find its old document for merge.
>
> Hence, I need to know whether there is any simple way to query leader
> directly using solrj other than finding the leader using zookeeper and then
> hitting http url?
>
> Notes: We are using SOLR 5.5 and i tried using zookeeper but zookeeper is
> distributing the query.
>
> Please let me know if any queries.
>
> Thanks,
> Indhupriya.S


Reg: Is there a way to query solr leader directly using solrj?

2016-12-14 Thread indhu priya
Hi,

In my project I have one leader and one replica architecture.
I am using custom code( using DocumentUpdateProcessorFactory) for merging
old documents with incoming new documents.

eg. 1. if 1st document have 10 fields, all 10 fields will be indexed.
  2. if 2nd document have 8 fields, 5 of which are in old document and
3 are in new document, then we will find the old document in index(using
solrJ), then update the 5 fields of old document and add the new 3 fields
with old document and hence we have total 13 updated fields in result
document.

When I am pointing to leader and do indexing, I am not facing any issues.
But if I point to replica, then I am facing issues. since document
distribution from replica to leader and again to replica is taking time.

Eg. If first document comes in replica at time t1, then the distribution to
leader happens at t2 and then the leader distributes it to replica at time
t3. But the second in-coming document is coming before t3 and hence the
custom code is not able to find its old document for merge.

Hence, I need to know whether there is any simple way to query leader
directly using solrj other than finding the leader using zookeeper and then
hitting http url?

Notes: We are using SOLR 5.5 and i tried using zookeeper but zookeeper is
distributing the query.

Please let me know if any queries.

Thanks,
Indhupriya.S


Re: High increasing slab memory solr 6

2016-12-14 Thread moscovig
In the mean time I am removing all the explicit commits we have in the code.


Will update if it got better



--
View this message in context: 
http://lucene.472066.n3.nabble.com/High-increasing-slab-memory-solr-6-tp4309708p4309718.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question

2016-12-14 Thread GW
Thanks,

I understand accessing solr directly. I'm doing REST calls to a single
machine.

If I have a cluster of five servers and say three Apache servers, I can
round robin the REST calls to all five in the cluster?

I guess I'm going to find out. :-)  If so I might be better off just
running Apache on all my solr instances.





On 14 December 2016 at 07:08, Dorian Hoxha  wrote:

> See replies inline:
>
> On Wed, Dec 14, 2016 at 11:16 AM, GW  wrote:
>
> > Hello folks,
> >
> > I'm about to set up a Web service I created with PHP/Apache <--> Solr
> Cloud
> >
> > I'm hoping to index a bazillion documents.
> >
> ok , how many inserts/second ?
>
> >
> > I'm thinking about using Linode.com because the pricing looks great. Any
> > opinions??
> >
> Pricing is 'ok'. For bazillion documents, I would skip vps and go straight
> dedicated. Check out ovh.com / online.net etc etc
>
> >
> > I envision using an Apache/PHP round robin in front of a solr cloud
> >
> > My thoughts are that I send my requests to the Solr instances on the
> > Zookeeper Ensemble. Am I missing something?
> >
> You contact with solr directly, don't have to connect to zookeeper for
> loadbalancing.
>
> >
> > What can I say.. I'm software oriented and a little hardware challenged.
> >
> > Thanks in advance,
> >
> > GW
> >
>


Re: High increasing slab memory solr 6

2016-12-14 Thread moscovig
Shawn, thanks for the reply

Please take a look at that post. It's describing the same issue with ES

They describe the issue as "dentry cache is bloating memory"

https://discuss.elastic.co/t/memory-usage-of-the-machine-with-es-is-continuously-increasing/23537/5

Thanks
Gilad






--
View this message in context: 
http://lucene.472066.n3.nabble.com/High-increasing-slab-memory-solr-6-tp4309708p4309713.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr has a CPU% spike when indexing a batch of data

2016-12-14 Thread Shawn Heisey
On 12/14/2016 1:28 AM, forest_soup wrote:
> We are doing index on the same http endpoint. But as we have shardnum=1 and
> replicafactor=1, so each collection only has one core. So there should no
> distributed update/query, as we are using solrj's CloudSolrClient which will
> get the target URL of the solrnode when requesting to each collection.
>
> For the questions:
> * What is the total physical memory in the machine? 
> 128GB
>
> * What is the max heap on each of the two Solr processes? 
> 32GB for each 
>
> * What is the total index size in each Solr process?
> Each Solr node(process) has 16 cores. 130GB for each solr core. So totally
>> 2000G for each solr node. 

This means that you have approximately 64GB left  for your OS after
deducting the heap sizes, which it must use for itself and for OS disk
caching.  With nearly 2 terabytes of index data on the machine, 64GB is
nowhere near enough for good performance.  The server will be VERY busy
whenever there is query activity, so the CPU spike is what I would
expect.  For that much index data, I would hope to have somewhere
between 512GB and 2 terabytes of memory.  Adding machines and/or
increasing memory in each machine would make your performance better and
reduce CPU load.

https://wiki.apache.org/solr/SolrPerformanceProblems

> * What is the total tlog size in each Solr process? 
> 25m for each core. So totally 400m for each solr node.
>
> 
> ${solr.ulog.dir:}
>  name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
> 1
> 100
> 

Compared to the amount of index data, 400MB is tiny, but this will take
a long time to process on restart.  You might want to consider lowering
the amount of data that the update log keeps so restarts are faster.

> * What are your commit characteristics like -- both manual and automatic. 
>
> 
> 1
> ${solr.autoCommit.maxTime:59000}
> false
> 
> 
> 5000
> ${solr.autoSoftCommit.maxTime:31000}
> 

I would personally remove the "maxDocs" portion of these settings and do
the automatic commits based purely on time.  For the amount of data
you're handling, those are very low maxDocs numbers, and could result in
very frequent commits when you index.  The time values are lower than I
would prefer, but are probably OK.

The number of collections should be no problem.  If there were hundreds
or thousands, that might be different.

Thanks,
Shawn



Re: Collection API CREATE creates name like '_shard1_replica1'

2016-12-14 Thread Shawn Heisey
On 12/14/2016 1:36 AM, Sandeep Khanzode wrote:
> I uploaded (upconfig) config (schema and solrconfig XMLs) to Zookeeper
> and then linked (linkconfig) the confname to a collection name. When I
> attempt to create a collection using the API like this
> .../solr/admin/collections?action=CREATE&name=abc&numShards=1&collection.configName=abc
>   ...
> it creates a collection core named abc_shard1_replica1 and not simply
> abc. 

This is exactly what it is supposed to do.  These are the *core* names. 
Each core is a shard replica.  The minimum value for shard count and
replica count is 1.  When making queries or update requests to Solr, you
can still use the "abc" name, and SolrCloud will figure out which cores
on which machines need to receive the request.

Thanks,
Shawn



Re: Solr - Amazon like search

2016-12-14 Thread Shawn Heisey
On 12/13/2016 10:55 PM, vasanth vijayaraj wrote:
> We are building an e-commerce mobile app. I have implemented Solr search and 
> autocomplete. 
> But we like the Amazon search and are trying to implement something like 
> that. Attached a screenshot 
> of what has been implemented so far
>
> The search/suggest should sort list of products based on popularity, document 
> hits and more. 
> How do we achieve this? Please help us out here. 

Your attachment didn't make it to the list.  They rarely do.  We can't
see whatever it is you were trying to include.

Sorting on things like popularity and hits requires putting that
information into the index so that each document has fields that encode
this information, allowing you to use Solr's standard sorting
functionality with those fields.  You also need a process to update that
information when there's a new hit.  It's possible, but you have to
write this into your indexing system.

Solr doesn't include special functionality for this.  It would be hard
to generalize, and it can all be done without special functionality.

Thanks,
Shawn



Re: High increasing slab memory solr 6

2016-12-14 Thread Shawn Heisey
On 12/14/2016 5:55 AM, moscovig wrote:
> We have solr 6.2.1.
> One of the collection is causing lots of updates.
> We see the next logs: 
> 
> /INFO  org.apache.solr.core.SolrDeletionPolicy :
> SolrDeletionPolicy.onCommit: commits: num=2
> 
> commit{dir=/opt/solr-6.2.1/server/solr/collection_shard1_replica2/data/index,segFN=segments_qbmv,generation=1228135}
> 
> commit{dir=/opt/solr-6.2.1/server/solr/collection_shard1_replica2/data/index,segFN=segments_qbmw,generation=1228136}/

Those do not look like any problem at all.  The first one says INFO, the
others probably do too, but what's here doesn't include the severity.

> As a result we are running out of memory in the instances hosting the
> collection. 
> The used memory is increased by 1 percent per day. 
> 
> The used memory is not part of the Solr's JVM, but part of the Slab memory
> (which I get to know now :) )

Solr does not explicitly allocate memory outside of the JVM.  Solr (via
Java) uses MMAP for access to index data, which relies on the OS using
memory for the disk cache, but this is normal OS function, and not
anything unusual.  The OS can instantly re-allocate any of that memory
for use by programs that request it.

> when cat over /proc/meminfo we get:
> /
> Slab:   17906760 kB
> SReclaimable:   17841548 kB
> /
> 
> and slabtop gives:
> 91635138 91635138   6%0.19K 4363578   21  17454312K dentry
> 
> ~17 GB for dentry.
> 
> Is there any way to avoid this "memory leak"?
> 
> echo 2 > /proc/sys/vm/drop_caches ; sync is cleaning the this "clean" cache
> but - 

This sounds like the memory is being used for the OS disk cache -- which
is completely normal, and exactly how your spare memory SHOULD be used.
Solr has no control over this, and it's very likely that Java doesn't
either.  This is *NOT* a memory leak.  Your OS is working exactly how it
is supposed to work -- using otherwise unallocated memory to speed up
the system.  If a program requests any of that memory, the OS will
instantly release whatever the program requests.

https://en.wikipedia.org/wiki/Page_cache
https://wiki.apache.org/solr/SolrPerformanceProblems
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

> 2. What's the meaning of the SolrDeletionPolicy logs we get? do we commit
> lots of updates? deletes? 

The deletion policy has to do with commit points.  This is Lucene
functionality that Solr doesn't really use -- a typical deletion policy
will delete all but the most recent commit point.  Solr has a lot of
logging by default that is not useful to the average person, but can
provide information vital to a developer who's familiar with Lucene.

Thanks,
Shawn



High increasing slab memory solr 6

2016-12-14 Thread moscovig
Hi

We have solr 6.2.1.
One of the collection is causing lots of updates.
We see the next logs: 

/INFO  org.apache.solr.core.SolrDeletionPolicy :
SolrDeletionPolicy.onCommit: commits: num=2

commit{dir=/opt/solr-6.2.1/server/solr/collection_shard1_replica2/data/index,segFN=segments_qbmv,generation=1228135}

commit{dir=/opt/solr-6.2.1/server/solr/collection_shard1_replica2/data/index,segFN=segments_qbmw,generation=1228136}/


As a result we are running out of memory in the instances hosting the
collection. 
The used memory is increased by 1 percent per day. 

The used memory is not part of the Solr's JVM, but part of the Slab memory
(which I get to know now :) )

when cat over /proc/meminfo we get:
/
Slab:   17906760 kB
SReclaimable:   17841548 kB
/

and slabtop gives:
91635138 91635138   6%0.19K 4363578   21  17454312K dentry

~17 GB for dentry.


Is there any way to avoid this "memory leak"?

echo 2 > /proc/sys/vm/drop_caches ; sync is cleaning the this "clean" cache
but - 

1. Either from the OS side or from the solr collection side?

2. What's the meaning of the SolrDeletionPolicy logs we get? do we commit
lots of updates? deletes? 

Thanks
Gilad






--
View this message in context: 
http://lucene.472066.n3.nabble.com/High-increasing-slab-memory-solr-6-tp4309708.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question

2016-12-14 Thread Dorian Hoxha
See replies inline:

On Wed, Dec 14, 2016 at 11:16 AM, GW  wrote:

> Hello folks,
>
> I'm about to set up a Web service I created with PHP/Apache <--> Solr Cloud
>
> I'm hoping to index a bazillion documents.
>
ok , how many inserts/second ?

>
> I'm thinking about using Linode.com because the pricing looks great. Any
> opinions??
>
Pricing is 'ok'. For bazillion documents, I would skip vps and go straight
dedicated. Check out ovh.com / online.net etc etc

>
> I envision using an Apache/PHP round robin in front of a solr cloud
>
> My thoughts are that I send my requests to the Solr instances on the
> Zookeeper Ensemble. Am I missing something?
>
You contact with solr directly, don't have to connect to zookeeper for
loadbalancing.

>
> What can I say.. I'm software oriented and a little hardware challenged.
>
> Thanks in advance,
>
> GW
>


Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question

2016-12-14 Thread GW
Hello folks,

I'm about to set up a Web service I created with PHP/Apache <--> Solr Cloud

I'm hoping to index a bazillion documents.

I'm thinking about using Linode.com because the pricing looks great. Any
opinions??

I envision using an Apache/PHP round robin in front of a solr cloud

My thoughts are that I send my requests to the Solr instances on the
Zookeeper Ensemble. Am I missing something?

What can I say.. I'm software oriented and a little hardware challenged.

Thanks in advance,

GW


Re: "on deck" searcher vs warming searcher

2016-12-14 Thread Toke Eskildsen
On Tue, 2016-12-13 at 16:07 -0700, Chris Hostetter wrote:
> ** "warming" happens i na single threaded executor -- so if there
> are multiple ondeck searchers, only one of them at a time is ever a
> "warming" searcher
> ** multiple ondeck searchers can be a sign of a potential performance
> problem (hence the log warning) [...]

In a situation where searchers A-E are queued in the states
A: Current
B: Warming
C: Ondeck
D: Ondeck
E: Being created with newSearcher

wouldn't it make sense to discard C before it gets promoted to Warming,
as the immediate action after warming C would be to start warming D?

Are there some situations where the (potentially extremely short lived)
C searcher must be visible before D replaces it?

- Toke Eskildsen, State and University Library, Denmark


Collection API CREATE creates name like '_shard1_replica1'

2016-12-14 Thread Sandeep Khanzode
Hi,
I uploaded (upconfig) config (schema and solrconfig XMLs) to Zookeeper and then 
linked (linkconfig) the confname to a collection name.
When I attempt to create a collection using the API like this 
.../solr/admin/collections?action=CREATE&name=abc&numShards=1&collection.configName=abc
  ... it creates a collection core named abc_shard1_replica1 and not simply abc.
What is missing? 
SRK

Re: Solr has a CPU% spike when indexing a batch of data

2016-12-14 Thread forest_soup
Thanks, Shawn!

We are doing index on the same http endpoint. But as we have shardnum=1 and
replicafactor=1, so each collection only has one core. So there should no
distributed update/query, as we are using solrj's CloudSolrClient which will
get the target URL of the solrnode when requesting to each collection.

For the questions:
* What is the total physical memory in the machine? 
128GB

* What is the max heap on each of the two Solr processes? 
32GB for each 

* What is the total index size in each Solr process?
Each Solr node(process) has 16 cores. 130GB for each solr core. So totally
>2000G for each solr node. 
 
* What is the total tlog size in each Solr process? 
25m for each core. So totally 400m for each solr node.


${solr.ulog.dir:}
${solr.ulog.numVersionBuckets:65536}
1
100


* What are your commit characteristics like -- both manual and automatic. 


1
${solr.autoCommit.maxTime:59000}
false


5000
${solr.autoSoftCommit.maxTime:31000}



* Do you have WARN or ERROR messages in your logfile? 
No.

* How many collections are in each cloud? 
80 collections with only one shard each. And replicafactor=1.

* How many servers are in each cloud? 
5 solr nodes. So each solr node has 16 cores.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-has-a-CPU-spike-when-indexing-a-batch-of-data-tp4309529p4309669.html
Sent from the Solr - User mailing list archive at Nabble.com.