Re: Stream InnerJoin to merge hierarchal data

2020-02-07 Thread Joel Bernstein
This is working as designed I believe. I issue is that innerJoin relies on
the sort order of the streams in order to perform streaming merge join. The
first join works because the sorts line up on childId.

  innerJoin(search(collection_name,
q="type:grandchild",
qt="/export",
fl="grandchild. name, grandId, childId,
parentId",
sort="childId asc"),
   search(collection_name,
q="type:child",
   qt="/export",
   fl="child.name, childId, parentId",
  sort="childId asc")

The second join though is attempting join on parentId but the sorts do not
allow that as one of the joins is sorted on childid.

One possible solution is to use fetch to retrieve the parent for the child:
https://lucene.apache.org/solr/guide/8_0/stream-decorator-reference.html#fetch


Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Feb 7, 2020 at 2:23 PM sambasivarao giddaluri <
sambasiva.giddal...@gmail.com> wrote:

> Hi All,
>
> Our dataset is of 50M records and we are using complex graph query and now
> trying to do innerjoin on the records and facing the below issue .
> This is a critical issue .
>
> Parent
> {
> parentId:"1"
> parent.name:"foo"
> type:"parent"
>
> }
> Child
> {
> childId:"2"
> parentId:"1"
> child.name:"bar"
> type:"child"
> }
> GrandChild
> {
> grandId:"3"
> childId:"2"
> parentId:"1"
> grandchild.name:"too"
> type:"grandchild"
> }
>
> innerJoin(search(collection_name, q="type:grandchild", qt="/export", fl="
> grandchild.name,grandId,childId,parentId", sort="childId asc"),
> search(collection_name, q="type:child", qt="/export",
> fl="child.name,childId,parentId",
> sort="childId asc"),
> on="childId")
>
> this works and gives result
> {
> "parentId": "1",
> "childId": "2",
> "grandId: "3",
> "grandchild.name": "too",
> "child.name": "bar"
>  }
>
> but if i try to join the parent as well with another innerjoin this gives
> error
>
> innerJoin(
> innerJoin(search(collection_name, q="type:grandchild", qt="/export", fl="
> grandchild.name,grandId,childId,parentId", sort="childId asc"),
> search(collection_name, q="type:child", qt="/export",
> fl="child.name,childId,parentId",
> sort="childId asc"),
> on="childId"),
> search(collection_name, q="type:parent", qt="/export", fl="parent.name,
> parentId", sort="parentId asc"),on="parentId")
>
> ERROR
> {
>   "result-set": {
> "docs": [
>   {
> "EXCEPTION": "Invalid JoinStream - all incoming stream comparators
> (sort) must be a superset of this stream's equalitor.",
> "EOF": true
>   }
> ]
>   }
> }
>
>
> If we change the key parentId in child doc to childParentId and similarly
> childId,parentId in grandchild doc to grandchildId,grandParentId then query
> will work but this is a big change in schema..
> i also refered this issue https://issues.apache.org/jira/browse/SOLR-10512
>
> Thanks
> sam
>


Stream InnerJoin to merge hierarchal data

2020-02-07 Thread sambasivarao giddaluri
Hi All,

Our dataset is of 50M records and we are using complex graph query and now
trying to do innerjoin on the records and facing the below issue .
This is a critical issue .

Parent
{
parentId:"1"
parent.name:"foo"
type:"parent"

}
Child
{
childId:"2"
parentId:"1"
child.name:"bar"
type:"child"
}
GrandChild
{
grandId:"3"
childId:"2"
parentId:"1"
grandchild.name:"too"
type:"grandchild"
}

innerJoin(search(collection_name, q="type:grandchild", qt="/export", fl="
grandchild.name,grandId,childId,parentId", sort="childId asc"),
search(collection_name, q="type:child", qt="/export",
fl="child.name,childId,parentId",
sort="childId asc"),
on="childId")

this works and gives result
{
"parentId": "1",
"childId": "2",
"grandId: "3",
"grandchild.name": "too",
"child.name": "bar"
 }

but if i try to join the parent as well with another innerjoin this gives
error

innerJoin(
innerJoin(search(collection_name, q="type:grandchild", qt="/export", fl="
grandchild.name,grandId,childId,parentId", sort="childId asc"),
search(collection_name, q="type:child", qt="/export",
fl="child.name,childId,parentId",
sort="childId asc"),
on="childId"),
search(collection_name, q="type:parent", qt="/export", fl="parent.name,
parentId", sort="parentId asc"),on="parentId")

ERROR
{
  "result-set": {
"docs": [
  {
"EXCEPTION": "Invalid JoinStream - all incoming stream comparators
(sort) must be a superset of this stream's equalitor.",
"EOF": true
  }
]
  }
}


If we change the key parentId in child doc to childParentId and similarly
childId,parentId in grandchild doc to grandchildId,grandParentId then query
will work but this is a big change in schema..
i also refered this issue https://issues.apache.org/jira/browse/SOLR-10512

Thanks
sam


Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

2020-02-07 Thread Pratik Patel
Hello Everyone,

Let's say I have an analyzer which has following token stream as an output.

*token stream : [], a, ab, [], c, [], d, de, def .*

Now let's say I want to add another filter which will drop a certain tokens
based on whether adjacent token on the right side is [] or some string.

for a given token,
 drop/replace it by empty string it if there is a non-empty string
token on its right and
 keep it if there is an empty token string on its right.

based on this, the resulting token stream would be like this.

*desired output stream : [], [a], ab, [], c, [], d,
de, def *


*Is there any Filter available in solr with which this can be achieved?*
*If writing a custom filter is the only possible option then I want to know
whether its possible to access adjacent tokens in the custom filter?*

*Any idea about this would be really helpful.*

Thanks,
Pratik


Re: Checking in on Solr Progress

2020-02-07 Thread Walter Underwood
I wrote some Python that checks CLUSTERSTATUS and reports replica status to 
Telegraf. Great for charts and alerts, but it only shows status, not progress.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 7, 2020, at 7:58 AM, Erick Erickson  wrote:
> 
> I was wondering about using metrics myself. I confess I didn’t look to see 
> what was already there either ;)
> 
> Actually, using metrics might be easiest all told, but I also confess I have 
> no clue what it takes to build a new metric in. Nor how to use the same (?) 
> collection process for the 5 situations I outlined, and those just off the 
> top of my head.
> 
> It’s particularly frustrating when diagnosing these not knowing whether the 
> “recovering” state is going to resolve itself sometime or not. I’ve seen Solr 
> replicas stuck in that state forever….
> 
> Andrzej could certainly shed some light on that question.
> 
> All ideas welcome of course!
> 
>> On Feb 7, 2020, at 10:40 AM, Jan Høydahl  wrote:
>> 
>> Could we expose some high level recovery info as part of metrics api? Then 
>> people could track number of cores recovering, recovery time, recovery 
>> phase, number of recoveries failed etc, and also build alerts on top of that.
>> 
>> Jan Høydahl
>> 
>>> 6. feb. 2020 kl. 19:42 skrev Erick Erickson :
>>> 
>>> There’s actually a crying need for this, but there’s nothing that’s there 
>>> yet, basically you have to look at the log files and try to figure it out. 
>>> 
>>> Actually I think this would be a great thing to work on, but it’d be pretty 
>>> much all new. If you’d like, you can create a Solr Improvement Proposal 
>>> here: https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to 
>>> flesh out what this would look like.
>>> 
>>> A couple of thoughts off the top of my head:
>>> 
>>> I really think what would be most useful would be a collections API 
>>> command, something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. 
>>> Currently a replica can be stuck in recovery and never get out. There are 
>>> several scenarios that’d have to be considered:
>>> 
>>> 1> normal startup. The replica briefly goes from down->recovering->active 
>>> which should be quite brief. 
>>> 1a> Waiting for a leader to be elected before continuing
>>> 
>>> 2> “peer sync” where another replica is replaying documents from the tlog.
>>> 
>>> 3> situations where the replica is replaying documents from its own tlog. 
>>> This can be very, very, very long too.
>>> 
>>> 4> full sync where it’s copying the entire index from a leader.
>>> 
>>> 5> knickers in a knot, it’s given up even trying to recover.
>>> 
>>> In either case, you’d want to report “all ok” if nothing was in recovery, 
>>> “just the ones having trouble” and “everything because I want to look”.
>>> 
>>> But like I said, there’s nothing really built into the system to accomplish 
>>> this now that I know of.
>>> 
>>> Best,
>>> Erick
>>> 
 On Feb 6, 2020, at 12:15 PM, dj-manning  wrote:
 
 Erick Erickson wrote
> When you say “look”, where are you looking from? Http requests? SolrJ? The
> admin UI?
 
 I'm open to looking form anywhere  - http request, or the admin UI, or
 following a log if possible. 
 
 My objective for this ask would be to human interactively follow/watch
 solr's recovery progress - if that's even possible.
 
 Stretch goal would be to autonomously report on recovery progress.
 
 The question stems from seeing recovery in log or the admin UI, then
 wondering what progress is.  
 
 Appreciation.
 
 
 
 
 --
 Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>> 
> 



Re: Checking in on Solr Progress

2020-02-07 Thread Erick Erickson
I was wondering about using metrics myself. I confess I didn’t look to see what 
was already there either ;)

Actually, using metrics might be easiest all told, but I also confess I have no 
clue what it takes to build a new metric in. Nor how to use the same (?) 
collection process for the 5 situations I outlined, and those just off the top 
of my head.

It’s particularly frustrating when diagnosing these not knowing whether the 
“recovering” state is going to resolve itself sometime or not. I’ve seen Solr 
replicas stuck in that state forever….

Andrzej could certainly shed some light on that question.

All ideas welcome of course!

> On Feb 7, 2020, at 10:40 AM, Jan Høydahl  wrote:
> 
> Could we expose some high level recovery info as part of metrics api? Then 
> people could track number of cores recovering, recovery time, recovery phase, 
> number of recoveries failed etc, and also build alerts on top of that.
> 
> Jan Høydahl
> 
>> 6. feb. 2020 kl. 19:42 skrev Erick Erickson :
>> 
>> There’s actually a crying need for this, but there’s nothing that’s there 
>> yet, basically you have to look at the log files and try to figure it out. 
>> 
>> Actually I think this would be a great thing to work on, but it’d be pretty 
>> much all new. If you’d like, you can create a Solr Improvement Proposal 
>> here: https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to flesh 
>> out what this would look like.
>> 
>> A couple of thoughts off the top of my head:
>> 
>> I really think what would be most useful would be a collections API command, 
>> something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. Currently a 
>> replica can be stuck in recovery and never get out. There are several 
>> scenarios that’d have to be considered:
>> 
>> 1> normal startup. The replica briefly goes from down->recovering->active 
>> which should be quite brief. 
>> 1a> Waiting for a leader to be elected before continuing
>> 
>> 2> “peer sync” where another replica is replaying documents from the tlog.
>> 
>> 3> situations where the replica is replaying documents from its own tlog. 
>> This can be very, very, very long too.
>> 
>> 4> full sync where it’s copying the entire index from a leader.
>> 
>> 5> knickers in a knot, it’s given up even trying to recover.
>> 
>> In either case, you’d want to report “all ok” if nothing was in recovery, 
>> “just the ones having trouble” and “everything because I want to look”.
>> 
>> But like I said, there’s nothing really built into the system to accomplish 
>> this now that I know of.
>> 
>> Best,
>> Erick
>> 
>>> On Feb 6, 2020, at 12:15 PM, dj-manning  wrote:
>>> 
>>> Erick Erickson wrote
 When you say “look”, where are you looking from? Http requests? SolrJ? The
 admin UI?
>>> 
>>> I'm open to looking form anywhere  - http request, or the admin UI, or
>>> following a log if possible. 
>>> 
>>> My objective for this ask would be to human interactively follow/watch
>>> solr's recovery progress - if that's even possible.
>>> 
>>> Stretch goal would be to autonomously report on recovery progress.
>>> 
>>> The question stems from seeing recovery in log or the admin UI, then
>>> wondering what progress is.  
>>> 
>>> Appreciation.
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> 



Re: Checking in on Solr Progress

2020-02-07 Thread Jan Høydahl
Could we expose some high level recovery info as part of metrics api? Then 
people could track number of cores recovering, recovery time, recovery phase, 
number of recoveries failed etc, and also build alerts on top of that.

Jan Høydahl

> 6. feb. 2020 kl. 19:42 skrev Erick Erickson :
> 
> There’s actually a crying need for this, but there’s nothing that’s there 
> yet, basically you have to look at the log files and try to figure it out. 
> 
> Actually I think this would be a great thing to work on, but it’d be pretty 
> much all new. If you’d like, you can create a Solr Improvement Proposal here: 
> https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to flesh out 
> what this would look like.
> 
> A couple of thoughts off the top of my head:
> 
> I really think what would be most useful would be a collections API command, 
> something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. Currently a 
> replica can be stuck in recovery and never get out. There are several 
> scenarios that’d have to be considered:
> 
> 1> normal startup. The replica briefly goes from down->recovering->active 
> which should be quite brief. 
> 1a> Waiting for a leader to be elected before continuing
> 
> 2> “peer sync” where another replica is replaying documents from the tlog.
> 
> 3> situations where the replica is replaying documents from its own tlog. 
> This can be very, very, very long too.
> 
> 4> full sync where it’s copying the entire index from a leader.
> 
> 5> knickers in a knot, it’s given up even trying to recover.
> 
> In either case, you’d want to report “all ok” if nothing was in recovery, 
> “just the ones having trouble” and “everything because I want to look”.
> 
> But like I said, there’s nothing really built into the system to accomplish 
> this now that I know of.
> 
> Best,
> Erick
> 
>> On Feb 6, 2020, at 12:15 PM, dj-manning  wrote:
>> 
>> Erick Erickson wrote
>>> When you say “look”, where are you looking from? Http requests? SolrJ? The
>>> admin UI?
>> 
>> I'm open to looking form anywhere  - http request, or the admin UI, or
>> following a log if possible. 
>> 
>> My objective for this ask would be to human interactively follow/watch
>> solr's recovery progress - if that's even possible.
>> 
>> Stretch goal would be to autonomously report on recovery progress.
>> 
>> The question stems from seeing recovery in log or the admin UI, then
>> wondering what progress is.  
>> 
>> Appreciation.
>> 
>> 
>> 
>> 
>> --
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


Re: Solr 7.7 heap space is getting full

2020-02-07 Thread Erick Erickson
Walter’s comment (that I’ve seen too BTW) is something
to pursue if (and only if) you have proof that Solr is spinning
up thousands of threads. Do you have any proof of that?

Having several hundred threads running is quite common BTW.

Attach jconsole or take a thread dump and it’ll be obvious.

However, having thousands of threads is fairly rare in my experience.

You simply must take a heap dump and analyze it to have any hope
of identifying exactly what the issue is. It’s quite possible that you
simply need more memory. It’s possible you don’t have docValues
enabled for all the fields you facet, group, sort, or use function
queries with. It’s possible that…. 

Best,
Erick

> On Feb 6, 2020, at 9:07 PM, Rajdeep Sahoo  wrote:
> 
> If we reduce the no of threads then is it going to help.
>  Is there any other way to debug this.
> 
> 
> On Mon, 3 Feb, 2020, 2:52 AM Walter Underwood, 
> wrote:
> 
>> The only time I’ve ever had an OOM is when Solr gets a huge load
>> spike and fires up 2000 threads. Then it runs out of space for stacks.
>> 
>> I’ve never run anything other than an 8GB heap, starting with Solr 1.3
>> at Netflix.
>> 
>> Agreed about filter cache, though I’d expect heavy use of that to most
>> often be part of a faceted search system.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 2, 2020, at 12:36 PM, Erick Erickson 
>> wrote:
>>> 
>>> Mostly I was reacting to the statement that the number
>>> of docs increased by over 4x and then there were
>>> memory problems.
>>> 
>>> Hmmm, that said, what does “heap space is getting full”
>>> mean anyway? If you’re hitting OOMs, that’s one thing. If
>>> you’re measuring the amount of heap consumed and
>>> noticing that it fills up, that’s totally normal. Java will
>>> collect garbage when it needs to. If you attach something
>>> like jconsole to Solr you’ll see memory grow and shrink
>>> quite regularly. Take a look at your garbage collection logs
>>> with something like GCViewer to see how much memory is
>>> still required after a GC cycle. If that number is reasonable
>>> then there’s no problem.
>>> 
>>> Walter:
>>> 
>>> Well, the expectation that one can keep adding docs without
>>> considering heap size is simply naive. The filterCache
>>> for instance grows linearly with the number of documents
>>> (OK, if it it stores the full bitset). Real Time Get requires
>>> on-heap structures to keep track of changed docs between
>>> commits. Etc.
>>> 
>>> The OP hasn’t even told us whether docValues are enabled
>>> appropriately, which if not set for fields needing it will also
>>> grow heap requirements linearly with the number of docs.
>>> 
>>> I’ll totally agree that the relationship between the size of
>>> the index on disk and heap is iffy at best. But if more heap is
>>> _not_ needed for bigger indexes then we’d never hit OOMs
>>> no matter how many docs we put in 4G.
>>> 
>>> Best,
>>> Erick
>>> 
>>> 
>>> 
 On Feb 2, 2020, at 11:18 AM, Walter Underwood 
>> wrote:
 
 We CANNOT diagnose anything until you tell us the error message!
 
 Erick, I strongly disagree that more heap is needed for bigger indexes.
 Except for faceting, Lucene was designed to stream index data and
 work regardless of the size of the index. Indexing is in RAM buffer
 sized chunks, so large updates also don’t need extra RAM.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
> On Feb 2, 2020, at 7:52 AM, Rajdeep Sahoo 
>> wrote:
> 
> We have allocated 16 gb of heap space  out of 24 g.
> There are 3 solr cores here, for one core when the no of documents are
> getting increased i.e. around 4.5 lakhs,then this scenario is
>> happening.
> 
> 
> On Sun, 2 Feb, 2020, 9:02 PM Erick Erickson, 
> wrote:
> 
>> Allocate more heap and possibly add more RAM.
>> 
>> What are you expectations? You can't continue to
>> add documents to your Solr instance without regard to
>> how much heap you’ve allocated. You’ve put over 4x
>> the number of docs on the node. There’s no magic here.
>> You can’t continue to add docs to a Solr instance without
>> increasing the heap at some point.
>> 
>> And as far as I know, you’ve never told us how much heap yo
>> _are_ allocating. The default for Java processes is 512M, which
>> is quite small. so perhaps it’s a simple matter of starting Solr
>> with the -XmX parameter set to something larger.
>> 
>> Best,
>> Erick
>> 
>>> On Feb 2, 2020, at 10:19 AM, Rajdeep Sahoo <
>> rajdeepsahoo2...@gmail.com>
>> wrote:
>>> 
>>> What can we do in this scenario as the solr master node is going
>> down and
>>> the indexing is failing.
>>> Please provide some workaround for this issue.
>>> 
>>> On Sat, 1 Feb, 2020, 11:51 PM Walter Underwood, <
>> wun...@wunderwood.org>
>>

Re: Storage/Volume type for Kubernetes Solr POD?

2020-02-07 Thread Nicolas PARIS
hi all

what about cephfs or lustre distrubuted filesystem for such purpose ?


Karl Stoney  writes:

> we personally run solr on google cloud kubernetes engine and each node has a 
> 512Gb persistent ssd (network attached) storage which gives roughly this 
> performance (read/write):
>
> Sustained random IOPS limit 15,360.00 15,360.00
> Sustained throughput limit (MB/s) 245.76  245.76
>
> and we get very good performance.
>
> ultimately though it's going to depend on your workload
> 
> From: Susheel Kumar 
> Sent: 06 February 2020 13:43
> To: solr-user@lucene.apache.org 
> Subject: Storage/Volume type for Kubernetes Solr POD?
>
> Hello,
>
> Whats type of storage/volume is recommended to run Solr on Kubernetes POD?
> I know in the past Solr has issues with NFS storing its indexes and was not
> recommended.
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Fconcepts%2Fstorage%2Fvolumes%2F&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7Cade649a9f6e84e1ee7d008d7ab0a8c7b%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165934101219754&sdata=wsc4v3dJwTzOqSirbo7DvdmrimTL2sOX66Ug%2FvzrRw8%3D&reserved=0
>
> Thanks,
> Susheel
> This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
> Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
> 9439967). This email and any files transmitted with it are confidential and 
> may be legally privileged, and intended solely for the use of the individual 
> or entity to whom they are addressed. If you have received this email in 
> error please notify the sender. This email message has been swept for the 
> presence of computer viruses.


-- 
nicolas paris


Re: Storage/Volume type for Kubernetes Solr POD?

2020-02-07 Thread Karl Stoney
we personally run solr on google cloud kubernetes engine and each node has a 
512Gb persistent ssd (network attached) storage which gives roughly this 
performance (read/write):

Sustained random IOPS limit 15,360.00 15,360.00
Sustained throughput limit (MB/s) 245.76  245.76

and we get very good performance.

ultimately though it's going to depend on your workload

From: Susheel Kumar 
Sent: 06 February 2020 13:43
To: solr-user@lucene.apache.org 
Subject: Storage/Volume type for Kubernetes Solr POD?

Hello,

Whats type of storage/volume is recommended to run Solr on Kubernetes POD?
I know in the past Solr has issues with NFS storing its indexes and was not
recommended.

https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Fconcepts%2Fstorage%2Fvolumes%2F&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7Cade649a9f6e84e1ee7d008d7ab0a8c7b%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165934101219754&sdata=wsc4v3dJwTzOqSirbo7DvdmrimTL2sOX66Ug%2FvzrRw8%3D&reserved=0

Thanks,
Susheel
This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
9439967). This email and any files transmitted with it are confidential and may 
be legally privileged, and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email in error 
please notify the sender. This email message has been swept for the presence of 
computer viruses.