Re: CursorMarks and 'end of results'

2018-06-21 Thread Chris Hostetter


: the documentation of 'cursorMarks' recommends to fetch until a query returns
: the cursorMark that was passed in to a request.
: 
: But that always requires an additional request at the end, so I wonder if I
: can stop already, if a request returns less results than requested (num rows).
: There won't be new documents added during the search in my use case, so could
: there every be a non-empty 'page' after a non-full 'page'?

You could stop then -- if that fits your usecase -- but the documentation 
(in particular the sentence you are refering to) is trying to be as 
straightforward and general as possible ... which includes the use case 
where someone is "tailing" an index and documents may be continually 
added.

When originally writing those docs, I did have a bit in there about 
*either* getting back less then "rows" docs *or* getting back the same 
cursor you passed in (to try to cover both use cases as efficiently as 
possible) but it seemed more confusing -- and i was worried people might 
be suprised/confused when the number of docs was perfectly divisible by 
"rows" so the "less then rows" case could still wind up in a final 
request that returned "0" docs.

the current docs seemed like a good balance between brevity & clarity, 
with the added bonus of being correct :)

But as Anshum said: if you have suggested improvements for rewording, 
patches/PRs certainly welcome.  It's hard to have a good perspective on 
what docs are helpful to new users whne you have been working with the 
software for 14 years and wrote the code in question.



-Hoss
http://www.lucidworks.com/


Re: Applying streaming expression as a filter in graph traversal expression (gatherNodes)

2018-06-21 Thread Joel Bernstein
Currently the gatherNodes expression can only be filtered by a traditional
filter query. I'm curious about the type of expression you are thinking of
filtering by?

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jun 20, 2018 at 1:54 PM, Pratik Patel  wrote:

> We can limit the scope of graph traversal by applying some filter along the
> way as follows.
>
> gatherNodes(emails,
> walk="john...@apache.org->from",
> fq="body:(solr rocks)",
> gather="to")
>
>
> Is it possible to replace "body:(solr rocks)" by some streaming expression
> like "search" function for example? Like as follows..
>
> gatherNodes(emails,
> walk="john...@apache.org->from",
> fq="search(...)", // use streaming expression as filter
> gather="to")
>
>
>
> In my case, it would improve performance significantly if one can do that.
> Other approach I can think of is to save results of "search" streaming
> expression in some variable in pipeline and then use it at multiple places
> including "fq" clause of "gatherNodes". Is it possible to do something like
> this?
>


Re: Search streaming expressions returns rows times number of shards docs

2018-06-21 Thread Alfonso Muñoz-Pomer Fuentes
Yes, I specifically was addressing the /select handler, sorry about not 
mentioning it explicitly. My use case was, originally, with CloudSolrStream in 
SolrJ, where I could observe the same behaviour and I created the streaming 
expression in the UI to test if it was SolrJ-specific.

> On 21 Jun 2018, at 21:01, Aroop Ganguly  wrote:
> 
> So I think 2 things are being missed here. You should be specifying the 
> qt=“/export” to see all the results.
> If you do not do that, then the select handler is used by default which gives 
> the default 10-20 rows as result.
> 
>> On Jun 21, 2018, at 12:53 PM, Joel Bernstein  wrote:
>> 
>> That is actually the current behavior of the search expression. The initial
>> use cases from Streaming Expressions revolved around joins and rollups
>> which really require the entire result set. So the search expression just
>> merged the results from the shards and let the wrapping expression deal
>> with the results. Things have evolved quite a bit since then and having the
>> search expression respect the rows parameter is something that I've been
>> meaning to add. Feel free to create a ticket for this.
>> 
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>> 
>> On Thu, Jun 21, 2018 at 1:35 PM, Alfonso Muñoz-Pomer Fuentes <
>> amu...@ebi.ac.uk> wrote:
>> 
>>> I’m having a weird issue with the search streaming expressions and I’d
>>> like to share it before opening a ticket in Jira, just in case I’m missing
>>> something obvious.
>>> 
>>> I’m currently on Solr 7.1 and I have a collection named bioentities split
>>> into two shards and no replicas. Whenever I run a query such as this:
>>> search(
>>> bioentities,
>>> q="*:*",
>>> fl="bioentity_identifier,property_value,property_name",
>>> sort="bioentity_identifier asc")
>>> 
>>> I’m getting 20 documents. If I add e.g. rows=4 I get 8 results, and so on.
>>> 
>>> I have the same collection in another SolrCloud cluster, split into three
>>> shards and running the same queries I get 30 and 12 results, respectively.
>>> So it seems that the seach expression distributes the query between shards
>>> and then aggregates the results. Is this the expected behaviour?
>>> 
>>> Thanks in advance.
>>> 
>>> --
>>> Alfonso Muñoz-Pomer Fuentes
>>> Senior Lead Software Engineer @ Expression Atlas Team
>>> European Bioinformatics Institute (EMBL-EBI)
>>> European Molecular Biology Laboratory
>>> Tel:+ 44 (0) 1223 49 2633
>>> Skype: amunozpomer
>>> 
>>> 
> 

--
Alfonso Muñoz-Pomer Fuentes
Senior Lead Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer



Re: Search streaming expressions returns rows times number of shards docs

2018-06-21 Thread Alfonso Muñoz-Pomer Fuentes
Thanks a lot for the clarification. I created a Jira ticket not to lose track 
of this:
https://issues.apache.org/jira/browse/SOLR-12510


> On 21 Jun 2018, at 20:53, Joel Bernstein  wrote:
> 
> That is actually the current behavior of the search expression. The initial
> use cases from Streaming Expressions revolved around joins and rollups
> which really require the entire result set. So the search expression just
> merged the results from the shards and let the wrapping expression deal
> with the results. Things have evolved quite a bit since then and having the
> search expression respect the rows parameter is something that I've been
> meaning to add. Feel free to create a ticket for this.
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Thu, Jun 21, 2018 at 1:35 PM, Alfonso Muñoz-Pomer Fuentes <
> amu...@ebi.ac.uk> wrote:
> 
>> I’m having a weird issue with the search streaming expressions and I’d
>> like to share it before opening a ticket in Jira, just in case I’m missing
>> something obvious.
>> 
>> I’m currently on Solr 7.1 and I have a collection named bioentities split
>> into two shards and no replicas. Whenever I run a query such as this:
>> search(
>>  bioentities,
>>  q="*:*",
>>  fl="bioentity_identifier,property_value,property_name",
>>  sort="bioentity_identifier asc")
>> 
>> I’m getting 20 documents. If I add e.g. rows=4 I get 8 results, and so on.
>> 
>> I have the same collection in another SolrCloud cluster, split into three
>> shards and running the same queries I get 30 and 12 results, respectively.
>> So it seems that the seach expression distributes the query between shards
>> and then aggregates the results. Is this the expected behaviour?
>> 
>> Thanks in advance.
>> 
>> --
>> Alfonso Muñoz-Pomer Fuentes
>> Senior Lead Software Engineer @ Expression Atlas Team
>> European Bioinformatics Institute (EMBL-EBI)
>> European Molecular Biology Laboratory
>> Tel:+ 44 (0) 1223 49 2633
>> Skype: amunozpomer
>> 
>> 

--
Alfonso Muñoz-Pomer Fuentes
Senior Lead Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer



Re: Search streaming expressions returns rows times number of shards docs

2018-06-21 Thread Aroop Ganguly
So I think 2 things are being missed here. You should be specifying the 
qt=“/export” to see all the results.
If you do not do that, then the select handler is used by default which gives 
the default 10-20 rows as result.

> On Jun 21, 2018, at 12:53 PM, Joel Bernstein  wrote:
> 
> That is actually the current behavior of the search expression. The initial
> use cases from Streaming Expressions revolved around joins and rollups
> which really require the entire result set. So the search expression just
> merged the results from the shards and let the wrapping expression deal
> with the results. Things have evolved quite a bit since then and having the
> search expression respect the rows parameter is something that I've been
> meaning to add. Feel free to create a ticket for this.
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Thu, Jun 21, 2018 at 1:35 PM, Alfonso Muñoz-Pomer Fuentes <
> amu...@ebi.ac.uk> wrote:
> 
>> I’m having a weird issue with the search streaming expressions and I’d
>> like to share it before opening a ticket in Jira, just in case I’m missing
>> something obvious.
>> 
>> I’m currently on Solr 7.1 and I have a collection named bioentities split
>> into two shards and no replicas. Whenever I run a query such as this:
>> search(
>>  bioentities,
>>  q="*:*",
>>  fl="bioentity_identifier,property_value,property_name",
>>  sort="bioentity_identifier asc")
>> 
>> I’m getting 20 documents. If I add e.g. rows=4 I get 8 results, and so on.
>> 
>> I have the same collection in another SolrCloud cluster, split into three
>> shards and running the same queries I get 30 and 12 results, respectively.
>> So it seems that the seach expression distributes the query between shards
>> and then aggregates the results. Is this the expected behaviour?
>> 
>> Thanks in advance.
>> 
>> --
>> Alfonso Muñoz-Pomer Fuentes
>> Senior Lead Software Engineer @ Expression Atlas Team
>> European Bioinformatics Institute (EMBL-EBI)
>> European Molecular Biology Laboratory
>> Tel:+ 44 (0) 1223 49 2633
>> Skype: amunozpomer
>> 
>> 



Re: Search streaming expressions returns rows times number of shards docs

2018-06-21 Thread Joel Bernstein
That is actually the current behavior of the search expression. The initial
use cases from Streaming Expressions revolved around joins and rollups
which really require the entire result set. So the search expression just
merged the results from the shards and let the wrapping expression deal
with the results. Things have evolved quite a bit since then and having the
search expression respect the rows parameter is something that I've been
meaning to add. Feel free to create a ticket for this.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 21, 2018 at 1:35 PM, Alfonso Muñoz-Pomer Fuentes <
amu...@ebi.ac.uk> wrote:

> I’m having a weird issue with the search streaming expressions and I’d
> like to share it before opening a ticket in Jira, just in case I’m missing
> something obvious.
>
> I’m currently on Solr 7.1 and I have a collection named bioentities split
> into two shards and no replicas. Whenever I run a query such as this:
> search(
>   bioentities,
>   q="*:*",
>   fl="bioentity_identifier,property_value,property_name",
>   sort="bioentity_identifier asc")
>
> I’m getting 20 documents. If I add e.g. rows=4 I get 8 results, and so on.
>
> I have the same collection in another SolrCloud cluster, split into three
> shards and running the same queries I get 30 and 12 results, respectively.
> So it seems that the seach expression distributes the query between shards
> and then aggregates the results. Is this the expected behaviour?
>
> Thanks in advance.
>
> --
> Alfonso Muñoz-Pomer Fuentes
> Senior Lead Software Engineer @ Expression Atlas Team
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Tel:+ 44 (0) 1223 49 2633
> Skype: amunozpomer
>
>


Re: CURL DELETE BLOB do not working in solr 7.3 cloud

2018-06-21 Thread Jason Gerlowski
Hi Maxence,

Yes, unfortunately that's the wrong API to delete an item from the
Blob Store.  Items in the blob store are deleted like any other Solr
document (i.e. either delete-by-id, or delete-by-query).  This is
mentioned quite obliquely in the Solr Ref Guide here:
https://lucene.apache.org/solr/guide/7_3/blob-store-api.html. (CTRL-F
for "delete").  We should really clarify that text a bit...

Anyway, to give you a concrete idea, you could delete that document
with a command like:

curl -X POST -H 'Content-Type: application/json'
'http://srv-formation-solr3:8983/solr/.system/' --data-binary ' {
"delete": "CityaUpdateProcessorJar/14" }'

Hope that helps,

Jason



On Wed, May 30, 2018 at 11:14 AM, msaunier  wrote:
> Hello,
>
>
>
> I want to delete a file in the blob but this command not work:
>
> curl -X "DELETE"
> http://srv-formation-solr3:8983/solr/.system/blob/CityaUpdateProcessorJar/14
>
>
>
> This command return just the file informations:
>
> {
>
>   "responseHeader":{
>
> "zkConnected":true,
>
> "status":0,
>
> "QTime":1},
>
>   "response":{"numFound":1,"start":0,"docs":[
>
>   {
>
> "id":"CityaUpdateProcessorJar/14",
>
> "md5":"45aeda5a01607fb668cec26a45cac9e6",
>
> "blobName":"CityaUpdateProcessorJar",
>
> "version":14,
>
> "timestamp":"2018-05-30T12:59:40.419Z",
>
> "size":22483}]
>
>   }}
>
>
>
> My command is bad ?
>
> Thanks,
>
> Maxence,
>


Re: Delete By Query issue followed by Delete By Id Issues

2018-06-21 Thread sujatha sankaran
Thanks,Shawn.

Our use case is something like this in a batch load of  several 1000's of
documents,we do a delete first followed by update.Example delete all 1000
docs and send an update request for 1000.

What we see is that there are many missing docs due to DBQ re-ordering of
the order of  deletes followed by updates.We also saw issue with nodes
going down
similar tot issue described here:
http://lucene.472066.n3.nabble.com/SolrCloud-Nodes-going-to-recovery-state-during-indexing-td4369396.html

we see at the end of this batch process, many (several thousand ) missing
docs.

Due to this and after reading above thread , we decided to move to DBI and
now are facing issues due to custom routing or implicit routing which we
have in place.So I don't think DBQ was working for us, but we did have
several such process ( DBQ followed by updates) for different activities in
the collection happening at the same time.


Sujatha

On Thu, Jun 21, 2018 at 1:21 PM, Shawn Heisey  wrote:

> On 6/21/2018 9:59 AM, sujatha sankaran wrote:
> > Currently from our business perspective we find that we are left with no
> > options for deleting docs in a batch load as :
> >
> > DBQ+ batch does not work well together
> > DBI+ custom routing (batch load / normal)would not work as well.
>
> I would expect DBQ to work, just with the caveat that if you are trying
> to do other indexing operations at the same time, you may run into
> significant delays, and if there are timeouts configured anywhere that
> are shorter than those delays, requests may return failure responses or
> log failures.
>
> If you are using DBQ, you just need to be sure that there are no other
> operations happening at the same time, or that your error handling is
> bulletproof.  Making sure that no other operations are happening at the
> same time as the DBQ is in my opinion a better option.
>
> Thanks,
> Shawn
>
>


Re: Solr basic auth

2018-06-21 Thread Jan Høydahl
Hi,

As I said there is not way to combine multiple authentication plugins at the 
moment.
So your best shot is probably to create your own CustomAuthPlugin where you
implement the logic that you need. You can fork the code from BasicAuth and add
the logic you need to whitelist the requests you need.

It is very hard to understand from your initial email how Solr should see the
difference between a request to "the solr URL directly" vs requests done
indirectly, whatever that would mean. From solr's standpoint ALL requests
are done to Solr directly :-) I suppose you mean that if a request originates
from a particular frontend server's IP address then it should be whitelisted?

You could also suggest in a new JIRA issue to extend Solr's auth feature
to allow a chain of AuthPlugins, and if the request passes any of them it
is let through.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 21. jun. 2018 kl. 17:47 skrev Dinesh Sundaram :
> 
> thanks for your valuable feedback. I really want to allow this domain
> without any credentials. i need basic auth only if anyone access the solr
> url directly. so no option in solr to do that?
> 
> On Sun, Jun 17, 2018 at 4:18 PM, Jan Høydahl  wrote:
> 
>> Of course, but Dinesh explicitly set blockUnknown=true below, so in this
>> case ALL requests must have credentials. There is currently no feature that
>> lets Solr accept any request by other rules, all requests are forwarded to
>> the chosen authentication plugin.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 15. jun. 2018 kl. 19:12 skrev Terry Steichen :
>>> 
>>> "When authentication is enabled ALL requests must carry valid
>>> credentials."  I believe this behavior depends on the value you set for
>>> the *blockUnknown* authentication parameter.
>>> 
>>> 
>>> On 06/15/2018 06:25 AM, Jan Høydahl wrote:
 When authentication is enabled ALL requests must carry valid
>> credentials.
 
 Are you asking for a feature where a request is authenticated based on
>> IP address of the client, not username/password?
 
 Jan
 
 Sendt fra min iPhone
 
> 14. jun. 2018 kl. 22:24 skrev Dinesh Sundaram >> :
> 
> Hi,
> 
> I have configured basic auth for solrcloud. it works well when i
>> access the
> solr url directly. i have integrated this solr with test.com domain.
>> now if
> I access the solr url like test.com/solr it prompts the credentials
>> but I
> dont want to ask this time since it is known domain. is there any way
>> to
> achieve this. much appreciate your quick response.
> 
> my security json below. i'm using the default security, want to allow
>> my
> domain default without prompting any credentials.
> 
> {"authentication":{
> "blockUnknown": true,
> "class":"solr.BasicAuthPlugin",
> "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
> Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
> },"authorization":{
> "class":"solr.RuleBasedAuthorizationPlugin",
> "permissions":[{"name":"security-edit",
>"role":"admin"}],
> "user-role":{"solr":"admin"}
> }}
>>> 
>> 
>> 



Search streaming expressions returns rows times number of shards docs

2018-06-21 Thread Alfonso Muñoz-Pomer Fuentes
I’m having a weird issue with the search streaming expressions and I’d like to 
share it before opening a ticket in Jira, just in case I’m missing something 
obvious.

I’m currently on Solr 7.1 and I have a collection named bioentities split into 
two shards and no replicas. Whenever I run a query such as this:
search(
  bioentities,
  q="*:*",
  fl="bioentity_identifier,property_value,property_name",
  sort="bioentity_identifier asc")

I’m getting 20 documents. If I add e.g. rows=4 I get 8 results, and so on.

I have the same collection in another SolrCloud cluster, split into three 
shards and running the same queries I get 30 and 12 results, respectively. So 
it seems that the seach expression distributes the query between shards and 
then aggregates the results. Is this the expected behaviour?

Thanks in advance.

--
Alfonso Muñoz-Pomer Fuentes
Senior Lead Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer



Re: solr basic authentication

2018-06-21 Thread Christopher Schultz
Dinesh,

On 6/21/18 11:40 AM, Dinesh Sundaram wrote:
> is there any way to disable basic authentication for particular domain. i
> have proxy pass from a domain to solr which is always asking credentials so
> wanted to disable basic auth only for that domain. is there any way?

I wouldn't recommend this, in general, because it's not really all that
secure, but since you have a reverse-proxy in between the client and
Solr, why not have the proxy provide the HTTP BASIC authentication
information to Solr?

That may be a more straightforward solution.

-chris



signature.asc
Description: OpenPGP digital signature


Re: Delete By Query issue followed by Delete By Id Issues

2018-06-21 Thread Shawn Heisey
On 6/21/2018 9:59 AM, sujatha sankaran wrote:
> Currently from our business perspective we find that we are left with no
> options for deleting docs in a batch load as :
>
> DBQ+ batch does not work well together
> DBI+ custom routing (batch load / normal)would not work as well.

I would expect DBQ to work, just with the caveat that if you are trying
to do other indexing operations at the same time, you may run into
significant delays, and if there are timeouts configured anywhere that
are shorter than those delays, requests may return failure responses or
log failures.

If you are using DBQ, you just need to be sure that there are no other
operations happening at the same time, or that your error handling is
bulletproof.  Making sure that no other operations are happening at the
same time as the DBQ is in my opinion a better option.

Thanks,
Shawn



no stable results using morelikethis in distributed mode

2018-06-21 Thread guenterh.li...@bluewin.ch
Hi,
I realize a weird behaviour I can't explain (so far we are still running in 
master/slave mode)
Requesting the collection I see logs randomly against the two available shards 
"green_shard1_replica_n1" and "green_shard2_replica_n2"
2018-06-21 15:35:40.970 INFO  (qtp1873653341-17) [c:green s:shard2 r:core_node4 
x:green_shard2_replica_n2] o.a.s.c.S.Request [green_shard2_replica_n2]  
webapp=/solr path=/select 
params={q=id:"508364329"=morelikethis=arrarr=*,score=40=5=json}
 status=0 QTime=1
2018-06-21 15:36:05.679 INFO  (qtp1873653341-70) [c:green s:shard1 r:core_node3 
x:green_shard1_replica_n1] o.a.s.c.S.Request [green_shard1_replica_n1]  
webapp=/solr path=/select 
params={q=id:"508364329"=morelikethis=arrarr=*,score=40=5=json}
 status=0 QTime=5
2018-06-21 15:36:11.022 INFO  (qtp1873653341-17) [c:green s:shard1 r:core_node3 
x:green_shard1_replica_n1] o.a.s.c.S.Request [green_shard1_replica_n1]  
webapp=/solr path=/select 
params={q=id:"508364329"=morelikethis=arrarr=*,score=40=5=json}
 status=0 QTime=2
2018-06-21 15:36:17.031 INFO  (qtp1873653341-70) [c:green s:shard1 r:core_node3 
x:green_shard1_replica_n1] o.a.s.c.S.Request [green_shard1_replica_n1]  
webapp=/solr path=/select 
params={q=id:"508364329"=morelikethis=arrarr=*,score=40=5=json}
 status=0 QTime=2
2018-06-21 15:36:21.800 INFO  (qtp1873653341-17) [c:green s:shard1 r:core_node3 
x:green_shard1_replica_n1] o.a.s.c.S.Request [green_shard1_replica_n1]  
webapp=/solr path=/select 
params={q=id:"508364329"=morelikethis=arrarr=*,score=40=5=json}
 status=0 QTime=2
2018-06-21 15:36:26.668 INFO  (qtp1873653341-70) [c:green s:shard2 r:core_node4 
x:green_shard2_replica_n2] o.a.s.c.S.Request [green_shard2_replica_n2]  
webapp=/solr path=/select 
params={q=id:"508364329"=morelikethis=arrarr=*,score=40=5=json}
 status=0 QTime=0
In case the running node selects "green_shard1_replica_n1" I'm getting results 
but I will always get no morelikethis suggestions in case the other shard is 
selected.
The number of shards for the test collection is 2, replication factor 1 (it's 
only a test index for a small number of docs) but the behavior is the same for 
a huge collection (30 Mio docs)
What might be the reason for this behavior? - thanks for any hints
Günter  


Re: Indexing part of Binary Documents and not the entire contents

2018-06-21 Thread Erick Erickson
This may help you get started:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Jun 21, 2018 at 8:11 AM, Shawn Heisey  wrote:
> On 6/20/2018 9:05 AM, neotorand wrote:
>>
>> I have a specific Requirement where i need to index below things
>>
>> Meta Data of any document
>> Some parts from the Document that matches some keywords that i configure
>>
>> The first part i am able to achieve through ERH or
>> FilelistEntityProcessor.
>>
>> I am struggling on second part.I am looking for an effective and smart
>> approach to handle this.
>> Can any one give me a pointer or help with this.
>
>
> Write a custom indexing program to compile precisely the information that
> you need and send that to Solr.
>
> Yes, that is a serious suggestion.  Solr itself is very capable, but it
> can't do everything that every user's specific business requirements
> dictate.  A large percentage of Solr users have written custom indexing
> programs.
>
> It is strongly recommended that the ExtractingRequestHandler never be used
> in production, because the Tika software it utilizes is prone to serious
> problems that might extend as far as an actual program crash.  If Tika
> crashes and it's running inside Solr, then Solr crashes too.  Running Tika
> in a custom indexing program instead is recommended, so that if it crashes,
> it's only the indexing program that dies, not Solr.
>
> Thanks,
> Shawn
>


Re: Delete By Query issue followed by Delete By Id Issues

2018-06-21 Thread sujatha sankaran
Thanks,Shawn.

Currently from our business perspective we find that we are left with no
options for deleting docs in a batch load as :

DBQ+ batch does not work well together
DBI+ custom routing (batch load / normal)would not work as well.

We are not sure how we can proceed unless we don't have to delete at all.

Thanks,
Sujatha



On Wed, Jun 20, 2018 at 8:31 PM, Shawn Heisey  wrote:

> On 6/20/2018 3:46 PM, sujatha sankaran wrote:
> > Thanks,Shawn.   Very useful information.
> >
> > Please find below the log details:-
>
> Is your collection using the implicit router?  You didn't say.  If it
> is, then I think you may not be able to use deleteById.  This is indeed
> a bug, one that has been reported at least once already, but hasn't been
> fixed yet.   I do not know why it hasn't been fixed yet.  Maybe the fix
> is very difficult, or maybe the reason for the problem is not yet fully
> understood.
>
> The log you shared shows an error trying to do an update -- the delete
> that failed.  This kind of error is indeed likely to cause SolrCloud to
> attempt index recovery, all in accordance with SolrCloud design goals.
>
> Thanks,
> Shawn
>
>


Re: Solr basic auth

2018-06-21 Thread Dinesh Sundaram
thanks for your valuable feedback. I really want to allow this domain
without any credentials. i need basic auth only if anyone access the solr
url directly. so no option in solr to do that?

On Sun, Jun 17, 2018 at 4:18 PM, Jan Høydahl  wrote:

> Of course, but Dinesh explicitly set blockUnknown=true below, so in this
> case ALL requests must have credentials. There is currently no feature that
> lets Solr accept any request by other rules, all requests are forwarded to
> the chosen authentication plugin.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 15. jun. 2018 kl. 19:12 skrev Terry Steichen :
> >
> > "When authentication is enabled ALL requests must carry valid
> > credentials."  I believe this behavior depends on the value you set for
> > the *blockUnknown* authentication parameter.
> >
> >
> > On 06/15/2018 06:25 AM, Jan Høydahl wrote:
> >> When authentication is enabled ALL requests must carry valid
> credentials.
> >>
> >> Are you asking for a feature where a request is authenticated based on
> IP address of the client, not username/password?
> >>
> >> Jan
> >>
> >> Sendt fra min iPhone
> >>
> >>> 14. jun. 2018 kl. 22:24 skrev Dinesh Sundaram  >:
> >>>
> >>> Hi,
> >>>
> >>> I have configured basic auth for solrcloud. it works well when i
> access the
> >>> solr url directly. i have integrated this solr with test.com domain.
> now if
> >>> I access the solr url like test.com/solr it prompts the credentials
> but I
> >>> dont want to ask this time since it is known domain. is there any way
> to
> >>> achieve this. much appreciate your quick response.
> >>>
> >>> my security json below. i'm using the default security, want to allow
> my
> >>> domain default without prompting any credentials.
> >>>
> >>> {"authentication":{
> >>>  "blockUnknown": true,
> >>>  "class":"solr.BasicAuthPlugin",
> >>>  "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
> >>> Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
> >>> },"authorization":{
> >>>  "class":"solr.RuleBasedAuthorizationPlugin",
> >>>  "permissions":[{"name":"security-edit",
> >>> "role":"admin"}],
> >>>  "user-role":{"solr":"admin"}
> >>> }}
> >
>
>


solr basic authentication

2018-06-21 Thread Dinesh Sundaram
Hi,

is there any way to disable basic authentication for particular domain. i
have proxy pass from a domain to solr which is always asking credentials so
wanted to disable basic auth only for that domain. is there any way?


Thanks,
Dinesh Sundaram.


Re: Trouble using the MIGRATE command in the collections API on solr 7.3.1

2018-06-21 Thread Matthew Faw
Hi Shawn,

Thanks for your reply.  According to the MIGRATE documentation, the split.key 
parameter is required, and removing it returns a missing parameter exception.  
I’ve tried setting the “split.key=DERP_”, and after doing that I still see no 
documents in the destination collection.  Additionally, the CLUSTERSTATUS 
command indicates that the routeRanges using this split key are "routeRanges": 
"16f98178-16f98178", but when I use the split.key=DERP/0!, I get the route 
ranges I expect (8000- on one shard, and 0-7fff on the other).

So, to me, it seems like this particular API endpoint does not work.  I’d love 
for someone to prove me wrong.

Thanks,
Matthew

On 6/21/18, 11:02 AM, "Shawn Heisey"  wrote:

On 6/21/2018 7:08 AM, Matthew Faw wrote:
> For background, I’m using solr version 7.3.1 and lucene version 7.3.1
>
> I have a solr collection with 2 shards and 3 replicas using the 
compositeId router.  Each solr document has “id” as its unique key, where each 
id is of format DERP_${X}, where ${X} is some 24 character alphanumerical 
string.  I create this collection in the following way:
>
> curl 
"http://localhost:8983/solr/admin/collections?action=CREATE=derp=derp=2=3=0=true;
>
> Suppose I have some other collection named herp, created in the same 
fashion, and a collection named blurp, with 1 shard, but otherwise created in 
the same fashion.  Also suppose that there are 2000 documents in the derp 
collection, but none in the herp or blurp collections.
>
> I’ve been attempting to do two things with the MIGRATE Collections API:
>
>1.  Migrate all documents from the derp collection to the herp 
collection using the following command:
> curl 
"http://localhost:8983/solr/admin/collections?action=MIGRATE=derp=herp=DERP/0\!=30;
 | jq
>2.  Migrate all documents from the derp collection to the blurp 
collection using the same MIGRATE command, swapping herp for blurp.
>
> (I chose split.key=DERP/0! With the intent of capturing all documents in 
my source collection, since the /0 should tell the migrate command to only look 
at the hash of the id field, since I’m not using a shard key).

The Collections API documentation doesn't mention any ability to use /N
with split.key.  Which may mean that it is looking for the literal text
"DERP/0!" or "DERP/0\!" in your source documents, and since it's not
there, not choosing any documents to migrate.  The reason I have
mentioned two possible strings there is that the ! character doesn't
need escaping in a URL.  The URL encoded version of that string is this:

DERP%2f0!

Because you want to choose all documents, I don't think you need the
split.key parameter for this, or that you may need to use
split.key=DERP_ instead.  Because you're not using routing prefixes in
your indexing, I am leading more towards just removing the parameter
entirely.

I have never actually used the MIGRATE action.  So I'm basing all this
on the documentation.

Thanks,
Shawn



The content of this email is intended solely for the individual or entity named 
above and access by anyone else is unauthorized. If you are not the intended 
recipient, any disclosure, copying, distribution, or use of the contents of 
this information is prohibited and may be unlawful. If you have received this 
electronic transmission in error, please reply immediately to the sender that 
you have received the message in error, and delete it. Thank you.


Re: Indexing part of Binary Documents and not the entire contents

2018-06-21 Thread Shawn Heisey

On 6/20/2018 9:05 AM, neotorand wrote:

I have a specific Requirement where i need to index below things

Meta Data of any document
Some parts from the Document that matches some keywords that i configure

The first part i am able to achieve through ERH or FilelistEntityProcessor.

I am struggling on second part.I am looking for an effective and smart
approach to handle this.
Can any one give me a pointer or help with this.


Write a custom indexing program to compile precisely the information 
that you need and send that to Solr.


Yes, that is a serious suggestion.  Solr itself is very capable, but it 
can't do everything that every user's specific business requirements 
dictate.  A large percentage of Solr users have written custom indexing 
programs.


It is strongly recommended that the ExtractingRequestHandler never be 
used in production, because the Tika software it utilizes is prone to 
serious problems that might extend as far as an actual program crash.  
If Tika crashes and it's running inside Solr, then Solr crashes too.  
Running Tika in a custom indexing program instead is recommended, so 
that if it crashes, it's only the indexing program that dies, not Solr.


Thanks,
Shawn



Re: CloudSolrClient - setDefaultCollection

2018-06-21 Thread Shawn Heisey

On 6/21/2018 5:04 AM, Greenhorn Techie wrote:

While indexing, is there going to be any performance benefit to set the
collection name first using setDefaultCollection

(String

collection)
method and then index the document
using cloudClient.add(solrInputDoc), instead of suing the method
cloudClient.add(collectionName, solrInputDoc)?

Is this performance benefit to consider or is this mere of a convenience /
better looking code?


There should be no detectable performance difference between the two 
unless you've got an accurate way to measure microseconds or 
nanoseconds.  Even then, the difference should be extremely small.


If the code was written smartly, and usually Solr code is written very 
well, both will end up being executed by the same final code, it will 
just be the source of the collection string that will be different.


Thanks,
Shawn



Re: Trouble using the MIGRATE command in the collections API on solr 7.3.1

2018-06-21 Thread Shawn Heisey

On 6/21/2018 7:08 AM, Matthew Faw wrote:

For background, I’m using solr version 7.3.1 and lucene version 7.3.1

I have a solr collection with 2 shards and 3 replicas using the compositeId 
router.  Each solr document has “id” as its unique key, where each id is of 
format DERP_${X}, where ${X} is some 24 character alphanumerical string.  I 
create this collection in the following way:

curl 
"http://localhost:8983/solr/admin/collections?action=CREATE=derp=derp=2=3=0=true;

Suppose I have some other collection named herp, created in the same fashion, 
and a collection named blurp, with 1 shard, but otherwise created in the same 
fashion.  Also suppose that there are 2000 documents in the derp collection, 
but none in the herp or blurp collections.

I’ve been attempting to do two things with the MIGRATE Collections API:

   1.  Migrate all documents from the derp collection to the herp collection 
using the following command:
curl 
"http://localhost:8983/solr/admin/collections?action=MIGRATE=derp=herp=DERP/0\!=30;
 | jq
   2.  Migrate all documents from the derp collection to the blurp collection 
using the same MIGRATE command, swapping herp for blurp.

(I chose split.key=DERP/0! With the intent of capturing all documents in my 
source collection, since the /0 should tell the migrate command to only look at 
the hash of the id field, since I’m not using a shard key).


The Collections API documentation doesn't mention any ability to use /N 
with split.key.  Which may mean that it is looking for the literal text 
"DERP/0!" or "DERP/0\!" in your source documents, and since it's not 
there, not choosing any documents to migrate.  The reason I have 
mentioned two possible strings there is that the ! character doesn't 
need escaping in a URL.  The URL encoded version of that string is this:


DERP%2f0!

Because you want to choose all documents, I don't think you need the 
split.key parameter for this, or that you may need to use 
split.key=DERP_ instead.  Because you're not using routing prefixes in 
your indexing, I am leading more towards just removing the parameter 
entirely.


I have never actually used the MIGRATE action.  So I'm basing all this 
on the documentation.


Thanks,
Shawn



Trouble using the MIGRATE command in the collections API on solr 7.3.1

2018-06-21 Thread Matthew Faw
Hello,

For background, I’m using solr version 7.3.1 and lucene version 7.3.1

I have a solr collection with 2 shards and 3 replicas using the compositeId 
router.  Each solr document has “id” as its unique key, where each id is of 
format DERP_${X}, where ${X} is some 24 character alphanumerical string.  I 
create this collection in the following way:

curl 
"http://localhost:8983/solr/admin/collections?action=CREATE=derp=derp=2=3=0=true;

Suppose I have some other collection named herp, created in the same fashion, 
and a collection named blurp, with 1 shard, but otherwise created in the same 
fashion.  Also suppose that there are 2000 documents in the derp collection, 
but none in the herp or blurp collections.

I’ve been attempting to do two things with the MIGRATE Collections API:


  1.  Migrate all documents from the derp collection to the herp collection 
using the following command:
curl 
"http://localhost:8983/solr/admin/collections?action=MIGRATE=derp=herp=DERP/0\!=30;
 | jq
  2.  Migrate all documents from the derp collection to the blurp collection 
using the same MIGRATE command, swapping herp for blurp.

(I chose split.key=DERP/0! With the intent of capturing all documents in my 
source collection, since the /0 should tell the migrate command to only look at 
the hash of the id field, since I’m not using a shard key).

In both cases, the response of the corresponding REQUESTSTATUS indicates 
success.  For example:
╰─$ curl 
"localhost:8985/solr/admin/collections?action=REQUESTSTATUS=30"
{
  "responseHeader":{
"status":0,
"QTime":2},
  "success":{
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":10}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":0}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":0}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":0}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":0}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":0}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":94}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":0}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":82}},
"100.109.8.33:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":85}}},
  "3023875288733778":{
"responseHeader":{
  "status":0,
  "QTime":0},
"STATUS":"completed",
"Response":"TaskId: 3023875288733778 webapp=null path=/admin/cores 
params={async=3023875288733778=/admin/cores=herp_shard2_replica_n8=REQUESTBUFFERUPDATES=javabin=2}
 status=0 QTime=10"},
  "302387540601230023875636935846":{
"responseHeader":{
  "status":0,
  "QTime":0},
"STATUS":"completed",
"Response":"TaskId: 302387540601230023875636935846 webapp=null 
path=/admin/cores 
params={qt=/admin/cores=derp=true=split_shard2_temp_shard2=2=NRT=302387540601230023875636935846=core_node2=split_shard2_temp_shard2_shard1_replica_n1=CREATE=1=shard1=javabin}
 status=0 QTime=0"},
  "3023878903291448":{
"responseHeader":{
  "status":0,
  "QTime":0},
"STATUS":"completed",
"Response":"TaskId: 3023878903291448 webapp=null path=/admin/cores 
params={core=derp_shard2_replica_n8=3023878903291448=Z!=/admin/cores=3dba-3dba=SPLIT=split_shard2_temp_shard2_shard1_replica_n1=javabin=2}
 status=0 QTime=0"},
  "3023880308944216":{
"responseHeader":{
  "status":0,
  "QTime":0},
"STATUS":"completed",
"Response":"TaskId: 3023880308944216 webapp=null path=/admin/cores 
params={async=3023880308944216=/admin/cores=core_node4=derp=split_shard2_temp_shard2_shard1_replica_n3=CREATE=split_shard2_temp_shard2=shard1=javabin=2=NRT}
 status=0 QTime=0"},
  "3023882401961074":{
"responseHeader":{
  "status":0,
  "QTime":0},
"STATUS":"completed",
"Response":"TaskId: 3023882401961074 webapp=null path=/admin/cores 
params={nodeName=100.109.8.33:8983_solr=split_shard2_temp_shard2_shard1_replica_n1=3023882401961074=/admin/cores=core_node4=PREPRECOVERY=true=active=true=javabin=2}
 status=0 QTime=0"},
  "3023885405877119":{
"responseHeader":{
  "status":0,
  "QTime":0},
"STATUS":"completed",
"Response":"TaskId: 3023885405877119 webapp=null path=/admin/cores 
params={core=herp_shard2_replica_n8=3023885405877119=/admin/cores=MERGEINDEXES=split_shard2_temp_shard2_shard1_replica_n3=javabin=2}
 status=0 QTime=94"},
  "3023885501282272":{
"responseHeader":{
  "status":0,
  "QTime":0},
"STATUS":"completed",
"Response":"TaskId: 3023885501282272 webapp=null path=/admin/cores 
params={async=3023885501282272=/admin/cores=herp_shard2_replica_n8=REQUESTAPPLYUPDATES=javabin=2}
 status=0 QTime=0"},
  "status":{
   

CloudSolrClient - setDefaultCollection

2018-06-21 Thread Greenhorn Techie
Hi,

While indexing, is there going to be any performance benefit to set the
collection name first using setDefaultCollection

(String

collection)
method and then index the document
using cloudClient.add(solrInputDoc), instead of suing the method
cloudClient.add(collectionName, solrInputDoc)?

Is this performance benefit to consider or is this mere of a convenience /
better looking code?

Thanks


Re: Drive Change for Solr Setup

2018-06-21 Thread Rahul Singh
If it’s windows it may be using a tool called NSSM to manage the solr service.

Look at windows services and task scheduler and understand if solr services are 
being managed by windows via services or the task scheduler — or just .batch 
files.

Rahul
On Jun 20, 2018, 11:34 AM -0400, Shawn Heisey , wrote:
> On 6/20/2018 5:03 AM, Srinivas Muppu (US) wrote:
> > Hi Solr Team,My Solr project installation setup and instances(including
> > clustered solr, zk services and indexing jobs schedulers) is available in
> > Windows 'E:\ ' drive in production environment. As business needs to remove
> > the E:\ drive, going forward D:\ drive will be used and operational.Is
> > there any possible solution/steps for the moving solr installation setup
> > from 'E' drive to 'D' Drive without any impact to the existing
> > application(it should not create re indexing again)
>
> Exactly what needs to be done will be highly dependent on how you
> installed Solr on your system.  The project doesn't have any specific
> installation steps for Windows, so we have absolutely no idea what you
> have done.  Whoever set up your Solr install is going to know a LOT more
> about it than we ever can.
>
> At a high level, without any information specific to your setup, here's
> the steps you need:
>
>  * Stop Solr
>  * Move or copy files to the new location
>  * Change the solr home and possibly other config
>  * Start Solr.
>
> Thanks,
> Shawn
>


Indexing Approach

2018-06-21 Thread solrnoobie
So we have to optimize our current implementation of our indexing. Our
current implementation is to index per batch and each batch will have a
query call from the database that will return multiple result sets and the
application will be responsible in assembling the document based on the
result sets.

This creates a problem especially if there are new business requirements
that will require us to add another result set in our query in indexing. It
sometimes slows down the database because it's hard to estimate the ideal
batch size because the data grows unexpectedly in other result sets. 

Another problem that the documents being indexed is too big because of the
nested documents which results to a heap error in solrj (a parent document
can have up to a thousand documents).


So to those who has an idea and or experience with this, what can we do to
make this scalable and stable?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to split index more than 2GB in size

2018-06-21 Thread Michael Kuhlmann
Hi Sushant,

while this is true in general, it won't hold here. If you split your
index, searching on each splitted shard might be a bit faster, but
you'll increase search time much more because Solr needs to send your
search queries to all shards and then combine the results. So instead of
having one medium fast search request, you'll have several fast requests
plus the aggregation step.

Erick is totally right, splitting an index of that size has no
performance benefit. Sharding is not a technique to improve performance,
it's a technique to be able to handle indexes of hundreds of megabytes
in size, which won't fit into an individual machine.

Best,
Michael


Am 20.06.2018 um 19:58 schrieb Sushant Vengurlekar:
> Thank you for the detailed response Eric. Very much appreciated. The reason
> I am looking into splitting the index into two is because it’s much faster
> to search across a smaller index than a larger one.
> 
> On Wed, Jun 20, 2018 at 10:46 AM Erick Erickson 
> wrote:
> 
>> You still haven't answered _why_ you think splitting even a 20G index
>> is desirable. We regularly see 200G+ indexes per replica in the field,
>> so what's the point? Have you measured different setups to see if it's
>> a good idea? A 200G index needs some beefy hardware admittedly
>>
>> If you have adequate response times with a 20G index and need to
>> increase the QPS rate, just add more replicas. Having more than one
>> shard inevitably adds overhead which may (or may not) be made up for
>> by parallelizing some of the work. It's nearly always better to use
>> only one shard if it meets your response time requirements.
>>
>> Best,
>> Erick
>>
>> On Wed, Jun 20, 2018 at 10:39 AM, Sushant Vengurlekar
>>  wrote:
>>> The index size is small because this is my local development copy.  The
>>> production index is more than 20GB. So I am working on getting the index
>>> split and replicated on different nodes. Our current instance on prod is
>>> single instance solr 6 which we are working on moving towards solrcloud 7
>>>
>>> On Wed, Jun 20, 2018 at 10:30 AM Erick Erickson >>
>>> wrote:
>>>
 Use the indexupgrader tool or optimize your index before using
>> splitshard.

 Since this is a small index (< 5G), optimizing will not create an
 overly-large segment, so that pitfall is avoided.

 You haven't yet explained why you think splitting the index would be
 beneficial. Splitting an index this small is unlikely to improve query
 performance appreciably. This feels a lot like an "XY" problem, you're
 asking how to do X thinking it will solve Y but not telling us what Y
 is.

 Best,
 Erick

 On Wed, Jun 20, 2018 at 9:40 AM, Sushant Vengurlekar
  wrote:
> How can I resolve this error?
>
> On Wed, Jun 20, 2018 at 9:11 AM, Alexandre Rafalovitch <
 arafa...@gmail.com>
> wrote:
>
>> This seems more related to an old index upgraded to latest Solr
>> rather
 than
>> the split itself.
>>
>> Regards,
>> Alex
>>
>> On Wed, Jun 20, 2018, 12:07 PM Sushant Vengurlekar, <
>> svengurle...@curvolabs.com> wrote:
>>
>>> Thanks for the reply Alessandro! Appreciate it.
>>>
>>> Below is the full request and the error received
>>>
>>> curl '
>>>
>>> http://localhost:8081/solr/admin/collections?action=
>> SPLITSHARD=dev-transactions=shard1
>>> '
>>>
>>> {
>>>
>>>   "responseHeader":{
>>>
>>> "status":500,
>>>
>>> "QTime":7920},
>>>
>>>   "success":{
>>>
>>> "solr-1:8081_solr":{
>>>
>>>   "responseHeader":{
>>>
>>> "status":0,
>>>
>>> "QTime":1190},
>>>
>>>   "core":"dev-transactions_shard1_0_replica_n3"},
>>>
>>> "solr-1:8081_solr":{
>>>
>>>   "responseHeader":{
>>>
>>> "status":0,
>>>
>>> "QTime":1047},
>>>
>>>   "core":"dev-transactions_shard1_1_replica_n4"},
>>>
>>> "solr-1:8081_solr":{
>>>
>>>   "responseHeader":{
>>>
>>> "status":0,
>>>
>>> "QTime":6}},
>>>
>>> "solr-1:8081_solr":{
>>>
>>>   "responseHeader":{
>>>
>>> "status":0,
>>>
>>> "QTime":1009}}},
>>>
>>>   "failure":{
>>>
>>>
>>>
>> "solr-1:8081_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$
>> RemoteSolrException:Error
>>> from server at http://solr-1:8081/solr:
>>> java.lang.IllegalArgumentException:
>>> Cannot merge a segment that has been created with major version 6
>> into
>> this
>>> index which has been created by major version 7"},
>>>
>>>   "Operation splitshard caused
>>>
>>> exception:":"org.apache.solr.common.SolrException:org.
>> apache.solr.common.SolrException:
>>> SPLITSHARD failed to invoke SPLIT core admin command",
>>>
>>>