Hi, how to deal with a shard in recovery_failed status?

2019-06-17 Thread zhenyuan wei
Hi all,
I use solr-7.3.1 release,when split a shard1 to shard1_0&shard1_1,
encountered  OOM error,then shard1_0&shard1_1 publish a status as
 recovery_failed.
How to deal with a shard in recovery_failed status?
Remove shard1_0&shard1_1 and  then do split shard1 again?
Or any other way to retry?

Best wishes,
tinswzy


Re: unix socket or D-Bus?

2019-06-17 Thread Felipe Gasper



> On Jun 17, 2019, at 1:17 PM, Shawn Heisey  wrote:
> 
>> Ideally I’d like Solr/Jetty to be able to white-list any connection from a 
>> root-owned socket.
> 
> Solr typically runs as a non-privileged user.  If the start script detects 
> that it's running as root, it will refuse to start without an option to force 
> it.  We strongly recommend not running as root. About the only legitimate 
> reason to run as root is to bind to a port number below 1025... and that is 
> discouraged because Solr should never be accessible by the open Internet.

Solr wouldn’t need to run as root; the process just needs to determine whom 
it’s talking to, which the kernel can answer regardless of the server’s 
privilege level.

I’m new to Java, but the jnr.unixsocket library--which Jetty uses for its UNIX 
socket logic--does provide this information:
https://github.com/jnr/jnr-unixsocket/blob/master/src/main/java/jnr/unixsocket/UnixSocket.java

On the Solr side, then, would it be a matter of creating a new plugin as an 
alternative to BasicAuthPlugin that manipulates whatever control Jetty exposes 
(or would need to be altered to expose) that exposes the socket credentials 
from jnr.unixsocket?

-FG

Re: Error in last_modified for open documents formats

2019-06-17 Thread Erick Erickson
Solr requires a very precise format, the one you’re sending has both
too many zeros to the right of the decimal point and is missing the
terminating ‘Z’. See: 
https://lucene.apache.org/solr/guide/6_6/working-with-dates.html

The output you’re getting is from Tika, which is used by Solr. 
You’ll have to find a way to transform it into a proper Solr date format.

One option is to use the ParseDateFieldUpdateProcessorFactory on
the Solr side if you can’t correct it otherwise, see:
https://lucene.apache.org/solr/guide/7_7/schemaless-mode.html

NOTE: You do _not_ have to use schemaless mode to use this, you
that link is just to show you how to configure it. You’ll just have
to configure it as part of your standard update chain in solrconfig.xml

Best,
Erick

> On Jun 17, 2019, at 7:13 AM, maguba  wrote:
> 
> Hello,
> 
> I install solr 8.1.1 and when I trying indexing libreoffice files (ods,
> odt,...) throws:
> 
> org.apache.solr.common.SolrException: ERROR:
> [doc=D42039220124097949-A100020965] Error adding field
> 'last_modified'='2019-06-14T16:59:47.61000' msg=Invalid Date
> String:'2019-06-14T16:59:47.61000'
> 
> Caused by: org.apache.solr.common.SolrException: Invalid Date
> String:'2019-06-14T16:59:47.61000'
>   at 
> org.apache.solr.util.DateMathParser.parseMath(DateMathParser.java:247)
>   at 
> org.apache.solr.util.DateMathParser.parseMath(DateMathParser.java:226)
>   at
> org.apache.solr.schema.DatePointField.createField(DatePointField.java:214)
>   at org.apache.solr.schema.PointField.createFields(PointField.java:250)
>   at 
> org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:65)
>   at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:171)
>   ... 58 more
> 
> Others document formats (pdf, doc, xls,...) works without problem.
> 
> schema.xml definition:
> 
> 
> ...
> 
> 
> Please, any idea? 
> 
> Thanks!
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Question regarding negated block join queries

2019-06-17 Thread Erick Erickson
Bram:

Here’s a fuller explanation that you might be interested in:

https://lucidworks.com/2011/12/28/why-not-and-or-and-not/

Best,
Erick

> On Jun 17, 2019, at 11:32 AM, Bram Biesbrouck 
>  wrote:
> 
> On Mon, Jun 17, 2019 at 7:11 PM Shawn Heisey  wrote:
> 
>> On 6/17/2019 4:46 AM, Bram Biesbrouck wrote:
>>> q={!parent which=-(parentUri:*)}*:*
>> 
>> Pure negative queries do not work in Lucene.  Sometimes, when you do a
>> single-clause negative query, Solr is able to detect the problem and
>> automatically make an adjustment so the query works.  This happens
>> transparently so you never notice.
>> 
>> In essence, what your negative query tells Lucene is "start with
>> nothing, and then subtract docs that match this query."  Since you
>> started with nothing and then subtracted, you get nothing.
>> 
>> Also, that's a wilcard query.  Which could be very slow if the possible
>> number of values in parentUri is more than a few.  If that field can
>> only contain a very small number of values, then a wildcard query might
>> be fast.
>> 
>> The following query solves both problems -- starting with all docs and
>> then subtracting things that match the query clause after that:
>> 
>> *:* -parentUri:[* TO *]
>> 
>> This will return all documents that do not have the parentUri field
>> defined.  The [* TO *] syntax is an all-inclusive range query.
>> 
> 
> Hi Shawn,
> 
> Awesome elaborate explanation, thank you. Also thanks for the optimization
> hint. I found both approaches online, but didn't realize there was a
> performance difference .
> Digging deeper, I've found this SO post, basically explaining why it worked
> some of the time, but not in all cases:
> https://stackoverflow.com/questions/10651548/negation-in-solr-query
> 
> best,
> 
> b.



Re: Question regarding negated block join queries

2019-06-17 Thread Bram Biesbrouck
On Mon, Jun 17, 2019 at 7:11 PM Shawn Heisey  wrote:

> On 6/17/2019 4:46 AM, Bram Biesbrouck wrote:
> > q={!parent which=-(parentUri:*)}*:*
>
> Pure negative queries do not work in Lucene.  Sometimes, when you do a
> single-clause negative query, Solr is able to detect the problem and
> automatically make an adjustment so the query works.  This happens
> transparently so you never notice.
>
> In essence, what your negative query tells Lucene is "start with
> nothing, and then subtract docs that match this query."  Since you
> started with nothing and then subtracted, you get nothing.
>
> Also, that's a wilcard query.  Which could be very slow if the possible
> number of values in parentUri is more than a few.  If that field can
> only contain a very small number of values, then a wildcard query might
> be fast.
>
> The following query solves both problems -- starting with all docs and
> then subtracting things that match the query clause after that:
>
> *:* -parentUri:[* TO *]
>
> This will return all documents that do not have the parentUri field
> defined.  The [* TO *] syntax is an all-inclusive range query.
>

Hi Shawn,

Awesome elaborate explanation, thank you. Also thanks for the optimization
hint. I found both approaches online, but didn't realize there was a
performance difference .
Digging deeper, I've found this SO post, basically explaining why it worked
some of the time, but not in all cases:
https://stackoverflow.com/questions/10651548/negation-in-solr-query

best,

b.


Error in last_modified for open documents formats

2019-06-17 Thread maguba
Hello,

I install solr 8.1.1 and when I trying indexing libreoffice files (ods,
odt,...) throws:

org.apache.solr.common.SolrException: ERROR:
[doc=D42039220124097949-A100020965] Error adding field
'last_modified'='2019-06-14T16:59:47.61000' msg=Invalid Date
String:'2019-06-14T16:59:47.61000'

Caused by: org.apache.solr.common.SolrException: Invalid Date
String:'2019-06-14T16:59:47.61000'
at 
org.apache.solr.util.DateMathParser.parseMath(DateMathParser.java:247)
at 
org.apache.solr.util.DateMathParser.parseMath(DateMathParser.java:226)
at
org.apache.solr.schema.DatePointField.createField(DatePointField.java:214)
at org.apache.solr.schema.PointField.createFields(PointField.java:250)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:65)
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:171)
... 58 more

Others document formats (pdf, doc, xls,...) works without problem.

schema.xml definition:


...


Please, any idea? 

Thanks!





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: unix socket or D-Bus?

2019-06-17 Thread Shawn Heisey

On 6/16/2019 10:43 PM, Felipe Gasper wrote:

Does Solr do its own authentication, or does Jetty do that? One of the benefits 
of UNIX sockets is that the socket exposes the peer’s credentials, so 
Solr/Jetty could implement logic that says, “ah, you’re root? Cool, you’re in.”


As far as I know, when authentication is configured in Solr, Solr takes 
that config and uses the Servlet API to configure authentication, and 
then that is handled by the container.  Which is Jetty, unless the user 
takes the webapp and installs it in another container.



Ideally I’d like Solr/Jetty to be able to white-list any connection from a 
root-owned socket.


Solr typically runs as a non-privileged user.  If the start script 
detects that it's running as root, it will refuse to start without an 
option to force it.  We strongly recommend not running as root. About 
the only legitimate reason to run as root is to bind to a port number 
below 1025... and that is discouraged because Solr should never be 
accessible by the open Internet.


I'm sure that configuring a socket would be outside of Solr entirely -- 
all in Jetty.  I don't know that any of the built-in Solr client stuff 
can use a socket, though -- that would likely need to be a custom client.



D-Bus is an IPC mechanism that most (if not all) Linux distros--and several 
other OSes--run as a standard daemon. Notable uses include systemd and X-based 
applications, but any service can expose an interface on D-Bus. It would be an 
alternative to REST, one advantage of which being that Solr could send messages 
itself rather than merely answering requests.


I don't think Jetty can do that.  Maybe another container can ... but 
you'd be in unsupported territory at that point.  And you'd need to have 
a custom client for this too.


https://wiki.apache.org/solr/WhyNoWar

Thanks,
Shawn


Re: Question regarding negated block join queries

2019-06-17 Thread Shawn Heisey

On 6/17/2019 4:46 AM, Bram Biesbrouck wrote:

q={!parent which=-(parentUri:*)}*:*


Pure negative queries do not work in Lucene.  Sometimes, when you do a 
single-clause negative query, Solr is able to detect the problem and 
automatically make an adjustment so the query works.  This happens 
transparently so you never notice.


In essence, what your negative query tells Lucene is "start with 
nothing, and then subtract docs that match this query."  Since you 
started with nothing and then subtracted, you get nothing.


Also, that's a wilcard query.  Which could be very slow if the possible 
number of values in parentUri is more than a few.  If that field can 
only contain a very small number of values, then a wildcard query might 
be fast.


The following query solves both problems -- starting with all docs and 
then subtracting things that match the query clause after that:


*:* -parentUri:[* TO *]

This will return all documents that do not have the parentUri field 
defined.  The [* TO *] syntax is an all-inclusive range query.


Thanks,
Shawn


Optimizing integer primary key lookup speed: optimal FieldType and Codec?

2019-06-17 Thread Gregg Donovan
Hello! We (Etsy) would like to optimize primary key lookup speed. Our
primary key is a 32-bit integer -- and are wondering what the
state-of-the-art is for FieldType and Codec these days for maximizing the
throughput of 32-bit ID lookups.


Context:
Specifically, we're looking to optimize the loading loop of
ExternalFileField
.
We are developing a specialized binary file version of the EFF that is
optimized for 32-bit int primary keys and their scores. Our motivation is
saving on storage, bandwidth, etc. via specializing the code for our
use-case -- we are heavy EFF users.

In pseudo-code, the inner EFF loading loop is:

for each primary_key, score pair in the external file:
termsEnum.seekExact(primary_key)
doc_id = postingsEnum.nextDoc()


Re: Codecs:
Is anything special needed to make ID lookups faster now that "pulsing" has
been incorporated into the default codec
?
What about using IDVersionPostingsFormat
?
Is that likely to be faster? Or is it the wrong choice if we don't need the
version support?


FieldType:
I see that EFFs do not currently support the new Points-based int fields,
but this does not appear to be due to any inherent limitation in the Points
field. At least, that's what I infer from the JIRA
. Are the Point fields
the right choice for fast 32-bit int ID lookups?

Thanks!

Gregg


indexing MongoDB using DIH

2019-06-17 Thread Wendy2
Hi,

Has any one tried with the following project to index MongoDB via DIH?
I tried to use it. But could not add a filter in the find() method.

Any suggestions?   Thanks! 

https://github.com/james75/SolrMongoImporter



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Dovecot integration

2019-06-17 Thread Felipe Gasper
Hi all,

https://wiki.dovecot.org/Plugins/FTS/Solr

^^ I’m looking at this documentation and am wondering if its discussion 
of the managed-schema and schema.xml files is inaccurate/incomplete/misleading.

Dovecot’s documentation implies that it’s normal operation for the 
managed-schema to be generated from schema.xml; however, going by the docs here:


https://lucene.apache.org/solr/guide/8_0/schema-factory-definition-in-solrconfig.html

… it appears that an installation that uses Dovecot’s provided 
configuration file, which lacks a  directive, should, in fact, 
have a managed-schema file, and schema.xml is of no use; the generation of 
managed-schema from schema.xml is actually migration logic, not normal 
operation.

Is this correct? If so, it seems like Dovecot’s documentation could 
make this clearer.

Thank you!

-Felipe Gasper

Re: Loosing data on Solr restart

2019-06-17 Thread Erick Erickson
There are a number of possibilities, but what this really sounds
like is that you aren’t committing your documents. There’s more
than you want to know here:
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

But here’s the short form: In the absence of a hard commit
(openSeacher=true or false is irrelevant), two things happen:
1> Segments are not closed, and if you shut down Solr
 hard (e.g. kill -9 or the like) everything indexed since
 the last commit will be lost. You can still search them
 before you restart if you have _soft_ commits set to
 something other than -1.
2> If you kill Solr hard, then everything since the last hard
 commit is replayed from the transaction log at startup, 
 which would account for the nodes not coming back up.

So here’s what I’d try to verify whether this is even close to correct:
1> change solrconfig to hard commit (openSearcher=false probably)
 more frequently, say every 15 seconds.
2> Wait at least that long after your indexing is done before you 
 stop Solr.
3> Stop Solr gracefully, using the bin/solr script or the like. Pay
 attention to the termination message when you do, does it
 say anything like “forcefully killing Solr” or similar? If so,
 then the bin/solr script is un-gracefully killing Solr. There’s
 an environment variable in the script you can set to give
 Solr more time.

You do not need to issue commits from your client, that’s usually
a bad practice unless you can guarantee that there’s only a single
client and it only issues one commit at the very end of the run. To
troubleshoot, though, you can issue a commit from the browser, 
SOLR_ADDRESS:PORT/solr/collection/update?commit=true
will do it. It’d be instructive to see how long that takes to come back
as well.

So what this sounds like is that you have a massive number of 
uncommitted documents when Solr stops and it replays them
on startup. Whether that’s the real problem here or not you’ll have
to experiment to determine.

Best,
Erick

> On Jun 17, 2019, at 12:01 AM, स्वप्निल रोकडे  wrote:
> 
> I am newbie to Apache Manifold (ver – 2.12) and Solr(ver – 7.6) with
> Zookeeper (ver 3.14). I have created three collections in Solr out of which
> data for two comes from Manifold while one has from manual data insert
> through simple solr API. When I run jobs in Manifold I can see data is
> getting inserted in Solr and can be seen by querying Solr.
> 
> But when I restart solr all the shards and replicas goes down and do not
> recover ever. Also, I am unable to reload the collection as it always gives
> timeout error. I tried to take index backup and then try to restore it but
> restoring also fails with timeout error. I tried this reload command,
> restore command from the same server in which they are installed but still
> it fails. Looks like problem is only with the collections in which data is
> coming from Manifold as my other collection where I insert data via Solr
> API starts properly after restart. I don’t see any error getting logged in
> solr logs properly.
> 
> I am not getting if I missed anything while doing configurations or there
> is some kind of lock on solr due to which none of the reload, restore
> commands works properly and on restarting solr I lose everything.
> 
> Please suggest.
> Regards,
> Swapnil



Re: Config API: delete-requesthandler

2019-06-17 Thread Noble Paul
yes, this is the way to do it

On Thu, May 26, 2016 at 6:54 AM Jan Høydahl  wrote:
>
> Hi
>
> Have you tried adding all your commands to the same file?
>
> {
>   "add-requesthandler":{"name":"/foo","class":"solr.SearchHandler"},
>   "add-requesthandler":{"name":"/bar","class":"solr.SearchHandler"},
>   "add-requesthandler":{"name":"/baz","class":"solr.SearchHandler"}
> }
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 20. mai 2016 kl. 16.19 skrev Steven White :
> >
> > Hi folks,
> >
> > The code I'm maintaining, uses Solr config API per
> > https://cwiki.apache.org/confluence/display/solr/Config+API to manage
> > request handlers.  My environment has many request handlers, up to 100 in
> > few extreme cases.  When that's the case, it means I will issue 100
> > "delete-requesthandler" followed with 100 "add-requesthandler" requests.
> > This end-to-end operation can be time consuming specially if there is data
> > in the index (searchers are being reset leading to cache and stuff being
> > reloaded).  But even if there isn't data in the index, it will take about
> > 10 min. in my test environment.
> >
> > My question is, how can I optimize this?  Is there any kind of "bulk" or
> > "transaction-base" way of doing schema configuration?  I.e.: send all my
> > deletes, followed with all my adds, and then issue a commit for the change
> > to take effect.  If I switch to SolrJ (does it offer schema management?)
> > will that help?
> >
> > Thanks in advanced.
> >
> > Steve
>


-- 
-
Noble Paul


Question regarding negated block join queries

2019-06-17 Thread Bram Biesbrouck
Dear all,

I'm new to this list, so let me introduce myself. I'm Bram, author of a
linked data framework called Stralo. We're working toward version 1.0, in
which we're integrating Solr indexing and querying of RDF triples (
https://github.com/republic-of-reinvention/com.stralo.framework/milestone/3)

I'm running to inconsistent results regarding block join queries and I
wondered if any of you could help me out. We're indexing our parent-child
relationships using a field called "parentUri". The field contains the URI
(the id of the document) of the parent document, is just omitted when the
document itself if a parent.

Here's an example of a child document:

{
"language":"en",
"resource":"/resource/1130494009577889453",
"parentUri":"/en/blah",
"uri":"/resource/1130494009577889453",
"label":"Label of the object",
"description":"Example of some sub text",
"typeOf":"ror:Page",
"rdf:type":["ror:Page"],
"rdfs:label":["Label of the object"],
"ror:text":["Example of some sub text"],
"ror:testNumber":[4],
"ror:testDate":["2019-05-10T00:00:00Z"],
"_version_":1636582287436939264
}

(Please ignore the CURIE syntax we're using as field names. We know it's
slightly illegal in Solr, but it works just fine and it makes our lives
indexing tripes so much more convenient)

Here's it's parent document:

{
"language":"en",
"resource":"/resource/1106177060466942658",
"uri":"/en/blah",
"label":"rdfs label test 3",
"description":"Hi, we are the Republic \nwe do video
technology",
"typeOf":"ror:BlogPost",
"rdf:type":["ror:BlogPost"],
"rdfs:label":["rdfs label test 3"],
"meta:created":["2019-04-04T09:08:35.736Z"],
"meta:creator":["/users/2"],
"meta:modified":["2019-06-17T10:14:54.134Z"],
"meta:contributor":["/users/2",
  "/users/1"],
"ror:testEditor":["Blah, dit is inhoud van test editor"],
"ror:testEnum":["af"],
"ror:testDate":["2019-05-31T00:00:00Z"],
"ror:testResource":["/resource/Page/800895161299715471"],
"ror:testObject":["/resource/1130494009577889453"],
"ror:text":["Hi, we are the Republic we do video technology"],
"_version_":1636582287436939264
}

As said, we're struggling with block joins, because we don't have a clear
field that contains "this" for parent documents and "that" for child
documents. Instead, it's omitted for parent documents. So, to fire a block
join child query, we use this approach (just an example):

q={!parent which=-(parentUri:*)}*:*

What we expect is that the allParents filter selects all those documents
where the "parentUri" field doesn't exist using a negated wildcard query
(which works just fine when used alone). The someParents fitler just
selects everything since this is an example. Alas, this doesn't yield any
results.

Since the docs say:
When subordinate clause () is omitted, it’s parsed as a
segmented and cached filter for children documents. More precisely,
q={!child of=} is equivalent to q=*:* -.

I tried to run this query (assuming a double negation becomes a plus):

*:* +(parentUri:*)

And this yields correct results, so I'm assuming it's possible, but I'm
overlooking something in my block join children query syntax.

Could anyone put me in the right direction to use block join queries with
non-existent or existent fields?

all the best,

b.


Loosing data on Solr restart

2019-06-17 Thread स्वप्निल रोकडे
I am newbie to Apache Manifold (ver – 2.12) and Solr(ver – 7.6) with
Zookeeper (ver 3.14). I have created three collections in Solr out of which
data for two comes from Manifold while one has from manual data insert
through simple solr API. When I run jobs in Manifold I can see data is
getting inserted in Solr and can be seen by querying Solr.

But when I restart solr all the shards and replicas goes down and do not
recover ever. Also, I am unable to reload the collection as it always gives
timeout error. I tried to take index backup and then try to restore it but
restoring also fails with timeout error. I tried this reload command,
restore command from the same server in which they are installed but still
it fails. Looks like problem is only with the collections in which data is
coming from Manifold as my other collection where I insert data via Solr
API starts properly after restart. I don’t see any error getting logged in
solr logs properly.

I am not getting if I missed anything while doing configurations or there
is some kind of lock on solr due to which none of the reload, restore
commands works properly and on restarting solr I lose everything.

Please suggest.
Regards,
Swapnil