Re: Facet behavior

2016-10-25 Thread Bastien Latard | MDPI AG

Hi Guys,

Could any of you tell me if I'm right?
Thanks in advance.

kr,
Bast



 Forwarded Message 
Subject:Re: Facet behavior
Date:   Thu, 20 Oct 2016 14:45:23 +0200
From:   Bastien Latard | MDPI AG <lat...@mdpi.com>
To: solr-user@lucene.apache.org



Hi Yonik,

Thanks for your answer!
I'm not quite I understood everything...please, see my comments below.



On Wed, Oct 19, 2016 at 6:23 AM, Bastien Latard | MDPI AG
<lat...@mdpi.com.invalid> wrote:

I just had a question about facets.
*==> Is the facet run on all documents (to pre-process/cache the data) or
only on returned documents?*

Yes ;-)

There are sometimes per-field data structures that are cached to
support faceting.  This can make the first facet request after a new
searcher take longer.  Unless you're using docValues, then the cost is
much less.

So how to force it to use docValues? Simply:

Are there other advantage/inconvenient?


Then there are per-request data structures (like a count array) that
are O(field_cardinality) and not O(matching_docs).
But then for default field-cache faceting, the actual counting part is
O(matching_docs).
So yes, at the end of  the day we only facet on the matching
documents... but what the total field looks like certainly matters.

This would only be like that if I would use docValues, right?

If I have such field declaration (dedicated field for facet-- without
stemming), what would be the best setting?


Kind regards,
Bastien



Re: Facet behavior

2016-10-20 Thread Bastien Latard | MDPI AG

Hi Yonik,

Thanks for your answer!
I'm not quite I understood everything...please, see my comments below.



On Wed, Oct 19, 2016 at 6:23 AM, Bastien Latard | MDPI AG
<lat...@mdpi.com.invalid> wrote:

I just had a question about facets.
*==> Is the facet run on all documents (to pre-process/cache the data) or
only on returned documents?*

Yes ;-)

There are sometimes per-field data structures that are cached to
support faceting.  This can make the first facet request after a new
searcher take longer.  Unless you're using docValues, then the cost is
much less.

So how to force it to use docValues? Simply:
docValues="true" />

Are there other advantage/inconvenient?


Then there are per-request data structures (like a count array) that
are O(field_cardinality) and not O(matching_docs).
But then for default field-cache faceting, the actual counting part is
O(matching_docs).
So yes, at the end of  the day we only facet on the matching
documents... but what the total field looks like certainly matters.

This would only be like that if I would use docValues, right?

If I have such field declaration (dedicated field for facet-- without 
stemming), what would be the best setting?
stored="true" required="false" multiValued="true" />


Kind regards,
Bastien



Facet behavior

2016-10-19 Thread Bastien Latard | MDPI AG

Hi everybody,

I just had a question about facets.
*==> Is the facet run on all documents (to pre-process/cache the data) 
or only on returned documents?*


Because I have exactly the same index locally and on the prod server.. 
(except that my dev. contains much less docs)


When I make a query, and want the facets for the query, it's taking much 
longer in the production server, even if the query returns less 
documents ...


e.g.:
q=nanoparticles AND 
gold=5=author=0=true=xml=0

- live : 4059 documents <=> 11 secs
- local: 22298 documents <=> 1 sec

Thanks in advance.

Kind regards,
Bastien



Re: How can I set the defaultOperator to be AND?

2016-09-02 Thread Bastien Latard | MDPI AG

Thanks Steve for your advice (i.e.: upgrade to Solr 6.2).
I finally had time to upgrade and can now use "=AND" together with 
"=a OR b" and this works as expected.


I even defined the following line in the defaults settings in the 
requestHandler, to overwrite the default behavior:

AND

Issue fixed :)

Kind regards,
Bast

On 05/08/2016 14:57, Bastien Latard | MDPI AG wrote:

Hi Steve,

I read the thread you sent me (SOLR-8812) and it seems that the 6.1 
includes this fix, as you said.

I will upgrade.
Thank you!

Kind regards,
Bast

On 05/08/2016 14:37, Steve Rowe wrote:

Hi Bastien,

Have you tried upgrading to 6.1?  SOLR-8812, mentioned earlier in the 
thread, was released with 6.1, and is directly aimed at fixing the 
problem you are having in 6.0 (also a problem in 5.5): when mm is not 
explicitly provided and the query contains explicit operators (except 
for AND), edismax now sets mm=0.


--
Steve
www.lucidworks.com

On Aug 5, 2016, at 2:34 AM, Bastien Latard | MDPI AG 
<lat...@mdpi.com.INVALID> wrote:


Hi Eric & others,
Is there any way to overwrite the default OP when we use edismax?
Because adding the following line to solrconfig.xml doesn't solve 
the problem:



(Then if I do "q=black OR white", this always gives the results for 
"black AND white")


I did not find a way to define a default OP, which is automatically 
overwritten by the AND/OR from a query.



Example - Debug: defaultOP in solrconfig = AND / q=a or b


==> results for black AND white
The correct result should be the following (but I had to force the 
q.op):


==> I cannot do this in case I want to do "(a AND b) OR c"...


Kind regards,
Bastien

On 27/04/2016 05:30, Erick Erickson wrote:
Defaulting to "OR" has been the behavior since forever, so changing 
the behavior now is just not going to happen. Making it fit a new 
version of "correct" will change the behavior for every application 
out there that has not specified the default behavior.


There's no a-priori reason to expect "more words to equal fewer 
docs", I can just as easily argue that "more words should return 
more docs". Which you expect depends on your mental model.


And providing the default op in your solrconfig.xml request 
handlers allows you to implement whatever model your application 
chooses...


Best,
Erick

On Mon, Apr 25, 2016 at 11:32 PM, Bastien Latard - MDPI AG 
<lat...@mdpi.com.invalid> wrote:

Thank you Shawn, Jan and Georg for your answers.

Yes, it seems that if I simply remove the defaultOperator it works 
well for "composed queries" like '(a:x AND b:y) OR c:z'.

But I think that the default Operator should/could be the AND.

Because when I add an extra search word, I expect that the results 
get more accurate...

(It seems to be what google is also doing now)
|   |

Otherwise, if you make a search and apply another filter (e.g.: 
sort by publication date, facets, ...) , user can get the less 
relevant item (only 1 word in 4 matches) in first position only 
because of its date...


What do you think?


Kind regards,
Bastien


On 25/04/2016 14:53, Shawn Heisey wrote:

On 4/25/2016 6:39 AM, Bastien Latard - MDPI AG wrote:


Remember:
If I add the following line to the schema.xml, even if I do a search
'title:"test" OR author:"me"', it will returns documents matching
'title:"test" AND author:"me"':


The settings in the schema for default field and default operator 
were

deprecated a long time ago.  I actually have no idea whether they are
even supported in newer Solr versions.

The q.op parameter controls the default operator, and the df 
parameter

controls the default field.  These can be set in the request handler
definition in solrconfig.xml -- usually in "defaults" but there 
might be

reason to put them in "invariants" instead.

If you're using edismax, you'd be better off using the mm parameter
rather than the q.op parameter.  The behavior you have described 
above
sounds like a change in behavior (some call it a bug) introduced 
in the

5.5 version:


https://issues.apache.org/jira/browse/SOLR-8812


If you are using edismax, I suspect that if you set mm=100% 
instead of
q.op=AND (or the schema default operator) that the problem might 
go away

... but I am not sure.  Someone who is more familiar with SOLR-8812
probably should comment.

Thanks,
Shawn





Re: How can I set the defaultOperator to be AND?

2016-08-05 Thread Bastien Latard | MDPI AG

Hi Steve,

I read the thread you sent me (SOLR-8812) and it seems that the 6.1 
includes this fix, as you said.

I will upgrade.
Thank you!

Kind regards,
Bast

On 05/08/2016 14:37, Steve Rowe wrote:

Hi Bastien,

Have you tried upgrading to 6.1?  SOLR-8812, mentioned earlier in the thread, 
was released with 6.1, and is directly aimed at fixing the problem you are 
having in 6.0 (also a problem in 5.5): when mm is not explicitly provided and 
the query contains explicit operators (except for AND), edismax now sets mm=0.

--
Steve
www.lucidworks.com


On Aug 5, 2016, at 2:34 AM, Bastien Latard | MDPI AG <lat...@mdpi.com.INVALID> 
wrote:

Hi Eric & others,
Is there any way to overwrite the default OP when we use edismax?
Because adding the following line to solrconfig.xml doesn't solve the problem:


(Then if I do "q=black OR white", this always gives the results for "black AND 
white")

I did not find a way to define a default OP, which is automatically overwritten 
by the AND/OR from a query.


Example - Debug: defaultOP in solrconfig = AND / q=a or b


==> results for black AND white
The correct result should be the following (but I had to force the q.op):

==> I cannot do this in case I want to do "(a AND b) OR c"...


Kind regards,
Bastien

On 27/04/2016 05:30, Erick Erickson wrote:

Defaulting to "OR" has been the behavior since forever, so changing the behavior now is 
just not going to happen. Making it fit a new version of "correct" will change the 
behavior for every application out there that has not specified the default behavior.

There's no a-priori reason to expect "more words to equal fewer docs", I can just as 
easily argue that "more words should return more docs". Which you expect depends on your 
mental model.

And providing the default op in your solrconfig.xml request handlers allows you 
to implement whatever model your application chooses...

Best,
Erick

On Mon, Apr 25, 2016 at 11:32 PM, Bastien Latard - MDPI AG 
<lat...@mdpi.com.invalid> wrote:
Thank you Shawn, Jan and Georg for your answers.

Yes, it seems that if I simply remove the defaultOperator it works well for 
"composed queries" like '(a:x AND b:y) OR c:z'.
But I think that the default Operator should/could be the AND.

Because when I add an extra search word, I expect that the results get more 
accurate...
(It seems to be what google is also doing now)
|   |

Otherwise, if you make a search and apply another filter (e.g.: sort by 
publication date, facets, ...) , user can get the less relevant item (only 1 
word in 4 matches) in first position only because of its date...

What do you think?


Kind regards,
Bastien


On 25/04/2016 14:53, Shawn Heisey wrote:

On 4/25/2016 6:39 AM, Bastien Latard - MDPI AG wrote:


Remember:
If I add the following line to the schema.xml, even if I do a search
'title:"test" OR author:"me"', it will returns documents matching
'title:"test" AND author:"me"':



The settings in the schema for default field and default operator were
deprecated a long time ago.  I actually have no idea whether they are
even supported in newer Solr versions.

The q.op parameter controls the default operator, and the df parameter
controls the default field.  These can be set in the request handler
definition in solrconfig.xml -- usually in "defaults" but there might be
reason to put them in "invariants" instead.

If you're using edismax, you'd be better off using the mm parameter
rather than the q.op parameter.  The behavior you have described above
sounds like a change in behavior (some call it a bug) introduced in the
5.5 version:


https://issues.apache.org/jira/browse/SOLR-8812


If you are using edismax, I suspect that if you set mm=100% instead of
q.op=AND (or the schema default operator) that the problem might go away
... but I am not sure.  Someone who is more familiar with SOLR-8812
probably should comment.

Thanks,
Shawn





Re: URL parameters combined with text param

2016-05-13 Thread Bastien Latard - MDPI AG

Thanks both!

I already tried "=true", but it doesn't tell me that much...Or at 
least, I don't see any problem...

Below are the responses...

1. /select?q=hospital AND_query_:"{!q.op=AND 
v=$a}"=abstract,title=hospital Leapfrog=true




0
280

hospital AND_query_:"{!q.op=AND v=$a}"
hospital Leapfrog
true
abstract,title




hospital AND_query_:"{!q.op=AND v=$a}"
hospital AND_query_:"{!q.op=AND v=$a}"
(+(DisjunctionMaxQuery((abstract:hospit | 
title:hospit | authors:hospital | doi:hospital)) 
DisjunctionMaxQuery(((Synonym(abstract:and abstract:andqueri) 
abstract:queri) | (Synonym(title:and title:andqueri) title:queri) | 
(Synonym(authors:and authors:andquery) authors:query) | 
doi:and_query_:)) DisjunctionMaxQuery((abstract:"(q qopand) op and (v 
va) a" | title:"(q qopand) op and (v va) a" | authors:"(q qopand) op and 
(v va) a" | doi:"{!q.op=and v=$a}"/no_coord
+((abstract:hospit | title:hospit 
| authors:hospital | doi:hospital) ((Synonym(abstract:and 
abstract:andqueri) abstract:queri) | (Synonym(title:and title:andqueri) 
title:queri) | (Synonym(authors:and authors:andquery) authors:query) | 
doi:and_query_:) (abstract:"(q qopand) op and (v va) a" | title:"(q 
qopand) op and (v va) a" | authors:"(q qopand) op and (v va) a" | 
doi:"{!q.op=and v=$a}")


ExtendedDismaxQParser





   [...]





2. /select?q=_query_:"{!q.op=AND v='hospital'}"+_query_:"{!q.op=AND 
v=$a}"=hospital Leapfrog=true




  0
  2
  
_query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"

hospital Leapfrog
true
true
  




  _query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"
  _query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"

  (+())/no_coord
  +()
  
  ExtendedDismaxQParser
  
  
  
  
  
[...]
  



On 12/05/2016 17:06, Erick Erickson wrote:

Try adding =query to your query and look at the parsed results.
This shows you exactly what Solr sees rather than what you think
it should.

Best,
Erick

On Thu, May 12, 2016 at 6:24 AM, Ahmet Arslan <iori...@yahoo.com.invalid> wrote:

Hi,

Well, what happens

q=hospital={!lucene q.op=AND v=$a}=hospital Leapfrog

OR

q=+_query_:"{!lucene q.op=AND v='hospital'}" +_query_:"{!lucene q.op=AND 
v=$a}"=hospital Leapfrog


Ahmet


On Thursday, May 12, 2016 3:28 PM, Bastien Latard - MDPI AG 
<lat...@mdpi.com.INVALID> wrote:
Hi Ahmet,

Thanks for your answer, but this doesn't work on my local index.
q1 returns 2 results.

http://localhost:8983/solr/my_core/select?q=hospital AND
_query_:"{!q.op=AND%20v=$a}"=abstract,title=hospital Leapfrog
==> returns 254 results (the same as
http://localhost:8983/solr/my_core/select?q=hospital )

Kind regards,
Bastien

On 11/05/2016 16:06, Ahmet Arslan wrote:

Hi Bastien,

Please use magic _query_ field, q=hospital AND _query_:"{!q.op=AND v=$a}"

ahmet


On Wednesday, May 11, 2016 2:35 PM, Latard - MDPI AG <lat...@mdpi.com.INVALID> 
wrote:
Hi Everybody,

Is there a way to pass only some of the data by reference and some
others in the q param?

e.g.:

q1.   http://localhost:8983/solr/my_core/select?{!q.op=OR
v=$a}=abstract,title=hospital Leapfrog=true

q1a.  http://localhost:8983/solr/my_core/select?q=hospital AND
Leapfrog=abstract,title

q2.  http://localhost:8983/solr/my_core/select?q=hospital AND
({!q.op=AND v=$a})=abstract,title=hospital Leapfrog

q1 & q1a  are returning the same results, but q2 is somehow not
analyzing the $a parameter properly...

Am I missing anything?

Kind regards,
Bastien Latard
Web engineer


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: URL parameters combined with text param

2016-05-12 Thread Bastien Latard - MDPI AG

Hi Ahmet,

Thanks for your answer, but this doesn't work on my local index.
q1 returns 2 results.

http://localhost:8983/solr/my_core/select?q=hospital AND 
_query_:"{!q.op=AND%20v=$a}"=abstract,title=hospital Leapfrog
==> returns 254 results (the same as 
http://localhost:8983/solr/my_core/select?q=hospital )


Kind regards,
Bastien

On 11/05/2016 16:06, Ahmet Arslan wrote:

Hi Bastien,

Please use magic _query_ field, q=hospital AND _query_:"{!q.op=AND v=$a}"

ahmet


On Wednesday, May 11, 2016 2:35 PM, Latard - MDPI AG  
wrote:
Hi Everybody,

Is there a way to pass only some of the data by reference and some
others in the q param?

e.g.:

q1.   http://localhost:8983/solr/my_core/select?{!q.op=OR
v=$a}=abstract,title=hospital Leapfrog=true

q1a.  http://localhost:8983/solr/my_core/select?q=hospital AND
Leapfrog=abstract,title

q2.  http://localhost:8983/solr/my_core/select?q=hospital AND
({!q.op=AND v=$a})=abstract,title=hospital Leapfrog

q1 & q1a  are returning the same results, but q2 is somehow not
analyzing the $a parameter properly...

Am I missing anything?

Kind regards,
Bastien Latard
Web engineer


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



URL parameters combined with text param

2016-05-11 Thread Bastien Latard - MDPI AG

Hi Everybody,

Is there a way to pass only some of the data by reference and some 
others in the q param?


e.g.:

q1.   http://localhost:8983/solr/my_core/select?{!q.op=OR 
v=$a}=abstract,title=hospital Leapfrog=true


q1a.  http://localhost:8983/solr/my_core/select?q=hospital AND 
Leapfrog=abstract,title


q2.  http://localhost:8983/solr/my_core/select?q=hospital AND 
({!q.op=AND v=$a})=abstract,title=hospital Leapfrog


q1 & q1a  are returning the same results, but q2 is somehow not 
analyzing the $a parameter properly...


Am I missing anything?

Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: fq behavior...

2016-05-08 Thread Bastien Latard - MDPI AG

Thank you guys!
I got it.

kr,
Bast


On 06/05/2016 17:27, Erick Erickson wrote:

>From Yonik's blog:
"By default, Solr resolves all of the filters before the main query"

By definition, the non-cached fq clause _must_ be
executed over the entire data set in order to be
cached. Otherwise, how could the next query
that uses an identical fq clause make use of the
cached value?

If cache=false, it's  a different story as per Yonik's
blog.

On Fri, May 6, 2016 at 7:25 AM, Shawn Heisey <apa...@elyograg.org> wrote:

On 5/6/2016 12:07 AM, Bastien Latard - MDPI AG wrote:

Thank you Susmit, so the answer is:
fq queries are by default run before the main query.

Queries in fq parameters are normally executed in parallel with the main
query, unless they are a postfilter.  I am not sure that the standard
parser supports being run as a postfilter.  Some parsers (like geofilt)
do support that.

Susmit already gave you this link where some of that is explained:

http://yonik.com/advanced-filter-caching-in-solr/

Thanks,
Shawn



Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: fq behavior...

2016-05-06 Thread Bastien Latard - MDPI AG

Thank you Susmit, so the answer is:
fq queries are by default run before the main query.

kr,
Bast

On 06/05/2016 07:57, Susmit Shukla wrote:

Please take a look at this blog, specifically "Leapfrog Anyone?" section-
http://yonik.com/advanced-filter-caching-in-solr/

Thanks,
Susmit

On Thu, May 5, 2016 at 10:54 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:


Hi guys,

Just a quick question, that I did not find an easy answer.

1.

Is the fq "executed" before or after the usual query (q)

e.g.: select?q=title:"something really specific"=bPublic:true=10

Would it first:

  * get all the "specific" results, and then apply the filter
  * OR is it first getting all the docs matching the fq and then
running the "q" query

In other words, does it first check for "the best cardinality"?

Kind regards,
Bastien




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



fq behavior...

2016-05-05 Thread Bastien Latard - MDPI AG

Hi guys,

Just a quick question, that I did not find an easy answer.

1.

   Is the fq "executed" before or after the usual query (q)

   e.g.: select?q=title:"something really specific"=bPublic:true=10

   Would it first:

 * get all the "specific" results, and then apply the filter
 * OR is it first getting all the docs matching the fq and then
   running the "q" query

In other words, does it first check for "the best cardinality"?

Kind regards,
Bastien



Re: OOM script executed

2016-05-05 Thread Bastien Latard - MDPI AG

Thank you Shawn!

So if I run the two following requests, it will only store once 7.5Mo, 
right?

- select?q=*:*=bPublic:true=10
- select?q=field:my_search=bPublic:true=10

kr,

Bast

On 04/05/2016 16:22, Shawn Heisey wrote:

On 5/3/2016 11:58 PM, Bastien Latard - MDPI AG wrote:

Thank you for your email.
You said "have big caches or request big pages (e.g. 100k docs)"...
Does a fq cache all the potential results, or only the ones the query
returns?
e.g.: select?q=*:*=bPublic:true=10

=> with this query, if I have 60 millions of public documents, would
it cache 10 or 60 millions of IDs?
...and does it cache it the filter cache (from fq) in the OS cache or
in java heap?

The result of a filter query is a bitset.  If the core contains 60
million documents, each bitset is 7.5 million bytes in length.  It is
not a list of IDs -- it's a large array of bits representing every
document in the Lucene index, including deleted documents (the Max Doc
value from the core overview).  There are two values for each bit - 0 or
1, depending on whether each document matches the filter or not.

Thanks,
Shawn




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: OOM script executed

2016-05-03 Thread Bastien Latard - MDPI AG

Hi Tomás,

Thank you for your email.
You said "have big caches or request big pages (e.g. 100k docs)"...
Does a fq cache all the potential results, or only the ones the query 
returns?

e.g.: select?q=*:*=bPublic:true=10

=> with this query, if I have 60 millions of public documents, would it 
cache 10 or 60 millions of IDs?
...and does it cache it the filter cache (from fq) in the OS cache or in 
java heap?


kr,
Bastien

On 04/05/2016 02:31, Tomás Fernández Löbbe wrote:

You could use some memory analyzer tools (e.g. jmap), that could give you a
hint. But if you are migrating, I'd start to see if you changed something
from the previous version, including jvm settings, schema/solrconfig.
If nothing is different, I'd try to identify which feature is consuming
more memory. If you use faceting/stats/suggester, or you have big caches or
request big pages (e.g. 100k docs) or use Solr Cell for extracting content,
those are some usual suspects. Try to narrow it down, it could be many
things. Turn on/off features as you look at the memory (you could use
something like jconsole/jvisualvm/jstat) and see when it spikes, compare
with the previous version. That's that I would do at least.

If you get to narrow it down to a specific feature, then you can come back
to the users list and ask with some more specifics, that way someone could
point you to the solution, or maybe file a JIRA if it turns out to be a bug.

Tomás

On Mon, May 2, 2016 at 11:34 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:


Hi Tomás,

Thanks for your answer.
How could I see what's using memory?
I tried to add "-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/solr/logs/OOM_Heap_dump/"
...but this doesn't seem to be really helpful...

Kind regards,
Bastien


On 02/05/2016 22:55, Tomás Fernández Löbbe wrote:


You could, but before that I'd try to see what's using your memory and see
if you can decrease that. Maybe identify why you are running OOM now and
not with your previous Solr version (assuming you weren't, and that you
are
running with the same JVM settings). A bigger heap usually means more work
to the GC and less memory available for the OS cache.

Tomás

On Sun, May 1, 2016 at 11:20 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:

Hi Guys,

I got several times the OOM script executed since I upgraded to Solr6.0:

$ cat solr_oom_killer-8983-2016-04-29_15_16_51.log
Running OOM killer script for process 26044 for Solr on port 8983

Does it mean that I need to increase my JAVA Heap?
Or should I do anything else?

Here are some further logs:
$ cat solr_gc_log_20160502_0730:
}
{Heap before GC invocations=1674 (full 91):
   par new generation   total 1747648K, used 1747135K [0x0005c000,
0x00064000, 0x00064000)
eden space 1398144K, 100% used [0x0005c000,
0x00061556,
0x00061556)
from space 349504K,  99% used [0x00061556, 0x00062aa2fc30,
0x00062aab)
to   space 349504K,   0% used [0x00062aab, 0x00062aab,
0x00064000)
   concurrent mark-sweep generation total 6291456K, used 6291455K
[0x00064000, 0x0007c000, 0x0007c000)
   Metaspace   used 39845K, capacity 40346K, committed 40704K,
reserved
1085440K
class spaceused 4142K, capacity 4273K, committed 4368K, reserved
1048576K
2016-04-29T21:15:41.970+0200: 20356.359: [Full GC (Allocation Failure)
2016-04-29T21:15:41.970+0200: 20356.359: [CMS:
6291455K->6291456K(6291456K), 12.5694653 secs]
8038591K->8038590K(8039104K), [Metaspace: 39845K->39845K(1085440K)],
12.5695497 secs] [Times: user=12.57 sys=0.00, real=12.57 secs]


Kind regards,
Bastien




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: OOM script executed

2016-05-03 Thread Bastien Latard - MDPI AG

Hi Tomás,

Thanks for your answer.
How could I see what's using memory?
I tried to add "-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/var/solr/logs/OOM_Heap_dump/"

...but this doesn't seem to be really helpful...

Kind regards,
Bastien

On 02/05/2016 22:55, Tomás Fernández Löbbe wrote:

You could, but before that I'd try to see what's using your memory and see
if you can decrease that. Maybe identify why you are running OOM now and
not with your previous Solr version (assuming you weren't, and that you are
running with the same JVM settings). A bigger heap usually means more work
to the GC and less memory available for the OS cache.

Tomás

On Sun, May 1, 2016 at 11:20 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:


Hi Guys,

I got several times the OOM script executed since I upgraded to Solr6.0:

$ cat solr_oom_killer-8983-2016-04-29_15_16_51.log
Running OOM killer script for process 26044 for Solr on port 8983

Does it mean that I need to increase my JAVA Heap?
Or should I do anything else?

Here are some further logs:
$ cat solr_gc_log_20160502_0730:
}
{Heap before GC invocations=1674 (full 91):
  par new generation   total 1747648K, used 1747135K [0x0005c000,
0x00064000, 0x00064000)
   eden space 1398144K, 100% used [0x0005c000, 0x00061556,
0x00061556)
   from space 349504K,  99% used [0x00061556, 0x00062aa2fc30,
0x00062aab)
   to   space 349504K,   0% used [0x00062aab, 0x00062aab,
0x00064000)
  concurrent mark-sweep generation total 6291456K, used 6291455K
[0x00064000, 0x0007c000, 0x0007c000)
  Metaspace   used 39845K, capacity 40346K, committed 40704K, reserved
1085440K
   class spaceused 4142K, capacity 4273K, committed 4368K, reserved
1048576K
2016-04-29T21:15:41.970+0200: 20356.359: [Full GC (Allocation Failure)
2016-04-29T21:15:41.970+0200: 20356.359: [CMS:
6291455K->6291456K(6291456K), 12.5694653 secs]
8038591K->8038590K(8039104K), [Metaspace: 39845K->39845K(1085440K)],
12.5695497 secs] [Times: user=12.57 sys=0.00, real=12.57 secs]


Kind regards,
Bastien




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



What does the "Max Doc" means in Admin interface?

2016-05-02 Thread Bastien Latard - MDPI AG

Hi All,

Everything is in the title...


Can this value be modified?
Or is it because of my environment?

Also, what does "Heap Memory Usage: -1" mean?

Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



OOM script executed

2016-05-02 Thread Bastien Latard - MDPI AG

Hi Guys,

I got several times the OOM script executed since I upgraded to Solr6.0:

$ cat solr_oom_killer-8983-2016-04-29_15_16_51.log
Running OOM killer script for process 26044 for Solr on port 8983

Does it mean that I need to increase my JAVA Heap?
Or should I do anything else?

Here are some further logs:
$ cat solr_gc_log_20160502_0730:
}
{Heap before GC invocations=1674 (full 91):
 par new generation   total 1747648K, used 1747135K 
[0x0005c000, 0x00064000, 0x00064000)
  eden space 1398144K, 100% used [0x0005c000, 
0x00061556, 0x00061556)
  from space 349504K,  99% used [0x00061556, 
0x00062aa2fc30, 0x00062aab)
  to   space 349504K,   0% used [0x00062aab, 
0x00062aab, 0x00064000)
 concurrent mark-sweep generation total 6291456K, used 6291455K 
[0x00064000, 0x0007c000, 0x0007c000)
 Metaspace   used 39845K, capacity 40346K, committed 40704K, 
reserved 1085440K
  class spaceused 4142K, capacity 4273K, committed 4368K, reserved 
1048576K
2016-04-29T21:15:41.970+0200: 20356.359: [Full GC (Allocation Failure) 
2016-04-29T21:15:41.970+0200: 20356.359: [CMS: 
6291455K->6291456K(6291456K), 12.5694653 secs] 
8038591K->8038590K(8039104K), [Metaspace: 39845K->39845K(1085440K)], 
12.5695497 secs] [Times: user=12.57 sys=0.00, real=12.57 secs]



Kind regards,
Bastien



Re: How can I set the defaultOperator to be AND?

2016-04-26 Thread Bastien Latard - MDPI AG

Thank you Erick.
You're fully right that it can be an expected behavior to get more docs 
with more words...why not...


However, when I set the default OP to "AND" in solrconfig.xml, then a 
simple query "q=a OR b" doesn't work as expected... as described in the 
previous email:
-> a search 'title:"test" OR author:"me"' will returns documents 
matching 'title:"test" AND author:"me"'


Kind regards,
Bastien

On 27/04/2016 05:30, Erick Erickson wrote:
Defaulting to "OR" has been the behavior since forever, so changing 
the behavior now is just not going to happen. Making it fit a new 
version of "correct" will change the behavior for every application 
out there that has not specified the default behavior.


There's no a-priori reason to expect "more words to equal fewer docs", 
I can just as easily argue that "more words should return more docs". 
Which you expect depends on your mental model.


And providing the default op in your solrconfig.xml request handlers 
allows you to implement whatever model your application chooses...


Best,
Erick

On Mon, Apr 25, 2016 at 11:32 PM, Bastien Latard - MDPI AG 
<lat...@mdpi.com.invalid <mailto:lat...@mdpi.com.invalid>> wrote:


Thank you Shawn, Jan and Georg for your answers.

Yes, it seems that if I simply remove the defaultOperator it works
well for "composed queries" like '(a:x AND b:y) OR c:z'.
But I think that the default Operator should/could be the AND.

Because when I add an extra search word, I expect that the results
get more accurate...
(It seems to be what google is also doing now)
   ||

Otherwise, if you make a search and apply another filter (e.g.:
sort by publication date, facets, ...) , user can get the less
relevant item (only 1 word in 4 matches) in first position only
because of its date...

What do you think?


    Kind regards,
Bastien


On 25/04/2016 14:53, Shawn Heisey wrote:

On 4/25/2016 6:39 AM, Bastien Latard - MDPI AG wrote:

Remember:
If I add the following line to the schema.xml, even if I do a search
'title:"test" OR author:"me"', it will returns documents matching
'title:"test" AND author:"me"':


The settings in the schema for default field and default operator were
deprecated a long time ago.  I actually have no idea whether they are
even supported in newer Solr versions.

The q.op parameter controls the default operator, and the df parameter
controls the default field.  These can be set in the request handler
definition in solrconfig.xml -- usually in "defaults" but there might be
reason to put them in "invariants" instead.

If you're using edismax, you'd be better off using the mm parameter
rather than the q.op parameter.  The behavior you have described above
sounds like a change in behavior (some call it a bug) introduced in the
5.5 version:

https://issues.apache.org/jira/browse/SOLR-8812

If you are using edismax, I suspect that if you set mm=100% instead of
q.op=AND (or the schema default operator) that the problem might go away
... but I am not sure.  Someone who is more familiar with SOLR-8812
probably should comment.

Thanks,
Shawn




Kind regards,
Bastien Latard
Web engineer
-- 
MDPI AG

Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel.+41 61 683 77 35 <tel:%2B41%2061%20683%2077%2035>Fax: +41 61 302
89 18 <tel:%2B41%2061%20302%2089%2018> E-mail: lat...@mdpi.com
<mailto:lat...@mdpi.com> http://www.mdpi.com/




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



'batching when indexing is good' -> some questions

2016-04-26 Thread Bastien Latard - MDPI AG

Hi Eric (Erickson) & others,

I read your post 'batching when indexing is good 
'.
But I also read this one 
, which 
recommend to use batchSize="-1".


So I have now some questions:
- when you speak about 'Packet Size', are you speaking about batchSize?
- where can I define the Integer.MIN_VALUE used by the setFetchSize() 
from JDBC con. ? (I use mysql jdbc)


Kind regards,
Bastien


Re: How can I set the defaultOperator to be AND?

2016-04-26 Thread Bastien Latard - MDPI AG

Thank you Shawn, Jan and Georg for your answers.

Yes, it seems that if I simply remove the defaultOperator it works well 
for "composed queries" like '(a:x AND b:y) OR c:z'.

But I think that the default Operator should/could be the AND.

Because when I add an extra search word, I expect that the results get 
more accurate...

(It seems to be what google is also doing now)
   ||

Otherwise, if you make a search and apply another filter (e.g.: sort by 
publication date, facets, ...) , user can get the less relevant item 
(only 1 word in 4 matches) in first position only because of its date...


What do you think?


Kind regards,
Bastien


On 25/04/2016 14:53, Shawn Heisey wrote:

On 4/25/2016 6:39 AM, Bastien Latard - MDPI AG wrote:

Remember:
If I add the following line to the schema.xml, even if I do a search
'title:"test" OR author:"me"', it will returns documents matching
'title:"test" AND author:"me"':


The settings in the schema for default field and default operator were
deprecated a long time ago.  I actually have no idea whether they are
even supported in newer Solr versions.

The q.op parameter controls the default operator, and the df parameter
controls the default field.  These can be set in the request handler
definition in solrconfig.xml -- usually in "defaults" but there might be
reason to put them in "invariants" instead.

If you're using edismax, you'd be better off using the mm parameter
rather than the q.op parameter.  The behavior you have described above
sounds like a change in behavior (some call it a bug) introduced in the
5.5 version:

https://issues.apache.org/jira/browse/SOLR-8812

If you are using edismax, I suspect that if you set mm=100% instead of
q.op=AND (or the schema default operator) that the problem might go away
... but I am not sure.  Someone who is more familiar with SOLR-8812
probably should comment.

Thanks,
Shawn




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: How can I set the defaultOperator to be AND?

2016-04-25 Thread Bastien Latard - MDPI AG

Any news?

Remember:
If I add the following line to the schema.xml, even if I do a search 
'title:"test" OR author:"me"', it will returns documents matching 
'title:"test" AND author:"me"':



kr,
Bast

On 22/04/2016 13:22, Bastien Latard - MDPI AG wrote:

Yes Jan, I'm using edismax.

This is (a part of) my requestHandler:


 
false
   explicit
   10
   title,abstract,authors,doi
   edismax
   title^1.0  author^1.0
[...]

Is there anything I should do to improve/fix it?

Kind regards,
Bastien

On 22/04/2016 12:42, Jan Høydahl wrote:

Hi

Which query parser are you using? If using edismax yo may be hitting 
a recent bug concerning default operator and explicit boolean operators.


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

22. apr. 2016 kl. 11.26 skrev Bastien Latard - MDPI AG 
<lat...@mdpi.com.INVALID>:


Hi guys,

How can I set the defaultOperator to be AND?
If I add the following line to the schema.xml, even if I do a search 
'title:"test" OR author:"me"', it will returns documents matching 
'title:"test" AND author:"me"':



solr version: 6.0

I know that I can overwrite the query with q.op, but this is not 
that convenient...
I would need to write a complex query for a simple search '(a:x AND 
b:y) OR c:z'


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/





Kind regards,
Bastien Latard
Web engineer


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: How can I set the defaultOperator to be AND?

2016-04-22 Thread Bastien Latard - MDPI AG

Yes Jan, I'm using edismax.

This is (a part of) my requestHandler:


 
false
   explicit
   10
   title,abstract,authors,doi
   edismax
   title^1.0  author^1.0
[...]

Is there anything I should do to improve/fix it?

Kind regards,
Bastien

On 22/04/2016 12:42, Jan Høydahl wrote:

Hi

Which query parser are you using? If using edismax yo may be hitting a recent 
bug concerning default operator and explicit boolean operators.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


22. apr. 2016 kl. 11.26 skrev Bastien Latard - MDPI AG 
<lat...@mdpi.com.INVALID>:

Hi guys,

How can I set the defaultOperator to be AND?
If I add the following line to the schema.xml, even if I do a search 'title:"test" OR author:"me"', 
it will returns documents matching 'title:"test" AND author:"me"':


solr version: 6.0

I know that I can overwrite the query with q.op, but this is not that 
convenient...
I would need to write a complex query for a simple search '(a:x AND b:y) OR c:z'

Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/





Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



How can I set the defaultOperator to be AND?

2016-04-22 Thread Bastien Latard - MDPI AG

Hi guys,

How can I set the defaultOperator to be AND?
If I add the following line to the schema.xml, even if I do a search 
'title:"test" OR author:"me"', it will returns documents matching 
'title:"test" AND author:"me"':



solr version: 6.0

I know that I can overwrite the query with q.op, but this is not that 
convenient...
I would need to write a complex query for a simple search '(a:x AND b:y) 
OR c:z'


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: Can a field be an array of fields?

2016-04-19 Thread Bastien Latard - MDPI AG

Thank you Jack and Daniel, I somehow missed your answers.

Yes, I already thought about the JSON possibility, but I was more 
concerned of having such structure in result:


"docs":[
   {
[...]
"authors_array":
 [  
[
"given_name":["Bastien"],
"last_name":["lastname1"]
 ],
[
"last_name":["lastname2"]
 ],
[
"given_name":["Matthieu"],
"last_name":["lastname2"]
 ],
[
"given_name":["Nino"],
"last_name":["lastname4"]
 ],
 ]
[...]


And being able to query like:
- q=authors_array.given_name:Nino
OR
- q=authors_array['given_name']:Nino

Is that possible?


Kind regards,
Bastien


On 15/04/2016 17:08, Jack Krupansky wrote:

It all depends on what your queries look like - what input data does your
application have and what data does it need to retrieve.

My recommendation is that you store first name and last name as separate,
multivalued fields if you indeed need to query by precisely a first or last
name, but also store the full name as a separate multivalued text field. If
you want to search by only first or last name, fine. If you want to search
by full name or wildcards, etc., you can use the full name field, using
phrase query. You can use an update request processor to combine first and
last name into that third field. You could also store the full name in a
fourth field as raw JSON if you really need structure in the result. The
third field might have first and last name with a special separator such as
"|", although a simple comma is typically sufficient.


-- Jack Krupansky

On Fri, Apr 15, 2016 at 10:58 AM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:


Short answer - JOINs, external query outside Solr, Elastic Search ;)
Alternatives:
   * You get back an id for each document when you query on "Nino".   You
look up the last names in some other system that has the full list.
   * You index the authors in another collection and use JOINs
   * You store the author_array as formatted, escaped JSON, stored, but not
indexed (or analyzed).   When you get the data back, you navigate the JSON
to the author_array, get the value, and parse that value as JSON.   Now you
have the full list.
   * This is a sweet spot for Elastic Search, to be perfectly honest.

-Original Message-
From: Bastien Latard - MDPI AG [mailto:lat...@mdpi.com.INVALID]
Sent: Friday, April 15, 2016 7:52 AM
To: solr-user@lucene.apache.org
Subject: Can a field be an array of fields?

Hi everybody!

/I described a bit what I found in another thread, but I prefer to create
a new thread for this specific question.../ *It's **possible to create an
array of string by doing (incomplete example):
- in the data-conf.xml:*


 
   
   
   
   
 



*- in schema.xml:
*




And this provides something like:

"docs":[
{
[...]
 "given_name":["Bastien",  "Matthieu",  "Nino"],
 "last_name":["lastname1", "lastname2",
  "lastname3",   "lastname4"],

[...]


*Note: there can be one author with only a last_name, and then we are
unable to tell which one it is...*

My goal would be to get this as a result:

"docs":[
{
[...]
 "authors_array":
  [
 [
 "given_name":["Bastien"],
 "last_name":["lastname1"]
  ],
 [
 "last_name":["lastname2"]
  ],
 [
 "given_name":["Matthieu"],
 "last_name":["lastname2"]
  ],
 [
 "given_name":["Nino"],
 "last_name":["lastname4"]
  ],
  ]
[...]


Is there any way to do this?
/PS: I know that I could do '//select if(a.given_name is not null,
a.given_name ,'') as given_name, [...]//' but I would like to get an
array.../

I tried to add something like that to the schema.xml, but this doesn't
work (well, it might be of type 'array'):


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Denormalization and data retrieval

2016-04-19 Thread Bastien Latard - MDPI AG

Hi,

What's the correct way to create index(es) using denormalization?

1. something like that?
   
   
   
   

OR even:
   


2. OR a different index for each SQL table?
 -> if yes, how can I then retrieve all the needed data (i.e.: 
intersection)?...JOIN/Streaming exp.?


I have more than 68 millions of articles, which are all linked to 1 
journal and 1 publisher...And I have 8 different services requesting the 
data (so I cannot really provide a specific use case, I'd like to know a 
more general answer).


But in general, would it be better/faster to query:
- a single normalized index with all the data at the same place (but 
larger index because of duplicated data)

- several indexes (smaller indexes, but need to make a solr "join")

I got good tips about using 'Streaming expressions' & 'Parallel SQL 
interface', but I first want to know the best way to store the data.


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: Solr best practices for many to many relations...

2016-04-18 Thread Bastien Latard - MDPI AG
need to do something

that

SQL

doesn't yet support you should check out the Streaming

Expressions

to

see

if it can support it.

With these you could store your data in separate collections (or

the

same

collection with different docType field values) and then during

search

perform a join (inner, outer, hash) across the collections. You

could,

if

you wanted, even join with data NOT in solr using the jdbc

streaming

function.

- Dennis Gove


On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:


'*would I then be able to query a specific field of articles or

other

"table" (with the same OR BETTER performances)?*'
-> And especially, would I be able to get only 1 article in the

result...

On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:

Thanks Jack.

I know that Solr is a search engine, but this replace a search

in

my

mysql DB with this model:


*My goal is to improve my environment (and my performances at

the

same

time).*

*Yes, I have a Solr data model... but atm I created 4 different

indexes

for "similar service usage".*
*So atm, for 70 millions of documents, I am duplicating journal

data

and

publisher data all the time in 1 index (for all articles from

the

same

journal/pub) in order to be able to retrieve all data in 1

query...*

*I found yesterday that there is the possibility to create like

an

array

of  in the data-conf.xml.*
e.g. (pseudo code - incomplete):

publishers">


journals

WHERE publisher_id='${solr_publisher.id}'">

articles

WHERE journal_id='${solr_journal.id}'">

from

authors WHERE article_id='${solr_article.id}'">


* Would this be a good option? Is this the denormalization you

were

proposing? *

*If yes, would I then be able to query a specific field of

articles

or

other "table" (with the same OR BETTER performances)? If yes, I

might

probably merge all the different indexes together. *
*I'm currently joining everything in mysql, so duplicating the

fields

in

the solr (pseudo code):*

journal

on

[...]">
*So I have an index for authors query, a general one for

articles

(only

needed info of other tables) ...*

Thanks in advance for the tips. :)

Kind regards,
Bastien

On 14/04/2016 16:23, Jack Krupansky wrote:

Solr is a search engine, not a database.

JOINs? Although Solr does have some limited JOIN capabilities,

they

are

more for special situations, not the front-line go-to technique

for

data

modeling for search.

Rather, denormalization is the front-line go-to technique for

data

modeling in Solr.

In any case, the first step in data modeling is always to focus

on

your

queries - what information will be coming into your apps and

what

information will the apps want to access based on those inputs.

But wait... you say you are upgrading, which suggests that you

have

an

existing Solr data model, and probably queries as well. So...

1. Share at least a summary of your existing Solr data model as

well

as

at least a summary of the kinds of queries you perform today.
2. Tell us what exacting is driving your inquiry - are queries

too

slow,

too cumbersome, not sufficiently powerful, or... what exactly is

the

problem you need to solve.


-- Jack Krupansky

On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
<lat...@mdpi.com.invalid>lat...@mdpi.com.invalid> wrote:


Hi Guys,

*I am upgrading from solr 4.2 to 6.0.*
*I successfully (after some time) migrated the config files and

other

parameters...*

Now I'm just wondering if my indexes are following the best
practices...(and they are probably not :-) )

What would be the best if we have this kind of sql data to

write

in

Solr:


I have several different services which need (more or less),

different

data based on these JOINs...

e.g.:
Service A needs lots of data (but bot all),
Service B needs a few data (some fields already included in A),
Service C needs a bit more data than B(some fields already

included

in

A/B)...

*1. Would it be better to create one single index?*
*-> i.e.: this will duplicate journal info for every single

article*

*2. Would it be better to create several specific indexes for

each

similar services?*





*-> i.e.: this will use more space on the disks (and there are
~70millions of documents to join) 3. Would it be better to

create

an

index

per table and make a join? -> if yes, how?? *

Kind regards,
Bastien



Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail: latard@mdpi.comhttp://www.mdpi.com/


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail: latard@mdpi.comhttp://www.mdpi.com/




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Can a field be an array of fields?

2016-04-15 Thread Bastien Latard - MDPI AG

The same email, but with formatting...
(email below)

 Forwarded Message 
Subject:Can a field be an array of fields?
Date:   Fri, 15 Apr 2016 13:51:48 +0200
From:   Bastien Latard - MDPI AG <lat...@mdpi.com>
To: solr-user@lucene.apache.org



Hi everybody!

/I described a bit what I found in another thread, but I prefer to 
create a new thread for this specific question.../

*It's **possible to create an array of string by doing (incomplete example):
- in the data-conf.xml:*


   
 
 
 
 
   



*- in schema.xml:
*required="false" multiValued="true" />
required="false" multiValued="true" />
required="false" multiValued="true" />
required="false" multiValued="true" />


And this provides something like:

"docs":[
  {
[...]
"given_name":["Bastien",  "Matthieu",  "Nino"],
"last_name":["lastname1", "lastname2", "lastname3",   
"lastname4"],

[...]


*Note: there can be one author with only a last_name, and then we are 
unable to tell which one it is...*


My goal would be to get this as a result:

"docs":[
  {
[...]
   "authors_array":
[   
[
"given_name":["Bastien"],
"last_name":["lastname1"]
],
[
"last_name":["lastname2"]
],
[
"given_name":["Matthieu"],
"last_name":["lastname2"]
],
[
"given_name":["Nino"],
"last_name":["lastname4"]
],
]
[...]


Is there any way to do this?
/PS: I know that I could do '//select if(a.given_name is not null, 
a.given_name ,'') as given_name, [...]//' but I would like to get an 
array.../


I tried to add something like that to the schema.xml, but this doesn't 
work (well, it might be of type 'array'):
required="false" multiValued="true"/>


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/





Can a field be an array of fields?

2016-04-15 Thread Bastien Latard - MDPI AG

Hi everybody!

/I described a bit what I found in another thread, but I prefer to 
create a new thread for this specific question.../

*It's **possible to create an array of string by doing (incomplete example):
- in the data-conf.xml:*


   
 
 
 
 
   



*- in schema.xml:
*required="false" multiValued="true" />
required="false" multiValued="true" />
required="false" multiValued="true" />
required="false" multiValued="true" />


And this provides something like:

"docs":[
  {
[...]
"given_name":["Bastien",  "Matthieu",  "Nino"],
"last_name":["lastname1", "lastname2", "lastname3",   
"lastname4"],

[...]


*Note: there can be one author with only a last_name, and then we are 
unable to tell which one it is...*


My goal would be to get this as a result:

"docs":[
  {
[...]
   "authors_array":
[   
[
"given_name":["Bastien"],
"last_name":["lastname1"]
],
[
"last_name":["lastname2"]
],
[
"given_name":["Matthieu"],
"last_name":["lastname2"]
],
[
"given_name":["Nino"],
"last_name":["lastname4"]
],
]
[...]


Is there any way to do this?
/PS: I know that I could do '//select if(a.given_name is not null, 
a.given_name ,'') as given_name, [...]//' but I would like to get an 
array.../


I tried to add something like that to the schema.xml, but this doesn't 
work (well, it might be of type 'array'):
required="false" multiValued="true"/>


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: Solr best practices for many to many relations...

2016-04-15 Thread Bastien Latard - MDPI AG
'/would I then be able to query a specific field of articles or other 
"table" (with the same OR BETTER performances)?/'

-> And especially, would I be able to get only 1 article in the result...

On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:

Thanks Jack.

I know that Solr is a search engine, but this replace a search in my 
mysql DB with this model:



*My goal is to improve my environment (and my performances at the same 
time).*

/
//Yes, I have a Solr data model... but atm I created 4 different 
indexes for "similar service usage".//
//So atm, for 70 millions of documents, I am duplicating journal data 
and publisher data all the time in 1 index (for all articles from the 
same journal/pub) in order to be able to retrieve all data in 1 query.../


*I found yesterday that there is the possibility to create like an 
array of  in the data-conf.xml.*

e.g. (pseudo code - incomplete):





*
Would this be a good option? Is this the denormalization you were 
proposing?
*/If yes, would I then be able to query a specific field of articles 
or other "table" (with the same OR BETTER performances)?

If yes, I might probably merge all the different indexes together.
/*
*/I'm currently joining everything in mysql, so duplicating the fields 
in the solr (pseudo code):/
*
*/So I have an index for authors query, a general one for articles 
(only needed info of other tables) .../*


*Thanks in advance for the tips. :)
*
*Kind regards,
Bastien*
*

On 14/04/2016 16:23, Jack Krupansky wrote:

Solr is a search engine, not a database.

JOINs? Although Solr does have some limited JOIN capabilities, they 
are more for special situations, not the front-line go-to technique 
for data modeling for search.


Rather, denormalization is the front-line go-to technique for data 
modeling in Solr.


In any case, the first step in data modeling is always to focus on 
your queries - what information will be coming into your apps and 
what information will the apps want to access based on those inputs.


But wait... you say you are upgrading, which suggests that you have 
an existing Solr data model, and probably queries as well. So...


1. Share at least a summary of your existing Solr data model as well 
as at least a summary of the kinds of queries you perform today.
2. Tell us what exacting is driving your inquiry - are queries too 
slow, too cumbersome, not sufficiently powerful, or... what exactly 
is the problem you need to solve.



-- Jack Krupansky

On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG 
<lat...@mdpi.com.invalid> wrote:


Hi Guys,

/I am upgrading from solr 4.2 to 6.0.//
//I successfully (after some time) migrated the config files and
other parameters.../

Now I'm just wondering if my indexes are following the best
practices...(and they are probably not :-) )

What would be the best if we have this kind of sql data to write
in Solr:


I have several different services which need (more or less),
different data based on these JOINs...

e.g.:
Service A needs lots of data (but bot all),
Service B needs a few data (some fields already included in A),
Service C needs a bit more data than B(some fields already
included in A/B)...

*1. Would it be better to create one single index?**
**-> i.e.: this will duplicate journal info for every single
article**
**
**2. Would it be better to create several specific indexes for
each similar services?**
**-> i.e.: this will use more space on the disks (and there are
~70millions of documents to join)

3. Would it be better to create an index per table and make a join?
-> if yes, how??

*

Kind regards,
Bastien




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: Solr best practices for many to many relations...

2016-04-15 Thread Bastien Latard - MDPI AG

Thanks Jack.

I know that Solr is a search engine, but this replace a search in my 
mysql DB with this model:



*My goal is to improve my environment (and my performances at the same 
time).*

/
//Yes, I have a Solr data model... but atm I created 4 different indexes 
for "similar service usage".//
//So atm, for 70 millions of documents, I am duplicating journal data 
and publisher data all the time in 1 index (for all articles from the 
same journal/pub) in order to be able to retrieve all data in 1 query.../


*I found yesterday that there is the possibility to create like an array 
of  in the data-conf.xml.*

e.g. (pseudo code - incomplete):





*
Would this be a good option? Is this the denormalization you were proposing?
*/If yes, would I then be able to query a specific field of articles or 
other "table" (with the same OR BETTER performances)?

If yes, I might probably merge all the different indexes together.
/*
*/I'm currently joining everything in mysql, so duplicating the fields 
in the solr (pseudo code):/
*
*/So I have an index for authors query, a general one for articles (only 
needed info of other tables) .../*


*Thanks in advance for the tips. :)
*
*Kind regards,
Bastien*
*

On 14/04/2016 16:23, Jack Krupansky wrote:

Solr is a search engine, not a database.

JOINs? Although Solr does have some limited JOIN capabilities, they 
are more for special situations, not the front-line go-to technique 
for data modeling for search.


Rather, denormalization is the front-line go-to technique for data 
modeling in Solr.


In any case, the first step in data modeling is always to focus on 
your queries - what information will be coming into your apps and what 
information will the apps want to access based on those inputs.


But wait... you say you are upgrading, which suggests that you have an 
existing Solr data model, and probably queries as well. So...


1. Share at least a summary of your existing Solr data model as well 
as at least a summary of the kinds of queries you perform today.
2. Tell us what exacting is driving your inquiry - are queries too 
slow, too cumbersome, not sufficiently powerful, or... what exactly is 
the problem you need to solve.



-- Jack Krupansky

On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG 
<lat...@mdpi.com.invalid <mailto:lat...@mdpi.com.invalid>> wrote:


Hi Guys,

/I am upgrading from solr 4.2 to 6.0.//
//I successfully (after some time) migrated the config files and
other parameters.../

Now I'm just wondering if my indexes are following the best
practices...(and they are probably not :-) )

What would be the best if we have this kind of sql data to write
in Solr:


I have several different services which need (more or less),
different data based on these JOINs...

e.g.:
Service A needs lots of data (but bot all),
Service B needs a few data (some fields already included in A),
Service C needs a bit more data than B(some fields already
included in A/B)...

*1. Would it be better to create one single index?**
**-> i.e.: this will duplicate journal info for every single article**
**
**2. Would it be better to create several specific indexes for
each similar services?**
**-> i.e.: this will use more space on the disks (and there are
~70millions of documents to join)

3. Would it be better to create an index per table and make a join?
-> if yes, how??

*

Kind regards,
Bastien




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Solr best practices for many to many relations...

2016-04-14 Thread Bastien Latard - MDPI AG

Hi Guys,

/I am upgrading from solr 4.2 to 6.0.//
//I successfully (after some time) migrated the config files and other 
parameters.../


Now I'm just wondering if my indexes are following the best 
practices...(and they are probably not :-) )


What would be the best if we have this kind of sql data to write in Solr:


I have several different services which need (more or less), different 
data based on these JOINs...


e.g.:
Service A needs lots of data (but bot all),
Service B needs a few data (some fields already included in A),
Service C needs a bit more data than B(some fields already included in 
A/B)...


*1. Would it be better to create one single index?**
**-> i.e.: this will duplicate journal info for every single article**
**
**2. Would it be better to create several specific indexes for each 
similar services?**
**-> i.e.: this will use more space on the disks (and there are 
~70millions of documents to join)


3. Would it be better to create an index per table and make a join?
-> if yes, how??

*

Kind regards,
Bastien



Re: Cache problem

2016-04-13 Thread Bastien Latard - MDPI AG

Thank you all again for your good and detailed answer.
I will combine all of them to try to build a better environment.

*Just a last question...*
/I don't remember exactly when I needed to increase the java heap.../
/but is it possible that this was for the DataImport.../

*Would the DIH work if it cannot "load" the temporary index into the 
java heap in the full-index mode?*
I thought that's why I needed to increase this value...but I might be 
confused!


kind regards,
Bastien

On 13/04/2016 09:54, Shawn Heisey wrote:

>Question #1:
> From the picture above, we see Physical memory: ~60Gb
>*  -> is this because of -Xmx40960m AND -XX:MaxPermSize=20480m ? *

I don't actually know whether permgen is allocated from the heap, or *in
addition* to the heap.  Your current allocated heap size is 20GB, which
means that at most Java is taking up 30GB, but it might be just 20GB.
The other 30-40GB is used by the operating system -- for disk caching
(the page cache).  It's perfectly normal for physical memory to be
almost completely maxed out.  The physical memory graph is nearly
useless for troubleshooting.


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Generic questions - increase performance

2016-04-13 Thread Bastien Latard - MDPI AG

Dear Folks, :-)

From this source 
, I read:
"Each incoming request requires a thread [...] If still more 
simultaneous requests (more than maxThreads) are received, they are 
stacked up inside the server socket"


I have a couple of generic questions.

1) *How would the increase of maxThreads will behave with RAM usage? 
e.g.: if I increase by 2, would it be twice more?*


2) *What's the defaults values of maxThreads and maxConnections?*
This post  says 
"maxConnections=10,000 and maxThreads=200"


3) Here is my config (/etc/tomcat7/server.xml):

*Is there a way to kill the request if someone make a big query (e.g.: 
50 seconds), but either close the connection or get a timeout after 5 
seconds?**(or is it default behavior?)

*
Thanks!

Kind regards,
Bastien



Re: Cache problem

2016-04-13 Thread Bastien Latard - MDPI AG

Thank you Shawn & Reth!

So I have now some questions, again


Remind: I have only Solr running on this server (i.e.: java + tomcat).

/BTW: I needed to increase previously the java heap size because I went 
out of memory. Actually, you only see here 2Gb (8Gb previously) for JVM 
because I automatically restart tomcat for a better performance every 30 
minutes if no DIH running.//

/
Question #1:
From the picture above, we see Physical memory: ~60Gb
*  -> is this because of -Xmx40960m AND -XX:MaxPermSize=20480m ? *

Question #2:
/"The OS caches the actual index files"./

*Does this mean that OS will try to cache 47.48Gb for this index? (if 
not, how can I know the size of the cache)
*/Or are you speaking about page cache 
<https://en.wikipedia.org/wiki/Page_cache>?/*

*
Question #3:
/"documentCache does live in Java heap"
/*Is there a way to know the real size used/needed by this caching?*

Thanks for your help.

Kind regards,
Bastien

On 13/04/2016 02:47, Shawn Heisey wrote:

On 4/12/2016 3:35 AM, Bastien Latard - MDPI AG wrote:

Thank you both, Bill and Reth!

Here is my current options from my command to launch java:
*/usr/bin/java  -Xms20480m -Xmx40960m -XX:PermSize=10240m
-XX:MaxPermSize=20480m [...]*

So should I do *-Xms20480m -Xmx20480m*?
Why? What would it change?

You do *NOT* need a 10GB permsize.  That's a definite waste of memory --
most of it will never get used.  It's probably best to let Java handle
the permgen.  This generation is entirely eliminated in Java 8.  In Java
7, the permsize usually doesn't need adjusting ... but if it does, Solr
probably wouldn't even start without an adjustment.

Regarding something said in another reply on this thread:  The
documentCache *does* live in the Java heap, not the OS memory.  The OS
caches the actual index files, and documentCache is maintained by Solr
itself, separately from that.

It is highly unlikely that you will ever need a 40GB heap.  You might
not even need a 20GB heap.  As I said earlier:  Based on what I saw in
your screenshots, I think you can run with an 8g heap (-Xms8g -Xmx8g),
but you might need to try 12g instead.

Thanks,
Shawn




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: Cache problem

2016-04-12 Thread Bastien Latard - MDPI AG

Thank you both, Bill and Reth!

Here is my current options from my command to launch java:
*/usr/bin/java  -Xms20480m -Xmx40960m -XX:PermSize=10240m 
-XX:MaxPermSize=20480m [...]*


So should I do *-Xms20480m -Xmx20480m* ?
Why? What would it change?

Reminder: the size of my main index is 46Gb... (80Gb all together)



BTW: what's the difference between dark and light grey in the JVM 
representation? (real/virtual memory?)



NOTE: I have only tomcat running on this server (and this is my live 
website - /i.e.: quite critical/).


So if document cache is using the OS cache, this might be the problem, 
right?
(because it seems to cache every field ==> so all the data returned by 
the query)


kr,
Bast

On 12/04/2016 08:19, Reth RM wrote:

As per solr admin dashboard's memory report, solr jvm is not using memory
more than 20 gb, where as physical memory is almost full.  I'd set
xms=xmx=16 gb and let operating system use rest. And regarding caches:
  filter cache hit ratio looks good so it should not be concern. And afaik,
document cache actually uses OS cache. Overall, I'd reduce memory allocated
to jvm as said above and try.




On Mon, Apr 11, 2016 at 7:40 PM, <billnb...@gmail.com> wrote:


You do need to optimize to get rid of the deleted docs probably...

That is a lot of deleted docs

Bill Bell
Sent from mobile



On Apr 11, 2016, at 7:39 AM, Bastien Latard - MDPI AG

<lat...@mdpi.com.INVALID> wrote:

Dear Solr experts :),

I read this very interesting post 'Understanding and tuning your Solr

caches' !

This is the only good document that I was able to find after searching

for 1 day!

I was using Solr for 2 years without knowing in details what it was

caching...(because I did not need to understand it before).

I had to take a look since I needed to restart (regularly) my tomcat in

order to improve performances...

But I now have 2 questions:
1) How can I know how much RAM is my solr using in real (especially for

caching)?

2) Could you have a quick look into the following images and tell me if

I'm doing something wrong?

Note: my index contains 66 millions of articles with several text fields

stored.



My solr contains several cores (all together are ~80Gb big), but almost

only the one below is used.

I have the feeling that a lot of data is always stored in RAM...and

getting bigger and bigger all the time...




(after restart)
$ sudo tail -f /var/log/tomcat7/catalina.out | grep GC

[...] after a few minutes


Here are some images, that can show you some stats about my Solr

performances...







Kind regards,
Bastien Latard




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/