RE: Debugging/scoring question

2018-05-23 Thread LOPEZ-CORTES Mariano-ext
Yes. This make sense.

I guess you talk about this doc:

https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

How I can decrease the effect of the IDF component in my query?

Thanks!!

-Message d'origine-
De : Alessandro Benedetti [mailto:a.benede...@sease.io] 
Envoyé : mercredi 23 mai 2018 18:05
À : solr-user@lucene.apache.org
Objet : Re: Debugging/scoring question

Hi Mariano,
>From the documentation :

docCount = total number of documents containing this field, in the range [1 .. 
{@link #maxDoc()}]

In your debug the fields involved in the score computation are indeed different 
( nomUsageE, prenomE) .

Does this make sense ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Debugging/scoring question

2018-05-23 Thread LOPEZ-CORTES Mariano-ext
Hi all

I've a 20 document collection. In a debugging plan, we have:

"100051":"
20.794415 = max of:
  20.794415 = weight(nomUsageE:jean in 1) [SchemaSimilarity], result of:
20.794415 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
  15.0 = boost
  1.3862944 = idf, computed as log(1 + (docCount - docFreq + 0.5) / 
(docFreq + 0.5)) from:
1.0 = docFreq
5.0 = docCount
  1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * 
fieldLength / avgFieldLength)) from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
1.0 = avgFieldLength
1.0 = fieldLength

  "100053":"
21.11246 = max of:
  21.11246 = weight(prenomE:jean in 3) [SchemaSimilarity], result of:
21.11246 = score(doc=3,freq=1.0 = termFreq=1.0
), product of:
  8.0 = boost
  2.6390574 = idf, computed as log(1 + (docCount - docFreq + 0.5) / 
(docFreq + 0.5)) from:
1.0 = docFreq
20.0 = docCount
  1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * 
fieldLength / avgFieldLength)) from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
1.0 = avgFieldLength
1.0 = fieldLength

docCount = 5.0 for the document 100051. Why? docCount is the total number 
of documents, isn't it?

Thanks in advance!




Solr Dates TimeZone

2018-05-22 Thread LOPEZ-CORTES Mariano-ext
Hi

It's possible to configure Solr with a timezone other than GMT?
It's possible to configure Solr Admin to view dates with a timezone other than 
GMT?
What is the best way to store a birth date in Solr? We use TrieDate type.

Thanks!


Commit too slow?

2018-05-14 Thread LOPEZ-CORTES Mariano-ext
Hi

After having injecting 200 documents in our Solr server, the commit 
operation at the end of the process (using ConcurrentUpdateSolrClient) take 10 
minutes. It's too slow?

Our auto-commit policy is the following:

 
  15000
  false
 
 
  15000
 
Thanks !



Solr doesn't import the whole data

2018-04-27 Thread LOPEZ-CORTES Mariano-ext
Hi

We've finished the data import of 40 millions data into a 3 node Solr cluster.

After injecting all data via a Java program, we've noticed that the number of 
documents was less than expected (in 10 rows).
No exception, no error.

Some config details:

   
   15000
   
false
   

   15000
   

We have no commits in the client application.

But also, when consulting via admin, we've noticed that the number total of 
rows in Solr increase slowly (numFound).

It's a normal behaviour? What's the problem?

Thanks!






Filter query question

2018-04-12 Thread LOPEZ-CORTES Mariano-ext
Hi

In our search application we have one facet filter (Status)

Each status value corresponds to multiple values in the Solr database

Example : Status : Initialized --> status in solr =  11I, 12I, 13I, 14I, ...

On status value click, search is re-fired with fq filter:

fq: status:(11I OR 12I OR 13I )

This was very very inefficient. Filter query response time was longer than same 
search without filter!

We have changed status value in Solr database for corresponding to visual 
filter values. In consequence, there is no OR in the fq filter.
The performance is better now.

What is the reason?

Thanks!



RE: Question liste solr

2018-03-20 Thread LOPEZ-CORTES Mariano-ext
CSV file is 5GB aprox. for 29 millions. 

As you say Christopher, at the beggining we thougth that reading chunk by chunk 
from Oracle and writing to Solr
was the best strategy. 

But, from our tests we've remarked:

CSV creation via PL/SQL is really really fast. 40 minutes for the full dataset 
(with bulk collect).
Multiple SELECT calls from java slows down the process. I think Oracle is the 
bottleneck here.

Any other ideas/alternatives?

Some other points to remark:

We are going to enable autoCommit for every 10 minutes / 1 rows. No commit 
from client.
During indexing,  whe call all the time a front-end load-balancer that redirect 
calls to the 3-node cluster.

Thanks in advance!!

==>Great maillist and really awesome tool!! 

-Message d'origine-
De : Christopher Schultz [mailto:ch...@christopherschultz.net] 
Envoyé : lundi 19 mars 2018 18:05
À : solr-user@lucene.apache.org
Objet : Re: Question liste solr

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Mariano,

On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote:
> Hello
> 
> We have an index Solr with 3 nodes, 1 shard et 2 replicas.
> 
> Our goal is to index 42 millions rows. Indexing time is important.
> The data source is an oracle database.
> 
> Our indexing strategy is :
> 
> * Reading from Oracle to a big CSV file.
> 
> * Reading from 4 files (big file chunked) and injection via
> ConcurrentUpdateSolrClient
> 
> Is it the optimal way of injecting such mass of data into Solr ?
> 
> For information, estimated time for our solution is 6h.

How big are the CSV files? If most of the time is taken performing the various 
SELECT operations, then it's probably a good strategy.

However, you may find that using the disk as a buffer slows everything down 
because disk-writes can be very slow.

Why not perform your SELECT(s) and write directly to Solr using one of the APIs 
(either a language-specific API, or through the HTTP API)?

Hope that helps,
- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo
cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE
s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH
I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3
6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+
r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5
BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6
ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX
ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey
85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg
GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy
tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD
VH6PlwgqcrO28Jx799mJvpIotoE=
=aMPk
-END PGP SIGNATURE-


RE: Question liste solr

2018-03-19 Thread LOPEZ-CORTES Mariano-ext
Sorry. Thanks in advance !!

De : LOPEZ-CORTES Mariano-ext
Envoyé : lundi 19 mars 2018 16:50
À : 'solr-user@lucene.apache.org'
Objet : RE: Question liste solr

Hello

We have an index Solr with 3 nodes, 1 shard et 2 replicas.

Our goal is to index 42 millions rows. Indexing time is important. The data 
source is an oracle database.

Our indexing strategy is :

· Reading from Oracle to a big CSV file.

· Reading from 4 files (big file chunked) and injection via 
ConcurrentUpdateSolrClient

Is it the optimal way of injecting such mass of data into Solr ?

For information, estimated time for our solution is 6h.


RE: Question liste solr

2018-03-19 Thread LOPEZ-CORTES Mariano-ext
Hello

We have an index Solr with 3 nodes, 1 shard et 2 replicas.

Our goal is to index 42 millions rows. Indexing time is important. The data 
source is an oracle database.

Our indexing strategy is :

* Reading from Oracle to a big CSV file.

* Reading from 4 files (big file chunked) and injection via 
ConcurrentUpdateSolrClient

Is it the optimal way of injecting such mass of data into Solr ?

For information, estimated time for our solution is 6h.


RE: Response time under 1 second?

2018-02-22 Thread LOPEZ-CORTES Mariano-ext
For the moment, I have the following information:

12GB is max java heap. Total memory i don't know. No direct access to host.

2 replicas = 
Size 1 = 11.51 GB
Size 2 = 11.82 GB
(Sizes showed in the Core-Overview admin gui)

Thanks very much!

-Message d'origine-
De : Shawn Heisey [mailto:elyog...@elyograg.org] 
Envoyé : jeudi 22 février 2018 17:06
À : solr-user@lucene.apache.org
Objet : Re: Response time under 1 second?

On 2/22/2018 8:53 AM, LOPEZ-CORTES Mariano-ext wrote:
> With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format).
>
> Is it better to disable completely Solr cache ? There is enough RAM for the 
> entire index.

The size of the input data will have an effect on how big the index is, but it 
is not a direct indication of the index size.  The size of the index is more 
important than the size of the data that you send to Solr to create the index.

You say 12GB ... but is this total system memory, or the max Java heap size for 
Solr?  What are these two numbers for your servers?

If you go to the admin UI for one of these servers and look at the Overview 
page for all of the index cores it contains, you will be able to see how many 
documents and what size each index is on disk.  What are these numbers?  If the 
numbers are similar for all the servers, then I will only need to see it for 
one of them.

If the machine is running an OS like Linux that has the gnu top program, then I 
can see a lot of useful information from that program.  Run "top" 
(not htop or other variants), press shift-M to sort the list by memory, and 
grab a screenshot.  This will probably be an image file, so you'll need to find 
a file sharing site and give us a URL to access the file. Attachments rarely 
make it to the mailing list.

Thanks,
Shawn



Response time under 1 second?

2018-02-22 Thread LOPEZ-CORTES Mariano-ext
Hello

With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format).

Is it better to disable completely Solr cache ? There is enough RAM for the 
entire index.

Is there a way for reduce random queries under 1 second?

Thanks!





RE: Facet performance problem

2018-02-20 Thread LOPEZ-CORTES Mariano-ext
Our query looks like this:

...factet=true=motifPresence

We return a facet list of values in "motifPresence" field (person status).
Status:
[ ] status1
[x] status2
[x] status3

The user then selects 1 or multiple status (It's this step that we called 
"facet filtering").

Query is then re-executed with fq=motifPresence:(status2 OR status3)

We use fq in order to not alter the score in main query.

We've read that docValues=true for facet fields.  

We need also indexed=true?
Is there any other problem in our solution?

-Message d'origine-
De : Erick Erickson [mailto:erickerick...@gmail.com] 
Envoyé : lundi 19 février 2018 18:18
À : solr-user
Objet : Re: Facet performance problem

I'm confused here. What do you mean by "facet filtering"? Your examples have no 
facets at all, just a _filter query_.

I'll assume you want to use filter query (fq), and faceting has nothing to do 
with it. This is one of the tricky bits of docValues.
While it's _possible_ to search on a field that's defined as above, it's very 
inefficient since there's no "inverted index" for the field, you specified 
'indexed="false" '. So the docValues are searched, and it's essentially a table 
scan.

If you mean to search against this field, set indexed="true". You'll have to 
completely reindex your corpus of course.

If you intend to facet, group or sort on this field, you should _also_ have 
docValues="true".

Best,
Erick

On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext 
 wrote:
> Hi
>
> We have following environement :
>
> 3 nodes cluster
> 1 shard
> Replication factor = 2
> 8GB per node
>
> 29 millions of documents
>
> We've faceting over field "motifPresence" defined as follow:
>
>  indexed="false" stored="true" required="false"/>
>
> Once the user selects motifPresence filter we executes search again with:
>
> fq: (value1 OR value2 OR value3 OR ...)
>
> The problem is: During facet filtering query is too slow and her response 
> time is greater than main search (without facet filtering).
>
> Thanks in advance!


RE: Reading data from Oracle

2018-02-15 Thread LOPEZ-CORTES Mariano-ext
Injecting too many rows into Solr throws Java heap exception (Higher memory? We 
have 8GB per node).

Have DIH support for paging queries?

Thanks!

-Message d'origine-
De : Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
Envoyé : jeudi 15 février 2018 10:13
À : solr-user@lucene.apache.org
Objet : Re: Reading data from Oracle

And where is the bottleneck?

Is it reading from Oracle or injecting to Solr?

Regards
Bernd


Am 15.02.2018 um 08:34 schrieb LOPEZ-CORTES Mariano-ext:
> Hello
> 
> We have to delete our Solr collection and feed it periodically from an Oracle 
> database (up to 40M rows).
> 
> We've done the following test: From a java program, we read chunks of data 
> from Oracle and inject to Solr (via Solrj).
> 
> The problem : It is really really slow (1'5 nights).
> 
> Is there one faster method to do that ?
> 
> Thanks in advance.
> 


Reading data from Oracle

2018-02-14 Thread LOPEZ-CORTES Mariano-ext
Hello

We have to delete our Solr collection and feed it periodically from an Oracle 
database (up to 40M rows).

We've done the following test: From a java program, we read chunks of data from 
Oracle and inject to Solr (via Solrj).

The problem : It is really really slow (1'5 nights).

Is there one faster method to do that ?

Thanks in advance.


RE: Facets OutOfMemoryException

2018-02-08 Thread LOPEZ-CORTES Mariano-ext
We are just 1 field "status" in facets with a cardinality of 93.

We realize that increasing memory will work. But, you think it's necessary?

Thanks in advance.

-Message d'origine-
De : Zisis T. [mailto:zist...@runbox.com] 
Envoyé : jeudi 8 février 2018 13:14
À : solr-user@lucene.apache.org
Objet : Re: Facets OutOfMemoryException

I believe that things like the following will affect faceting memory 
requirements
-> how many fields do you facet on
-> what is the cardinality of each one of them What is you QPS rate

but 2GB for 27M documents seems too low. Did you try to increase the memory on 
Solr's JVM?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Facets OutOfMemoryException

2018-02-08 Thread LOPEZ-CORTES Mariano-ext
We are experimentig memory problems regarding facets filters (OutOfMemory java 
heap).
If we disable facets, it works ok.

Our infrastructure :

3 nodes Solr 2048 MB RAM
3 nodes Zookeeper 1024 MB RAM

Size : 27 millions of documents

Any ideas ?

Thanks in advance !



Highlighting over date fields

2018-02-07 Thread LOPEZ-CORTES Mariano-ext
It's possible to use highlighting over date fields ?

We've tried but we've got no highlighting response for the field.



Custom Solr function

2018-01-30 Thread LOPEZ-CORTES Mariano-ext
Can we create a custom function in Java?

Example :

sort = func([USER-ENTERED TEXT]) desc

func returns will numeric value

Thanks in advance


Phonetic matching relevance

2018-01-29 Thread LOPEZ-CORTES Mariano-ext
Hello.

We work on a search application whose main goal is to find persons by name 
(surname and lastname).

Query text comes from a user-entered text field. Ordering of the text is not 
defined (lastname-surname, surname-lastname), but
some orderings are most important than others. The ranking is :

1 Exact match
2 Inexact match (contains entered words)
3 Inexact phonetic match (contains with Beider-Morse filter French version)

In addition, Lastname+surname  is prioritized over Surname+lastname.

All words entered by user have to match (in exact or inexact way)

We have following fields :

lastNameE : WordTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory
lastName : StandardTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory
lastNameP : StandardTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory and 
BMF
surnameE : WordTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory
surname : StandardTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory
surnameP : StandardTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory and BMF

We use Edismax query parser and we assign higher weights to exact fields and 
lower to inexact fields.

However, for the phonetic matches, there are some matches closer to the query 
text than others. How can we boost these results ?

Thanks in advance !