RE: Debugging/scoring question
Yes. This make sense. I guess you talk about this doc: https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html How I can decrease the effect of the IDF component in my query? Thanks!! -Message d'origine- De : Alessandro Benedetti [mailto:a.benede...@sease.io] Envoyé : mercredi 23 mai 2018 18:05 À : solr-user@lucene.apache.org Objet : Re: Debugging/scoring question Hi Mariano, >From the documentation : docCount = total number of documents containing this field, in the range [1 .. {@link #maxDoc()}] In your debug the fields involved in the score computation are indeed different ( nomUsageE, prenomE) . Does this make sense ? Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Debugging/scoring question
Hi all I've a 20 document collection. In a debugging plan, we have: "100051":" 20.794415 = max of: 20.794415 = weight(nomUsageE:jean in 1) [SchemaSimilarity], result of: 20.794415 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 15.0 = boost 1.3862944 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.0 = docFreq 5.0 = docCount 1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 1.0 = avgFieldLength 1.0 = fieldLength "100053":" 21.11246 = max of: 21.11246 = weight(prenomE:jean in 3) [SchemaSimilarity], result of: 21.11246 = score(doc=3,freq=1.0 = termFreq=1.0 ), product of: 8.0 = boost 2.6390574 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.0 = docFreq 20.0 = docCount 1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 1.0 = avgFieldLength 1.0 = fieldLength docCount = 5.0 for the document 100051. Why? docCount is the total number of documents, isn't it? Thanks in advance!
Solr Dates TimeZone
Hi It's possible to configure Solr with a timezone other than GMT? It's possible to configure Solr Admin to view dates with a timezone other than GMT? What is the best way to store a birth date in Solr? We use TrieDate type. Thanks!
Commit too slow?
Hi After having injecting 200 documents in our Solr server, the commit operation at the end of the process (using ConcurrentUpdateSolrClient) take 10 minutes. It's too slow? Our auto-commit policy is the following: 15000 false 15000 Thanks !
Solr doesn't import the whole data
Hi We've finished the data import of 40 millions data into a 3 node Solr cluster. After injecting all data via a Java program, we've noticed that the number of documents was less than expected (in 10 rows). No exception, no error. Some config details: 15000 false 15000 We have no commits in the client application. But also, when consulting via admin, we've noticed that the number total of rows in Solr increase slowly (numFound). It's a normal behaviour? What's the problem? Thanks!
Filter query question
Hi In our search application we have one facet filter (Status) Each status value corresponds to multiple values in the Solr database Example : Status : Initialized --> status in solr = 11I, 12I, 13I, 14I, ... On status value click, search is re-fired with fq filter: fq: status:(11I OR 12I OR 13I ) This was very very inefficient. Filter query response time was longer than same search without filter! We have changed status value in Solr database for corresponding to visual filter values. In consequence, there is no OR in the fq filter. The performance is better now. What is the reason? Thanks!
RE: Question liste solr
CSV file is 5GB aprox. for 29 millions. As you say Christopher, at the beggining we thougth that reading chunk by chunk from Oracle and writing to Solr was the best strategy. But, from our tests we've remarked: CSV creation via PL/SQL is really really fast. 40 minutes for the full dataset (with bulk collect). Multiple SELECT calls from java slows down the process. I think Oracle is the bottleneck here. Any other ideas/alternatives? Some other points to remark: We are going to enable autoCommit for every 10 minutes / 1 rows. No commit from client. During indexing, whe call all the time a front-end load-balancer that redirect calls to the 3-node cluster. Thanks in advance!! ==>Great maillist and really awesome tool!! -Message d'origine- De : Christopher Schultz [mailto:ch...@christopherschultz.net] Envoyé : lundi 19 mars 2018 18:05 À : solr-user@lucene.apache.org Objet : Re: Question liste solr -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Mariano, On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote: > Hello > > We have an index Solr with 3 nodes, 1 shard et 2 replicas. > > Our goal is to index 42 millions rows. Indexing time is important. > The data source is an oracle database. > > Our indexing strategy is : > > * Reading from Oracle to a big CSV file. > > * Reading from 4 files (big file chunked) and injection via > ConcurrentUpdateSolrClient > > Is it the optimal way of injecting such mass of data into Solr ? > > For information, estimated time for our solution is 6h. How big are the CSV files? If most of the time is taken performing the various SELECT operations, then it's probably a good strategy. However, you may find that using the disk as a buffer slows everything down because disk-writes can be very slow. Why not perform your SELECT(s) and write directly to Solr using one of the APIs (either a language-specific API, or through the HTTP API)? Hope that helps, - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3 6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+ r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5 BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6 ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey 85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD VH6PlwgqcrO28Jx799mJvpIotoE= =aMPk -END PGP SIGNATURE-
RE: Question liste solr
Sorry. Thanks in advance !! De : LOPEZ-CORTES Mariano-ext Envoyé : lundi 19 mars 2018 16:50 À : 'solr-user@lucene.apache.org' Objet : RE: Question liste solr Hello We have an index Solr with 3 nodes, 1 shard et 2 replicas. Our goal is to index 42 millions rows. Indexing time is important. The data source is an oracle database. Our indexing strategy is : · Reading from Oracle to a big CSV file. · Reading from 4 files (big file chunked) and injection via ConcurrentUpdateSolrClient Is it the optimal way of injecting such mass of data into Solr ? For information, estimated time for our solution is 6h.
RE: Question liste solr
Hello We have an index Solr with 3 nodes, 1 shard et 2 replicas. Our goal is to index 42 millions rows. Indexing time is important. The data source is an oracle database. Our indexing strategy is : * Reading from Oracle to a big CSV file. * Reading from 4 files (big file chunked) and injection via ConcurrentUpdateSolrClient Is it the optimal way of injecting such mass of data into Solr ? For information, estimated time for our solution is 6h.
RE: Response time under 1 second?
For the moment, I have the following information: 12GB is max java heap. Total memory i don't know. No direct access to host. 2 replicas = Size 1 = 11.51 GB Size 2 = 11.82 GB (Sizes showed in the Core-Overview admin gui) Thanks very much! -Message d'origine- De : Shawn Heisey [mailto:elyog...@elyograg.org] Envoyé : jeudi 22 février 2018 17:06 À : solr-user@lucene.apache.org Objet : Re: Response time under 1 second? On 2/22/2018 8:53 AM, LOPEZ-CORTES Mariano-ext wrote: > With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format). > > Is it better to disable completely Solr cache ? There is enough RAM for the > entire index. The size of the input data will have an effect on how big the index is, but it is not a direct indication of the index size. The size of the index is more important than the size of the data that you send to Solr to create the index. You say 12GB ... but is this total system memory, or the max Java heap size for Solr? What are these two numbers for your servers? If you go to the admin UI for one of these servers and look at the Overview page for all of the index cores it contains, you will be able to see how many documents and what size each index is on disk. What are these numbers? If the numbers are similar for all the servers, then I will only need to see it for one of them. If the machine is running an OS like Linux that has the gnu top program, then I can see a lot of useful information from that program. Run "top" (not htop or other variants), press shift-M to sort the list by memory, and grab a screenshot. This will probably be an image file, so you'll need to find a file sharing site and give us a URL to access the file. Attachments rarely make it to the mailing list. Thanks, Shawn
Response time under 1 second?
Hello With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format). Is it better to disable completely Solr cache ? There is enough RAM for the entire index. Is there a way for reduce random queries under 1 second? Thanks!
RE: Facet performance problem
Our query looks like this: ...factet=true=motifPresence We return a facet list of values in "motifPresence" field (person status). Status: [ ] status1 [x] status2 [x] status3 The user then selects 1 or multiple status (It's this step that we called "facet filtering"). Query is then re-executed with fq=motifPresence:(status2 OR status3) We use fq in order to not alter the score in main query. We've read that docValues=true for facet fields. We need also indexed=true? Is there any other problem in our solution? -Message d'origine- De : Erick Erickson [mailto:erickerick...@gmail.com] Envoyé : lundi 19 février 2018 18:18 À : solr-user Objet : Re: Facet performance problem I'm confused here. What do you mean by "facet filtering"? Your examples have no facets at all, just a _filter query_. I'll assume you want to use filter query (fq), and faceting has nothing to do with it. This is one of the tricky bits of docValues. While it's _possible_ to search on a field that's defined as above, it's very inefficient since there's no "inverted index" for the field, you specified 'indexed="false" '. So the docValues are searched, and it's essentially a table scan. If you mean to search against this field, set indexed="true". You'll have to completely reindex your corpus of course. If you intend to facet, group or sort on this field, you should _also_ have docValues="true". Best, Erick On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-extwrote: > Hi > > We have following environement : > > 3 nodes cluster > 1 shard > Replication factor = 2 > 8GB per node > > 29 millions of documents > > We've faceting over field "motifPresence" defined as follow: > > indexed="false" stored="true" required="false"/> > > Once the user selects motifPresence filter we executes search again with: > > fq: (value1 OR value2 OR value3 OR ...) > > The problem is: During facet filtering query is too slow and her response > time is greater than main search (without facet filtering). > > Thanks in advance!
RE: Reading data from Oracle
Injecting too many rows into Solr throws Java heap exception (Higher memory? We have 8GB per node). Have DIH support for paging queries? Thanks! -Message d'origine- De : Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] Envoyé : jeudi 15 février 2018 10:13 À : solr-user@lucene.apache.org Objet : Re: Reading data from Oracle And where is the bottleneck? Is it reading from Oracle or injecting to Solr? Regards Bernd Am 15.02.2018 um 08:34 schrieb LOPEZ-CORTES Mariano-ext: > Hello > > We have to delete our Solr collection and feed it periodically from an Oracle > database (up to 40M rows). > > We've done the following test: From a java program, we read chunks of data > from Oracle and inject to Solr (via Solrj). > > The problem : It is really really slow (1'5 nights). > > Is there one faster method to do that ? > > Thanks in advance. >
Reading data from Oracle
Hello We have to delete our Solr collection and feed it periodically from an Oracle database (up to 40M rows). We've done the following test: From a java program, we read chunks of data from Oracle and inject to Solr (via Solrj). The problem : It is really really slow (1'5 nights). Is there one faster method to do that ? Thanks in advance.
RE: Facets OutOfMemoryException
We are just 1 field "status" in facets with a cardinality of 93. We realize that increasing memory will work. But, you think it's necessary? Thanks in advance. -Message d'origine- De : Zisis T. [mailto:zist...@runbox.com] Envoyé : jeudi 8 février 2018 13:14 À : solr-user@lucene.apache.org Objet : Re: Facets OutOfMemoryException I believe that things like the following will affect faceting memory requirements -> how many fields do you facet on -> what is the cardinality of each one of them What is you QPS rate but 2GB for 27M documents seems too low. Did you try to increase the memory on Solr's JVM? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Facets OutOfMemoryException
We are experimentig memory problems regarding facets filters (OutOfMemory java heap). If we disable facets, it works ok. Our infrastructure : 3 nodes Solr 2048 MB RAM 3 nodes Zookeeper 1024 MB RAM Size : 27 millions of documents Any ideas ? Thanks in advance !
Highlighting over date fields
It's possible to use highlighting over date fields ? We've tried but we've got no highlighting response for the field.
Custom Solr function
Can we create a custom function in Java? Example : sort = func([USER-ENTERED TEXT]) desc func returns will numeric value Thanks in advance
Phonetic matching relevance
Hello. We work on a search application whose main goal is to find persons by name (surname and lastname). Query text comes from a user-entered text field. Ordering of the text is not defined (lastname-surname, surname-lastname), but some orderings are most important than others. The ranking is : 1 Exact match 2 Inexact match (contains entered words) 3 Inexact phonetic match (contains with Beider-Morse filter French version) In addition, Lastname+surname is prioritized over Surname+lastname. All words entered by user have to match (in exact or inexact way) We have following fields : lastNameE : WordTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory lastName : StandardTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory lastNameP : StandardTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory and BMF surnameE : WordTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory surname : StandardTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory surnameP : StandardTokenizer, LowerCaseFilter, ASCIIFoldingFilterFactory and BMF We use Edismax query parser and we assign higher weights to exact fields and lower to inexact fields. However, for the phonetic matches, there are some matches closer to the query text than others. How can we boost these results ? Thanks in advance !