Re: Multiple concurrent queries to Solr
On 8/23/2015 7:46 AM, Ashish Mukherjee wrote: I want to run few Solr queries in parallel, which are being done in a multi-threaded model now. I was wondering if there are any client libraries to query Solr through a non-blocking I/O mechanism instead of a threaded model. Has anyone attempted something like this? The only client library that the Solr project makes is SolrJ -- the client for Java. If you are not using the SolrJ client, then the Solr project did not write it, and you should contact the authors of the library directly. SolrJ and Solr are both completely thread-safe, and multiple threads are recommended for highly concurrent usage. SolrJ uses HttpClient for communication with Solr. I was not able to determine whether the default httpclient settings will result in non-blocking I/O or not. As far as I am aware, nothing in SolrJ sets any explicit configuration for blocking or non-blocking I/O. You can create your own HttpClient object in a SolrJ program and have the SolrClient object use it. HttpClient uses HttpCore. Here is the main web page for these components: https://hc.apache.org/ On this webpage, it says HttpCore supports two I/O models: blocking I/O model based on the classic Java I/O and non-blocking, event driven I/O model based on Java NIO. There is no information here about which model is chosen by default. Thanks, Shawn
Re: Multiple concurrent queries to Solr
The last time that I used the HTTPClient library, it was non-blocking. It doesn’t try to read from the socket until you ask for data from the response object. That allows parallel requests without threads. Underneath, it has a pool of connections that can be reused. If the pool is exhausted, it can block. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Aug 23, 2015, at 8:49 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/23/2015 7:46 AM, Ashish Mukherjee wrote: I want to run few Solr queries in parallel, which are being done in a multi-threaded model now. I was wondering if there are any client libraries to query Solr through a non-blocking I/O mechanism instead of a threaded model. Has anyone attempted something like this? The only client library that the Solr project makes is SolrJ -- the client for Java. If you are not using the SolrJ client, then the Solr project did not write it, and you should contact the authors of the library directly. SolrJ and Solr are both completely thread-safe, and multiple threads are recommended for highly concurrent usage. SolrJ uses HttpClient for communication with Solr. I was not able to determine whether the default httpclient settings will result in non-blocking I/O or not. As far as I am aware, nothing in SolrJ sets any explicit configuration for blocking or non-blocking I/O. You can create your own HttpClient object in a SolrJ program and have the SolrClient object use it. HttpClient uses HttpCore. Here is the main web page for these components: https://hc.apache.org/ On this webpage, it says HttpCore supports two I/O models: blocking I/O model based on the classic Java I/O and non-blocking, event driven I/O model based on Java NIO. There is no information here about which model is chosen by default. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
And be aware that I'm sure the more terms in your documents, the slower clustering will be. So it isn't just the number of docs, the size of them counts in this instance. A simple test would be to build an index with just the first 1000 terms of your clustering fields, and see if that makes a difference to performance. Upayavira On Sun, Aug 23, 2015, at 05:32 PM, Erick Erickson wrote: You're confusing clustering with searching. Sure, Solr can index and lots of data, but clustering is essentially finding ad-hoc similarities between arbitrary documents. It must take each of the documents in the result size you specify from your result set and try to find commonalities. For perf issues in terms of clustering, you'd be better off talking to the folks at the carrot project. Best, Erick On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Multiple concurrent queries to Solr
Hello Ashish. Therse is an unfinished work about this at https://issues.apache.org/jira/browse/SOLR-3383 Maybe you want to have a look and contribute? Arcadius. On 23 August 2015 at 17:02, Walter Underwood wun...@wunderwood.org wrote: The last time that I used the HTTPClient library, it was non-blocking. It doesn’t try to read from the socket until you ask for data from the response object. That allows parallel requests without threads. Underneath, it has a pool of connections that can be reused. If the pool is exhausted, it can block. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Aug 23, 2015, at 8:49 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/23/2015 7:46 AM, Ashish Mukherjee wrote: I want to run few Solr queries in parallel, which are being done in a multi-threaded model now. I was wondering if there are any client libraries to query Solr through a non-blocking I/O mechanism instead of a threaded model. Has anyone attempted something like this? The only client library that the Solr project makes is SolrJ -- the client for Java. If you are not using the SolrJ client, then the Solr project did not write it, and you should contact the authors of the library directly. SolrJ and Solr are both completely thread-safe, and multiple threads are recommended for highly concurrent usage. SolrJ uses HttpClient for communication with Solr. I was not able to determine whether the default httpclient settings will result in non-blocking I/O or not. As far as I am aware, nothing in SolrJ sets any explicit configuration for blocking or non-blocking I/O. You can create your own HttpClient object in a SolrJ program and have the SolrClient object use it. HttpClient uses HttpCore. Here is the main web page for these components: https://hc.apache.org/ On this webpage, it says HttpCore supports two I/O models: blocking I/O model based on the classic Java I/O and non-blocking, event driven I/O model based on Java NIO. There is no information here about which model is chosen by default. Thanks, Shawn -- Arcadius Ahouansou Menelic Ltd | Information is Power M: 07908761999 W: www.menelic.com ---
Re: Solr performance is slow with just 1GB of data indexed
Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: However, I find that clustering is exceeding slow after I index this 1GB of data. It took almost 30 seconds to return the cluster results when I set it to cluster the top 1000 records, and still take more than 3 seconds when I set it to cluster the top 100 records. Your clustering uses Carrot2, which fetches the top documents and performs real-time clustering on them - that process is (nearly) independent of index size. The relevant numbers here are top 1000 and top 100, not 1GB. The unknown part is whether it is the fetching of top 1000 (the Solr part) or the clustering itself (the Carrot part) that is the bottleneck. - Toke Eskildsen
Re: SOLR 5.3
Solr-5.3 has been available for download from http://mirror.catn.com/pub/apache/lucene/solr/5.3.0/ The redirection on the web site will probably be fixed before we get the official announcement. Arcadius. On 23 August 2015 at 09:00, William Bell billnb...@gmail.com wrote: At lucene.apache.org/solr it says SOLR 5.3 is there, but when I click on downloads it shows Solr 5.2.1... ?? APACHE SOLR™ 5.3.0Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™. -- Bill Bell billnb...@gmail.com cell 720-256-8076 -- Arcadius Ahouansou Menelic Ltd | Information is Power M: 07908761999 W: www.menelic.com ---
Re: Solr performance is slow with just 1GB of data indexed
Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
You're confusing clustering with searching. Sure, Solr can index and lots of data, but clustering is essentially finding ad-hoc similarities between arbitrary documents. It must take each of the documents in the result size you specify from your result set and try to find commonalities. For perf issues in terms of clustering, you'd be better off talking to the folks at the carrot project. Best, Erick On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
unsubscribe On Sat, Aug 22, 2015 at 9:31 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr. However, I find that clustering is exceeding slow after I index this 1GB of data. It took almost 30 seconds to return the cluster results when I set it to cluster the top 1000 records, and still take more than 3 seconds when I set it to cluster the top 100 records. Is this speed normal? Cos i understand Solr can index terabytes of data without having the performance impacted so much, but now the collection is slowing down even with just 1GB of data. Below is my clustering configurations in solrconfig.xml. requestHandler name=/clustering startup=lazy enable=${solr.clustering.enabled:true} class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows1000/int str name=wtjson/str str name=indenttrue/str str name=dftext/str str name=flnull/str bool name=clusteringtrue/bool bool name=clustering.resultstrue/bool str name=carrot.titlesubject content tag/str bool name=carrot.produceSummarytrue/bool int name=carrot.fragSize20/int !-- the maximum number of labels per cluster -- int name=carrot.numDescriptions20/int !-- produce sub clusters -- bool name=carrot.outputSubClustersfalse/bool str name=LingoClusteringAlgorithm.desiredClusterCountBase7/str !-- Configure the remaining request handler parameters. -- str name=defTypeedismax/str /lst arr name=last-components strclustering/str /arr /requestHandler Regards, Edwin
Re: Can TrieDateField fields be null?
To be strict about it, I'd say that TrieDateFields CANNOT be null, but they CAN be excluded from the document. You could then check whether or not a value exists for this field. Upayavira On Sun, Aug 23, 2015, at 02:55 AM, Erick Erickson wrote: TrieDateFields can be null. Actually, just not in the document. I just verified with 4.10 How are you indexing? I suspect that somehow the program that's sending things to Solr is putting the default time in. What version of Solr? Best, Erick On Sat, Aug 22, 2015 at 4:04 PM, Henrique O. Santos hensan...@gmail.com wrote: Hello, Just a simple question. Can TrieDateField fields be null? I have a schema with the following field and type: field name=started_at type=date indexed=true stored=true docValues=true / fieldType name=date class=solr.TrieDateField precisionStep=0 positionIncrementGap=0/ Every time I index a document with no value for this field, the current time gets indexed and stored. Is there anyway to make this field null? My use case for this collection requires that I check if that date field is already filled or not. Thank you, Henrique.
Re: solr add document
thx i just need to call solr.commit -- View this message in context: http://lucene.472066.n3.nabble.com/solr-add-document-tp4224480p4224698.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can TrieDateField fields be null?
Following up on Shawn's comment, this can be the result of some sort of serialization or, if you're pulling info from a DB the literal string NULL may be returned from the DB. Solr really has no concept of a distinct value of NULL for a field, in Solr/Lucene terms that's just the total absence of the field from the document. Best, Erick On Sun, Aug 23, 2015 at 8:15 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/23/2015 8:29 AM, Henrique O. Santos wrote: I am doing some manual indexing using Solr Admin UI to be exactly sure how TrieDateFields and null values work. When I remove the TrieDateField from the document, I get the following when trying to index it: | msg: Invalid Date String:'NULL', code: 400| Unless the field is marked as required in your schema, TrieDateField will work if you have no value for the field. This means the field is not present in the javabin, xml, or json data sent to Solr for indexing, not that the empty string is present. What you have here is literally the string NULL -- four letters. This will NOT work on any kind of Trie field. Sometimes you can run into a conversion glitch related to a Java null object, but in that case the value is usually lowercase -- null -- which wouldn't work either. Thanks, Shawn
SOLR 5.3
At lucene.apache.org/solr it says SOLR 5.3 is there, but when I click on downloads it shows Solr 5.2.1... ?? APACHE SOLR™ 5.3.0Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™. -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Can TrieDateField fields be null?
Hi Erick and Upayavira, thanks for the reply. I am using Solr 5.2.1 and using SolrJ 5.2.1 API with an annotated POJO to update the index. And you were right, somehow my JODA DateTime field was being filled with the current timestamp prior to the update. Thanks for the clarification again. On 08/22/2015 09:55 PM, Erick Erickson wrote: TrieDateFields can be null. Actually, just not in the document. I just verified with 4.10 How are you indexing? I suspect that somehow the program that's sending things to Solr is putting the default time in. What version of Solr? Best, Erick On Sat, Aug 22, 2015 at 4:04 PM, Henrique O. Santos hensan...@gmail.com wrote: Hello, Just a simple question. Can TrieDateField fields be null? I have a schema with the following field and type: field name=started_at type=date indexed=true stored=true docValues=true / fieldType name=date class=solr.TrieDateField precisionStep=0 positionIncrementGap=0/ Every time I index a document with no value for this field, the current time gets indexed and stored. Is there anyway to make this field null? My use case for this collection requires that I check if that date field is already filled or not. Thank you, Henrique.
Re: Too many updates received since start
Indeed, I don't understand the caveat too, but I can imagine that is related with some algorithm to trigger a full sync if necessary. I will waiting for 5.3 to do the upgrade and have this configuration available. —/Yago Riveiro On Sun, Aug 23, 2015 at 3:37 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 3:50 PM, Yago Riveiro wrote: I'm using java 7u25 oracle version with Solr 4.6.1 It work well with 98% of throughput but in some full GC the issue arises. A full sync for one shard is more than 50G. There is any configuration to configurate the number of docs behind leader that a replica can be? It looks like the number of docs is configurable in 5.1 and later: https://issues.apache.org/jira/browse/SOLR-6359 There is apparently a caveat related to SolrCloud recovery, which I am having trouble grasping: the 20% newest existing transaction log of the core to be recovered must be newer than the 20% oldest existing transaction log of the good core. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Can TrieDateField fields be null?
Hello again, I am doing some manual indexing using Solr Admin UI to be exactly sure how TrieDateFields and null values work. When I remove the TrieDateField from the document, I get the following when trying to index it: | msg: Invalid Date String:'NULL', code: 400| On Solr 5.2.1. Can I assume that TrieDateFields need to be specified for every document? Thanks. On 08/23/2015 09:48 AM, Henrique O. Santos wrote: Hi Erick and Upayavira, thanks for the reply. I am using Solr 5.2.1 and using SolrJ 5.2.1 API with an annotated POJO to update the index. And you were right, somehow my JODA DateTime field was being filled with the current timestamp prior to the update. Thanks for the clarification again. On 08/22/2015 09:55 PM, Erick Erickson wrote: TrieDateFields can be null. Actually, just not in the document. I just verified with 4.10 How are you indexing? I suspect that somehow the program that's sending things to Solr is putting the default time in. What version of Solr? Best, Erick On Sat, Aug 22, 2015 at 4:04 PM, Henrique O. Santos hensan...@gmail.com wrote: Hello, Just a simple question. Can TrieDateField fields be null? I have a schema with the following field and type: field name=started_at type=date indexed=true stored=true docValues=true / fieldType name=date class=solr.TrieDateField precisionStep=0 positionIncrementGap=0/ Every time I index a document with no value for this field, the current time gets indexed and stored. Is there anyway to make this field null? My use case for this collection requires that I check if that date field is already filled or not. Thank you, Henrique.
Re: Solr performance is slow with just 1GB of data indexed
On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: DIH delta-import pk
Now I set db id as unique field and uuid field,which should be generated automatically as required. but when i add document i have an error that my required uuid field is missing. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-delta-import-pk-tp4224342p4224701.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Remove duplicate suggestions in Solr
Hi Edwin. What you are doing here is search as Solr has separate components for doing suggestions. About dedup, - have a look at the manual https://cwiki.apache.org/confluence/display/solr/De-Duplication - or simply do your dedup upfront before ingesting into Solr by assigning the same id to all doc with same textng (may require a different index if you want to keep the existing data with duplicate for other purpose) - Or you could use result grouping/fieldCollapsing to group/dedup your result Hope this helps Arcadius. On 21 August 2015 at 06:41, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, I would like to check, is there anyway to remove duplicate suggestions in Solr? I have several documents that looks very similar, and when I do a suggestion query, it came back with all same results. I'm using Solr 5.2.1 This is my suggestion pipeline: requestHandler name=/suggest class=solr.SearchHandler lst name=defaults !-- Browse specific stuff -- str name=echoParamsall/str str name=wtjson/str str name=indenttrue/str !-- Everything below should be identical to ac handler above -- str name=defTypeedismax/str str name=rows10/str str name=flid, score/str !--str name=qftextsuggest^30 extrasearch^30.0 textng^50.0 phonetic^10/str-- !--str name=qfcontent^50 title^50 extrasearch^30.0 textng^1.0 textng2^200.0/str-- str name=qfcontent^50 title^50 extrasearch^30.0/str str name=pftextnge^50.0/str !--str name=bfproduct(log(sum(popularity,1)),100)^20/str-- !-- Define relative importance between types. May be overridden per request by e.g. personboost=120 -- str name=boostproduct(map(query($type1query),0,0,1,$type1boost),map(query($type2query),0,0,1,$type2boost),map(query($type3query),0,0,1,$type3boost),map(query($type4query),0,0,1,$type4boost),$typeboost)/str double name=typeboost1.0/double str name=type1querycontent_type:application/pdf/str double name=type1boost0.9/double str name=type2querycontent_type:application/msword/str double name=type2boost0.5/double str name=type3querycontent_type:NA/str double name=type3boost0.0/double str name=type4querycontent_type:NA/str double name=type4boost0.0/double str name=hlon/str str name=hl.flid, textng, textng2, language_s/str str name=hl.highlightMultiTermtrue/str str name=hl.preserveMultitrue/str str name=hl.encoderhtml/str !--str name=f.content.hl.fragsize80/str-- str name=hl.fragsize50/str str name=debugQueryfalse/str /lst /requestHandler This is my query: http://localhost:8983/edm/chinese2/suggest?q=do our bestdefType=edismaxqf=content^5 textng^5pf=textnge^50pf2=content^20 textnge^50pf3=content^40%20textnge^50ps2=2ps3=2stats.calcdistinct=true This is the suggestion result: highlighting:{ responsibility001:{ id:[responsibility001], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility002:{ id:[responsibility002], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility003:{ id:[responsibility003], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility004:{ id:[responsibility004], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility005:{ id:[responsibility005], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility006:{ id:[responsibility006], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility007:{ id:[responsibility007], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility008:{ id:[responsibility008], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility009:{ id:[responsibility009], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility010:{ id:[responsibility010], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], Regards, Edwin -- Arcadius Ahouansou Menelic Ltd | Information is Power M: 07908761999 W: www.menelic.com ---
Multiple concurrent queries to Solr
Hello, I want to run few Solr queries in parallel, which are being done in a multi-threaded model now. I was wondering if there are any client libraries to query Solr through a non-blocking I/O mechanism instead of a threaded model. Has anyone attempted something like this? Regards, Ashish
Re: DIH delta-import pk
Send the SQL and Schema.xml. Also logs. Does it complain about _id_ or you field in schema? On Sun, Aug 23, 2015 at 4:55 AM, CrazyDiamond crazy_diam...@mail.ru wrote: Now I set db id as unique field and uuid field,which should be generated automatically as required. but when i add document i have an error that my required uuid field is missing. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-delta-import-pk-tp4224342p4224701.html Sent from the Solr - User mailing list archive at Nabble.com. -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Can TrieDateField fields be null?
On 8/23/2015 8:29 AM, Henrique O. Santos wrote: I am doing some manual indexing using Solr Admin UI to be exactly sure how TrieDateFields and null values work. When I remove the TrieDateField from the document, I get the following when trying to index it: | msg: Invalid Date String:'NULL', code: 400| Unless the field is marked as required in your schema, TrieDateField will work if you have no value for the field. This means the field is not present in the javabin, xml, or json data sent to Solr for indexing, not that the empty string is present. What you have here is literally the string NULL -- four letters. This will NOT work on any kind of Trie field. Sometimes you can run into a conversion glitch related to a Java null object, but in that case the value is usually lowercase -- null -- which wouldn't work either. Thanks, Shawn
Re: DIH delta-import pk
As far as I understand I cant use 2 uniquefield. i need db id and uuid because i moving data from database to solr index entirely. And temporaly i need it to be compatble with delta-import, but in future i will use new only uuid . -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-delta-import-pk-tp4224342p4224699.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH delta-import pk
i don't use SQL now. i'm adding documents manually. fieldType name=uuid class=solr.UUIDField indexed=true / field name=uu_id type=uuid indexed=true stored=true required=true / field name=db_id_s type=string indexed=true stored=true required=true multiValued=false / uniqueKeydb_id_s/uniqueKey -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-delta-import-pk-tp4224342p4224762.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance is slow with just 1GB of data indexed
We use 8gb to 10gb for those size indexes all the time. Bill Bell Sent from mobile On Aug 23, 2015, at 8:52 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Custom Solr caches in a FunctionQuery that emulates the ExternalFileField
Hello Upayavira, It's a long month ago! I just described this approach in http://blog.griddynamics.com/2015/08/scoring-join-party-in-solr-53.html Coming back to our discussion I think I miss {!func} which turn fieldname into function query. On Fri, Jul 24, 2015 at 3:41 PM, Upayavira u...@odoko.co.uk wrote: Mikhail, I've tried this out, but to be honest I can't work out what the score= parameter is supposed to add. I assume that if I do {!join fromIndex=other from=other_key to=key score=max}somefield:(abc dev) It will calculate the score for each document that has the same key value, and include that in the score for the main document? If this is the case, then I should be able to do: {!join fromIndex=other from=other_key to=key score=max}{!boost b=my_boost_value_field}*:* In which case, it'll take the value of my_boost_field in the other core, and include it in the score for my document that has the value of key? Upayavira On Fri, Jul 10, 2015, at 04:15 PM, Mikhail Khludnev wrote: I've heard that people use https://issues.apache.org/jira/browse/SOLR-6234 for such purpose - adding scores from fast moving core to the bigger slow moving one On Fri, Jul 10, 2015 at 4:54 PM, Upayavira u...@odoko.co.uk wrote: All, I have knocked up what I think could be a really cool function query - it allows you to retrieve a value from another core (much like a pseudo join) and use that value during scoring (much like an ExternalFileField). Examples: * Selective boosting of documents based upon a category based value * boost on aggregated popularity values * boost on fast moving data on your slow moving index It *works* but it does so very slowly (on 3m docs, milliseconds without, and 24s with it). There are two things that happen a lot: * locate a document with unique ID value of X * retrieve the value of field Y for that doc What it seems to me now is that I need to implement a cache that will have a string value as the key and the (float) field value as the object, that is warmed alongside existing caches. Any pointers to examples of how I could do this, or other ways to do the conversion from a key value to a float value faster? NB. I hope to contribute this if I can make it perform. Thanks! Upayavira -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Remove duplicate suggestions in Solr
Hi Arcadius, Thank you for your reply. So this means that the de-duplication has to be done during indexing time, and not during query time? Yes, currently I'm building on the search to be do my suggestion as I faced some issues with the suggestions components in the Solr 5.1.0 version. Will the suggestion components solve this issues of giving duplicating suggestions? There might also be cases where about 1/2 to 3/4 of my indexed documents are the same, with only the remaining 1/4 to 1/2 are different. So this will probably lead to cases where the index is different, but a search may return the part of the document that are the same. Regards, Edwin On 23 August 2015 at 21:44, Arcadius Ahouansou arcad...@menelic.com wrote: Hi Edwin. What you are doing here is search as Solr has separate components for doing suggestions. About dedup, - have a look at the manual https://cwiki.apache.org/confluence/display/solr/De-Duplication - or simply do your dedup upfront before ingesting into Solr by assigning the same id to all doc with same textng (may require a different index if you want to keep the existing data with duplicate for other purpose) - Or you could use result grouping/fieldCollapsing to group/dedup your result Hope this helps Arcadius. On 21 August 2015 at 06:41, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, I would like to check, is there anyway to remove duplicate suggestions in Solr? I have several documents that looks very similar, and when I do a suggestion query, it came back with all same results. I'm using Solr 5.2.1 This is my suggestion pipeline: requestHandler name=/suggest class=solr.SearchHandler lst name=defaults !-- Browse specific stuff -- str name=echoParamsall/str str name=wtjson/str str name=indenttrue/str !-- Everything below should be identical to ac handler above -- str name=defTypeedismax/str str name=rows10/str str name=flid, score/str !--str name=qftextsuggest^30 extrasearch^30.0 textng^50.0 phonetic^10/str-- !--str name=qfcontent^50 title^50 extrasearch^30.0 textng^1.0 textng2^200.0/str-- str name=qfcontent^50 title^50 extrasearch^30.0/str str name=pftextnge^50.0/str !--str name=bfproduct(log(sum(popularity,1)),100)^20/str-- !-- Define relative importance between types. May be overridden per request by e.g. personboost=120 -- str name=boostproduct(map(query($type1query),0,0,1,$type1boost),map(query($type2query),0,0,1,$type2boost),map(query($type3query),0,0,1,$type3boost),map(query($type4query),0,0,1,$type4boost),$typeboost)/str double name=typeboost1.0/double str name=type1querycontent_type:application/pdf/str double name=type1boost0.9/double str name=type2querycontent_type:application/msword/str double name=type2boost0.5/double str name=type3querycontent_type:NA/str double name=type3boost0.0/double str name=type4querycontent_type:NA/str double name=type4boost0.0/double str name=hlon/str str name=hl.flid, textng, textng2, language_s/str str name=hl.highlightMultiTermtrue/str str name=hl.preserveMultitrue/str str name=hl.encoderhtml/str !--str name=f.content.hl.fragsize80/str-- str name=hl.fragsize50/str str name=debugQueryfalse/str /lst /requestHandler This is my query: http://localhost:8983/edm/chinese2/suggest?q=do our bestdefType=edismaxqf=content^5 textng^5pf=textnge^50pf2=content^20 textnge^50pf3=content^40%20textnge^50ps2=2ps3=2stats.calcdistinct=true This is the suggestion result: highlighting:{ responsibility001:{ id:[responsibility001], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility002:{ id:[responsibility002], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility003:{ id:[responsibility003], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility004:{ id:[responsibility004], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility005:{ id:[responsibility005], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility006:{ id:[responsibility006], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility007:{ id:[responsibility007], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility008:{ id:[responsibility008], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility009:{ id:[responsibility009], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], responsibility010:{ id:[responsibility010], textng:[We will strive to emdo/em emour/em embest/em. lt;brgt; ], Regards, Edwin -- Arcadius Ahouansou
Re: Solr performance is slow with just 1GB of data indexed
Hi Alexandre, I've tried to use just index=true, and the speed is still the same and not any faster. If I set to store=false, there's no results that came back with the clustering. Is this due to the index are not stored, and the clustering requires indexed that are stored? I've also increase my heap size to 16GB as I'm using a machine with 32GB RAM, but there is no significant improvement with the performance too. Regards, Edwin On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Yes, I'm using store=true. field name=content type=text_general indexed=true stored=true omitNorms=true termVectors=true/ However, this field needs to be stored as my program requires this field to be returned during normal searching. I tried the lazyLoading=true, but it's not working. Will you do a copy field for the content, and not to set stored=true for that field. So that field will just be referenced to for the clustering, and the normal search will reference to the original content field? Regards, Edwin On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
Yes, I'm using store=true. field name=content type=text_general indexed=true stored=true omitNorms=true termVectors=true/ However, this field needs to be stored as my program requires this field to be returned during normal searching. I tried the lazyLoading=true, but it's not working. Will you do a copy field for the content, and not to set stored=true for that field. So that field will just be referenced to for the clustering, and the normal search will reference to the original content field? Regards, Edwin On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Exception while using {!cardinality=1.0}.
- Did you have the exact same data in both fields? No the data is not same. - Did your real query actually compute stats on the same field you had : done your main term query on? The query field is different and I missed to clearly put it. I will accordingly modify the jira. So the query can be q=anyfield:querystats=truestats.field={!cardinality=1.0}field Can you please explain how having the same field for query and stat can cause some issue for my better understanding of this feature? I haven't had a chance to review the jira in depth or actaully run your code with those configs -- but if you get a chance before i do, please re-review the code configs you posted and see if you can reproduce using the *exact* same data in two different fields, and if the choice of query makes a differnce in the behavior you see. Will try to reproduce the same as you have mentioned and revert with details. Thanks, Modassar On Sat, Aug 22, 2015 at 3:43 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : - Did you have the exact same data in both fields? : Both the field are string type. that's not the question i asked. is the data *in* these fields (ie: the actual value of each field for each document) the same for both of the fields? This is important to figuring out if the root problem that having docValues (or not having docValues) causes a problem, or is the root problem that having certain kinds of *data* in a string field (regardless of docValues) can cause this problem. Skimming the sample code you posted to SOLR-7954 you are definitley putting differnet data into field then you put into field1 so it's still not clear what the problem is. : - Did your real query actually compute stats on the same field you had : done your main term query on? : I did not get the question but as much I understood and verified in the : Solr log the stat is computed on the field given with : stats.field={!cardinality=1.0}field. the question is sepcific to the example query you mentioned before and again in your descripion in SOLR-7954. They show that the same field name you are computing stats on (field) is also used in your main query as a constraint on the documents (q=field:query) which is an odd and very special edge case that may be pertinant to the problem you are seeing. Depending on what data you index, that might easily only match 1 document -- in the case of the test code you put in jira, exactly 0 documents since you never index the text query into field field for any document) I haven't had a chance to review the jira in depth or actaully run your code with those configs -- but if you get a chance before i do, please re-review the code configs you posted and see if you can reproduce using the *exact* same data in two different fields, and if the choice of query makes a differnce in the behavior you see. : : Regards, : Modassar : : On Wed, Aug 19, 2015 at 10:24 AM, Modassar Ather modather1...@gmail.com : wrote: : : Ahmet/Chris! Thanks for your replies. : : Ahmet I think net.agkn.hll.serialization is used by hll() function : implementation of Solr. : : Chris I will try to create sample data and create a jira ticket with : details. : : Regards, : Modassar : : : On Tue, Aug 18, 2015 at 9:58 PM, Chris Hostetter hossman_luc...@fucit.org : wrote: : : : : I am getting following exception for the query : : : *q=field:querystats=truestats.field={!cardinality=1.0}field*. The : : exception is not seen once the cardinality is set to 0.9 or less. : : The field is *docValues enabled* and *indexed=false*. The same : exception : : I tried to reproduce on non docValues field but could not. Please : help me : : resolve the issue. : : Hmmm... this is a weird error ... but you haven't really given us enough : information to really guess what the root cause is : : - What was the datatype of the field(s)? : - Did you have the exact same data in both fields? : - Are these multivalued fields? : - Did your real query actually compute stats on the same field you had :done your main term query on? : : I know we have some tests of this bsaic siuation, and i tried to do ome : more manual testing to spot check, but i can't reproduce. : : If you can please provide a full copy of the data (as csv o xml or : whatever) to build your index along with all solr configs and the exact : queries to reproduce that would really help get to the bottom of this -- : if you can't provide all the data, then can you at least reproduce with a : small set of sample data? : : either way: please file a new jira issue and attach as much detail as you : can -- this URL has a lot of great tips on the types of data we need to be : able to get to the bottom of bugs... : : https://wiki.apache.org/solr/UsingMailingLists : : : : : : : ERROR - 2015-08-11 12:24:00.222; [core] : :