Re: Unable to query the spellchecker in a distributed way
I got it ; in Solr 4.4, the component org.apache.solr.handler.component.SpellCheckComponent didn't implement the method distributedProcess(ResponseBuilder rb) which is necessary to org.apache.solr.handler.component.SearchHandler to handle distributed searches the right way. And it seems that with 4.10, the SpellCheckComponent did not too... Do you have a workaround for these versions ? 2016-01-28 14:20 GMT+01:00 Damien Picard <picard.dam...@gmail.com>: > (we use Solr 4.4) > > 2016-01-28 11:07 GMT+01:00 Damien Picard <picard.dam...@gmail.com>: > >> Hi, >> >> We are using SolrCloud (4 nodes) and we have defined a suggester using >> the spellcheck component. >> >> The suggester is defined as : >> >> >> >> suggestOpeGes >> > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup >> > name="classname">org.apache.solr.spelling.suggest.Suggester >> ref_opegestion >> 0 >> true >> true >> >> >> suggestRefCre >> > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup >> > name="classname">org.apache.solr.spelling.suggest.Suggester >> ref_cre >> 0 >> true >> true >> >> >> suggestRefEcr >> > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup >> > name="classname">org.apache.solr.spelling.suggest.Suggester >> ref_ecriture >> 0 >> true >> true >> >> >> > startup="lazy"> >> >> true >> suggestOpeGes >> 20 >> true >> false >> >> >> suggest >> >> >> >> When I query this collection suggest with the shards parameters : >> GET >> /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEBBA=json=true=true=suggestOpeGes=suggest/ >> >> I get no results : >> >> { >> "responseHeader":{ >> "status":0, >> "QTime":0}} >> >> But, when I disable the distributed search : >> GET >> /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEMMA=json=true=true=suggestOpeGes=false >> >> I get the results I expect : >> >> { >> "responseHeader":{ >> "status":0, >> "QTime":28}, >> "spellcheck":{ >> "suggestions":[ >> "GSIEBBA",{ >> "numFound":20, >> "startOffset":0, >> "endOffset":7, >> "suggestion":["GSIEMMA44257700010010401", >> "GSIEBBA64257700010013501", >> "GSIEBBA70723503779040201", >> "GSIEBBA71257700030012101", >> "GSIEBBA71723503830023601", >> "GSIEBBA74001300670011701", >> "GSIEBBA74001300670011801", >> "GSIEBBA74772000136021201", >> "GSIEBBA76257700040010501", >> "GSIEBBA76600101133030501", >> "GSIEBBA76680400195030601", >> "GSIEBBA77692100093024401", >> "GSIEBBA77692100093024501", >> "GSIEBBA78450700227020701", >> "GSIEBBA78450700227020801", >> "GSIEBBA78854102439020301", >> "GSIEBBA78854102439020401", >> "GSIEBBA79441700201040401", >> "GSIEBBA79723504720012701", >> "GSIEBBA79763600779010501"]}, >> "collation","GSIEBBA44257700010010401"]}} >> >> I also try to send a "manually" distributed search without success : >> >> GET >> /solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica2/suggest?q=GSIEMMA=suggest=json=true=true=suggestOpeGes=suggest/=dn330003.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard2_replica1/|dn330004.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica1/ >> >> What am I doing wrong ? >> >> Thank you. >> -- >> Damien Picard >> Expert GWT >> <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> >> Mob : 06 11 51 47 78 >> > > > > -- > Damien Picard > Expert GWT > <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> > Mob : 06 11 51 47 78 > -- Damien Picard Expert GWT <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> Mob : 06 11 51 47 78
Re: Unable to query the spellchecker in a distributed way
(we use Solr 4.4) 2016-01-28 11:07 GMT+01:00 Damien Picard <picard.dam...@gmail.com>: > Hi, > > We are using SolrCloud (4 nodes) and we have defined a suggester using the > spellcheck component. > > The suggester is defined as : > > > > suggestOpeGes > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup > name="classname">org.apache.solr.spelling.suggest.Suggester > ref_opegestion > 0 > true > true > > > suggestRefCre > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup > name="classname">org.apache.solr.spelling.suggest.Suggester > ref_cre > 0 > true > true > > > suggestRefEcr > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup > name="classname">org.apache.solr.spelling.suggest.Suggester > ref_ecriture > 0 > true > true > > >startup="lazy"> > > true > suggestOpeGes > 20 > true > false > > > suggest > > > > When I query this collection suggest with the shards parameters : > GET > /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEBBA=json=true=true=suggestOpeGes=suggest/ > > I get no results : > > { > "responseHeader":{ > "status":0, > "QTime":0}} > > But, when I disable the distributed search : > GET > /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEMMA=json=true=true=suggestOpeGes=false > > I get the results I expect : > > { > "responseHeader":{ > "status":0, > "QTime":28}, > "spellcheck":{ > "suggestions":[ > "GSIEBBA",{ > "numFound":20, > "startOffset":0, > "endOffset":7, > "suggestion":["GSIEMMA44257700010010401", > "GSIEBBA64257700010013501", > "GSIEBBA70723503779040201", > "GSIEBBA71257700030012101", > "GSIEBBA71723503830023601", > "GSIEBBA74001300670011701", > "GSIEBBA74001300670011801", > "GSIEBBA74772000136021201", > "GSIEBBA76257700040010501", > "GSIEBBA76600101133030501", > "GSIEBBA76680400195030601", > "GSIEBBA77692100093024401", > "GSIEBBA77692100093024501", > "GSIEBBA78450700227020701", > "GSIEBBA78450700227020801", > "GSIEBBA78854102439020301", > "GSIEBBA78854102439020401", > "GSIEBBA79441700201040401", > "GSIEBBA79723504720012701", > "GSIEBBA79763600779010501"]}, > "collation","GSIEBBA44257700010010401"]}} > > I also try to send a "manually" distributed search without success : > > GET > /solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica2/suggest?q=GSIEMMA=suggest=json=true=true=suggestOpeGes=suggest/=dn330003.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard2_replica1/|dn330004.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica1/ > > What am I doing wrong ? > > Thank you. > -- > Damien Picard > Expert GWT > <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> > Mob : 06 11 51 47 78 > -- Damien Picard Expert GWT <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> Mob : 06 11 51 47 78
Unable to query the spellchecker in a distributed way
Hi, We are using SolrCloud (4 nodes) and we have defined a suggester using the spellcheck component. The suggester is defined as : suggestOpeGes org.apache.solr.spelling.suggest.tst.TSTLookup org.apache.solr.spelling.suggest.Suggester ref_opegestion 0 true true suggestRefCre org.apache.solr.spelling.suggest.tst.TSTLookup org.apache.solr.spelling.suggest.Suggester ref_cre 0 true true suggestRefEcr org.apache.solr.spelling.suggest.tst.TSTLookup org.apache.solr.spelling.suggest.Suggester ref_ecriture 0 true true true suggestOpeGes 20 true false suggest When I query this collection suggest with the shards parameters : GET /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEBBA=json=true=true=suggestOpeGes=suggest/ I get no results : { "responseHeader":{ "status":0, "QTime":0}} But, when I disable the distributed search : GET /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEMMA=json=true=true=suggestOpeGes=false I get the results I expect : { "responseHeader":{ "status":0, "QTime":28}, "spellcheck":{ "suggestions":[ "GSIEBBA",{ "numFound":20, "startOffset":0, "endOffset":7, "suggestion":["GSIEMMA44257700010010401", "GSIEBBA64257700010013501", "GSIEBBA70723503779040201", "GSIEBBA71257700030012101", "GSIEBBA71723503830023601", "GSIEBBA74001300670011701", "GSIEBBA74001300670011801", "GSIEBBA74772000136021201", "GSIEBBA76257700040010501", "GSIEBBA76600101133030501", "GSIEBBA76680400195030601", "GSIEBBA77692100093024401", "GSIEBBA77692100093024501", "GSIEBBA78450700227020701", "GSIEBBA78450700227020801", "GSIEBBA78854102439020301", "GSIEBBA78854102439020401", "GSIEBBA79441700201040401", "GSIEBBA79723504720012701", "GSIEBBA79763600779010501"]}, "collation","GSIEBBA44257700010010401"]}} I also try to send a "manually" distributed search without success : GET /solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica2/suggest?q=GSIEMMA=suggest=json=true=true=suggestOpeGes=suggest/=dn330003.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard2_replica1/|dn330004.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica1/ What am I doing wrong ? Thank you. -- Damien Picard Expert GWT <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> Mob : 06 11 51 47 78
Re: Solr QTime explanation
Thank you, you are right ! It seems to be a congestion from our test tool. Regards, 2016-01-19 18:46 GMT+01:00 Toke Eskildsen <t...@statsbiblioteket.dk>: > Damien Picard <picard.dam...@gmail.com> wrote: > > Currently we have 4 Solr nodes, with 12Gb memory (heap) ; the collections > > are replicated (4 shards, 1 replica). > > This query mostly returns a QTime=4 and it takes around 20ms on the > client > > side to get the result. > > > We have to handle around 200 simultaneous connections. > > You are probably experiencing congestion. JMeter can visualize throughput. > Try experimenting with 10 to 100 concurrent threads in increments of 10 > threads and look at throughput underway. My guess is that throughput will > rise as you increase threads, until some point after which it will fall > again as the Solrs exceeds their peak performance point. You might end up > getting better performance by rate-limiting outside of SolrCloud. > > Also, what does 200 simultaneous connections mean? Is that 200 requests > per second? > > - Toke Eskildsen > -- Damien Picard Expert GWT <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> Mob : 06 11 51 47 78
Solr QTime explanation
Hi, I'm currently testing Solr query execution performance (over http and SolrJ), and, using HTTP with JMeter, I see that the response time increases with the number of concurrent request (100 simultaneous request in my case). To understand where Solr takes more time, I use the debug=timing parameter. And I see this kind of response : 0 3003 uuid:FA2C9342381E3969E04456C8B4C639A9 timing true xml false P 3224149 FA2C9342381E3969E04456C8B4C639A9 FA2C9342381E3969E04456C8B4C639A9_P 1518391139884859416 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 I see that the QTime is "3003", but I get nothing (0.0) for all other times. Do you know what does it means ? Thank you in advance. -- Damien Picard Expert GWT <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
Re: Solr QTime explanation
Thank you for your advices. Currently we have 4 Solr nodes, with 12Gb memory (heap) ; the collections are replicated (4 shards, 1 replica). This query mostly returns a QTime=4 and it takes around 20ms on the client side to get the result. We have to handle around 200 simultaneous connections. Currently I do not use a load balancing between our 4 nodes, I only send query to one of the node. I will test to balance between the 4 nodes to see what happens. Thank you. Regards, 2016-01-19 15:08 GMT+01:00 Shawn Heisey <apa...@elyograg.org>: > On 1/19/2016 3:43 AM, Damien Picard wrote: > > I'm currently testing Solr query execution performance (over http and > > SolrJ), and, using HTTP with JMeter, I see that the response time > increases > > with the number of concurrent request (100 simultaneous request in my > case). > > > > To understand where Solr takes more time, I use the debug=timing > parameter. > > And I see this kind of response : > > > > > > > > > > > > 0 > > 3003 > > > > uuid:FA2C9342381E3969E04456C8B4C639A9 > > timing > > true > > xml > > > > > > > > > > false > > P > > 3224149 > > FA2C9342381E3969E04456C8B4C639A9 > > FA2C9342381E3969E04456C8B4C639A9_P > > 1518391139884859416 > > > > > > > > 0.0 > > > > 0.0 > > > > 0.0 > > > > > > 0.0 > > > > > > 0.0 > > > > > > 0.0 > > > > > > 0.0 > > > > > > 0.0 > > > > > > > > 0.0 > > > > 0.0 > > > > > > 0.0 > > > > > > 0.0 > > > > > > 0.0 > > > > > > 0.0 > > > > > > 0.0 > > > > > > > > > > > > QTime is not part of the debug. It is included on all results, whether > debug is turned on or not. I am not sure why all your debug information > says zero. Usually the total of all the debugs does not add up to > QTime, but this is an extreme imbalance. > > QTime is the amount of time spent gathering result information -- the > internal Lucene identifiers of the documents counted for numFound. The > response does not indicate the amount of time that was spent retrieving > the actual search results from the stored fields and sending those > results to the client, but with only one document found, those things > would be extremely fast. > > Running 100 queries concurrently is usually enough to topple *any* > single Solr server unless the index is very small. If you need to scale > up this far, you will need multiple replicas of your index. > > If you are not running a performance benchmark, how long does a single > query like this take? If this kind of QTime is normal even when not > hammering the server, then you probably don't have enough memory. > > Thanks, > Shawn > > -- Damien Picard Expert GWT <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> Mob : 06 11 51 47 78
Unable to extract images content (OCR) from PDF files using Solr
Hi, I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler. Everything works fine, except when I want to extract content from embedding images in PDF/Word etc. documents : I send an extract request like this : POST /update/extract?literal.id =ocrpdf8=attr_content=attr_ In attr_content, I get : \n \n date 2015-08-28T13:23:03Z \n pdf:PDFVersion 1.4 \n xmp:CreatorTool PDFCreator Version 1.2.3 \n stream_content_type application/pdf \n Keywords \n subject \n dc:creator S050735 \n dcterms:created 2015-08-28T13:23:03Z \n Last-Modified 2015-08-28T13:23:03Z \n dcterms:modified 2015-08-28T13:23:03Z \n dc:format application/pdf; version=1.4 \n Last-Save-Date 2015-08-28T13:23:03Z \n stream_name imagepdf.pdf \n meta:save-date 2015-08-28T13:23:03Z \n pdf:encrypted false \n dc:title imagepdf \n modified 2015-08-28T13:23:03Z \n cp:subject \n Content-Type application/pdf \n stream_size 423660 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n creator S050735 \n meta:author S050735 \n dc:subject \n meta:creation-date 2015-08-28T13:23:03Z \n stream_source_info the-file \n created Fri Aug 28 13:23:03 UTC 2015 \n xmpTPg:NPages 1 \n Creation-Date 2015-08-28T13:23:03Z \n meta:keyword \n Author S050735 \n producer GPL Ghostscript 9.04 \n imagepdf \n \n page \n Page 1 sur 1\n \n 28/08/2015 http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4... \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg embedded:image2.jpg image2.jpg \n So, tika works fine, but it doesn't apply OCR content extraction on the embedded images. When I post an image (JPG) on /update/extract, I get its content indexed throught Tesseract OCR (attr_content) field : \n \n stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n stream_content_type image/jpeg \n stream_name OM_1.jpg \n stream_source_info the-file \n Content-Type image/jpeg \n \n \n ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was visiting a.\ncertain public school, a school set in a typically English\ncountryside, which on the June clay of my visit was wonder-\nfully beauliful. The Head Master—-no less typical than his\nschool and the country-side—pointed out the charms of\nboth, and his pride came out in the final remark which he made\nbeforehe left me. He explained that he had a class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can you\n\n, conceive anything more delightful than a class in Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n Resolution Units inch \n stream_source_info the-file \n Compression Type Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type image/jpeg \n Y Resolution 72 dots I see on Tika JIRA that I have to enable extractInlineImages in org/apache/tika/parser/pdf/PDFParser.properties to force image extraction on PDF. So I did it, and I package a tika-app-1.7.jar that contains the tika-parsers-1.7.jar with this file modified to set to true this property. Then, I test my Tika JAR using CLI : # java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf In this case, I get the images content : Page 1 sur 1 28/08/2015 http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4. .. Simple Evan! Use Case Sdsedulet So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my modified one, but the images remains not extracted in my pdf. Does anybody know what I'm doing wrong ? Thank you. -- Damien Picard Expert GWT <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html> Mob : 06 11 51 47 78