Re: Unable to query the spellchecker in a distributed way

2016-02-18 Thread Damien Picard
I got it ; in Solr 4.4, the component
org.apache.solr.handler.component.SpellCheckComponent
didn't implement the method
distributedProcess(ResponseBuilder rb) which is necessary to
org.apache.solr.handler.component.SearchHandler to handle distributed
searches the right way.

And it seems that with 4.10, the SpellCheckComponent did not too...

Do you have a workaround for these versions ?


2016-01-28 14:20 GMT+01:00 Damien Picard <picard.dam...@gmail.com>:

> (we use Solr 4.4)
>
> 2016-01-28 11:07 GMT+01:00 Damien Picard <picard.dam...@gmail.com>:
>
>> Hi,
>>
>> We are using SolrCloud (4 nodes) and we have defined a suggester using
>> the spellcheck component.
>>
>> The suggester is defined as :
>>
>> 
>>   
>> suggestOpeGes
>> > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
>> > name="classname">org.apache.solr.spelling.suggest.Suggester
>> ref_opegestion
>> 0
>> true
>> true
>>   
>>   
>> suggestRefCre
>> > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
>> > name="classname">org.apache.solr.spelling.suggest.Suggester
>> ref_cre
>> 0
>> true
>> true
>>   
>>   
>> suggestRefEcr
>> > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
>> > name="classname">org.apache.solr.spelling.suggest.Suggester
>> ref_ecriture
>> 0
>> true
>> true
>>   
>>   
>>   > startup="lazy">
>> 
>> true
>> suggestOpeGes
>> 20
>> true
>> false
>> 
>> 
>>   suggest
>> 
>>   
>>
>> When I query this collection suggest with the shards parameters :
>> GET
>> /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEBBA=json=true=true=suggestOpeGes=suggest/
>>
>> I get no results :
>>
>> {
>>   "responseHeader":{
>> "status":0,
>> "QTime":0}}
>>
>> But, when I disable the distributed search :
>> GET
>> /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEMMA=json=true=true=suggestOpeGes=false
>>
>> I get the results I expect :
>>
>> {
>>   "responseHeader":{
>> "status":0,
>> "QTime":28},
>>   "spellcheck":{
>> "suggestions":[
>>   "GSIEBBA",{
>> "numFound":20,
>> "startOffset":0,
>> "endOffset":7,
>> "suggestion":["GSIEMMA44257700010010401",
>>   "GSIEBBA64257700010013501",
>>   "GSIEBBA70723503779040201",
>>   "GSIEBBA71257700030012101",
>>   "GSIEBBA71723503830023601",
>>   "GSIEBBA74001300670011701",
>>   "GSIEBBA74001300670011801",
>>   "GSIEBBA74772000136021201",
>>   "GSIEBBA76257700040010501",
>>   "GSIEBBA76600101133030501",
>>   "GSIEBBA76680400195030601",
>>   "GSIEBBA77692100093024401",
>>   "GSIEBBA77692100093024501",
>>   "GSIEBBA78450700227020701",
>>   "GSIEBBA78450700227020801",
>>   "GSIEBBA78854102439020301",
>>   "GSIEBBA78854102439020401",
>>   "GSIEBBA79441700201040401",
>>   "GSIEBBA79723504720012701",
>>   "GSIEBBA79763600779010501"]},
>>   "collation","GSIEBBA44257700010010401"]}}
>>
>> I also try to send a "manually" distributed search without success :
>>
>> GET
>> /solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica2/suggest?q=GSIEMMA=suggest=json=true=true=suggestOpeGes=suggest/=dn330003.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard2_replica1/|dn330004.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica1/
>>
>> What am I doing wrong ?
>>
>> Thank you.
>> --
>> Damien Picard
>> Expert GWT
>> <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
>> Mob : 06 11 51 47 78
>>
>
>
>
> --
> Damien Picard
> Expert GWT
> <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
> Mob : 06 11 51 47 78
>



-- 
Damien Picard
Expert GWT
<http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
Mob : 06 11 51 47 78


Re: Unable to query the spellchecker in a distributed way

2016-01-28 Thread Damien Picard
(we use Solr 4.4)

2016-01-28 11:07 GMT+01:00 Damien Picard <picard.dam...@gmail.com>:

> Hi,
>
> We are using SolrCloud (4 nodes) and we have defined a suggester using the
> spellcheck component.
>
> The suggester is defined as :
>
> 
>   
> suggestOpeGes
>  name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
>  name="classname">org.apache.solr.spelling.suggest.Suggester
> ref_opegestion
> 0
> true
> true
>   
>   
> suggestRefCre
>  name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
>  name="classname">org.apache.solr.spelling.suggest.Suggester
> ref_cre
> 0
> true
> true
>   
>   
> suggestRefEcr
>  name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
>  name="classname">org.apache.solr.spelling.suggest.Suggester
> ref_ecriture
> 0
> true
> true
>   
>   
>startup="lazy">
> 
> true
> suggestOpeGes
> 20
> true
> false
> 
> 
>   suggest
> 
>   
>
> When I query this collection suggest with the shards parameters :
> GET
> /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEBBA=json=true=true=suggestOpeGes=suggest/
>
> I get no results :
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":0}}
>
> But, when I disable the distributed search :
> GET
> /solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEMMA=json=true=true=suggestOpeGes=false
>
> I get the results I expect :
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":28},
>   "spellcheck":{
> "suggestions":[
>   "GSIEBBA",{
> "numFound":20,
> "startOffset":0,
> "endOffset":7,
> "suggestion":["GSIEMMA44257700010010401",
>   "GSIEBBA64257700010013501",
>   "GSIEBBA70723503779040201",
>   "GSIEBBA71257700030012101",
>   "GSIEBBA71723503830023601",
>   "GSIEBBA74001300670011701",
>   "GSIEBBA74001300670011801",
>   "GSIEBBA74772000136021201",
>   "GSIEBBA76257700040010501",
>   "GSIEBBA76600101133030501",
>   "GSIEBBA76680400195030601",
>   "GSIEBBA77692100093024401",
>   "GSIEBBA77692100093024501",
>   "GSIEBBA78450700227020701",
>   "GSIEBBA78450700227020801",
>   "GSIEBBA78854102439020301",
>   "GSIEBBA78854102439020401",
>   "GSIEBBA79441700201040401",
>   "GSIEBBA79723504720012701",
>   "GSIEBBA79763600779010501"]},
>   "collation","GSIEBBA44257700010010401"]}}
>
> I also try to send a "manually" distributed search without success :
>
> GET
> /solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica2/suggest?q=GSIEMMA=suggest=json=true=true=suggestOpeGes=suggest/=dn330003.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard2_replica1/|dn330004.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica1/
>
> What am I doing wrong ?
>
> Thank you.
> --
> Damien Picard
> Expert GWT
> <http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
> Mob : 06 11 51 47 78
>



-- 
Damien Picard
Expert GWT
<http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
Mob : 06 11 51 47 78


Unable to query the spellchecker in a distributed way

2016-01-28 Thread Damien Picard
Hi,

We are using SolrCloud (4 nodes) and we have defined a suggester using the
spellcheck component.

The suggester is defined as :


  
suggestOpeGes
org.apache.solr.spelling.suggest.tst.TSTLookup
org.apache.solr.spelling.suggest.Suggester
ref_opegestion
0
true
true
  
  
suggestRefCre
org.apache.solr.spelling.suggest.tst.TSTLookup
org.apache.solr.spelling.suggest.Suggester
ref_cre
0
true
true
  
  
suggestRefEcr
org.apache.solr.spelling.suggest.tst.TSTLookup
org.apache.solr.spelling.suggest.Suggester
ref_ecriture
0
true
true
  
  
  

true
suggestOpeGes
20
true
false


  suggest

  

When I query this collection suggest with the shards parameters :
GET
/solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEBBA=json=true=true=suggestOpeGes=suggest/

I get no results :

{
  "responseHeader":{
"status":0,
"QTime":0}}

But, when I disable the distributed search :
GET
/solr/ppd_piste_audit_gsie_traite_001/suggest?q=GSIEMMA=json=true=true=suggestOpeGes=false

I get the results I expect :

{
  "responseHeader":{
"status":0,
"QTime":28},
  "spellcheck":{
"suggestions":[
  "GSIEBBA",{
"numFound":20,
"startOffset":0,
"endOffset":7,
"suggestion":["GSIEMMA44257700010010401",
  "GSIEBBA64257700010013501",
  "GSIEBBA70723503779040201",
  "GSIEBBA71257700030012101",
  "GSIEBBA71723503830023601",
  "GSIEBBA74001300670011701",
  "GSIEBBA74001300670011801",
  "GSIEBBA74772000136021201",
  "GSIEBBA76257700040010501",
  "GSIEBBA76600101133030501",
  "GSIEBBA76680400195030601",
  "GSIEBBA77692100093024401",
  "GSIEBBA77692100093024501",
  "GSIEBBA78450700227020701",
  "GSIEBBA78450700227020801",
  "GSIEBBA78854102439020301",
  "GSIEBBA78854102439020401",
  "GSIEBBA79441700201040401",
  "GSIEBBA79723504720012701",
  "GSIEBBA79763600779010501"]},
  "collation","GSIEBBA44257700010010401"]}}

I also try to send a "manually" distributed search without success :

GET
/solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica2/suggest?q=GSIEMMA=suggest=json=true=true=suggestOpeGes=suggest/=dn330003.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard2_replica1/|dn330004.xxx.priv:8983/solr/ppd_piste_audit_gsie_traite_001-03_shard1_replica1/

What am I doing wrong ?

Thank you.
-- 
Damien Picard
Expert GWT
<http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
Mob : 06 11 51 47 78


Re: Solr QTime explanation

2016-01-26 Thread Damien Picard
Thank you, you are right ! It seems to be a congestion from our test tool.

Regards,

2016-01-19 18:46 GMT+01:00 Toke Eskildsen <t...@statsbiblioteket.dk>:

> Damien Picard <picard.dam...@gmail.com> wrote:
> > Currently we have 4 Solr nodes, with 12Gb memory (heap) ; the collections
> > are replicated (4 shards, 1 replica).
> > This query mostly returns a QTime=4 and it takes around 20ms on the
> client
> > side to get the result.
>
> > We have to handle around 200 simultaneous connections.
>
> You are probably experiencing congestion. JMeter can visualize throughput.
> Try experimenting with 10 to 100 concurrent threads in increments of 10
> threads and look at throughput underway. My guess is that throughput will
> rise as you increase threads, until some point after which it will fall
> again as the Solrs exceeds their peak performance point. You might end up
> getting better performance by rate-limiting outside of SolrCloud.
>
> Also, what does 200 simultaneous connections mean? Is that 200 requests
> per second?
>
> - Toke Eskildsen
>



-- 
Damien Picard
Expert GWT
<http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
Mob : 06 11 51 47 78


Solr QTime explanation

2016-01-19 Thread Damien Picard
Hi,

I'm currently testing Solr query execution performance (over http and
SolrJ), and, using HTTP with JMeter, I see that the response time increases
with the number of concurrent request (100 simultaneous request in my case).

To understand where Solr takes more time, I use the debug=timing parameter.
And I see this kind of response :





  0
  3003
  
uuid:FA2C9342381E3969E04456C8B4C639A9
timing
true
xml
  


  
false
P
3224149
FA2C9342381E3969E04456C8B4C639A9
FA2C9342381E3969E04456C8B4C639A9_P
1518391139884859416


  
0.0

  0.0
  
0.0
  
  
0.0
  
  
0.0
  
  
0.0
  
  
0.0
  
  
0.0
  


  0.0
  
0.0
  
  
0.0
  
  
0.0
  
  
0.0
  
  
0.0
  
  
0.0
  

  



I see that the QTime is "3003", but I get nothing (0.0) for all other
times. Do you know what does it means ?

Thank you in advance.

-- 
Damien Picard
Expert GWT
<http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>


Re: Solr QTime explanation

2016-01-19 Thread Damien Picard
Thank you for your advices.

Currently we have 4 Solr nodes, with 12Gb memory (heap) ; the collections
are replicated (4 shards, 1 replica).
This query mostly returns a QTime=4 and it takes around 20ms on the client
side to get the result.

We have to handle around 200 simultaneous connections.

Currently I do not use a load balancing between our 4 nodes, I only send
query to one of the node. I will test to balance between the 4 nodes to see
what happens.

Thank you.

Regards,

2016-01-19 15:08 GMT+01:00 Shawn Heisey <apa...@elyograg.org>:

> On 1/19/2016 3:43 AM, Damien Picard wrote:
> > I'm currently testing Solr query execution performance (over http and
> > SolrJ), and, using HTTP with JMeter, I see that the response time
> increases
> > with the number of concurrent request (100 simultaneous request in my
> case).
> >
> > To understand where Solr takes more time, I use the debug=timing
> parameter.
> > And I see this kind of response :
> >
> > 
> > 
> >
> > 
> >   0
> >   3003
> >   
> > uuid:FA2C9342381E3969E04456C8B4C639A9
> > timing
> > true
> > xml
> >   
> > 
> > 
> >   
> > false
> > P
> > 3224149
> > FA2C9342381E3969E04456C8B4C639A9
> > FA2C9342381E3969E04456C8B4C639A9_P
> > 1518391139884859416
> > 
> > 
> >   
> > 0.0
> > 
> >   0.0
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> > 
> > 
> >   0.0
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> >   
> > 0.0
> >   
> > 
> >   
> > 
> > 
>
> QTime is not part of the debug.  It is included on all results, whether
> debug is turned on or not.  I am not sure why all your debug information
> says zero.  Usually the total of all the debugs does not add up to
> QTime, but this is an extreme imbalance.
>
> QTime is the amount of time spent gathering result information -- the
> internal Lucene identifiers of the documents counted for numFound.  The
> response does not indicate the amount of time that was spent retrieving
> the actual search results from the stored fields and sending those
> results to the client, but with only one document found, those things
> would be extremely fast.
>
> Running 100 queries concurrently is usually enough to topple *any*
> single Solr server unless the index is very small.  If you need to scale
> up this far, you will need multiple replicas of your index.
>
> If you are not running a performance benchmark, how long does a single
> query like this take?  If this kind of QTime is normal even when not
> hammering the server, then you probably don't have enough memory.
>
> Thanks,
> Shawn
>
>


-- 
Damien Picard
Expert GWT
<http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
Mob : 06 11 51 47 78


Unable to extract images content (OCR) from PDF files using Solr

2015-10-22 Thread Damien Picard
Hi,

I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from
PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.

Everything works fine, except when I want to extract content from embedding
images in PDF/Word etc. documents :

I send an extract request like this :
POST /update/extract?literal.id
=ocrpdf8=attr_content=attr_

In attr_content, I get :
\n \n date 2015-08-28T13:23:03Z \n
pdf:PDFVersion 1.4 \n
xmp:CreatorTool PDFCreator Version 1.2.3 \n
 stream_content_type application/pdf \n
 Keywords \n
 subject \n
 dc:creator S050735 \n
 dcterms:created 2015-08-28T13:23:03Z \n
 Last-Modified 2015-08-28T13:23:03Z \n
 dcterms:modified 2015-08-28T13:23:03Z \n
 dc:format application/pdf; version=1.4 \n
 Last-Save-Date 2015-08-28T13:23:03Z \n
 stream_name imagepdf.pdf \n
 meta:save-date 2015-08-28T13:23:03Z \n
 pdf:encrypted false \n
 dc:title imagepdf \n
 modified 2015-08-28T13:23:03Z \n
 cp:subject \n
 Content-Type application/pdf \n
 stream_size 423660 \n
 X-Parsed-By org.apache.tika.parser.DefaultParser \n
 X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n
 creator S050735 \n
 meta:author S050735 \n
 dc:subject \n
 meta:creation-date 2015-08-28T13:23:03Z \n
 stream_source_info the-file \n
 created Fri Aug 28 13:23:03 UTC 2015 \n
 xmpTPg:NPages 1 \n
 Creation-Date 2015-08-28T13:23:03Z \n
 meta:keyword \n
 Author S050735 \n
 producer GPL Ghostscript 9.04 \n
 imagepdf \n
 \n
 page \n
 Page 1 sur 1\n \n
 28/08/2015
http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4...
\n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg
embedded:image2.jpg image2.jpg \n

So, tika works fine, but it doesn't apply OCR content extraction on the
embedded images.

When I post an image (JPG) on /update/extract, I get its content indexed
throught Tesseract OCR (attr_content) field :
\n \n stream_size 55422 \n
 X-Parsed-By org.apache.tika.parser.DefaultParser \n
 X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n
 stream_content_type image/jpeg \n
 stream_name OM_1.jpg \n
 stream_source_info the-file \n
 Content-Type image/jpeg \n \n \n
 ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was
visiting a.\ncertain public school, a school set in a typically
English\ncountryside, which on the June clay of my visit was wonder-\nfully
beauliful. The Head Master—-no less typical than his\nschool and the
country-side—pointed out the charms of\nboth, and his pride came out in the
final remark which he made\nbeforehe left me. He explained that he had a
class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can
you\n\n, conceive anything more delightful than a class in
Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n
stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By
org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n
Resolution Units inch \n stream_source_info the-file \n Compression Type
Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n
tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1,
Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization
table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X
Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n
Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1
vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type
image/jpeg \n Y Resolution 72 dots

I see on Tika JIRA that I have to enable extractInlineImages in
org/apache/tika/parser/pdf/PDFParser.properties to force image extraction
on PDF. So I did it, and I package a tika-app-1.7.jar that contains the
tika-parsers-1.7.jar with this file modified to set to true this property.
Then, I test my Tika JAR using CLI :

# java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf

In this case, I get the images content :


Page 1 sur 1

28/08/2015
http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
..

Simple Evan!
Use Case
Sdsedulet

So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my
modified one, but the images remains not extracted in my pdf.

Does anybody know what I'm doing wrong ?

Thank you.

-- 
Damien Picard
Expert GWT
<http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
Mob : 06 11 51 47 78