Re: mm being ignored by edismax

2016-10-06 Thread Alexandre Rafalovitch
I think it is the change in the OR and AND treatment that had been
confusing a number of people. There were discussions before on the
mailing list about it, for example
http://search-lucene.com/m/eHNlzBMAHdfxcv1

Regards,
   Alex.

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 7 October 2016 at 10:24, Nick Hall  wrote:
> Hello,
>
> I'm working on upgrading a Solr installation from 4.0 to 6.2.1 and have
> everything mostly working but have hit a snag. I kept the schema basically
> the same, just made some minor changes to allow it to work with the new
> version, but one of my queries is working differently with the new version
> and I'm not sure why.
>
> In version 4.0 when I do a query with edismax like:
>
> "params":{
>   "mm":"3",
>   "debugQuery":"on",
>   "indent":"on",
>   "q":"string1 string2 string3 string4 string5",
>   "qf":"vehicle_string_t^1",
>   "wt":"json",
>   "defType":"edismax"}},
>
> I get the results I expect, and the debugQuery shows:
>
> "rawquerystring":"string1 string2 string3 string4 string5",
> "querystring":"string1 string2 string3 string4 string5",
> "parsedquery":"+((DisjunctionMaxQuery((vehicle_string_t:\"string 1\"))
> DisjunctionMaxQuery((vehicle_string_t:\"string 2\"))
> DisjunctionMaxQuery((vehicle_string_t:\"string 3\"))
> DisjunctionMaxQuery((vehicle_string_t:\"string 4\"))
> DisjunctionMaxQuery((vehicle_string_t:\"string 5\")))~3)",
> "parsedquery_toString":"+(((vehicle_string_t:\"string 1\")
> (vehicle_string_t:\"string 2\") (vehicle_string_t:\"string 3\")
> (vehicle_string_t:\"string 4\") (vehicle_string_t:\"string 5\"))~3)",
>
>
> But when I run the same query with version 6.2.1, debugQuery shows:
>
> "rawquerystring":"string1 string2 string3 string4 string5",
> "querystring":"string1 string2 string3 string4 string5",
> "parsedquery":"(+(+DisjunctionMaxQuery((vehicle_string_t:\"string 1\"))
> +DisjunctionMaxQuery((vehicle_string_t:\"string 2\"))
> +DisjunctionMaxQuery((vehicle_string_t:\"string 3\"))
> +DisjunctionMaxQuery((vehicle_string_t:\"string 4\"))
> +DisjunctionMaxQuery((vehicle_string_t:\"string 5\"/no_coord",
> "parsedquery_toString":"+(+(vehicle_string_t:\"string 1\")
> +(vehicle_string_t:\"string 2\") +(vehicle_string_t:\"string 3\")
> +(vehicle_string_t:\"string 4\") +(vehicle_string_t:\"string 5\"))",
>
>
> You can see that the key difference is that in version 4 it uses the "~3"
> to indicate the mm, but in 6.2.1 it doesn't matter what I have mm set to,
> it always ends with "/no_coord" and is trying to match all 5 strings even
> if mm is set to 1, so mm is being completely ignored.
>
> I imagine there is some behavior that changed between 4 and 6.2.1 that I
> need to adjust something in my configuration to account for, but I'm
> scratching my head right now. Has anyone else seen this and can point me in
> the right direction? Thanks,
>
> Nick


RE: Migrating to Solr 6.1.0 from 5.5.0

2016-10-06 Thread M, Arjun (Nokia - IN/Bangalore)
Thanks David.. I found the solution. Below the information related to that.


“Solr supports polygons via JTS Topology Suite, which does not come with 
Solr.It's a JAR file that you need to put on Solr's classpath (but not via the 
standard solrconfig.xml mechanisms). If you intend to use those shapes, set 
this attribute to 
org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory. (note: prior 
to Solr 6, the "org.locationtech.spatial4j" part was "com.spatial4j.core")”


More info in this link : 
https://cwiki.apache.org/confluence/display/solr/Spatial+Search

--Arjun





-Original Message-
From: David Smiley [mailto:david.w.smi...@gmail.com]
Sent: Thursday, September 29, 2016 8:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Migrating to Solr 6.1.0 from 5.5.0



Arjun,



Your input is a POLYGON -- as seen in the error message.  The "Try JTS" was

hopefully a clue -- on

https://cwiki.apache.org/confluence/display/solr/Spatial+Search search for

"JTS" and you should see how to set the spatialContextFactory to JTS, and a

mention of needing JTS jar.  I'll try and add a bit more info on suggesting

exactly where to put it and a download link.  I'll also mention a shortcut

so you don't have to type out the classname -- a recent feature in 6.2.



Since you said you were upgrading... presumably your spatialContextFactory

attribute was already set for this to work at all in 5.5?  The package

reference changed for this value -- I imagine you would have seen a

warning/error to this effect in Solr's logs.  Do you?



~ David



On Tue, Sep 27, 2016 at 10:29 AM William Bell 
> wrote:



> the documentation is not good on this. Not sure how to fix it either.

>

> On Tue, Sep 27, 2016 at 3:41 AM, M, Arjun (Nokia - IN/Bangalore) <

> arju...@nokia.com> wrote:

>

> > Hi,

> >

> > We are getting the below errors when migrating Solr from 5.5.0 to

> > 6.1.0. Could anyone help in resolving the issue, if you have come across

> > this?

> >

> >

>  org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:

> > Error from server at http://127.0.0.1:41569/solr/collection1: Unable to

> > parse shape given formats "lat,lon", "x y" or as WKT because

> > java.text.ParseException: java.lang.UnsupportedOperationException:

> > Unsupported shape of this SpatialContext. Try JTS or Geo3D. input:

> > POLYGON((-10 30, -40 40, -10 -20, 0 0, -10 30))

> >

> > Thanks in advance..

> >

> > Thanks & Regards,

> >Arjun M

> >

> >

> >

> >

>

>

> --

> Bill Bell

> billnb...@gmail.com

> cell 720-256-8076

>

--

Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker

LinkedIn: http://linkedin.com/in/davidwsmiley | Book:

http://www.solrenterprisesearchserver.com


mm being ignored by edismax

2016-10-06 Thread Nick Hall
Hello,

I'm working on upgrading a Solr installation from 4.0 to 6.2.1 and have
everything mostly working but have hit a snag. I kept the schema basically
the same, just made some minor changes to allow it to work with the new
version, but one of my queries is working differently with the new version
and I'm not sure why.

In version 4.0 when I do a query with edismax like:

"params":{
  "mm":"3",
  "debugQuery":"on",
  "indent":"on",
  "q":"string1 string2 string3 string4 string5",
  "qf":"vehicle_string_t^1",
  "wt":"json",
  "defType":"edismax"}},

I get the results I expect, and the debugQuery shows:

"rawquerystring":"string1 string2 string3 string4 string5",
"querystring":"string1 string2 string3 string4 string5",
"parsedquery":"+((DisjunctionMaxQuery((vehicle_string_t:\"string 1\"))
DisjunctionMaxQuery((vehicle_string_t:\"string 2\"))
DisjunctionMaxQuery((vehicle_string_t:\"string 3\"))
DisjunctionMaxQuery((vehicle_string_t:\"string 4\"))
DisjunctionMaxQuery((vehicle_string_t:\"string 5\")))~3)",
"parsedquery_toString":"+(((vehicle_string_t:\"string 1\")
(vehicle_string_t:\"string 2\") (vehicle_string_t:\"string 3\")
(vehicle_string_t:\"string 4\") (vehicle_string_t:\"string 5\"))~3)",


But when I run the same query with version 6.2.1, debugQuery shows:

"rawquerystring":"string1 string2 string3 string4 string5",
"querystring":"string1 string2 string3 string4 string5",
"parsedquery":"(+(+DisjunctionMaxQuery((vehicle_string_t:\"string 1\"))
+DisjunctionMaxQuery((vehicle_string_t:\"string 2\"))
+DisjunctionMaxQuery((vehicle_string_t:\"string 3\"))
+DisjunctionMaxQuery((vehicle_string_t:\"string 4\"))
+DisjunctionMaxQuery((vehicle_string_t:\"string 5\"/no_coord",
"parsedquery_toString":"+(+(vehicle_string_t:\"string 1\")
+(vehicle_string_t:\"string 2\") +(vehicle_string_t:\"string 3\")
+(vehicle_string_t:\"string 4\") +(vehicle_string_t:\"string 5\"))",


You can see that the key difference is that in version 4 it uses the "~3"
to indicate the mm, but in 6.2.1 it doesn't matter what I have mm set to,
it always ends with "/no_coord" and is trying to match all 5 strings even
if mm is set to 1, so mm is being completely ignored.

I imagine there is some behavior that changed between 4 and 6.2.1 that I
need to adjust something in my configuration to account for, but I'm
scratching my head right now. Has anyone else seen this and can point me in
the right direction? Thanks,

Nick


Re: solr 5 leaving tomcat, will I be the only one fearing about this?

2016-10-06 Thread Alexandre Rafalovitch
Treat Solr as a blackbox standalone database. Your MySQL is running
standalone, right?

And try to go to Solr 6, if you can. 5 is not latest anymore and there had
been lots of scaling improvements in 6.

Regards,
Alex

On 7 Oct 2016 5:02 AM, "Renee Sun"  wrote:

> need some general advises please...
>
> our infra is built with multiple webapps with tomcat ... the scale layer is
> archived on top of those webapps which work hand-in-hand with solr admin
> APIs / shard queries / commit or optimize / core management etc etc.
>
> While I have not get a chance to actually play with solr 5 yet, just by
> imagination, we will be facing some huge changes in our infra to be able to
> upgrade to solr 5, yes?
>
> Thanks
> Renee
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/solr-5-leaving-tomcat-will-I-be-the-only-one-
> fearing-about-this-tp4300065.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?

2016-10-06 Thread Pushkar Raste
A couple of questions/suggestions
- This normally happens after leader election, when new leader gets
elected, it will force all the nodes to sync with itself.
Check logs to see when this happens, if leader was changed. If that is true
then you will have to investigate why leader change takes place.
I suspect leader goes into long enough GC pause that makes zookeeper leader
is no longer available and initiates leader election.

- What version of Solr you are using.  SOLR-8586
 introduced
IndexFingerprint check, unfortunately it was broken and hence replica would
always do full index replication. Issue is now fixed in SOLR-9310
, this should help
replicas recover faster.

- You should also increase ulog log size (default threshold is 100 docs or
10 tlogs whichever is hit first). This will again help replicas recover
faster from tlogs (of course, there would be a threshold after which
recovering from tlog would in fact take longer than copying over all the
index files from leader)


On Thu, Oct 6, 2016 at 5:23 AM, Gerald Reinhart 
wrote:

>
> Hello everyone,
>
> Our Solr Cloud  works very well for several months without any
> significant changes: the traffic to serve is stable, no major release
> deployed...
>
> But randomly, the Solr Cloud leader puts all the replicas in recovery
> at the same time for no obvious reason.
>
> Hence, we can not serve the queries any more and the leader is
> overloaded while replicating all the indexes on the replicas at the same
> time which eventually implies a downtime of approximately 30 minutes.
>
> Is there a way to prevent it ? Ideally, a configuration saying a
> percentage of replicas to be put in recovery at the same time?
>
> Thanks,
>
> Gérald, Elodie and Ludovic
>
>
> --
> [image: Kelkoo]
>
> *Gérald Reinhart *Software Engineer
>
> *E*  
> gerald.reinh...@kelkoo.com*Y!Messenger* gerald.reinhart
> *T* +33 (0)4 56 09 07 41
> *A* Parc Sud Galaxie - Le Calypso, 6 rue des Méridiens, 38130 Echirolles
>
>
>
> --
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 158 Ter Rue du Temple 75003 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>


Re: newSearcher autowarming queries in solrconfig.xml run but does not appear to warm cache

2016-10-06 Thread Dalton Gooding
Erick,
Thanks for the response. After I run the initial query and get a long response 
time, if I change the query to remove or add additional query statements, I 
find the speed is good.
If I run the modified query after a new searcher has registered, the response 
is slow but after the modified query has been completed, the warming query sent 
from CuRl is much faster. I assume it is because the document cache has updated 
with the documents from the modified query. A large number of our queries work 
with the same document set, I am trying to get a warming query to populate the 
document cache to be as big as feasible.
Should the firstSearcher and newSearcher warm the document cache? 

On Friday, 7 October 2016, 9:31, Erick Erickson  
wrote:
 

 Submitting the exact same query twice will return results from the
queryResultCache. I'm not entirely
sure that the firstSearcher events get put into the cache.

So if you change the query even slighty my guess is that you'll see
response times very close to your
original ones of over a second.

Best,
Erick

On Thu, Oct 6, 2016 at 2:56 PM, Dalton Gooding
 wrote:
> After setting a number of newSearcher and firstSearcher queries, I can see in 
> the console logs that the queries are run, but when I run the same query 
> against the new searcher (using CuRL), I get a slow response time for the 
> first run.
>
> Config:
>          name="queries">         DataType_s:Product           
>   WebSections_ms:house              name="fq">{!tag=current_group}GroupIds_ms:*
>              true              name="facet.field">BrandID_s              name="facet.query">Price_2_f:[* TO *]              name="facet.query">Price_3_f:[* TO *]              name="facet.query">Price_4_f:[* TO *]              name="facet.query">Price_5_f:[* TO *]              name="facet.query">Price_6_f:[* TO *]              name="facet.query">Price_7_f:[* TO *]              name="facet.query">Price_8_f:[* TO *]              name="facet.mincount">1              fc   
>           json              name="json.nl">map              (title:* OR text:*)  
>            0              20   
>             
>
> Console log:
> INFO  (searcherExecutor-7-thread-1-processing-x:core1) [  x:core1] 
> o.a.s.c.S.Request [core1] webapp=null path=null 
> params={facet=true=1=0=Price_2_f:[*+TO+*]=Price_3_f:[*+TO+*]=Price_4_f:[*+TO+*]=Price_5_f:[*+TO+*]=Price_6_f:[*+TO+*]=Price_7_f:[*+TO+*]=Price_8_f:[*+TO+*]=newSearcher=(title:*+OR+text:*)=false=map=BrandID_s=json=fc=DataType_s:Product=WebSections_ms:house=VisibleOnline_ms:7={!tag%3Dcurrent_group}GroupIds_ms:*=20}
>  hits=2549 status=0 QTime=1263
>
>
> If I run the same query after the index has registered I see a QTime of over 
> a second, the second time I run the query I see around 80ms. This leads me to 
> believe the warming did not occur or the query was not commited to cache on 
> start up of the new searcher.
> Can someone please advise on how to use the newSearcher queries to 
> effectively warm SolR caches. Should I see an improved response for the first 
> time I run the query if the same query has been used as a newSearcher query?
> Cheers,
> Dalton

   

Count on Multivalued field using facet

2016-10-06 Thread Aswath Srinivasan (TMS)
Hello,

I'm having a result set something like this, and query like below. The facet 
count for Line field is 1(1). That is, value Line's value 1 has numBucket = 1.

However, I need to count the number of occurrence of each of the values in the 
LINE field. Is there a way to do this?

Expecting something like, LINE 1(10), 2(2)

http://localhost:8983/solr/collection/select?facet.field=line=on=id:123456789=on=*:*=json

{
  "responseHeader":{
"status":0,
"QTime":25,
"params":{
  "q":"*:*",
  "facet.field":"line",
  "indent":"on",
  "fq":"id:123456789",
  "facet":"on",
  "wt":"json",
  "_":"1475711557126"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"123456789",
" name":["abc"],
"year":["2016"],
"idno":[6009250200],
"issue":["Paint",
  "zTest",
  "zTest",
  "Paint",
  "zTest",
  "zTest",
  "zTest",
  "Paint",
  "Paint",
  "Paint",
  "Paint",
  "Paint"],
"line":["1",
  "1",
  "1",
  "2",
  "1",
  "1",
  "1",
  "1",
  "2",
  "1",
  "1",
  "1"],
"_version_":1547467907197304832}]
  },
  "facet_counts":{
"facet_queries":{},
"facet_fields":{
  "line":[
"1",1]
"2",1]
},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}

Thank you,
Aswath NS



Re: Streaming api and multiValued fields

2016-10-06 Thread Joel Bernstein
Currently the joins in the Streaming API don't support joining on
multi-value fields. It will be difficult to support merge joins on
multi-value fields but hash joins would be possible in the future. Also the
gatherNodes graph expression will support multi-value fields in the future.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Oct 6, 2016 at 4:52 PM, chriseldredge 
wrote:

> Is there any documentation on the support (or lack thereof) for using join,
> hashJoin and other operations to combine streams on multiValued fields?
>
> I have a core with posts that can be written about multiple companies, and
> another core with info about those companies:
>
> {
>   'id': 'post-1234',
>   'body': 'the body text about IBM and Oracle',
>   'company_ids': [21, 43]
> }
>
> {
>   'id': 'company-21',
>   'company_id': 21,
>   'name': 'IBM',
> }
>
> {
>   'id': 'company-43',
>   'company_id': 43,
>   'name': 'Oracle'
> }
>
> I want to join posts to companies.
>
> As an aside, this might not be possible yet before SOLR-8395 is completed
> (query-time join (with scoring) for single value numeric fields) but I
> thought I'd ask about multiValued fields anyway.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Streaming-api-and-multiValued-fields-tp4300058.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: newSearcher autowarming queries in solrconfig.xml run but does not appear to warm cache

2016-10-06 Thread Erick Erickson
Submitting the exact same query twice will return results from the
queryResultCache. I'm not entirely
sure that the firstSearcher events get put into the cache.

So if you change the query even slighty my guess is that you'll see
response times very close to your
original ones of over a second.

Best,
Erick

On Thu, Oct 6, 2016 at 2:56 PM, Dalton Gooding
 wrote:
> After setting a number of newSearcher and firstSearcher queries, I can see in 
> the console logs that the queries are run, but when I run the same query 
> against the new searcher (using CuRL), I get a slow response time for the 
> first run.
>
> Config:
>name="queries"> DataType_s:Product  
> WebSections_ms:house   name="fq">{!tag=current_group}GroupIds_ms:*
>   true   name="facet.field">BrandID_s   name="facet.query">Price_2_f:[* TO *]   name="facet.query">Price_3_f:[* TO *]   name="facet.query">Price_4_f:[* TO *]   name="facet.query">Price_5_f:[* TO *]   name="facet.query">Price_6_f:[* TO *]   name="facet.query">Price_7_f:[* TO *]   name="facet.query">Price_8_f:[* TO *]   name="facet.mincount">1  fc  
> json   name="json.nl">map  (title:* OR text:*) 
>  0  20 
> 
>
> Console log:
> INFO  (searcherExecutor-7-thread-1-processing-x:core1) [   x:core1] 
> o.a.s.c.S.Request [core1] webapp=null path=null 
> params={facet=true=1=0=Price_2_f:[*+TO+*]=Price_3_f:[*+TO+*]=Price_4_f:[*+TO+*]=Price_5_f:[*+TO+*]=Price_6_f:[*+TO+*]=Price_7_f:[*+TO+*]=Price_8_f:[*+TO+*]=newSearcher=(title:*+OR+text:*)=false=map=BrandID_s=json=fc=DataType_s:Product=WebSections_ms:house=VisibleOnline_ms:7={!tag%3Dcurrent_group}GroupIds_ms:*=20}
>  hits=2549 status=0 QTime=1263
>
>
> If I run the same query after the index has registered I see a QTime of over 
> a second, the second time I run the query I see around 80ms. This leads me to 
> believe the warming did not occur or the query was not commited to cache on 
> start up of the new searcher.
> Can someone please advise on how to use the newSearcher queries to 
> effectively warm SolR caches. Should I see an improved response for the first 
> time I run the query if the same query has been used as a newSearcher query?
> Cheers,
> Dalton


solr 5 leaving tomcat, will I be the only one fearing about this?

2016-10-06 Thread Renee Sun
need some general advises please...

our infra is built with multiple webapps with tomcat ... the scale layer is
archived on top of those webapps which work hand-in-hand with solr admin
APIs / shard queries / commit or optimize / core management etc etc.

While I have not get a chance to actually play with solr 5 yet, just by
imagination, we will be facing some huge changes in our infra to be able to
upgrade to solr 5, yes? 

Thanks
Renee



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-5-leaving-tomcat-will-I-be-the-only-one-fearing-about-this-tp4300065.html
Sent from the Solr - User mailing list archive at Nabble.com.


newSearcher autowarming queries in solrconfig.xml run but does not appear to warm cache

2016-10-06 Thread Dalton Gooding
After setting a number of newSearcher and firstSearcher queries, I can see in 
the console logs that the queries are run, but when I run the same query 
against the new searcher (using CuRL), I get a slow response time for the first 
run. 

Config:
                   DataType_s:Product            
  WebSections_ms:house              {!tag=current_group}GroupIds_ms:*
              true              BrandID_s              Price_2_f:[* TO *]              Price_3_f:[* TO *]              Price_4_f:[* TO *]              Price_5_f:[* TO *]              Price_6_f:[* TO *]              Price_7_f:[* TO *]              Price_8_f:[* TO *]              1              fc    
          json              map  
            (title:* OR text:*)              0              20              
   

Console log:
INFO  (searcherExecutor-7-thread-1-processing-x:core1) [   x:core1] 
o.a.s.c.S.Request [core1] webapp=null path=null 
params={facet=true=1=0=Price_2_f:[*+TO+*]=Price_3_f:[*+TO+*]=Price_4_f:[*+TO+*]=Price_5_f:[*+TO+*]=Price_6_f:[*+TO+*]=Price_7_f:[*+TO+*]=Price_8_f:[*+TO+*]=newSearcher=(title:*+OR+text:*)=false=map=BrandID_s=json=fc=DataType_s:Product=WebSections_ms:house=VisibleOnline_ms:7={!tag%3Dcurrent_group}GroupIds_ms:*=20}
 hits=2549 status=0 QTime=1263


If I run the same query after the index has registered I see a QTime of over a 
second, the second time I run the query I see around 80ms. This leads me to 
believe the warming did not occur or the query was not commited to cache on 
start up of the new searcher.
Can someone please advise on how to use the newSearcher queries to effectively 
warm SolR caches. Should I see an improved response for the first time I run 
the query if the same query has been used as a newSearcher query?
Cheers,
Dalton

Re: Queries to help warm up (mmap)

2016-10-06 Thread Pushkar Raste
One of the tricks I had read somewhere was to cat all files in the index
directory and OS will have file in the disk cache.

On Thu, Oct 6, 2016 at 11:55 AM, Rallavagu  wrote:

> Looking for clues/recommendations to help warm up during startup. Not
> necessarily Solr caches but mmap as well. I have used some like "q= name>:[* TO *]" for various fields and it seems to help with mmap
> population around 40-50%. Is there anything else that could help achieve
> 90% or more? Thanks.
>


Streaming api and multiValued fields

2016-10-06 Thread chriseldredge
Is there any documentation on the support (or lack thereof) for using join,
hashJoin and other operations to combine streams on multiValued fields?

I have a core with posts that can be written about multiple companies, and
another core with info about those companies:

{
  'id': 'post-1234',
  'body': 'the body text about IBM and Oracle',
  'company_ids': [21, 43]
}

{
  'id': 'company-21',
  'company_id': 21,
  'name': 'IBM',
}

{
  'id': 'company-43',
  'company_id': 43,
  'name': 'Oracle'
}

I want to join posts to companies.

As an aside, this might not be possible yet before SOLR-8395 is completed
(query-time join (with scoring) for single value numeric fields) but I
thought I'd ask about multiValued fields anyway.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Streaming-api-and-multiValued-fields-tp4300058.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Average of Averages in Solr

2016-10-06 Thread Susheel Kumar
Please look into streaming expressions.  I think that is what you are
looking for.
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

Thanks,
Susheel



On Thu, Oct 6, 2016 at 11:56 AM, John Bickerstaff 
wrote:

> This may help?  Note the "Bloomberg Analytics" at the bottom of the post...
>
> https://dzone.com/articles/solr-not-just-for-text-anymore
>
> Quote from article:
>
>
>- *Bloomberg Analytics Component for Solr*: Bloomberg Financial Services
>uses Solr extensively, and found the existing statistical packages
> woefully
>lacking. So, they developed a high-performance framework that can
> perform
>complex calculations and aggregations on time-series data, and then
>released it to OpenSource.
>
>
> On Thu, Oct 6, 2016 at 8:53 AM, Shawn Heisey  wrote:
>
> > On 10/6/2016 12:04 AM, Lewin Joy (TMS) wrote:
> > > There is a requirement to take an average on "Amount" field against
> > > each "code" field. And then calculate the averages on this averages.
> > > Since my "code" field has a very huge cardinality, which could be
> > > around 200,000 or even in millions ; It gets highly complex to
> > > calculate the average of averages through Java. Even Solr takes a huge
> > > time listing the averages. And the JSON response size becomes huge. Is
> > > there some way we can tackle this? Any way we stats on stats?
> >
> > I wasn't sure what you meant with the first sentence I quoted above, but
> > in order to get statistics from your index that are relevant for the
> > results of a query, you probably want the stats component.
> >
> > https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Problem with Password Decryption in Data Import Handler

2016-10-06 Thread Jamie Jackson
It happens to be ten characters.

On Thu, Oct 6, 2016 at 12:44 PM, Alexandre Rafalovitch 
wrote:

> How long is the encryption key (file content)? Because the code I am
> looking at seems to expect it to be at most 100 characters.
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 6 October 2016 at 23:26, Kevin Risden  wrote:
> > I haven't tried this but is it possible there is a new line at the end in
> > the file?
> >
> > If you did something like echo "" > file.txt then there would be a new
> > line. Use echo -n "" > file.txt
> >
> > Also you should be able to check how many characters are in the file.
> >
> > Kevin Risden
> >
> > On Wed, Oct 5, 2016 at 5:00 PM, Jamie Jackson 
> wrote:
> >
> >> Hi Folks,
> >>
> >> (Using Solr 5.5.3.)
> >>
> >> As far as I know, the only place where encrypted password use is
> documented
> >> is in
> >> https://cwiki.apache.org/confluence/display/solr/
> >> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler,
> >> under the "Configuring the DIH Configuration File", in a comment in the
> >> sample XML file:
> >>
> >> 
> >>
> >> Anyway, I can encrypt just fine:
> >>
> >> $ openssl enc -aes-128-cbc -a -salt -in stgps.txt
> >> enter aes-128-cbc encryption password:
> >> Verifying - enter aes-128-cbc encryption password:
> >> U2FsdGVkX1+VtVoQtmEREvB5qZjn3131+N4jRXmjyIY=
> >>
> >>
> >> I can also decrypt just fine from the command line.
> >>
> >> However, if I use the encrypted password and encryptKeyFile in the
> config
> >> file, I end up with an error: "String length must be a multiple of
> four."
> >>
> >> https://gist.github.com/jamiejackson/3852dacb03432328ea187d43ade5e4d9
> >>
> >> How do I get this working?
> >>
> >> Thanks,
> >> Jamie
> >>
>


Re: Problem with Password Decryption in Data Import Handler

2016-10-06 Thread Jamie Jackson
I tried it both ways yesterday--with a newline and without.

On Thu, Oct 6, 2016 at 12:26 PM, Kevin Risden 
wrote:

> I haven't tried this but is it possible there is a new line at the end in
> the file?
>
> If you did something like echo "" > file.txt then there would be a new
> line. Use echo -n "" > file.txt
>
> Also you should be able to check how many characters are in the file.
>
> Kevin Risden
>
> On Wed, Oct 5, 2016 at 5:00 PM, Jamie Jackson 
> wrote:
>
> > Hi Folks,
> >
> > (Using Solr 5.5.3.)
> >
> > As far as I know, the only place where encrypted password use is
> documented
> > is in
> > https://cwiki.apache.org/confluence/display/solr/
> > Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler,
> > under the "Configuring the DIH Configuration File", in a comment in the
> > sample XML file:
> >
> > 
> >
> > Anyway, I can encrypt just fine:
> >
> > $ openssl enc -aes-128-cbc -a -salt -in stgps.txt
> > enter aes-128-cbc encryption password:
> > Verifying - enter aes-128-cbc encryption password:
> > U2FsdGVkX1+VtVoQtmEREvB5qZjn3131+N4jRXmjyIY=
> >
> >
> > I can also decrypt just fine from the command line.
> >
> > However, if I use the encrypted password and encryptKeyFile in the config
> > file, I end up with an error: "String length must be a multiple of four."
> >
> > https://gist.github.com/jamiejackson/3852dacb03432328ea187d43ade5e4d9
> >
> > How do I get this working?
> >
> > Thanks,
> > Jamie
> >
>


Re: Writing Solr Custom Components

2016-10-06 Thread John Bickerstaff
Thank you Otis!

On Thu, Oct 6, 2016 at 10:28 AM, Otis Gospodnetić <
otis.gospodne...@gmail.com> wrote:

> John, if it helps, here are a few examples of custom Solr SearchComponents:
>
> https://github.com/sematext/query-segmenter
> https://github.com/sematext/solr-researcher
> https://github.com/sematext/solr-autocomplete
>
> I hope this helps.
>
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
> On Wed, Oct 5, 2016 at 9:34 AM, John Bickerstaff  >
> wrote:
>
> > Thank you both!
> >
> > On Oct 5, 2016 2:32 AM, "Charlie Hull"  wrote:
> >
> > > On 04/10/2016 17:23, John Bickerstaff wrote:
> > >
> > >> All,
> > >>
> > >> I'm looking for information on writing custom Solr components.  A
> quick
> > >> search showed nothing really recent and before I dig deeper, I thought
> > I'd
> > >> ask the community for anything you are aware of.
> > >>
> > >> Thanks
> > >>
> > >> We wrote a few for the BioSolr project:
> https://github.com/flaxsearch/
> > > BioSolr - the ontology one might be useful
> > https://github.com/flaxsearch/
> > > BioSolr/tree/master/ontology Count yourself lucky, you could be doing
> it
> > > for Elasticsearch :) http://www.flax.co.uk/blog/201
> > > 6/01/27/fun-frustration-writing-plugin-elasticsearch-
> ontology-indexing/
> > >
> > > Charlie
> > >
> > > --
> > > Charlie Hull
> > > Flax - Open Source Enterprise Search
> > >
> > > tel/fax: +44 (0)8700 118334
> > > mobile:  +44 (0)7767 825828
> > > web: www.flax.co.uk
> > >
> >
>


Re: Writing Solr Custom Components

2016-10-06 Thread Otis Gospodnetić
John, if it helps, here are a few examples of custom Solr SearchComponents:

https://github.com/sematext/query-segmenter
https://github.com/sematext/solr-researcher
https://github.com/sematext/solr-autocomplete

I hope this helps.

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Wed, Oct 5, 2016 at 9:34 AM, John Bickerstaff 
wrote:

> Thank you both!
>
> On Oct 5, 2016 2:32 AM, "Charlie Hull"  wrote:
>
> > On 04/10/2016 17:23, John Bickerstaff wrote:
> >
> >> All,
> >>
> >> I'm looking for information on writing custom Solr components.  A quick
> >> search showed nothing really recent and before I dig deeper, I thought
> I'd
> >> ask the community for anything you are aware of.
> >>
> >> Thanks
> >>
> >> We wrote a few for the BioSolr project: https://github.com/flaxsearch/
> > BioSolr - the ontology one might be useful
> https://github.com/flaxsearch/
> > BioSolr/tree/master/ontology Count yourself lucky, you could be doing it
> > for Elasticsearch :) http://www.flax.co.uk/blog/201
> > 6/01/27/fun-frustration-writing-plugin-elasticsearch-ontology-indexing/
> >
> > Charlie
> >
> > --
> > Charlie Hull
> > Flax - Open Source Enterprise Search
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web: www.flax.co.uk
> >
>


Re: seperate core from engine

2016-10-06 Thread Shawn Heisey
On 10/6/2016 11:07 AM, KRIS MUSSHORN wrote:
> Currently Solr ( 5.4.1 ) and its core data are all in one location.
> How would i set up Solr so that the core data could be stored
> somewhere else? Pointers to helpful instructions are appreciated 

If you use the service installation script on a UNIX or UNIX-like
operating system, Solr is installed in a different location than its
data by default.  The program defaults to /opt/solr and the data to
/var/solr.

I strongly recommend using a free operating system and the service
install script.  The service install script has a number of options that
let you customize many aspects of the installation.

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production

At this time, there is no official way to automate installation on
Windows.  That may come in the future ... but Windows is fairly low
priority here.

If you are starting manually with the bin/solr script, there is a
commandline option to start with a different solr home, which would let
you put the data anywhere you want.

Thanks,
Shawn



Re: seperate core from engine

2016-10-06 Thread Alexandre Rafalovitch
You have solr home property (solr.solr.home) to point to where all
your collections/cores are and then you can set various directory
locations per core in the core.properties file.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 7 October 2016 at 00:07, KRIS MUSSHORN  wrote:
> Currently Solr ( 5.4.1 ) and its core data are all in one location.
>
> How would i set up Solr so that the core data could be stored somewhere else?
>
> Pointers to helpful instructions are appreciated
>
> TIA
>
> Kris


seperate core from engine

2016-10-06 Thread KRIS MUSSHORN
Currently Solr ( 5.4.1 ) and its core data are all in one location. 

How would i set up Solr so that the core data could be stored somewhere else? 

Pointers to helpful instructions are appreciated 

TIA 

Kris 


Re: Problem with Password Decryption in Data Import Handler

2016-10-06 Thread Alexandre Rafalovitch
How long is the encryption key (file content)? Because the code I am
looking at seems to expect it to be at most 100 characters.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 6 October 2016 at 23:26, Kevin Risden  wrote:
> I haven't tried this but is it possible there is a new line at the end in
> the file?
>
> If you did something like echo "" > file.txt then there would be a new
> line. Use echo -n "" > file.txt
>
> Also you should be able to check how many characters are in the file.
>
> Kevin Risden
>
> On Wed, Oct 5, 2016 at 5:00 PM, Jamie Jackson  wrote:
>
>> Hi Folks,
>>
>> (Using Solr 5.5.3.)
>>
>> As far as I know, the only place where encrypted password use is documented
>> is in
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler,
>> under the "Configuring the DIH Configuration File", in a comment in the
>> sample XML file:
>>
>> 
>>
>> Anyway, I can encrypt just fine:
>>
>> $ openssl enc -aes-128-cbc -a -salt -in stgps.txt
>> enter aes-128-cbc encryption password:
>> Verifying - enter aes-128-cbc encryption password:
>> U2FsdGVkX1+VtVoQtmEREvB5qZjn3131+N4jRXmjyIY=
>>
>>
>> I can also decrypt just fine from the command line.
>>
>> However, if I use the encrypted password and encryptKeyFile in the config
>> file, I end up with an error: "String length must be a multiple of four."
>>
>> https://gist.github.com/jamiejackson/3852dacb03432328ea187d43ade5e4d9
>>
>> How do I get this working?
>>
>> Thanks,
>> Jamie
>>


Re: Problem with Password Decryption in Data Import Handler

2016-10-06 Thread Kevin Risden
I haven't tried this but is it possible there is a new line at the end in
the file?

If you did something like echo "" > file.txt then there would be a new
line. Use echo -n "" > file.txt

Also you should be able to check how many characters are in the file.

Kevin Risden

On Wed, Oct 5, 2016 at 5:00 PM, Jamie Jackson  wrote:

> Hi Folks,
>
> (Using Solr 5.5.3.)
>
> As far as I know, the only place where encrypted password use is documented
> is in
> https://cwiki.apache.org/confluence/display/solr/
> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler,
> under the "Configuring the DIH Configuration File", in a comment in the
> sample XML file:
>
> 
>
> Anyway, I can encrypt just fine:
>
> $ openssl enc -aes-128-cbc -a -salt -in stgps.txt
> enter aes-128-cbc encryption password:
> Verifying - enter aes-128-cbc encryption password:
> U2FsdGVkX1+VtVoQtmEREvB5qZjn3131+N4jRXmjyIY=
>
>
> I can also decrypt just fine from the command line.
>
> However, if I use the encrypted password and encryptKeyFile in the config
> file, I end up with an error: "String length must be a multiple of four."
>
> https://gist.github.com/jamiejackson/3852dacb03432328ea187d43ade5e4d9
>
> How do I get this working?
>
> Thanks,
> Jamie
>


Re: SOLR Sizing

2016-10-06 Thread Walter Underwood
The square-root rule comes from a short paper draft (unpublished) that I can’t 
find right now. But this paper gets the same result:

http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html 


Perfect OCR would follow this rule, but even great OCR has lots of errors. 95% 
accuracy is good OCR performance, but that makes a huge, pathological long tail 
of non-language terms.

I learned about the OCR problems from the Hathi Trust. They hit the Solr 
vocabulary limit of 2.4 billion terms, then when that was raise, they hit 
memory management issues.

https://www.hathitrust.org/blogs/large-scale-search/too-many-words 

https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again 


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2016, at 8:05 AM, Rick Leir  wrote:
> 
> I am curious to know where the square-root assumption is from, and why OCR 
> (without errors) would break it. TIA
> 
> cheers - - Rick
> 
> On 2016-10-04 10:51 AM, Walter Underwood wrote:
>> No, we don’t have OCR’ed text. But if you do, it breaks the assumption that 
>> vocabulary size
>> is the square root of the text size.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 4, 2016, at 7:14 AM, Rick Leir  wrote:
>>> 
>>> OCR’ed text can have large amounts of garbage such as '';,-d'." 
>>> particularly when there is poor image quality or embedded graphics. Is that 
>>> what is causing your huge vocabularies? I filtered the text, removing any 
>>> word with fewer than 3 alphanumerics or more than 2 non-alphas.
>>> 
>>> 
>>> On 2016-10-03 09:30 PM, Walter Underwood wrote:
 That approach doesn’t work very well for estimates.
 
 Some parts of the index size and speed scale with the vocabulary instead 
 of the number of documents.
 Vocabulary usually grows at about the square root of the total amount of 
 text in the index. OCR’ed text
 breaks that estimate badly, with huge vocabularies.
 
 
> 



Re: Upgrading to SolrCloud

2016-10-06 Thread Shawn Heisey
On 10/6/2016 9:02 AM, Steven White wrote:
> We currently have a component that uses SolrJ and Solr REST API to admin
> Solr (adding new fields, changing handlers, etc. to customize Solr's
> schema) based on customer's DB schema before we start indexing.
>
> If we switch over to SolrCloud:
>
> 1) Will our existing usage of SolrJ and REST API still work as-is?

Generally speaking, if you change from a variant like HttpSolrClient
connecting to a non-cloud install to CloudSolrClient connecting to a
cloud install, the rest of your SolrJ code will *probably* work with no
other changes.  That will largely depend on the config/schema being
similar between the cloud install and the non-cloud install.

> 2) Not all of our customers need that high availability of Solr.  For
> those, single server and single index will do just fine.  In this case, can
> I configure SolrCloud to single server with single core?  When I do so, am
> I impacting performance of Solr?

You can have collections in the cloud that have a single shard and a
single replica -- only one core in the entire collection.  These kinds
of collections are vulnerable to failures if the server with the single
core goes down, of course.  Aside from that, they work just like
collections with more shards and/or more replicas.

The "old" http API still works even in cloud mode, using collection
names instead of core names in the URL -- with the added advantage that
you can send such requests to ANY node in the cloud, and they will find
their way to the correct location.  Updates are more efficient if they
are sent to the correct shard leader, which CloudSolrClient does by default.

Thanks,
Shawn



Re: Upgrading to SolrCloud

2016-10-06 Thread Jan Høydahl
> 6. okt. 2016 kl. 17.02 skrev Steven White :
> If we switch over to SolrCloud:
> 
> 1) Will our existing usage of SolrJ and REST API still work as-is?
Yes, probably

> 2) Not all of our customers need that high availability of Solr.  For
> those, single server and single index will do just fine.  In this case, can
> I configure SolrCloud to single server with single core?  When I do so, am
> I impacting performance of Solr?

If you have a collection with only one shard, there’s no overhead during 
indexing
or querying. You would use CloudSolrClient which will talk to ZK but that
does not happen for every request, so it will be smart enough to send the
requests directly to the node that should serve them.

> I'm thinking performance will be impacted because there is now an extra
> layer my requests will have to go through.

Nope. But note that when you create collections, Solr may assign you a 
node which is already used by other collections, and that may potentially
cause performance issues if the node is not powerful enough to drive both
collections. But you can also choose what node to use when creating the 
collection

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Re: Queries to help warm up (mmap)

2016-10-06 Thread Walter Underwood
I use the schema browser to find the 20 most common words. I use those, 
assuming  that they’ll be the most common in queries. Those are static warming 
queries in solrconfig.xml.

This works fairly well for book or movie titles. Not so well for free text.

You could do the same thing with query log analysis. Use your most frequent 
queries.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2016, at 8:55 AM, Rallavagu  wrote:
> 
> Looking for clues/recommendations to help warm up during startup. Not 
> necessarily Solr caches but mmap as well. I have used some like "q= name>:[* TO *]" for various fields and it seems to help with mmap population 
> around 40-50%. Is there anything else that could help achieve 90% or more? 
> Thanks.



Upgrading to SolrCloud (take 2)

2016-10-06 Thread Steven White
(sorry if this a second post, the first one 1 posted 1 hour ago has yet to
make it to the mailing list!!)


Hi everyone,

Currently, we are on Solr 5.2 and use 1 core and none of the cloud
features.  We are planning to upgrade to Solr 6.2 and utilize SolrCloud not
because our data need to scale (single core with no cloud is doing just
fine on our index of 2 million records and about 15 gb index size) but
because some of our customers want high availability.

We currently have a component that uses SolrJ and Solr REST API to admin
Solr (adding new fields, changing handlers, etc. to customize Solr's
schema) based on customer's DB schema before we start indexing.

If we switch over to SolrCloud:

1) Will our existing usage of SolrJ and REST API still work as-is?
2) Not all of our customers need that high availability of Solr.  For
those, single server and single index will do just fine.  In this case, can
I configure SolrCloud to single server with single core?  When I do so, am
I impacting performance of Solr?

I'm thinking performance will be impacted because there is now an extra
layer my requests will have to go through.

Thanks in advanced.

Steve


Re: Average of Averages in Solr

2016-10-06 Thread John Bickerstaff
This may help?  Note the "Bloomberg Analytics" at the bottom of the post...

https://dzone.com/articles/solr-not-just-for-text-anymore

Quote from article:


   - *Bloomberg Analytics Component for Solr*: Bloomberg Financial Services
   uses Solr extensively, and found the existing statistical packages woefully
   lacking. So, they developed a high-performance framework that can perform
   complex calculations and aggregations on time-series data, and then
   released it to OpenSource.


On Thu, Oct 6, 2016 at 8:53 AM, Shawn Heisey  wrote:

> On 10/6/2016 12:04 AM, Lewin Joy (TMS) wrote:
> > There is a requirement to take an average on "Amount" field against
> > each "code" field. And then calculate the averages on this averages.
> > Since my "code" field has a very huge cardinality, which could be
> > around 200,000 or even in millions ; It gets highly complex to
> > calculate the average of averages through Java. Even Solr takes a huge
> > time listing the averages. And the JSON response size becomes huge. Is
> > there some way we can tackle this? Any way we stats on stats?
>
> I wasn't sure what you meant with the first sentence I quoted above, but
> in order to get statistics from your index that are relevant for the
> results of a query, you probably want the stats component.
>
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
>
> Thanks,
> Shawn
>
>


Queries to help warm up (mmap)

2016-10-06 Thread Rallavagu
Looking for clues/recommendations to help warm up during startup. Not 
necessarily Solr caches but mmap as well. I have used some like 
"q=:[* TO *]" for various fields and it seems to help with 
mmap population around 40-50%. Is there anything else that could help 
achieve 90% or more? Thanks.


Re: Average of Averages in Solr

2016-10-06 Thread Shawn Heisey
On 10/6/2016 12:04 AM, Lewin Joy (TMS) wrote:
> There is a requirement to take an average on "Amount" field against
> each "code" field. And then calculate the averages on this averages.
> Since my "code" field has a very huge cardinality, which could be
> around 200,000 or even in millions ; It gets highly complex to
> calculate the average of averages through Java. Even Solr takes a huge
> time listing the averages. And the JSON response size becomes huge. Is
> there some way we can tackle this? Any way we stats on stats? 

I wasn't sure what you meant with the first sentence I quoted above, but
in order to get statistics from your index that are relevant for the
results of a query, you probably want the stats component.

https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

Thanks,
Shawn



Re: JSON Facet "allBuckets" behavior

2016-10-06 Thread prosens
Yonik,
Here is the requirement:
Get sum of size field for all the documents which has a duplicate in the
index. Duplicate is decided based on a string field. So, we are looking for
something like this.
{
"Statistics": {
"type": "terms",
"field": "filename",
"mincount": 2,
"numBuckets": true, 
*"sumBuckets": true*
}
}

Is their an alternate way to achieve this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/JSON-Facet-allBuckets-behavior-tp4298289p4299980.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR Sizing

2016-10-06 Thread Erick Erickson
OCR _without errors_ wouldn't break it. That comment assumed that the OCR
was dirty I thought.

Honest, I once was trying to index an OCR'd image of a "family tree" that was a
stylized tree where the most remote ancestor was labeled in vertical text on the
trunk, and descendants at various angles as the trunk branched, the branches
branched and on and on

And as far as cleaning up the text is concerned if it's dirty,
anything you do is
wrong. For instance, again using the genealogy example, throwing out
unrecognized
words like, removes the data that's important when they're names.

But leaving nonsense characters in is wrong too

And hand-correcting all of the data is almost always far too expensive.

If your OCR is, indeed perfect, then I envy you ;)...

On a different note, I thought the captcha-image way of correcting OCR
text was brilliant.

Erick

On Thu, Oct 6, 2016 at 8:05 AM, Rick Leir  wrote:
> I am curious to know where the square-root assumption is from, and why OCR
> (without errors) would break it. TIA
>
> cheers - - Rick
>
> On 2016-10-04 10:51 AM, Walter Underwood wrote:
>>
>> No, we don’t have OCR’ed text. But if you do, it breaks the assumption
>> that vocabulary size
>> is the square root of the text size.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Oct 4, 2016, at 7:14 AM, Rick Leir  wrote:
>>>
>>> OCR’ed text can have large amounts of garbage such as '';,-d'."
>>> particularly when there is poor image quality or embedded graphics. Is that
>>> what is causing your huge vocabularies? I filtered the text, removing any
>>> word with fewer than 3 alphanumerics or more than 2 non-alphas.
>>>
>>>
>>> On 2016-10-03 09:30 PM, Walter Underwood wrote:

 That approach doesn’t work very well for estimates.

 Some parts of the index size and speed scale with the vocabulary instead
 of the number of documents.
 Vocabulary usually grows at about the square root of the total amount of
 text in the index. OCR’ed text
 breaks that estimate badly, with huge vocabularies.


>


Re: SOLR Sizing

2016-10-06 Thread Rick Leir
I am curious to know where the square-root assumption is from, and why 
OCR (without errors) would break it. TIA


cheers - - Rick

On 2016-10-04 10:51 AM, Walter Underwood wrote:

No, we don’t have OCR’ed text. But if you do, it breaks the assumption that 
vocabulary size
is the square root of the text size.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Oct 4, 2016, at 7:14 AM, Rick Leir  wrote:

OCR’ed text can have large amounts of garbage such as '';,-d'." particularly 
when there is poor image quality or embedded graphics. Is that what is causing your 
huge vocabularies? I filtered the text, removing any word with fewer than 3 
alphanumerics or more than 2 non-alphas.


On 2016-10-03 09:30 PM, Walter Underwood wrote:

That approach doesn’t work very well for estimates.

Some parts of the index size and speed scale with the vocabulary instead of the 
number of documents.
Vocabulary usually grows at about the square root of the total amount of text 
in the index. OCR’ed text
breaks that estimate badly, with huge vocabularies.






Upgrading to SolrCloud

2016-10-06 Thread Steven White
Hi everyone,

Currently, we are on Solr 5.2 and use 1 core and none of the cloud
features.  We are planning to upgrade to Solr 6.2 and utilize SolrCloud not
because our data need to scale (single core with no cloud is doing just
fine on our index of 2 million records and about 15 gb index size) but
because some of our customers want high availability.

We currently have a component that uses SolrJ and Solr REST API to admin
Solr (adding new fields, changing handlers, etc. to customize Solr's
schema) based on customer's DB schema before we start indexing.

If we switch over to SolrCloud:

1) Will our existing usage of SolrJ and REST API still work as-is?
2) Not all of our customers need that high availability of Solr.  For
those, single server and single index will do just fine.  In this case, can
I configure SolrCloud to single server with single core?  When I do so, am
I impacting performance of Solr?

I'm thinking performance will be impacted because there is now an extra
layer my requests will have to go through.

Thanks in advanced.

Steve


Re: QuerySenderListener

2016-10-06 Thread Erick Erickson
Hmm, that JIRA looks like exactly what's going on. I suspect the
reason it's not generating that much interest is that restarting Solr
should be a rare enough event that opening two searchers isn't causing
enough difficulty for someone to break loose the time to create a
patch.

The patch attached just illustrated the case, it doesn't contain a fix.

So I'm afraid there's nothing OOB that'll fix this.

Best,
Erick

On Wed, Oct 5, 2016 at 7:40 PM, Rallavagu  wrote:
> Not sure if this is related.
>
> https://issues.apache.org/jira/browse/SOLR-7035
>
> firstSearcher has few queries that run longer (~3 min)
>
> On 10/5/16 6:58 PM, Erick Erickson wrote:
>>
>> How many cores? Is it possible you're seeing these from two different
>> cores?
>>
>> Erick
>>
>> On Wed, Oct 5, 2016 at 11:44 AM, Rallavagu  wrote:
>>>
>>> Solr Cloud 5.4.1 with embedded jetty, jdk8
>>>
>>> At the time of startup it appears that "QuerySenderListener" is run twice
>>> and this is causing "firstSearcher" and "newSearcher" to run twice as
>>> well.
>>> Any clues as to why QuerySenderListener is triggered twice? Thanks.


Rollback solrcloud

2016-10-06 Thread Pablo Anzorena
Hey,

I was trying to make a rollback under solrcloud and foundd that it's not
supported
https://issues.apache.org/jira/browse/SOLR-4895 (I have solr6.1.0)

So my question is, how can I simulate a rollback?
Actually what I'm doing is:

   1. prepareCommit
   2. add documents
   3. try to commit
   4. if success, then exit, else rollback.

I have to point out that it doesn't happen that multiple threads are
preparing commits nor adding documents, just single thread.

Thanks.


Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?

2016-10-06 Thread Erick Erickson
There is no information here at all that would us to say anything
meaningful. You might review:
http://wiki.apache.org/solr/UsingMailingLists

What do the logs say? Are there any exceptions? What happens on your system
that's
unusual if anything? In short, what have you tried to do to diagnose the
cause and what
have you learned?

But here's a random guess. You didn't configure your log4j properties and
your logs, particularly
your console log have grown to huge sizes and occasionally you encounter
disk full issues. Or
your ZK nodes have the same problem, they accumulate snapshots (see the
Zookeeper admin page).

Best,
Erick

On Thu, Oct 6, 2016 at 2:23 AM, Gerald Reinhart 
wrote:

>
> Hello everyone,
>
> Our Solr Cloud  works very well for several months without any
> significant changes: the traffic to serve is stable, no major release
> deployed...
>
> But randomly, the Solr Cloud leader puts all the replicas in recovery
> at the same time for no obvious reason.
>
> Hence, we can not serve the queries any more and the leader is
> overloaded while replicating all the indexes on the replicas at the same
> time which eventually implies a downtime of approximately 30 minutes.
>
> Is there a way to prevent it ? Ideally, a configuration saying a
> percentage of replicas to be put in recovery at the same time?
>
> Thanks,
>
> Gérald, Elodie and Ludovic
>
>
> --
> [image: Kelkoo]
>
> *Gérald Reinhart *Software Engineer
>
> *E*  
> gerald.reinh...@kelkoo.com*Y!Messenger* gerald.reinhart
> *T* +33 (0)4 56 09 07 41
> *A* Parc Sud Galaxie - Le Calypso, 6 rue des Méridiens, 38130 Echirolles
>
>
>
> --
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 158 Ter Rue du Temple 75003 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>


Best practice for Fuzzy Search combined with Phrase Queries

2016-10-06 Thread Markus Lang
Hi,
I am interested in best practices on how to handle phrase queries where
only a part of the phrase may match and / or the user made some typos.
Are there any papers on when to use only a part of the query phrase or how
many words of the phrase should rather be corrected before skipping them?
Does anyone know how e.g. Google or Amazon deals with these issues?

Best regards

Markus


[Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?

2016-10-06 Thread Gerald Reinhart


Hello everyone,

   Our Solr Cloud  works very well for several months without any significant 
changes: the traffic to serve is stable, no major release deployed...

   But randomly, the Solr Cloud leader puts all the replicas in recovery at the 
same time for no obvious reason.

   Hence, we can not serve the queries any more and the leader is overloaded 
while replicating all the indexes on the replicas at the same time which 
eventually implies a downtime of approximately 30 minutes.

   Is there a way to prevent it ? Ideally, a configuration saying a percentage 
of replicas to be put in recovery at the same time?

Thanks,

Gérald, Elodie and Ludovic


--
[cid:part1.0508.06030105@kelkoo.com]

Gérald Reinhart Software Engineer

E   
gerald.reinh...@kelkoo.comY!Messenger gerald.reinhart
T +33 (0)4 56 09 07 41
A Parc Sud Galaxie - Le Calypso, 6 rue des Méridiens, 38130 Echirolles


[cid:part4.08030706.00010405@kelkoo.com]




Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: running solr 6.x in Eclipse for debugging

2016-10-06 Thread John Bickerstaff
Thank you very much Eric - I'll try that tomorrow.

On Wed, Oct 5, 2016 at 7:57 PM, Erick Erickson 
wrote:

> John:
>
> The simple answer is "cheat"
>
> It takes a little fiddling, but here's what I do in IntelliJ:
>
> 1> use IntelliJ to create an "artifact" that's just the jar WITHOUT
> the associated Solr jar dependencies, just the compiled output
> 1a> Find the bits in <1>. It's usually under my project somewhere
> ...out/artifacts/whatever/whatever.jar
> 1aa> use something like "tar -xvf yourjar.jar" to verify the classes
> are in it that you expect.
> 1b> Be sure when creating the "artifacts" to check the (non obvious)
> "build on make" checkbox OR be sure you use "build artifacts" from the
> build menu when you want to debug.
> 1c> execute "ant server dist" first. The "dist"directory (and possibly
> the solrj-lib below it) will contain all of the Solr jars you should
> need.
> 2> Now change solrconfig.xml to add a  directive to point to <1a>.
> 3> At this point, you don't have to copy anything around anywhere.
> Every time you rebuild your project/plugin the new jar is picked up
> when you restart Solr. All the Solr jars you depend on for your plugin
> are loaded and available when Solr starts. If you have suspend=y set
> in your start command, you can set breakpoints in your initialization
> (or anywhere else) in your plugin (or anywhere in Solr).
>
> HINT: If at all possible and you can write a Junit test, it's easier
> to just debug _that_ than go all the stuff above, you can debug
> individual junit tests..
>
> FWIW,
> Erick
>
> On Wed, Oct 5, 2016 at 2:12 PM, John Bickerstaff
>  wrote:
> > I've found this much in build.xml...
> >
> >  I'm assuming Ant puts the compiled jars into the paths listed below.
> >
> > Further hints gratefully accepted if someone knows specifically how to
> set
> > this up from top to bottom. I assume the Eclipse build path puts the jars
> > into the referenced directories...?
> >
> >  >
> >   description="Creates the Webapp folder for distribution."
> >
> >   depends="dist-core, dist-solrj, lucene-jars-to-solr">
> >
> >  > "contribs-add-to-webapp"/>
> >
> > 
> >
> > 
> >
> >   
> >
> >
> >
> > 
> >
> > 
> >
> > 
> >
> >> "${exclude.from.webapp},${common.classpath.excludes}"/>
> >
> >> "${exclude.from.webapp},${common.classpath.excludes}"/>
> >
> >> "${exclude.from.webapp},${common.classpath.excludes}" />
> >
> >> "${exclude.from.webapp},${common.classpath.excludes}">
> >
> > 
> >
> > 
> >
> >   
> >
> > 
> >
> >   
> >
> > On Wed, Oct 5, 2016 at 1:30 PM, John Bickerstaff <
> j...@johnbickerstaff.com>
> > wrote:
> >
> >> OK - I'm running now in debug mode.  My intent is to add and test a
> "hello
> >> world" plugin to prove everything is wired up and that I can debug all
> the
> >> way into the plugin I wrote...
> >>
> >> I want to test plugins/addons which, as I understand it go here if
> you're
> >> adding them to an installed version of Solr.
> >>
> >> solr-6.x.x/server/solr-webapp/webapp/WEB-INF/lib/
> >>
> >> So - to get that all working (build and seeing source when in debug
> >> mode)...
> >>
> >> 1. Where, exactly, should I place the source code for a new plugin that
> is
> >> NOT a part of the Solr distribution if I want to be able to debug
> through
> >> that code as well?  (I understand that the appropriate entries will
> need to
> >> be in the correct xml files in the core/collection configurations)
> >>
> >> 2. Will I need to change the build.xml files in some way?  If yes,
> please
> >> tell me how, I'm unfamiliar with Ant.
> >>
> >> 3. In case I'm in "X/Y problem mode" here - My goal is: Add plugin
> source
> >> code, build, make config changes where needed, and see source when I
> put a
> >> breakpoint in code.
> >>
> >>
> >>
> >> On Wed, Oct 5, 2016 at 12:04 PM, John Bickerstaff <
> >> j...@johnbickerstaff.com> wrote:
> >>
> >>> Thanks Mikhail!
> >>>
> >>> On Wed, Oct 5, 2016 at 11:29 AM, Mikhail Khludnev 
> >>> wrote:
> >>>
>  ok. it's "run-example" [ ..@solr]$ant -p
>   run-example  Run Solr interactively, via Jetty.
>  -Dexample.debug=true to en
>  able JVM debugger
>  I have it in master and branch_6x
> 
>  On Wed, Oct 5, 2016 at 5:51 PM, John Bickerstaff <
>  j...@johnbickerstaff.com>
>  wrote:
> 
>  > Mikhail -- which version of Solr are you using to do this [ant
> example
>  > -Dexample.debug=true]
>  >
>  > I may be wrong, but it seems that "example" no longer works with
>  6.x...?
>  >
>  > On Wed, Oct 5, 2016 at 1:14 AM, Mikhail Khludnev 
>  wrote:
>  >
>  > > launching ant example -Dexample.debug=true from Exlipse works to
> me.
>  > > It takes a while for useless compile checks, then you can debug
>  remotely
>  > to
>  > > 5005.
>  > 

Average of Averages in Solr

2016-10-06 Thread Lewin Joy (TMS)
•• PROTECTED 関係者外秘

Hi,

I have a big collection with around 100 million records.
There is a requirement to take an average on "Amount" field against each "code" 
field.
And then calculate the averages on this averages.
Since my "code" field has a very huge cardinality, which could be around 
200,000 or even in millions ; It gets highly complex to calculate the average 
of averages through Java.
Even Solr takes a huge time listing the averages. And the JSON response size 
becomes huge.
Is there some way we can tackle this? Any way we stats on stats?

Thanks,
Lewin