Status Collection Down

2016-08-26 Thread Hardika Catur S

Hi,

I find problems in collection status in apache solr,
when solr restart some collection status to "down". It happened at 
server 00 and 01.

how to turn some of the collection??



Please help me to find a solution.

Thanks,
Hardika CS.


Why can't we get multiple payloadFields from suggester?

2016-08-26 Thread Siddharth Gargate
I need multiple payload fields from the suggester. I tried following but
didn't work



  mySuggester
  BlendedInfixLookupFactory
  position_linear
  DocumentDictionaryFactory
  longDesc_txt
  
  text_en_splitting
  false
  desc_txt
  code
  type

  

  

  true
  10


  suggest

  

Also in case the payload field is multivalued, I get only one value as
payload.

Only option I see is to concatenate multiple field values and post process
it. Any better solution? Is there any plan to implement this feature?


Solr for Multi Tenant architecture

2016-08-26 Thread Chamil Jeewantha
Dear Solr Members,

We are using SolrCloud as the search provider of a multi-tenant cloud based
application. We have one schema for all the tenants. The indexes will have
large number(millions) of documents.

As of our research, we have two options,

   - One large collection for all the tenants and use Composite-ID routing
   - Collection per tenant

The below mail says,


https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201403.mbox/%3c5324cd4b.2020...@protulae.com%3E

SolrCloud is *more scalable in terms of index size*. Plus you get
redundancy which can't be underestimated in a hosted solution.


AND

The issue is management. 1000s of cores/collections require a level of
automation. On the other hand, having a single core/collection means if
you make one change to the schema or solrconfig, it affects everyone.


Based on the above facts we think One large collection will be the way to
go.

Questions:

   1. Is that the right way to go?
   2. Will it be a hassle when we need to do reindexing?
   3. What is the chance of entire collection crash? (in that case all
   tenants will be affected and reindexing will be painful.

Thank you in advance for your kind opinion.

Best Regards,
Chamil

-- 
http://kavimalla.blgospot.com
http://kdchamil.blogspot.com


Re: Get results from Solr facets

2016-08-26 Thread Mikhail Khludnev
Hello,

Have you checked json.facets? They allow to combine a lot of such
instructions like this.

On Fri, Aug 26, 2016 at 4:09 PM, Marta (motagirl2) 
wrote:

> Hi everybody,
> I am pretty new to Solr, so I don't know if what I'd like to achieve is
> actually feasible or not. Currently, I am querying my Solr to retrieve the
> amount of results that match the conditions in several facet queries. For
> example:
>
> localhost:8082/solr/dict/select?q=*:*=0=json&
> indent=true=true=dict1:"#tiger#"
> query=dict1:"#lion#"
>
> With this kind of query, I am getting the count of Solr docs containing
> "tiger" and the count of those cointaining "lion", in field "dict1":
>
>  {
>   "responseHeader": {
> "status": 0,
> "QTime": 239,
> "params": {
>   "facet.query": [
> "dict1:\"#tiger#\"",
> "dict1:\"#lion#\""
>   ],
>   "q": "*:*",
>   "indent": "true",
>   "rows": "0",
>   "wt": "json",
>   "facet": "true"
> }
>   },
>   "response": {
> "numFound": 37278987,
> "start": 0,
> "docs": [ ]
>   },
>   "facet_counts": {
> "facet_queries": {
>   "dict1:\"#tiger#\"": 6,
>   "dict1:\"#lion#\"": 10
> },
> [...]
>   }
> }
>
> The thing is that now I need to get also some results for each facet, aside
> as the count (for example, three results for "tiger" and three more for
> "lion")
>
> I have read some similar questions (Solr Facetting - Showing First 10
> results and Other
>  facetting-showing-first-10-results-and-other>
>  or SOLR - Querying Facets, return N results per Facet
>  querying-facets-return-n-results-per-facet>
> )
> , but none of their answers seems to work for me, maybe because I am doing
> the facets on all docs (q=*:*).
>
> Any help will be welcome :)
>
>
> (I posted this issue also in Stackoverflow, you can see it here
> http://stackoverflow.com/questions/39164957/get-results-from-solr-facets )
> --
> marta - motagirl
> http://motagirl.net
>



-- 
Sincerely yours
Mikhail Khludnev


RE: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

2016-08-26 Thread Jon Hawkesworth
Many thanks for this too, I am digging to this now.

Jon
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, August 26, 2016 5:01 PM
To: solr-user
Subject: Re: solrcloud 6.0.1 any suggestions for fixing a replica that 
stubbornly remains down

OK, this is starting to get worrisome. This _looks_ like your index is somehow 
corrupt. Let's back up a bit:

Before you do the ADDREPLICA, is your system healthy? That is, do you have at 
least one leader for each shard that's "active"? You say it looks OK for a 
while, can you successfully query the cluster? Can you successfully query every 
individual replica (use =false and point at specific cores to verify 
this).

I'd pull out the CheckIndex tool, here's a brief KB on it:
https://support.lucidworks.com/hc/en-us/articles/202091128-How-to-deal-with-Index-Corruption
to verify at least one replica for each shard.

Note the caveat there, the "-fix" option WILL DROP segments it doesn't think 
are OK, so don't use it yet.

So let's assume you can run this successfully on at least one replica for each 
shard. I'd disable all other replicas then and restart my Solrs. The "trick" to 
disabling them is to find the associated "core.properties" file and name it 
something else. Don't worry, this won't remove data or anything and you can get 
it back by going back and renaming it back to "core.properties" and restarting 
your Solrs. This is "core discovery", see:
https://cwiki.apache.org/confluence/display/solr/Defining+core.properties.
You've wisely stopped indexing so the cores won't be out of sync.

The goal here is to have a healthy, leader-only cluster that you totally know 
is OK, you can query it & etc. If you get this far, try the ADDREPLICA again. 
Let's assume that you can do this, then it's really up to you whether to just 
ADDREPLICA and build up your cluster again and just nuke all the old replicas 
or try to recover the old ones.

Mind you, this is largely shooting in the dark since this is very peculiar.
Did you have any weird errors, disk full and the like? Note that looking at 
your current disk is insufficient since you need up to as much free space on 
your disk as your indexes already have.

Best,
Erick

On Thu, Aug 25, 2016 at 10:48 PM, Jon Hawkesworth < 
jon.hawkeswo...@medquist.onmicrosoft.com> wrote:

> Thanks for your suggestion.  Here's a chunk of info from the logging 
> in the solr admin page below.  Is there somewhere else I should be looking 
> too?
>
>
>
> It looks to me like its stuck in  a never-ending loop of attempting 
> recovery that fails.
>
>
>
> I don't know if the warnings from IndexFetcher are relevant or not, 
> and if they are, what I can do about them?
>
>
>
> Our system has been feeding 150k docs a day into this cluster for 
> nearly two months now.  I have a backlog of approx 45million more 
> documents I need to get loaded but until I have a healthy looking 
> cluster it would be foolish to start loading even more.
>
>
>
> Jon
>
>
>
>
>
> *Time (Local)*
>
> *Level*
>
> *Core*
>
> *Logger*
>
> *Message*
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.nvm did not match. expected checksum is 1754812894 and 
> actual is checksum 3450541029. expected length is 108 and actual 
> length is 108
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fnm did not match. expected checksum is 2714900770 and 
> actual is checksum 1393668596. expected length is 1265 and actual 
> length is 1265
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.doc did not match. expected checksum is 
> 1374818988 and actual is checksum 1039421217. expected length is 110 
> and actual length is 433
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.tim did not match. expected checksum is 
> 1001343351 and actual is checksum 3395571641. expected length is 2025 
> and actual length is 7662
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.tip did not match. expected checksum is 
> 814607015 and actual is checksum 1271109784. expected length is 301 
> and actual length is 421
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene54_0.dvd did not match. expected checksum is 
> 875968405 and actual is checksum 4024097898. expected length is 96 and 
> actual length is 144
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.si did not match. expected checksum is 2341973651 and 
> actual is checksum 281320882. expected length is 535 and actual length 
> is 535
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fdx did not match. expected checksum is 2874533507 and 
> actual is checksum 3545673052. expected length is 84 and actual length 
> is 84
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.nvd did not match. expected checksum is 663721296 and 
> actual is checksum 1107475498. expected 

RE: solcloud; collection reload, core Statistics 'optimize now'

2016-08-26 Thread Jon Hawkesworth
Many thanks for this, that's really useful.

We're feeding in documents all the time so makes sense that optimizing the 
index would just be overhead.

We just have one collection that we care about at the moment so I can't see us 
using Reload very often either.

Jon


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, August 26, 2016 4:46 PM
To: solr-user
Subject: Re: solcloud; collection reload, core Statistics 'optimize now'

First of all, please have them pretty much ignore the cores admin page.
That's mostly a remnant of the non-SolrCloud days and largely is used for 
troubleshooting and the like. Most of all, assuming your index changes 
reasonably frequently (i.e. less than once a day) optimizing is unnecessary and 
should be avoided.

As far as the reload command on a collection, it finds all of the cores that 
make up a collection and issues a core reload on all of them. This:
> reloads the config and schema files
> throws out all the cached data
> opens new searchers

There's no reason to reload unless you've changed the config files and pushed 
them to Zookeeper in the normal course of events. For your ops people, reload 
should be about on par with "restart Solr". Think of reloading a collection as 
bouncing the JVM except only for a single collection.

Best,
Erick

On Fri, Aug 26, 2016 at 12:47 AM, Jon Hawkesworth < 
jon.hawkeswo...@medquist.onmicrosoft.com> wrote:

> Hi,
>
>
>
> I'd like to understand a bit more about some of the admin options in 
> solrcloud admin interface.
>
>
>
> Can anyone point me at something which tells me what hit Reload for a 
> given collection actually does, whether it is safe to do at any time 
> and/or under what circumstances it should/shouldn't be used?
>
>
>
> Also, poking around the UI I noticed that if you select a core, on the 
> Overview page there is a Statistics panel and in it a button entitled 
> 'optimize now'.  Again I'd like to understand what this does, when it 
> should/shouldn't be used and whether optimising statistics is 
> something that should scheduled.
>
>
>
> The background to this is that I'm trying to provide operations team 
> members with instructions about what, if anything, needs to be done to 
> keep our production clusters in good working order.  Obviously my 
> preference is for things to be automatic where possible but if things 
> can't be automated then I want to be able to provide operations team 
> members clear guidance about what needs to be done and when and why.
>
>
>
> Many thanks,
>
>
>
> Jon
>
>
>
>
>
> *Jon Hawkesworth*
> Software Developer
>
>
>
>
>
> Hanley Road, Malvern, WR13 6NP. UK
>
> O: +44 (0) 1684 312313
>
> *jon.hawkeswo...@mmodal.com  
> www.mmodal.com
> *
>
>
>
> *This electronic mail transmission contains confidential information 
> intended only for the person(s) named. Any use, distribution, copying 
> or disclosure by another person is strictly prohibited. If you are not 
> the intended recipient of this e-mail, promptly delete it and all 
> attachments.*
>
>
>


Re: How to update from Solr Cloud 5.4.1 to 5.5.1

2016-08-26 Thread D'agostino Victor

Hi Erick

Thanks for your reply.
That's what I though but i wasn't sure :)

Do you know in which version index format changes and if I should update 
to a higher version ?


And about ZooKeeper ; the 3.4.8 is fine or should I update it too ?

Have a good day
Victor


 Message original 
*Sujet: *Re: How to update from Solr Cloud 5.4.1 to 5.5.1
*De : *Erick Erickson 
*Pour : *solr-user 
*Date : *26/08/2016 17:40

First of course I would always back up my indexes, but then I'm paranoid.

But 5.5.1 should be drop-in for 5.4.1. There are no index format changes
you need to worry about. You can install 5.5.1 in a new directory on your
box and start it up with the same SOLR_HOME as your 5.4.1 setup (after
shutting down the Solr 5.4.1 of course) and you should be fine.

Best,
Erick

2016-08-26 4:26 GMT-07:00 D'agostino Victor :

Hi guys

I've got a tree nodes Solr Cloud 5.4.1 with zookeeper 3.4.8 in production
serving 72.000.000 documents.
Documents types are easy ones : string, date, text, boolean, and multi
valued string but reindexing would take two weeks.

I would like to upgrade solr to (at least) version 5.5.1 for a bug fix
(https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SOLR-2D8779=DQIFaQ=1tDFxPZjcWEmlmmx4CZtyA=GIbD6pb1nH9ZrxFDfhl_c8kJe7NkpbmXG1YHXBYFth8=64wqAkfWd4MYvnvXDQtmqtauM4cEjSJovZ_5MMxbzGY=N9YtmP-PRMS6F_VQtrm2ClVsFocMf4dcnN0buu-AmXQ=
 ).

How can I do that ? Is there a safe procedure anywhere ?

Best regards
Victor d'Agostino





Ce message et les éventuels documents joints peuvent contenir des
informations confidentielles. Au cas où il ne vous serait pas destiné, nous
vous remercions de bien vouloir le supprimer et en aviser immédiatement
l'expéditeur. Toute utilisation de ce message non conforme à sa destination,
toute diffusion ou publication, totale ou partielle et quel qu'en soit le
moyen est formellement interdite. Les communications sur internet n'étant
pas sécurisées, l'intégrité de ce message n'est pas assurée et la société
émettrice ne peut être tenue pour responsable de son contenu.





Ce message et les éventuels documents joints peuvent contenir des informations 
confidentielles. Au cas où il ne vous serait pas destiné, nous vous remercions 
de bien vouloir le supprimer et en aviser immédiatement l'expéditeur. Toute 
utilisation de ce message non conforme à sa destination, toute diffusion ou 
publication, totale ou partielle et quel qu'en soit le moyen est formellement 
interdite. Les communications sur internet n'étant pas sécurisées, l'intégrité 
de ce message n'est pas assurée et la société émettrice ne peut être tenue pour 
responsable de son contenu. 

Re: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

2016-08-26 Thread Erick Erickson
OK, this is starting to get worrisome. This _looks_ like your index is
somehow corrupt. Let's back up a bit:

Before you do the ADDREPLICA, is your system healthy? That is, do you have
at least one leader for each shard that's "active"? You say it looks OK for
a while, can you successfully query the cluster? Can you successfully query
every individual replica (use =false and point at specific cores to
verify this).

I'd pull out the CheckIndex tool, here's a brief KB on it:
https://support.lucidworks.com/hc/en-us/articles/202091128-How-to-deal-with-Index-Corruption
to verify at least one replica for each shard.

Note the caveat there, the "-fix" option WILL DROP segments it doesn't
think are OK, so don't use it yet.

So let's assume you can run this successfully on at least one replica for
each shard. I'd disable all other replicas then and restart my Solrs. The
"trick" to disabling them is to find the associated "core.properties" file
and name it something else. Don't worry, this won't remove data or anything
and you can get it back by going back and renaming it back to
"core.properties" and restarting your Solrs. This is "core discovery", see:
https://cwiki.apache.org/confluence/display/solr/Defining+core.properties.
You've wisely stopped indexing so the cores won't be out of sync.

The goal here is to have a healthy, leader-only cluster that you totally
know is OK, you can query it & etc. If you get this far, try the ADDREPLICA
again. Let's assume that you can do this, then it's really up to you
whether to just ADDREPLICA and build up your cluster again and just nuke
all the old replicas or try to recover the old ones.

Mind you, this is largely shooting in the dark since this is very peculiar.
Did you have any weird errors, disk full and the like? Note that looking at
your current disk is insufficient since you need up to as much free space
on your disk as your indexes already have.

Best,
Erick

On Thu, Aug 25, 2016 at 10:48 PM, Jon Hawkesworth <
jon.hawkeswo...@medquist.onmicrosoft.com> wrote:

> Thanks for your suggestion.  Here's a chunk of info from the logging in
> the solr admin page below.  Is there somewhere else I should be looking too?
>
>
>
> It looks to me like its stuck in  a never-ending loop of attempting
> recovery that fails.
>
>
>
> I don't know if the warnings from IndexFetcher are relevant or not, and if
> they are, what I can do about them?
>
>
>
> Our system has been feeding 150k docs a day into this cluster for nearly
> two months now.  I have a backlog of approx 45million more documents I need
> to get loaded but until I have a healthy looking cluster it would be
> foolish to start loading even more.
>
>
>
> Jon
>
>
>
>
>
> *Time (Local)*
>
> *Level*
>
> *Core*
>
> *Logger*
>
> *Message*
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.nvm did not match. expected checksum is 1754812894 and actual
> is checksum 3450541029. expected length is 108 and actual length is 108
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fnm did not match. expected checksum is 2714900770 and actual
> is checksum 1393668596. expected length is 1265 and actual length is 1265
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.doc did not match. expected checksum is 1374818988
> and actual is checksum 1039421217. expected length is 110 and actual length
> is 433
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.tim did not match. expected checksum is 1001343351
> and actual is checksum 3395571641. expected length is 2025 and actual
> length is 7662
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.tip did not match. expected checksum is 814607015
> and actual is checksum 1271109784. expected length is 301 and actual length
> is 421
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene54_0.dvd did not match. expected checksum is 875968405
> and actual is checksum 4024097898. expected length is 96 and actual
> length is 144
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.si did not match. expected checksum is 2341973651 and actual
> is checksum 281320882. expected length is 535 and actual length is 535
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fdx did not match. expected checksum is 2874533507 and actual
> is checksum 3545673052. expected length is 84 and actual length is 84
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.nvd did not match. expected checksum is 663721296 and actual is
> checksum 1107475498. expected length is 59 and actual length is 68
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fdt did not match. expected checksum is 2953417110 and actual
> is checksum 471758721. expected length is 1109 and actual length is 7185
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File segments_h7g8 did not match. 

Re: solcloud; collection reload, core Statistics 'optimize now'

2016-08-26 Thread Erick Erickson
First of all, please have them pretty much ignore the cores admin page.
That's mostly a remnant of the non-SolrCloud days and largely is used for
troubleshooting and the like. Most of all, assuming your index changes
reasonably frequently (i.e. less than once a day) optimizing is unnecessary
and should be avoided.

As far as the reload command on a collection, it finds all of the cores
that make up a collection and issues a core reload on all of them. This:
> reloads the config and schema files
> throws out all the cached data
> opens new searchers

There's no reason to reload unless you've changed the config files and
pushed them to Zookeeper in the normal course of events. For your ops
people, reload should be about on par with "restart Solr". Think of
reloading a collection as bouncing the JVM except only for a single
collection.

Best,
Erick

On Fri, Aug 26, 2016 at 12:47 AM, Jon Hawkesworth <
jon.hawkeswo...@medquist.onmicrosoft.com> wrote:

> Hi,
>
>
>
> I'd like to understand a bit more about some of the admin options in
> solrcloud admin interface.
>
>
>
> Can anyone point me at something which tells me what hit Reload for a
> given collection actually does, whether it is safe to do at any time and/or
> under what circumstances it should/shouldn't be used?
>
>
>
> Also, poking around the UI I noticed that if you select a core, on the
> Overview page there is a Statistics panel and in it a button entitled
> 'optimize now'.  Again I'd like to understand what this does, when it
> should/shouldn't be used and whether optimising statistics is something
> that should scheduled.
>
>
>
> The background to this is that I'm trying to provide operations team
> members with instructions about what, if anything, needs to be done to keep
> our production clusters in good working order.  Obviously my preference is
> for things to be automatic where possible but if things can't be automated
> then I want to be able to provide operations team members clear guidance
> about what needs to be done and when and why.
>
>
>
> Many thanks,
>
>
>
> Jon
>
>
>
>
>
> *Jon Hawkesworth*
> Software Developer
>
>
>
>
>
> Hanley Road, Malvern, WR13 6NP. UK
>
> O: +44 (0) 1684 312313
>
> *jon.hawkeswo...@mmodal.com  www.mmodal.com
> *
>
>
>
> *This electronic mail transmission contains confidential information
> intended only for the person(s) named. Any use, distribution, copying or
> disclosure by another person is strictly prohibited. If you are not the
> intended recipient of this e-mail, promptly delete it and all attachments.*
>
>
>


Re: How to update from Solr Cloud 5.4.1 to 5.5.1

2016-08-26 Thread Erick Erickson
First of course I would always back up my indexes, but then I'm paranoid.

But 5.5.1 should be drop-in for 5.4.1. There are no index format changes
you need to worry about. You can install 5.5.1 in a new directory on your
box and start it up with the same SOLR_HOME as your 5.4.1 setup (after
shutting down the Solr 5.4.1 of course) and you should be fine.

Best,
Erick

2016-08-26 4:26 GMT-07:00 D'agostino Victor :
> Hi guys
>
> I've got a tree nodes Solr Cloud 5.4.1 with zookeeper 3.4.8 in production
> serving 72.000.000 documents.
> Documents types are easy ones : string, date, text, boolean, and multi
> valued string but reindexing would take two weeks.
>
> I would like to upgrade solr to (at least) version 5.5.1 for a bug fix
> (https://issues.apache.org/jira/browse/SOLR-8779).
>
> How can I do that ? Is there a safe procedure anywhere ?
>
> Best regards
> Victor d'Agostino
>
>
>
> 
> 
> Ce message et les éventuels documents joints peuvent contenir des
> informations confidentielles. Au cas où il ne vous serait pas destiné, nous
> vous remercions de bien vouloir le supprimer et en aviser immédiatement
> l'expéditeur. Toute utilisation de ce message non conforme à sa destination,
> toute diffusion ou publication, totale ou partielle et quel qu'en soit le
> moyen est formellement interdite. Les communications sur internet n'étant
> pas sécurisées, l'intégrité de ce message n'est pas assurée et la société
> émettrice ne peut être tenue pour responsable de son contenu.


Re: Inventor-template vs Inventor template - issue with hyphen

2016-08-26 Thread Erick Erickson
This confuses a lot of people. The difference is at the top-level parser, way
before it gets to the analysis chain.

"Inventor-template"

comes out of the top-level parser as
a single token. From there it goes through edismax etc. So it's a single
token spread across your
fields by edismax. It's only during the field analysis that it's broken
into two tokens.

"Inventor template" is parsed as two distinct tokens and fed to edismax as
two tokens where
they're spread across your fields as a pair of words.

Best,
Erick




On Fri, Aug 26, 2016 at 8:09 AM, shamik  wrote:

> Anyone ?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Inventor-template-vs-Inventor-template-issue-
> with-hyphen-tp4293357p4293489.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Inventor-template vs Inventor template - issue with hyphen

2016-08-26 Thread shamik
Anyone ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Inventor-template-vs-Inventor-template-issue-with-hyphen-tp4293357p4293489.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.NRTCachingDirectoryFactory

2016-08-26 Thread Rallavagu

Thanks Michail.

I am unable to locate bottleneck so far. Will try jstack and other tools.

On 8/25/16 11:40 PM, Mikhail Khludnev wrote:

Rough sampling under load makes sense as usual. JMC is one of the suitable
tools for this.
Sometimes even just jstack  or looking at SolrAdmin/Threads is enough.
If the only small ratio of documents is updated and a bottleneck is
filterCache you can experiment with segmened filters which suite more for
NRT.
http://blog-archive.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html


On Fri, Aug 26, 2016 at 2:56 AM, Rallavagu  wrote:


Follow up update ...

Set autowarm count to zero for caches for NRT and I could negotiate
latency from 2 min to 5 min :)

However, still seeing high QTimes and wondering where else can I look?
Should I debug the code or run some tools to isolate bottlenecks (disk I/O,
CPU or Query itself). Looking for some tuning advice. Thanks.


On 7/26/16 9:42 AM, Erick Erickson wrote:


And, I might add, you should look through your old logs
and see how long it takes to open a searcher. Let's
say Shawn's lower bound is what you see, i.e.
it takes a minute each to execute all the autowarming
in filterCache and queryResultCache... So you're current
latency is _at least_ 2 minutes between the time something
is indexed and it's available for search just for autowarming.

Plus up to another 2 minutes for your soft commit interval
to expire.

So if your business people haven't noticed a 4 minute
latency yet, tell them they don't know what they're talking
about when they insist on the NRT interval being a few
seconds ;).

Best,
Erick

On Tue, Jul 26, 2016 at 7:20 AM, Rallavagu  wrote:




On 7/26/16 5:46 AM, Shawn Heisey wrote:



On 7/22/2016 10:15 AM, Rallavagu wrote:








 size="2"
 initialSize="2"
 autowarmCount="500"/>




As Erick indicated, these settings are incompatible with Near Real Time
updates.

With those settings, every time you commit and create a new searcher,
Solr will execute up to 1000 queries (potentially 500 for each of the
caches above) before that new searcher will begin returning new results.

I do not know how fast your filter queries execute when they aren't
cached... but even if they only take 100 milliseconds each, that's could
take up to a minute for filterCache warming.  If each one takes two
seconds and there are 500 entries in the cache, then autowarming the
filterCache would take nearly 17 minutes. You would also need to wait
for the warming queries on queryResultCache.

The autowarmCount on my filterCache is 4, and warming that cache *still*
sometimes takes ten or more seconds to complete.

If you want true NRT, you need to set all your autowarmCount values to
zero.  The tradeoff with NRT is that your caches are ineffective
immediately after a new searcher is created.



Will look into this and make changes as suggested.



Looking at the "top" screenshot ... you have plenty of memory to cache
the entire index.  Unless your queries are extreme, this is usually
enough for good performance.

One possible problem is that cache warming is taking far longer than
your autoSoftCommit interval, and the server is constantly busy making
thousands of warming queries.  Reducing autowarmCount, possibly to zero,
*might* fix that. I would expect higher CPU load than what your
screenshot shows if this were happening, but it still might be the
problem.



Great point. Thanks for the help.



Thanks,
Shawn









Re: Get results from Solr facets

2016-08-26 Thread Alessandro Benedetti
What about simply using grouping ?

solr/hotels/search?q=*%3A*=json=true=true=query1=query2=3
[1]

is this ok for you ?

[1] https://cwiki.apache.org/confluence/display/solr/Result+Grouping

Cheers

On Fri, Aug 26, 2016 at 2:09 PM, Marta (motagirl2) 
wrote:

> Hi everybody,
> I am pretty new to Solr, so I don't know if what I'd like to achieve is
> actually feasible or not. Currently, I am querying my Solr to retrieve the
> amount of results that match the conditions in several facet queries. For
> example:
>
> localhost:8082/solr/dict/select?q=*:*=0=json&
> indent=true=true=dict1:"#tiger#"
> query=dict1:"#lion#"
>
> With this kind of query, I am getting the count of Solr docs containing
> "tiger" and the count of those cointaining "lion", in field "dict1":
>
>  {
>   "responseHeader": {
> "status": 0,
> "QTime": 239,
> "params": {
>   "facet.query": [
> "dict1:\"#tiger#\"",
> "dict1:\"#lion#\""
>   ],
>   "q": "*:*",
>   "indent": "true",
>   "rows": "0",
>   "wt": "json",
>   "facet": "true"
> }
>   },
>   "response": {
> "numFound": 37278987,
> "start": 0,
> "docs": [ ]
>   },
>   "facet_counts": {
> "facet_queries": {
>   "dict1:\"#tiger#\"": 6,
>   "dict1:\"#lion#\"": 10
> },
> [...]
>   }
> }
>
> The thing is that now I need to get also some results for each facet, aside
> as the count (for example, three results for "tiger" and three more for
> "lion")
>
> I have read some similar questions (Solr Facetting - Showing First 10
> results and Other
>  facetting-showing-first-10-results-and-other>
>  or SOLR - Querying Facets, return N results per Facet
>  querying-facets-return-n-results-per-facet>
> )
> , but none of their answers seems to work for me, maybe because I am doing
> the facets on all docs (q=*:*).
>
> Any help will be welcome :)
>
>
> (I posted this issue also in Stackoverflow, you can see it here
> http://stackoverflow.com/questions/39164957/get-results-from-solr-facets )
> --
> marta - motagirl
> http://motagirl.net
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


[ANNOUNCE] Apache Solr 6.2.0 released

2016-08-26 Thread Michael McCandless
26 August 2016, Apache Solr 6.2.0 available

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

Solr 6.2.0 is available for immediate download at:

 * http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 6.2.0 Release Highlights:

DocValues, streaming, /export, machine learning
* DocValues can now be used with BoolFields
* Date and boolean support added to /export handler
* Add "scoreNodes" streaming graph expression
* Support parallel ETL with the "topic" expression
* Feature selection and logistic regression on text via new streaming
expressions: "features" and "train"

bin/solr script
* Add basic auth support to the bin/solr script
* File operations to/from Zookeeper are now supported

SolrCloud
* New tag 'role' in replica placement rules, e.g. rule=role:!overseer
keeps new repicas off overseer nodes
* CDCR: fall back to whole-index replication when tlogs are insufficient
* New REPLACENODE command to decommission an existing node and replace
it with another new node
* New DELETENODE command to delete all replicas on a node

Security
* Add Kerberos delegation token support
* Support secure impersonation / proxy user for Kerberos authentication

Misc changes
* A large number of regressions were fixed in the new Admin UI
* New boolean comparison function queries comparing numeric arguments:
gt, gte, lt, lte, eq
* Upgraded Extraction module to Apache Tika 1.13.
* Updated to Hadoop 2.7.2

Further details of changes are available in the change log available at:
http://lucene.apache.org/solr/6_2_0/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also applies to Maven access.

Happy searching,

Mike McCandless

http://blog.mikemccandless.com


Re: Integration Solr Cloudera with squirrel-sql

2016-08-26 Thread Joel Bernstein
You want to give the HUE, front-end to Solr's SQL interface. A video demo
of this is here:

https://vimeo.com/174334432

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Aug 26, 2016 at 8:14 AM, Cahill, Trey 
wrote:

> Hardika,
>
> Parallel SQL and the accompanying JDBC connector only became available in
> Solr 6.x.  Since Cloudera's Solr is only at 4.10, it will not have this
> feature.
>
> Trey
>
> -Original Message-
> From: Hardika Catur S [mailto:hardika.sa...@solusi247.com.INVALID]
> Sent: Friday, August 26, 2016 4:31 AM
> To: solr-user@lucene.apache.org
> Subject: Integration Solr Cloudera with squirrel-sql
>
> Hi,
>
> I want integrate apache solr with squirrel-sql. It work in solr 6.0 and
> squirrel-sql 3.7.  But I have difficulty in integrating solr Cloudera
> 4.10,  because lib not in accordance with the needs of a squirrel-sql.
>
> Does solr Cloudera 4.10 could be integrated with squirrel-sql??
> Or there aresoftware to transform a solr query into sql query similar to
> squirrel ??
>
> Please help me to find a solution.
>
> Thanks,
> Hardika CS.
>


Regex search on Solr

2016-08-26 Thread Anil
HI,

I am indexing a text abc*17-logs.tgz/var/log/analyticsd *in solr and
indexed term after all filters is *abc**17-logs.tgz/var/log/analyticsd*

what is the regex to search abc17-logs.tgz/var/log/analyticsd  in solr ?
Following is the query and index analyzers

I tried abc[0-9]+-logs\.tgz/var\/log\/analyticsd and its not working.
Please advice.

 <
filter class="solr.ASCIIFoldingFilterFactory"/> 
 



Thanks


Re: Default stop word list

2016-08-26 Thread Steven White
But what about the current "default" list that comes with Solr?  How was
that list, for all supported languages, determined?

What I fear is this, when someone puts Solr into production, no one makes a
change to that list, so if the list is not "valid" this will impacting
search, but if the list is valid, how was it determined, just by the
development team of Solr / Lucene or input from linguistic expert?

Steve

On Fri, Aug 26, 2016 at 2:25 AM, Srinivasa Meenavalli  wrote:

> Hi Steven,
>
> List of Stopwords of a language are not fixed, there is no single
> universal list of stop words used by all natural language processing tools .
> Ideally stop words should be defined search merchandisers based on their
> domain instead of referring default.
>
> https://en.wikipedia.org/wiki/Stop_words
>
> You are allowed to add  lang/stopwords_.txt
>
>  positionIncrementGap="100">
> 
>   
>ignoreCase="true"/>
>   
>   
>protected="protwords.txt"/>
>   
> 
> 
>   
>synonyms="synonyms.txt" ignoreCase="true"/>
>ignoreCase="true"/>
>   
>   
>protected="protwords.txt"/>
>   
> 
>
> Regards
> Srinivas Meenavalli
>
> -Original Message-
> From: Steven White [mailto:swhite4...@gmail.com]
> Sent: Friday, August 26, 2016 4:02 AM
> To: solr-user@lucene.apache.org
> Subject: Default stopword list
>
> Hi everyone,
>
> I'm curious, the current "default" stopword list, for English and other
> languages, how was it determined?  And for English, why "I" is not in the
> stopword list?
>
> Thanks in advanced.
>
> Steve
> Disclaimer: The contents of this e-mail and attachment(s) thereto are
> confidential and intended for the named recipient(s) only. It shall not
> attach any liability on the originator or Zensar Technologies Limited or
> its affiliates. Any views or opinions presented in this email are solely
> those of the author and may not necessarily reflect the opinions of Zensar
> Technologies Limited or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification, distribution and / or
> publication of this message without the prior written consent of the author
> of this e-mail is strictly prohibited. If you have received this email in
> error please delete it and notify the sender immediately. Before opening
> any mail and attachments please check them for viruses and defect. Zensar
> Technologies Ltd or its affiliate do not accept any liability for virus
> infected mails.
>


Get results from Solr facets

2016-08-26 Thread Marta (motagirl2)
Hi everybody,
I am pretty new to Solr, so I don't know if what I'd like to achieve is
actually feasible or not. Currently, I am querying my Solr to retrieve the
amount of results that match the conditions in several facet queries. For
example:

localhost:8082/solr/dict/select?q=*:*=0=json=true=true=dict1:"#tiger#"=dict1:"#lion#"

With this kind of query, I am getting the count of Solr docs containing
"tiger" and the count of those cointaining "lion", in field "dict1":

 {
  "responseHeader": {
"status": 0,
"QTime": 239,
"params": {
  "facet.query": [
"dict1:\"#tiger#\"",
"dict1:\"#lion#\""
  ],
  "q": "*:*",
  "indent": "true",
  "rows": "0",
  "wt": "json",
  "facet": "true"
}
  },
  "response": {
"numFound": 37278987,
"start": 0,
"docs": [ ]
  },
  "facet_counts": {
"facet_queries": {
  "dict1:\"#tiger#\"": 6,
  "dict1:\"#lion#\"": 10
},
[...]
  }
}

The thing is that now I need to get also some results for each facet, aside
as the count (for example, three results for "tiger" and three more for
"lion")

I have read some similar questions (Solr Facetting - Showing First 10
results and Other

 or SOLR - Querying Facets, return N results per Facet

)
, but none of their answers seems to work for me, maybe because I am doing
the facets on all docs (q=*:*).

Any help will be welcome :)


(I posted this issue also in Stackoverflow, you can see it here
http://stackoverflow.com/questions/39164957/get-results-from-solr-facets )
-- 
marta - motagirl
http://motagirl.net


Re: Question about indexing PDFs

2016-08-26 Thread Betsey Benagh
Erick,

I’m not sure of anything.  I’m new to Solr and find the documentation
extremely confusing.  I’ve searched the web and found tutorials/advice,
but they generally refer to older versions of Solr, and refer to
methods/settings/whatever that no longer exist. That’s why I’m asking for
help here.

I looked at the list of fields in the schema browser, and ‘content' is not
there.  If that is not enough to ‘assume’ that the content is not being
indexed, then please enlighten me as to what is.

I inserted the docs in batches by posting them, following the ‘Quick
Start’ tutorial.  It seemed like a safe assumption that the tutorial on
the Solr site would be correct and produce desirable results.

What I really want to do is index the XML versions of the documents which
have been run through another system, but I cannot for the life of me
figure out how to do that.  I’ve tried, but the documentation about XML
makes no sense to me.  I thought indexing the PDF versions would be easier
and more straightforward, but perhaps that is not the case.

Thanks,

betsey

On 8/25/16, 5:39 PM, "Erick Erickson"  wrote:

>That is always a dangerous assumption. Are you sure
>you're searching on the proper field? Are you sure it's indexed? Are
>you sure it's
>
>The schema browser I indicated above will give you some
>idea what's actually in the field. You can not only see the
>fields Solr (actually Lucene) see in your index, but you can
>also see what some of the terms are.
>
>Adding =query and looking at the parsed query
>will show you what fields are being searched against. The
>most common causes of what you're describing are:
>
>> not searching against the field you think you are. This
>is very easy to do without knowing it.
>
>> not actually having 'indexed="true" set in your schema
>
>> not committing after inserting the doc
>
>Best,
>Erick
>
>On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh <
>betsey.ben...@stresearch.com> wrote:
>
>> It looks like the metadata of the PDFs was indexed, but not the content
>> (which is what I was interested in).  Searches on terms I know exist in
>> the content come up empty.
>>
>> On 8/25/16, 2:16 PM, "Betsey Benagh" 
>>wrote:
>>
>> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused
>>me.
>> >
>> >
>> >On 8/25/16, 1:56 PM, "Erick Erickson"  wrote:
>> >
>> >>when you say "I don't see it in the schema for that collection" are
>>you
>> >>talking schema.xml? managed_schema? Or actual documents in the index?
>> >>Often
>> >>these are defined by dynamic fields and the like in the schema files.
>> >>
>> >>Take a look at the admin UI>>schema browser>>drop down and you'll see
>>all
>> >>the actual fields in your index...
>> >>
>> >>Best,
>> >>Erick
>> >>
>> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
>> >>> >>> wrote:
>> >>
>> >>> Following the instructions in the quick start guide, I imported a
>>bunch
>> >>>of
>> >>> PDF documents into my Solr 6.0 instance.  As far as I can tell from
>>the
>> >>> documentation, there should be a 'content' field indexing, well, the
>> >>> content, but I don't see it in the schema for that collection.  Is
>> >>>there
>> >>> something obvious I might have missed?
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >
>>
>>



RE: Integration Solr Cloudera with squirrel-sql

2016-08-26 Thread Cahill, Trey
Hardika,

Parallel SQL and the accompanying JDBC connector only became available in Solr 
6.x.  Since Cloudera's Solr is only at 4.10, it will not have this feature.

Trey

-Original Message-
From: Hardika Catur S [mailto:hardika.sa...@solusi247.com.INVALID] 
Sent: Friday, August 26, 2016 4:31 AM
To: solr-user@lucene.apache.org
Subject: Integration Solr Cloudera with squirrel-sql

Hi,

I want integrate apache solr with squirrel-sql. It work in solr 6.0 and 
squirrel-sql 3.7.  But I have difficulty in integrating solr Cloudera 4.10,  
because lib not in accordance with the needs of a squirrel-sql.

Does solr Cloudera 4.10 could be integrated with squirrel-sql??
Or there aresoftware to transform a solr query into sql query similar to 
squirrel ??

Please help me to find a solution.

Thanks,
Hardika CS.


How to update from Solr Cloud 5.4.1 to 5.5.1

2016-08-26 Thread D'agostino Victor

Hi guys

I've got a tree nodes Solr Cloud 5.4.1 with zookeeper 3.4.8 in 
production serving 72.000.000 documents.
Documents types are easy ones : string, date, text, boolean, and multi 
valued string but reindexing would take two weeks.


I would like to upgrade solr to (at least) version 5.5.1 for a bug fix 
(https://issues.apache.org/jira/browse/SOLR-8779).


How can I do that ? Is there a safe procedure anywhere ?

Best regards
Victor d'Agostino





Ce message et les éventuels documents joints peuvent contenir des informations confidentielles. Au cas où il ne vous serait pas destiné, nous vous remercions de bien vouloir le supprimer et en aviser immédiatement l'expéditeur. Toute utilisation de ce message non conforme à sa destination, toute diffusion ou publication, totale ou partielle et quel qu'en soit le moyen est formellement interdite. Les communications sur internet n'étant pas sécurisées, l'intégrité de ce message n'est pas assurée et la société émettrice ne peut être tenue pour responsable de son contenu. 


Re: High load, frequent updates, low latency requirement use case

2016-08-26 Thread Emir Arnautovic

Hi Brent,
Please see inline comments.

Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 26.08.2016 04:51, Brent P wrote:

I'm trying to set up a Solr Cloud cluster to support a system with the
following characteristics:

It will be writing documents at a rate of approximately 500 docs/second,
and running search queries at about the same rate.
The documents are fairly small, with about 10 fields, most of which range
in size from a simple int to a string that holds a UUID. There's a date
field, and then three text fields that typically hold in the range of 350
to 500 chars.
There should be no problems with ingestion on 24 machines. Assuming 1 
replication, that is roughly 40 doc/sec/server. Make sure you bulk docs 
when ingesting.

Documents should be available for searching within 30 seconds of being
added.

Make sure you don't do explicit commits and use only auto commit.

We need an average search latency of 50 ms or faster.
What is the number of documents you expect? What type of queries do you 
have? Make sure you use filters wherever possible.


We've been using DataStax Enterprise with decent results, but trying to
determine if we can get more out of the latest version of Solr Cloud, as we
originally chose DSE ~4 years ago *I believe* because its Cassandra-backed
Solr provided redundancy/high availability features that weren't currently
available with straight Solr (not even sure if Solr Cloud was available
then).

We have 24 fairly beefy servers (96 CPU cores, 256 GB RAM, SSDs) for the
task, and I'm trying to figure out the best way to distribute the documents
into collections, cores, and shards.

If I can categorize a document into one of 8 "types", should I create 8
collections? Is that going to provide better performance than putting them
all into one collection and then using a filter query with the type field
when doing a search?
If you don't need to share term frequencies between types and if you 
always search one type, I would split collections. Set up number of 
shards for each collection according to number of docs in it. 
Alternatively, you could use  routing by type or in case you need to 
split to more than one shard, you can use composite hash routing. 
(https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/).


What are the options/things to consider when deciding on the number of
shards for each collection?
Number of documents and query latency are main factors in determining 
number of shards. Smaller the shard, faster the query, but more shards 
are queried, more data to merge so at one point benefits of distributed 
query are eaten be overhead of merging.

  As far as I know, I don't choose the number of
Solr cores, that is just determined base on the replication factor (and
shard count?).

Some of the settings I'm using in my solrconfig that seem important:
${solr.lock.type:native}

   ${solr.autoCommit.maxTime:3}
   false

This is not how soon your changes will be visible (openSearcher=false). 
This is how frequent your modifications will be flushed.


   ${solr.autoSoftCommit.maxTime:1000}
This is how soon you need your changes will be visible and should be set 
to 30s (or the highest possible value since caches are invalidated when 
searcher is opened.


true
8
You can keep these setting but they are hiding configuration errors. It 
is better to have those errors and fix warming configs than using cold 
and allowing large number of warming searchers.


I've got the updateLog/transaction log enabled, as I think I read it's
required for Solr Cloud.

Are there any settings I should look at that affect performance
significantly, especially outside of the solrconfig.xml for each collection
(like jetty configs, logging properties, etc)?

How much impact do the  directives in the solrconfig have on
performance? Do they only take effect if I have something configured that
requires them, and therefore if I'm missing one that I need, I'd get an
error if it's not defined?

Any help will be greatly appreciated. Thanks!
-Brent



Integration Solr Cloudera with squirrel-sql

2016-08-26 Thread Hardika Catur S

Hi,

I want integrate apache solr with squirrel-sql. It work in solr 6.0 and 
squirrel-sql 3.7.  But I have difficulty in integrating solr Cloudera 
4.10,  because lib not in accordance with the needs of a squirrel-sql.


Does solr Cloudera 4.10 could be integrated with squirrel-sql??
Or there aresoftware to transform a solr query into sql query similar to 
squirrel ??


Please help me to find a solution.

Thanks,
Hardika CS.


RE: Question about indexing PDFs

2016-08-26 Thread Srinivasa Meenavalli
Hi Betsey,

I executed some examples in Solr 5.5 from apache Tika Data import handler . 
content/Text was not store by default.
I can see PDF contents with documents when stored="true" enabled .

solr start -e dih



/solr/tika/select?q=*%3A*=json=true












Regards
Srinivas Meenavalli

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, August 26, 2016 3:09 AM
To: solr-user
Subject: Re: Question about indexing PDFs

That is always a dangerous assumption. Are you sure you're searching on the 
proper field? Are you sure it's indexed? Are you sure it's

The schema browser I indicated above will give you some idea what's actually in 
the field. You can not only see the fields Solr (actually Lucene) see in your 
index, but you can also see what some of the terms are.

Adding =query and looking at the parsed query will show you what fields 
are being searched against. The most common causes of what you're describing 
are:

> not searching against the field you think you are. This
is very easy to do without knowing it.

> not actually having 'indexed="true" set in your schema

> not committing after inserting the doc

Best,
Erick

On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh < betsey.ben...@stresearch.com> 
wrote:

> It looks like the metadata of the PDFs was indexed, but not the
> content (which is what I was interested in).  Searches on terms I know
> exist in the content come up empty.
>
> On 8/25/16, 2:16 PM, "Betsey Benagh"  wrote:
>
> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
> >
> >
> >On 8/25/16, 1:56 PM, "Erick Erickson"  wrote:
> >
> >>when you say "I don't see it in the schema for that collection" are
> >>you talking schema.xml? managed_schema? Or actual documents in the index?
> >>Often
> >>these are defined by dynamic fields and the like in the schema files.
> >>
> >>Take a look at the admin UI>>schema browser>>drop down and you'll
> >>see all the actual fields in your index...
> >>
> >>Best,
> >>Erick
> >>
> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
> >> >>> wrote:
> >>
> >>> Following the instructions in the quick start guide, I imported a
> >>>bunch of  PDF documents into my Solr 6.0 instance.  As far as I can
> >>>tell from the  documentation, there should be a 'content' field
> >>>indexing, well, the  content, but I don't see it in the schema for
> >>>that collection.  Is there  something obvious I might have missed?
> >>>
> >>> Thanks!
> >>>
> >>>
> >
>
>
Disclaimer: The contents of this e-mail and attachment(s) thereto are 
confidential and intended for the named recipient(s) only. It shall not attach 
any liability on the originator or Zensar Technologies Limited or its 
affiliates. Any views or opinions presented in this email are solely those of 
the author and may not necessarily reflect the opinions of Zensar Technologies 
Limited or its affiliates. Any form of reproduction, dissemination, copying, 
disclosure, modification, distribution and / or publication of this message 
without the prior written consent of the author of this e-mail is strictly 
prohibited. If you have received this email in error please delete it and 
notify the sender immediately. Before opening any mail and attachments please 
check them for viruses and defect. Zensar Technologies Ltd or its affiliate do 
not accept any liability for virus infected mails.


BasicAuthentication & blockUnknown Issue

2016-08-26 Thread Susheel Kumar
Hello,

I configured Solr for Basic Authentication with blockUnknown:true.  It
works well and no issue observed in ingestion & querying the solr cloud
cluster but in the Logging i see below errors being logged.

I see SOLR-9188 and SOLR 8236 logged for similar issue.  Is there any
workaround/fix that I can avoid these errors in the log even though cluster
is working fine ?

Thanks,
Susheel

 https://issues.apache.org/jira/browse/SOLR-9188
https://issues.apache.org/jira/browse/SOLR-8326


security.json
===
{
"authentication":{
"blockUnknown": true,
"class":"solr.BasicAuthPlugin",
"credentials":{"solr":"pv7VOv0Riny47Gg7B+dEbI6DZNx/2lP1ZRUkoU1zf+k=
po+GSKNNGfRmlgWPfo8fOphw8EzVP0F+YfKUBrfNzQA="}},
"authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"permissions":[{
"name":"security-edit",
"role":"admin"}],
"user-role":{"solr":"admin"}
}}

Logging Errors
===
Time (Local) Level Core Logger Message
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Exception trying to get public key from : http://host1:8983/solr
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Decryption failed , key must be wrong
8/25/2016, 11:50:45 PM
WARN false
PKIAuthenticationPlugin
Failed to decrypt header, trying after refreshing the key
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Exception trying to get public key from : http://host1:8983/solr
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Decryption failed , key must be wrong
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Could not decipher a header host1:8984_solr
UE4VOAXkNYDmZGmMBe34VCPoWJgVoRU5IByP9TCS7bXP6QLTB37D9R6DNCWbeAzPOekQ3t7rB+8dS7YUxrGg==
. No principal set


solcloud; collection reload, core Statistics 'optimize now'

2016-08-26 Thread Jon Hawkesworth
Hi,

I'd like to understand a bit more about some of the admin options in solrcloud 
admin interface.

Can anyone point me at something which tells me what hit Reload for a given 
collection actually does, whether it is safe to do at any time and/or under 
what circumstances it should/shouldn't be used?

Also, poking around the UI I noticed that if you select a core, on the Overview 
page there is a Statistics panel and in it a button entitled 'optimize now'.  
Again I'd like to understand what this does, when it should/shouldn't be used 
and whether optimising statistics is something that should scheduled.

The background to this is that I'm trying to provide operations team members 
with instructions about what, if anything, needs to be done to keep our 
production clusters in good working order.  Obviously my preference is for 
things to be automatic where possible but if things can't be automated then I 
want to be able to provide operations team members clear guidance about what 
needs to be done and when and why.

Many thanks,

Jon


Jon Hawkesworth
Software Developer


[cid:image003.png@01D1FF76.8A257730]

Hanley Road, Malvern, WR13 6NP. UK
O: +44 (0) 1684 312313
jon.hawkeswo...@mmodal.com
www.mmodal.com

This electronic mail transmission contains confidential information intended 
only for the person(s) named. Any use, distribution, copying or disclosure by 
another person is strictly prohibited. If you are not the intended recipient of 
this e-mail, promptly delete it and all attachments.



RE: Default stop word list

2016-08-26 Thread Srinivasa Meenavalli
Hi Steven,

List of Stopwords of a language are not fixed, there is no single universal 
list of stop words used by all natural language processing tools .
Ideally stop words should be defined search merchandisers based on their domain 
instead of referring default.

https://en.wikipedia.org/wiki/Stop_words

You are allowed to add  lang/stopwords_.txt



  
  
  
  
  
  


  
  
  
  
  
  
  


Regards
Srinivas Meenavalli

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com]
Sent: Friday, August 26, 2016 4:02 AM
To: solr-user@lucene.apache.org
Subject: Default stopword list

Hi everyone,

I'm curious, the current "default" stopword list, for English and other 
languages, how was it determined?  And for English, why "I" is not in the 
stopword list?

Thanks in advanced.

Steve
Disclaimer: The contents of this e-mail and attachment(s) thereto are 
confidential and intended for the named recipient(s) only. It shall not attach 
any liability on the originator or Zensar Technologies Limited or its 
affiliates. Any views or opinions presented in this email are solely those of 
the author and may not necessarily reflect the opinions of Zensar Technologies 
Limited or its affiliates. Any form of reproduction, dissemination, copying, 
disclosure, modification, distribution and / or publication of this message 
without the prior written consent of the author of this e-mail is strictly 
prohibited. If you have received this email in error please delete it and 
notify the sender immediately. Before opening any mail and attachments please 
check them for viruses and defect. Zensar Technologies Ltd or its affiliate do 
not accept any liability for virus infected mails.


Re: solr.NRTCachingDirectoryFactory

2016-08-26 Thread Mikhail Khludnev
Rough sampling under load makes sense as usual. JMC is one of the suitable
tools for this.
Sometimes even just jstack  or looking at SolrAdmin/Threads is enough.
If the only small ratio of documents is updated and a bottleneck is
filterCache you can experiment with segmened filters which suite more for
NRT.
http://blog-archive.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html


On Fri, Aug 26, 2016 at 2:56 AM, Rallavagu  wrote:

> Follow up update ...
>
> Set autowarm count to zero for caches for NRT and I could negotiate
> latency from 2 min to 5 min :)
>
> However, still seeing high QTimes and wondering where else can I look?
> Should I debug the code or run some tools to isolate bottlenecks (disk I/O,
> CPU or Query itself). Looking for some tuning advice. Thanks.
>
>
> On 7/26/16 9:42 AM, Erick Erickson wrote:
>
>> And, I might add, you should look through your old logs
>> and see how long it takes to open a searcher. Let's
>> say Shawn's lower bound is what you see, i.e.
>> it takes a minute each to execute all the autowarming
>> in filterCache and queryResultCache... So you're current
>> latency is _at least_ 2 minutes between the time something
>> is indexed and it's available for search just for autowarming.
>>
>> Plus up to another 2 minutes for your soft commit interval
>> to expire.
>>
>> So if your business people haven't noticed a 4 minute
>> latency yet, tell them they don't know what they're talking
>> about when they insist on the NRT interval being a few
>> seconds ;).
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 26, 2016 at 7:20 AM, Rallavagu  wrote:
>>
>>>
>>>
>>> On 7/26/16 5:46 AM, Shawn Heisey wrote:
>>>

 On 7/22/2016 10:15 AM, Rallavagu wrote:

>
>   size="5000"
>  initialSize="5000"
>  autowarmCount="500"/>
>
>
   size="2"
>  initialSize="2"
>  autowarmCount="500"/>
>


 As Erick indicated, these settings are incompatible with Near Real Time
 updates.

 With those settings, every time you commit and create a new searcher,
 Solr will execute up to 1000 queries (potentially 500 for each of the
 caches above) before that new searcher will begin returning new results.

 I do not know how fast your filter queries execute when they aren't
 cached... but even if they only take 100 milliseconds each, that's could
 take up to a minute for filterCache warming.  If each one takes two
 seconds and there are 500 entries in the cache, then autowarming the
 filterCache would take nearly 17 minutes. You would also need to wait
 for the warming queries on queryResultCache.

 The autowarmCount on my filterCache is 4, and warming that cache *still*
 sometimes takes ten or more seconds to complete.

 If you want true NRT, you need to set all your autowarmCount values to
 zero.  The tradeoff with NRT is that your caches are ineffective
 immediately after a new searcher is created.

>>>
>>> Will look into this and make changes as suggested.
>>>
>>>
 Looking at the "top" screenshot ... you have plenty of memory to cache
 the entire index.  Unless your queries are extreme, this is usually
 enough for good performance.

 One possible problem is that cache warming is taking far longer than
 your autoSoftCommit interval, and the server is constantly busy making
 thousands of warming queries.  Reducing autowarmCount, possibly to zero,
 *might* fix that. I would expect higher CPU load than what your
 screenshot shows if this were happening, but it still might be the
 problem.

>>>
>>> Great point. Thanks for the help.
>>>
>>>
 Thanks,
 Shawn


>>>


-- 
Sincerely yours
Mikhail Khludnev