date:20121122

Hello,

I´ve been dealing with the same question these days. In architecture terms,
it´s always better to separate services (Solr and Zookeeper, in this case)
rather to keep them in a single instance. However, when we have to deal
with costs issues, all of use we are quite limitated and we must elect the
best architecture/scalable/single point of failure option. As I see, the
options are:


*1. *Solr servers with Zookeeper embeded.
*2. *Solr servers with external Zookeeper.
*3.* Solr servers with external Zookeeper ensemble.

*Note*: as far as I know, the recommended number of Zookeeper services to
avoid single points of failure is:* ZkNum = 2 * Numshards - 1*. If you have


The best option is the third one. Reasons:

*1. *If one of your Solr servers goes down, Zookeeper services still up.
*2.* If one of your Zookeeper services goes down, Solr servers and the rest
of Zookeeper services still up.

Considering that option, we have two ways to implement it in production:

*1. *Each service (Solr and Zookeeper) in separate machines. Let´s imagine
that we have 2 shards for a given collection, so we need at least 4 Solr
servers to complete the leader-replica configuration. The best option is to
deploy them in for Amazon instances, one per each server. We need at least
3 Zookeeper services in a Zookeeper ensemble configuration. The optimal way
to install them is in separates machines (micro instance will be nice for
Zookeeper), so we will have 7 Amazon instances. The reason is that if one
machine goes down (Solr or Zookeeper one) the others services may still up
and your production environment will be safe. However,* for me this is the
best case, but it´s the more expensive one*, so in my case is imposible to
make real.

*2. *As wee need at least 4 Solr servers and 3 Zookeeper services up, I
would install three Amazon instances with Solr and Zookeeper, and one of
them only with Solr. So we´ll have: 3 complete Amazon instances (Solr +
Zookeeper) and 1 single Amazon instance  (only Solr). If one of them goes
down, the production environment will be safe. This architecture is not the
best one, as I told you, but I think that is optimal in terms of
robustness, single point of failure and costs.


It would be a pleasure to hear new suggestions from other people that
dealed with this kind of issues.

Regards,


- Luis Cappa.


2012/11/21 Marcin Rzewucki mrzewu...@gmail.com

 Yes, I meant the same (not -zkRun). However, I was asking if it is safe to
 have zookeeper and solr processes running on the same node or better on
 different machines?

 On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote:

  Hello!
 
  As I told I wouldn't use the Zookeeper that is embedded into Solr, but
  rather setup a standalone one.
 
  --
  Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
 ElasticSearch
 
   First of all: thank you for your answers. Yes, I meant side by side
   configuration. I think the worst case for ZKs here is to loose two of
  them.
   However, I'm going to use 4 availability zones in same region so at
 least
   this will reduce the risk of loosing both of them at the same time.
   Regards.
 
   On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote:
 
   Hello!
  
   Zookeeper by itself is not demanding, but if something happens to your
   nodes that have Solr on it, you'll loose ZooKeeper too if you have
   them installed side by side. However if you will have 4 Solr nodes and
   3 ZK instances you can get them running side by side.
  
   --
   Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
  ElasticSearch
  
Separate is generally nice because then you can restart Solr nodes
without consideration for ZooKeeper.
  
Performance-wise, I doubt it's a big deal either way.
  
- Mark
  
On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com
   wrote:
  
Hi,
   
I have 4 solr collections, 2-3mn documents per collection, up to
 100K
updates per collection daily (roughly). I'm going to create
  SolrCloud4x
   on
Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
  question is
what about zookeeper? It's going to be external ensemble, but is it
   better
to use same nodes as solr or dedicated micro instances? Zookeeper
  does
   not
seem to be resources demanding process, but what would be better in
  this
case ? To keep it inside of solrcloud or separately (micro
 instances
   seem
to be enough here) ?
   
Thanks in advance.
Regards.
  
  
 
 




-- 

- Luis Cappa

Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Marcin Rzewucki

Yes, this is exactly my case. I prefer 3rd option too. As I have 2 more
instances to be used for my purposes (SolrCloud4x + 2 more instances for
loading) it will be easier to configure zookeeper ensemble (as I can use
those 2 additional machines + 1 from SolrCloud) and avoid more instances to
be purchased and maintained.

On 22 November 2012 10:18, Luis Cappa Banda luisca...@gmail.com wrote:

 Hello,

 I´ve been dealing with the same question these days. In architecture terms,
 it´s always better to separate services (Solr and Zookeeper, in this case)
 rather to keep them in a single instance. However, when we have to deal
 with costs issues, all of use we are quite limitated and we must elect the
 best architecture/scalable/single point of failure option. As I see, the
 options are:


 *1. *Solr servers with Zookeeper embeded.
 *2. *Solr servers with external Zookeeper.
 *3.* Solr servers with external Zookeeper ensemble.

 *Note*: as far as I know, the recommended number of Zookeeper services to
 avoid single points of failure is:* ZkNum = 2 * Numshards - 1*. If you have


 The best option is the third one. Reasons:

 *1. *If one of your Solr servers goes down, Zookeeper services still up.
 *2.* If one of your Zookeeper services goes down, Solr servers and the rest
 of Zookeeper services still up.

 Considering that option, we have two ways to implement it in production:

 *1. *Each service (Solr and Zookeeper) in separate machines. Let´s imagine
 that we have 2 shards for a given collection, so we need at least 4 Solr
 servers to complete the leader-replica configuration. The best option is to
 deploy them in for Amazon instances, one per each server. We need at least
 3 Zookeeper services in a Zookeeper ensemble configuration. The optimal way
 to install them is in separates machines (micro instance will be nice for
 Zookeeper), so we will have 7 Amazon instances. The reason is that if one
 machine goes down (Solr or Zookeeper one) the others services may still up
 and your production environment will be safe. However,* for me this is the
 best case, but it´s the more expensive one*, so in my case is imposible to
 make real.

 *2. *As wee need at least 4 Solr servers and 3 Zookeeper services up, I
 would install three Amazon instances with Solr and Zookeeper, and one of
 them only with Solr. So we´ll have: 3 complete Amazon instances (Solr +
 Zookeeper) and 1 single Amazon instance  (only Solr). If one of them goes
 down, the production environment will be safe. This architecture is not the
 best one, as I told you, but I think that is optimal in terms of
 robustness, single point of failure and costs.


 It would be a pleasure to hear new suggestions from other people that
 dealed with this kind of issues.

 Regards,


 - Luis Cappa.


 2012/11/21 Marcin Rzewucki mrzewu...@gmail.com

  Yes, I meant the same (not -zkRun). However, I was asking if it is safe
 to
  have zookeeper and solr processes running on the same node or better on
  different machines?
 
  On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote:
 
   Hello!
  
   As I told I wouldn't use the Zookeeper that is embedded into Solr, but
   rather setup a standalone one.
  
   --
   Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
  ElasticSearch
  
First of all: thank you for your answers. Yes, I meant side by side
configuration. I think the worst case for ZKs here is to loose two of
   them.
However, I'm going to use 4 availability zones in same region so at
  least
this will reduce the risk of loosing both of them at the same time.
Regards.
  
On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote:
  
Hello!
   
Zookeeper by itself is not demanding, but if something happens to
 your
nodes that have Solr on it, you'll loose ZooKeeper too if you have
them installed side by side. However if you will have 4 Solr nodes
 and
3 ZK instances you can get them running side by side.
   
--
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
   ElasticSearch
   
 Separate is generally nice because then you can restart Solr nodes
 without consideration for ZooKeeper.
   
 Performance-wise, I doubt it's a big deal either way.
   
 - Mark
   
 On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com
 
wrote:
   
 Hi,

 I have 4 solr collections, 2-3mn documents per collection, up to
  100K
 updates per collection daily (roughly). I'm going to create
   SolrCloud4x
on
 Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
   question is
 what about zookeeper? It's going to be external ensemble, but is
 it
better
 to use same nodes as solr or dedicated micro instances? Zookeeper
   does
not
 seem to be resources demanding process, but what would be better
 in
   this
 case ? To keep it inside of solrcloud or separately (micro
  instances

Re: TermsComponent/SolrCloud

Hi Federico, it should work. Make sure you set the shards.qt parameter
too (in your case, it should be shards.qt=/terms)


On Thu, Nov 22, 2012 at 6:51 AM, Federico Méndez federic...@gmail.comwrote:

 Anyone knows if the TermsComponent supports distributed search trough a
 SolrCloud installation? I have a SolrCloud installation that works OK for
 regular searches but TermsComponent is returning empty results when using:
 [collectionName]/terms?terms.fl=collector_nameterms.prefix=jo, the request
 handler configuration is:
 !-- A request handler for demonstrating the terms component --
   requestHandler name=/terms class=solr.SearchHandler startup=lazy
  lst name=defaults
   bool name=termstrue/bool
   bool name=distribtrue/bool
 /lst
 arr name=components
   strterms/str
 /arr
   /requestHandler

Re: How to use eDismax query parser on a non tokenized field

You can either escape the whitespace with \ or search as a phrase.

fieldNonTokenized:foo\ bar
...or...
fieldNonTokenized:foo bar


On Thu, Nov 22, 2012 at 9:08 AM, Varun Thacker
varunthacker1...@gmail.comwrote:

 I have indexed documents using a fieldType which does not break the word
 up. I confirmed this by looking up the index in luke. I can see that the
 words haven't been tokenized.

 I use a search handler which uses edismax query parser for searching.
 According to the wiki also
 http://wiki.apache.org/solr/ExtendedDisMax#Query_Structure Extended DisMax
 breaks up the query string into words before searching. Thus no results
 show up.

 Example for q=foo bar:
 In the index : fieldNonTokenized:foo bar

 And when searching this is the final query getting made
 is: ((fieldNonTokenized:foo:foo)~0.01 (fieldNonTokenized:foo:bar)~0.01)~1

 Thus no document matches and returns no result. I can understand why this
 is happening. Is there any way where I can say that the query string should
 not be broken up into words?

 --


 Regards,
 Varun Thacker
 http://www.vthacker.in/

Re: TermsComponent/SolrCloud

2012-11-22 Thread Federico Méndez

Thanks Tomas, your suggestion worked!!

requestHandler name=/terms class=solr.SearchHandler startup=lazy
 lst name=defaults
  bool name=termstrue/bool
  bool name=distribtrue/bool
  str name=shards.qt/terms/str
/lst
arr name=components
  strterms/str
/arr
  /requestHandler


On Thu, Nov 22, 2012 at 11:59 AM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 Hi Federico, it should work. Make sure you set the shards.qt parameter
 too (in your case, it should be shards.qt=/terms)


 On Thu, Nov 22, 2012 at 6:51 AM, Federico Méndez federic...@gmail.com
 wrote:

  Anyone knows if the TermsComponent supports distributed search trough a
  SolrCloud installation? I have a SolrCloud installation that works OK for
  regular searches but TermsComponent is returning empty results when
 using:
  [collectionName]/terms?terms.fl=collector_nameterms.prefix=jo, the
 request
  handler configuration is:
  !-- A request handler for demonstrating the terms component --
requestHandler name=/terms class=solr.SearchHandler
 startup=lazy
   lst name=defaults
bool name=termstrue/bool
bool name=distribtrue/bool
  /lst
  arr name=components
strterms/str
  /arr
/requestHandler

Re: Suggester for numbers

2012-11-22 Thread Gustav

Hello Illu,
Here you go:

field name='autocomplete' type='text_auto' indexed='true' stored='true'
multiValued='true'/

fieldType class=solr.TextField name=text_auto
 analyzer
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
 /analyzer
/fieldType


 field name='conteiner' type='text_general_like' indexed='true'
stored='true' multiValued='false'/

fieldType class=solr.TextField name=text_general_like
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.NGramFilterFactory maxGramSize=25 
minGramSize=1/
  /analyzer

  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-for-numbers-tp4021672p4021828.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud and exernal file fields

2012-11-22 Thread Martin Koch

Mikhail

To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug
3985). But this wasn't good enough, because SOLR would still take very long
to restart when that was necessary.

I don't see how we could throw more hardware at the problem without making
it worse, really - the only solution here would be *fewer* shards, not
more.

IMO it would be ideal if the lucene/solr community could come up with a
good way of updating fields in a document without reindexing. This could be
by linking to some external data store, or in the lucene/solr internals. If
it would make things easier, a good first step would be to have dynamically
updateable numerical fields only.

/Martin

On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 I don't think solrconfig.xml shed any light on. I've just found what I
 didn't get in your setup - the way of how to explicitly assigning core to
 collection. Now, I realized most of details after all!
 Ball is on your side, let us know whether you have managed your cores to
 commit one by one to avoid freeze, or could you eliminate pauses by
 allocating more hardware?
 Thanks in advance!


 On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:

  Mikhail,
 
  PSB
 
  On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote:
  
   
I wasn't aware until now that it is possible to send a commit to one
  core
only. What we observed was the effect of curl
localhost:8080/solr/update?commit=true but perhaps we should
 experiment
with solr/coreN/update?commit=true. A quick trial run seems to
 indicate
that a commit to a single core causes commits on all cores.
   
   You should see something like this in the log:
   ... SolrCmdDistributor  Distrib commit to: ...
  
   Yup, a commit towards a single core results in a commit on all cores.
 
 
   
   
Perhaps I should clarify that we are using SOLR as a black box; we do
  not
touch the code at all - we only install the distribution WAR file and
proceed from there.
   
   I still don't understand how you deploy/launch Solr. How many jettys
 you
   start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
   shards= param for every request and distributes updates yourself? What
   collections do you create and with which settings?
  
   We let SOLR do the sharding using one collection with 16 SOLR cores
  holding one shard each. We launch only one instance of jetty with the
  folllowing arguments:
 
  -DnumShards=16
  -DzkHost=zookeeperhost:port
  -Xmx10G
  -Xms10G
  -Xmn2G
  -server
 
  Would you like to see the solrconfig.xml?
 
  /Martin
 
 
   
   
 Also from my POV such deployments should start at least from *16*
  4-way
 vboxes, it's more expensive, but should be much better available
  during
 cpu-consuming operations.

   
Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
   with
16 cores? Or am I misunderstanding something :) ?
   
   I prefer to start from 16 hosts with 4 cores each.
  
  
   
   
 Other details, if you use single jetty for all of them, are you
 sure
   that
 jetty's threadpool doesn't limit requests? is it large enough?
 You have 60G and set -Xmx=10G. are you sure that total size of
 cores
index
 directories is less than 45G?

 The total index size is 230 GB, so it won't fit in ram, but we're
  using
an
SSD disk to minimize disk access time. We have tried putting the EFF
   onto a
ram disk, but this didn't have a measurable effect.
   
Thanks,
/Martin
   
   
 Thanks


 On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com
 wrote:

  Mikhail
 
  PSB
 
  On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Please find additional question from me below.
  
   Simone,
  
   I'm sorry for hijacking your thread. The only what I've heard
  about
it
 at
   recent ApacheCon sessions is that Zookeeper is supposed to
   replicate
  those
   files as configs under solr home. And I'm really looking
 forward
  to
 know
   how it works with huge files in production.
  
   Thank You, Guys!
  
   20.11.2012 18:06 пользователь Martin Koch m...@issuu.com
   написал:
   
Hi Mikhail
   
Please see answers below.
   
On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:
   
 Martin,

 Thank you for telling your own war-story. It's really
  useful
for
 community.
 The first question might seems not really conscious, but
  would
you
  tell
   me
 what blocks searching during EFF reload, when it's
 triggered
  by

Performance improvement for solr faceting on large index

2012-11-22 Thread Pravin Agrawal

Hi All,

We are using solr 3.4 with following schema fields.

schema.xml---

fieldType name=autosuggest_text class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=5 
outputUnigrams=true/
filter class=solr.PatternReplaceFilterFactory 
pattern=^([0-9. ])*$ replacement=
replace=all/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

field name=id type=string stored=true indexed=true/
field name=autoSuggestContent type=autosuggest_text stored=true 
indexed=true multiValued=true/
copyField source=content dest=autoSuggestContent/
copyField source=original_title dest=autoSuggestContent/

field name=content type=text stored=true indexed=true/
field name=original_title type=text stored=true indexed=true/
field name=site type=site stored=false indexed=true/

/schema.xml---

The index on above schema is distributed on two solr shards with each index 
size of about 1.2 million, and size on disk of about 195GB per shard.

We want to retrieve (site, autoSuggestContent term, frequency of the term) 
information from our above main solr index. The site is a field in document and 
contains name of site to which that document belongs. The terms are retrieved 
from multivalued field autoSuggestContent which is created using shingles from 
content and title of the web page.

As of now, we are using facet query to retrieve (term, frequency of term)  for 
each site. Below is a sample query (you may ignore initial part of query)

http://localhost:8080/solr/select?indent=onq=*:*fq=site:www.abc.comstart=0rows=0fl=idqt=dismaxfacet=truefacet.field=autoSuggestContentfacet.mincount=25facet.limit=-1facet.method=enumfacet.sort=index

The problem is that with increase in index size, this method has started taking 
huge time. It used to take 7 minutes per site with index size of
0.4 million docs but takes around 60-90 minutes for index size of 2.5 
million(). With this speed, it will take around 5-6 days to index complete 1500 
sites. Also we are expecting the index size to grow with more documents and 
more sites and as such time to get the above information will increase further.

Please let us know if there is any better way to extract (site, term, 
frequency) information compare to current method.

Thanks,
Pravin Agrawal




DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Re: Performance improvement for solr faceting on large index

2012-11-22 Thread Yuval Dotan

you could always try the fc facet method and maybe increase the filtercache
size

On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal 
pravin_agra...@persistent.co.in wrote:

 Hi All,

 We are using solr 3.4 with following schema fields.


 schema.xml---

 fieldType name=autosuggest_text class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ShingleFilterFactory
 maxShingleSize=5 outputUnigrams=true/
 filter class=solr.PatternReplaceFilterFactory
 pattern=^([0-9. ])*$ replacement=
 replace=all/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType

 field name=id type=string stored=true indexed=true/
 field name=autoSuggestContent type=autosuggest_text stored=true
 indexed=true multiValued=true/
 copyField source=content dest=autoSuggestContent/
 copyField source=original_title dest=autoSuggestContent/

 field name=content type=text stored=true indexed=true/
 field name=original_title type=text stored=true indexed=true/
 field name=site type=site stored=false indexed=true/


 /schema.xml---

 The index on above schema is distributed on two solr shards with each
 index size of about 1.2 million, and size on disk of about 195GB per shard.

 We want to retrieve (site, autoSuggestContent term, frequency of the term)
 information from our above main solr index. The site is a field in document
 and contains name of site to which that document belongs. The terms are
 retrieved from multivalued field autoSuggestContent which is created using
 shingles from content and title of the web page.

 As of now, we are using facet query to retrieve (term, frequency of term)
  for each site. Below is a sample query (you may ignore initial part of
 query)


 http://localhost:8080/solr/select?indent=onq=*:*fq=site:www.abc.comstart=0rows=0fl=idqt=dismaxfacet=truefacet.field=autoSuggestContentfacet.mincount=25facet.limit=-1facet.method=enumfacet.sort=index

 The problem is that with increase in index size, this method has started
 taking huge time. It used to take 7 minutes per site with index size of
 0.4 million docs but takes around 60-90 minutes for index size of 2.5
 million(). With this speed, it will take around 5-6 days to index complete
 1500 sites. Also we are expecting the index size to grow with more
 documents and more sites and as such time to get the above information will
 increase further.

 Please let us know if there is any better way to extract (site, term,
 frequency) information compare to current method.

 Thanks,
 Pravin Agrawal




 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.

Re: SolrCloud and external Zookeeper ensemble

That's a tradeoff for you to make based on your own requirements, but the 
point is that it is LESS SAFE to run zookeeper on the same machine as a Solr 
instance.


Also keep in mind that the goal is to have at least THREE zookeeper 
instances running at any moment, so if you run zookeeper on the same machine 
as a Solr instance, you will need more than three zookeepeers. Figure three 
plus the MAXIMUM number of Solr nodes that you expect could be down 
simultaneously.


Also keep in mind that SolrCloud is about scaling,  but the intention is NOT 
to scale the zookeeper ensemble linearly with the number of Solr nodes. That 
means you would have to deal with the messiness of sometimes running 
zookeeper with Solr and sometimes not. So, unless you are running a very 
small SolrCloud cluster, you are much better off keeping zookeeper off your 
Solr machines.


The intent is that there will be a relatively small ensemble of zookeepers 
that service a large army or armada of Solr nodes.


-- Jack Krupansky

-Original Message- 
From: Marcin Rzewucki

Sent: Wednesday, November 21, 2012 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud and external Zookeeper ensemble

Yes, I meant the same (not -zkRun). However, I was asking if it is safe to
have zookeeper and solr processes running on the same node or better on
different machines?

On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote:


Hello!

As I told I wouldn't use the Zookeeper that is embedded into Solr, but
rather setup a standalone one.

--
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

 First of all: thank you for your answers. Yes, I meant side by side
 configuration. I think the worst case for ZKs here is to loose two of
them.
 However, I'm going to use 4 availability zones in same region so at 
 least

 this will reduce the risk of loosing both of them at the same time.
 Regards.

 On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote:

 Hello!

 Zookeeper by itself is not demanding, but if something happens to your
 nodes that have Solr on it, you'll loose ZooKeeper too if you have
 them installed side by side. However if you will have 4 Solr nodes and
 3 ZK instances you can get them running side by side.

 --
 Regards,
  Rafał Kuć
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  Separate is generally nice because then you can restart Solr nodes
  without consideration for ZooKeeper.

  Performance-wise, I doubt it's a big deal either way.

  - Mark

  On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com
 wrote:

  Hi,
 
  I have 4 solr collections, 2-3mn documents per collection, up to 
  100K

  updates per collection daily (roughly). I'm going to create
SolrCloud4x
 on
  Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
question is
  what about zookeeper? It's going to be external ensemble, but is it
 better
  to use same nodes as solr or dedicated micro instances? Zookeeper
does
 not
  seem to be resources demanding process, but what would be better in
this
  case ? To keep it inside of solrcloud or separately (micro instances
 seem
  to be enough here) ?
 
  Thanks in advance.
  Regards.

Re: From Solr3.1 to SolrCloud

2012-11-22 Thread roySolr

I run a separate Zookeeper instance right now. Works great, nodes are visible
in admin.

Two more questions:

- I change my synonyms.txt on a solr node. How can i get zookeeper in sync
and the other solr nodes without restart?

- I read something more about zookeeper ensemble. When i need to run with 4
solr nodes(replicas) i need 3 zookeepers in ensemble(50% live). When
zookeeper and solr are separated it will takes 7 servers to get it live. In
the past we only needed 4 servers. Are there some other options because the
costs will grow? 3 zookeeper servers sounds like overkill.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/From-Solr3-1-to-SolrCloud-tp4021536p4021849.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Partial results with not enough hits

Hi,

Maybe your goal should be to make your queries faster instead of fighting
with timeouts which are known not to work well.

What is your hardware like?
How about your queries?
What do you see in debugQuery=true output?

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 21, 2012 6:04 PM, Aleksey Vorona avor...@ea.com wrote:

 In all of my queries I have timeAllowed parameter. My application is ready
 for partial results. However, whenever Solr returns partial result it is a
 very bad result.

 For example, I have a test query and here its execution log with the
 strict time allowed:
 WARNING: Query: omitted; Elapsed time: 120Exceeded allowed search
 time: 100 ms.
 INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=**100}
 hits=189 status=0 QTime=119
 Here it is without such a strict limitation:
 INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=**1}
 hits=582 status=0 QTime=124

 The total execution time is different by mere 5 ms, but the partial result
 has only about 1/3 of the full result.

 Is it the expected behaviour? Does that mean I can never rely on the
 partial results?

 I added timeAllowed to protect from too expensive wide queries, but I
 still want to return something relevant to the user. This query returned
 30% of the full result, but I have other queries in the log where partial
 result is just empty. Am I doing something wrong?

 P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory.
 Load Average on the Solr box is very low.

 -- Aleksey

Re: SolrCloud and external Zookeeper ensemble

If your Solr instances don't max out your ec2 instances you should be fine.
But maybe even micro instances will suffice. Or 1 on demand and 2 spot
ones. If cost is the concern, that is.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 21, 2012 5:07 PM, Marcin Rzewucki mrzewu...@gmail.com wrote:

 Yes, I meant the same (not -zkRun). However, I was asking if it is safe to
 have zookeeper and solr processes running on the same node or better on
 different machines?

 On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote:

  Hello!
 
  As I told I wouldn't use the Zookeeper that is embedded into Solr, but
  rather setup a standalone one.
 
  --
  Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
 ElasticSearch
 
   First of all: thank you for your answers. Yes, I meant side by side
   configuration. I think the worst case for ZKs here is to loose two of
  them.
   However, I'm going to use 4 availability zones in same region so at
 least
   this will reduce the risk of loosing both of them at the same time.
   Regards.
 
   On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote:
 
   Hello!
  
   Zookeeper by itself is not demanding, but if something happens to your
   nodes that have Solr on it, you'll loose ZooKeeper too if you have
   them installed side by side. However if you will have 4 Solr nodes and
   3 ZK instances you can get them running side by side.
  
   --
   Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
  ElasticSearch
  
Separate is generally nice because then you can restart Solr nodes
without consideration for ZooKeeper.
  
Performance-wise, I doubt it's a big deal either way.
  
- Mark
  
On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com
   wrote:
  
Hi,
   
I have 4 solr collections, 2-3mn documents per collection, up to
 100K
updates per collection daily (roughly). I'm going to create
  SolrCloud4x
   on
Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
  question is
what about zookeeper? It's going to be external ensemble, but is it
   better
to use same nodes as solr or dedicated micro instances? Zookeeper
  does
   not
seem to be resources demanding process, but what would be better in
  this
case ? To keep it inside of solrcloud or separately (micro
 instances
   seem
to be enough here) ?
   
Thanks in advance.
Regards.

Re: SolrCloud and exernal file fields

2012-11-22 Thread Yonik Seeley

On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch m...@issuu.com wrote:
 around 7M documents in the index; each document has a 45 character ID.

7M documents isn't that large.  Is there a reason why you need so many
shards (16 in your case) on a single box?

-Yonik
http://lucidworks.com

Re: SolrCloud and external Zookeeper ensemble

That is an interesting point - what size of instance is needed for a 
zookeeper. Can it run well in a micro?


Another issue I wanted to raise is that maybe questions, advice, and 
guidelines should be relative to the shirt size of your cluster - small, 
medium, or large. SolrCloud is clearly more optimized for medium to large 
clusters. Sure, you can use it for small clusters, but then some of the 
features and guidance do seem like overkill. Nonetheless, I would hate to 
see anybody take the compromised guidance for very small clusters (3 or 4 
machines) and apply it to even medium-size clusters (10 to 20 machines), let 
alone large clusters (dozens to 100 or more machines).


-- Jack Krupansky

-Original Message- 
From: Otis Gospodnetic

Sent: Thursday, November 22, 2012 9:37 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud and external Zookeeper ensemble

If your Solr instances don't max out your ec2 instances you should be fine.
But maybe even micro instances will suffice. Or 1 on demand and 2 spot
ones. If cost is the concern, that is.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 21, 2012 5:07 PM, Marcin Rzewucki mrzewu...@gmail.com wrote:


Yes, I meant the same (not -zkRun). However, I was asking if it is safe to
have zookeeper and solr processes running on the same node or better on
different machines?

On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote:

 Hello!

 As I told I wouldn't use the Zookeeper that is embedded into Solr, but
 rather setup a standalone one.

 --
 Regards,
  Rafał Kuć
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  First of all: thank you for your answers. Yes, I meant side by side
  configuration. I think the worst case for ZKs here is to loose two of
 them.
  However, I'm going to use 4 availability zones in same region so at
least
  this will reduce the risk of loosing both of them at the same time.
  Regards.

  On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote:

  Hello!
 
  Zookeeper by itself is not demanding, but if something happens to 
  your

  nodes that have Solr on it, you'll loose ZooKeeper too if you have
  them installed side by side. However if you will have 4 Solr nodes 
  and

  3 ZK instances you can get them running side by side.
 
  --
  Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
 ElasticSearch
 
   Separate is generally nice because then you can restart Solr nodes
   without consideration for ZooKeeper.
 
   Performance-wise, I doubt it's a big deal either way.
 
   - Mark
 
   On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com
  wrote:
 
   Hi,
  
   I have 4 solr collections, 2-3mn documents per collection, up to
100K
   updates per collection daily (roughly). I'm going to create
 SolrCloud4x
  on
   Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
 question is
   what about zookeeper? It's going to be external ensemble, but is 
   it

  better
   to use same nodes as solr or dedicated micro instances? Zookeeper
 does
  not
   seem to be resources demanding process, but what would be better 
   in

 this
   case ? To keep it inside of solrcloud or separately (micro
instances
  seem
   to be enough here) ?
  
   Thanks in advance.
   Regards.

Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Shawn Heisey


On 11/22/2012 2:18 AM, Luis Cappa Banda wrote:

I´ve been dealing with the same question these days. In architecture terms,
it´s always better to separate services (Solr and Zookeeper, in this case)
rather to keep them in a single instance. However, when we have to deal
with costs issues, all of use we are quite limitated and we must elect the
best architecture/scalable/single point of failure option. As I see, the
options are:


*1. *Solr servers with Zookeeper embeded.
*2. *Solr servers with external Zookeeper.
*3.* Solr servers with external Zookeeper ensemble.


I've never used SolrCloud, so this is all speculation based on what I've 
been reading.  That has been mostly on this list, but also on dev@l.o 
and the IRC channel.


I have a four-node Solr 3.5 deployment with about 80 million documents 
(130GB) in the distributed index.  I think of my installation as small.  
Others might disagree with my opinion, but I know there are a lot of 
indexes out there that make mine look tiny.


If I needed to set a similarly small setup with SolrCloud on four Solr 
servers, what I would pitch to management would be one extra machine 
(cheap, 1U, low-end processor, etc) to act as a standalone zookeeper 
node.  For the other two zookeper instances, I would run standalone 
zookeeper (separate JVM from Solr) on two of the Solr servers.  I might 
ask for a small boost in RAM and/or CPU on the two servers that serve 
double-duty.  I would not run zookeeper in the same JVM as Solr.


With a little bit of growth in the cluster, I would ask for a second 
standalone zookeeper node, pulling zookeeper off one of the Solr 
servers.  If it continued to grow, then I would ask for the third.  I 
would leave blank spots in the rack for those standalone servers.


Thanks,
Shawn

Re: How to get a list of servers per collection in sorlcloud using java api?

Hello, Joe.

Try something like this using SolrJ library:

String endpoints[] = // your Solr server endpoints. Example:
http://localhost:8080/solr/core1
String zookeeperEndpoints = // your Zookeeper endpoints. Example:
localhost:9000
String collectionName = // Your collection name. Example: core1

LBHttpSolrServer lbSolrServer = new LBHttpSolrServer(endpoints);
this.cloudSolrServer = new CloudSolrServer(zookeeperEndpoints,
lbSolrServer);
this.cloudSolrServer.setDefaultCollection(collectionName);


You have now created a CloudSolrServer instance which can manage Solr
server operations: add a new document, delete, update, etc.

Regards,


 - Luis Cappa.

2012/11/22 joe.cohe...@gmail.com joe.cohe...@gmail.com

 I want to write a function that will go thorugh all the servers that store
 a
 specific collection and perform a tsk on it, suppose RELOAD CORE task.
 How can I get a list of all solr servers/urls that run a specific
 collection?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-get-a-list-of-servers-per-collection-in-sorlcloud-using-java-api-tp4021863.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 

- Luis Cappa

Re: From Solr3.1 to SolrCloud


 - I change my synonyms.txt on a solr node. How can i get zookeeper in sync
 and the other solr nodes without restart?


Well, you can upload the whole collection configuration again with zkClient
(included in the cloud.scripts section). see
http://wiki.apache.org/solr/SolrCloud#Getting_your_Configuration_Files_into_ZooKeeper
Other option, if you only want to upload one file is to write something
that communicate with zk with any of it's apis. I did this before Solr's
zkClient was committed and it is quite simple. Then, you can reload the
collection, which is like reloading all the cores for the collection in the
different nodes. See
http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API


 - I read something more about zookeeper ensemble. When i need to run with 4
 solr nodes(replicas) i need 3 zookeepers in ensemble(50% live). When
 zookeeper and solr are separated it will takes 7 servers to get it live. In
 the past we only needed 4 servers. Are there some other options because the
 costs will grow? 3 zookeeper servers sounds like overkill.


The number of Solr instances doesn't have to do with the number of ZK
instances that you need to run.  You can effectively run with only one zk
instance, the problem with this is that if that instance dies, then your
whole cluster will go down. So you can increase the number of zk instances.
When you create your Zookeeper ensemble, you declare the size of it (the
number of zk instances it will contain). When you run that ensemble,
Zookeeper requires that N/2+1 of the servers are connected. This means that
if you want your zk ensemble to survive one instance dying, you'll need at
least 3 ZK instances (if you have 2, and one dies, you still need 2 to
work, so it wont).

There has been some discussions these days in the list about this, but if
the number of physical servers is too much for you, you could run on the
same physical machine an instance of Solr and ZK.

Tomás

Re: Solr Cloud Zookeeper Namespace

You could use Zookeeper's chroot:
http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices

You can use chroot in Solr by specifying it in the zkHost parameter, for
example -DzkHost=localhost:2181/namespace1

In order for this to work, you need to first create the initial path (in
the example above, you should create /namespace1 in zookeeper before
starting Solr)

Tomás


On Thu, Nov 22, 2012 at 2:08 PM, Sandopolus sandopo...@gmail.com wrote:

 Is it possible with Solr Cloud 4.0 to specify a namespace for zookeeper so
 that you can run completely isolated Solr Cloud Clusters.

 There is the collection.configName property puts specific items into sub
 nodes for that collection, but certain things are still shared and in the
 root directory in Zookeeper like clusterstate.json
 What i am looking for a property which allows me to prepend a namespace to
 all nodes in Zookeeper that Solr Cloud inserts.

 Does anyone know if this exists?

Re: How to get a list of servers per collection in sorlcloud using java api?

Hello,

As far as I know, you cannot do that at the moment, :-/

Regards,


 - Luis Cappa.


2012/11/22 joe.cohe...@gmail.com joe.cohe...@gmail.com

 Thanks Rakudten.
 I had my question mis-phrased.
 What I need is being able to get the solr servers storing a collection by
 giving the zookeeper server as an input.

 something like:

 // returns a list of solr servers in the zookeeper ensemble that store the
 given collection
 ListString getServers(String zkhost, String collectionName)




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-get-a-list-of-servers-per-collection-in-sorlcloud-using-java-api-tp4021863p4021883.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 

- Luis Cappa

Re: is there a way to prevent abusing rows parameter

2012-11-22 Thread solr-user

Thanks guys.  This is a problem with the front end not validating requests. 
I was hoping there might be a simple config value I could enter/change,
rather than going the long process of migrating a proper fix all the way up
to our production servers.  Looks like not, but thx.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-a-way-to-prevent-abusing-rows-parameter-tp4021467p4021892.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Partial results with not enough hits

2012-11-22 Thread Aleksey Vorona


Thank you!

That seems to be the case, I tried to execute queries without sorting 
and only one document in the response and I got execution time in the 
same range as before.


-- Aleksey

On 12-11-21 04:07 PM, Jack Krupansky wrote:

It could be that the time to get set up to return even the first result is
high and then each additional document is a minimal increment in time.

Do a query with rows=1 (or even 0) and see what the minimum query time is
for your query, index, and environment.

-- Jack Krupansky

-Original Message-
From: Aleksey Vorona
Sent: Wednesday, November 21, 2012 6:04 PM
To: solr-user@lucene.apache.org
Subject: Partial results with not enough hits

In all of my queries I have timeAllowed parameter. My application is
ready for partial results. However, whenever Solr returns partial result
it is a very bad result.

For example, I have a test query and here its execution log with the
strict time allowed:
  WARNING: Query: omitted; Elapsed time: 120Exceeded allowed search
time: 100 ms.
  INFO: [] webapp=/solr path=/select
params={omittedtimeAllowed=100} hits=189 status=0 QTime=119
Here it is without such a strict limitation:
  INFO: [] webapp=/solr path=/select
params={omittedtimeAllowed=1} hits=582 status=0 QTime=124

The total execution time is different by mere 5 ms, but the partial
result has only about 1/3 of the full result.

Is it the expected behaviour? Does that mean I can never rely on the
partial results?

I added timeAllowed to protect from too expensive wide queries, but I
still want to return something relevant to the user. This query returned
30% of the full result, but I have other queries in the log where
partial result is just empty. Am I doing something wrong?

P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory.
Load Average on the Solr box is very low.

-- Aleksey

Re: Partial results with not enough hits

2012-11-22 Thread Aleksey Vorona


Thanks for the response.

I have increased the timeout and it did not increase execution time or 
system load. It is really that I misused the timeout.


Just to give you a bit of perspective, we added timeout to guarantee 
some level of QoS from the search engine. Our UI allows user to 
construct very complex queries and (what is worse) not all the time user 
really understands what she needs. That may become a problem if we have 
lots of users doing that. In this case I do not want to run such a 
complex query for seconds and want to return some result with a warning 
to the user that she is doing something wrong. But clearly, I set a 
timeout too low for that and started to harm even normal queries.


Anyway, thanks everyone for the replies. The issue is fixed and I now 
understand how timeout works much better (which was the reason to post 
to this list). Thanks!


-- Aleksey

On 12-11-22 06:37 AM, Otis Gospodnetic wrote:

Hi,

Maybe your goal should be to make your queries faster instead of fighting
with timeouts which are known not to work well.

What is your hardware like?
How about your queries?
What do you see in debugQuery=true output?

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 21, 2012 6:04 PM, Aleksey Vorona avor...@ea.com wrote:


In all of my queries I have timeAllowed parameter. My application is ready
for partial results. However, whenever Solr returns partial result it is a
very bad result.

For example, I have a test query and here its execution log with the
strict time allowed:
 WARNING: Query: omitted; Elapsed time: 120Exceeded allowed search
time: 100 ms.
 INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=**100}
hits=189 status=0 QTime=119
Here it is without such a strict limitation:
 INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=**1}
hits=582 status=0 QTime=124

The total execution time is different by mere 5 ms, but the partial result
has only about 1/3 of the full result.

Is it the expected behaviour? Does that mean I can never rely on the
partial results?

I added timeAllowed to protect from too expensive wide queries, but I
still want to return something relevant to the user. This query returned
30% of the full result, but I have other queries in the log where partial
result is just empty. Am I doing something wrong?

P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory.
Load Average on the Solr box is very low.

-- Aleksey

SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

Hello everyone.

I´ve starting to seriously worry about with SolrCloud due an strange
behavior that I have detected. The situation is this the following:

*1.* SolrCloud with one shard and two Solr instances.
*2.* Indexation via SolrJ with CloudServer and a custom
BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly
atomic updates. Check
JIRA-4080https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055
*3.* An asynchronous proccess updates partially some document fields. After
that operation I automatically execute a commit, so the index must be
reloaded.

What I have checked is that both using atomic updates or complete document
reindexations* aleatory documents are not updated* *even if I saw debugging
how the add() and commit() operations were executed correctly* *and without
errors*. Has anyone experienced a similar behavior? Is it posible that if
an index update operation didn´t finish and CloudSolrServer receives a new
one this second update operation doesn´t complete?

Thank you in advance.

Regards,

-- 

- Luis Cappa

Re: How to get a list of servers per collection in sorlcloud using java api?

2012-11-22 Thread Sami Siren

On Thu, Nov 22, 2012 at 7:20 PM, joe.cohe...@gmail.com 
joe.cohe...@gmail.com wrote:

 Thanks Rakudten.
 I had my question mis-phrased.
 What I need is being able to get the solr servers storing a collection by
 giving the zookeeper server as an input.

 something like:

 // returns a list of solr servers in the zookeeper ensemble that store the
 given collection
 ListString getServers(String zkhost, String collectionName)


You can use ZKStateReader (#getClusterState) to get this info.

--
 Sami Siren

Reloading config to zookeeper

2012-11-22 Thread Cool Techi

When we make changes to our config files, how do we reload the files into 
zookeeper. 

Also, I understand that we would need to reload the collection, would we need 
to do this at a per shard level or just at the cloud level.

Regards,
Ayush

Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Sami Siren

I think the problem is that even though you were able to work around the
bug in the client solr still uses the xml format internally so the atomic
update (with multivalued field) fails later down the stack. The bug you
filed needs to be fixed to get the problem solved.

On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.comwrote:

Hello everyone.

I´ve starting to seriously worry about with SolrCloud due an strange
behavior that I have detected. The situation is this the following:

*1.* SolrCloud with one shard and two Solr instances.
*2.* Indexation via SolrJ with CloudServer and a custom
BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly
atomic updates. Check
JIRA-4080
https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055

*3.* An asynchronous proccess updates partially some document fields. After
that operation I automatically execute a commit, so the index must be
reloaded.

What I have checked is that both using atomic updates or complete document
reindexations* aleatory documents are not updated* *even if I saw debugging
how the add() and commit() operations were executed correctly* *and without
errors*. Has anyone experienced a similar behavior? Is it posible that if
an index update operation didn´t finish and CloudSolrServer receives a new
one this second update operation doesn´t complete?

Thank you in advance.

Regards,

- Luis Cappa

Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

Hi, Sami!

But isn´t strange that some documents were updated (atomic updates)
correctly and other ones not? Can´t it be a more serious problem like some
kind of index writer lock, or whatever?

Regards,

- Luis Cappa.

2012/11/22 Sami Siren ssi...@gmail.com

On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com
wrote:

Hello everyone.

I´ve starting to seriously worry about with SolrCloud due an strange
behavior that I have detected. The situation is this the following:

https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055

*3.* An asynchronous proccess updates partially some document fields.
After
that operation I automatically execute a commit, so the index must be
reloaded.

What I have checked is that both using atomic updates or complete
document
reindexations* aleatory documents are not updated* *even if I saw
debugging
how the add() and commit() operations were executed correctly* *and
without
errors*. Has anyone experienced a similar behavior? Is it posible that if
an index update operation didn´t finish and CloudSolrServer receives a
new
one this second update operation doesn´t complete?

Thank you in advance.

Regards,

- Luis Cappa

Re: Reloading config to zookeeper

2012-11-22 Thread Marcin Rzewucki

Hi,

I'm using cloud-scripts/zkcli.sh script for reloading configuration, for
example:
$ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir config.dir -solrhome
solr.home -confname config.name -z zookeeper.host

Then I'm reloading collection on each node in cloud, but maybe someone
knows better solution.
Regards.

On 22 November 2012 19:23, Cool Techi cooltec...@outlook.com wrote:

 When we make changes to our config files, how do we reload the files into
 zookeeper.

 Also, I understand that we would need to reload the collection, would we
 need to do this at a per shard level or just at the cloud level.

 Regards,
 Ayush

RE: Reloading config to zookeeper

2012-11-22 Thread Cool Techi

Thanks, but why do we need to specify the -solrhome? 

I am using the following command to load new config,

java -classpath .:/Users/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -cmd 
upconfig -zkhost 
localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185 
-confdir /Users/config-files -confname myconf

So basically reloading is just uploading the configs back again?

Regard,s
Ayush

 Date: Thu, 22 Nov 2012 19:32:27 +0100
 Subject: Re: Reloading config to zookeeper
 From: mrzewu...@gmail.com
 To: solr-user@lucene.apache.org
 
 Hi,
 
 I'm using cloud-scripts/zkcli.sh script for reloading configuration, for
 example:
 $ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir config.dir -solrhome
 solr.home -confname config.name -z zookeeper.host
 
 Then I'm reloading collection on each node in cloud, but maybe someone
 knows better solution.
 Regards.
 
 On 22 November 2012 19:23, Cool Techi cooltec...@outlook.com wrote:
 
  When we make changes to our config files, how do we reload the files into
  zookeeper.
 
  Also, I understand that we would need to reload the collection, would we
  need to do this at a per shard level or just at the cloud level.
 
  Regards,
  Ayush

Re: Reloading config to zookeeper

2012-11-22 Thread Marcin Rzewucki

I think solrhome is not mandatory.
Yes, reloading is uploading config dir again. It's a pity we can't update
just modified files.
Regards.

On 22 November 2012 19:38, Cool Techi cooltec...@outlook.com wrote:

 Thanks, but why do we need to specify the -solrhome?

 I am using the following command to load new config,

 java -classpath .:/Users/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -cmd
 upconfig -zkhost
 localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185
 -confdir /Users/config-files -confname myconf

 So basically reloading is just uploading the configs back again?

 Regard,s
 Ayush

  Date: Thu, 22 Nov 2012 19:32:27 +0100
  Subject: Re: Reloading config to zookeeper
  From: mrzewu...@gmail.com
  To: solr-user@lucene.apache.org
 
  Hi,
 
  I'm using cloud-scripts/zkcli.sh script for reloading configuration,
 for
  example:
  $ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir config.dir -solrhome
  solr.home -confname config.name -z zookeeper.host
 
  Then I'm reloading collection on each node in cloud, but maybe someone
  knows better solution.
  Regards.
 
  On 22 November 2012 19:23, Cool Techi cooltec...@outlook.com wrote:
 
   When we make changes to our config files, how do we reload the files
 into
   zookeeper.
  
   Also, I understand that we would need to reload the collection, would
 we
   need to do this at a per shard level or just at the cloud level.
  
   Regards,
   Ayush

Re: Reload core via CoreAdminRequest doesnt work with solr cloud? (solrj)

If you need to reload all the cores from a given collection you can use the
Collections API:
http://localhost:8983/solr/admin/collections?action=RELOADname=mycollection


On Thu, Nov 22, 2012 at 3:17 PM, joe.cohe...@gmail.com 
joe.cohe...@gmail.com wrote:

 Hi,
 I'm using solr-4.0.0
 I'm trying to reload all the cores of a given collection in my solr cloud.
 I use it like this:

 CloudSolrServer server = new CloudSolrServer (zkserver:port);
 server.setDefaultCollection(collection1);
 CoreAdminRequest req = new CoreAdminRequest();
 req.reloadCore(collection1, server)

 This throws an Exception telling me that no live solr servers are availble,
 listing the servers like this:
 http://server/solr/collection1

 Of course doing other tasks like adding documnets to the CloudSolrServer
 above works fine.
 Using reloadCore on a HttpSolrServer also works fine.

 Any know issue with CloudSolrServer   and CoreAdminRequest ?


 Note that I moved to solr-4.0.0 from solr-4.0.0-beta after trying the same
 thing also failed, but with a different exception.
 it failed saying cannot cast string to map in class ClusterState,  in
 load()
 method (line 300), because the key range gave some String value instead
 of
 a map object.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Reload-core-via-CoreAdminRequest-doesnt-work-with-solr-cloud-solrj-tp4021882.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Sami Siren

It might even depend on the cluster layout! Let's say you have 2 shards (no
replicas) if the doc belongs to the node you send it to so that it does not
get forwarded to another node then the update should work and in case where
the doc gets forwarded to another node the problem occurs. With replicas it
could appear even more strange: the leader might have the doc right and the
replica not.

I only briefly looked at the bits that deal with this so perhaps there's
something more involved.

On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.comwrote:

Hi, Sami!

But isn´t strange that some documents were updated (atomic updates)
correctly and other ones not? Can´t it be a more serious problem like some
kind of index writer lock, or whatever?

Regards,

- Luis Cappa.

2012/11/22 Sami Siren ssi...@gmail.com

On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com
wrote:

Hello everyone.

I´ve starting to seriously worry about with SolrCloud due an strange
behavior that I have detected. The situation is this the following:

https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055

*3.* An asynchronous proccess updates partially some document fields.
After
that operation I automatically execute a commit, so the index must be
reloaded.

What I have checked is that both using atomic updates or complete
document
reindexations* aleatory documents are not updated* *even if I saw
debugging
how the add() and commit() operations were executed correctly* *and
without
errors*. Has anyone experienced a similar behavior? Is it posible that
if
an index update operation didn´t finish and CloudSolrServer receives a
new
one this second update operation doesn´t complete?

Thank you in advance.

Regards,

- Luis Cappa

upgrading from 4.0 to 4.1 causes CorruptIndexException: checksum mismatch in segments file

2012-11-22 Thread solr-user

hi all

I have been working on moving us from 4.0 to a newer build of 4.1

I am seeing a CorruptIndexException: checksum mismatch in segments file
error when I try to use the existing index files.

I did see something in the build log for #119 re LUCENE-4446 that mentions
flip file formats to point to 4.1 format

Do I just need to reindex or is this some other issue (ie do I need to
configure something differently)?

or should I move back a few builds?

note, we are currently using:

solr-spec 4.0.0.2012.04.05.15.05.52
solr-impl 4.0-SNAPSHOT 1310094M - - 2012-04-05 15:05:52
lucene-spec 4.0-SNAPSHOT
lucene-impl 4.0-SNAPSHOT 1309921 - - 2012-04-05 10:25:27

and are considering moving to:

solr-spec 4.1.0.2012.11.03.18.08.42
solr-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:08:42
lucene-spec 4.1-2012-11-03_18-05-49
lucene-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:06:50
(aka apache-solr-4.1-2012-11-03_18-05-49)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/upgrading-from-4-0-to-4-1-causes-CorruptIndexException-checksum-mismatch-in-segments-file-tp4021913.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error: _version_field must exist in schema

2012-11-22 Thread Nick Zadrozny

On Wed, Oct 17, 2012 at 3:20 PM, Dotan Cohen dotanco...@gmail.com wrote:

 I do have a Solr 4 Beta index running on Websolr that does not have
 such a field. It works, but throws many Service Unavailable and
 Communication Error errors. Might the lack of the _version_ field be
 the reason?


Belated reply, but this is probably something you should let us know about
directly at supp...@onemorecloud.com if it happens again. Cheers.

-- 
Nick Zadrozny

Cofounder, One More Cloud

websolr.com https://websolr.com/home • bonsai.io http://bonsai.io/home

Hassle-free hosted full-text search,
powered by Apache Solr and ElasticSearch.

Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

Hello!

I´m using a simple test configuration with nShards=1 without any replica.
SolrCloudServer is suposed to forward properly those index/update
operations, isn´t it? I test with a complete document reindexation, not
atomic updates, using the official LBHttpSolrServer, not my custom
BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug
related with atomic updates via CloudSolrServer but a general bug when an
index changes with reindexations/updates frequently.

Regards,

- Luis Cappa.

2012/11/22 Sami Siren ssi...@gmail.com

I only briefly looked at the bits that deal with this so perhaps there's
something more involved.

On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com
wrote:

Hi, Sami!

But isn´t strange that some documents were updated (atomic updates)
correctly and other ones not? Can´t it be a more serious problem like
some
kind of index writer lock, or whatever?

Regards,

- Luis Cappa.

2012/11/22 Sami Siren ssi...@gmail.com

I think the problem is that even though you were able to work around
the
bug in the client solr still uses the xml format internally so the
atomic
update (with multivalued field) fails later down the stack. The bug you
filed needs to be fixed to get the problem solved.

On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com
wrote:

Hello everyone.

I´ve starting to seriously worry about with SolrCloud due an strange
behavior that I have detected. The situation is this the following:

https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055

*3.* An asynchronous proccess updates partially some document fields.
After
that operation I automatically execute a commit, so the index must be
reloaded.

What I have checked is that both using atomic updates or complete
document
reindexations* aleatory documents are not updated* *even if I saw
debugging
how the add() and commit() operations were executed correctly* *and
without
errors*. Has anyone experienced a similar behavior? Is it posible
that
if
an index update operation didn´t finish and CloudSolrServer receives
a
new
one this second update operation doesn´t complete?

Thank you in advance.

Regards,

- Luis Cappa

Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

For more details, my indexation App is:

1. Multithreaded.
2. NRT indexation.
3. It´s a Web App with a REST API. It receives asynchronous requests that
produces those atomic updates / document reindexations I told before.

I´m pretty sure that the wrong behavior is related with CloudSolrServer and
with the fact that maybe you are trying to modify the index while an index
update is in course.

Regards,

- Luis Cappa.

2012/11/22 Luis Cappa Banda luisca...@gmail.com

Hello!

Regards,

- Luis Cappa.

2012/11/22 Sami Siren ssi...@gmail.com

It might even depend on the cluster layout! Let's say you have 2 shards
(no
replicas) if the doc belongs to the node you send it to so that it does
not
get forwarded to another node then the update should work and in case
where
the doc gets forwarded to another node the problem occurs. With replicas
it
could appear even more strange: the leader might have the doc right and
the
replica not.

I only briefly looked at the bits that deal with this so perhaps there's
something more involved.

On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com
wrote:

Hi, Sami!

But isn´t strange that some documents were updated (atomic updates)
correctly and other ones not? Can´t it be a more serious problem like
some
kind of index writer lock, or whatever?

Regards,

- Luis Cappa.

2012/11/22 Sami Siren ssi...@gmail.com

I think the problem is that even though you were able to work around
the
bug in the client solr still uses the xml format internally so the
atomic
update (with multivalued field) fails later down the stack. The bug
you
filed needs to be fixed to get the problem solved.

On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda
luisca...@gmail.com
wrote:

Hello everyone.

I´ve starting to seriously worry about with SolrCloud due an strange
behavior that I have detected. The situation is this the following:

https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055

*3.* An asynchronous proccess updates partially some document
fields.
After
that operation I automatically execute a commit, so the index must
be
reloaded.

What I have checked is that both using atomic updates or complete
document
reindexations* aleatory documents are not updated* *even if I saw
debugging
how the add() and commit() operations were executed correctly* *and
without
errors*. Has anyone experienced a similar behavior? Is it posible
that
if
an index update operation didn´t finish and CloudSolrServer
receives a
new
one this second update operation doesn´t complete?

Thank you in advance.

Regards,

- Luis Cappa

Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

More info:

- I´m trying to update the document re-indexing the whole document again.
I first retrieve the document querying by it´s id, then delete it by it´s
id, and re-index including the new changes.
- At the same time there are other index writing operations.

*RESULT*: in most cases the document wasn´t updated. Bad news... it smells
like a critical bug.

Regards,

- Luis Cappa.

2012/11/22 Luis Cappa Banda luisca...@gmail.com

For more details, my indexation App is:

1. Multithreaded.
2. NRT indexation.
3. It´s a Web App with a REST API. It receives asynchronous requests that
produces those atomic updates / document reindexations I told before.

I´m pretty sure that the wrong behavior is related with CloudSolrServer
and with the fact that maybe you are trying to modify the index while an
index update is in course.

Regards,

- Luis Cappa.

2012/11/22 Luis Cappa Banda luisca...@gmail.com

Hello!

Regards,

- Luis Cappa.

2012/11/22 Sami Siren ssi...@gmail.com

It might even depend on the cluster layout! Let's say you have 2 shards
(no
replicas) if the doc belongs to the node you send it to so that it does
not
get forwarded to another node then the update should work and in case
where
the doc gets forwarded to another node the problem occurs. With replicas
it
could appear even more strange: the leader might have the doc right and
the
replica not.

I only briefly looked at the bits that deal with this so perhaps there's
something more involved.

On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com
wrote:

Hi, Sami!

But isn´t strange that some documents were updated (atomic updates)
correctly and other ones not? Can´t it be a more serious problem like
some
kind of index writer lock, or whatever?

Regards,

- Luis Cappa.

2012/11/22 Sami Siren ssi...@gmail.com

I think the problem is that even though you were able to work around
the
bug in the client solr still uses the xml format internally so the
atomic
update (with multivalued field) fails later down the stack. The bug
you
filed needs to be fixed to get the problem solved.

On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda
luisca...@gmail.com
wrote:

Hello everyone.

I´ve starting to seriously worry about with SolrCloud due an
strange
behavior that I have detected. The situation is this the following:

https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055

*3.* An asynchronous proccess updates partially some document
fields.
After
that operation I automatically execute a commit, so the index must
be
reloaded.

What I have checked is that both using atomic updates or complete
document
reindexations* aleatory documents are not updated* *even if I saw
debugging
how the add() and commit() operations were executed correctly* *and
without
errors*. Has anyone experienced a similar behavior? Is it posible
that
if
an index update operation didn´t finish and CloudSolrServer
receives a
new
one this second update operation doesn´t complete?

Thank you in advance.

Regards,

- Luis Cappa

Re: upgrading from 4.0 to 4.1 causes CorruptIndexException: checksum mismatch in segments file

Moving from the final release of 4.0 to 4.1 should be fine, but you appear 
to be using a snapshot of 4.0 that is even older than the ALPHA release of 
4.0 and a number of format changes occurred last Spring. So, yeah, you will 
have to re-index.


-- Jack Krupansky

-Original Message- 
From: solr-user

Sent: Thursday, November 22, 2012 2:03 PM
To: solr-user@lucene.apache.org
Subject: upgrading from 4.0 to 4.1 causes CorruptIndexException: checksum 
mismatch in segments file


hi all

I have been working on moving us from 4.0 to a newer build of 4.1

I am seeing a CorruptIndexException: checksum mismatch in segments file
error when I try to use the existing index files.

I did see something in the build log for #119 re LUCENE-4446 that mentions
flip file formats to point to 4.1 format

Do I just need to reindex or is this some other issue (ie do I need to
configure something differently)?

or should I move back a few builds?

note, we are currently using:

solr-spec 4.0.0.2012.04.05.15.05.52
solr-impl 4.0-SNAPSHOT 1310094M - - 2012-04-05 15:05:52
lucene-spec 4.0-SNAPSHOT
lucene-impl 4.0-SNAPSHOT 1309921 - - 2012-04-05 10:25:27

and are considering moving to:

solr-spec 4.1.0.2012.11.03.18.08.42
solr-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:08:42
lucene-spec 4.1-2012-11-03_18-05-49
lucene-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:06:50
(aka apache-solr-4.1-2012-11-03_18-05-49)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/upgrading-from-4-0-to-4-1-causes-CorruptIndexException-checksum-mismatch-in-segments-file-tp4021913.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Find the matched field in each matched document

No, not directly, but indirectly you can - add debugQuery=true to your 
request and the explain section will detail which terms matched in which 
fields.


You could probably also implement a custom search component which annotated 
each document with the matched field names. In that sense, Solr CAN do it.


-- Jack Krupansky

-Original Message- 
From: Alireza Salimi

Sent: Thursday, November 22, 2012 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Find the matched field in each matched document

Maybe I should say it in different way:

By having documents like above, I want to know what Robert De Niro is?
Is it an actor or a movie title.

you can just tell me if Solr can do it or not, it will be enough.

Thanks



On Thu, Nov 22, 2012 at 1:57 PM, Alireza Salimi 
alireza.sal...@gmail.comwrote:



Hi,

I apologize if i'm asking a duplicate question but I haven't found any
good answer for my problem.
My question is: How can I find out the type of fields that are matched to
the search criteria,
when I search over multip fields.

Assume I have documents like this:
{title: Robert De Niro, actors: []}
{title: ronin, actors: [robert de niro, jean reno]}
{title: casino, actors: [robert de niro, Joe Pesci]}

Here's is the schema:

field  name=actors
indexed=true
multiValued=true
stored=true
termPositions=true
termOffsets=true
termVectors=true
type=text_general /

field  name=title
indexed=true
multiValued=false
stored=true
type=text_general /

Now after search for robert de niro in both title and Actors,
I will have some matches, but my question is: How can I find out
what robert de niro is? Is he an actor or a movie title?


Thanks in advance



--
Alireza Salimi
Java EE Developer






--
Alireza Salimi
Java EE Developer

Re: Find the matched field in each matched document

2012-11-22 Thread Alireza Salimi

Hi Jack,

Thanks for the reply.

I'm not sure about debug components, I thought it slows down query time.
Can you explain more about custom search component?

Thanks


On Thu, Nov 22, 2012 at 7:02 PM, Jack Krupansky j...@basetechnology.comwrote:

 No, not directly, but indirectly you can - add debugQuery=true to your
 request and the explain section will detail which terms matched in which
 fields.

 You could probably also implement a custom search component which
 annotated each document with the matched field names. In that sense, Solr
 CAN do it.

 -- Jack Krupansky

 -Original Message- From: Alireza Salimi
 Sent: Thursday, November 22, 2012 6:11 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Find the matched field in each matched document


 Maybe I should say it in different way:

 By having documents like above, I want to know what Robert De Niro is?
 Is it an actor or a movie title.

 you can just tell me if Solr can do it or not, it will be enough.

 Thanks



 On Thu, Nov 22, 2012 at 1:57 PM, Alireza Salimi alireza.sal...@gmail.com
 **wrote:

  Hi,

 I apologize if i'm asking a duplicate question but I haven't found any
 good answer for my problem.
 My question is: How can I find out the type of fields that are matched to
 the search criteria,
 when I search over multip fields.

 Assume I have documents like this:
 {title: Robert De Niro, actors: []}
 {title: ronin, actors: [robert de niro, jean reno]}
 {title: casino, actors: [robert de niro, Joe Pesci]}

 Here's is the schema:

 field  name=actors
 indexed=true
 multiValued=true
 stored=true
 termPositions=true
 termOffsets=true
 termVectors=true
 type=text_general /

 field  name=title
 indexed=true
 multiValued=false
 stored=true
 type=text_general /

 Now after search for robert de niro in both title and Actors,
 I will have some matches, but my question is: How can I find out
 what robert de niro is? Is he an actor or a movie title?


 Thanks in advance



 --
 Alireza Salimi
 Java EE Developer





 --
 Alireza Salimi
 Java EE Developer




-- 
Alireza Salimi
Java EE Developer

Re: Performance improvement for solr faceting on large index

Hi,

I don't quite follow what you are trying gyroscope do, but it almost sounds
like you may be better off using something other than Solr if all you are
doing is filtering by site and counting something.
I see unigrams in what looks like it could be a big field and that's a red
flag.
Your index is quite big - how much memory have you got?  Do those queries
produce a lot of disk IO. I have a feeling they do. If so, your shards may
be too large for your hardware.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 22, 2012 7:53 AM, Pravin Agrawal pravin_agra...@persistent.co.in
wrote:

 Hi All,

 We are using solr 3.4 with following schema fields.


 schema.xml---

 fieldType name=autosuggest_text class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ShingleFilterFactory
 maxShingleSize=5 outputUnigrams=true/
 filter class=solr.PatternReplaceFilterFactory
 pattern=^([0-9. ])*$ replacement=
 replace=all/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType

 field name=id type=string stored=true indexed=true/
 field name=autoSuggestContent type=autosuggest_text stored=true
 indexed=true multiValued=true/
 copyField source=content dest=autoSuggestContent/
 copyField source=original_title dest=autoSuggestContent/

 field name=content type=text stored=true indexed=true/
 field name=original_title type=text stored=true indexed=true/
 field name=site type=site stored=false indexed=true/


 /schema.xml---

 The index on above schema is distributed on two solr shards with each
 index size of about 1.2 million, and size on disk of about 195GB per shard.

 We want to retrieve (site, autoSuggestContent term, frequency of the term)
 information from our above main solr index. The site is a field in document
 and contains name of site to which that document belongs. The terms are
 retrieved from multivalued field autoSuggestContent which is created using
 shingles from content and title of the web page.

 As of now, we are using facet query to retrieve (term, frequency of term)
  for each site. Below is a sample query (you may ignore initial part of
 query)


 http://localhost:8080/solr/select?indent=onq=*:*fq=site:www.abc.comstart=0rows=0fl=idqt=dismaxfacet=truefacet.field=autoSuggestContentfacet.mincount=25facet.limit=-1facet.method=enumfacet.sort=index

 The problem is that with increase in index size, this method has started
 taking huge time. It used to take 7 minutes per site with index size of
 0.4 million docs but takes around 60-90 minutes for index size of 2.5
 million(). With this speed, it will take around 5-6 days to index complete
 1500 sites. Also we are expecting the index size to grow with more
 documents and more sites and as such time to get the above information will
 increase further.

 Please let us know if there is any better way to extract (site, term,
 frequency) information compare to current method.

 Thanks,
 Pravin Agrawal




 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.

Re: SolrCloud and external Zookeeper ensemble