date:20130828

Ah, OK. Nothing springs to mind. Even faceting on the individual values
of the field counts _documents_ that match, but doesn't give you
which particular values matched. I suppose that in that case you could
run your regex over the returned labels for the facets.

But that's a really ugly solution. Problem is that in a field with 1M
unique values you'd get a list 1M long perhaps which wouldn't perform
at all well.

Depending, you could enumerate your terms (see TermsComponent)
using terms.regex to get a list of all terms that matched your regex
up-front, then do some relatively painful facet querying on a long list
of the returned values, again not something I'd do in a high-query
environment. Depends I guess on how busy your website is

Best
Erick


On Wed, Aug 28, 2013 at 4:18 AM, jai2 jai4l...@gmail.com wrote:

 hi Erick,

 Appreciate your reply. Facet.query will give count of matches not the count
 of unique pattern matches.

 if i give regular expression [0-9]{3} to match a 3 digit number it will
 return total occurrences of three digit numbers, but i want to know
 occurrences of unique 3 numbers. lets say i have number 100 occurred 10
 times and 500 occurred 5 times. facet.query will return count as 15,
 instead
 of giving count of 100 and 500 individually.

 Hope i made myself clear. is there any way to to this?

 thanks and regards
 jai



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-2-Regular-expression-returning-only-matched-substring-tp4086868p4086944.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Help to figure out why query does not match

2013-08-28 Thread heaven

Hi, please help me figure out what's going on. I have the next field type:

fieldType name=words_ngram class=solr.TextField omitNorms=false
  analyzer type=index
tokenizer class=solr.PatternTokenizerFactory pattern=[^\d\w]+ /
filter class=solr.StopFilterFactory words=url_stopwords.txt
ignoreCase=true /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=2
maxGramSize=20 /
  /analyzer
  analyzer type=query
tokenizer class=solr.PatternTokenizerFactory pattern=[^\d\w]+ /
filter class=solr.StopFilterFactory words=url_stopwords.txt
ignoreCase=true /
filter class=solr.LowerCaseFilterFactory /
  /analyzer
/fieldType

And the next string indexed:
http://plus.google.com/111950520904110959061/profile

Here is what the analyzer shows:
http://img607.imageshack.us/img607/5074/fn1.png

Then I do the next query:
fq=type:Site
sort=score desc
q=https\\:\\/\\/plus.google.com\\/111950520904110959061\\/profile
fl=* score
qf=url_words_ngram
defType=edismax
start=0
rows=20
mm=1

And have no results.

These queries do match:
1. https://plus.google
2. https://plus.google.com
3. 11195052090

And these do not:
1. https://plus.google.com/111950520904110959061/profile
2. 111950520904110959061/profile
3. 111950520904110959061

The reason is that 111950520904110959061 length is 21 when I have max gram
size set to 20. Tried to increase max gram size to 200 and it works, but is
there any way to match given query without doing that? The query analyzer
show there are exact matches at PT, SF and LCF or does it work that way so
in index we have only the output from the last filter factory (ENGTF in my
example)? If so, is there an option to preserve the original tokens also?

So that for maxGramSize=5 and indexed string awesomeness I'd have:
a, aw, awe, awes, aweso, awesomeness

Best,
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to patch Solr4.2 for SolrEnityProcessor Sub-Enity issue

2013-08-28 Thread Shalin Shekhar Mangar

This is fixed in trunk and branch_4x and will be available in the next
release (4.5)

See https://issues.apache.org/jira/browse/SOLR-5190

On Mon, Aug 26, 2013 at 12:37 PM, harshchawla ha...@livecareer.com wrote:
 Thanks a lot in advance. I am eagerly waiting for your response.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-patch-Solr4-2-for-SolrEnityProcessor-Sub-Enity-issue-tp4086292p4086572.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.

Re: How to patch Solr4.2 for SolrEnityProcessor Sub-Enity issue

2013-08-28 Thread harshchawla

Thanks a lot for this fix. I am now eagerly waiting for solr - 4.5



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-patch-Solr4-2-for-SolrEnityProcessor-Sub-Enity-issue-tp4086292p4086973.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple replicas for specific shard

http://wiki.apache.org/solr/SolrCloud#Creating_cores_via_CoreAdmin.

Essentially you create a core on a new machine and assign it a
collection and shard. It'll register itself, replicate the data from
the leader and join the cluster automatically.

You could script this too, but be aware that the replication may
take quite a while depending on the network speed and the
size of your index.

Best
Erick


On Wed, Aug 28, 2013 at 4:09 AM, maephisto my_sky...@yahoo.com wrote:

 Thanks Keith!

 But could this be done dinamically?
 Let's take the following example: a SolrCloud cluster with sport event
 results split in three shards by category - footbal shard, golf shard and
 baseball shard. Each of this shards has a replica on a machine.
 Then i realize that my footbal related QPS grow dramatically so i decide to
 add 2 more replicas for the footbal shard, on two new machines.

 How can i proceed in this situatian ?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multiple-replicas-for-specific-shard-tp4086828p4086941.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Solr 4.0 - Fuzzy query and Proximity query

2013-08-28 Thread Prasi S

Hi,
with solr 4.0 the fuzzy query syntax is like  keyword~1 (or 2)
Proximity search is like value~20.

How does this differentiate between the two searches. My thought was
promiximity would be on phrases and fuzzy on individual words. Is that
correct?

I wasnted to do a promiximity search for text field and gave the below
query,
ip:port/collection1/select?q=trinity%20service~50debugQuery=yes,

it gives me results as

result name=response numFound=111 start=0 maxScore=4.1237307
doc
str name=business_name*Trinidad *Services/str
/doc
doc
str name=business_nameTrinity Services/str
/doc
doc
str name=business_nameTrinity Services/str
/doc
doc
str name=business_name*Trinitee *Service/str

How to differentiate between fuzzy and proximity.


Thanks,
Prasi

Re: Help to figure out why query does not match

Hmmm, Certainly only the outputs of the last filter make it into
the index. Consider stopwords being the last filter, you'd expect
stopwords to be removed.

There's nothing that I know of that'll do what you're asking, the
code for ENGTF doesn't have any preserve original that I
see. This seems like a useful addition though, you've
done a nice job of characterizing the problem. Want to
raise a JIRA and/or do a patch?

I'd guess your only real short-term workaround would be to
increase the max gram size.

I suppose you could do a copyfield into a field that doesn't
do the n-gramming and search against that too, but that
feels kind of kludgy...

Best,
Erick

On Wed, Aug 28, 2013 at 7:16 AM, heaven aheave...@gmail.com wrote:

Hi, please help me figure out what's going on. I have the next field type:

fieldType name=words_ngram class=solr.TextField omitNorms=false
analyzer type=index
tokenizer class=solr.PatternTokenizerFactory pattern=[^\d\w]+ /
filter class=solr.StopFilterFactory words=url_stopwords.txt
ignoreCase=true /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=2
maxGramSize=20 /
/analyzer
analyzer type=query
tokenizer class=solr.PatternTokenizerFactory pattern=[^\d\w]+ /
filter class=solr.StopFilterFactory words=url_stopwords.txt
ignoreCase=true /
filter class=solr.LowerCaseFilterFactory /
/analyzer
/fieldType

And the next string indexed:
http://plus.google.com/111950520904110959061/profile

Here is what the analyzer shows:
http://img607.imageshack.us/img607/5074/fn1.png

Then I do the next query:
fq=type:Site
sort=score desc
q=https\\:\\/\\/plus.google.com\\/111950520904110959061\\/profile
fl=* score
qf=url_words_ngram
defType=edismax
start=0
rows=20
mm=1

And have no results.

These queries do match:
1. https://plus.google
2. https://plus.google.com
3. 11195052090

And these do not:
1. https://plus.google.com/111950520904110959061/profile
2. 111950520904110959061/profile
3. 111950520904110959061

The reason is that 111950520904110959061 length is 21 when I have max
gram
size set to 20. Tried to increase max gram size to 200 and it works, but is
there any way to match given query without doing that? The query analyzer
show there are exact matches at PT, SF and LCF or does it work that way so
in index we have only the output from the last filter factory (ENGTF in my
example)? If so, is there an option to preserve the original tokens also?

So that for maxGramSize=5 and indexed string awesomeness I'd have:
a, aw, awe, awes, aweso, awesomeness

Best,
Alex

--
View this message in context:
http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.0 - Fuzzy query and Proximity query

The first thing I'd recommend is to look at the admin/analysis
page. I suspect you aren't seeing fuzzy query results
at all, what you're seeing is the result of stemming.

Stemming is algorithmic, so sometimes produces very
surprising results, i.e. Trinidad and Trinigee may stem
to something like triniti.

But you didn't provide the field definition so it's just a guess.

Best
Erick


On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote:

 Hi,
 with solr 4.0 the fuzzy query syntax is like  keyword~1 (or 2)
 Proximity search is like value~20.

 How does this differentiate between the two searches. My thought was
 promiximity would be on phrases and fuzzy on individual words. Is that
 correct?

 I wasnted to do a promiximity search for text field and gave the below
 query,
 ip:port/collection1/select?q=trinity%20service~50debugQuery=yes,

 it gives me results as

 result name=response numFound=111 start=0 maxScore=4.1237307
 doc
 str name=business_name*Trinidad *Services/str
 /doc
 doc
 str name=business_nameTrinity Services/str
 /doc
 doc
 str name=business_nameTrinity Services/str
 /doc
 doc
 str name=business_name*Trinitee *Service/str

 How to differentiate between fuzzy and proximity.


 Thanks,
 Prasi

Data Centre recovery/replication, does this seem plausible?

2013-08-28 Thread Daniel Collins

We have 2 separate data centers in our organisation, and in order to
maintain the ZK quorum during any DC outage, we have 2 separate Solr
clouds, one in each DC with separate ZK ensembles but both are fed with the
same indexing data.

Now in the event of a DC outage, all our Solr instances go down, and when
they come back up, we need some way to recover the lost data.

Our thought was to replicate from the working DC, but is there a way to do
that whilst still maintaining an online presence for indexing purposes?

In essence, we want to do what happens within Solr cloud's recovery, so (as
I understand cloud recovery) a node starts up, (I'm assuming worst case and
peer sync has failed) then buffers all updates into the transaction log,
replicates from the leader, and replays the transaction log to get
everything in sync.

Is it conceivable to do the same by extending Solr, so on the activation of
some handler (user triggered), we initiated a replicate from other DC,
which puts all the leaders into buffering updates, replicate from some
other set of servers and then replay?

Our goal is to try to minimize the downtime (beyond the initial outage), so
we would ideally like to be able to start up indexing before this
replicate/clone has finished, that's why I thought to enable buffering on
the transaction log.  Searches shouldn't be sent here, but if they do we
have a valid (albeit old) index to serve those until the new one swaps in.

Just curious how any other DC-aware setups handle this kind of scenario?
 Or other concerns, issues with this type of approach.

Re: Solr 4.0 - Fuzzy query and Proximity query

2013-08-28 Thread Prasi S

hi Erick,
Yes it is correct. These results are because of stemming + phonetic
matching. Below is the

Index time

ST
trinity
services
SF
trinity
services
LCF
trinity
services
SF
trinity
services
SF
trinity
services
WDF
trinity
services
Query time

SF
triniti
servic
PF
TRNTtriniti
SRFKservic
HWF
TRNTtriniti
SRFKservic
PSF
TRNTtriniti
SRFKservic
Apart from this, fuzzy would be for indivual words and proximity would be
phrase. Is this correct.
also can we have fuzzy on phrases?



On Wed, Aug 28, 2013 at 5:36 PM, Erick Erickson erickerick...@gmail.comwrote:

 The first thing I'd recommend is to look at the admin/analysis
 page. I suspect you aren't seeing fuzzy query results
 at all, what you're seeing is the result of stemming.

 Stemming is algorithmic, so sometimes produces very
 surprising results, i.e. Trinidad and Trinigee may stem
 to something like triniti.

 But you didn't provide the field definition so it's just a guess.

 Best
 Erick


 On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote:

  Hi,
  with solr 4.0 the fuzzy query syntax is like  keyword~1 (or 2)
  Proximity search is like value~20.
 
  How does this differentiate between the two searches. My thought was
  promiximity would be on phrases and fuzzy on individual words. Is that
  correct?
 
  I wasnted to do a promiximity search for text field and gave the below
  query,
  ip:port/collection1/select?q=trinity%20service~50debugQuery=yes,
 
  it gives me results as
 
  result name=response numFound=111 start=0 maxScore=4.1237307
  doc
  str name=business_name*Trinidad *Services/str
  /doc
  doc
  str name=business_nameTrinity Services/str
  /doc
  doc
  str name=business_nameTrinity Services/str
  /doc
  doc
  str name=business_name*Trinitee *Service/str
 
  How to differentiate between fuzzy and proximity.
 
 
  Thanks,
  Prasi

Re: Solr 4.0 - Fuzzy query and Proximity query

2013-08-28 Thread Prasi S

sry , i copied it wrong. Below is the correct analysis.

Index time

ST
trinity
services
SF
trinity
services
LCF
trinity
services
SF
trinity
services
SF
trinity
services
WDF
trinity
services
SF
triniti
servic
PF
TRNTtriniti
SRFKservic
HWF
TRNTtriniti
SRFKservic
PSF
TRNTtriniti
SRFKservic



*Query time*
ST
trinity
services
SF
trinity
services
LCF
trinity
services
WDF
trinity
services
SF
triniti
servic
PSF
triniti
servic
PF
TRNTtriniti
SRFKservic

Apart from this, fuzzy would be for indivual words and proximity would be
phrase. Is this correct.
also can we have fuzzy on phrases?


On Wed, Aug 28, 2013 at 5:58 PM, Prasi S prasi1...@gmail.com wrote:

 hi Erick,
 Yes it is correct. These results are because of stemming + phonetic
 matching. Below is the

 Index time

  ST
trinity
   services
  SF
trinity
   services
  LCF
trinity
   services
  SF
trinity
   services
  SF
trinity
   services
  WDF
trinity
   services
 Query time

 SF
triniti
   servic
  PF
TRNT  triniti
   SRFK  servic
  HWF
TRNT  triniti
   SRFK  servic
  PSF
TRNT  triniti
   SRFK  servic
 Apart from this, fuzzy would be for indivual words and proximity would be
 phrase. Is this correct.
 also can we have fuzzy on phrases?



 On Wed, Aug 28, 2013 at 5:36 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 The first thing I'd recommend is to look at the admin/analysis
 page. I suspect you aren't seeing fuzzy query results
 at all, what you're seeing is the result of stemming.

 Stemming is algorithmic, so sometimes produces very
 surprising results, i.e. Trinidad and Trinigee may stem
 to something like triniti.

 But you didn't provide the field definition so it's just a guess.

 Best
 Erick


 On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote:

  Hi,
  with solr 4.0 the fuzzy query syntax is like  keyword~1 (or 2)
  Proximity search is like value~20.
 
  How does this differentiate between the two searches. My thought was
  promiximity would be on phrases and fuzzy on individual words. Is that
  correct?
 
  I wasnted to do a promiximity search for text field and gave the below
  query,
  ip:port/collection1/select?q=trinity%20service~50debugQuery=yes,
 
  it gives me results as
 
  result name=response numFound=111 start=0 maxScore=4.1237307
  doc
  str name=business_name*Trinidad *Services/str
  /doc
  doc
  str name=business_nameTrinity Services/str
  /doc
  doc
  str name=business_nameTrinity Services/str
  /doc
  doc
  str name=business_name*Trinitee *Service/str
 
  How to differentiate between fuzzy and proximity.
 
 
  Thanks,
  Prasi

Re: Solr 4.0 - Fuzzy query and Proximity query

No, ComplexPhraseQuery has been around for quite a while but
never incorporated into the code base, it's pretty much what you
need to do both fuzzy and phrase at once.

But, doesn't phonetic really incorporate at least a flavor of fuzzy?
Is it close enough for your needs to just do phonetic matches?

Best
Erick


On Wed, Aug 28, 2013 at 8:31 AM, Prasi S prasi1...@gmail.com wrote:

 sry , i copied it wrong. Below is the correct analysis.

 Index time

 ST
 trinity
 services
 SF
 trinity
 services
 LCF
 trinity
 services
 SF
 trinity
 services
 SF
 trinity
 services
 WDF
 trinity
 services
 SF
 triniti
 servic
 PF
 TRNTtriniti
 SRFKservic
 HWF
 TRNTtriniti
 SRFKservic
 PSF
 TRNTtriniti
 SRFKservic



 *Query time*
 ST
 trinity
 services
 SF
 trinity
 services
 LCF
 trinity
 services
 WDF
 trinity
 services
 SF
 triniti
 servic
 PSF
 triniti
 servic
 PF
 TRNTtriniti
 SRFKservic

 Apart from this, fuzzy would be for indivual words and proximity would be
 phrase. Is this correct.
 also can we have fuzzy on phrases?


 On Wed, Aug 28, 2013 at 5:58 PM, Prasi S prasi1...@gmail.com wrote:

  hi Erick,
  Yes it is correct. These results are because of stemming + phonetic
  matching. Below is the
 
  Index time
 
   ST
 trinity
services
   SF
 trinity
services
   LCF
 trinity
services
   SF
 trinity
services
   SF
 trinity
services
   WDF
 trinity
services
  Query time
 
  SF
 triniti
servic
   PF
 TRNT  triniti
SRFK  servic
   HWF
 TRNT  triniti
SRFK  servic
   PSF
 TRNT  triniti
SRFK  servic
  Apart from this, fuzzy would be for indivual words and proximity would be
  phrase. Is this correct.
  also can we have fuzzy on phrases?
 
 
 
  On Wed, Aug 28, 2013 at 5:36 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  The first thing I'd recommend is to look at the admin/analysis
  page. I suspect you aren't seeing fuzzy query results
  at all, what you're seeing is the result of stemming.
 
  Stemming is algorithmic, so sometimes produces very
  surprising results, i.e. Trinidad and Trinigee may stem
  to something like triniti.
 
  But you didn't provide the field definition so it's just a guess.
 
  Best
  Erick
 
 
  On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote:
 
   Hi,
   with solr 4.0 the fuzzy query syntax is like  keyword~1 (or 2)
   Proximity search is like value~20.
  
   How does this differentiate between the two searches. My thought was
   promiximity would be on phrases and fuzzy on individual words. Is that
   correct?
  
   I wasnted to do a promiximity search for text field and gave the below
   query,
  
 ip:port/collection1/select?q=trinity%20service~50debugQuery=yes,
  
   it gives me results as
  
   result name=response numFound=111 start=0 maxScore=4.1237307
   doc
   str name=business_name*Trinidad *Services/str
   /doc
   doc
   str name=business_nameTrinity Services/str
   /doc
   doc
   str name=business_nameTrinity Services/str
   /doc
   doc
   str name=business_name*Trinitee *Service/str
  
   How to differentiate between fuzzy and proximity.
  
  
   Thanks,
   Prasi

Re: Data Centre recovery/replication, does this seem plausible?

The separate DC problem has been lurking for a while. But your
understanding it a little off. When a replica discovers that
it's too far out of date, it does an old-style replication. IOW, the
tlog doesn't contain the entire delta. Eventually, the old-style
replications catch up to close enough and _then_ the remaining
docs in the tlog are replayed. The target number of updates in the
tlog is 100 so it's a pretty small window that's actually replayed in
the normal case.

None of which helps your problem. The simplest way (and on the
expectation that DC outages were pretty rare!) would be to have your
indexing process fire the missed updates at the DC after it came
back up.

Copying from one DC to another is tricky. You'd have to be very,
very sure that you copied indexes to the right shard. Ditto for any
process that tried to have, say, a single node from the recovering
DC temporarily join the good DC, at least long enough to synch.

Not a pretty problem, we don't really have any best practices yet
that I know of.

FWIW,
Erick


On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins danwcoll...@gmail.comwrote:

 We have 2 separate data centers in our organisation, and in order to
 maintain the ZK quorum during any DC outage, we have 2 separate Solr
 clouds, one in each DC with separate ZK ensembles but both are fed with the
 same indexing data.

 Now in the event of a DC outage, all our Solr instances go down, and when
 they come back up, we need some way to recover the lost data.

 Our thought was to replicate from the working DC, but is there a way to do
 that whilst still maintaining an online presence for indexing purposes?

 In essence, we want to do what happens within Solr cloud's recovery, so (as
 I understand cloud recovery) a node starts up, (I'm assuming worst case and
 peer sync has failed) then buffers all updates into the transaction log,
 replicates from the leader, and replays the transaction log to get
 everything in sync.

 Is it conceivable to do the same by extending Solr, so on the activation of
 some handler (user triggered), we initiated a replicate from other DC,
 which puts all the leaders into buffering updates, replicate from some
 other set of servers and then replay?

 Our goal is to try to minimize the downtime (beyond the initial outage), so
 we would ideally like to be able to start up indexing before this
 replicate/clone has finished, that's why I thought to enable buffering on
 the transaction log.  Searches shouldn't be sent here, but if they do we
 have a valid (albeit old) index to serve those until the new one swaps in.

 Just curious how any other DC-aware setups handle this kind of scenario?
  Or other concerns, issues with this type of approach.

NPE during distributed search

2013-08-28 Thread Dmitry Kan

Solr 4.3.1
container: jetty 9 (jetty-distribution-9.0.4.v20130625)
shard sizes: between 10G and 15G
two cores per shard, non SolrCloud mode

We have frontend solr and several shards. When searching in smaller amount
of shards, the query runs ok. When asking for larger amount of shards, the
query fails with NPE. Looking into the corresponding code, we see score
comparison:

class: ShardDoc.java
method: static Comparator comparatorScore(final String fieldName)
Code:final float f1 = e1.score;

Looks like e1 is null. What could be the reason? Is it at all possible to
remove scoring altogether (because we don't need that)?

What else should we look into?

NPE stack trace:

ERROR org.apache.solr.servlet.SolrDispatchFilter  -
null:java.lang.NullPointerException
at
org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardDoc.java:234)
at
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:159)
at
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:101)
at
org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:231)
at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:140)
at
org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:156)
at
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:863)
at
org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:625)
at
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:604)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:311)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1094)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1028)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:317)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:445)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:267)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:224)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
at java.lang.Thread.run(Thread.java:722)

how to sum a field grouping by more fields

2013-08-28 Thread hao.jin

Hello,

can somebody tell me, if  solr 4.4.0 support *stats.pivot* in order to sum a
field grouping by more fields.
Are there another methods to sum a field grouping by more fields?

Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-sum-a-field-grouping-by-more-fields-tp4087003.html
Sent from the Solr - User mailing list archive at Nabble.com.

Group/distinct

2013-08-28 Thread Per Steffensen


Hi

I have a set of collections containing documents with the fields: a, 
b and timestamp
A LOT of documents and a lot of them have same values for a, and for 
each value of a there is only a very limited set of distinct values in 
the b's. The timestamp-values are different for (almost) all documents.


Can I make a group/distinct query to Solr returning all distinct values 
of a where timestamp is within a certain period of time. If yes, 
how? Guess this is just using group of facet, but what is the difference 
and which one is best? Do any of them require that the fields has been 
prepared for grouping/faceting by setting it up in the schema?


Can I make a query to Solr returning all distinct values of a where 
timestamp is within a certain period of time, and also, for each 
distinct a, have the limited set of distinct b-values returned? I 
guess this will beg grouping/faceting on multiple fields, but can you do 
that? Other suggestions on how to achieve this?


Regards, Per Steffensen

Re: Data Centre recovery/replication, does this seem plausible?

2013-08-28 Thread Timothy Potter

I've been thinking about this one too and was curious about using the Solr
Entity support in the DIH to do the import from one DC to another (for the
lost docs). In my mind, one configures the DIH to use the
SolrEntityProcessor with a query to capture the docs in the DC that stayed
online, most likely using a timestamp in the query (see:
http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor).

Would that work? If so, any downsides? I've only used DIH /
SolrEntityProcessor to populate a staging / dev environment from prod but
have had good success with it.

Thanks.
Tim


On Wed, Aug 28, 2013 at 6:59 AM, Erick Erickson erickerick...@gmail.comwrote:

 The separate DC problem has been lurking for a while. But your
 understanding it a little off. When a replica discovers that
 it's too far out of date, it does an old-style replication. IOW, the
 tlog doesn't contain the entire delta. Eventually, the old-style
 replications catch up to close enough and _then_ the remaining
 docs in the tlog are replayed. The target number of updates in the
 tlog is 100 so it's a pretty small window that's actually replayed in
 the normal case.

 None of which helps your problem. The simplest way (and on the
 expectation that DC outages were pretty rare!) would be to have your
 indexing process fire the missed updates at the DC after it came
 back up.

 Copying from one DC to another is tricky. You'd have to be very,
 very sure that you copied indexes to the right shard. Ditto for any
 process that tried to have, say, a single node from the recovering
 DC temporarily join the good DC, at least long enough to synch.

 Not a pretty problem, we don't really have any best practices yet
 that I know of.

 FWIW,
 Erick


 On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins danwcoll...@gmail.com
 wrote:

  We have 2 separate data centers in our organisation, and in order to
  maintain the ZK quorum during any DC outage, we have 2 separate Solr
  clouds, one in each DC with separate ZK ensembles but both are fed with
 the
  same indexing data.
 
  Now in the event of a DC outage, all our Solr instances go down, and when
  they come back up, we need some way to recover the lost data.
 
  Our thought was to replicate from the working DC, but is there a way to
 do
  that whilst still maintaining an online presence for indexing purposes?
 
  In essence, we want to do what happens within Solr cloud's recovery, so
 (as
  I understand cloud recovery) a node starts up, (I'm assuming worst case
 and
  peer sync has failed) then buffers all updates into the transaction log,
  replicates from the leader, and replays the transaction log to get
  everything in sync.
 
  Is it conceivable to do the same by extending Solr, so on the activation
 of
  some handler (user triggered), we initiated a replicate from other DC,
  which puts all the leaders into buffering updates, replicate from some
  other set of servers and then replay?
 
  Our goal is to try to minimize the downtime (beyond the initial outage),
 so
  we would ideally like to be able to start up indexing before this
  replicate/clone has finished, that's why I thought to enable buffering
 on
  the transaction log.  Searches shouldn't be sent here, but if they do we
  have a valid (albeit old) index to serve those until the new one swaps
 in.
 
  Just curious how any other DC-aware setups handle this kind of scenario?
   Or other concerns, issues with this type of approach.

Re: Data Centre recovery/replication, does this seem plausible?

If you can satisfy this statement then it seems possible. This is the same
restirction
as atomic updates.:
The SolrEntityProcessor can only copy fields that are stored in the source
index.


On Wed, Aug 28, 2013 at 9:41 AM, Timothy Potter thelabd...@gmail.comwrote:

 I've been thinking about this one too and was curious about using the Solr
 Entity support in the DIH to do the import from one DC to another (for the
 lost docs). In my mind, one configures the DIH to use the
 SolrEntityProcessor with a query to capture the docs in the DC that stayed
 online, most likely using a timestamp in the query (see:
 http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor).

 Would that work? If so, any downsides? I've only used DIH /
 SolrEntityProcessor to populate a staging / dev environment from prod but
 have had good success with it.

 Thanks.
 Tim


 On Wed, Aug 28, 2013 at 6:59 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  The separate DC problem has been lurking for a while. But your
  understanding it a little off. When a replica discovers that
  it's too far out of date, it does an old-style replication. IOW, the
  tlog doesn't contain the entire delta. Eventually, the old-style
  replications catch up to close enough and _then_ the remaining
  docs in the tlog are replayed. The target number of updates in the
  tlog is 100 so it's a pretty small window that's actually replayed in
  the normal case.
 
  None of which helps your problem. The simplest way (and on the
  expectation that DC outages were pretty rare!) would be to have your
  indexing process fire the missed updates at the DC after it came
  back up.
 
  Copying from one DC to another is tricky. You'd have to be very,
  very sure that you copied indexes to the right shard. Ditto for any
  process that tried to have, say, a single node from the recovering
  DC temporarily join the good DC, at least long enough to synch.
 
  Not a pretty problem, we don't really have any best practices yet
  that I know of.
 
  FWIW,
  Erick
 
 
  On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins danwcoll...@gmail.com
  wrote:
 
   We have 2 separate data centers in our organisation, and in order to
   maintain the ZK quorum during any DC outage, we have 2 separate Solr
   clouds, one in each DC with separate ZK ensembles but both are fed with
  the
   same indexing data.
  
   Now in the event of a DC outage, all our Solr instances go down, and
 when
   they come back up, we need some way to recover the lost data.
  
   Our thought was to replicate from the working DC, but is there a way to
  do
   that whilst still maintaining an online presence for indexing
 purposes?
  
   In essence, we want to do what happens within Solr cloud's recovery, so
  (as
   I understand cloud recovery) a node starts up, (I'm assuming worst case
  and
   peer sync has failed) then buffers all updates into the transaction
 log,
   replicates from the leader, and replays the transaction log to get
   everything in sync.
  
   Is it conceivable to do the same by extending Solr, so on the
 activation
  of
   some handler (user triggered), we initiated a replicate from other
 DC,
   which puts all the leaders into buffering updates, replicate from some
   other set of servers and then replay?
  
   Our goal is to try to minimize the downtime (beyond the initial
 outage),
  so
   we would ideally like to be able to start up indexing before this
   replicate/clone has finished, that's why I thought to enable
 buffering
  on
   the transaction log.  Searches shouldn't be sent here, but if they do
 we
   have a valid (albeit old) index to serve those until the new one swaps
  in.
  
   Just curious how any other DC-aware setups handle this kind of
 scenario?
Or other concerns, issues with this type of approach.

RE: Data Centre recovery/replication, does this seem plausible?

2013-08-28 Thread Markus Jelsma

Hi - You're going to miss unstored but indexed fields.  We stop any indexing 
process, kill the servlets on the down DC and copy over the files using scp, 
then remove the lock file and start it up again. Always works but it's a manual 
process at this point but should be easy to automate using some simple bash 
scripting.

-Original message-
 From:Timothy Potter thelabd...@gmail.com
 Sent: Wednesday 28th August 2013 15:41
 To: solr-user@lucene.apache.org
 Subject: Re: Data Centre recovery/replication, does this seem plausible?

 I've been thinking about this one too and was curious about using the Solr
 Entity support in the DIH to do the import from one DC to another (for the
 lost docs). In my mind, one configures the DIH to use the
 SolrEntityProcessor with a query to capture the docs in the DC that stayed
 online, most likely using a timestamp in the query (see:
 http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor).

 Would that work? If so, any downsides? I've only used DIH /
 SolrEntityProcessor to populate a staging / dev environment from prod but
 have had good success with it.

 Thanks.
 Tim

 On Wed, Aug 28, 2013 at 6:59 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

  The separate DC problem has been lurking for a while. But your
  understanding it a little off. When a replica discovers that
  it's too far out of date, it does an old-style replication. IOW, the
  tlog doesn't contain the entire delta. Eventually, the old-style
  replications catch up to close enough and _then_ the remaining
  docs in the tlog are replayed. The target number of updates in the
  tlog is 100 so it's a pretty small window that's actually replayed in
  the normal case.

  None of which helps your problem. The simplest way (and on the
  expectation that DC outages were pretty rare!) would be to have your
  indexing process fire the missed updates at the DC after it came
  back up.

  Copying from one DC to another is tricky. You'd have to be very,
  very sure that you copied indexes to the right shard. Ditto for any
  process that tried to have, say, a single node from the recovering
  DC temporarily join the good DC, at least long enough to synch.

  Not a pretty problem, we don't really have any best practices yet
  that I know of.

  FWIW,
  Erick

  On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins danwcoll...@gmail.com
  wrote:

   We have 2 separate data centers in our organisation, and in order to
   maintain the ZK quorum during any DC outage, we have 2 separate Solr
   clouds, one in each DC with separate ZK ensembles but both are fed with
  the
   same indexing data.

   Now in the event of a DC outage, all our Solr instances go down, and when
   they come back up, we need some way to recover the lost data.

   Our thought was to replicate from the working DC, but is there a way to
  do
   that whilst still maintaining an online presence for indexing purposes?

   In essence, we want to do what happens within Solr cloud's recovery, so
  (as
   I understand cloud recovery) a node starts up, (I'm assuming worst case
  and
   peer sync has failed) then buffers all updates into the transaction log,
   replicates from the leader, and replays the transaction log to get
   everything in sync.

   Is it conceivable to do the same by extending Solr, so on the activation
  of
   some handler (user triggered), we initiated a replicate from other DC,
   which puts all the leaders into buffering updates, replicate from some
   other set of servers and then replay?

   Our goal is to try to minimize the downtime (beyond the initial outage),
  so
   we would ideally like to be able to start up indexing before this
   replicate/clone has finished, that's why I thought to enable buffering
  on
   the transaction log.  Searches shouldn't be sent here, but if they do we
   have a valid (albeit old) index to serve those until the new one swaps
  in.

   Just curious how any other DC-aware setups handle this kind of scenario?
Or other concerns, issues with this type of approach.

RE: SOLR 4.2.1 - High Resident Memory Usage

2013-08-28 Thread Markus Jelsma

Hi - it's certainly not a rule of thumb but usually RES always grows higher 
than Xmx so keep an eye on it.

-Original message-
 From:vsilgalis vsilga...@gmail.com
 Sent: Wednesday 28th August 2013 2:53
 To: solr-user@lucene.apache.org
 Subject: Re: SOLR 4.2.1 - High Resident Memory Usage

 http://lucene.472066.n3.nabble.com/file/n4086923/huge.png 

 That doesn't seem to be a problem.

 Markus, are you saying that I should plan on resident memory being at least
 double my heap size?  I haven't run into issues around this before but then
 again I don't know everything.

 Is this a rule of thumb or is their documentation I can look at.

 Thanks again.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLR-4-2-1-High-Resident-Memory-Usage-tp4086866p4086923.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple replicas for specific shard

2013-08-28 Thread maephisto

Thanks Erik,
I think this answers my question



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-replicas-for-specific-shard-tp4086828p4087019.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Centre recovery/replication, does this seem plausible?

On 8/28/2013 6:13 AM, Daniel Collins wrote:
 We have 2 separate data centers in our organisation, and in order to
 maintain the ZK quorum during any DC outage, we have 2 separate Solr
 clouds, one in each DC with separate ZK ensembles but both are fed with the
 same indexing data.
 
 Now in the event of a DC outage, all our Solr instances go down, and when
 they come back up, we need some way to recover the lost data.
 
 Our thought was to replicate from the working DC, but is there a way to do
 that whilst still maintaining an online presence for indexing purposes?

One way which would work (if your core name structures were identical
between the two clouds) would be to shut down your indexing process,
shut down the cloud that went down and has now come back up, and rsync
from the good cloud.  Depending on the index size, that could take a
long time, and the index updates would be turned off while it's
happening.  That makes this idea less than ideal.

I have a similar setup on a sharded index that's NOT using SolrCloud,
and both copies are in one location instead of two separate data
centers.  My general indexing method would work for your setup, though.

The way that I handle this is that my indexing program tracks its update
position for each copy of the index independently.  If one copy is down,
the tracked position for that index won't get updated, so the next time
it comes up, all missed updates will get done for that copy.  In the
meantime, the program (Java, using SolrJ) is happily using a separate
thread to continue updating the index copy that's still up.

Thanks,
Shawn

Re: Solr 4.0 - Fuzzy query and Proximity query

2013-08-28 Thread Walter Underwood

Mixing fuzzy with phonetic can give bizarre matches. I worked on a search 
engine that did that.

You really don't want to mix stemming, phonetic, and fuzzy. They are distinct 
transformations of the surface word that do different things.

Stemming: conflate different inflections of the same word, like car and cars.
Phonetic: conflate words that sound similar, like moody and mudie.
Fuzzy: conflate words with different spellings or misspellings, like smith, 
smyth, and smit.

If you want all of these, make three fields with separate transformations.

wunder

On Aug 28, 2013, at 5:46 AM, Erick Erickson wrote:

 No, ComplexPhraseQuery has been around for quite a while but
 never incorporated into the code base, it's pretty much what you
 need to do both fuzzy and phrase at once.
 
 But, doesn't phonetic really incorporate at least a flavor of fuzzy?
 Is it close enough for your needs to just do phonetic matches?
 
 Best
 Erick
 
 
 On Wed, Aug 28, 2013 at 8:31 AM, Prasi S prasi1...@gmail.com wrote:
 
 sry , i copied it wrong. Below is the correct analysis.
 
 Index time
 
 ST
 trinity
 services
 SF
 trinity
 services
 LCF
 trinity
 services
 SF
 trinity
 services
 SF
 trinity
 services
 WDF
 trinity
 services
 SF
 triniti
 servic
 PF
 TRNTtriniti
 SRFKservic
 HWF
 TRNTtriniti
 SRFKservic
 PSF
 TRNTtriniti
 SRFKservic
 
 
 
 *Query time*
 ST
 trinity
 services
 SF
 trinity
 services
 LCF
 trinity
 services
 WDF
 trinity
 services
 SF
 triniti
 servic
 PSF
 triniti
 servic
 PF
 TRNTtriniti
 SRFKservic
 
 Apart from this, fuzzy would be for indivual words and proximity would be
 phrase. Is this correct.
 also can we have fuzzy on phrases?
 
 
 On Wed, Aug 28, 2013 at 5:58 PM, Prasi S prasi1...@gmail.com wrote:
 
 hi Erick,
 Yes it is correct. These results are because of stemming + phonetic
 matching. Below is the
 
 Index time
 
 ST
   trinity
  services
 SF
   trinity
  services
 LCF
   trinity
  services
 SF
   trinity
  services
 SF
   trinity
  services
 WDF
   trinity
  services
 Query time
 
 SF
   triniti
  servic
 PF
   TRNT  triniti
  SRFK  servic
 HWF
   TRNT  triniti
  SRFK  servic
 PSF
   TRNT  triniti
  SRFK  servic
 Apart from this, fuzzy would be for indivual words and proximity would be
 phrase. Is this correct.
 also can we have fuzzy on phrases?
 
 
 
 On Wed, Aug 28, 2013 at 5:36 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
 The first thing I'd recommend is to look at the admin/analysis
 page. I suspect you aren't seeing fuzzy query results
 at all, what you're seeing is the result of stemming.
 
 Stemming is algorithmic, so sometimes produces very
 surprising results, i.e. Trinidad and Trinigee may stem
 to something like triniti.
 
 But you didn't provide the field definition so it's just a guess.
 
 Best
 Erick
 
 
 On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote:
 
 Hi,
 with solr 4.0 the fuzzy query syntax is like  keyword~1 (or 2)
 Proximity search is like value~20.
 
 How does this differentiate between the two searches. My thought was
 promiximity would be on phrases and fuzzy on individual words. Is that
 correct?
 
 I wasnted to do a promiximity search for text field and gave the below
 query,
 
 ip:port/collection1/select?q=trinity%20service~50debugQuery=yes,
 
 it gives me results as
 
 result name=response numFound=111 start=0 maxScore=4.1237307
 doc
 str name=business_name*Trinidad *Services/str
 /doc
 doc
 str name=business_nameTrinity Services/str
 /doc
 doc
 str name=business_nameTrinity Services/str
 /doc
 doc
 str name=business_name*Trinitee *Service/str
 
 How to differentiate between fuzzy and proximity.
 
 
 Thanks,
 Prasi
 
 
 
 
 

--
Walter Underwood
wun...@wunderwood.org

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West

Thanks Shawn and Naomi,

I think I am running into the same bug, but the symptoms are a bit
different.
I'm wondering if it makes sense to file a separate linked bug report.

The workaround is to remove sharedLib from solr.xml,
The solr.xml that comes out-of-the-box does not have a sharedLib.

I am using Solr 4.4. out-of-the-box, with the exception that I set up a
lib directory in example/solr/collection1.   I did not change solr.xml from
the out-of-the-box.  There is no  mention of lib in the out-of-the-box
example/solr/solr.xml.

I did not change the out-of-the-box solrconfig.xml.

 According to the README.txt, all that needs to be done is create the
collection1/lib directory and put the jars there.
However, I am getting the class not found error.

Should I open another bug report or comment on the existing report?

Tom




On Tue, Aug 27, 2013 at 6:48 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/27/2013 4:29 PM, Tom Burton-West wrote:

 According to the README.txt in solr-4.4.0/solr/example/solr/**
 collection1,
 all we have to do is create a collection1/lib directory and put whatever
 jars we want in there.

 .. /lib.
 If it exists, Solr will load any Jars
 found in this directory and use them to resolve any plugins
  specified in your solrconfig.xml or schema.xml 


I did so  (see below).  However, I keep getting a class not found error
 (see below).

 Has the default changed from what is documented in the README.txt file?
 Is there something I have to change in solrconfig.xml or solr.xml to make
 this work?

 I looked at SOLR-4852, but don't understand.   It sounds like maybe there
 is a problem if the collection1/lib directory is also specified in
 solrconfig.xml.  But I didn't do that. (i.e. out of the box
 solrconfig.xml)
   Does this mean that by following what it says in the README.txt, I am
 making some kind of a configuration error.  I also don't understand the
 workaround in SOLR-4852.


 That's my bug! :)  If you have sharedLib set to lib (or explicitly the
 lib directory under solr.solr.home) in solr.xml, then ICUTokenizer cannot
 be found despite the fact that all the correct jars are there.

 The workaround is to remove sharedLib from solr.xml, or set it to some
 other directory that either doesn't exist or has no jars in it.  The
 ${solr.solr.home}/lib directory is automatically added to the classpath
 regardless of config, there seems to be some kind of classloading bug when
 the sharedLib adds the same directory again.  This all worked fine in 3.x,
 and early 4.x releases, but due to classloader changes, it seems to have
 broken.  I think (based on the issue description) that it started being a
 problem with 4.3-SNAPSHOT.

 The same thing happens if you set sharedLib to foo and put some of your
 jars in lib and some in foo.  It's quite mystifying.

 Thanks,
 Shawn

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West

My point in the previous e-mail was that following the instructions in the
documentation does not seem to work.
The workaround I found was to simply change the name of the collection1/lib
directory to collection1/foobar and then include it in solrconfig.xml.
  lib dir=./foobar /

This works, but does not explain why out-of-the-box, simply creating a
collection1/lib directory and putting the jars there does not work as
documented in both  the README.txt and in solrconfig.xml.

Shawn, should I add these comments to your JIRA issue?
Should I open a separate related JIRA issue?

Tom

Tom


On Tue, Aug 27, 2013 at 7:18 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/27/2013 5:11 PM, Naomi Dushay wrote:

 Perhaps you are missing the following from your solrconfig

 lib dir=/home/blacklight/solr-**home/lib /


 I ran into this issue (I'm the one that filed SOLR-4852) and I am not
 using blacklight.  I am only using what can be found in a Solr download,
 plus the MySQL JDBC driver for dataimport.

 I prefer not to load jars via solrconfig.xml.  I have a lot of cores and
 every core needs to use the same jars.  Rather than have the same jars
 loaded 18 times (once by each of the 18 solrconfig.xml files), I would
 rather have Solr load them once and make the libraries available to all
 cores.  Using ${solr.solr.home}/lib accomplishes this goal.

 Thanks,
 Shawn

Re: ICUTokenizer class not found with Solr 4.4


On 8/28/2013 9:34 AM, Tom Burton-West wrote:

I think I am running into the same bug, but the symptoms are a bit
different.
I'm wondering if it makes sense to file a separate linked bug report.


The workaround is to remove sharedLib from solr.xml,

The solr.xml that comes out-of-the-box does not have a sharedLib.

 I am using Solr 4.4. out-of-the-box, with the exception that I set up a
lib directory in example/solr/collection1.   I did not change solr.xml from
the out-of-the-box.  There is no  mention of lib in the out-of-the-box
example/solr/solr.xml.

I did not change the out-of-the-box solrconfig.xml.

  According to the README.txt, all that needs to be done is create the
collection1/lib directory and put the jars there.
However, I am getting the class not found error.

Should I open another bug report or comment on the existing report?


I have never heard of using ${instanceDir}/lib for jars.  That doesn't 
mean it won't work, but I have never seen it mentioned anywhere.


I have only ever put the lib directory in solr.home, where solr.xml is. 
 Did you try that?


If you have seen documentation for collection1/lib, then there may be a 
doc bug, another dimension to the bug already filed, or a new bug.  Do 
you see log entries saying your jars in collection/lib are loaded?  If 
you do, then I think it's probably another dimension to the existing bug.


Thanks,
Shawn

Re: Data Centre recovery/replication, does this seem plausible?

2013-08-28 Thread Daniel Collins

Thanks Shawn/Erick for the suggestions. Unfortunately stopping indexing
whilst we recover isn't a viable option, we are using Solr as an NRT search
platform, so indexing must continue at least on the DC that is fine.  If we
could stop indexing on the broken DC, then recovery is relatively
straightforward, its a rsync/copy of a snapshot from the other data center
followed by restarting indexing.

The million dollar question is how to start up our existing Solr instances
(once the data center has recovered from whatever broke it), realize that
we have a gap in indexing (using a checkpointing mechanism similar to what
Shawn describes), and recover from that (that's the tricky bit!), without
having to interrupt indexing...  I know that replication takes up to an
hour (its a rather large collection but split into 8 shards currently, and
we can replicate each shard in parallel).  What ideally I would like to do
is at the point that I kick off recovery, divert the indexing feed for the
broken into a transaction log on those machines, run the replication and
swap the index in, then replay the transaction log to bring it all up to
date.  That process (conceptually)  is the same as the
org.apache.solr.cloud.RecoveryStrategy code.

Yes, if I could divert that feed a that application level, then I can do
what you suggest, but it feels like more work to do that (and build an
external transaction log) whereas the code seems to already be in Solr
itself, I just need to hook it all up (famous last words!) Our indexing
pipeline does a lot of pre-processing work (its not just pulling data from
a database), and since we are only talking about the time taken to do the
replication (should be an hour or less), it feels like we ought to be able
to store that in a Solr transaction log (i.e. the last point in the
indexing pipeline).

The plan would be to recover the leaders (1 of each shard) this way, and
then use conventional replication/recovery to deal with the local replicas
(blank their data area and then they will automatically sync from the local
leader).


On 28 August 2013 15:26, Shawn Heisey s...@elyograg.org wrote:

 On 8/28/2013 6:13 AM, Daniel Collins wrote:
  We have 2 separate data centers in our organisation, and in order to
  maintain the ZK quorum during any DC outage, we have 2 separate Solr
  clouds, one in each DC with separate ZK ensembles but both are fed with
 the
  same indexing data.
 
  Now in the event of a DC outage, all our Solr instances go down, and when
  they come back up, we need some way to recover the lost data.
 
  Our thought was to replicate from the working DC, but is there a way to
 do
  that whilst still maintaining an online presence for indexing purposes?

 One way which would work (if your core name structures were identical
 between the two clouds) would be to shut down your indexing process,
 shut down the cloud that went down and has now come back up, and rsync
 from the good cloud.  Depending on the index size, that could take a
 long time, and the index updates would be turned off while it's
 happening.  That makes this idea less than ideal.

 I have a similar setup on a sharded index that's NOT using SolrCloud,
 and both copies are in one location instead of two separate data
 centers.  My general indexing method would work for your setup, though.

 The way that I handle this is that my indexing program tracks its update
 position for each copy of the index independently.  If one copy is down,
 the tracked position for that index won't get updated, so the next time
 it comes up, all missed updates will get done for that copy.  In the
 meantime, the program (Java, using SolrJ) is happily using a separate
 thread to continue updating the index copy that's still up.

 Thanks,
 Shawn

Question about SOLR-5017 - Allow sharding based on the value of a field

2013-08-28 Thread adfel70

Hi
I'm looking into allowing query joins in solr cloud.
This has the limitation of having to index all the documents that are
joineable together to the same shard.
I'm wondering if  SOLR-5017
https://issues.apache.org/jira/browse/SOLR-5017   would give me the
ability to do so without implementing my own routing mechanism?

If I add a field named parent_id and give that field the same value in all
the documents that I want to join, it seems, theoretically, that it will be
enough.

Am I correct?

Thanks.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-SOLR-5017-Allow-sharding-based-on-the-value-of-a-field-tp4087050.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Centre recovery/replication, does this seem plausible?


On 8/28/2013 10:48 AM, Daniel Collins wrote:

What ideally I would like to do
is at the point that I kick off recovery, divert the indexing feed for the
broken into a transaction log on those machines, run the replication and
swap the index in, then replay the transaction log to bring it all up to
date.  That process (conceptually)  is the same as the
org.apache.solr.cloud.RecoveryStrategy code.


I don't think any such mechanism exists currently.  It would be 
extremely awesome if it did.  If there's not an existing Jira issue, I 
recommend that you file one.  Being able to set up a multi-datacenter 
cloud with automatic recovery would be awesome.  Even if it took a long 
time, having it be fully automated would be exceptionally useful.



Yes, if I could divert that feed a that application level, then I can do
what you suggest, but it feels like more work to do that (and build an
external transaction log) whereas the code seems to already be in Solr
itself, I just need to hook it all up (famous last words!) Our indexing
pipeline does a lot of pre-processing work (its not just pulling data from
a database), and since we are only talking about the time taken to do the
replication (should be an hour or less), it feels like we ought to be able
to store that in a Solr transaction log (i.e. the last point in the
indexing pipeline).


I think it would have to be a separate transaction log.  One problem 
with really big regular tlogs is that when Solr gets restarted, the 
entire transaction log that's currently on the disk gets replayed.  If 
it were big enough to recover the last several hours to a duplicate 
cloud, it would take forever to replay on Solr restart.  If the regular 
tlog were kept small but a second log with the last 24 hours were 
available, it could replay updates when the second cloud came back up.


I do import from a database, so the application-level tracking works 
really well for me.


Thanks,
Shawn

Re: Question about SOLR-5017 - Allow sharding based on the value of a field

2013-08-28 Thread Greg Preston

I don't know about SOLR-5017, but why don't you want to use parent_id
as a shard key?

So if you've got a doc with a key of abc123 and a  parent_id of 456,
just use a key of 456!abc123 and all docs with the same parent_id
will go to the same shard.
We're doing something similar and limiting queries to the single shard
that hosts the relevant docs by setting shard.keys=456! on queries.

-Greg


On Wed, Aug 28, 2013 at 10:04 AM, adfel70 adfe...@gmail.com wrote:
 Hi
 I'm looking into allowing query joins in solr cloud.
 This has the limitation of having to index all the documents that are
 joineable together to the same shard.
 I'm wondering if  SOLR-5017
 https://issues.apache.org/jira/browse/SOLR-5017   would give me the
 ability to do so without implementing my own routing mechanism?

 If I add a field named parent_id and give that field the same value in all
 the documents that I want to join, it seems, theoretically, that it will be
 enough.

 Am I correct?

 Thanks.





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Question-about-SOLR-5017-Allow-sharding-based-on-the-value-of-a-field-tp4087050.html
 Sent from the Solr - User mailing list archive at Nabble.com.

SolrCloud Set up

2013-08-28 Thread Jared Griffith

What is the recommended way to set up Solr so it's HA and fault tolerant?
I'm assuming it would be the SolrCloud set up.  I'm guessing that Example C
(http://wiki.apache.org/solr/SolrCloud)  would be the optimum set up.   If
so, would one set up a load balancer (like f5 or whatever) to direct
requests to the Zookeeper instances?

Any issues that any of you have run into when setting this up?
Suggestions, tips, tricks?

-- 

Jared Griffith
Linux Administrator, PICS Auditing, LLC
P: (949) 936-4574
C: (909) 653-7814

http://www.picsauditing.com

17701 Cowan #140 | Irvine, CA | 92614

Join PICS on LinkedIn and Twitter!

https://twitter.com/PICSAuditingLLC

Re: Storing query results

You could copy the existing core to a new core every once in awhile, and
then do your delta indexing into a new core once the copy is complete.  If
a Persistent URL for the search results included the name of the original
core, the results you would get from a bookmark would be stable.  However,
if you went to the site, and did a new site, you would be searching the
newest core.

This I think applies whether the site is Intranet or not.

Older cores could be aged out gracefully, and the search handler for an old
core could be replaced by a search on the new core via sharding.


On Fri, Aug 23, 2013 at 11:57 AM, jfeist jfe...@llminc.com wrote:

 I completely agree.  I would prefer to just rerun the search each time.
 However, we are going to be replacing our rdb based search with something
 like Solr, and the application currently behaves this way.  Our users
 understand that the search is essentially a snapshot (and I would guess
 many
 prefer this over changing results) and we don't want to change existing
 behavior and confuse anyone.  Also, my boss told me it unequivocally has to
 be this way :p

 Thanks for your input though, looks like I'm going to have to do something
 like you've suggested within our application.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Storing-query-results-tp4086182p4086349.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to Manage RAM Usage at Heavy Indexing

This could be an operating systems problem rather than a Solr problem.
CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing
and I would read-up up on that.
The VM parameters can be tuned in /etc/sysctl.conf


On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi Erick;

 I wanted to get a quick answer that's why I asked my question as that way.

 Error is as follows:

 INFO  - 2013-08-21 22:01:30.978;
 org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
 webapp=/solr path=/update params={wt=javabinversion=2}
 {add=[com.deviantart.reachmeh
 ere:http/gallery/, com.deviantart.reachstereo:http/,
 com.deviantart.reachstereo:http/art/SE-mods-313298903,
 com.deviantart.reachtheclouds:http/, com.deviantart.reachthegoddess:http/,
 co
 m.deviantart.reachthegoddess:http/art/retouched-160219962,
 com.deviantart.reachthegoddess:http/badges/,
 com.deviantart.reachthegoddess:http/favourites/,
 com.deviantart.reachthetop:http/
 art/Blue-Jean-Baby-82204657 (1444006227844530177),
 com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790
 ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException;
 java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException]
 early EOF
 at

 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
 at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 at

 com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
 at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
 at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
 at
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
 at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
 at

 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at

 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at

 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at

 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at

 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at

 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at

 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at

 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at

 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at

 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
 at

 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:948)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at

 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at

 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at

 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at

 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: org.eclipse.jetty.io.EofException: early EOF
 at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65)
 at java.io.InputStream.read(InputStream.java:101)
 at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)
 at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)
 at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
 at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
 at

Re: More on topic of Meta-search/Federated Search with Solr

On Tue, Aug 27, 2013 at 2:03 AM, Paul Libbrecht p...@hoplahup.net wrote:

 Dan,

 if you're bound to federated search then I would say that you need to work
 on the service guarantees of each of the nodes and, maybe, create
 strategies to cope with bad nodes.

 paul


+1

I'll think on that.

Re: More on topic of Meta-search/Federated Search with Solr

On Tue, Aug 27, 2013 at 3:33 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 Years ago when Federated Search was a buzzword we did some development
 and
 testing with Lucene, FAST Search, Google and several other Search Engines
 according Federated Search in Library context.
 The results can be found here
 http://pub.uni-bielefeld.de/download/2516631/2516644
 Some minor parts are in German most is written in English.
 It also gives you an idea where to keep an eye on, where are the pitfalls
 and so on.
 We also had a tool called unity (written in Python) which did Federated
 Search on any Search Engine and
 Database, like Google, Gigablast, FAST, Lucene, ...
 The trick with Federated Search is to combine the results.
 We offered three options to the users search surface:
 - RoundRobin
 - Relevancy
 - PseudoRandom



Thanks much - Andrzej B. suggested I read Comparing top-k lists in
addition to his Berlin Buzzwords presentation.

I will know soon whether we are intent on this direction, right now I'm
still trying to think on how hard it will be.

Re: More on topic of Meta-search/Federated Search with Solr

On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha shanuu@gmail.com wrote:

 Would you like to create something like
 http://knimbus.com


I work at the National Library of Medicine.   We are moving our library
catalog to a newer platform, and we will probably include articles.   The
article's content and meta-data are available from a number of web-scale
discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's traditional
API.   Most libraries use open source solutions to avoid the cost of
purchasing an expensive enterprise search platform.   We are big; we
already have a closed-source enterprise search engine (and our own home
grown Entrez search used for PubMed).Since we can already do Federated
Search with the above, I am evaluating the effort of adding such to Apache
Solr.   Because NLM data is used in the open relevancy project, we actually
have the relevancy decisions to decide whether we have done a good job of
it.

I obviously think it would be Fun to add Federated Search to Apache Solr.

*Standard disclosure *- my opinion's do not represent the opinions of NIH
or NLM.Fun is no reason to spend tax-payer money.Enhancing Apache
Solr would reduce the risk of putting all our eggs in one basket. and
there may be some other relevant benefits.

We do use Apache Solr here for more than one other project... so keep up
the good work even if my working group decides to go with the closed-source
solution.

Re: SolrCloud Set up

On 8/28/2013 11:56 AM, Jared Griffith wrote:

What is the recommended way to set up Solr so it's HA and fault tolerant?
I'm assuming it would be the SolrCloud set up. I'm guessing that Example C
(http://wiki.apache.org/solr/SolrCloud) would be the optimum set up. If
so, would one set up a load balancer (like f5 or whatever) to direct
requests to the Zookeeper instances?

Example C has everything on localhost. That's not really redundant. If
you put example C on separate hosts, then it would very likely be redundant.

You do not need (or want) a load balancer for zookeeper. If your Solr
client code is not written in Java, you might want a load balancer for
Solr, though. The java client (SolrJ, specifically the CloudSolrServer
class) doesn't require a load balancer for HA.

For a SolrCloud setup with HA, you need at least three separate physical
hosts. A bare minimum setup has two capable servers that will each run
one copy of Solr and one copy of Zookeeper. The third can be less
capable and run zookeeper only. If you want to run Solr on all three,
you certainly can.

You can also add additional nodes for Solr. Additional zookeeper nodes
are not required, but if you want them, be sure you have an odd number.

You would download zookeeper and follow the instructions to create a
three-node replicated setup:

http://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_RunningReplicatedZooKeeper

For Solr, it's best if you run the latest version, currently 4.4.0. You
can put your zkHost parameter (and other solrcloud parameters) in
solr.xml. Your zkHost parameter should look like the following, where
you use the correct port(s) and a value for the chroot (/mysolr1) that
names your cloud:

server1:2181,server2:2181,server3:2181/mysolr1

A note on the chroot functionality: By using a different chroot value
for each one, you can use one zookeeper ensemble for more than one
SolrCloud. SolrCloud doesn't put much load on zookeeper. If you have
hundreds of Solr nodes that go up and down a lot, the load would be higher.

It's my opinion that you should not use the numShards parameter on the
commandline or in solr.xml, or use the startup options for bootstrapping
a config. I think it's better to use the zkCli upconfig option to
upload config sets to zookeeper, and specify the collection.configName,
numShards, and replicationFactor via the Collections API CREATE action.

If you want to go to the freenode IRC system (www.freenode.net) and
joing the #solr channel, you can get more interactive help. I have no
problem sticking with the mailing list either.

Thanks,
Shawn

Re: Different Responses for 4.4 and 3.5 solr index

2013-08-28 Thread Michael Sokolov

We've been seeing changes in our rankings as well.  I don't have a 
definite answer yet, since we're waiting on an index rebuild, but our 
current working theory is that the change to default omitNorms=true 
for primitive types may have had an effect, possibly due to follow on 
confusion: our developers may have omitted norms from some other fields 
they shouldn't have?


-Mike

On 08/26/2013 09:46 AM, Stefan Matheis wrote:

Did you check the scoring? (use fl=*,score to retrieve it) .. additionally 
debugQuery=true might provide more information about how the score was 
calculated.

- Stefan


On Monday, August 26, 2013 at 12:46 AM, Kuchekar wrote:


Hi,
The response from 4.4 and 3.5 in the current scenario differs in the
sequence in which results are given us back.

For example :

Response from 3.5 solr is : id:A, id:B, id:C, id:D ...
Response from 4.4 solr is : id C, id:A, id:D, id:B...

Looking forward your reply.

Thanks.
Kuchekar, Nilesh


On Sun, Aug 25, 2013 at 11:32 AM, Stefan Matheis
matheis.ste...@gmail.com (mailto:matheis.ste...@gmail.com)wrote:


Kuchekar (hope that's your first name?)

you didn't tell us .. how they differ? do you get an actual error? or does
the result contain documents you didn't expect? or the other way round,
that some are missing you'd expect to be there?

- Stefan


On Sunday, August 25, 2013 at 4:43 PM, Kuchekar wrote:


Hi,

We get different response when we query 4.4 and 3.5 solr using same
query params.

My query param are as following :

facet=true
facet.mincount=1
facet.limit=25


qf=content^0.0+p_last_name^500.0+p_first_name^50.0+strong_topic^0.0+first_author_topic^0.0+last_author_topic^0.0+title_topic^0.0

wt=javabin
version=2
rows=10
f.affiliation_org.facet.limit=150
fl=p_id,p_first_name,p_last_name
start=0
q=Apple
facet.field=affiliation_org
fq=table:profile
fq=num_content:[*+TO+1500]
fq=name:Apple

The content in both (solr 4.4 and solr 3.5) are same.

The solrconfig.xml from 3.5 an 4.4 are similarly constructed.

Is there something I am missing that might have been changed in 4.4,

which

might be causing this issue. ?. The qf params looks same.

Looking forward for your reply.

Thanks.
Kuchekar, Nilesh

Re: SolrCloud Set up

2013-08-28 Thread Jared Griffith

We are using Java here. Are you saying that the Solr java client would be
aware of the multiple zookeepers and would thus do health / host checks on
each zookeeper instance in turn until it got one that is working (assuming
that you have one or more zookeepers down)?
If that's the case, holy awesome.
I'll probably jump in IRC when I actually tackle this set up later on today.

On Wed, Aug 28, 2013 at 11:36 AM, Shawn Heisey s...@elyograg.org wrote:

On 8/28/2013 11:56 AM, Jared Griffith wrote:

What is the recommended way to set up Solr so it's HA and fault tolerant?
I'm assuming it would be the SolrCloud set up. I'm guessing that Example
C
(http://wiki.apache.org/solr/**SolrCloudhttp://wiki.apache.org/solr/SolrCloud)
would be the optimum set up. If
so, would one set up a load balancer (like f5 or whatever) to direct
requests to the Zookeeper instances?

Example C has everything on localhost. That's not really redundant. If
you put example C on separate hosts, then it would very likely be redundant.

For a SolrCloud setup with HA, you need at least three separate physical
hosts. A bare minimum setup has two capable servers that will each run one
copy of Solr and one copy of Zookeeper. The third can be less capable and
run zookeeper only. If you want to run Solr on all three, you certainly
can.

You can also add additional nodes for Solr. Additional zookeeper nodes
are not required, but if you want them, be sure you have an odd number.

You would download zookeeper and follow the instructions to create a
three-node replicated setup:

http://zookeeper.apache.org/**doc/r3.4.5/zookeeperStarted.**html#sc_**
RunningReplicatedZooKeeperhttp://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_RunningReplicatedZooKeeper

For Solr, it's best if you run the latest version, currently 4.4.0. You
can put your zkHost parameter (and other solrcloud parameters) in solr.xml.
Your zkHost parameter should look like the following, where you use the
correct port(s) and a value for the chroot (/mysolr1) that names your cloud:

server1:2181,server2:2181,**server3:2181/mysolr1

A note on the chroot functionality: By using a different chroot value for
each one, you can use one zookeeper ensemble for more than one SolrCloud.
SolrCloud doesn't put much load on zookeeper. If you have hundreds of
Solr nodes that go up and down a lot, the load would be higher.

It's my opinion that you should not use the numShards parameter on the
commandline or in solr.xml, or use the startup options for bootstrapping a
config. I think it's better to use the zkCli upconfig option to upload
config sets to zookeeper, and specify the collection.configName, numShards,
and replicationFactor via the Collections API CREATE action.

If you want to go to the freenode IRC system (www.freenode.net) and joing
the #solr channel, you can get more interactive help. I have no problem
sticking with the mailing list either.

Thanks,
Shawn

Jared Griffith
Linux Administrator, PICS Auditing, LLC
P: (949) 936-4574
C: (909) 653-7814

http://www.picsauditing.com

17701 Cowan #140 | Irvine, CA | 92614

Join PICS on LinkedIn and Twitter!

https://twitter.com/PICSAuditingLLC

Re: SolrCloud Set up


On 8/28/2013 1:36 PM, Jared Griffith wrote:

We are using Java here.  Are you saying that the Solr java client would be
aware of the multiple zookeepers and would thus do health / host checks on
each zookeeper instance in turn until it got one that is working (assuming
that you have one or more zookeepers down)?
If that's the case, holy awesome.
I'll probably jump in IRC when I actually tackle this set up later on today.


Yes, the Java client is completely aware of the cloud state in realtime.

When you create a CloudSolrServer object, you don't tell it where Solr 
is, you tell it where zookeeper is - using the same (potentially 
multi-host and including a chroot) zkHost parameter that you give to Solr.


Thanks,
Shawn

purge and optimize questions for solr 4.4.0

2013-08-28 Thread Joshi, Shital

We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes with 500 
million documents. We're using custom sharding where we  direct all documents 
with specific business date to specific shard.

With Solr 3.6 we used this command to optimize documents on master and then let 
replication take care of updating documents on slave1 and slave2.

curl --proxy  
'http://prod-solr-master.xyz.com:8983/solr/core1/update?optimize=truewaitFlush=falsemaxSegments=1'

How do we optimize documents for all shards in Solr Cloud? Do we have to fire 
five different optimize commands to all five leaders? Also, looks like optimize 
will be going away and might no longer be necessary - see 
SOLR-3141https://issues.apache.org/jira/browse/SOLR-3141 Is that true? With 
Solr 3.6 we purge millions of documents every month and then run optimize. 
We're planning to do same with Solr Cloud set up.

With Solr 3.6 we used following curl command to purge documents. Now with 
multiple shards can we still use the same command? We will definitely 
experiment with our QA set up of 500 million documents.

curl --proxy  
http://prod-solr-master.xyz.com:8983/solr/core1/update?commit=true -H 
Content-Type: text/xml --data-binary 'deletequerybusdate_i:[* TO 
20130208]/query/delete'

Thanks!

coordination factor in between query terms

2013-08-28 Thread Anirudha Jadhav

How can i specify coordination factor between query terms
eg. q=termA termB

doc1= { field: termA}
doc2 = {field: termA termB termC termD }

I want doc2 scored higher than doc1

-- 
Anirudha P. Jadhav

RE: coordination factor in between query terms

2013-08-28 Thread Greg Walters

Just boost the term you want to show up higher in your results.

http://wiki.apache.org/solr/SolrRelevancyCookbook#Boosting_Ranking_Terms

- Greg

-Original Message-
From: anirudh...@gmail.com [mailto:anirudh...@gmail.com] On Behalf Of Anirudha 
Jadhav
Sent: Wednesday, August 28, 2013 3:36 PM
To: solr-user@lucene.apache.org
Subject: coordination factor in between query terms

How can i specify coordination factor between query terms eg. q=termA termB

doc1= { field: termA}
doc2 = {field: termA termB termC termD }

I want doc2 scored higher than doc1

--
Anirudha P. Jadhav

Re: coordination factor in between query terms

2013-08-28 Thread Anirudha Jadhav

i don't know what term to boost.

I just need the documents with both terms listed as ranked higher. but
since Doc1 is smaller and has an exact match on the term as per tf-idf is
ranked higher.



On Wed, Aug 28, 2013 at 4:47 PM, Greg Walters
gwalt...@sherpaanalytics.comwrote:

 Just boost the term you want to show up higher in your results.

 http://wiki.apache.org/solr/SolrRelevancyCookbook#Boosting_Ranking_Terms

 - Greg

 -Original Message-
 From: anirudh...@gmail.com [mailto:anirudh...@gmail.com] On Behalf Of
 Anirudha Jadhav
 Sent: Wednesday, August 28, 2013 3:36 PM
 To: solr-user@lucene.apache.org
 Subject: coordination factor in between query terms

 How can i specify coordination factor between query terms eg. q=termA
 termB

 doc1= { field: termA}
 doc2 = {field: termA termB termC termD }

 I want doc2 scored higher than doc1

 --
 Anirudha P. Jadhav




-- 
Anirudha P. Jadhav

Re: coordination factor in between query terms


1) Coordination factor is controlled by the Similarity you have configured 
-- there is no request time option to affect hte coordination function. 
the Default Similarity already includes a simple ratio coord factor...

https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#coord%28int,%20int%29

2) your example query includes quote characters which makes it a phrase 
query, not a simple boolean query, so in that case both termA and termB 
will be required, and must be within the default slop number of term 
positions away from eachother.  if you instead used a query param 
of:  q=termA termB   ... then you'd see the coord factor come into play

3) in addition to the coord factor is the issue of fieldNorms -- but 
default text fields include a norm factor that takes into account 
thelength of a field, so in spite of the coord factor a very short field 
(ie; doc1) might score higher then a long field (ie: doc2) even if the lon 
field has more matches -- if you odn't want this, just use 
omitNorms=true on your field.


: How can i specify coordination factor between query terms
: eg. q=termA termB
: 
: doc1= { field: termA}
: doc2 = {field: termA termB termC termD }
: 
: I want doc2 scored higher than doc1
: 
: -- 
: Anirudha P. Jadhav
: 

-Hoss

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West

Hi Shawn,

I'm going to add this to the your JIRA unless you think that it would be
good to open another issue.
The issue for me is that making a ./lib in the instanceDir is documented as
working in several places and has worked in previous versions of Solr, for
example solr 4.1.0.

 I make a ./lib directory in Solr Home, all works just fine. However
according to the documentation making a ./lib directory in the instanceDir
should work, and in fact in Solr 4.1.0 it works just fine.

 So the question for me is whether making a ./lib directory as documented
in collections1/conf/solrconfig.xml and collections1/README.txt is supposed
to work in Solr 4.4 , but due to a bug it is not working.

   If it is not supposed to work, then the documentation needs fixing and
some note needs to be made about upgrading from previous versions of Solr.

Do you think I should open another JIRA and link it to yours or just add
this information (i.e. other scenarios where class loading not working) to
your JIRA?

Details below:

Tom

The documentation in the collections1/conf  directory is confusing.   For
example the collections1/conf/solrconfig.xml file says you should put a
./lib dir in your instanceDir.  (Am I correct that an instanceDir refers to
the core? )   On the other hand the documentation in the
collections1/README.txt is confusing about whether it is talking about the
instanceDir or Solr Home:

  For example, In collections1/conf/solrconfig.xml there is this comment:

   If a ./lib directory exists in your instanceDir, all files
   found in it are included as if you had used the following
   syntax...

  lib dir=./lib /

Also in collections1/conf/README.txt  it is suggested that you use ./lib
 but that README.txt file needs editing as it is very confusing about
whether it is talking about Solr Home or the Instance Directory in the text
excerpted below.

I would assume that the conf and data directories have to be subdirectories
of the instanceDir, since I assume they are set per core.   So in the
excerpt below the discussion of the sub-directories should apply to the
instanceDir not Solr Home.



Example SolrCore Instance Directory
=
This directory is provided as an example of what an Instance Directory
should look like for a SolrCore

It's not strictly necessary that you copy all of the files in this
directory when setting up a new SolrCores, but it is recommended.

Basic Directory Structure
-

The Solr Home directory typically contains the following sub-directorie:

  conf/
This directory is mandatory and must contain your solrconfig.xml
and schema.xml.  Any other optional configuration files would also
be kept here.

   data/
This directory is the default location where Solr will keep your
...
lib/



On Wed, Aug 28, 2013 at 12:11 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/28/2013 9:34 AM, Tom Burton-West wrote:

 I think I am running into the same bug, but the symptoms are a bit
 different.
 I'm wondering if it makes sense to file a separate linked bug report.

  The workaround is to remove sharedLib from solr.xml,

 The solr.xml that comes out-of-the-box does not have a sharedLib.

  I am using Solr 4.4. out-of-the-box, with the exception that I set
 up a
 lib directory in example/solr/collection1.   I did not change solr.xml
 from
 the out-of-the-box.  There is no  mention of lib in the out-of-the-box
 example/solr/solr.xml.

 I did not change the out-of-the-box solrconfig.xml.

   According to the README.txt, all that needs to be done is create the
 collection1/lib directory and put the jars there.
 However, I am getting the class not found error.

 Should I open another bug report or comment on the existing report?


 I have never heard of using ${instanceDir}/lib for jars.  That doesn't
 mean it won't work, but I have never seen it mentioned anywhere.

 I have only ever put the lib directory in solr.home, where solr.xml is.
  Did you try that?

 If you have seen documentation for collection1/lib, then there may be a
 doc bug, another dimension to the bug already filed, or a new bug.  Do you
 see log entries saying your jars in collection/lib are loaded?  If you do,
 then I think it's probably another dimension to the existing bug.

 Thanks,
 Shawn

Re: coordination factor in between query terms

2013-08-28 Thread Anirudha Jadhav

my bad, typo there q=termA termB

i know omitNorms is indexTime field option, can it be applied to the query
also?

are there other solutions to this kind of a problem? curious

On Wed, Aug 28, 2013 at 4:52 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

1) Coordination factor is controlled by the Similarity you have configured
-- there is no request time option to affect hte coordination function.
the Default Similarity already includes a simple ratio coord factor...

https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#coord%28int,%20int%29

2) your example query includes quote characters which makes it a phrase
query, not a simple boolean query, so in that case both termA and termB
will be required, and must be within the default slop number of term
positions away from eachother. if you instead used a query param
of: q=termA termB ... then you'd see the coord factor come into play

3) in addition to the coord factor is the issue of fieldNorms -- but
default text fields include a norm factor that takes into account
thelength of a field, so in spite of the coord factor a very short field
(ie; doc1) might score higher then a long field (ie: doc2) even if the lon
field has more matches -- if you odn't want this, just use
omitNorms=true on your field.

: How can i specify coordination factor between query terms
: eg. q=termA termB
:
: doc1= { field: termA}
: doc2 = {field: termA termB termC termD }
:
: I want doc2 scored higher than doc1
:
: --
: Anirudha P. Jadhav
:

-Hoss

--
Anirudha P. Jadhav

Re: ICUTokenizer class not found with Solr 4.4


On 8/28/2013 2:59 PM, Tom Burton-West wrote:

Do you think I should open another JIRA and link it to yours or just add
this information (i.e. other scenarios where class loading not working) to
your JIRA?


The documentation does sound confused.

My personal opinion (which may not be what ends up happening) is that 
${instanceDir}/lib shouldn't continue to be supported, at least not 
implicitly without config, mostly because each instanceDir can be 
dynamically destroyed (and added, with SolrCloud) by the core and 
collection APIs.


I am guessing that you are seeing the same issue that has already been 
documented.  The little research I've done into this suggests that some 
classes (ICUTokenizer being the specific example here) don't like it 
when Solr replaces the classloader to add additional jars.  This is 
probably the case no matter which part of the config (solr.xml or 
solrconfig.xml) tells Solr to replace the classloader.


The safest thing I've found is to use the lib directory off solr.home 
(which gets automatically used) and don't specify any additional lib 
directories anywhere in the configuration.


Thanks,
Shawn

What does it mean when a shard is down in solr4.4?

2013-08-28 Thread Utkarsh Sengar

I have a 3 node solrcloud cluster with 3 shards for each collection/core.

At times when I rebuild the index say on collectionA on nodeA (shard1) via
UpdateCSV, the Cloud status page says that collectionA on nodeA (shard1)
is down.

Observations:
1. Other collections on nodeA work.
2. collectionA on nodeB and nodeC works.
3. nodeA's solr admin is accessible too.

So my questions are:
1. What does it really mean when a shard goes down?
2. How can I recover from that state?

Solr cloud screenshot: http://i.imgur.com/2TgKXiC.png

-- 
Thanks,
-Utkarsh

Re: Solr show total row count in response of full import


: It would be nice if you could receive a total row count like
: 
: str name=Total Documents10100/str
: 
: With this information we could add another information like
: 
: str name=Imported in Percent 62.91/str
: 
: This would make it easier to generate a progress bar for the end user.

I don't think that's possible -- DIH has no way of knowing in advance the 
total number of documents that the DataSources are going to produce.

-Hoss

Re: Filter cache pollution during sharded edismax queries


Ken ... i'm not really sure i'm understanding what you're trying to 
describe.  can you give the full details of a concrete example of what you 
are seeing?

* full requestHandler config
* example of query issued by client
* every request logged on each shard
* contends of filterCache and queryResultCache after client's query finishes


-Hoss

Re: purge and optimize questions for solr 4.4.0