Re: Field not available on Edimax query

2013-07-10 Thread It-forum

Hello Alex,

You're right. But I faced a problem regarding solr engine, when I reload 
dataimport configuration and update, the indexation goes wrong. I had to 
reload Solr and restart importation and get the values.


Tx for your help.

David

Le 09/07/2013 17:19, Alexandre Rafalovitch a écrit :

On Tue, Jul 9, 2013 at 6:29 AM, It-forum it-fo...@meseo.fr wrote:


However when i use edimax query with the following details, I'm not able
to retreive the field tag. And it seems that it is not taken in match
score too.


You seem to have two problems here. One not matching (use debug flags for
that) and one not retrieving. But what do you mean by not retrieving? By
default all the fields are returned regardless of the query. So if you are
getting it in one but not in another you might be either getting different
documents without that field populated or you have explicitly mis-defined
which fields to return (with 'fl' parameter).

Regards,
Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)





Re: Calculating Solr document score by ignoring the boost field.

2013-07-10 Thread Daniel Collins
Sorry to repeat Jacks' previous answer but x times zero is always zero :)

A index boost is just what the name suggests, a factor by which the
document score is boosted (multiplied). Since it is in an index time value,
it is stored alongside the document, so any future scoring of the document
by any query will take this value into account. If you take Solr's internal
document score and then multiply it by zero, the result is by definition
zero...

What you seem to be saying is you are passing in an index time boost (which
is incorrect but that's an issue with Nutch), but you want Solr to ignore
it, surely the correct approach then is *not* to pass it in?

Once the data is indexed, it is fixed, unless you re-index the document,
so if that data is wrong, there is nothing Solr can do about it, you have
to re-index the documents that have incorrect data. If you want to just use
TF-IDF for scoring and not use boosting, don't supply any boosting, it's
that simple.  Sorry if this sounds repetitive, but can't think of any other
way to say it.


On 10 July 2013 06:33, Tony Mullins tonymullins...@gmail.com wrote:

 Jack due to 'some' reason my nutch is returning me index time boost =0.0
 and just for a moment suppose that nutch is and will always return boost
 =0.

 Now my simple question was why Solr is showing me document's score = 0 ?
 Why is it depending upon index time boost value ? Why or how to make Solr
 to only calculate the score value on TF-IDF ?

 Regards,
 Khan


 On Tue, Jul 9, 2013 at 6:31 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  Simple math: x times zero equals zero.
 
  That's why the default document boost is 1.0 - score times 1.0 equals
  score.
 
  Any particular reason you wanted to zero out the document score from the
  document level?
 
  -- Jack Krupansky
 
  -Original Message- From: Tony Mullins
  Sent: Tuesday, July 09, 2013 9:23 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Calculating Solr document score by ignoring the  field.
 
 
  I am passing boost value (via nutch) and i.e boost =0.0.
  But my question is why Solr is showing me score = 0.0 when my boost
 (index
  time boost) = 0.0 ?
  Should not Solr calculate its documents score on the basis of TF-IDF ?
 And
  if not how can I make Solr to only consider TF-IDF while calculating
  document's score ?
 
  Regards,
  Khan
 
 
  On Tue, Jul 9, 2013 at 4:46 PM, Erick Erickson erickerick...@gmail.com
 **
  wrote:
 
   My guess is that you're not really passing on the boost field's value
  and getting the default. Don't quite know how I'd track that down
  though
 
  Best
  Erick
 
  On Tue, Jul 9, 2013 at 4:09 AM, imran khan imrankhan.x...@gmail.com
  wrote:
   Greetings,
  
   I am using nutch 2.x as my datasource for Solr 4.3.0. And nutch passes
   on
   its own boost field to my Solr schema
  
   field name=boost type=float stored=true indexed=false/
  
   Now due to some reason I always get boost = 0.0 and due to this my
  Solr's
   document score is also always 0.0.
  
   Is there any way in Solr that it ignores the boost field's value for
  its
   document's score calculation ?
  
   Regards,
   Khan
 
 
 



Re: Calculating Solr document score by ignoring the boost field.

2013-07-10 Thread Tony Mullins
Ok thanks, I just wanted the know is it possible to ignore boost value or
not during score calculation and as you said its not.
Now I would have to focus on nutch to fix the issue and not to send boost=0
to Solr.

Regards,
Khan


On Wed, Jul 10, 2013 at 12:14 PM, Daniel Collins danwcoll...@gmail.comwrote:

 Sorry to repeat Jacks' previous answer but x times zero is always zero :)

 A index boost is just what the name suggests, a factor by which the
 document score is boosted (multiplied). Since it is in an index time value,
 it is stored alongside the document, so any future scoring of the document
 by any query will take this value into account. If you take Solr's internal
 document score and then multiply it by zero, the result is by definition
 zero...

 What you seem to be saying is you are passing in an index time boost (which
 is incorrect but that's an issue with Nutch), but you want Solr to ignore
 it, surely the correct approach then is *not* to pass it in?

 Once the data is indexed, it is fixed, unless you re-index the document,
 so if that data is wrong, there is nothing Solr can do about it, you have
 to re-index the documents that have incorrect data. If you want to just use
 TF-IDF for scoring and not use boosting, don't supply any boosting, it's
 that simple.  Sorry if this sounds repetitive, but can't think of any other
 way to say it.


 On 10 July 2013 06:33, Tony Mullins tonymullins...@gmail.com wrote:

  Jack due to 'some' reason my nutch is returning me index time boost =0.0
  and just for a moment suppose that nutch is and will always return boost
  =0.
 
  Now my simple question was why Solr is showing me document's score = 0 ?
  Why is it depending upon index time boost value ? Why or how to make Solr
  to only calculate the score value on TF-IDF ?
 
  Regards,
  Khan
 
 
  On Tue, Jul 9, 2013 at 6:31 PM, Jack Krupansky j...@basetechnology.com
  wrote:
 
   Simple math: x times zero equals zero.
  
   That's why the default document boost is 1.0 - score times 1.0 equals
   score.
  
   Any particular reason you wanted to zero out the document score from
 the
   document level?
  
   -- Jack Krupansky
  
   -Original Message- From: Tony Mullins
   Sent: Tuesday, July 09, 2013 9:23 AM
   To: solr-user@lucene.apache.org
   Subject: Re: Calculating Solr document score by ignoring the  field.
  
  
   I am passing boost value (via nutch) and i.e boost =0.0.
   But my question is why Solr is showing me score = 0.0 when my boost
  (index
   time boost) = 0.0 ?
   Should not Solr calculate its documents score on the basis of TF-IDF ?
  And
   if not how can I make Solr to only consider TF-IDF while calculating
   document's score ?
  
   Regards,
   Khan
  
  
   On Tue, Jul 9, 2013 at 4:46 PM, Erick Erickson 
 erickerick...@gmail.com
  **
   wrote:
  
My guess is that you're not really passing on the boost field's value
   and getting the default. Don't quite know how I'd track that down
   though
  
   Best
   Erick
  
   On Tue, Jul 9, 2013 at 4:09 AM, imran khan imrankhan.x...@gmail.com
   wrote:
Greetings,
   
I am using nutch 2.x as my datasource for Solr 4.3.0. And nutch
 passes
on
its own boost field to my Solr schema
   
field name=boost type=float stored=true indexed=false/
   
Now due to some reason I always get boost = 0.0 and due to this my
   Solr's
document score is also always 0.0.
   
Is there any way in Solr that it ignores the boost field's value
 for
   its
document's score calculation ?
   
Regards,
Khan
  
  
  
 



Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Jed Glazner
We are planning an upgrade to 4.4 but it's still weeks out. We offer a
high availability search service and there are a number of changes in 4.4
that are not backward compatible. (i.e. Clusterstate.json and no solr.xml)
So there must be lots of testing, additionally this upgrade cannot be
performed without downtime.

Regardless, I need to find a band-aid right now.  Does anyone know if it's
possible to set the timeout for distributed update request to/from leader.
 Currently we see it's set to 0.  Maybe via -D startup param, or something?

Jed

On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

Hi Jed,

This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
that is about to be released.  We did not have fun working with 4.0 in
SolrCloud mode a few months ago.  You will save time, hair, and money
if you convince your manager to let you use Solr 4.4. :)

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
 Hi Shawn,

 I have been trying to duplicate this problem without success for the
last 2 weeks which is one reason I'm getting flustered.   It seems
reasonable to be able to duplicate it but I can't.

  We do have a story to upgrade but that is still weeks if not months
before that gets rolled out to production.

 We have another cluster running the same version but with 8 shards and
8 replicas with each shard at 100gb and more load and more indexing
requests without this problem but we send docs in batches here and all
fields are stored.   Where as the trouble index has only 1 or 2 stored
fields and only send docs 1 at a time.

 Could that have anything to do with it?

 Jed


 Von Samsung Mobile gesendet



  Ursprüngliche Nachricht 
 Von: Shawn Heisey s...@elyograg.org
 Datum: 07.09.2013 18:33 (GMT+01:00)
 An: solr-user@lucene.apache.org
 Betreff: Re: Solr Hangs During Updates for over 10 minutes


 On 7/9/2013 9:50 AM, Jed Glazner wrote:
 I'll give you the high level before delving deep into setup etc. I
have been struggeling at work with a seemingly random problem when solr
will hang for 10-15 minutes during updates.  This outage always seems
to immediately be proceeded by an EOF exception on  the replica.  Then
10-15 minutes later we see an exception on the leader for a socket
timeout to the replica.  The leader will then tell the replica to
recover which in most cases it does and then the outage is over.

 Here are the setup details:

 We are currently using Solr 4.0.0 with an external ZK ensemble of 5
machines.

 After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
 and have since been fixed.  You're five releases and about nine months
 behind what's current.  My recommendation: Upgrade to 4.3.1, ensure your
 configuration is up to date with changes to the example config between
 4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
 testbed, duplicate your current problem, and upgrade the testbed to see
 if the problem goes away.  A testbed will also give you practice for a
 smooth upgrade of your production system.

 Thanks,
 Shawn




Re: Solr limitations

2013-07-10 Thread Ramkumar R. Aiyengar
I understand, thanks. I just wanted to check in case there were scalability
limitations with how SolrCloud operates..
On 9 Jul 2013 12:45, Erick Erickson erickerick...@gmail.com wrote:

 I think Jack was mostly thinking in slam dunk terms. I know of
 SolrCloud demo clusters with 500+ nodes, and at that point
 people said it's going to work for our situation, we don't need
 to push more.

 As you start getting into that kind of scale, though, you really
 have a bunch of ops considerations etc. Mostly when I get into
 larger scales I pretty much want to examine my assumptions
 and see if they're correct, perhaps start to trim my requirements
 etc.

 FWIW,
 Erick

 On Tue, Jul 9, 2013 at 4:07 AM, Ramkumar R. Aiyengar
 andyetitmo...@gmail.com wrote:
  5. No more than 32 nodes in your SolrCloud cluster.
 
  I hope this isn't too OT, but what tradeoffs is this based on? Would have
  thought it easy to hit this number for a big index and high load (hence
  with the view of both the number of shards and replicas horizontally
  scaling..)
 
  6. Don't return more than 250 results on a query.
 
  None of those is a hard limit, but don't go beyond them unless your
 Proof
  of Concept testing proves that performance is acceptable for your
 situation.
 
  Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary
  tests and then scale as needed.
 
  Dynamic and multivalued fields? Try to stay away from them - excepts for
  the simplest cases, they are usually an indicator of a weak data model.
  Sure, it's fine to store a relatively small number of values in a
  multivalued field (say, dozens of values), but be aware that you can't
  directly access individual values, you can't tell which was matched on a
  query, and you can't coordinate values between multiple multivalued
 fields.
  Except for very simple cases, multivalued fields should be flattened into
  multiple documents with a parent ID.
 
  Since you brought up the topic of dynamic fields, I am curious how you
  got the impression that they were a good technique to use as a starting
  point. They're fine for prototyping and hacking, and fine when used in
  moderation, but not when used to excess. The whole point of Solr is
  searching and searching is optimized within fields, not across fields, so
  having lots of dynamic fields is counter to the primary strengths of
 Lucene
  and Solr. And... schemas with lots  of dynamic fields tend to be
 difficult
  to maintain. For example, if you wanted to ask a support question here,
 one
  of the first things we want to know is what your schema looks like, but
  with lots of dynamic fields it is not possible to have a simple
 discussion
  of what your schema looks like.
 
  Sure, there is something called schemaless design (and Solr supports
  that in 4.4), but that's very different from heavy reliance on dynamic
  fields in the traditional sense. Schemaless design is A-OK, but using
  dynamic fields for arrays of data in a single document is a poor match
  for the search features of Solr (e.g., Edismax searching across multiple
  fields.)
 
  One other tidbit: Although Solr does not enforce naming conventions for
  field names, and you can put special characters in them, there are plenty
  of features in Solr, such as the common fl parameter, where field names
  are expected to adhere to Java naming rules. When people start going
 wild
  with dynamic fields, it is common that they start going wild with their
  names as well, using spaces, colons, slashes, etc. that cannot be parsed
 in
  the fl and qf parameters, for example. Please don't go there!
 
  In short, put up a small cluster and start doing a Proof of Concept
  cluster. Stay within my suggested guidelines and you should do okay.
 
  -- Jack Krupansky
 
  -Original Message- From: Marcelo Elias Del Valle
  Sent: Monday, July 08, 2013 9:46 AM
  To: solr-user@lucene.apache.org
  Subject: Solr limitations
 
 
  Hello everyone,
 
 I am trying to search information about possible solr limitations I
  should consider in my architecture. Things like max number of dynamic
  fields, max number o documents in SolrCloud, etc.
 Does anyone know where I can find this info?
 
  Best regards,
  --
  Marcelo Elias Del Valle
  http://mvalle.com - @mvallebr



Re: two types of answers in my query

2013-07-10 Thread Mysurf Mail
This will work.
Thanks.


On Tue, Jul 9, 2013 at 4:37 PM, Jack Krupansky j...@basetechnology.comwrote:

 Usually a car term and a car part term will look radically different. So,
 simply use the edismax query parser and set qf to be both the car and car
 part fields. If either matches, the document will be selected. And if you
 have a type field, you can check that to see if a car or part was matched
 in the results.

 -- Jack Krupansky

 -Original Message- From: Mysurf Mail
 Sent: Tuesday, July 09, 2013 2:38 AM
 To: solr-user@lucene.apache.org
 Subject: two types of answers in my query


 Hi,
 A general question:


 Let's say I have Car And CarParts 1:n relation.

 And I have discovered that the user had entered in the search field instead
 of car name - a part serial number (SKU).
 (I discovered it useing regex)

 Is there a way to fetch different types of answers in Solr?
 Is there a way to fetch mixed types in the answers?
 Is there something similiar to that and how is that feature called?

 Thank you.



Disabling workd breaking for codes and SKUs

2013-07-10 Thread Mysurf Mail
Some of the data in my index is SKUs and barcodes as follows
ASDF3-DASDD-2133DD-21H44

I want to disable the wordbreaking for this type (maybe through Regex.
Is there a possible way to do this?


Re: Norms

2013-07-10 Thread Daniel Collins
I don't know the full answer to your question, but here's what I can offer.

Solr offers 2 types of normalisation, FieldNorm and QueryNorm.  FieldNorm
is as the name suggests field level normalisation, based on length of the
field, and can be controlled by the omitNorms parameter on the field.  In
your example, fieldNorm is always 1.0, see below, so that suggests you have
correctly turned off field normalisation on the name_edgy field.

1.0 = fieldNorm(field=name_edgy, doc=231378)

QueryNorm is what I'm still trying to get to the bottom of exactly :)  But
its something that tries to normalise the results of different term queries
so they are broadly comparable. You haven't supplied the query you've run ,
but based on the qf, bf, I'm assuming it breaks down into a DisMax query on
3 fields (name_edgy, name_edge, name_word) so queryNorm is trying to ensure
that the results of those 3 queries can be compared.  The exact details of
it I'm still trying to get to the bottom of (any volunteers with more info
chip in!)

From earlier answers to the list, queryNorm is calculated in the Similarity
object, I need to dig further, but that's probably a good place to start.



On 10 July 2013 04:57, William Bell billnb...@gmail.com wrote:

 I have a field that has omitNorms=true, but when I look at debugQuery I see
 that
 the field is being normalized for the score.

 What can I do to turn off normalization in the score?

 I want a simple way to do 2 things:

 boost geodist() highest at 1 mile and lowest at 100 miles.
 plus add a boost for a query=edgefield^5.

 I only want tf() and no queryNorm. I am not even sure I want idf() but I
 can probably live with rare names being boosted.



 The results are being normalized. See below. I tried dismax and edismax -
 bf, bq and boost.

 requestHandler name=autoproviderdist class=solr.SearchHandler
 lst name=defaults
 str name=echoParamsnone/str
 str name=defTypeedismax/str
 float name=tie0.01/float
 str name=fl
 display_name,city_state,prov_url,pwid,city_state_alternative
 /str
 !--
 str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6),
 0.1)^10/str
 --
 str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str
 int name=rows5/int
 str name=q.alt*:*/str
 str name=qfname_edgy^.9 name_edge^.9 name_word/str
 str name=grouptrue/str
 str name=group.fieldpwid/str
 str name=group.maintrue/str
 !-- str name=pfname_edgy/str do not turn on --
 str name=sortscore desc, last_name asc/str
 str name=d100/str
 str name=pt39.740112,-104.984856/str
 str name=sfieldstore_geohash/str
 str name=hlfalse/str
 str name=hl.flname_edgy/str
 str name=mm2-1 4-2 6-3/str
 /lst
 /requestHandler

 0.058555886 = queryNorm

 product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01
 times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378),
 product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 =
 boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 =
 queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378),
 product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 =
 idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge,
 doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378),
 product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 =
 boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 =
 queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378),
 product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 =
 idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy,
 doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 =
 (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 =
 queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40,
 maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH)
 fieldWeight(name_word:nutting in 231378), product of: 1.0 =
 tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40,
 maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 =
 (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 =
 queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 =
 idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 =
 (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 =
 tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40,
 maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 =

 sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1))



 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Daniel Collins
We had something similar in terms of update times suddenly spiking up for
no obvious reason.  We never got quite as bad as you in terms of the other
knock on effects, but we certainly saw updates jumping from 10ms up to
3ms, all our external queues backed up and we rejected some updates,
then after a while things quietened down.

We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped to
Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
went away.

Now, I admit its not exactly the same as your case, we never had the
follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly
reduced the spikes in our indexing times.

We run the following settings now (the usual caveats apply, it might not
work for you).

GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
-XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
-XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
application pauses, that's our goal, if we have to use more memory in the
short term then so be it, but we couldn't afford application pauses,
because we are using NRT (soft commits every 1s, hard commits every 60s)
and we get a lot of updates.

I know there have been other discussion on G1 and it has received mixed
results overall, but for us, it seems to be a winner.

Hope that helps,


On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in 4.4
 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if it's
 possible to set the timeout for distributed update request to/from leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0 in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and all
 fields are stored.   Where as the trouble index has only 1 or 2 stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when solr
 will hang for 10-15 minutes during updates.  This outage always seems
 to immediately be proceeded by an EOF exception on  the replica.  Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of 5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
  and have since been fixed.  You're five releases and about nine months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure your
  configuration is up to date with changes to the example config between
  4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
  testbed, duplicate your current problem, and upgrade the testbed to see
  if the problem goes away.  A testbed will also give you practice for a
  smooth upgrade of your production system.
 
  Thanks,
  Shawn
 




Re: Solr 3.6 optimize and field cache question

2013-07-10 Thread Marc Sturlese
Not a solution for the short term but sounds like a good use case to migrate
to Solr 4.X and use DocValues instead of FieldCache for faceting.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-optimize-and-field-cache-question-tp4076398p4076822.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Live Nodes not updating immediately

2013-07-10 Thread Daniel Collins
What do you have your ZK Timeout set to (zkClientTimeout in solr.xml or
command line if you override it)?

A kill of the raw process is bad, but ZK should spot that using its
heartbeat mechanism, so unless your timeout is very large, it should be
detecting the node is no longer available, and then triggering a leadership
election.

We (still) use 4.3.0 (with some patches) and we do have some issues with
Solr shutdowns not causing an election quickly enough for us, but that's a
known issue within Solr/Jetty, and maybe causes 10-20s of outage, not 20
minutes!

You say you have 3 machines, how many shards and how many ZKs, and are they
embedded ZK or external? I think we need more info about the scenario.

If you are running embedded ZK, then you are losing both a shard/replica
and a ZK at the same time, which isn't ideal (we moved to external ZKs
quite quickly, embedded just caused too many issues) but shouldn't be that
catastrophic.

Also does it only happen with a kill -9, what about a normal kill, and/or a
normal shutdown of Jetty?



On 9 July 2013 16:18, Shawn Heisey s...@elyograg.org wrote:

  We are going to use solr in production. There are chances that the
 machine
  itself might shutdown due to power failure or the network is disconnected
  due to manual intervention. We need to address those cases as well to
  build
  a robust system..

 The latest version of Solr is 4.3.1, and 4.4 is right around the corner.
 Any chance you can test a nightly 4.4 build or a checkout of the
 lucene_solr_4_4 branch,ji so we can know whether you are running into the
 same problems with what will be released soon? No sense in fixing a
 problem that no longer exists.

 Thanks,
 Shawn





Switch to new leader transparently?

2013-07-10 Thread Floyd Wu
Hi there,

I've built a SolrCloud cluster from example, but I have some question.
When I send query to one leader (say
http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything
will be fine.

When I shutdown that leader, the other replica(
http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new
leader. The problem is:

The application doesn't know new leader's location and still send request
to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response.

How can I know new leader in my application?
Are there any mechanism that application can send request to one fixed
endpoint no matter who is leader?

For example, application just send to
http://xxx.xxx.xxx.xxx:8983/solr/collection1
even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1

Please help on this or give me some key infomation to google it.

Many thanks.

Floyd


Re: Switch to new leader transparently?

2013-07-10 Thread Anshum Gupta
You don't really need to direct any query specifically to a leader. It will
automatically be routed to the right leader.
You may put a load balancer on top to just fix the problem with querying a
node that has gone away.

Also, ZK aware SolrJ Java client that load-balances across all nodes in
cluster.


On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote:

 Hi there,

 I've built a SolrCloud cluster from example, but I have some question.
 When I send query to one leader (say
 http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything
 will be fine.

 When I shutdown that leader, the other replica(
 http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be
 new
 leader. The problem is:

 The application doesn't know new leader's location and still send request
 to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response.

 How can I know new leader in my application?
 Are there any mechanism that application can send request to one fixed
 endpoint no matter who is leader?

 For example, application just send to
 http://xxx.xxx.xxx.xxx:8983/solr/collection1
 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1

 Please help on this or give me some key infomation to google it.

 Many thanks.

 Floyd




-- 

Anshum Gupta
http://www.anshumgupta.net


Re: Disabling workd breaking for codes and SKUs

2013-07-10 Thread Gora Mohanty
On 10 July 2013 14:02, Mysurf Mail stammail...@gmail.com wrote:
 Some of the data in my index is SKUs and barcodes as follows
 ASDF3-DASDD-2133DD-21H44

 I want to disable the wordbreaking for this type (maybe through Regex.
 Is there a possible way to do this?

What fieldtype are you using for this in schema.xml?
Use a string field, or some non-analysed field that stores
the data as is.

Regards,
Gora


Re: Switch to new leader transparently?

2013-07-10 Thread Furkan KAMACI
You can define a CloudSolrServer as like that:

*private static CloudSolrServer solrServer;*

and then define the addres of your zookeeper host:

*private static String zkHost = localhost:9983;*

initialize your variable:

*solrServer = new CloudSolrServer(zkHost);*

You can get leader list as like:

*ClusterState clusterState =
cloudSolrServer.getZkStateReader().getClusterState();
ListReplica leaderList = new ArrayList();
  for (Slice slice : clusterState.getSlices(collectionName)) {
  leaderList.add(slice.getLeader()); /
}*


For querying you can try that:
*
*
*SolrQuery solrQuery = new SolrQuery();*
*//fill your **solrQuery variable here**
*
*QueryRequest queryRequest = new QueryRequest(solrQuery,
SolrRequest.METHOD.POST);
queryRequest.process(**solrServer**);*

CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like
that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a wrapper
to CommonsHttpSolrServer. This is useful when you have multiple SolrServers
and query requests need to be Load Balanced among them. It offers automatic
failover when a server goes down and it detects when the server comes back
up.*
*
*
*
*

2013/7/10 Anshum Gupta ans...@anshumgupta.net

 You don't really need to direct any query specifically to a leader. It will
 automatically be routed to the right leader.
 You may put a load balancer on top to just fix the problem with querying a
 node that has gone away.

 Also, ZK aware SolrJ Java client that load-balances across all nodes in
 cluster.


 On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote:

  Hi there,
 
  I've built a SolrCloud cluster from example, but I have some question.
  When I send query to one leader (say
  http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything
  will be fine.
 
  When I shutdown that leader, the other replica(
  http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be
  new
  leader. The problem is:
 
  The application doesn't know new leader's location and still send request
  to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
 response.
 
  How can I know new leader in my application?
  Are there any mechanism that application can send request to one fixed
  endpoint no matter who is leader?
 
  For example, application just send to
  http://xxx.xxx.xxx.xxx:8983/solr/collection1
  even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1
 
  Please help on this or give me some key infomation to google it.
 
  Many thanks.
 
  Floyd
 



 --

 Anshum Gupta
 http://www.anshumgupta.net



Re: Switch to new leader transparently?

2013-07-10 Thread Floyd Wu
Hi anshum
Thanks for your response.
My application is developed using C#, so I can't use  CloudSolrServer with
SolrJ.

My problem is there is a setting in my application

SolrUrl = http://xxx.xxx.xxx.xxx:8983/solr/collection1

When this Solr instance shutdown or crash, I have to change this setting.
I've read source code of CloudSolrServer.java in SolrJ just few minutes ago.

It seems to that CloudSolrServer first read cluster state from zk ( or some
live node)
to retrieve info and then use this info to decide which node to send
request.

Maybe I have to modify my application to mimic CloudSolrServer impl.

Any idea?

Floyd




2013/7/10 Anshum Gupta ans...@anshumgupta.net

 You don't really need to direct any query specifically to a leader. It will
 automatically be routed to the right leader.
 You may put a load balancer on top to just fix the problem with querying a
 node that has gone away.

 Also, ZK aware SolrJ Java client that load-balances across all nodes in
 cluster.


 On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote:

  Hi there,
 
  I've built a SolrCloud cluster from example, but I have some question.
  When I send query to one leader (say
  http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything
  will be fine.
 
  When I shutdown that leader, the other replica(
  http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be
  new
  leader. The problem is:
 
  The application doesn't know new leader's location and still send request
  to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
 response.
 
  How can I know new leader in my application?
  Are there any mechanism that application can send request to one fixed
  endpoint no matter who is leader?
 
  For example, application just send to
  http://xxx.xxx.xxx.xxx:8983/solr/collection1
  even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1
 
  Please help on this or give me some key infomation to google it.
 
  Many thanks.
 
  Floyd
 



 --

 Anshum Gupta
 http://www.anshumgupta.net



Re: Switch to new leader transparently?

2013-07-10 Thread Furkan KAMACI
You can check the source code of LBHttpSolrServer and try to implement
something like that as your own.

2013/7/10 Floyd Wu floyd...@gmail.com

 Hi anshum
 Thanks for your response.
 My application is developed using C#, so I can't use  CloudSolrServer with
 SolrJ.

 My problem is there is a setting in my application

 SolrUrl = http://xxx.xxx.xxx.xxx:8983/solr/collection1

 When this Solr instance shutdown or crash, I have to change this setting.
 I've read source code of CloudSolrServer.java in SolrJ just few minutes
 ago.

 It seems to that CloudSolrServer first read cluster state from zk ( or some
 live node)
 to retrieve info and then use this info to decide which node to send
 request.

 Maybe I have to modify my application to mimic CloudSolrServer impl.

 Any idea?

 Floyd




 2013/7/10 Anshum Gupta ans...@anshumgupta.net

  You don't really need to direct any query specifically to a leader. It
 will
  automatically be routed to the right leader.
  You may put a load balancer on top to just fix the problem with querying
 a
  node that has gone away.
 
  Also, ZK aware SolrJ Java client that load-balances across all nodes in
  cluster.
 
 
  On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote:
 
   Hi there,
  
   I've built a SolrCloud cluster from example, but I have some question.
   When I send query to one leader (say
   http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem
 everything
   will be fine.
  
   When I shutdown that leader, the other replica(
   http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will
 be
   new
   leader. The problem is:
  
   The application doesn't know new leader's location and still send
 request
   to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
  response.
  
   How can I know new leader in my application?
   Are there any mechanism that application can send request to one fixed
   endpoint no matter who is leader?
  
   For example, application just send to
   http://xxx.xxx.xxx.xxx:8983/solr/collection1
   even the real leader run on
 http://xxx.xxx.xxx.xxx:9983/solr/collection1
  
   Please help on this or give me some key infomation to google it.
  
   Many thanks.
  
   Floyd
  
 
 
 
  --
 
  Anshum Gupta
  http://www.anshumgupta.net
 



Re: Switch to new leader transparently?

2013-07-10 Thread Floyd Wu
Hi Furkan
I'm using C#,  SolrJ won't help on this, but its impl is a good reference
for me. Thanks for your help.

by the way, how to fetch/get cluster state from zk directly in plain http
or tcp socket?
In my SolrCloud cluster, I'm using standalone zk to coordinate.

Floyd




2013/7/10 Furkan KAMACI furkankam...@gmail.com

 You can define a CloudSolrServer as like that:

 *private static CloudSolrServer solrServer;*

 and then define the addres of your zookeeper host:

 *private static String zkHost = localhost:9983;*

 initialize your variable:

 *solrServer = new CloudSolrServer(zkHost);*

 You can get leader list as like:

 *ClusterState clusterState =
 cloudSolrServer.getZkStateReader().getClusterState();
 ListReplica leaderList = new ArrayList();
   for (Slice slice : clusterState.getSlices(collectionName)) {
   leaderList.add(slice.getLeader()); /
 }*


 For querying you can try that:
 *
 *
 *SolrQuery solrQuery = new SolrQuery();*
 *//fill your **solrQuery variable here**
 *
 *QueryRequest queryRequest = new QueryRequest(solrQuery,
 SolrRequest.METHOD.POST);
 queryRequest.process(**solrServer**);*

 CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like
 that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a wrapper
 to CommonsHttpSolrServer. This is useful when you have multiple SolrServers
 and query requests need to be Load Balanced among them. It offers automatic
 failover when a server goes down and it detects when the server comes back
 up.*
 *
 *
 *
 *

 2013/7/10 Anshum Gupta ans...@anshumgupta.net

  You don't really need to direct any query specifically to a leader. It
 will
  automatically be routed to the right leader.
  You may put a load balancer on top to just fix the problem with querying
 a
  node that has gone away.
 
  Also, ZK aware SolrJ Java client that load-balances across all nodes in
  cluster.
 
 
  On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote:
 
   Hi there,
  
   I've built a SolrCloud cluster from example, but I have some question.
   When I send query to one leader (say
   http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem
 everything
   will be fine.
  
   When I shutdown that leader, the other replica(
   http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will
 be
   new
   leader. The problem is:
  
   The application doesn't know new leader's location and still send
 request
   to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
  response.
  
   How can I know new leader in my application?
   Are there any mechanism that application can send request to one fixed
   endpoint no matter who is leader?
  
   For example, application just send to
   http://xxx.xxx.xxx.xxx:8983/solr/collection1
   even the real leader run on
 http://xxx.xxx.xxx.xxx:9983/solr/collection1
  
   Please help on this or give me some key infomation to google it.
  
   Many thanks.
  
   Floyd
  
 
 
 
  --
 
  Anshum Gupta
  http://www.anshumgupta.net
 



Re: Switch to new leader transparently?

2013-07-10 Thread Furkan KAMACI
By the this is not related to your question but this may help you for
connecting Solr via C#: http://solrsharp.codeplex.com/

2013/7/10 Floyd Wu floyd...@gmail.com

 Hi Furkan
 I'm using C#,  SolrJ won't help on this, but its impl is a good reference
 for me. Thanks for your help.

 by the way, how to fetch/get cluster state from zk directly in plain http
 or tcp socket?
 In my SolrCloud cluster, I'm using standalone zk to coordinate.

 Floyd




 2013/7/10 Furkan KAMACI furkankam...@gmail.com

  You can define a CloudSolrServer as like that:
 
  *private static CloudSolrServer solrServer;*
 
  and then define the addres of your zookeeper host:
 
  *private static String zkHost = localhost:9983;*
 
  initialize your variable:
 
  *solrServer = new CloudSolrServer(zkHost);*
 
  You can get leader list as like:
 
  *ClusterState clusterState =
  cloudSolrServer.getZkStateReader().getClusterState();
  ListReplica leaderList = new ArrayList();
for (Slice slice : clusterState.getSlices(collectionName)) {
leaderList.add(slice.getLeader()); /
  }*
 
 
  For querying you can try that:
  *
  *
  *SolrQuery solrQuery = new SolrQuery();*
  *//fill your **solrQuery variable here**
  *
  *QueryRequest queryRequest = new QueryRequest(solrQuery,
  SolrRequest.METHOD.POST);
  queryRequest.process(**solrServer**);*
 
  CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like
  that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a
 wrapper
  to CommonsHttpSolrServer. This is useful when you have multiple
 SolrServers
  and query requests need to be Load Balanced among them. It offers
 automatic
  failover when a server goes down and it detects when the server comes
 back
  up.*
  *
  *
  *
  *
 
  2013/7/10 Anshum Gupta ans...@anshumgupta.net
 
   You don't really need to direct any query specifically to a leader. It
  will
   automatically be routed to the right leader.
   You may put a load balancer on top to just fix the problem with
 querying
  a
   node that has gone away.
  
   Also, ZK aware SolrJ Java client that load-balances across all nodes in
   cluster.
  
  
   On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote:
  
Hi there,
   
I've built a SolrCloud cluster from example, but I have some
 question.
When I send query to one leader (say
http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem
  everything
will be fine.
   
When I shutdown that leader, the other replica(
http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will
  be
new
leader. The problem is:
   
The application doesn't know new leader's location and still send
  request
to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
   response.
   
How can I know new leader in my application?
Are there any mechanism that application can send request to one
 fixed
endpoint no matter who is leader?
   
For example, application just send to
http://xxx.xxx.xxx.xxx:8983/solr/collection1
even the real leader run on
  http://xxx.xxx.xxx.xxx:9983/solr/collection1
   
Please help on this or give me some key infomation to google it.
   
Many thanks.
   
Floyd
   
  
  
  
   --
  
   Anshum Gupta
   http://www.anshumgupta.net
  
 



Re: Solr Live Nodes not updating immediately

2013-07-10 Thread Ranjith Venkatesan
My zkClientTimeout is set to 15000 by default.

I am using external zookeeper-3.4.5 which is also running in 3 machines. I
am using only one shard with replication factor being set to 3. 

Normal shutdown updates the solr state as soon as the node gets down.. I am
facing issue with abrupt shut down(kill -9) or network problem.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Live-Nodes-not-updating-immediately-tp4076560p4076826.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Jed Glazner
Hey Daniel,

Thanks for the response.  I think we'll give this a try to see if this
helps.

Jed.

On 7/10/13 10:48 AM, Daniel Collins danwcoll...@gmail.com wrote:

We had something similar in terms of update times suddenly spiking up for
no obvious reason.  We never got quite as bad as you in terms of the other
knock on effects, but we certainly saw updates jumping from 10ms up to
3ms, all our external queues backed up and we rejected some updates,
then after a while things quietened down.

We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped to
Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
went away.

Now, I admit its not exactly the same as your case, we never had the
follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly
reduced the spikes in our indexing times.

We run the following settings now (the usual caveats apply, it might not
work for you).

GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
-XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
-XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
application pauses, that's our goal, if we have to use more memory in the
short term then so be it, but we couldn't afford application pauses,
because we are using NRT (soft commits every 1s, hard commits every 60s)
and we get a lot of updates.

I know there have been other discussion on G1 and it has received mixed
results overall, but for us, it seems to be a winner.

Hope that helps,


On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in
4.4
 that are not backward compatible. (i.e. Clusterstate.json and no
solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if
it's
 possible to set the timeout for distributed update request to/from
leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or
something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0 in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards
and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and all
 fields are stored.   Where as the trouble index has only 1 or 2 stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when
solr
 will hang for 10-15 minutes during updates.  This outage always seems
 to immediately be proceeded by an EOF exception on  the replica.
Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of 5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
  and have since been fixed.  You're five releases and about nine
months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure
your
  configuration is up to date with changes to the example config
between
  4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
  testbed, duplicate your current problem, and upgrade the testbed to
see
  if the problem goes away.  A testbed will also give you practice for
a
  smooth upgrade of your production system.
 
  Thanks,
  Shawn
 





Re: replication getting stuck on a file

2013-07-10 Thread Erick Erickson
Hmmm, that is kind of funny. I know this is ugly, but what happens if you
1 stop the slave
2 completely delete the data/index directory (directory too, not just contents)
3 fire it back up?

inelegant at best, but if it cures your problem

Erick

On Tue, Jul 9, 2013 at 5:57 PM, Petersen, Robert
robert.peter...@mail.rakuten.com wrote:
 Look at the speed and time remaining on this one, pretty funny:


 Master   http://ssbuyma01:8983/solr/1/replication
 Latest Index Version:null, Generation: null
 Replicatable Index Version:1276893670202, Generation: 127213
 Poll Interval00:05:00
 Local Index  Index Version: 1276893670108, Generation: 127204
 Location: /var/LucidWorks/lucidworks/solr/1/data/index
 Size: 23.13 GB
 Times Replicated Since Startup: 48874
 Previous Replication Done At: Tue Jul 09 13:12:05 PDT 2013
 Config Files Replicated At: null
 Config Files Replicated: null
 Times Config Files Replicated Since Startup: null
 Next Replication Cycle At: Tue Jul 09 13:17:04 PDT 2013
 Current Replication Status   Start Time: Tue Jul 09 13:12:04 PDT 2013
 Files Downloaded: 10 / 538
 Downloaded: 1.67 MB / 23.13 GB [0.0%]
 Downloading File: _34n2.prx, Downloaded: 140 bytes / 140 bytes [100.0%]
 Time Elapsed: 6203s, Estimated Time Remaining: 88091277s, Speed: 281 bytes/s


 -Original Message-
 From: Petersen, Robert [mailto:robert.peter...@mail.rakuten.com]
 Sent: Tuesday, July 09, 2013 1:22 PM
 To: solr-user@lucene.apache.org
 Subject: replication getting stuck on a file

 Hi

 My solr 3.6.1 slave farm is suddenly getting stuck during replication.  It 
 seems to stop on a random file on various slaves (not all) and not continue.  
 I've tried stoping and restarting tomcat etc but some slaves just can't get 
 the index pulled down.  Note there is plenty of space on the hard drive.  I 
 don't get it.  Everything else seems fine.  Does this ring a bell for anyone? 
  I have the slaves set for five minute polling intervals.

 Here is what I see in admin page, it just stays on that one file and won't 
 get past it while the speed steadily averages down to 0kbs:

 Master   http://ssbuyma01:8983/solr/1/replication
 Latest Index Version:null, Generation: null Replicatable Index 
 Version:1276893670111, Generation: 127205
 Poll Interval00:05:00
 Local Index  Index Version: 1276893670084, Generation: 127202
 Location: /var/LucidWorks/lucidworks/solr/1/data/index
 Size: 23.06 GB
 Times Replicated Since Startup: 48903
 Previous Replication Done At: Tue Jul 09 12:55:01 EDT 2013 Config Files 
 Replicated At: null Config Files Replicated: null Times Config Files 
 Replicated Since Startup: null Next Replication Cycle At: Tue Jul 09 13:00:00 
 EDT 2013
 Current Replication Status   Start Time: Tue Jul 09 12:55:00 EDT 2013
 Files Downloaded: 59 / 486
 Downloaded: 88.73 MB / 23.06 GB [0.0%]
 Downloading File: _34mt.fnm, Downloaded: 1.35 MB / 1.35 MB [100.0%] Time 
 Elapsed: 691s, Estimated Time Remaining: 183204s, Speed: 131.49 KB/s


 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department











Re: join not working with UUIDs

2013-07-10 Thread Erick Erickson
What kind of field is root_id? If it's tokenized or not the
same type as id, that could account for it.

Best
Erick

On Tue, Jul 9, 2013 at 7:34 PM, Marcelo Elias Del Valle
mvall...@gmail.com wrote:
 Hello,

 I am trying to create a POC to test query joins. However, I was
 surprised when I saw my test worked with some ids, but when my document ids
 are UUIDs, it doesn't work.
 Follows an example, using solrj:

 SolrInputDocument doc = new SolrInputDocument();
 doc.addField(id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20);
 doc.addField(cor_parede, branca);
 doc.addField(num_cadeiras, 34);
 solr.add(doc);

 // Add children
 SolrInputDocument doc2 = new SolrInputDocument();
 doc2.addField(id, computador1);
 doc2.addField(acessorio1, Teclado);
 doc2.addField(acessorio2, Mouse);
 doc2.addField(root_id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20);
 solr.add(doc2);

  When I execute:

 ///select
 params={start=0rows=10q=cor_parede%3Abrancafq=%7B%21join+from%3Droot_id+to%3Did%7Dacessorio1%3ATeclado}
 SolrQuery query = new SolrQuery();

 query.setStart(0);
 query.setRows(10);
 query.set(q, cor_parede:branca);
 query.set(fq, {!join from=root_id to=id}acessorio1:Teclado);

 QueryResponse response = DGSolrServer.get().query(query);
 long numFound = response.getResults().getNumFound();

it returns zero results. However, if I use room1 for first
 document's id and for root_id field on second document, it works.

Any idea why? What am I missing?

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr


Re: Switch to new leader transparently?

2013-07-10 Thread Erick Erickson
Floyd:

The Apache Zookeeper project should have the relevant info
on how to get the state from ZK directly.

FWIW,
Erick

On Wed, Jul 10, 2013 at 6:41 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 By the this is not related to your question but this may help you for
 connecting Solr via C#: http://solrsharp.codeplex.com/

 2013/7/10 Floyd Wu floyd...@gmail.com

 Hi Furkan
 I'm using C#,  SolrJ won't help on this, but its impl is a good reference
 for me. Thanks for your help.

 by the way, how to fetch/get cluster state from zk directly in plain http
 or tcp socket?
 In my SolrCloud cluster, I'm using standalone zk to coordinate.

 Floyd




 2013/7/10 Furkan KAMACI furkankam...@gmail.com

  You can define a CloudSolrServer as like that:
 
  *private static CloudSolrServer solrServer;*
 
  and then define the addres of your zookeeper host:
 
  *private static String zkHost = localhost:9983;*
 
  initialize your variable:
 
  *solrServer = new CloudSolrServer(zkHost);*
 
  You can get leader list as like:
 
  *ClusterState clusterState =
  cloudSolrServer.getZkStateReader().getClusterState();
  ListReplica leaderList = new ArrayList();
for (Slice slice : clusterState.getSlices(collectionName)) {
leaderList.add(slice.getLeader()); /
  }*
 
 
  For querying you can try that:
  *
  *
  *SolrQuery solrQuery = new SolrQuery();*
  *//fill your **solrQuery variable here**
  *
  *QueryRequest queryRequest = new QueryRequest(solrQuery,
  SolrRequest.METHOD.POST);
  queryRequest.process(**solrServer**);*
 
  CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like
  that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a
 wrapper
  to CommonsHttpSolrServer. This is useful when you have multiple
 SolrServers
  and query requests need to be Load Balanced among them. It offers
 automatic
  failover when a server goes down and it detects when the server comes
 back
  up.*
  *
  *
  *
  *
 
  2013/7/10 Anshum Gupta ans...@anshumgupta.net
 
   You don't really need to direct any query specifically to a leader. It
  will
   automatically be routed to the right leader.
   You may put a load balancer on top to just fix the problem with
 querying
  a
   node that has gone away.
  
   Also, ZK aware SolrJ Java client that load-balances across all nodes in
   cluster.
  
  
   On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote:
  
Hi there,
   
I've built a SolrCloud cluster from example, but I have some
 question.
When I send query to one leader (say
http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem
  everything
will be fine.
   
When I shutdown that leader, the other replica(
http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will
  be
new
leader. The problem is:
   
The application doesn't know new leader's location and still send
  request
to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
   response.
   
How can I know new leader in my application?
Are there any mechanism that application can send request to one
 fixed
endpoint no matter who is leader?
   
For example, application just send to
http://xxx.xxx.xxx.xxx:8983/solr/collection1
even the real leader run on
  http://xxx.xxx.xxx.xxx:9983/solr/collection1
   
Please help on this or give me some key infomation to google it.
   
Many thanks.
   
Floyd
   
  
  
  
   --
  
   Anshum Gupta
   http://www.anshumgupta.net
  
 



Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Erick Erickson
Jed:

I'm not sure changing Java runtime is any less scary than upgrading Solr

Wait, I know! Ask your manager if you can do both at once evil smirk. I have
a  t-shirt that says I don't test, but when I do it's in production...

Erick

On Wed, Jul 10, 2013 at 8:08 AM, Jed Glazner jglaz...@adobe.com wrote:
 Hey Daniel,

 Thanks for the response.  I think we'll give this a try to see if this
 helps.

 Jed.

 On 7/10/13 10:48 AM, Daniel Collins danwcoll...@gmail.com wrote:

We had something similar in terms of update times suddenly spiking up for
no obvious reason.  We never got quite as bad as you in terms of the other
knock on effects, but we certainly saw updates jumping from 10ms up to
3ms, all our external queues backed up and we rejected some updates,
then after a while things quietened down.

We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped to
Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
went away.

Now, I admit its not exactly the same as your case, we never had the
follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly
reduced the spikes in our indexing times.

We run the following settings now (the usual caveats apply, it might not
work for you).

GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
-XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
-XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
application pauses, that's our goal, if we have to use more memory in the
short term then so be it, but we couldn't afford application pauses,
because we are using NRT (soft commits every 1s, hard commits every 60s)
and we get a lot of updates.

I know there have been other discussion on G1 and it has received mixed
results overall, but for us, it seems to be a winner.

Hope that helps,


On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in
4.4
 that are not backward compatible. (i.e. Clusterstate.json and no
solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if
it's
 possible to set the timeout for distributed update request to/from
leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or
something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0 in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards
and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and all
 fields are stored.   Where as the trouble index has only 1 or 2 stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when
solr
 will hang for 10-15 minutes during updates.  This outage always seems
 to immediately be proceeded by an EOF exception on  the replica.
Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of 5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
  and have since been fixed.  You're five releases and about nine
months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure
your
  configuration is up to date with changes to the example config
between
  4.0.0 and 4.3.1, and reindex.  

Re: Staggered Replication In Solr?

2013-07-10 Thread adityab
Thanks Shawn,
We do have repeaters setup to replicate index to the 8 Slaves.
We update documents to Master every 2hrs in a batch process. When on hard
commit is replicated to repeaters and then to slaves. 
The concern is that during heavy traffic when slaves are busy serving
request, when a new index is available on repeater all slaves start
replicating at the same time. And that's when we see the spike on entire
cluster. 
In a single cluster we have 1 Master, 2 Repeaters, 8 slaves. 
We have currently implemented a cron job which performs staggered
replication so not all slaves spike at same time and the cluster is in a
state to serve traffic. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Staggered-Replication-In-Solr-tp4076659p4076888.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Jed Glazner
It is certainly 'more' possible, as we have additional code that revolves
around reading the clusterstate.json and since solr decided to change the
format of the clusterstate.json from 4.0 to 4.1 it requires additional
code changes to our service since the solrj lib from 4.0 isn't compatible
with anything after 4.0 due to the clusterstate.json change.  I can
however run java7 with these GC in a dev env under load to see if they
blow up or if it's even possible, and then roll it out to the replica, and
then to to the leader. I cannot however do this with a solr upgrade
without significant coding changes to our service, which would require us
to roll out new code for our service, as well as new solr instances.

So, while it's 'just as risky' as you say, it's 'less risky' than a new
version of java and is possible to implement without downtime.

It is actually something of a pain point that the upgrade path to
solrcloud seems to frequently require downtime. (clusterstate.json changes
in 4.1, and then again this big change in 4.4 with no solr.xml).

So we'll do what we can quickly to see if we can 'band-aid' the problem
until we can upgrade to solr 4.4  Speaking of band-aids - does anyone know
of a way to change the socket timeout/connection timeout for distributed
updates?

Jed.

On 7/10/13 2:38 PM, Erick Erickson erickerick...@gmail.com wrote:

Jed:

I'm not sure changing Java runtime is any less scary than upgrading
Solr

Wait, I know! Ask your manager if you can do both at once evil smirk. I
have
a  t-shirt that says I don't test, but when I do it's in production...

Erick

On Wed, Jul 10, 2013 at 8:08 AM, Jed Glazner jglaz...@adobe.com wrote:
 Hey Daniel,

 Thanks for the response.  I think we'll give this a try to see if this
 helps.

 Jed.

 On 7/10/13 10:48 AM, Daniel Collins danwcoll...@gmail.com wrote:

We had something similar in terms of update times suddenly spiking up
for
no obvious reason.  We never got quite as bad as you in terms of the
other
knock on effects, but we certainly saw updates jumping from 10ms up to
3ms, all our external queues backed up and we rejected some updates,
then after a while things quietened down.

We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped
to
Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
went away.

Now, I admit its not exactly the same as your case, we never had the
follow-on effects, but I'd consider Java 7 and the G1 GC, it has
certainly
reduced the spikes in our indexing times.

We run the following settings now (the usual caveats apply, it might not
work for you).

GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
-XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
-XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
application pauses, that's our goal, if we have to use more memory in
the
short term then so be it, but we couldn't afford application pauses,
because we are using NRT (soft commits every 1s, hard commits every 60s)
and we get a lot of updates.

I know there have been other discussion on G1 and it has received mixed
results overall, but for us, it seems to be a winner.

Hope that helps,


On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in
4.4
 that are not backward compatible. (i.e. Clusterstate.json and no
solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if
it's
 possible to set the timeout for distributed update request to/from
leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or
something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0
in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com
wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for
the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not
months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards
and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and
all
 fields are stored.   Where 

Re: join not working with UUIDs

2013-07-10 Thread Marcelo Elias Del Valle
root_id is a dynamic field... But should the type of the field change
according to the values? Because using the same configuration but using
room1 as value, it works.

Let me compare the configurations:

field name=id type=string indexed=true stored=true
required=true multiValued=false /

dynamicField name=* type=text_general multiValued=true /

Indeed, one is text_general and the other is string... I will try to create
a fixed field root_id and check if it works...

Thanks for the hint!



2013/7/10 Erick Erickson erickerick...@gmail.com

 What kind of field is root_id? If it's tokenized or not the
 same type as id, that could account for it.

 Best
 Erick

 On Tue, Jul 9, 2013 at 7:34 PM, Marcelo Elias Del Valle
 mvall...@gmail.com wrote:
  Hello,
 
  I am trying to create a POC to test query joins. However, I was
  surprised when I saw my test worked with some ids, but when my document
 ids
  are UUIDs, it doesn't work.
  Follows an example, using solrj:
 
  SolrInputDocument doc = new SolrInputDocument();
  doc.addField(id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20);
  doc.addField(cor_parede, branca);
  doc.addField(num_cadeiras, 34);
  solr.add(doc);
 
  // Add children
  SolrInputDocument doc2 = new SolrInputDocument();
  doc2.addField(id, computador1);
  doc2.addField(acessorio1, Teclado);
  doc2.addField(acessorio2, Mouse);
  doc2.addField(root_id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20);
  solr.add(doc2);
 
   When I execute:
 
  ///select
 
 params={start=0rows=10q=cor_parede%3Abrancafq=%7B%21join+from%3Droot_id+to%3Did%7Dacessorio1%3ATeclado}
  SolrQuery query = new SolrQuery();
 
  query.setStart(0);
  query.setRows(10);
  query.set(q, cor_parede:branca);
  query.set(fq, {!join from=root_id to=id}acessorio1:Teclado);
 
  QueryResponse response = DGSolrServer.get().query(query);
  long numFound = response.getResults().getNumFound();
 
 it returns zero results. However, if I use room1 for first
  document's id and for root_id field on second document, it works.
 
 Any idea why? What am I missing?
 
  Best regards,
  --
  Marcelo Elias Del Valle
  http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: Solr limitations

2013-07-10 Thread Jack Krupansky
Again, no hard limits, mostly performance-based limits and environmental 
factors of your own environment, as well as the fact that most people on 
this list will have deeper experience with smaller clusters, so if you 
decide to go big, you will be in uncharted and untested territory.


I would relax my number a little (actually, double it) to 64 nodes, to 
handle the 8-shard, 8-replica case, since just yesterday somebody on the 
list mentioned that they were using such a configuration.


In other words, with configurations up to 16 or 32 or even 64 nodes, you 
will readily find people here who might be able to help support you, but if 
you are thinking of a 16-shard, 16-replica cluster with 256 nodes or 
32-shard, 32-replica cluster with 1,024 nodes, it's not that that will hit 
any hard limit in Solr, but simply that not as many people will be able to 
provide support, answer questions, or simply confirm that yes, a cluster 
that big is a... slam-dunk. And if you do want to try a 1,024-node 
cluster, you absolutely should do a Proof of Concept implementation first.


I actually don't have any hard, empirical evidence to back up my 32/64-node 
guidance, but it seems reasonable and consistent with configurations people 
commonly talk about. Generally, people talk about smaller clusters, so I'm 
stretching a little to get up to my 32/64 guidance. And, to be clear, that's 
just a rough guide and not intended to guarantee that a 64-node cluster will 
perform really well, nor to imply that a 96-node or 128-node cluster won't 
perform well.


-- Jack Krupansky

-Original Message- 
From: Ramkumar R. Aiyengar

Sent: Wednesday, July 10, 2013 4:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr limitations

I understand, thanks. I just wanted to check in case there were scalability
limitations with how SolrCloud operates..
On 9 Jul 2013 12:45, Erick Erickson erickerick...@gmail.com wrote:


I think Jack was mostly thinking in slam dunk terms. I know of
SolrCloud demo clusters with 500+ nodes, and at that point
people said it's going to work for our situation, we don't need
to push more.

As you start getting into that kind of scale, though, you really
have a bunch of ops considerations etc. Mostly when I get into
larger scales I pretty much want to examine my assumptions
and see if they're correct, perhaps start to trim my requirements
etc.

FWIW,
Erick

On Tue, Jul 9, 2013 at 4:07 AM, Ramkumar R. Aiyengar
andyetitmo...@gmail.com wrote:
 5. No more than 32 nodes in your SolrCloud cluster.

 I hope this isn't too OT, but what tradeoffs is this based on? Would 
 have

 thought it easy to hit this number for a big index and high load (hence
 with the view of both the number of shards and replicas horizontally
 scaling..)

 6. Don't return more than 250 results on a query.

 None of those is a hard limit, but don't go beyond them unless your
Proof
 of Concept testing proves that performance is acceptable for your
situation.

 Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary
 tests and then scale as needed.

 Dynamic and multivalued fields? Try to stay away from them - excepts 
 for

 the simplest cases, they are usually an indicator of a weak data model.
 Sure, it's fine to store a relatively small number of values in a
 multivalued field (say, dozens of values), but be aware that you can't
 directly access individual values, you can't tell which was matched on a
 query, and you can't coordinate values between multiple multivalued
fields.
 Except for very simple cases, multivalued fields should be flattened 
 into

 multiple documents with a parent ID.

 Since you brought up the topic of dynamic fields, I am curious how you
 got the impression that they were a good technique to use as a starting
 point. They're fine for prototyping and hacking, and fine when used in
 moderation, but not when used to excess. The whole point of Solr is
 searching and searching is optimized within fields, not across fields, 
 so

 having lots of dynamic fields is counter to the primary strengths of
Lucene
 and Solr. And... schemas with lots  of dynamic fields tend to be
difficult
 to maintain. For example, if you wanted to ask a support question here,
one
 of the first things we want to know is what your schema looks like, but
 with lots of dynamic fields it is not possible to have a simple
discussion
 of what your schema looks like.

 Sure, there is something called schemaless design (and Solr supports
 that in 4.4), but that's very different from heavy reliance on dynamic
 fields in the traditional sense. Schemaless design is A-OK, but using
 dynamic fields for arrays of data in a single document is a poor match
 for the search features of Solr (e.g., Edismax searching across multiple
 fields.)

 One other tidbit: Although Solr does not enforce naming conventions for
 field names, and you can put special characters in them, there are 
 plenty
 of features in Solr, such as the common fl parameter, 

Re: Switch to new leader transparently?

2013-07-10 Thread Aloke Ghoshal
Hi Floyd,

We use SolrNet to connect to Solr from a C# application. Since SolrNet is
not aware about SolrCloud or ZK, we use a Http load balancer in front of
the Solr nodes  query via the load balancer url. You could use something
like HAProxy or Apache reverse proxy for load balancing.

On the other hand in order to write a ZK aware client in C# you could start
here: https://github.com/ewhauser/zookeeper/tree/trunk/src/dotnet

Regards,
Aloke


On Wed, Jul 10, 2013 at 4:11 PM, Furkan KAMACI furkankam...@gmail.comwrote:

 By the this is not related to your question but this may help you for
 connecting Solr via C#: http://solrsharp.codeplex.com/

 2013/7/10 Floyd Wu floyd...@gmail.com

  Hi Furkan
  I'm using C#,  SolrJ won't help on this, but its impl is a good reference
  for me. Thanks for your help.
 
  by the way, how to fetch/get cluster state from zk directly in plain http
  or tcp socket?
  In my SolrCloud cluster, I'm using standalone zk to coordinate.
 
  Floyd
 
 
 
 
  2013/7/10 Furkan KAMACI furkankam...@gmail.com
 
   You can define a CloudSolrServer as like that:
  
   *private static CloudSolrServer solrServer;*
  
   and then define the addres of your zookeeper host:
  
   *private static String zkHost = localhost:9983;*
  
   initialize your variable:
  
   *solrServer = new CloudSolrServer(zkHost);*
  
   You can get leader list as like:
  
   *ClusterState clusterState =
   cloudSolrServer.getZkStateReader().getClusterState();
   ListReplica leaderList = new ArrayList();
 for (Slice slice : clusterState.getSlices(collectionName)) {
 leaderList.add(slice.getLeader()); /
   }*
  
  
   For querying you can try that:
   *
   *
   *SolrQuery solrQuery = new SolrQuery();*
   *//fill your **solrQuery variable here**
   *
   *QueryRequest queryRequest = new QueryRequest(solrQuery,
   SolrRequest.METHOD.POST);
   queryRequest.process(**solrServer**);*
  
   CloudSolrServer uses LBHttpSolrServer by default. It's definiton is
 like
   that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a
  wrapper
   to CommonsHttpSolrServer. This is useful when you have multiple
  SolrServers
   and query requests need to be Load Balanced among them. It offers
  automatic
   failover when a server goes down and it detects when the server comes
  back
   up.*
   *
   *
   *
   *
  
   2013/7/10 Anshum Gupta ans...@anshumgupta.net
  
You don't really need to direct any query specifically to a leader.
 It
   will
automatically be routed to the right leader.
You may put a load balancer on top to just fix the problem with
  querying
   a
node that has gone away.
   
Also, ZK aware SolrJ Java client that load-balances across all nodes
 in
cluster.
   
   
On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com
 wrote:
   
 Hi there,

 I've built a SolrCloud cluster from example, but I have some
  question.
 When I send query to one leader (say
 http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem
   everything
 will be fine.

 When I shutdown that leader, the other replica(
 http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard
 will
   be
 new
 leader. The problem is:

 The application doesn't know new leader's location and still send
   request
 to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
response.

 How can I know new leader in my application?
 Are there any mechanism that application can send request to one
  fixed
 endpoint no matter who is leader?

 For example, application just send to
 http://xxx.xxx.xxx.xxx:8983/solr/collection1
 even the real leader run on
   http://xxx.xxx.xxx.xxx:9983/solr/collection1

 Please help on this or give me some key infomation to google it.

 Many thanks.

 Floyd

   
   
   
--
   
Anshum Gupta
http://www.anshumgupta.net
   
  
 



simple date query

2013-07-10 Thread Marcos Mendez
Hi,

I'm trying to do something like startDate_tdt = NOW = endDate_tdt. Any ideas 
how I can implement this in a query? I don't think the normal range query will 
work.

Regards,
Marcos

RE: simple date query

2013-07-10 Thread Markus Jelsma
hi - check the examples for range queries and date math:

http://wiki.apache.org/solr/SolrQuerySyntax
http://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/util/DateMathParser.html
 
 
-Original message-
 From:Marcos Mendez mar...@aimrecyclinggroup.com
 Sent: Wednesday 10th July 2013 15:47
 To: solr-user@lucene.apache.org
 Subject: simple date query
 
 Hi,
 
 I'm trying to do something like startDate_tdt = NOW = endDate_tdt. Any 
 ideas how I can implement this in a query? I don't think the normal range 
 query will work.
 
 Regards,
 Marcos


Re: join not working with UUIDs

2013-07-10 Thread Marcelo Elias Del Valle
Worked :D
Thanks a lot!


2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com

 root_id is a dynamic field... But should the type of the field change
 according to the values? Because using the same configuration but using
 room1 as value, it works.

 Let me compare the configurations:

 field name=id type=string indexed=true stored=true
 required=true multiValued=false /

 dynamicField name=* type=text_general multiValued=true /

 Indeed, one is text_general and the other is string... I will try to
 create a fixed field root_id and check if it works...

 Thanks for the hint!



 2013/7/10 Erick Erickson erickerick...@gmail.com

 What kind of field is root_id? If it's tokenized or not the
 same type as id, that could account for it.

 Best
 Erick

 On Tue, Jul 9, 2013 at 7:34 PM, Marcelo Elias Del Valle
 mvall...@gmail.com wrote:
  Hello,
 
  I am trying to create a POC to test query joins. However, I was
  surprised when I saw my test worked with some ids, but when my document
 ids
  are UUIDs, it doesn't work.
  Follows an example, using solrj:
 
  SolrInputDocument doc = new SolrInputDocument();
  doc.addField(id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20);
  doc.addField(cor_parede, branca);
  doc.addField(num_cadeiras, 34);
  solr.add(doc);
 
  // Add children
  SolrInputDocument doc2 = new SolrInputDocument();
  doc2.addField(id, computador1);
  doc2.addField(acessorio1, Teclado);
  doc2.addField(acessorio2, Mouse);
  doc2.addField(root_id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20);
  solr.add(doc2);
 
   When I execute:
 
  ///select
 
 params={start=0rows=10q=cor_parede%3Abrancafq=%7B%21join+from%3Droot_id+to%3Did%7Dacessorio1%3ATeclado}
  SolrQuery query = new SolrQuery();
 
  query.setStart(0);
  query.setRows(10);
  query.set(q, cor_parede:branca);
  query.set(fq, {!join from=root_id to=id}acessorio1:Teclado);
 
  QueryResponse response = DGSolrServer.get().query(query);
  long numFound = response.getResults().getNumFound();
 
 it returns zero results. However, if I use room1 for first
  document's id and for root_id field on second document, it works.
 
 Any idea why? What am I missing?
 
  Best regards,
  --
  Marcelo Elias Del Valle
  http://mvalle.com - @mvallebr




 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: simple date query

2013-07-10 Thread Jack Krupansky
You can't use two fields in one range query, but you can combine two range 
queries:


startDate_tdt:[* TO NOW] AND endDate_tdt:[NOW TO *]

-- Jack Krupansky

-Original Message- 
From: Marcos Mendez

Sent: Wednesday, July 10, 2013 9:31 AM
To: solr-user@lucene.apache.org
Subject: simple date query

Hi,

I'm trying to do something like startDate_tdt = NOW = endDate_tdt. Any 
ideas how I can implement this in a query? I don't think the normal range 
query will work.


Regards,
Marcos= 



Securing SOLR REST API

2013-07-10 Thread Pires, Guilherme
Hello Everyone,

I have been developing several solutions, mainly geospatial, that include solr.
The availability of the restful services seem to bother a lot of people. Mainly 
IT security, of course.

How can I guarantee that Solr services are only 'called' from my web 
html5/jquery based application?
Any ideas?

Thanks
Guilherme
GIS Solution Specialist


Re: Securing SOLR REST API

2013-07-10 Thread Steve Rowe
Hi Guilherme, see http://wiki.apache.org/solr/SolrSecurity - Steve

On Jul 10, 2013, at 10:22 AM, Pires, Guilherme guilherme.pi...@cgi.com 
wrote:

 Hello Everyone,
 
 I have been developing several solutions, mainly geospatial, that include 
 solr.
 The availability of the restful services seem to bother a lot of people. 
 Mainly IT security, of course.
 
 How can I guarantee that Solr services are only 'called' from my web 
 html5/jquery based application?
 Any ideas?
 
 Thanks
 Guilherme
 GIS Solution Specialist



Re: Securing SOLR REST API

2013-07-10 Thread Nazik


Sent from my iPhone

On Jul 10, 2013, at 10:22 AM, Pires, Guilherme guilherme.pi...@cgi.com 
wrote:

 Hello Everyone,
 
 I have been developing several solutions, mainly geospatial, that include 
 solr.
 The availability of the restful services seem to bother a lot of people. 
 Mainly IT security, of course.
 
 How can I guarantee that Solr services are only 'called' from my web 
 html5/jquery based application?
 Any ideas?
 
 Thanks
 Guilherme
 GIS Solution Specialist


Commit different database rows to solr with same id value?

2013-07-10 Thread Jason Huang
Hello,

I am trying to use Solr to store fields from two different database tables,
where the primary keys are in the format of 1, 2, 3, 

In Java, we build different POJO classes for these two database tables:

table1.java

@SolrIndex(name=id)

private String idTable1




table2.java

@SolrIndex(name=id)

private String idTable2



And later we add these fields defined in the two different types of tables
and commit it to solrServer.


Here is the scenario where I am having issues:

(1) commit a row from table1 with primary key = 3, this generates a
document in Solr

(2) commit another row from table2 with the same value of primary key =
3, this overwrites the document generated in step (1).


What we really want to achieve is to keep both rows in (1) and (2) because
they are from different tables. I've read something from google search and
it appears that we might be able to do it via keeping multiple cores in
solr? Could anyone point at how to implement multiple core to achieve this?
To be more specific, when I commit the row as a document, I don't have a
place to pick a certain core and I am not sure if it makes any sense for me
to specify a core when I commit the document since the layer I am working
on should abstract it away from me.



The second question is - if we don't want to do a multicore (since we can't
easily search for related data between multiple cores), how can we resolve
this issue so both rows from different database table which shares the same
primary key still exist? We don't want to have to always change the primary
key format to ensure a uniqueness of the primary key among all different
types of database tables.


thanks!


Jason


RE: Commit different database rows to solr with same id value?

2013-07-10 Thread David Quarterman
Hi Jason,

Assuming you're using DIH, why not build a new, unique id within the query to 
use as  the 'doc_id' for SOLR? We do something like this in one of our 
collections. In MySQL, try this (don't know what it would be for any other db 
but there must be equivalents):

select @rownum:=@rownum+1 rowid, t.* from (main select query) t, (select 
@rownum:=0) s

Regards,

DQ

-Original Message-
From: Jason Huang [mailto:jason.hu...@icare.com] 
Sent: 10 July 2013 15:50
To: solr-user@lucene.apache.org
Subject: Commit different database rows to solr with same id value?

Hello,

I am trying to use Solr to store fields from two different database tables, 
where the primary keys are in the format of 1, 2, 3, 

In Java, we build different POJO classes for these two database tables:

table1.java

@SolrIndex(name=id)

private String idTable1




table2.java

@SolrIndex(name=id)

private String idTable2



And later we add these fields defined in the two different types of tables and 
commit it to solrServer.


Here is the scenario where I am having issues:

(1) commit a row from table1 with primary key = 3, this generates a document 
in Solr

(2) commit another row from table2 with the same value of primary key = 3, 
this overwrites the document generated in step (1).


What we really want to achieve is to keep both rows in (1) and (2) because they 
are from different tables. I've read something from google search and it 
appears that we might be able to do it via keeping multiple cores in solr? 
Could anyone point at how to implement multiple core to achieve this?
To be more specific, when I commit the row as a document, I don't have a place 
to pick a certain core and I am not sure if it makes any sense for me to 
specify a core when I commit the document since the layer I am working on 
should abstract it away from me.



The second question is - if we don't want to do a multicore (since we can't 
easily search for related data between multiple cores), how can we resolve this 
issue so both rows from different database table which shares the same primary 
key still exist? We don't want to have to always change the primary key format 
to ensure a uniqueness of the primary key among all different types of database 
tables.


thanks!


Jason


Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Shawn Heisey

On 7/10/2013 6:57 AM, Jed Glazner wrote:

So we'll do what we can quickly to see if we can 'band-aid' the problem
until we can upgrade to solr 4.4  Speaking of band-aids - does anyone know
of a way to change the socket timeout/connection timeout for distributed
updates?


If you need to change HttpClient parameters for CloudSolrServer, here's 
how you can do it:


String zkHost = 
zk1.REDACTED.com:2181,zk2.REDACTED.com:2181,zk3.REDACTED.com:2181/chroot;

ModifiableSolrParams params = new ModifiableSolrParams();
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000);
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200);
params.set(HttpClientUtil.PROP_SO_TIMEOUT, 30);
params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 5000);
HttpClient client = HttpClientUtil.createClient(params);
ResponseParser parser = new BinaryResponseParser();
LBHttpSolrServer lbServer = new LBHttpSolrServer(client, parser);
CloudSolrServer server = new CloudSolrServer(zkHost, lbServer);

Thanks,
Shawn



When not to use NRTCachingDirectory and what to use instead.

2013-07-10 Thread Tom Burton-West
Hello all,

The default directory implementation in Solr 4 is the NRTCachingDirectory
(in the example solrconfig.xml file , see below).

The Javadoc for NRTCachingDirectoy (
http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true)
 says:

 This class is likely only useful in a near real-time context, where
indexing rate is lowish but reopen rate is highish, resulting in many tiny
files being written...

It seems like we have exactly the opposite use case, so we would like
advice on what directory implementation to use instead.

We are doing offline batch indexing, so no searches are being done.  So we
don't need NRT.  We also have a high indexing rate as we are trying to
index 3 billion pages as quickly as possible.

I am not clear what determines the reopen rate.   Is it only related to
searching or is it involved in indexing as well?

 Does the NRTCachingDirectory have any benefit for indexing under the use
case noted above?

I'm guessing we should just use the solrStandardDirectoryFactory instead.
 Is this correct?

Tom

---





!-- The DirectoryFactory to use for indexes.

   solr.StandardDirectoryFactory is filesystem
   based and tries to pick the best implementation for the current
   JVM and platform.  solr.NRTCachingDirectoryFactory, the default,
   wraps solr.StandardDirectoryFactory and caches small files in memory
   for better NRT performance.

   One can force a particular implementation via
solr.MMapDirectoryFactory,
   solr.NIOFSDirectoryFactory, or solr.SimpleFSDirectoryFactory.

   solr.RAMDirectoryFactory is memory based, not
   persistent, and doesn't work with replication.
--
  directoryFactory name=DirectoryFactory

class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/


Re: What are the options for obtaining IDF at interactive speeds?

2013-07-10 Thread Kathryn Mazaitis
I didn't try indexing each term as a separate document (and if I had I
probably would've just used tv.tf_idf instead of a functional query -- why
not?). The regular functional query which required sending a separate
request for each of thousands of terms was wy dominated by the overhead
of each query, and far too slow.


On Mon, Jul 8, 2013 at 4:45 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi,
 I am curious about the functional query, did you try it and it didn't work?
  or was it too slow?

 idf(other_field,field(term))

 Thanks!

   roman


 On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote:

  Hi All,
 
  Resolution: I ended up cheating. :P Though now that I look at it, I think
  this was Roman's second suggestion. Thanks!
 
  Since the application that will be processing the IDF figures is located
 on
  the same machine as SOLR, I opened a second IndexReader on the lucene
 index
  and used
 
  reader.numDocs()
  reader.docFreq(field,term)
 
  to generate IDF by hand, ref:
 http://en.wikipedia.org/wiki/Tf%E2%80%93idf
 
  As it turns out, using this method to get IDF on all the terms mentioned
 in
  the set of relevant documents runs in time comparable to retrieving the
  documents in the first place (so, .1-1s). This makes it fast enough that
  it's no longer the slowest part of my algorithm by far. Problem solved!
 It
  is possible that IDFValueSource would be faster; I may swap that in at a
  later date.
 
  I will keep Mikhail's debugQuery=true in my pocket, too; that technique
  would never have occurred to me. Thank you too!
 
  Best,
  Katie
 
 
  On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi Kathryn,
   I wonder if you could index all your terms as separate documents and
 then
   construct a new query (2nd pass)
  
   q=term:term1 OR term:term2 OR term:term3
  
   and use func to score them
  
   *idf(other_field,field(term))*
   *
   *
   the 'term' index cannot be multi-valued, obviously.
  
   Other than that, if you could do it on server side, that weould be the
   fastest - the code is ready inside IDFValueSource:
  
  
 
 http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html
  
   roman
  
  
   On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
   kathryn.riv...@gmail.comwrote:
  
Hi,
   
I'm using SOLRJ to run a query, with the goal of obtaining:
   
(1) the retrieved documents,
(2) the TF of each term in each document,
(3) the IDF of each term in the set of retrieved documents (TF/IDF
  would
   be
fine too)
   
...all at interactive speeds, or 10s per query. This is a demo, so
 if
   all
else fails I can adjust the corpus, but I'd rather, y'know, actually
 do
   it.
   
(1) and (2) are working; I completed the patch posted in the
 following
issue:
https://issues.apache.org/jira/browse/SOLR-949
and am just setting tv=truetv.tf=true for my query. This way I get
  the
documents and the tf information all in one go.
   
With (3) I'm running into trouble. I have found 2 ways to do it so
 far:
   
Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
information along with the documents and tf information. Since each
  term
may appear in multiple documents, this means retrieving idf
 information
   for
each term about 20 times, and takes over a minute to do.
   
Option B: After I've gathered the tf information, run through the
 list
  of
terms used across the set of retrieved documents, and for each term,
  run
   a
query like:
{!func}idf(text,'the_term')deftype=funcfl=scorerows=1
...while this retrieves idf information only once for each term, the
   added
latency for doing that many queries piles up to almost two minutes on
  my
current corpus.
   
Is there anything I didn't think of -- a way to construct a query to
  get
idf information for a set of terms all in one go, outside the bounds
 of
what terms happen to be in a document?
   
Failing that, does anyone have a sense for how far I'd have to scale
   down a
corpus to approach interactive speeds, if I want this sort of data?
   
Katie
   
  
 



Re: When not to use NRTCachingDirectory and what to use instead.

2013-07-10 Thread Shawn Heisey

On 7/10/2013 9:59 AM, Tom Burton-West wrote:

The Javadoc for NRTCachingDirectoy (
http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true)
  says:

  This class is likely only useful in a near real-time context, where
indexing rate is lowish but reopen rate is highish, resulting in many tiny
files being written...

It seems like we have exactly the opposite use case, so we would like
advice on what directory implementation to use instead.

We are doing offline batch indexing, so no searches are being done.  So we
don't need NRT.  We also have a high indexing rate as we are trying to
index 3 billion pages as quickly as possible.

I am not clear what determines the reopen rate.   Is it only related to
searching or is it involved in indexing as well?

  Does the NRTCachingDirectory have any benefit for indexing under the use
case noted above?

I'm guessing we should just use the solrStandardDirectoryFactory instead.
  Is this correct?


The NRT directory object in Solr uses the MMap implementation as its 
default delegate.  I would use MMapDirectoryFactory (the default for 
most of the 3.x releases) for testing whether you can get any 
improvement from moving away from the default.  The advantages of memory 
mapping are not something you'd want to give up.


Thanks,
Shawn



more than 1 join on the same query

2013-07-10 Thread Marcelo Elias Del Valle
Hello,

I am playing with joins here just to test what I can do with them. I
have been learning a lot, but I am still having some troubles with more
complex queries.
For example, suppose I have the following documents:


   - id = 1 - name = Humblebee - age = 1000
   - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id =
   1
   - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id
   = 1
   - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id =
   1
   - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id
   = 1

In my case, that would mean there is a body called humblebee with id 1
and 4 child, each one a member of the body.
What I am trying to do: select all bodies (root entities) that have a
left arm and a right leg.
 To select the body based on the left arm, I would do:

   - q = *:*
   - fq = {!join from=root_id to=id}type:armattr1=left

 To select the body based on the right leg:

   - q = *:*
   - fq = {!join from=root_id to=id}type:legattr1=right

But what if I need both left arm AND right leg? Should I do 2 joins?

Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


AW: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Jed Glazner
Hi Shawn this code is für the solrj lib which we already use.

I'm talking about solr s internal communication from leader to replica via the 
DistributedCmdUpdate class.  I want to force the leader to time out after a 
fixed period instead of waiting for 15 minutes für the server to figure out the 
other end of the socket was closed.I don't know of any flags or settings in 
the solrconfig.xml to do this or if it's even possible with out modifying 
source code.

Jed

Von Samsung Mobile gesendet



 Ursprüngliche Nachricht 
Von: Shawn Heisey s...@elyograg.org
Datum: 07.10.2013 17:35 (GMT+01:00)
An: solr-user@lucene.apache.org
Betreff: Re: Solr Hangs During Updates for over 10 minutes


On 7/10/2013 6:57 AM, Jed Glazner wrote:
 So we'll do what we can quickly to see if we can 'band-aid' the problem
 until we can upgrade to solr 4.4  Speaking of band-aids - does anyone know
 of a way to change the socket timeout/connection timeout for distributed
 updates?

If you need to change HttpClient parameters for CloudSolrServer, here's
how you can do it:

String zkHost =
zk1.REDACTED.com:2181,zk2.REDACTED.com:2181,zk3.REDACTED.com:2181/chroot;
ModifiableSolrParams params = new ModifiableSolrParams();
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000);
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200);
params.set(HttpClientUtil.PROP_SO_TIMEOUT, 30);
params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 5000);
HttpClient client = HttpClientUtil.createClient(params);
ResponseParser parser = new BinaryResponseParser();
LBHttpSolrServer lbServer = new LBHttpSolrServer(client, parser);
CloudSolrServer server = new CloudSolrServer(zkHost, lbServer);

Thanks,
Shawn



How to create optional 'fq' plugin?

2013-07-10 Thread Learner
I am trying to suppress the error messages received when a value is not
passed to a query, 

Ex:

/select?first_name=peterfq=$first_nameq=*:*

I don't want the above query to throw error or die whenever the variable
first_name is not passed to the query hence I came up with a plugin to
return null whenever the variable is not passed in the query. The below code
works fine for 'q' but doesn't work for 'fq' parameter.

Something like...

public class OptionalQParserPlugin extends QParserPlugin {
public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
if (qstr == null || qstr.trim().length()  1) {
return new QParser(qstr, localParams, params, req) {
@Override
public Query parse() throws SyntaxError {
return null;
}
};
}
}

 Can someone let me know how to make fq variables optional?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-create-optional-fq-plugin-tp4077000.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to make a variable in 'fq' optional?

2013-07-10 Thread Learner
I am trying to make a variable in fq optional, 

Ex: 

/select?first_name=peterfq=$first_nameq=*:* 

I don't want the above query to throw error or die whenever the variable
first_name is not passed to the query instead return the value corresponding
to rest of the query. I can use switch but its difficult to handle each and
every case using switch (as I need to handle switch for so many
variables)... Is there a way to resolve this via some other way?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-make-a-variable-in-fq-optional-tp4077007.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to make a variable in 'fq' optional?

2013-07-10 Thread Learner
I am trying to make a variable in fq optional, 

Ex: 

/select?first_name=peterfq=$first_nameq=*:* 

I don't want the above query to throw error or die whenever the variable
first_name is not passed to the query instead return the value corresponding
to rest of the query. I can use switch but its difficult to handle each and
every case using switch (as I need to handle switch for so many
variables)... Is there a way to resolve this via some other way?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-make-a-variable-in-fq-optional-tp4077008.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: more than 1 join on the same query

2013-07-10 Thread Dominique Debailleux
try  fq = {!join from=root_id to=id}type:legattr1=right OR {!join
from=root_id to=id}type:armattr1=left

Dom


2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com

 Hello,

 I am playing with joins here just to test what I can do with them. I
 have been learning a lot, but I am still having some troubles with more
 complex queries.
 For example, suppose I have the following documents:


- id = 1 - name = Humblebee - age = 1000
- id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id =
1
- id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id
= 1
- id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id =
1
- id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id
= 1

 In my case, that would mean there is a body called humblebee with id 1
 and 4 child, each one a member of the body.
 What I am trying to do: select all bodies (root entities) that have a
 left arm and a right leg.
  To select the body based on the left arm, I would do:

- q = *:*
- fq = {!join from=root_id to=id}type:armattr1=left

  To select the body based on the right leg:

- q = *:*
- fq = {!join from=root_id to=id}type:legattr1=right

 But what if I need both left arm AND right leg? Should I do 2 joins?

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Dominique Debailleux
WoAnA - small.but.robust
[image: Accèder au profil LinkedIn de Dominique
Debailleux]http://www.linkedin.com/in/dominiquedebailleux


Re: more than 1 join on the same query

2013-07-10 Thread Marcelo Elias Del Valle
This
fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id
to=id}type:armattr1=left
works even if I have attr1=left1 in the second condition. My goal is to
select bodies that matches both conditions.

It's strange, but if I try
fq = {!join from=root_id to=id}type:legattr1=right AND {!join from=root_id
to=id}type:armattr1=left

it returns zero results, but the body exists. I am guessing it's trying to
query for childs which have type equals to both leg AND arm and attr1
equals to both right AND left...

Not sure...



2013/7/10 Dominique Debailleux dominique.debaill...@woana.net

 try  fq = {!join from=root_id to=id}type:legattr1=right OR {!join
 from=root_id to=id}type:armattr1=left

 Dom


 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com

  Hello,
 
  I am playing with joins here just to test what I can do with them. I
  have been learning a lot, but I am still having some troubles with more
  complex queries.
  For example, suppose I have the following documents:
 
 
 - id = 1 - name = Humblebee - age = 1000
 - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm -
 root_id =
 1
 - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm -
 root_id
 = 1
 - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm -
 root_id =
 1
 - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm -
 root_id
 = 1
 
  In my case, that would mean there is a body called humblebee with id
 1
  and 4 child, each one a member of the body.
  What I am trying to do: select all bodies (root entities) that have a
  left arm and a right leg.
   To select the body based on the left arm, I would do:
 
 - q = *:*
 - fq = {!join from=root_id to=id}type:armattr1=left
 
   To select the body based on the right leg:
 
 - q = *:*
 - fq = {!join from=root_id to=id}type:legattr1=right
 
  But what if I need both left arm AND right leg? Should I do 2 joins?
 
  Best regards,
  --
  Marcelo Elias Del Valle
  http://mvalle.com - @mvallebr
 



 --
 Dominique Debailleux
 WoAnA - small.but.robust
 [image: Accèder au profil LinkedIn de Dominique
 Debailleux]http://www.linkedin.com/in/dominiquedebailleux




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: more than 1 join on the same query

2013-07-10 Thread Dominique Debailleux
Sorry, I didn't check preciselyI guess in your sample attr1 applies to
the body, not the legs, that could explain your problem


2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com

 This
 fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id
 to=id}type:armattr1=left
 works even if I have attr1=left1 in the second condition. My goal is to
 select bodies that matches both conditions.

 It's strange, but if I try
 fq = {!join from=root_id to=id}type:legattr1=right AND {!join from=root_id
 to=id}type:armattr1=left

 it returns zero results, but the body exists. I am guessing it's trying to
 query for childs which have type equals to both leg AND arm and attr1
 equals to both right AND left...

 Not sure...



 2013/7/10 Dominique Debailleux dominique.debaill...@woana.net

  try  fq = {!join from=root_id to=id}type:legattr1=right OR {!join
  from=root_id to=id}type:armattr1=left
 
  Dom
 
 
  2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com
 
   Hello,
  
   I am playing with joins here just to test what I can do with them.
 I
   have been learning a lot, but I am still having some troubles with more
   complex queries.
   For example, suppose I have the following documents:
  
  
  - id = 1 - name = Humblebee - age = 1000
  - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm -
  root_id =
  1
  - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm -
  root_id
  = 1
  - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm -
  root_id =
  1
  - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm -
  root_id
  = 1
  
   In my case, that would mean there is a body called humblebee with
 id
  1
   and 4 child, each one a member of the body.
   What I am trying to do: select all bodies (root entities) that
 have a
   left arm and a right leg.
To select the body based on the left arm, I would do:
  
  - q = *:*
  - fq = {!join from=root_id to=id}type:armattr1=left
  
To select the body based on the right leg:
  
  - q = *:*
  - fq = {!join from=root_id to=id}type:legattr1=right
  
   But what if I need both left arm AND right leg? Should I do 2
 joins?
  
   Best regards,
   --
   Marcelo Elias Del Valle
   http://mvalle.com - @mvallebr
  
 
 
 
  --
  Dominique Debailleux
  WoAnA - small.but.robust
  [image: Accèder au profil LinkedIn de Dominique
  Debailleux]http://www.linkedin.com/in/dominiquedebailleux
 



 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Dominique Debailleux
WoAnA - small.but.robust
[image: Accèder au profil LinkedIn de Dominique
Debailleux]http://www.linkedin.com/in/dominiquedebailleux


Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Shawn Heisey

On 7/10/2013 6:57 AM, Jed Glazner wrote:

So, while it's 'just as risky' as you say, it's 'less risky' than a new
version of java and is possible to implement without downtime.


I believe that if you update one node at a time, there should be no 
downtime.  I've not actually tried this, so it would be a very good idea 
for you to try on a testbed.



It is actually something of a pain point that the upgrade path to
solrcloud seems to frequently require downtime. (clusterstate.json changes
in 4.1, and then again this big change in 4.4 with no solr.xml).


Looking through CHANGES.txt, I cannot see any issues mentioning a format 
change in clusterstate.json except for SOLR-3815, which was fixed in 
4.0, not 4.1.  I do see some commits on that issue after 4.0 was 
released, but they would have gone into 4.2.1, not 4.1, and the 
description for one of those later commits says that it adds information 
to clusterstate.json, it doesn't say anything about changing the format. 
 What documentation or issues are you seeing regarding a format change 
in 4.1?


As far as I know, elimination of solr.xml has not happened yet, and will 
not happen in the 4.x timeframe.  There is a new solr.xml format for 
core discovery that will be used in the 4.4 example, but it is 
completely optional - you will be able to continue to use the existing 
format in all 4.x releases.  Things are likely to be different in 5.0, 
but nobody is working on actual release plans for 5.0 yet.


Thanks,
Shawn



Re: more than 1 join on the same query

2013-07-10 Thread Marcelo Elias Del Valle
Dominique,
I tried also:
fq = {!join from=root_id to=id}type:leg AND {!join from=root_id
to=id}type:arm
If I understood what you said correctly, that should return something too,
right? It also got me 0 results...


2013/7/10 Dominique Debailleux dominique.debaill...@woana.net

 Sorry, I didn't check preciselyI guess in your sample attr1 applies to
 the body, not the legs, that could explain your problem


 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com

  This
  fq = {!join from=root_id to=id}type:legattr1=right OR {!join
 from=root_id
  to=id}type:armattr1=left
  works even if I have attr1=left1 in the second condition. My goal is to
  select bodies that matches both conditions.
 
  It's strange, but if I try
  fq = {!join from=root_id to=id}type:legattr1=right AND {!join
 from=root_id
  to=id}type:armattr1=left
 
  it returns zero results, but the body exists. I am guessing it's trying
 to
  query for childs which have type equals to both leg AND arm and attr1
  equals to both right AND left...
 
  Not sure...
 
 
 
  2013/7/10 Dominique Debailleux dominique.debaill...@woana.net
 
   try  fq = {!join from=root_id to=id}type:legattr1=right OR {!join
   from=root_id to=id}type:armattr1=left
  
   Dom
  
  
   2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com
  
Hello,
   
I am playing with joins here just to test what I can do with
 them.
  I
have been learning a lot, but I am still having some troubles with
 more
complex queries.
For example, suppose I have the following documents:
   
   
   - id = 1 - name = Humblebee - age = 1000
   - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm -
   root_id =
   1
   - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm -
   root_id
   = 1
   - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm -
   root_id =
   1
   - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm -
   root_id
   = 1
   
In my case, that would mean there is a body called humblebee with
  id
   1
and 4 child, each one a member of the body.
What I am trying to do: select all bodies (root entities) that
  have a
left arm and a right leg.
 To select the body based on the left arm, I would do:
   
   - q = *:*
   - fq = {!join from=root_id to=id}type:armattr1=left
   
 To select the body based on the right leg:
   
   - q = *:*
   - fq = {!join from=root_id to=id}type:legattr1=right
   
But what if I need both left arm AND right leg? Should I do 2
  joins?
   
Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr
   
  
  
  
   --
   Dominique Debailleux
   WoAnA - small.but.robust
   [image: Accèder au profil LinkedIn de Dominique
   Debailleux]http://www.linkedin.com/in/dominiquedebailleux
  
 
 
 
  --
  Marcelo Elias Del Valle
  http://mvalle.com - @mvallebr
 



 --
 Dominique Debailleux
 WoAnA - small.but.robust
 [image: Accèder au profil LinkedIn de Dominique
 Debailleux]http://www.linkedin.com/in/dominiquedebailleux




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: more than 1 join on the same query

2013-07-10 Thread Marcelo Elias Del Valle
Got puzzled now! If instead of AND I use , it works:
fq = {!join from=root_id to=id}type:leg  {!join from=root_id to=id}type:arm
I am definitly missing something, I don't know what... Shouldn't both be
the same?


[]s



2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com

 Dominique,
 I tried also:
 fq = {!join from=root_id to=id}type:leg AND {!join from=root_id
 to=id}type:arm
 If I understood what you said correctly, that should return something too,
 right? It also got me 0 results...


 2013/7/10 Dominique Debailleux dominique.debaill...@woana.net

 Sorry, I didn't check preciselyI guess in your sample attr1 applies
 to
 the body, not the legs, that could explain your problem


 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com

  This
  fq = {!join from=root_id to=id}type:legattr1=right OR {!join
 from=root_id
  to=id}type:armattr1=left
  works even if I have attr1=left1 in the second condition. My goal is to
  select bodies that matches both conditions.
 
  It's strange, but if I try
  fq = {!join from=root_id to=id}type:legattr1=right AND {!join
 from=root_id
  to=id}type:armattr1=left
 
  it returns zero results, but the body exists. I am guessing it's trying
 to
  query for childs which have type equals to both leg AND arm and attr1
  equals to both right AND left...
 
  Not sure...
 
 
 
  2013/7/10 Dominique Debailleux dominique.debaill...@woana.net
 
   try  fq = {!join from=root_id to=id}type:legattr1=right OR {!join
   from=root_id to=id}type:armattr1=left
  
   Dom
  
  
   2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com
  
Hello,
   
I am playing with joins here just to test what I can do with
 them.
  I
have been learning a lot, but I am still having some troubles with
 more
complex queries.
For example, suppose I have the following documents:
   
   
   - id = 1 - name = Humblebee - age = 1000
   - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm -
   root_id =
   1
   - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm -
   root_id
   = 1
   - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm -
   root_id =
   1
   - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm -
   root_id
   = 1
   
In my case, that would mean there is a body called humblebee
 with
  id
   1
and 4 child, each one a member of the body.
What I am trying to do: select all bodies (root entities) that
  have a
left arm and a right leg.
 To select the body based on the left arm, I would do:
   
   - q = *:*
   - fq = {!join from=root_id to=id}type:armattr1=left
   
 To select the body based on the right leg:
   
   - q = *:*
   - fq = {!join from=root_id to=id}type:legattr1=right
   
But what if I need both left arm AND right leg? Should I do 2
  joins?
   
Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr
   
  
  
  
   --
   Dominique Debailleux
   WoAnA - small.but.robust
   [image: Accèder au profil LinkedIn de Dominique
   Debailleux]http://www.linkedin.com/in/dominiquedebailleux
  
 
 
 
  --
  Marcelo Elias Del Valle
  http://mvalle.com - @mvallebr
 



 --
 Dominique Debailleux
 WoAnA - small.but.robust
 [image: Accèder au profil LinkedIn de Dominique
 Debailleux]http://www.linkedin.com/in/dominiquedebailleux




 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: Solr Hangs During Updates for over 10 minutes

2013-07-10 Thread Otis Gospodnetic
+1 for G1.  We just had a happy client this week switch to G1 after
seeing stw pauses with CMS.  I can't share their JVM metrics from SPM,
but I can share ours:
http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/
(HBase, not Solr, but we've seen the same effect with ElasticSearch
for example, so I'm optimistic about seeing the same effects with
Solr, too).

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Wed, Jul 10, 2013 at 4:48 AM, Daniel Collins danwcoll...@gmail.com wrote:
 We had something similar in terms of update times suddenly spiking up for
 no obvious reason.  We never got quite as bad as you in terms of the other
 knock on effects, but we certainly saw updates jumping from 10ms up to
 3ms, all our external queues backed up and we rejected some updates,
 then after a while things quietened down.

 We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped to
 Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
 went away.

 Now, I admit its not exactly the same as your case, we never had the
 follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly
 reduced the spikes in our indexing times.

 We run the following settings now (the usual caveats apply, it might not
 work for you).

 GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
 -XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
 -XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000

 I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
 application pauses, that's our goal, if we have to use more memory in the
 short term then so be it, but we couldn't afford application pauses,
 because we are using NRT (soft commits every 1s, hard commits every 60s)
 and we get a lot of updates.

 I know there have been other discussion on G1 and it has received mixed
 results overall, but for us, it seems to be a winner.

 Hope that helps,


 On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote:

 We are planning an upgrade to 4.4 but it's still weeks out. We offer a
 high availability search service and there are a number of changes in 4.4
 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml)
 So there must be lots of testing, additionally this upgrade cannot be
 performed without downtime.

 Regardless, I need to find a band-aid right now.  Does anyone know if it's
 possible to set the timeout for distributed update request to/from leader.
  Currently we see it's set to 0.  Maybe via -D startup param, or something?

 Jed

 On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi Jed,
 
 This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
 that is about to be released.  We did not have fun working with 4.0 in
 SolrCloud mode a few months ago.  You will save time, hair, and money
 if you convince your manager to let you use Solr 4.4. :)
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote:
  Hi Shawn,
 
  I have been trying to duplicate this problem without success for the
 last 2 weeks which is one reason I'm getting flustered.   It seems
 reasonable to be able to duplicate it but I can't.
 
   We do have a story to upgrade but that is still weeks if not months
 before that gets rolled out to production.
 
  We have another cluster running the same version but with 8 shards and
 8 replicas with each shard at 100gb and more load and more indexing
 requests without this problem but we send docs in batches here and all
 fields are stored.   Where as the trouble index has only 1 or 2 stored
 fields and only send docs 1 at a time.
 
  Could that have anything to do with it?
 
  Jed
 
 
  Von Samsung Mobile gesendet
 
 
 
   Ursprüngliche Nachricht 
  Von: Shawn Heisey s...@elyograg.org
  Datum: 07.09.2013 18:33 (GMT+01:00)
  An: solr-user@lucene.apache.org
  Betreff: Re: Solr Hangs During Updates for over 10 minutes
 
 
  On 7/9/2013 9:50 AM, Jed Glazner wrote:
  I'll give you the high level before delving deep into setup etc. I
 have been struggeling at work with a seemingly random problem when solr
 will hang for 10-15 minutes during updates.  This outage always seems
 to immediately be proceeded by an EOF exception on  the replica.  Then
 10-15 minutes later we see an exception on the leader for a socket
 timeout to the replica.  The leader will then tell the replica to
 recover which in most cases it does and then the outage is over.
 
  Here are the setup details:
 
  We are currently using Solr 4.0.0 with an external ZK ensemble of 5
 machines.
 
  After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced
  and have since been fixed.  You're five releases and about nine months
  behind what's current.  My recommendation: Upgrade to 4.3.1, ensure 

Re: more than 1 join on the same query

2013-07-10 Thread Yonik Seeley
Be careful with URL encoding... that may be messing you up depending
on how you are trying to submit the query (and the single  you were
using as AND)

fq={!join from=root_id to=id}type:arm AND attr1=left
fq={!join from=root_id to=id}type:leg AND attr1=right

-Yonik
http://lucidworks.com


On Wed, Jul 10, 2013 at 12:56 PM, Marcelo Elias Del Valle
mvall...@gmail.com wrote:
 Hello,

 I am playing with joins here just to test what I can do with them. I
 have been learning a lot, but I am still having some troubles with more
 complex queries.
 For example, suppose I have the following documents:


- id = 1 - name = Humblebee - age = 1000
- id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id =
1
- id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id
= 1
- id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id =
1
- id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id
= 1

 In my case, that would mean there is a body called humblebee with id 1
 and 4 child, each one a member of the body.
 What I am trying to do: select all bodies (root entities) that have a
 left arm and a right leg.
  To select the body based on the left arm, I would do:

- q = *:*
- fq = {!join from=root_id to=id}type:armattr1=left

  To select the body based on the right leg:

- q = *:*
- fq = {!join from=root_id to=id}type:legattr1=right

 But what if I need both left arm AND right leg? Should I do 2 joins?

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr


How to make 'fq' optional?

2013-07-10 Thread Learner
I am trying to make a variable in fq optional, 

Ex: 

/select?first_name=peterfq=$first_nameq=*:* 

I don't want the above query to throw error or die whenever the variable
first_name is not passed to the query instead return the value corresponding
to rest of the query. I can use switch but its difficult to handle each and
every case using switch (as I need to handle switch for so many
variables)... Is there a way to resolve this via some other way? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-make-fq-optional-tp4077042.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to form / return filter (Query object)?

2013-07-10 Thread Learner
I was able to form a TermQuery as below,

Query query=new TermQuery(new Term(id,value));

I am trying to form a filter query something that returns just the filter
that can be used with any query type (q or fq).

if (qstr == null || qstr.trim().length()  1) {
return new QParser(qstr, localParams, params, req) {
@Override
public Query parse() throws SyntaxError {
*return null;*
}
};
}

As of now I am returning null, instead I am trying return query object with
filter (ex: *:*). Can someone let me know how to implement that?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-form-return-filter-Query-object-tp4077067.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr limitations

2013-07-10 Thread Lance Norskog

Also, total index file size. At 200-300gb managing an index becomes a pain.

Lance

On 07/08/2013 07:28 AM, Jack Krupansky wrote:
Other that the per-node/per-collection limit of 2 billion documents 
per Lucene index, most of the limits of Solr are performance-based 
limits - Solr can handle it, but the performance may not be 
acceptable. Dynamic fields are a great example. Nothing prevents you 
from creating a document with, say, 50,000 dynamic fields, but you are 
likely to find the performance less than acceptable. Or facets. Sure, 
Solr will let you have 5,000 faceted fields, but the performance is 
likely to be... you get the picture.


What is acceptable performance? That's for you to decide.

What will the performance of 5,000 dynamic fields or 500 faceted 
fields or 500 million documents on a node be? It all depends on your 
data, especially the cardinality (unique values) of each individual 
field.


How can you determine the performance? Only one way: Proof of concept. 
You need to do your own proof of concept implementation, with your own 
representative data, with your own representative data model, with 
your own representative hardware, with your own representative client 
software, with your own representative user query load. That testing 
will give you all the answers you need.


There are are no magic answers. Don't believe any magic spreadsheet or 
magic wizard. Flip a coin whether they will work for your situation.


Some simple, common sense limits:

1. No more than 50 to 100 million documents per node.
2. No more than 250 fields per document.
3. No more than 250K characters per document.
4. No more than 25 faceted fields.
5. No more than 32 nodes in your SolrCloud cluster.
6. Don't return more than 250 results on a query.

None of those is a hard limit, but don't go beyond them unless your 
Proof of Concept testing proves that performance is acceptable for 
your situation.


Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary 
tests and then scale as needed.


Dynamic and multivalued fields? Try to stay away from them - excepts 
for the simplest cases, they are usually an indicator of a weak data 
model. Sure, it's fine to store a relatively small number of values in 
a multivalued field (say, dozens of values), but be aware that you 
can't directly access individual values, you can't tell which was 
matched on a query, and you can't coordinate values between multiple 
multivalued fields. Except for very simple cases, multivalued fields 
should be flattened into multiple documents with a parent ID.


Since you brought up the topic of dynamic fields, I am curious how you 
got the impression that they were a good technique to use as a 
starting point. They're fine for prototyping and hacking, and fine 
when used in moderation, but not when used to excess. The whole point 
of Solr is searching and searching is optimized within fields, not 
across fields, so having lots of dynamic fields is counter to the 
primary strengths of Lucene and Solr. And... schemas with lots  of 
dynamic fields tend to be difficult to maintain. For example, if you 
wanted to ask a support question here, one of the first things we want 
to know is what your schema looks like, but with lots of dynamic 
fields it is not possible to have a simple discussion of what your 
schema looks like.


Sure, there is something called schemaless design (and Solr supports 
that in 4.4), but that's very different from heavy reliance on dynamic 
fields in the traditional sense. Schemaless design is A-OK, but using 
dynamic fields for arrays of data in a single document is a poor 
match for the search features of Solr (e.g., Edismax searching across 
multiple fields.)


One other tidbit: Although Solr does not enforce naming conventions 
for field names, and you can put special characters in them, there are 
plenty of features in Solr, such as the common fl parameter, where 
field names are expected to adhere to Java naming rules. When people 
start going wild with dynamic fields, it is common that they start 
going wild with their names as well, using spaces, colons, slashes, 
etc. that cannot be parsed in the fl and qf parameters, for 
example. Please don't go there!


In short, put up a small cluster and start doing a Proof of Concept 
cluster. Stay within my suggested guidelines and you should do okay.


-- Jack Krupansky

-Original Message- From: Marcelo Elias Del Valle
Sent: Monday, July 08, 2013 9:46 AM
To: solr-user@lucene.apache.org
Subject: Solr limitations

Hello everyone,

   I am trying to search information about possible solr limitations I
should consider in my architecture. Things like max number of dynamic
fields, max number o documents in SolrCloud, etc.
   Does anyone know where I can find this info?

Best regards,




Re: Commit different database rows to solr with same id value?

2013-07-10 Thread Jason Huang
Thanks David.

I am actually trying to commit the database row on the fly, not DIH. :)

Anyway, if I understand you correctly, basically you are suggesting to
modify the value of the primary key and pass the new value to id before
committing to solr. This could probably be one solution.

What if I want to commit the data from table2 to a new core? Anyone knows
how I can do that?

thanks,

Jason

On Wed, Jul 10, 2013 at 11:18 AM, David Quarterman da...@corexe.com wrote:

 Hi Jason,

 Assuming you're using DIH, why not build a new, unique id within the query
 to use as  the 'doc_id' for SOLR? We do something like this in one of our
 collections. In MySQL, try this (don't know what it would be for any other
 db but there must be equivalents):

 select @rownum:=@rownum+1 rowid, t.* from (main select query) t, (select
 @rownum:=0) s

 Regards,

 DQ

 -Original Message-
 From: Jason Huang [mailto:jason.hu...@icare.com]
 Sent: 10 July 2013 15:50
 To: solr-user@lucene.apache.org
 Subject: Commit different database rows to solr with same id value?

 Hello,

 I am trying to use Solr to store fields from two different database
 tables, where the primary keys are in the format of 1, 2, 3, 

 In Java, we build different POJO classes for these two database tables:

 table1.java

 @SolrIndex(name=id)

 private String idTable1

 


 table2.java

 @SolrIndex(name=id)

 private String idTable2



 And later we add these fields defined in the two different types of tables
 and commit it to solrServer.


 Here is the scenario where I am having issues:

 (1) commit a row from table1 with primary key = 3, this generates a
 document in Solr

 (2) commit another row from table2 with the same value of primary key =
 3, this overwrites the document generated in step (1).


 What we really want to achieve is to keep both rows in (1) and (2) because
 they are from different tables. I've read something from google search and
 it appears that we might be able to do it via keeping multiple cores in
 solr? Could anyone point at how to implement multiple core to achieve this?
 To be more specific, when I commit the row as a document, I don't have a
 place to pick a certain core and I am not sure if it makes any sense for me
 to specify a core when I commit the document since the layer I am working
 on should abstract it away from me.



 The second question is - if we don't want to do a multicore (since we
 can't easily search for related data between multiple cores), how can we
 resolve this issue so both rows from different database table which shares
 the same primary key still exist? We don't want to have to always change
 the primary key format to ensure a uniqueness of the primary key among all
 different types of database tables.


 thanks!


 Jason



Re: Solr limitations

2013-07-10 Thread Otis Gospodnetic
For what it's worth, in SPM we keep track of nodes/server stats, of
course, and that metric has been going up for those using SPM to
monitor Solr clusters, which is a nice sign.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Solr Performance Monitoring -- http://sematext.com/spm



On Wed, Jul 10, 2013 at 9:29 AM, Jack Krupansky j...@basetechnology.com wrote:
 Again, no hard limits, mostly performance-based limits and environmental
 factors of your own environment, as well as the fact that most people on
 this list will have deeper experience with smaller clusters, so if you
 decide to go big, you will be in uncharted and untested territory.

 I would relax my number a little (actually, double it) to 64 nodes, to
 handle the 8-shard, 8-replica case, since just yesterday somebody on the
 list mentioned that they were using such a configuration.

 In other words, with configurations up to 16 or 32 or even 64 nodes, you
 will readily find people here who might be able to help support you, but if
 you are thinking of a 16-shard, 16-replica cluster with 256 nodes or
 32-shard, 32-replica cluster with 1,024 nodes, it's not that that will hit
 any hard limit in Solr, but simply that not as many people will be able to
 provide support, answer questions, or simply confirm that yes, a cluster
 that big is a... slam-dunk. And if you do want to try a 1,024-node
 cluster, you absolutely should do a Proof of Concept implementation first.

 I actually don't have any hard, empirical evidence to back up my 32/64-node
 guidance, but it seems reasonable and consistent with configurations people
 commonly talk about. Generally, people talk about smaller clusters, so I'm
 stretching a little to get up to my 32/64 guidance. And, to be clear, that's
 just a rough guide and not intended to guarantee that a 64-node cluster will
 perform really well, nor to imply that a 96-node or 128-node cluster won't
 perform well.

 -- Jack Krupansky

 -Original Message- From: Ramkumar R. Aiyengar
 Sent: Wednesday, July 10, 2013 4:03 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr limitations


 I understand, thanks. I just wanted to check in case there were scalability
 limitations with how SolrCloud operates..
 On 9 Jul 2013 12:45, Erick Erickson erickerick...@gmail.com wrote:

 I think Jack was mostly thinking in slam dunk terms. I know of
 SolrCloud demo clusters with 500+ nodes, and at that point
 people said it's going to work for our situation, we don't need
 to push more.

 As you start getting into that kind of scale, though, you really
 have a bunch of ops considerations etc. Mostly when I get into
 larger scales I pretty much want to examine my assumptions
 and see if they're correct, perhaps start to trim my requirements
 etc.

 FWIW,
 Erick

 On Tue, Jul 9, 2013 at 4:07 AM, Ramkumar R. Aiyengar
 andyetitmo...@gmail.com wrote:
  5. No more than 32 nodes in your SolrCloud cluster.
 
  I hope this isn't too OT, but what tradeoffs is this based on? Would 
  have
  thought it easy to hit this number for a big index and high load (hence
  with the view of both the number of shards and replicas horizontally
  scaling..)
 
  6. Don't return more than 250 results on a query.
 
  None of those is a hard limit, but don't go beyond them unless your
 Proof
  of Concept testing proves that performance is acceptable for your
 situation.
 
  Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary
  tests and then scale as needed.
 
  Dynamic and multivalued fields? Try to stay away from them - excepts 
  for
  the simplest cases, they are usually an indicator of a weak data model.
  Sure, it's fine to store a relatively small number of values in a
  multivalued field (say, dozens of values), but be aware that you can't
  directly access individual values, you can't tell which was matched on a
  query, and you can't coordinate values between multiple multivalued
 fields.
  Except for very simple cases, multivalued fields should be flattened 
  into
  multiple documents with a parent ID.
 
  Since you brought up the topic of dynamic fields, I am curious how you
  got the impression that they were a good technique to use as a starting
  point. They're fine for prototyping and hacking, and fine when used in
  moderation, but not when used to excess. The whole point of Solr is
  searching and searching is optimized within fields, not across fields, 
  so
  having lots of dynamic fields is counter to the primary strengths of
 Lucene
  and Solr. And... schemas with lots  of dynamic fields tend to be
 difficult
  to maintain. For example, if you wanted to ask a support question here,
 one
  of the first things we want to know is what your schema looks like, but
  with lots of dynamic fields it is not possible to have a simple
 discussion
  of what your schema looks like.
 
  Sure, there is something called schemaless design (and Solr supports
  that in 4.4), but that's very different from heavy reliance on 

amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Marcelo Elias Del Valle
Hello,

I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
Let me give an example. Consider the entities:

User:
id: 1
type: Joan of Arc
age: 27

Webpage:
id: 1
url: http://wiki.apache.org/solr/Join
category: Technical
user_id: 1

id: 2
url: http://stackoverflow.com
category: Technical
user_id: 1

Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
id: 1
name: Joan of Arc
age: 27
webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category:
Technical]
webpage2: [id:2, url: http://stackoverflow.com;, category:
Technical]

It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Jack Krupansky
Simple answer: avoid large number of values in a single document. There 
should only be a modest to moderate number of fields in a single document.


Is the data relatively static, or subject to frequent updates? To update any 
field of a single document, even with atomic update, requires Solr to read 
and rewrite every field of the document. So, lots of smaller documents are 
best for a frequent update scenario.


Multivalues fields are great for storing a relatively small list of values. 
You can add to the list easily, but under the hood, Solr must read and 
rewrite the full list as well as the full document. And, there is no way to 
address or synchronize individual elements of multivalued fields.


Joins are great... if used in moderation. Heavy use of joins is not a great 
idea.


-- Jack Krupansky

-Original Message- 
From: Marcelo Elias Del Valle

Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: amount of values in a multi value field - is denormalization always 
the best option?


Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1
   url: http://wiki.apache.org/solr/Join
   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27
   webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category:
Technical]
   webpage2: [id:2, url: http://stackoverflow.com;, category:
Technical]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr 



solr postfilter question

2013-07-10 Thread Rohit Harchandani
Hey,
I am trying to create a plugin which makes use of postfilter. I know that
the collect function is called for every document matched, but is there a
way i can access all the matched documents upto this point before collect
is called on each of them?
Thanks,
Rohir


Re: solr postfilter question

2013-07-10 Thread Yonik Seeley
On Wed, Jul 10, 2013 at 6:08 PM, Rohit Harchandani rhar...@gmail.com wrote:
 Hey,
 I am trying to create a plugin which makes use of postfilter. I know that
 the collect function is called for every document matched, but is there a
 way i can access all the matched documents upto this point before collect
 is called on each of them?

You would need to collect/cache that information yourself in the post filter.

-Yonik
http://lucidworks.com


Re: replication getting stuck on a file

2013-07-10 Thread adityab
I have seen this in 4.2.1 too. 
Once replication is finished, on Admin UI we see 100% and time and dlspeed
information goes out of wack Same is reflected in mbeans. But whats actually
happening in the background is auto-warmup of caches (in my case)
May be some minor stats bug




--
View this message in context: 
http://lucene.472066.n3.nabble.com/replication-getting-stuck-on-a-file-tp4076707p4077112.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: My latest solr blog post on Solr's PostFiltering

2013-07-10 Thread Rohit Harchandani
Hi Amit,

Great article. I tried it and it works well. I am new to developing in solr
and had a question? do you know if there is a way to access all the matched
ids before collect is called?

Thanks,
Rohit


On Sat, Nov 10, 2012 at 1:12 PM, Erick Erickson erickerick...@gmail.comwrote:

 That'll teach _me_ to look closely at the URL...

 Best
 Erick


 On Fri, Nov 9, 2012 at 12:03 PM, Amit Nithian anith...@gmail.com wrote:

  Oh weird. I'll post URLs on their own lines next time to clarify.
 
  Thanks guys and looking forward to any feedback!
 
  Cheers
  Amit
 
 
  On Fri, Nov 9, 2012 at 2:05 AM, Dmitry Kan dmitry@gmail.com wrote:
 
   I guess the url should have been:
  
  
  
 
 http://hokiesuns.blogspot.com/2012/11/using-solrs-postfiltering-to-collect.html
  
   i.e. without 'and' in the end of it.
  
   -- Dmitry
  
   On Fri, Nov 9, 2012 at 12:03 PM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
It's always good when someone writes up their experiences!
   
But when I try to follow that link, I get to your Random Writings,
  but
   it
tells me that the blog post doesn't exist...
   
Erick
   
   
On Thu, Nov 8, 2012 at 4:21 PM, Amit Nithian anith...@gmail.com
  wrote:
   
 Hey all,

 I wanted to thank those who have helped in answering some of my
   esoteric
 questions and especially the one about using Solr's post filtering
feature
 to implement some score statistics gathering we had to do at
 Zvents.

 To show this appreciation and to help advance the knowledge of this
   space
 in a more codified fashion, I have written a blog post about this
  work
and
 open sourced the work as well.

 Please take a read by visiting


   
  
 
 http://hokiesuns.blogspot.com/2012/11/using-solrs-postfiltering-to-collect.htmland
 please let me know if there are any inaccuracies or points of
 contention so I can address/correct them.

 Thanks!
 Amit

   
  
  
  
   --
   Regards,
  
   Dmitry Kan
  
 



Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Marcelo Elias Del Valle
Jack,

 When you say: large number of values in a single document you also
mean a block in a block join, right? Exactly the same thing, agree?
 In my case, I have just 1 insert and no updates. Even in this case,
you think a large document or block would be a really bad idea? I am more
worried about the search time.

Best regards,
Marcelo.


2013/7/10 Jack Krupansky j...@basetechnology.com

 Simple answer: avoid large number of values in a single document. There
 should only be a modest to moderate number of fields in a single document.

 Is the data relatively static, or subject to frequent updates? To update
 any field of a single document, even with atomic update, requires Solr to
 read and rewrite every field of the document. So, lots of smaller documents
 are best for a frequent update scenario.

 Multivalues fields are great for storing a relatively small list of
 values. You can add to the list easily, but under the hood, Solr must read
 and rewrite the full list as well as the full document. And, there is no
 way to address or synchronize individual elements of multivalued fields.

 Joins are great... if used in moderation. Heavy use of joins is not a
 great idea.

 -- Jack Krupansky

 -Original Message- From: Marcelo Elias Del Valle
 Sent: Wednesday, July 10, 2013 5:37 PM
 To: solr-user@lucene.apache.org
 Subject: amount of values in a multi value field - is denormalization
 always the best option?


 Hello,

I have asked a question recently about solr limitations and some about
 joins. It comes that this question is about both at the same time.
I am trying to figure how to denormalize my data so I will need just 1
 document in my index instead of performing a join. I figure one way of
 doing this is storing an entity as a multivalued field, instead of storing
 different fields.
Let me give an example. Consider the entities:

 User:
id: 1
type: Joan of Arc
age: 27

 Webpage:
id: 1
url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join
category: Technical
user_id: 1

id: 2
url: http://stackoverflow.com
category: Technical
user_id: 1

Instead of creating 1 document for user, 1 for webpage 1 and 1 for
 webpage 2 (1 parent and 2 childs) I could store webpages in a user
 multivalued field, as follows:

 User:
id: 1
name: Joan of Arc
age: 27
webpage1: [id:1, url: 
 http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join,
 category:
 Technical]
webpage2: [id:2, url: http://stackoverflow.com;, category:
 Technical]

It would probably perform better than the join, right? However, it made
 me think about solr limitations again. What if I have 200 million webpges
 (200 million fields) per user? Or imagine a case where I could have 200
 million values on a field, like in the case I need to index every html DOM
 element (div, a, etc.) for each web page user visited.
I mean, if I need to do the query and this is a business requirement no
 matter what, although denormalizing could be better than using query time
 joins, I wonder it distributing the data present in this single document
 along the cluster wouldn't give me better performance. And this is
 something I won't get with block joins or multivalued fields...
I guess there is probably no right answer for this question (at least
 not a known one), and I know I should create a POC to check how each
 perform... But do you think a so large number of values in a single
 document could make denormalization not possible in an extreme case like
 this? Would you share my thoughts if I said denormalization is not always
 the right option?

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Roman Chyla
On Wed, Jul 10, 2013 at 5:37 PM, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:

 Hello,

 I have asked a question recently about solr limitations and some about
 joins. It comes that this question is about both at the same time.
 I am trying to figure how to denormalize my data so I will need just 1
 document in my index instead of performing a join. I figure one way of
 doing this is storing an entity as a multivalued field, instead of storing
 different fields.
 Let me give an example. Consider the entities:

 User:
 id: 1
 type: Joan of Arc
 age: 27

 Webpage:
 id: 1
 url: http://wiki.apache.org/solr/Join
 category: Technical
 user_id: 1

 id: 2
 url: http://stackoverflow.com
 category: Technical
 user_id: 1

 Instead of creating 1 document for user, 1 for webpage 1 and 1 for
 webpage 2 (1 parent and 2 childs) I could store webpages in a user
 multivalued field, as follows:

 User:
 id: 1
 name: Joan of Arc
 age: 27
 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category:
 Technical]
 webpage2: [id:2, url: http://stackoverflow.com;, category:
 Technical]

 It would probably perform better than the join, right? However, it made
 me think about solr limitations again. What if I have 200 million webpges
 (200 million fields) per user? Or imagine a case where I could have 200
 million values on a field, like in the case I need to index every html DOM
 element (div, a, etc.) for each web page user visited.
 I mean, if I need to do the query and this is a business requirement no
 matter what, although denormalizing could be better than using query time
 joins, I wonder it distributing the data present in this single document
 along the cluster wouldn't give me better performance. And this is
 something I won't get with block joins or multivalued fields...


Indeed, and when you think of it, then there are only (2?) alternatives

1. let you distributed search cluster have the knowledge of relations
2. denormalize  duplicate the data


 I guess there is probably no right answer for this question (at least
 not a known one), and I know I should create a POC to check how each
 perform... But do you think a so large number of values in a single
 document could make denormalization not possible in an extreme case like
 this? Would you share my thoughts if I said denormalization is not always
 the right option?


Aren't words of natural language (and whatever crap there comes with them
in the fulltext) similar? You may not want to retrieve relations between
every word that you indexed, but still you can index millions of unique
tokens (well, having 200 millions seems to high). But if you were having
such a high number of unique values, you can think of indexing hash values
- search for 'near-duplicates' could be acceptable too.

And so, with lucene, only the denormalization will give you anywhere closer
to acceptable search speed. If you look at the code that executes the join
search, you would see that values for the 1st order search are harvested,
then a new search (or lookup) is performed - so it has to be almost always
slower than the inverted index lookup

roman



 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr



Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Jack Krupansky
Join is a query operation - it has nothing to do with the number of values 
(fields and multivalued fields) in a Solr/Lucene document.


Block insert isn't available yet anyway, so we don't have any clear 
assessments of its performance.


Generally, any kind of large block of data is not a great idea.

1. Break things down.
2. Keep things simple.
3. Join is not simple.
4. Only use non-simple features in careful moderation.

There is no reasonable short cut to doing a robust data model. Shortcuts may 
seem enticing in the short run, but will eat you alive in the long run.


-- Jack Krupansky

-Original Message- 
From: Marcelo Elias Del Valle

Sent: Wednesday, July 10, 2013 6:52 PM
To: solr-user@lucene.apache.org
Subject: Re: amount of values in a multi value field - is denormalization 
always the best option?


Jack,

When you say: large number of values in a single document you also
mean a block in a block join, right? Exactly the same thing, agree?
In my case, I have just 1 insert and no updates. Even in this case,
you think a large document or block would be a really bad idea? I am more
worried about the search time.

Best regards,
Marcelo.


2013/7/10 Jack Krupansky j...@basetechnology.com


Simple answer: avoid large number of values in a single document. There
should only be a modest to moderate number of fields in a single document.

Is the data relatively static, or subject to frequent updates? To update
any field of a single document, even with atomic update, requires Solr to
read and rewrite every field of the document. So, lots of smaller 
documents

are best for a frequent update scenario.

Multivalues fields are great for storing a relatively small list of
values. You can add to the list easily, but under the hood, Solr must read
and rewrite the full list as well as the full document. And, there is no
way to address or synchronize individual elements of multivalued fields.

Joins are great... if used in moderation. Heavy use of joins is not a
great idea.

-- Jack Krupansky

-Original Message- From: Marcelo Elias Del Valle
Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: amount of values in a multi value field - is denormalization
always the best option?


Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1
   url: 
http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join

   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27
   webpage1: [id:1, url: 
http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join,

category:
Technical]
   webpage2: [id:2, url: http://stackoverflow.com;, category:
Technical]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr





--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr 



RE: Overseer queues confused me

2013-07-10 Thread Illu.Y.Ying (mis.sh04.Newegg) 41417
can someone answer my question?

Thanks in advance

Best Regards,
Illu Ying

-Original Message-
From: Illu.Y.Ying (mis.sh04.Newegg) 41417 [mailto:illu.y.y...@newegg.com] 
Sent: Wednesday, July 10, 2013 10:44 AM
To: solr-user@lucene.apache.org
Subject: Overseer queues confused me

Hi there:
 In solr4.3 source code , I found overseer use 3 queues to handle all 
solrcloud management request:
 1: /overseer/queue
2: /overseer/queue-work
3: /overseer/collection-queue-work

 ClusterStateUpdater use 1st  2nd queue to handle solrcloud shard or 
state request.
 When peek request from 1st queue, then offer it to 2nd queue and 
handle it.

 OverseerCollectionProcessor use 3rd queue to handle collection related 
request.

 My question is why ClusterStateUpdater use 2 queues but 
OverseerCollectionProcessor use 1 also can handle request correctly?
 Is there any additional design for ClusterStateUpdater?


 Thanks in advance:)


Best Regards,
Illu Ying



expunging deletes

2013-07-10 Thread Petersen, Robert
Hi guys,

Using solr 3.6.1 and the following settings, I am trying to run without 
optimizes.  I used to optimize nightly, but sometimes the optimize took a very 
long time to complete and slowed down our indexing.  We are continuously 
indexing our new or changed data all day and night.  After a few days running 
without an optimize, the index size has nearly doubled and maxdocs is nearly 
twice the size of numdocs.  I understand deletes should be expunged on merges, 
but even after trying lots of different settings for our merge policy it seems 
this growth is somewhat unbounded.  I have tried sending an optimize with 
numSegments = 2 which is a lot lighter weight then a regular optimize and that 
does bring the number down but not by too much.  Does anyone have any ideas for 
better settings for my merge policy that would help?  Here is my current index 
snapshot too:

Location: /var/LucidWorks/lucidworks/solr/1/data/index
Size: 25.05 GB  (when the index is optimized it is around 15.5 GB)
searcherName : Searcher@6c3a3517 main 
caching : true 
numDocs : 16852155 
maxDoc : 24512617 
reader : 
SolrIndexReader{this=6e3b4ec8,r=ReadOnlyDirectoryReader@6e3b4ec8,refCnt=1,segments=61}
 


mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce35/int
  int name=segmentsPerTier35/int
  int name=maxMergeAtOnceExplicit105/int
  double name=maxMergedSegmentMB6144.0/double
  double name=reclaimDeletesWeight8.0/double
/mergePolicy
 
 mergeScheduler 
class=org.apache.lucene.index.ConcurrentMergeScheduler
  int name=maxMergeCount20/int
  int name=maxThreadCount3/int
  /mergeScheduler

Thanks,

Robert (Robi) Petersen
Senior Software Engineer
Search Department


   (formerly Buy.com)
85 enterprise, suite 100
aliso viejo, ca 92656
tel 949.389.2000 x5465
fax 949.448.5415


  





Re: expunging deletes

2013-07-10 Thread Shawn Heisey
On 7/10/2013 5:58 PM, Petersen, Robert wrote:
 Using solr 3.6.1 and the following settings, I am trying to run without 
 optimizes.  I used to optimize nightly, but sometimes the optimize took a 
 very long time to complete and slowed down our indexing.  We are continuously 
 indexing our new or changed data all day and night.  After a few days running 
 without an optimize, the index size has nearly doubled and maxdocs is nearly 
 twice the size of numdocs.  I understand deletes should be expunged on 
 merges, but even after trying lots of different settings for our merge policy 
 it seems this growth is somewhat unbounded.  I have tried sending an optimize 
 with numSegments = 2 which is a lot lighter weight then a regular optimize 
 and that does bring the number down but not by too much.  Does anyone have 
 any ideas for better settings for my merge policy that would help?  Here is 
 my current index snapshot too:

Your merge settings are the equivalent of the old mergeFactor set to 35,
and based on the fact that you have the Explicit set to 105, I'm
guessing your settings originally came from something I posted - these
are the numbers that I use.  These settings can result in a very large
number of segments on your disk.

Because you index a lot (and probably reindex existing documents often),
I can understand why you have high merge settings, but if you want to
eliminate optimizes, you'll need to go lower.  The default merge setting
of 10 (with an Explicit value of 30) is probably a good starting point,
but you might need to go even smaller.

On Solr 3.6, an optimize probably cannot take place at the same time as
index updates -- the optimize would probably delay updates until after
it's finished.  I remember running into problems on Solr 3.x, so I set
up my indexing program to stop updates while the index was optimizing.

Solr 4.x should lift any restriction where optimizes and updates can't
happen at the same time.

With an index size of 25GB, a six-drive RAID10 should be able to
optimize in 10-15 minutes, but if your I/O system is single disk, RAID1,
RAID5, or RAID6, the write performance may cause this to take longer.
If you went with SSD, optimizes would happen VERY fast.

Thanks,
Shawn



solr 4.3 solrj generating search terms that return no results

2013-07-10 Thread dboychuck
I'm having trouble with solrj generating a query like q=kohler%5C+k for the
search term 'Kohler k'

I am using Solr 4.3 in cloud mode. When I remove the %5C everything is fine.
I'm not sure why the %5C is being added when I call
solrQuery.setQuery('Kohler k');

Any help is appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-4-3-solrj-generating-search-terms-that-return-no-results-tp4077137.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 4.3 solrj generating search terms that return no results

2013-07-10 Thread Shawn Heisey
On 7/10/2013 6:34 PM, dboychuck wrote:
 I'm having trouble with solrj generating a query like q=kohler%5C+k for the
 search term 'Kohler k'
 
 I am using Solr 4.3 in cloud mode. When I remove the %5C everything is fine.
 I'm not sure why the %5C is being added when I call
 solrQuery.setQuery('Kohler k');
 
 Any help is appreciated.

%5C is a backslash.  In order for a space to be a literal part of a
query string and not a tokenization point, it must be escaped, and the
character for doing that is a backslash.

I would not have expected this to be added, though.  I am in the process
of building a test app to try this.  Can you use http://apaste.info to
share more of your solrj code?  I should also be on IRC momentarily.

Thanks,
Shawn



Re: solr 4.3 solrj generating search terms that return no results

2013-07-10 Thread dboychuck
solrQuery.setQuery(ClientUtils.escapeQueryChars(keyword));

It looks like using the solrj ClientUtils.escapeQueryChars function is
escaping any spaces with %5C+ which returns 0 results at search time.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-4-3-solrj-generating-search-terms-that-return-no-results-tp4077137p4077141.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Moving replica from node to node?

2013-07-10 Thread Otis Gospodnetic
Thanks Mark.  I assume you are referring to using the Core Admin API -
CREATE and UNLOAD?

Added https://issues.apache.org/jira/browse/SOLR-5032

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 8, 2013 at 10:50 PM, Mark Miller markrmil...@gmail.com wrote:
 It's simply a sugar method that no one has gotten to yet. I almost have once 
 or twice, but I always have moved onto other things before even starting.

 It's fairly simple to just start another replica on the TO node and then 
 delete the replica on the FROM node, so not a lot of urgency.

 - Mark

 On Jul 8, 2013, at 10:18 PM, Otis Gospodnetic otis.gospodne...@gmail.com 
 wrote:

 Hi,

 Solr(Cloud) currently doesn't have any facility to move a specific
 replica from one node to the other.

 How come?  Is there a technical or philosophical reason, or just the
 24 hours/day reason?

 Thanks,
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



Re: Switch to new leader transparently?

2013-07-10 Thread Floyd Wu
Thanks Aloke, I will do some research.
2013/7/10 下午9:45 於 Aloke Ghoshal alghos...@gmail.com 寫道:

 Hi Floyd,

 We use SolrNet to connect to Solr from a C# application. Since SolrNet is
 not aware about SolrCloud or ZK, we use a Http load balancer in front of
 the Solr nodes  query via the load balancer url. You could use something
 like HAProxy or Apache reverse proxy for load balancing.

 On the other hand in order to write a ZK aware client in C# you could start
 here: https://github.com/ewhauser/zookeeper/tree/trunk/src/dotnet

 Regards,
 Aloke


 On Wed, Jul 10, 2013 at 4:11 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:

  By the this is not related to your question but this may help you for
  connecting Solr via C#: http://solrsharp.codeplex.com/
 
  2013/7/10 Floyd Wu floyd...@gmail.com
 
   Hi Furkan
   I'm using C#,  SolrJ won't help on this, but its impl is a good
 reference
   for me. Thanks for your help.
  
   by the way, how to fetch/get cluster state from zk directly in plain
 http
   or tcp socket?
   In my SolrCloud cluster, I'm using standalone zk to coordinate.
  
   Floyd
  
  
  
  
   2013/7/10 Furkan KAMACI furkankam...@gmail.com
  
You can define a CloudSolrServer as like that:
   
*private static CloudSolrServer solrServer;*
   
and then define the addres of your zookeeper host:
   
*private static String zkHost = localhost:9983;*
   
initialize your variable:
   
*solrServer = new CloudSolrServer(zkHost);*
   
You can get leader list as like:
   
*ClusterState clusterState =
cloudSolrServer.getZkStateReader().getClusterState();
ListReplica leaderList = new ArrayList();
  for (Slice slice : clusterState.getSlices(collectionName)) {
  leaderList.add(slice.getLeader()); /
}*
   
   
For querying you can try that:
*
*
*SolrQuery solrQuery = new SolrQuery();*
*//fill your **solrQuery variable here**
*
*QueryRequest queryRequest = new QueryRequest(solrQuery,
SolrRequest.METHOD.POST);
queryRequest.process(**solrServer**);*
   
CloudSolrServer uses LBHttpSolrServer by default. It's definiton is
  like
that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a
   wrapper
to CommonsHttpSolrServer. This is useful when you have multiple
   SolrServers
and query requests need to be Load Balanced among them. It offers
   automatic
failover when a server goes down and it detects when the server comes
   back
up.*
*
*
*
*
   
2013/7/10 Anshum Gupta ans...@anshumgupta.net
   
 You don't really need to direct any query specifically to a leader.
  It
will
 automatically be routed to the right leader.
 You may put a load balancer on top to just fix the problem with
   querying
a
 node that has gone away.

 Also, ZK aware SolrJ Java client that load-balances across all
 nodes
  in
 cluster.


 On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com
  wrote:

  Hi there,
 
  I've built a SolrCloud cluster from example, but I have some
   question.
  When I send query to one leader (say
  http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem
everything
  will be fine.
 
  When I shutdown that leader, the other replica(
  http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard
  will
be
  new
  leader. The problem is:
 
  The application doesn't know new leader's location and still send
request
  to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
 response.
 
  How can I know new leader in my application?
  Are there any mechanism that application can send request to one
   fixed
  endpoint no matter who is leader?
 
  For example, application just send to
  http://xxx.xxx.xxx.xxx:8983/solr/collection1
  even the real leader run on
http://xxx.xxx.xxx.xxx:9983/solr/collection1
 
  Please help on this or give me some key infomation to google it.
 
  Many thanks.
 
  Floyd
 



 --

 Anshum Gupta
 http://www.anshumgupta.net

   
  
 



Indexing database in Solr using Data Import Handler

2013-07-10 Thread archit2112

Im trying to index MySql database using Data Import Handler in solr.

I have made two tables. The first table holds the metadata of a file.

create table filemetadata (
id varchar(20) primary key ,
filename varchar(50),
path varchar(200),
size varchar(10),
author varchar(50)
) ;

The second table contains the favourite info about a particular file in
the above table.

create table filefav (
fid varchar(20) primary key ,
id varchar(20),
favouritedby varchar(300),
favouritedtime varchar(10),
FOREIGN KEY (id) REFERENCES filemetadata(id) 
) ;

As you can see id is a foreign key.

To index this i have written the following data-config.xml -

dataConfig
dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost:3306/test user=root password=root / 
document name=filemetadata

entity name=restaurant query=select * from filemetadata
field column=id name=id / 

 entity name=filefav query=select favouritedby from filefav where id=
'${filemetadata.id}'
field column=favouritedby name=favouritedby1 /
/entity

field column=filename name=name1 / 
field column=path name=path1 / 
field column=size name=size1 / 
field column=author name=author1 /  

/entity

/document
/dataConfig

Everything is working but the favouritedby1 field is not getting indexed ,
ie, that field does not exist when i run the *:* query. Can you please help
me out?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-database-in-Solr-using-Data-Import-Handler-tp4077180.html
Sent from the Solr - User mailing list archive at Nabble.com.