Re: Swapping Cores

2013-11-19 Thread Shawn Heisey
On 11/19/2013 10:18 PM, Tirthankar Chatterjee wrote:
> I have a site that I crawl and host the index. The web site has changes every 
> month which requires it to re-crawl. Now there is a new SOLR index that is 
> created. How effectively can I swap the previous one with the new one with 
> minimal downtime for search. 
> 
> We have tried swapping the core but once due to any reason tomcat is 
> restarted the temp core data is gone after the restart. Is there a way we 
> dont lose the new index after the swap.

The reply you received from Otis assumes that you're using SolrCloud.  I
looked back at previous messages that you have sent to the list, where
you were using version 3.6, but that was over a year ago, so I don't
know whether you've upgraded to 4.x yet, and I don't know if you've gone
with SolrCloud.

If you are not using SolrCloud, then you can do core swapping, and the
next paragraph will apply.  If you are using SolrCloud, then you can't;
you must use collection aliasing.

Do you have persistent set to true in your solr.xml?  This is required
for core swapping to work properly through restarts.

Thanks,
Shawn



RE: facet method=enum and uninvertedfield limitations

2013-11-19 Thread Dmitry Kan
Since you are faceting on a text field (is this correct?) you deal with a
lot of unique values in it. So your best bet is enum method. Also if you
are on solr 4x try building doc values in the index: this suits faceting
well.

Otherwise start from your spec once again. Can you use shingles instead?
On 19 Nov 2013 17:44, "Lemke, Michael SZ/HZA-ZSW" 
wrote:

> On Friday, November 15, 2013 11:22 AM, Lemke, Michael SZ/HZA-ZSW wrote:
>
> Judging from numerous replies this seems to be a tough question.
> Nevertheless, I'd really appreciate any help as we are stuck.
> We'd really like to know what in our index causes the facet.method=fc
> query to fail.
>
> Thanks,
> Michael
>
> >On Thu, November 14, 2013 7:26 PM, Yonik Seeley wrote:
> >>On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael  SZ/HZA-ZSW
> >> wrote:
> >>> I am running into performance problems with faceted queries.
> >>> If I do a
> >>>
> >>>
> q=word&facet.field=CONTENT&facet=true&facet.limit=10&facet.mincount=1&facet.method=fc&facet.prefix=a&rows=0
> >>>
> >>> I am getting an exception:
> >>> org.apache.solr.common.SolrException: Too many values for
> UnInvertedField faceting on field CONTENT
> >>> at
> org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384)
> >>> at
> org.apache.solr.request.UnInvertedField.(UnInvertedField.java:178)
> >>> at
> org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
> >>> ...
> >>>
> >>> I understand it's got something to do with a 24bit limit somewhere
> >>> in the code but I don't understand enough of it to be able to construct
> >>> a specialized index that can be queried with facet.method=enum.
> >>
> >>You shouldn't need to do anything differently to try facet.method=enum
> >>(just replace facet.method=fc with facet.method=enum)
> >
> >This is true and facet.method=enum does work indeed.  The problem is
> >runtime.  In particular queries with an empty facet.prefix= run many
> >seconds if not minutes.  I initially asked about this here:
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3c33ec3398272fbe47b64ee3b3e98f69a761427...@de011521.schaeffler.com%3E
> >
> >It was suggested that fc is much faster than enum and I'd like to
> >test that.  We are still fairly free to design the index such that
> >it performs well.  But to do that we need to understand what is
> >killing it.
> >
> >>
> >>You may also want to add the parameter
> >>facet.enum.cache.minDf=10
> >>to lower memory usage by only usiing the filter cache for terms that
> >>match more than 100K docs.
> >
> >That helped a little, cut down my particular test from 10 sec to 5 sec.
> >But still too slow.  Mind you this is for an autosuggest feature.
> >
> >Thanks for your reply.
> >
> >Michael
> >
> >
>
>


Re: Swapping Cores

2013-11-19 Thread Otis Gospodnetic
Have a look at https://issues.apache.org/jira/browse/SOLR-4497 +
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CreateormodifyanAliasforaCollection

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Nov 20, 2013 at 12:18 AM, Tirthankar Chatterjee <
tchatter...@commvault.com> wrote:

> Hi,
> I have a site that I crawl and host the index. The web site has changes
> every month which requires it to re-crawl. Now there is a new SOLR index
> that is created. How effectively can I swap the previous one with the new
> one with minimal downtime for search.
>
> We have tried swapping the core but once due to any reason tomcat is
> restarted the temp core data is gone after the restart. Is there a way we
> dont lose the new index after the swap.
>
> Thanks,
> Tirthankar
>
>
>
>
> ***Legal Disclaimer***
> "This communication may contain confidential and privileged material for
> the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution
> by others is strictly prohibited. If you have received the message by
> mistake,
> please advise the sender by reply email and delete the message. Thank you."
> **
>


Re: Solr spatial search within the polygon

2013-11-19 Thread Dhanesh Radhakrishnan
Hi David,
Thank you for your reply
This is my current schema and field type "location_rpt" is a
SpatialRecursivePrefixTreeFieldType and
Field "location" is a type "location_rpt" and its multiValued




















Whenever add a document to solr, I'll collect the current latitude and
longitude of particular business and index in the field "location"
It's like
$doc->setField('location', $business['latitude']."
".$business['longitude']);
This should looks like "location":["9.445890 76.540970"] in solr

What I'm doing is that in Map view of search result , there is one
provision to draw polygon in map and fetch the result based on the drawing.

http://localhost:8983/solr/poc/select?fl=id,name,locality&wt=json&json.nl=map&q=*:*&fq=state:Kerala&fq=location:"IsWithin(POLYGON((9.471920923238988
76.5496015548706,9.464174399734185 76.53947353363037,9.457232011740006
76.55457973480225,9.471920923238988 76.5496015548706)))
distErrPct=0"&debugQuery=true

I'm pretty sure that the coordinates are in the right position.
"9.445890,76.540970" is in India, precisely in Kerala state :)

It is highly appreciated that you kindly correct me if I'm in wrong way





This is response from solr

"responseHeader": {
"status": 0,
"QTime": 5
},
"response": {
"numFound": 3,
"start": 0,
"docs": [
{
"id": "192",
"name": "50 cents of ideal plot",
"locality": "Changanassery"
},
{
"id": "189",
"name": "new independent house for sale",
"locality": "Changanassery"
},
{
"id": "188",
"name": "Renovated Resort style home with 21 cent",
"locality": "Changanassery"
}
]
}



Here is the debug mode output of the query


"debug": {
"rawquerystring": "*:*",
"querystring": "*:*",
"parsedquery": "MatchAllDocsQuery(*:*)",
"parsedquery_toString": "*:*",
"explain": {
"188": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n 1.0 =
queryNorm\n",
"189": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n 1.0 =
queryNorm\n",
"192": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n 1.0 =
queryNorm\n"
},
"QParser": "LuceneQParser",
"filter_queries": [
"state:Kerala",
"location:\"IsWithin(POLYGON((9.471920923238988
76.5496015548706,9.464174399734185 76.53947353363037,9.457232011740006
76.55457973480225,9.471920923238988 76.5496015548706))) distErrPct=0\""
],
"parsed_filter_queries": [
"state:Kerala",

"ConstantScore(org.apache.lucene.spatial.prefix.WithinPrefixTreeFilter@1ed6c279
)"
],
"timing": {
"time": 5,
"prepare": {
"time": 1,
"query": {
"time": 1
},
"facet": {
"time": 0
},
"mlt": {
"time": 0
},
"highlight": {
"time": 0
},
"stats": {
"time": 0
},
"debug": {
"time": 0
}
},
"process": {
"time": 4,
"query": {
"time": 3
},
"facet": {
"time": 0
},
"mlt": {
"time": 0
},
"highlight": {
"time": 0
},
"stats": {
"time": 0
},
"debug": {
"time": 1
}
}
}
}


On Tue, Nov 19, 2013 at 8:56 PM, Smiley, David W.  wrote:

>
>
> On 11/19/13 4:06 AM, "Dhanesh Radhakrishnan"  wrote:
>
> >Hi David,
> >Thank you so much for the detailed reply. I've checked each and every lat
> >lng coordinates and its a purely polygon.
> >After  some time I did one change in the lat lng indexing.
> >Changed the indexing format.
> >
> >Initially I indexed the latitude and longitude separated by comma  Eg:-
> >"location":["9.445890,76.540970"]
> >Instead I indexed with space.
> > "location":["9.445890 76.540970"]
>
> Just to be clear, if you use a space, it's "x y" order.  If you use a
> comma, it's "y, x" order.  If you use WKT, it's always a space in X Y
> order (and of course, the shape name and other stuff).  You may have
> gotten your search to work but I have a hunch you have all your latitudes
> and longitudes in the wrong position, but I can't possible know for sure
> because your example datapoint is ambiguous.  76 degrees latitude is
> pretty far up-there though, hence my hunch you've got it wrong.
>
> >
> >and it worked
> >
> >Also from your observation on IsWithIn predicate I tested with Intersects
> >and I found there is a  difference in the QTime.
> >
> >For IsWithin
> > "QTime": 9
> >ResponseHeader": {
> >"status": 0,
> >"QTime": 9
> >},
> >
> >When I used "Intersects"
> >responseHeader": {
> >"status": 0,
> >"QTime": 26
> >}
>
> Th

Re: {!cache=false} for regular queries?

2013-11-19 Thread Mikhail Khludnev
It should bypass cache for sure
https://github.com/apache/lucene-solr/blob/34a92d090ac4ff5c8382e1439827d678265ede0d/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1263
20.11.2013 7:05 пользователь "Otis Gospodnetic" 
написал:

> Hi,
>
> We have the ability to turn off caching for filter queries -
> http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters
>
> I didn't try it, but I think one can't turn off caching for regular
> queries, a la:
>
> q={!cache=false}
>
> Is there a reason this could not be done?
>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>


Swapping Cores

2013-11-19 Thread Tirthankar Chatterjee
Hi,
I have a site that I crawl and host the index. The web site has changes every 
month which requires it to re-crawl. Now there is a new SOLR index that is 
created. How effectively can I swap the previous one with the new one with 
minimal downtime for search. 

We have tried swapping the core but once due to any reason tomcat is restarted 
the temp core data is gone after the restart. Is there a way we dont lose the 
new index after the swap.

Thanks,
Tirthankar

 


***Legal Disclaimer***
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**


Re: JVM tuning?

2013-11-19 Thread Otis Gospodnetic
Quickly scanned this and from what I can tell it picks values for things
like Xmx based on memory found on the host.  This is fine first guess, but
ultimately one wants control over that and adjusts it based on factors
beyond just available RAM, such as whether sorting it used, or faceting, on
how many fields, of which type, etc. etc.

Just today we (Sematext) had a client with a rather large Solr index and
memory issues.  After about an hour of troubleshooting and looking at Solr
metrics in SPM we correlated high GC, old gen mem pool hitting 100%, a few
other things, and warmup queries that ended up throwing a node into
OOMland.  It turned out warmup queries using very non-selective filter
queries were creating massive entries in the filter cache.  In such
situations I wouldn't want a script to assume what Xmx is needed.  I'd want
to set this value based on all the information I have at my disposal.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Nov 12, 2013 at 12:59 PM, Scott Stults <
sstu...@opensourceconnections.com> wrote:

> We've been using a slightly older version of this script to start Solr in
> server environments:
>
> https://github.com/apache/cassandra/blob/trunk/conf/cassandra-env.sh
>
> The thing I especially like about it is its ability to dynamically cap
> memory usage, and the garbage collection log section is a great reference
> when we need to check gc times.
>
> My question is, does anyone else use a script like this to configure the
> JVM for Solr? Would it be useful to have this as a reference in
> solr/example/etc?
>
>
> Thanks!
> -Scott
>


Re: field collapsing performance in sharded environment

2013-11-19 Thread Otis Gospodnetic
Have a look at https://issues.apache.org/jira/browse/SOLR-5027 +
https://wiki.apache.org/solr/CollapsingQParserPlugin

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano <
dtroi...@basistech.com> wrote:

> Hello,
>
> I'm hitting a performance issue when using field collapsing in a
> distributed Solr setup and I'm wondering if others have seen it and if
> anyone has an idea to work around. it.
>
> I'm using field collapsing to deduplicate documents that have the same near
> duplicate hash value, and deduplicating at query time (as opposed to
> filtering at index time) is a requirement.  I have a sharded setup with 10
> cores (not SolrCloud), each having ~1000 documents each.  Of the 10k docs,
> most have a unique near duplicate hash value, so there are about 10k unique
> values for the field that I'm grouping on.  The grouping parameters that
> I'm using are:
>
> group=true
> group.field=
> group.main=true
>
> I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
> difference is the absence or presence of these three grouping parameters
> and I'm consistently seeing a marked difference in performance (as a
> representative data point, 200ms latency without grouping and 1600ms with
> grouping).  Interestingly, if I put all 10k docs on the same core and query
> that core independently with and without grouping, I don't see much of a
> latency difference, so the performance degradation seems to exist only in
> the sharded setup.
>
> Is there a known performance issue when field collapsing in a sharded setup
> (perhaps only manifests when the grouping field has many unique values), or
> have other people observed this?  Any ideas for a workaround?  Note that
> docs in my sharded setup can only have the same signature if they're in the
> same shard, so perhaps that can be used to boost perf, though I don't see
> an exposed way to do so.
>
> A follow-on question is whether we're likely to see the same issue if /
> when we move to SolrCloud.
>
> Thanks,
> Dave
>


Re: Option to enforce a majority quorum approach to accepting updates in SolrCloud?

2013-11-19 Thread Otis Gospodnetic
Btw. isn't the situation Timothy is describing what hinted handoff is all
about?

http://wiki.apache.org/cassandra/HintedHandoff
http://www.datastax.com/dev/blog/modern-hinted-handoff

Check this:
http://www.jroller.com/otis/entry/common_distributed_computing_routines

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Nov 19, 2013 at 1:58 PM, Mark Miller  wrote:

> Mostly a lot of other systems already offer these types of things, so they
> were hard not to think about while building :) Just hard to get back to a
> lot of those things, even though a lot of them are fairly low hanging
> fruit. Hardening takes the priority :(
>
> - Mark
>
> On Nov 19, 2013, at 12:42 PM, Timothy Potter  wrote:
>
> > You're thinking is always one-step ahead of me! I'll file the JIRA
> >
> > Thanks.
> > Tim
> >
> >
> > On Tue, Nov 19, 2013 at 10:38 AM, Mark Miller 
> wrote:
> >
> >> Yeah, this is kind of like one of many little features that we have just
> >> not gotten to yet. I’ve always planned for a param that let’s you say
> how
> >> many replicas an update must be verified on before responding success.
> >> Seems to make sense to fail that type of request early if you notice
> there
> >> are not enough replicas up to satisfy the param to begin with.
> >>
> >> I don’t think there is a JIRA issue yet, fire away if you want.
> >>
> >> - Mark
> >>
> >> On Nov 19, 2013, at 12:14 PM, Timothy Potter 
> wrote:
> >>
> >>> I've been thinking about how SolrCloud deals with write-availability
> >> using
> >>> in-sync replica sets, in which writes will continue to be accepted so
> >> long
> >>> as there is at least one healthy node per shard.
> >>>
> >>> For a little background (and to verify my understanding of the process
> is
> >>> correct), SolrCloud only considers active/healthy replicas when
> >>> acknowledging a write. Specifically, when a shard leader accepts an
> >> update
> >>> request, it forwards the request to all active/healthy replicas and
> only
> >>> considers the write successful if all active/healthy replicas ack the
> >>> write. Any down / gone replicas are not considered and will sync up
> with
> >>> the leader when they come back online using peer sync or snapshot
> >>> replication. For instance, if a shard has 3 nodes, A, B, C with A being
> >> the
> >>> current leader, then writes to the shard will continue to succeed even
> >> if B
> >>> & C are down.
> >>>
> >>> The issue is that if a shard leader continues to accept updates even if
> >> it
> >>> loses all of its replicas, then we have acknowledged updates on only 1
> >>> node. If that node, call it A, then fails and one of the previous
> >> replicas,
> >>> call it B, comes back online before A does, then any writes that A
> >> accepted
> >>> while the other replicas were offline are at risk to being lost.
> >>>
> >>> SolrCloud does provide a safe-guard mechanism for this problem with the
> >>> leaderVoteWait setting, which puts any replicas that come back online
> >>> before node A into a temporary wait state. If A comes back online
> within
> >>> the wait period, then all is well as it will become the leader again
> and
> >> no
> >>> writes will be lost. As a side note, sys admins definitely need to be
> >> made
> >>> more aware of this situation as when I first encountered it in my
> >> cluster,
> >>> I had no idea what it meant.
> >>>
> >>> My question is whether we want to consider an approach where SolrCloud
> >> will
> >>> not accept writes unless there is a majority of replicas available to
> >>> accept the write? For my example, under this approach, we wouldn't
> accept
> >>> writes if both B&C failed, but would if only C did, leaving A & B
> online.
> >>> Admittedly, this lowers the write-availability of the system, so may be
> >>> something that should be tunable? Just wanted to put this out there as
> >>> something I've been thinking about lately ...
> >>>
> >>> Cheers,
> >>> Tim
> >>
> >>
>
>


Re: How to index X™ as ™ (HTML decimal entity)

2013-11-19 Thread Walter Underwood
Why do you want to do this? You can always do this transformation on the 
presentation side. Doing this on the search server could be a really bad idea.

wunder

On Nov 19, 2013, at 8:19 PM, "Jack Krupansky"  wrote:

> You could use an update processor to map non-ASCII codes to SGML entities. 
> You could code it as a JavaScript script and use the stateless script update 
> processor.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Developer
> Sent: Tuesday, November 19, 2013 5:46 PM
> To: solr-user@lucene.apache.org
> Subject: How to index X™ as ™ (HTML decimal entity)
> 
> I have a data coming in to SOLR as below.
> 
> X™ - Black
> 
> I need to store the HTML Entity (decimal) equivalent value (i.e. ™)
> in SOLR rather than storing the original value.
> 
> Is there a way to do this?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-index-X-as-8482-HTML-decimal-entity-tp4102002.html
> Sent from the Solr - User mailing list archive at Nabble.com. 

--
Walter Underwood
wun...@wunderwood.org





Re: How to index X™ as ™ (HTML decimal entity)

2013-11-19 Thread Jack Krupansky
You could use an update processor to map non-ASCII codes to SGML entities. 
You could code it as a JavaScript script and use the stateless script update 
processor.


-- Jack Krupansky

-Original Message- 
From: Developer

Sent: Tuesday, November 19, 2013 5:46 PM
To: solr-user@lucene.apache.org
Subject: How to index X™ as ™ (HTML decimal entity)

I have a data coming in to SOLR as below.

X™ - Black

I need to store the HTML Entity (decimal) equivalent value (i.e. ™)
in SOLR rather than storing the original value.

Is there a way to do this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-X-as-8482-HTML-decimal-entity-tp4102002.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Indexing different customer customized field values

2013-11-19 Thread Otis Gospodnetic
Very fuzzy idea here, and maybe there are better approaches I'm not
thinking of right now, but would working with dynamic fields whose names
include customer ID work for you here?

e.g.
global field: booklevel=valueX
customer-specific field for customer 007: booklevel_007=valueY

Your query could then include both fields or maybe you can play with
function queries like http://wiki.apache.org/solr/FunctionQuery#exists to
make queries behave the way you want them to behave in situations like the
one above.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Nov 19, 2013 at 5:27 PM, kchellappa wrote:

> In our application, we index educational resources and allow searching for
> them.
> We allow our customers to change some of the non-textual metadata
> associated
> with a resource (like booklevel, interestlevel etc) to serve their users
> better.
> So for each resource, in theory it could have different set of metadata
> values for each customer, but in reality may be 10 - 25% of our customers
> customize a small portion of the resources.
>
> Our current solution uses SQL Server to manage the customizations (the
> database is sharded for other reasons as well) and also uses SQL Server's
> Full Text index for search.
> We are replacing this with Solr.
>
> There are few approaches we had thought about, but none of them seem ideal
>
> a) Duplicate the entries in Solr.  Each resource would be replicated for
> each customer and there would be an index entry/customer.
> The number of index entries is an big concern even though the text field
> values are the same.
> (We have about 300K resources and about 50K customers and both will grow)
>
> b) Use a dedicated solr core for each customer.  This wouldn't be using
> resources efficiently and we would be duplicating textual components
> which doesn't change from customer to customer.
>
> c) Use a Global index that has the resources with default values and then
> use a separate index for each customer that contains resources that are
> customized
> This requires managing lot of small cores/indexes.  Also this would require
> merging results from multiple cores, so don't think this will work
>
> d) Use solr to do the text search and do Post Processing to filter based on
> metadata externally -- as you can imagine, this have all the
> challenges associated with post processing (pagination support, etc)
>
> e) Use Advanced/Post filtering Solr support --- Even if we can figure out a
> reasonable way to cache the lookup for metadata values for each customer,
> not sure if this would be efficient
>
> Any other recommendations on solutions.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-different-customer-customized-field-values-tp4102000.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Zookeeper down question

2013-11-19 Thread Otis Gospodnetic
Garth,

Here is something else related to help push the upgrade further:

http://search-lucene.com/m/gUajqxuETB1/&subj=Re+SolrCloud+and+split+brain

Monitor your beast keeper: http://search-lucene.com/m/R9vEg2JmiR91

Otis


On Tue, Nov 19, 2013 at 5:56 PM, Garth Grimm <
garthgr...@averyranchconsulting.com> wrote:

> Thanks Mark and Tim.  My understanding has been upgraded.
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Tuesday, November 19, 2013 1:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Zookeeper down question
>
>
> On Nov 19, 2013, at 2:24 PM, Timothy Potter  wrote:
>
> > Good questions ... From my understanding, queries will work if Zk goes
> > down but writes do not work w/o Zookeeper. This works because the
> > clusterstate is cached on each node so Zookeeper doesn't participate
> > directly in queries and indexing requests. Solr has to decide not to
> > allow writes if it loses its connection to Zookeeper, which is a safe
> > guard mechanism. In other words, Solr assumes it's pretty safe to
> > allow reads if the cluster doesn't have a healthy coordinator, but
> chooses to not allow writes to be safe.
>
> Right - we currently stop accepting writes when Solr cannot talk to
> ZooKeeper - this is because we can no longer count on knowing about any
> changes to the cluster and no new leaders can be elected, etc. It gets
> tricky fast if you consider allowing updates without ZooKeeper connectivity
> for very long.
>
> >
> > If a Solr nodes goes down while ZK is not available, since Solr no
> > longer accepts writes, leader / replica doesn't really matter. I'd
> > venture to guess there is some failover logic built in when executing
> > distributing queries but I'm not as familiar with that part of the
> > code (I'll brush up on it though as I'm now curious as well).
>
> Right - query requests will fail over to other replicas - this is
> important in general because the cluster state a Solr instance has can be a
> bit stale - so a request might hit something that has gone down and another
> replica in the shard can be tried. We use the load balancing solrj client
> for these internal requests. CloudSolrServer handles failover for the user
> (or non internal) requests. Or you can use your own external load balancer.
>
> - Mark
>
> >
> > Cheers,
> > Tim
> >
> >
> > On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm <
> > garthgr...@averyranchconsulting.com> wrote:
> >
> >> Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard),
> >> and a standalone zookeeper.
> >>
> >> Correct me if any of my understanding is incorrect on the following:
> >> If ZK goes down, most normal operations will still function, since my
> >> understanding is that ZK isn't involved on a transaction by
> >> transaction basis for each of these.
> >> Document adds, updates, and deletes on existing collection will still
> >> work as expected.
> >> Queries will still get processed as expected.
> >> Is the above correct?
> >>
> >> But adding new collections, changing configs, etc., will all fail
> >> while ZK is down (or at least, place things in an inconsistent
> >> state?) Is that correct?
> >>
> >> If, while ZK is down, one of the 4 solr nodes also goes down, will
> >> all normal operations fail?  Will they all continue to succeed?  I.e.
> >> will each of the nodes realize which node is down and route indexing
> >> and query requests around them, or is that impossible while ZK is
> >> down?  Will some queries succeed (because they were lucky enough to
> >> get routed to the one replica on the one shard that is still
> >> functional) while other queries fail (they aren't so lucky and get
> >> routed to the one replica that is down on the one shard)?
> >>
> >> Thanks,
> >> Garth Grimm
> >>
> >>
> >>
>
>


{!cache=false} for regular queries?

2013-11-19 Thread Otis Gospodnetic
Hi,

We have the ability to turn off caching for filter queries -
http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters

I didn't try it, but I think one can't turn off caching for regular
queries, a la:

q={!cache=false}

Is there a reason this could not be done?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


Re: Question about upgrading Sorl and DocValues

2013-11-19 Thread Yago Riveiro
Other question, 

Can someone confirm that I can upgrade from 4.5.1 to 4.6 in a safety and clean 
way (without optimises and all stuff)?

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 20, 2013 at 12:16 AM, Yago Riveiro wrote:

> Shawn, 
> 
> This setup has big implication and I think that this problem is not describe 
> in proper way either wiki or ref.. guide and how can be overcame (all the 
> process that you describes).
> 
> +1 to find a way to upgrade without reindexing the data, I have not space 
> enough to do an optimize of 3T and respective replicas (not to mention the 
> time it would take). 
> 
> -- 
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> 
> 
> On Wednesday, November 20, 2013 at 12:02 AM, Shawn Heisey wrote:
> 
> > On 11/19/2013 4:10 PM, yriveiro wrote:
> > > After the reading this link about DocValues and be pointed by Mark Miller 
> > > to
> > > raise the question on the mailing list, I have some questions about the
> > > codec implementation note:
> > > 
> > > "Note that only the default implementation is supported by future version 
> > > of
> > > Lucene: if you try an alternative format, you may need to switch back to 
> > > the
> > > default and rewrite your index (e.g. forceMerge) before upgrading."
> > > 
> > > My questions is about how I can do this, either the wiki or the ref guide
> > > don't explain how this process can be done.
> > > 
> > > I'm using the per-field DocValues formats, therefore I'm not using the
> > > default implementation, and this in some way this scare me, because I have
> > > in some way the possibility of make Solr updates compromised.
> > > 
> > 
> > 
> > The way I understand what you've been told is this:
> > 
> > Remove all docValuesFormat attributes from your schema. Restart/Reload 
> > and optimize (forceMerge) your index. At this point you should be able 
> > to upgrade Solr without any problems. Once you're upgraded, re-add the 
> > docValuesFormat attributes and optimize again.
> > 
> > Mark and other experts - is this correct?
> > 
> > I do fully understand that your index is HUGE, so optimizing it is not 
> > trivial.
> > 
> > IMHO upgrades should be possible with the disk-based format. Having very 
> > large indexes is the primary reason that people choose the disk-based 
> > format. These are the people who are least likely to be able to either 
> > reindex or run an optimize.
> > 
> > Thanks,
> > Shawn
> > 
> > 
> > 
> 
> 



Re: Question about upgrading Sorl and DocValues

2013-11-19 Thread Yago Riveiro
Shawn, 

This setup has big implication and I think that this problem is not describe in 
proper way either wiki or ref.. guide and how can be overcame (all the process 
that you describes).

+1 to find a way to upgrade without reindexing the data, I have not space 
enough to do an optimize of 3T and respective replicas (not to mention the time 
it would take). 

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 20, 2013 at 12:02 AM, Shawn Heisey wrote:

> On 11/19/2013 4:10 PM, yriveiro wrote:
> > After the reading this link about DocValues and be pointed by Mark Miller to
> > raise the question on the mailing list, I have some questions about the
> > codec implementation note:
> > 
> > "Note that only the default implementation is supported by future version of
> > Lucene: if you try an alternative format, you may need to switch back to the
> > default and rewrite your index (e.g. forceMerge) before upgrading."
> > 
> > My questions is about how I can do this, either the wiki or the ref guide
> > don't explain how this process can be done.
> > 
> > I'm using the per-field DocValues formats, therefore I'm not using the
> > default implementation, and this in some way this scare me, because I have
> > in some way the possibility of make Solr updates compromised.
> > 
> 
> 
> The way I understand what you've been told is this:
> 
> Remove all docValuesFormat attributes from your schema. Restart/Reload 
> and optimize (forceMerge) your index. At this point you should be able 
> to upgrade Solr without any problems. Once you're upgraded, re-add the 
> docValuesFormat attributes and optimize again.
> 
> Mark and other experts - is this correct?
> 
> I do fully understand that your index is HUGE, so optimizing it is not 
> trivial.
> 
> IMHO upgrades should be possible with the disk-based format. Having very 
> large indexes is the primary reason that people choose the disk-based 
> format. These are the people who are least likely to be able to either 
> reindex or run an optimize.
> 
> Thanks,
> Shawn
> 
> 




Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available

2013-11-19 Thread Uwe Reh

Thank you for opening the issue.

I'm not sure that my case is representative. I'm spending every day 
three hours in the train (commuting to work). I like to use this time to 
have a closer look into manuals. Printouts and laptops are horrible in 
this situation. So there is only the alternative between my 10" tablet 
and my 6" e-reader. I prefer the more handy reader.

No, I can't afford a nice Nexus7. Not now ;-)

Uwe


Am 19.11.2013 17:08, schrieb Cassandra Targett:

I've often thought of possibly providing the reference guide in .epub
format, but wasn't sure of general interest. I also once tried to
convert the PDF version with calibre and it was a total mess. - but
PDF is probably the least-flexible starting point for conversion.

Unfortunately, the Word export is only available on a per-page basis,
which would make it really tedious to try to make a .doc version of
the entire guide (there are ~150 pages). There are, however, options
for HTML export, which I believe could be converted to .epub - but
might take some fiddling.

I created an issue for this - for now just to track that it's
something that might be of interest - but not sure if/when I'd
personally be able to work on it:
https://issues.apache.org/jira/browse/SOLR-5467.

On Tue, Nov 19, 2013 at 6:34 AM, Uwe Reh  wrote:

Am 18.11.2013 14:39, schrieb Furkan KAMACI:


Atlassian Jira has two options at default: exporting to PDF and exporting
to Word.



I see, 'Word' isn't optimal for a reference guide. But OO can handle 'doc'
and has epub plugins.
Could it be possible, to offer the doku also as 'doc(x)'

barefaced
Uwe





Re: Question about upgrading Sorl and DocValues

2013-11-19 Thread Shawn Heisey

On 11/19/2013 4:10 PM, yriveiro wrote:

After the reading this link about DocValues and be pointed by Mark Miller to
raise the question on the mailing list, I have some questions about the
codec implementation note:

"Note that only the default implementation is supported by future version of
Lucene: if you try an alternative format, you may need to switch back to the
default and rewrite your index (e.g. forceMerge) before upgrading."

My questions is about how I can do this, either the wiki or the ref guide
don't explain how this process can be done.

I'm using the per-field DocValues formats, therefore I'm not using the
default implementation, and this in some way this scare me, because I have
in some way the possibility of make Solr updates compromised.


The way I understand what you've been told is this:

Remove all docValuesFormat attributes from your schema. Restart/Reload 
and optimize (forceMerge) your index.  At this point you should be able 
to upgrade Solr without any problems. Once you're upgraded, re-add the 
docValuesFormat attributes and optimize again.


Mark and other experts - is this correct?

I do fully understand that your index is HUGE, so optimizing it is not 
trivial.


IMHO upgrades should be possible with the disk-based format. Having very 
large indexes is the primary reason that people choose the disk-based 
format.  These are the people who are least likely to be able to either 
reindex or run an optimize.


Thanks,
Shawn



Re: Unable to update dynamic _coordinate field

2013-11-19 Thread jfeist
Ah, I see where I went wrong.  I didn't define that dynamic field, it was in
the Solr default schema.xml file.  I thought that adding a dynamic field
called *_coordinate would basically do the same thing for latitude and
longitudes as adding a dynamic field like *_i does for integers, i.e. index
it as lat and long.  But yeah, I see that field exists so the location
fieldType can store the two tdouble values comprising the lat and long. 
Thanks for the explanation.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-update-dynamic-coordinate-field-tp4101992p4102008.html
Sent from the Solr - User mailing list archive at Nabble.com.


Question about upgrading Sorl and DocValues

2013-11-19 Thread yriveiro
Hi,

After the reading this link about DocValues and be pointed by Mark Miller to
raise the question on the mailing list, I have some questions about the
codec implementation note:

"Note that only the default implementation is supported by future version of
Lucene: if you try an alternative format, you may need to switch back to the
default and rewrite your index (e.g. forceMerge) before upgrading."

My questions is about how I can do this, either the wiki or the ref guide
don't explain how this process can be done.

I'm using the per-field DocValues formats, therefore I'm not using the
default implementation, and this in some way this scare me, because I have
in some way the possibility of make Solr updates compromised.

/Yago



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-upgrading-Sorl-and-DocValues-tp4102007.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Zookeeper down question

2013-11-19 Thread Garth Grimm
Thanks Mark and Tim.  My understanding has been upgraded.

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Tuesday, November 19, 2013 1:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Zookeeper down question


On Nov 19, 2013, at 2:24 PM, Timothy Potter  wrote:

> Good questions ... From my understanding, queries will work if Zk goes 
> down but writes do not work w/o Zookeeper. This works because the 
> clusterstate is cached on each node so Zookeeper doesn't participate 
> directly in queries and indexing requests. Solr has to decide not to 
> allow writes if it loses its connection to Zookeeper, which is a safe 
> guard mechanism. In other words, Solr assumes it's pretty safe to 
> allow reads if the cluster doesn't have a healthy coordinator, but chooses to 
> not allow writes to be safe.

Right - we currently stop accepting writes when Solr cannot talk to ZooKeeper - 
this is because we can no longer count on knowing about any changes to the 
cluster and no new leaders can be elected, etc. It gets tricky fast if you 
consider allowing updates without ZooKeeper connectivity for very long.

> 
> If a Solr nodes goes down while ZK is not available, since Solr no 
> longer accepts writes, leader / replica doesn't really matter. I'd 
> venture to guess there is some failover logic built in when executing 
> distributing queries but I'm not as familiar with that part of the 
> code (I'll brush up on it though as I'm now curious as well).

Right - query requests will fail over to other replicas - this is important in 
general because the cluster state a Solr instance has can be a bit stale - so a 
request might hit something that has gone down and another replica in the shard 
can be tried. We use the load balancing solrj client for these internal 
requests. CloudSolrServer handles failover for the user (or non internal) 
requests. Or you can use your own external load balancer.

- Mark

> 
> Cheers,
> Tim
> 
> 
> On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm < 
> garthgr...@averyranchconsulting.com> wrote:
> 
>> Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), 
>> and a standalone zookeeper.
>> 
>> Correct me if any of my understanding is incorrect on the following:
>> If ZK goes down, most normal operations will still function, since my 
>> understanding is that ZK isn't involved on a transaction by 
>> transaction basis for each of these.
>> Document adds, updates, and deletes on existing collection will still 
>> work as expected.
>> Queries will still get processed as expected.
>> Is the above correct?
>> 
>> But adding new collections, changing configs, etc., will all fail 
>> while ZK is down (or at least, place things in an inconsistent 
>> state?) Is that correct?
>> 
>> If, while ZK is down, one of the 4 solr nodes also goes down, will 
>> all normal operations fail?  Will they all continue to succeed?  I.e. 
>> will each of the nodes realize which node is down and route indexing 
>> and query requests around them, or is that impossible while ZK is 
>> down?  Will some queries succeed (because they were lucky enough to 
>> get routed to the one replica on the one shard that is still 
>> functional) while other queries fail (they aren't so lucky and get 
>> routed to the one replica that is down on the one shard)?
>> 
>> Thanks,
>> Garth Grimm
>> 
>> 
>> 



How to index X™ as ™ (HTML decimal entity)

2013-11-19 Thread Developer
I have a data coming in to SOLR as below.

X™ - Black 

I need to store the HTML Entity (decimal) equivalent value (i.e. ™) 
in SOLR rather than storing the original value.

Is there a way to do this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-X-as-8482-HTML-decimal-entity-tp4102002.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unable to update dynamic _coordinate field

2013-11-19 Thread Chris Hostetter

I'm not entirely sure i understand what you are *trying* to do, but what 
you are currently doing is..

1) defining a dynamic field of type "tdouble"
2) indexing a doc with a value that can not be parsed as a double into a 
field that uses this dynamic field

forget that you've named it "coordinate", forget that it's a dynamic field 
-- you've told solr that "job_coordinate" must be a double precision 
float (ie: "1.2", "89765.12345", "-56", etc...) and then you are trying to 
shove a string containing a comma character into it.


: I'm using solrj and attempting to index some latitude and longitudes.  My
: schema.xml has this dynamicField definition:
: 
: 
: When I attempt to update a document like so
: doc.setField("job_coordinate","40.7143,-74.006");
: 
: I get the following error:
: Exception in thread "main"
: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR:
: [doc=1] Error adding field 'job_coordinate'='40.7143,-74.006' msg=For input
: string: "40.7143,-74.006"
: 
: However, if I change the field to a location field that is defined in my
: schema.xml, like "store", it works fine.  Does anyone know how to add/update
: a dynamic coordinate field to documents?  I'm using Solr 4.4.0.



-Hoss


Indexing different customer customized field values

2013-11-19 Thread kchellappa
In our application, we index educational resources and allow searching for
them.
We allow our customers to change some of the non-textual metadata associated
with a resource (like booklevel, interestlevel etc) to serve their users
better.
So for each resource, in theory it could have different set of metadata
values for each customer, but in reality may be 10 - 25% of our customers
customize a small portion of the resources.

Our current solution uses SQL Server to manage the customizations (the
database is sharded for other reasons as well) and also uses SQL Server's
Full Text index for search.
We are replacing this with Solr.

There are few approaches we had thought about, but none of them seem ideal

a) Duplicate the entries in Solr.  Each resource would be replicated for
each customer and there would be an index entry/customer.  
The number of index entries is an big concern even though the text field
values are the same.  
(We have about 300K resources and about 50K customers and both will grow)

b) Use a dedicated solr core for each customer.  This wouldn't be using
resources efficiently and we would be duplicating textual components 
which doesn't change from customer to customer.

c) Use a Global index that has the resources with default values and then
use a separate index for each customer that contains resources that are
customized
This requires managing lot of small cores/indexes.  Also this would require
merging results from multiple cores, so don't think this will work

d) Use solr to do the text search and do Post Processing to filter based on
metadata externally -- as you can imagine, this have all the 
challenges associated with post processing (pagination support, etc)

e) Use Advanced/Post filtering Solr support --- Even if we can figure out a
reasonable way to cache the lookup for metadata values for each customer, 
not sure if this would be efficient

Any other recommendations on solutions.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-different-customer-customized-field-values-tp4102000.html
Sent from the Solr - User mailing list archive at Nabble.com.


Unable to update dynamic _coordinate field

2013-11-19 Thread jfeist
I'm using solrj and attempting to index some latitude and longitudes.  My
schema.xml has this dynamicField definition:


When I attempt to update a document like so
doc.setField("job_coordinate","40.7143,-74.006");

I get the following error:
Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR:
[doc=1] Error adding field 'job_coordinate'='40.7143,-74.006' msg=For input
string: "40.7143,-74.006"

However, if I change the field to a location field that is defined in my
schema.xml, like "store", it works fine.  Does anyone know how to add/update
a dynamic coordinate field to documents?  I'm using Solr 4.4.0.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-update-dynamic-coordinate-field-tp4101992.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What is the difference between "attorney:(Roger Miller)" and "attorney:Roger Miller"

2013-11-19 Thread Rafał Kuć
Hello!

Terms surrounded by " characters will be treated as phrase query. So,
if your default query operator is OR, the attorney:(Roger Miller) will
result in documents with first or second (or both) terms in the
attorney field. The attorney:"Roger Miller" will result only in
documents that have the phrase Roger Miller in the attorney field.

You may want to look at Lucene query syntax to understand all the
differences: 
http://lucene.apache.org/core/4_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

-- 
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


> Also, attorney:(Roger Miller) is same as attorney:"Roger Miller" right? Or
> the term "Roger Miller" is run against attorney?

> Thanks,
> -Utkarsh


> On Tue, Nov 19, 2013 at 12:42 PM, Rafał Kuć  wrote:

>> Hello!
>>
>> In the first one, the two terms 'Roger' and 'Miller' are run against
>> the attorney field. In the second the 'Roger' term is run against the
>> attorney field and the 'Miller' term is run against the default search
>> field.
>>
>> --
>> Regards,
>>  Rafał Kuć
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> > We got different results for these two queries. The first one returned
>> 115
>> > records and the second returns 179 records.
>>
>> > Thanks,
>>
>> > Fudong
>>
>>




Re: What is the difference between "attorney:(Roger Miller)" and "attorney:Roger Miller"

2013-11-19 Thread Utkarsh Sengar
Also, attorney:(Roger Miller) is same as attorney:"Roger Miller" right? Or
the term "Roger Miller" is run against attorney?

Thanks,
-Utkarsh


On Tue, Nov 19, 2013 at 12:42 PM, Rafał Kuć  wrote:

> Hello!
>
> In the first one, the two terms 'Roger' and 'Miller' are run against
> the attorney field. In the second the 'Roger' term is run against the
> attorney field and the 'Miller' term is run against the default search
> field.
>
> --
> Regards,
>  Rafał Kuć
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> > We got different results for these two queries. The first one returned
> 115
> > records and the second returns 179 records.
>
> > Thanks,
>
> > Fudong
>
>


-- 
Thanks,
-Utkarsh


Re: What is the difference between "attorney:(Roger Miller)" and "attorney:Roger Miller"

2013-11-19 Thread Rafał Kuć
Hello!

In the first one, the two terms 'Roger' and 'Miller' are run against
the attorney field. In the second the 'Roger' term is run against the
attorney field and the 'Miller' term is run against the default search
field.

-- 
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


> We got different results for these two queries. The first one returned 115
> records and the second returns 179 records.

> Thanks,

> Fudong



What is the difference between "attorney:(Roger Miller)" and "attorney:Roger Miller"

2013-11-19 Thread fudong li
We got different results for these two queries. The first one returned 115
records and the second returns 179 records.

Thanks,

Fudong


Re: Zookeeper down question

2013-11-19 Thread Mark Miller

On Nov 19, 2013, at 2:24 PM, Timothy Potter  wrote:

> Good questions ... From my understanding, queries will work if Zk goes down
> but writes do not work w/o Zookeeper. This works because the clusterstate
> is cached on each node so Zookeeper doesn't participate directly in queries
> and indexing requests. Solr has to decide not to allow writes if it loses
> its connection to Zookeeper, which is a safe guard mechanism. In other
> words, Solr assumes it's pretty safe to allow reads if the cluster doesn't
> have a healthy coordinator, but chooses to not allow writes to be safe.

Right - we currently stop accepting writes when Solr cannot talk to ZooKeeper - 
this is because we can no longer count on knowing about any changes to the 
cluster and no new leaders can be elected, etc. It gets tricky fast if you 
consider allowing updates without ZooKeeper connectivity for very long.

> 
> If a Solr nodes goes down while ZK is not available, since Solr no longer
> accepts writes, leader / replica doesn't really matter. I'd venture to
> guess there is some failover logic built in when executing distributing
> queries but I'm not as familiar with that part of the code (I'll brush up
> on it though as I'm now curious as well).

Right - query requests will fail over to other replicas - this is important in 
general because the cluster state a Solr instance has can be a bit stale - so a 
request might hit something that has gone down and another replica in the shard 
can be tried. We use the load balancing solrj client for these internal 
requests. CloudSolrServer handles failover for the user (or non internal) 
requests. Or you can use your own external load balancer.

- Mark

> 
> Cheers,
> Tim
> 
> 
> On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm <
> garthgr...@averyranchconsulting.com> wrote:
> 
>> Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), and a
>> standalone zookeeper.
>> 
>> Correct me if any of my understanding is incorrect on the following:
>> If ZK goes down, most normal operations will still function, since my
>> understanding is that ZK isn't involved on a transaction by transaction
>> basis for each of these.
>> Document adds, updates, and deletes on existing collection will still work
>> as expected.
>> Queries will still get processed as expected.
>> Is the above correct?
>> 
>> But adding new collections, changing configs, etc., will all fail while ZK
>> is down (or at least, place things in an inconsistent state?)
>> Is that correct?
>> 
>> If, while ZK is down, one of the 4 solr nodes also goes down, will all
>> normal operations fail?  Will they all continue to succeed?  I.e. will each
>> of the nodes realize which node is down and route indexing and query
>> requests around them, or is that impossible while ZK is down?  Will some
>> queries succeed (because they were lucky enough to get routed to the one
>> replica on the one shard that is still functional) while other queries fail
>> (they aren't so lucky and get routed to the one replica that is down on the
>> one shard)?
>> 
>> Thanks,
>> Garth Grimm
>> 
>> 
>> 



Split shard and stream sub-shards to remote nodes?

2013-11-19 Thread Otis Gospodnetic
Hi,

Is it possible to perform a shard split and stream data for the
new/sub-shards to remote nodes, avoiding persistence of new/sub-shards
on the local/source node first?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


Re: Zookeeper down question

2013-11-19 Thread Timothy Potter
Good questions ... From my understanding, queries will work if Zk goes down
but writes do not work w/o Zookeeper. This works because the clusterstate
is cached on each node so Zookeeper doesn't participate directly in queries
and indexing requests. Solr has to decide not to allow writes if it loses
its connection to Zookeeper, which is a safe guard mechanism. In other
words, Solr assumes it's pretty safe to allow reads if the cluster doesn't
have a healthy coordinator, but chooses to not allow writes to be safe.

If a Solr nodes goes down while ZK is not available, since Solr no longer
accepts writes, leader / replica doesn't really matter. I'd venture to
guess there is some failover logic built in when executing distributing
queries but I'm not as familiar with that part of the code (I'll brush up
on it though as I'm now curious as well).

Cheers,
Tim


On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm <
garthgr...@averyranchconsulting.com> wrote:

> Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), and a
> standalone zookeeper.
>
> Correct me if any of my understanding is incorrect on the following:
> If ZK goes down, most normal operations will still function, since my
> understanding is that ZK isn't involved on a transaction by transaction
> basis for each of these.
> Document adds, updates, and deletes on existing collection will still work
> as expected.
> Queries will still get processed as expected.
> Is the above correct?
>
> But adding new collections, changing configs, etc., will all fail while ZK
> is down (or at least, place things in an inconsistent state?)
> Is that correct?
>
> If, while ZK is down, one of the 4 solr nodes also goes down, will all
> normal operations fail?  Will they all continue to succeed?  I.e. will each
> of the nodes realize which node is down and route indexing and query
> requests around them, or is that impossible while ZK is down?  Will some
> queries succeed (because they were lucky enough to get routed to the one
> replica on the one shard that is still functional) while other queries fail
> (they aren't so lucky and get routed to the one replica that is down on the
> one shard)?
>
> Thanks,
> Garth Grimm
>
>
>


Re: Option to enforce a majority quorum approach to accepting updates in SolrCloud?

2013-11-19 Thread Timothy Potter
I got to thinking about this particular question while watching this
presentation, which is well worth 45 minutes if you can spare it:

http://www.infoq.com/presentations/partitioning-comparison

I created SOLR-5468 for this.


On Tue, Nov 19, 2013 at 11:58 AM, Mark Miller  wrote:

> Mostly a lot of other systems already offer these types of things, so they
> were hard not to think about while building :) Just hard to get back to a
> lot of those things, even though a lot of them are fairly low hanging
> fruit. Hardening takes the priority :(
>
> - Mark
>
> On Nov 19, 2013, at 12:42 PM, Timothy Potter  wrote:
>
> > You're thinking is always one-step ahead of me! I'll file the JIRA
> >
> > Thanks.
> > Tim
> >
> >
> > On Tue, Nov 19, 2013 at 10:38 AM, Mark Miller 
> wrote:
> >
> >> Yeah, this is kind of like one of many little features that we have just
> >> not gotten to yet. I’ve always planned for a param that let’s you say
> how
> >> many replicas an update must be verified on before responding success.
> >> Seems to make sense to fail that type of request early if you notice
> there
> >> are not enough replicas up to satisfy the param to begin with.
> >>
> >> I don’t think there is a JIRA issue yet, fire away if you want.
> >>
> >> - Mark
> >>
> >> On Nov 19, 2013, at 12:14 PM, Timothy Potter 
> wrote:
> >>
> >>> I've been thinking about how SolrCloud deals with write-availability
> >> using
> >>> in-sync replica sets, in which writes will continue to be accepted so
> >> long
> >>> as there is at least one healthy node per shard.
> >>>
> >>> For a little background (and to verify my understanding of the process
> is
> >>> correct), SolrCloud only considers active/healthy replicas when
> >>> acknowledging a write. Specifically, when a shard leader accepts an
> >> update
> >>> request, it forwards the request to all active/healthy replicas and
> only
> >>> considers the write successful if all active/healthy replicas ack the
> >>> write. Any down / gone replicas are not considered and will sync up
> with
> >>> the leader when they come back online using peer sync or snapshot
> >>> replication. For instance, if a shard has 3 nodes, A, B, C with A being
> >> the
> >>> current leader, then writes to the shard will continue to succeed even
> >> if B
> >>> & C are down.
> >>>
> >>> The issue is that if a shard leader continues to accept updates even if
> >> it
> >>> loses all of its replicas, then we have acknowledged updates on only 1
> >>> node. If that node, call it A, then fails and one of the previous
> >> replicas,
> >>> call it B, comes back online before A does, then any writes that A
> >> accepted
> >>> while the other replicas were offline are at risk to being lost.
> >>>
> >>> SolrCloud does provide a safe-guard mechanism for this problem with the
> >>> leaderVoteWait setting, which puts any replicas that come back online
> >>> before node A into a temporary wait state. If A comes back online
> within
> >>> the wait period, then all is well as it will become the leader again
> and
> >> no
> >>> writes will be lost. As a side note, sys admins definitely need to be
> >> made
> >>> more aware of this situation as when I first encountered it in my
> >> cluster,
> >>> I had no idea what it meant.
> >>>
> >>> My question is whether we want to consider an approach where SolrCloud
> >> will
> >>> not accept writes unless there is a majority of replicas available to
> >>> accept the write? For my example, under this approach, we wouldn't
> accept
> >>> writes if both B&C failed, but would if only C did, leaving A & B
> online.
> >>> Admittedly, this lowers the write-availability of the system, so may be
> >>> something that should be tunable? Just wanted to put this out there as
> >>> something I've been thinking about lately ...
> >>>
> >>> Cheers,
> >>> Tim
> >>
> >>
>
>


hung solr instance behavior

2013-11-19 Thread Garth Grimm
Given a 4 node Solr Cloud (i.e. 2 shards, 2 replicas per shard).

Let's say one node becomes 'nonresponsive'.  Meaning sockets get created, but 
transactions to them don't get handled (i.e. they time out).  We'll also assume 
that means the solr instance can't send information out to zookeeper or other 
solar instances.

Does ZK become aware of the issue at all?
Do normal indexing operations fail (I would assume so based on a timeout, but 
just checking)?
What would happen with query requests (let's assume the requests aren't sent 
directly to the 'hung' instance).  Do some queries succeed, but others fail 
(i.e. timeout) based upon whether the node in the shard asked to handle the 
query is the 'hung' one or not?  Is there an automatic timeout functionality 
where all queries will still succeed, but some will be much slower(i.e. if the 
'hung' one is asked to handle it, there'll be a timeout and then the other core 
on the shard will be asked to handle it)?

Thanks,
Garth


Zookeeper down question

2013-11-19 Thread Garth Grimm
Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), and a 
standalone zookeeper.

Correct me if any of my understanding is incorrect on the following:
If ZK goes down, most normal operations will still function, since my 
understanding is that ZK isn't involved on a transaction by transaction basis 
for each of these.
Document adds, updates, and deletes on existing collection will still work as 
expected.
Queries will still get processed as expected.
Is the above correct?

But adding new collections, changing configs, etc., will all fail while ZK is 
down (or at least, place things in an inconsistent state?)
Is that correct?

If, while ZK is down, one of the 4 solr nodes also goes down, will all normal 
operations fail?  Will they all continue to succeed?  I.e. will each of the 
nodes realize which node is down and route indexing and query requests around 
them, or is that impossible while ZK is down?  Will some queries succeed 
(because they were lucky enough to get routed to the one replica on the one 
shard that is still functional) while other queries fail (they aren't so lucky 
and get routed to the one replica that is down on the one shard)?

Thanks,
Garth Grimm




Re: Option to enforce a majority quorum approach to accepting updates in SolrCloud?

2013-11-19 Thread Mark Miller
Mostly a lot of other systems already offer these types of things, so they were 
hard not to think about while building :) Just hard to get back to a lot of 
those things, even though a lot of them are fairly low hanging fruit. Hardening 
takes the priority :(

- Mark

On Nov 19, 2013, at 12:42 PM, Timothy Potter  wrote:

> You're thinking is always one-step ahead of me! I'll file the JIRA
> 
> Thanks.
> Tim
> 
> 
> On Tue, Nov 19, 2013 at 10:38 AM, Mark Miller  wrote:
> 
>> Yeah, this is kind of like one of many little features that we have just
>> not gotten to yet. I’ve always planned for a param that let’s you say how
>> many replicas an update must be verified on before responding success.
>> Seems to make sense to fail that type of request early if you notice there
>> are not enough replicas up to satisfy the param to begin with.
>> 
>> I don’t think there is a JIRA issue yet, fire away if you want.
>> 
>> - Mark
>> 
>> On Nov 19, 2013, at 12:14 PM, Timothy Potter  wrote:
>> 
>>> I've been thinking about how SolrCloud deals with write-availability
>> using
>>> in-sync replica sets, in which writes will continue to be accepted so
>> long
>>> as there is at least one healthy node per shard.
>>> 
>>> For a little background (and to verify my understanding of the process is
>>> correct), SolrCloud only considers active/healthy replicas when
>>> acknowledging a write. Specifically, when a shard leader accepts an
>> update
>>> request, it forwards the request to all active/healthy replicas and only
>>> considers the write successful if all active/healthy replicas ack the
>>> write. Any down / gone replicas are not considered and will sync up with
>>> the leader when they come back online using peer sync or snapshot
>>> replication. For instance, if a shard has 3 nodes, A, B, C with A being
>> the
>>> current leader, then writes to the shard will continue to succeed even
>> if B
>>> & C are down.
>>> 
>>> The issue is that if a shard leader continues to accept updates even if
>> it
>>> loses all of its replicas, then we have acknowledged updates on only 1
>>> node. If that node, call it A, then fails and one of the previous
>> replicas,
>>> call it B, comes back online before A does, then any writes that A
>> accepted
>>> while the other replicas were offline are at risk to being lost.
>>> 
>>> SolrCloud does provide a safe-guard mechanism for this problem with the
>>> leaderVoteWait setting, which puts any replicas that come back online
>>> before node A into a temporary wait state. If A comes back online within
>>> the wait period, then all is well as it will become the leader again and
>> no
>>> writes will be lost. As a side note, sys admins definitely need to be
>> made
>>> more aware of this situation as when I first encountered it in my
>> cluster,
>>> I had no idea what it meant.
>>> 
>>> My question is whether we want to consider an approach where SolrCloud
>> will
>>> not accept writes unless there is a majority of replicas available to
>>> accept the write? For my example, under this approach, we wouldn't accept
>>> writes if both B&C failed, but would if only C did, leaving A & B online.
>>> Admittedly, this lowers the write-availability of the system, so may be
>>> something that should be tunable? Just wanted to put this out there as
>>> something I've been thinking about lately ...
>>> 
>>> Cheers,
>>> Tim
>> 
>> 



Re: External File field Reload option

2013-11-19 Thread Mikhail Khludnev
Aditya,

If you commit, docnums are supposed to be changed, hence the file should be
reloaded.

There might be few alternative approaches to address this problem, but they
are  really  bloody hacks, you know.

Hold on, if docs are pushed in few hours, but file is changed daily, can't
you mix that boosting values from files into documents. It might force you
to index more, and/or slightly delay boosting update, but it's the only
easy win here.


On Tue, Nov 19, 2013 at 6:55 PM, adityab  wrote:

> Hi,
> I have been using external file field (eff) for holding rank of the
> document
> which gets updated every day based on different stats collected by the
> system. Once the rank is computed the new files are pushed to Master which
> will eventually replicate to slaves on next commit.
>
> Our eff file has around 1.6M lines a simple key value pare. Its roughly
> about 16MB. Its been observed that loading this file at first takes around
> 192 sec. I agree this can be done at the start of the server and should not
> impact the performance while serving traffic. (We have 10 such fields, file
> per zone).
>
> Now documents are pushed to Master every 2 hrs in batches. Eff is just
> pushed once a day. As we apply commit every 2hrs, On slaves when new reader
> is opened after replication it takes a long time to warmup because it has
> to
> load the eff file again.
>
> Curious to know if the file has not changed and resides outside index, is
> there a way in solar to check if the eff file is actually modified before
> trying to reload it?
>
> Any other suggestions?
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/External-File-field-Reload-option-tp4101929.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Question regarding possibility of data loss

2013-11-19 Thread Mark Miller
I’d recommend you start with the upcoming 4.6 release. Should be out this week 
or next.

- Mark

On Nov 19, 2013, at 8:18 AM, adfel70  wrote:

> Hi, we plan to establish an ensemble of solr with zookeeper. 
> We gonna have 6 solr servers with 2 instances on each server, also we'll
> have 6 shards with replication factor 2, in addition we'll have 3
> zookeepers. 
> 
> Our concern is that we will send documents to index and solr won't index
> them but won't send any error message and we will suffer a data loss
> 
> 1. Is there any situation that can cause this kind of problem? 
> 2. Can it happen if some of ZKs are down? or some of the solr instances? 
> 3. How can we monitor them? Can we do something to prevent these kind of
> errors? 
> 
> Thanks in advance 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Question-regarding-possibility-of-data-loss-tp4101915.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Question regarding possibility of data loss

2013-11-19 Thread Daniel Collins
Regarding data loss, Solr returns an error code to the callling app (either
HTTP error code, or equivalent in SolrJ), so if it fails to index for a
known reason, you'll know about it.

There are always edge cases though.

If Solr indexes the document (returns success), that means the document is
in the transaction log (and should be in the log for each replica).
If someone pulls the plug on the machines and the hard drives crash, then
the transaction log might not be re-playable when the system comes back
up...

Now Solr won't tell you what's trashed (since it can't possibly know). At
that point your whole collection might be corrupt, but *presumably* you
will have a backup available (onsite or off) and a checkpoint time of when
you took that backup, so you can replay any indexing work that might have
happened since then.

Admittedly that's extreme, but it depends how cast iron a guarantee you
want :)

But in all seriousness, Shaun is right, Solr is stable, and if it can't
index a doc it will tell you.
In the case of ALL ZK being down or all Solr servers for a particular
shard, you will generate an error when you try to index anything (HTTP
503/Service Is Unavailable or the SolrJ equivalent).


On 19 November 2013 15:35, Shawn Heisey  wrote:

> On 11/19/2013 6:18 AM, adfel70 wrote:
> > Hi, we plan to establish an ensemble of solr with zookeeper.
> > We gonna have 6 solr servers with 2 instances on each server, also we'll
> > have 6 shards with replication factor 2, in addition we'll have 3
> > zookeepers.
>
> You'll want to do one Solr instance per machine.  Each Solr instance can
> house many cores (shard replicas).  More than one instance per machine
> will: 1) Add memory/CPU overhead.  2) Accidentally and easily result in
> a situation where multiple replicas for a single shard are located on
> the same machine.
>
> > Our concern is that we will send documents to index and solr won't index
> > them but won't send any error message and we will suffer a data loss
> >
> > 1. Is there any situation that can cause this kind of problem?
> > 2. Can it happen if some of ZKs are down? or some of the solr instances?
> > 3. How can we monitor them? Can we do something to prevent these kind of
> > errors?
>
> 1) If it does become possible for data loss to occur without notifying
> your application, it will be considered a very serious bug, and top
> priority will be given to fixing it.  A release with the fix will be
> made as quickly as possible.  Of course I cannot guarantee that such
> bugs don't exist, but I am not aware of any at the moment.
>
> 2) You must have a majority ([n/2] + 1) of zookeepers operational.  If
> you have three or four zookeepers, one zookeeper can be down and
> SolrCloud will continue to function perfectly.  With five or six
> zookeepers, two can be down.  With seven or eight, three can be down.
> As far as Solr itself, if one replica of each shard from a collection is
> working, then the entire collection will work.  That means you'll want
> to have at least replicationFactor=2, so there are two copies of each
> shard.
>
> 3) There are MANY options for monitoring.  Many of them are completely
> free, and it is always possible to write your own.  One high-level thing
> you can do is make sure the hosts are up and that they are running the
> proper number of java processes.  Solr offers a number of API entry
> points that will tell you how things are working, and more are added
> over time.  I don't think there are any zookeeper-specific informational
> capabilities at the moment, but I did file a bug report asking for the
> feature.  When I have some time, I will work on a fix for it.  One of
> the other committers may decide to work on it as well.
>
> If you want out-of-the-box Solr-specific monitoring and are willing to
> pay for it, Sematext offers SPM.  One of Sematext's employees is very
> active on this list, and they just added Zookeeper monitoring to their
> capabilities.  They do have a free version, but it has extremely limited
> monitoring history.
>
> http://sematext.com/
>
> Thanks,
> Shawn
>
>


Re: Option to enforce a majority quorum approach to accepting updates in SolrCloud?

2013-11-19 Thread Timothy Potter
You're thinking is always one-step ahead of me! I'll file the JIRA

Thanks.
Tim


On Tue, Nov 19, 2013 at 10:38 AM, Mark Miller  wrote:

> Yeah, this is kind of like one of many little features that we have just
> not gotten to yet. I’ve always planned for a param that let’s you say how
> many replicas an update must be verified on before responding success.
> Seems to make sense to fail that type of request early if you notice there
> are not enough replicas up to satisfy the param to begin with.
>
> I don’t think there is a JIRA issue yet, fire away if you want.
>
> - Mark
>
> On Nov 19, 2013, at 12:14 PM, Timothy Potter  wrote:
>
> > I've been thinking about how SolrCloud deals with write-availability
> using
> > in-sync replica sets, in which writes will continue to be accepted so
> long
> > as there is at least one healthy node per shard.
> >
> > For a little background (and to verify my understanding of the process is
> > correct), SolrCloud only considers active/healthy replicas when
> > acknowledging a write. Specifically, when a shard leader accepts an
> update
> > request, it forwards the request to all active/healthy replicas and only
> > considers the write successful if all active/healthy replicas ack the
> > write. Any down / gone replicas are not considered and will sync up with
> > the leader when they come back online using peer sync or snapshot
> > replication. For instance, if a shard has 3 nodes, A, B, C with A being
> the
> > current leader, then writes to the shard will continue to succeed even
> if B
> > & C are down.
> >
> > The issue is that if a shard leader continues to accept updates even if
> it
> > loses all of its replicas, then we have acknowledged updates on only 1
> > node. If that node, call it A, then fails and one of the previous
> replicas,
> > call it B, comes back online before A does, then any writes that A
> accepted
> > while the other replicas were offline are at risk to being lost.
> >
> > SolrCloud does provide a safe-guard mechanism for this problem with the
> > leaderVoteWait setting, which puts any replicas that come back online
> > before node A into a temporary wait state. If A comes back online within
> > the wait period, then all is well as it will become the leader again and
> no
> > writes will be lost. As a side note, sys admins definitely need to be
> made
> > more aware of this situation as when I first encountered it in my
> cluster,
> > I had no idea what it meant.
> >
> > My question is whether we want to consider an approach where SolrCloud
> will
> > not accept writes unless there is a majority of replicas available to
> > accept the write? For my example, under this approach, we wouldn't accept
> > writes if both B&C failed, but would if only C did, leaving A & B online.
> > Admittedly, this lowers the write-availability of the system, so may be
> > something that should be tunable? Just wanted to put this out there as
> > something I've been thinking about lately ...
> >
> > Cheers,
> > Tim
>
>


Re: Option to enforce a majority quorum approach to accepting updates in SolrCloud?

2013-11-19 Thread Mark Miller
Yeah, this is kind of like one of many little features that we have just not 
gotten to yet. I’ve always planned for a param that let’s you say how many 
replicas an update must be verified on before responding success. Seems to make 
sense to fail that type of request early if you notice there are not enough 
replicas up to satisfy the param to begin with.

I don’t think there is a JIRA issue yet, fire away if you want.

- Mark

On Nov 19, 2013, at 12:14 PM, Timothy Potter  wrote:

> I've been thinking about how SolrCloud deals with write-availability using
> in-sync replica sets, in which writes will continue to be accepted so long
> as there is at least one healthy node per shard.
> 
> For a little background (and to verify my understanding of the process is
> correct), SolrCloud only considers active/healthy replicas when
> acknowledging a write. Specifically, when a shard leader accepts an update
> request, it forwards the request to all active/healthy replicas and only
> considers the write successful if all active/healthy replicas ack the
> write. Any down / gone replicas are not considered and will sync up with
> the leader when they come back online using peer sync or snapshot
> replication. For instance, if a shard has 3 nodes, A, B, C with A being the
> current leader, then writes to the shard will continue to succeed even if B
> & C are down.
> 
> The issue is that if a shard leader continues to accept updates even if it
> loses all of its replicas, then we have acknowledged updates on only 1
> node. If that node, call it A, then fails and one of the previous replicas,
> call it B, comes back online before A does, then any writes that A accepted
> while the other replicas were offline are at risk to being lost.
> 
> SolrCloud does provide a safe-guard mechanism for this problem with the
> leaderVoteWait setting, which puts any replicas that come back online
> before node A into a temporary wait state. If A comes back online within
> the wait period, then all is well as it will become the leader again and no
> writes will be lost. As a side note, sys admins definitely need to be made
> more aware of this situation as when I first encountered it in my cluster,
> I had no idea what it meant.
> 
> My question is whether we want to consider an approach where SolrCloud will
> not accept writes unless there is a majority of replicas available to
> accept the write? For my example, under this approach, we wouldn't accept
> writes if both B&C failed, but would if only C did, leaving A & B online.
> Admittedly, this lowers the write-availability of the system, so may be
> something that should be tunable? Just wanted to put this out there as
> something I've been thinking about lately ...
> 
> Cheers,
> Tim



Option to enforce a majority quorum approach to accepting updates in SolrCloud?

2013-11-19 Thread Timothy Potter
I've been thinking about how SolrCloud deals with write-availability using
in-sync replica sets, in which writes will continue to be accepted so long
as there is at least one healthy node per shard.

For a little background (and to verify my understanding of the process is
correct), SolrCloud only considers active/healthy replicas when
acknowledging a write. Specifically, when a shard leader accepts an update
request, it forwards the request to all active/healthy replicas and only
considers the write successful if all active/healthy replicas ack the
write. Any down / gone replicas are not considered and will sync up with
the leader when they come back online using peer sync or snapshot
replication. For instance, if a shard has 3 nodes, A, B, C with A being the
current leader, then writes to the shard will continue to succeed even if B
& C are down.

The issue is that if a shard leader continues to accept updates even if it
loses all of its replicas, then we have acknowledged updates on only 1
node. If that node, call it A, then fails and one of the previous replicas,
call it B, comes back online before A does, then any writes that A accepted
while the other replicas were offline are at risk to being lost.

SolrCloud does provide a safe-guard mechanism for this problem with the
leaderVoteWait setting, which puts any replicas that come back online
before node A into a temporary wait state. If A comes back online within
the wait period, then all is well as it will become the leader again and no
writes will be lost. As a side note, sys admins definitely need to be made
more aware of this situation as when I first encountered it in my cluster,
I had no idea what it meant.

My question is whether we want to consider an approach where SolrCloud will
not accept writes unless there is a majority of replicas available to
accept the write? For my example, under this approach, we wouldn't accept
writes if both B&C failed, but would if only C did, leaving A & B online.
Admittedly, this lowers the write-availability of the system, so may be
something that should be tunable? Just wanted to put this out there as
something I've been thinking about lately ...

Cheers,
Tim


Re: Boosting documents by categorical preferences

2013-11-19 Thread Chris Hostetter
: My approach was something like:
: 1) Look at the categories that the user has preferred and compute the
: z-score
: 2) Pick the top 3 among those
: 3) Use those to boost search results.

I think that totaly makes sense ... the additional bit i was suggesting 
that you consider is that instead of picking the "highest" 3 z-scores, 
pick the z-scores with the greatest absolute value ... that way if someone 
is a very booring person and their "positive interests" are all basically 
exactly the same as the mean for everyone else, but they have some very 
strong "dis-interests" you don't bother boosting on those miniscule 
interests and instead you negatively boost on the things they are 
antogonistic against.


-Hoss


Re: Listing collection fields

2013-11-19 Thread youknow...@heroicefforts.net
Thanks.  I have an Xtext DSL doing some config and code generation downstream 
of the data ingestion.  It probably wouldn't be that hard to generate a 
solrconfig.xml, but for now I just want to build in some runtime reconciliation 
to aid in dynamic query generation.  It sounds like Luke is still the best 
approach.

Regards,

-Jess

Shalin Shekhar Mangar  wrote:

>You can use the ListFields method in the new Schema API:
>
>https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-ListFields
>
>Note that this will return all configured fields but it doesn't tell
>you the actual dynamic field names in the index. I don't know if we
>have anything better than a luke request for that yet.
>
>On Tue, Nov 19, 2013 at 5:56 AM, youknow...@heroicefforts.net
> wrote:
>> I'd like to get the complete field list for a collection, including
>dynamic fields.  Is issuing a Luke request still the recommended way
>for retrieving this data?
>>
>> --
>> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>
>
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available

2013-11-19 Thread Cassandra Targett
I've often thought of possibly providing the reference guide in .epub
format, but wasn't sure of general interest. I also once tried to
convert the PDF version with calibre and it was a total mess. - but
PDF is probably the least-flexible starting point for conversion.

Unfortunately, the Word export is only available on a per-page basis,
which would make it really tedious to try to make a .doc version of
the entire guide (there are ~150 pages). There are, however, options
for HTML export, which I believe could be converted to .epub - but
might take some fiddling.

I created an issue for this - for now just to track that it's
something that might be of interest - but not sure if/when I'd
personally be able to work on it:
https://issues.apache.org/jira/browse/SOLR-5467.

On Tue, Nov 19, 2013 at 6:34 AM, Uwe Reh  wrote:
> Am 18.11.2013 14:39, schrieb Furkan KAMACI:
>
>> Atlassian Jira has two options at default: exporting to PDF and exporting
>> to Word.
>
>
> I see, 'Word' isn't optimal for a reference guide. But OO can handle 'doc'
> and has epub plugins.
> Could it be possible, to offer the doku also as 'doc(x)'
>
> barefaced
> Uwe
>


RE: facet method=enum and uninvertedfield limitations

2013-11-19 Thread Lemke, Michael SZ/HZA-ZSW
On Friday, November 15, 2013 11:22 AM, Lemke, Michael SZ/HZA-ZSW wrote:

Judging from numerous replies this seems to be a tough question.
Nevertheless, I'd really appreciate any help as we are stuck.
We'd really like to know what in our index causes the facet.method=fc
query to fail.

Thanks,
Michael

>On Thu, November 14, 2013 7:26 PM, Yonik Seeley wrote:
>>On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael  SZ/HZA-ZSW
>> wrote:
>>> I am running into performance problems with faceted queries.
>>> If I do a
>>>
>>> q=word&facet.field=CONTENT&facet=true&facet.limit=10&facet.mincount=1&facet.method=fc&facet.prefix=a&rows=0
>>>
>>> I am getting an exception:
>>> org.apache.solr.common.SolrException: Too many values for UnInvertedField 
>>> faceting on field CONTENT
>>> at 
>>> org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384)
>>> at 
>>> org.apache.solr.request.UnInvertedField.(UnInvertedField.java:178)
>>> at 
>>> org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
>>> ...
>>>
>>> I understand it's got something to do with a 24bit limit somewhere
>>> in the code but I don't understand enough of it to be able to construct
>>> a specialized index that can be queried with facet.method=enum.
>>
>>You shouldn't need to do anything differently to try facet.method=enum
>>(just replace facet.method=fc with facet.method=enum)
>
>This is true and facet.method=enum does work indeed.  The problem is
>runtime.  In particular queries with an empty facet.prefix= run many
>seconds if not minutes.  I initially asked about this here:
>http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3c33ec3398272fbe47b64ee3b3e98f69a761427...@de011521.schaeffler.com%3E
>
>It was suggested that fc is much faster than enum and I'd like to
>test that.  We are still fairly free to design the index such that
>it performs well.  But to do that we need to understand what is
>killing it.
>
>>
>>You may also want to add the parameter
>>facet.enum.cache.minDf=10
>>to lower memory usage by only usiing the filter cache for terms that
>>match more than 100K docs.
>
>That helped a little, cut down my particular test from 10 sec to 5 sec.
>But still too slow.  Mind you this is for an autosuggest feature.
>
>Thanks for your reply.
>
>Michael
>
>



Re: Problems bulk adding documents to Solr Cloud in 4.5.1

2013-11-19 Thread Mark Miller
4.6 no longer uses XML to send requests between nodes. It’s probably worth 
trying it and seeing if there is still a problem. Here is the RC we are voting 
on today: 
http://people.apache.org/~simonw/staging_area/lucene-solr-4.6.0-RC4-rev1543363/

Otherwise, I do plan on looking into this issue soon.

- Mark

On Nov 19, 2013, at 10:11 AM, Michael Tracey  wrote:

> Dave, that's the exact symptoms we all have had in SOLR-5402.  After many 
> attempted fixes (including upgrading jetty, switching to tomcat, messing with 
> buffer settings) my solution was to fall back to 4.4 and await a fix.
> 
> - Original Message -
> From: "Dave Seltzer" 
> To: solr-user@lucene.apache.org
> Sent: Monday, November 18, 2013 9:48:46 PM
> Subject: Problems bulk adding documents to Solr Cloud in 4.5.1
> 
> Hello,
> 
> I'm having quite a bit of trouble indexing content in Solr Cloud. I build a
> content indexer on top of the REST API designed to index my data quickly.
> It was working very well indexing about 100 documents per ""
> instruction.
> 
> After some tweaking of the schema I switched on a few more servers. Set up
> a few shards and started indexing data. Everything was working perfectly,
> but as soon as I switched to "Cloud" I started getting
> RemoteServerExceptions "Illegal to have multiple roots."
> 
> I'm using the stock Jetty container on both servers.
> 
> To get things working I reduced the number of documents per add until it
> worked. Unfortunately that has limited me to adding a single document per
> add - which is quite slow.
> 
> I'm fairly sure it's not the size of the HTTP post because things were
> working just fine until I moved over to Solr Cloud.
> 
> Does anyone have any information about this problem? It sounds a lot like
> Sai Gadde's https://issues.apache.org/jira/browse/SOLR-5402
> 
> Thanks so much!
> 
> -Dave



Re: Question regarding possibility of data loss

2013-11-19 Thread Shawn Heisey
On 11/19/2013 6:18 AM, adfel70 wrote:
> Hi, we plan to establish an ensemble of solr with zookeeper. 
> We gonna have 6 solr servers with 2 instances on each server, also we'll
> have 6 shards with replication factor 2, in addition we'll have 3
> zookeepers. 

You'll want to do one Solr instance per machine.  Each Solr instance can
house many cores (shard replicas).  More than one instance per machine
will: 1) Add memory/CPU overhead.  2) Accidentally and easily result in
a situation where multiple replicas for a single shard are located on
the same machine.

> Our concern is that we will send documents to index and solr won't index
> them but won't send any error message and we will suffer a data loss
> 
> 1. Is there any situation that can cause this kind of problem? 
> 2. Can it happen if some of ZKs are down? or some of the solr instances? 
> 3. How can we monitor them? Can we do something to prevent these kind of
> errors? 

1) If it does become possible for data loss to occur without notifying
your application, it will be considered a very serious bug, and top
priority will be given to fixing it.  A release with the fix will be
made as quickly as possible.  Of course I cannot guarantee that such
bugs don't exist, but I am not aware of any at the moment.

2) You must have a majority ([n/2] + 1) of zookeepers operational.  If
you have three or four zookeepers, one zookeeper can be down and
SolrCloud will continue to function perfectly.  With five or six
zookeepers, two can be down.  With seven or eight, three can be down.
As far as Solr itself, if one replica of each shard from a collection is
working, then the entire collection will work.  That means you'll want
to have at least replicationFactor=2, so there are two copies of each shard.

3) There are MANY options for monitoring.  Many of them are completely
free, and it is always possible to write your own.  One high-level thing
you can do is make sure the hosts are up and that they are running the
proper number of java processes.  Solr offers a number of API entry
points that will tell you how things are working, and more are added
over time.  I don't think there are any zookeeper-specific informational
capabilities at the moment, but I did file a bug report asking for the
feature.  When I have some time, I will work on a fix for it.  One of
the other committers may decide to work on it as well.

If you want out-of-the-box Solr-specific monitoring and are willing to
pay for it, Sematext offers SPM.  One of Sematext's employees is very
active on this list, and they just added Zookeeper monitoring to their
capabilities.  They do have a free version, but it has extremely limited
monitoring history.

http://sematext.com/

Thanks,
Shawn



Leading and trailing wildcard with phrase query and positional ordering

2013-11-19 Thread GOYAL, ANKUR
Hi,

I am using Solr 4.2.1. I have a couple of questions regarding using leading and 
trailing wildcards with phrase queries and doing positional ordering.

*   I have a field called text which is defined as the text_general field. 
I downloaded the ComplexPhraseQuery plugin 
(https://issues.apache.org/jira/browse/SOLR-1604) and it works perfectly for 
trailing wildcards and wildcards within the phrase. However, if we use a 
leading wildcard, then it leads to an error saying that WildCard query does not 
permit usage of leading wildcard. So, is there any other way that we can use 
leading and trailing wildcards along with a phrase ?
*   I am using boosting (qf parameter in requestHandler in solrConfig.xml) 
to do ordering of results that are returned from Solr. However, the order is 
not correct. The fields that I am doing boosting on are "text_general" fields. 
So, is it possible that boosting does not occur when the wildcards are used ?

-Ankur



Re: Solr spatial search within the polygon

2013-11-19 Thread Smiley, David W.


On 11/19/13 4:06 AM, "Dhanesh Radhakrishnan"  wrote:

>Hi David,
>Thank you so much for the detailed reply. I've checked each and every lat
>lng coordinates and its a purely polygon.
>After  some time I did one change in the lat lng indexing.
>Changed the indexing format.
>
>Initially I indexed the latitude and longitude separated by comma  Eg:-
>"location":["9.445890,76.540970"]
>Instead I indexed with space.
> "location":["9.445890 76.540970"]

Just to be clear, if you use a space, it's "x y" order.  If you use a
comma, it's "y, x" order.  If you use WKT, it's always a space in X Y
order (and of course, the shape name and other stuff).  You may have
gotten your search to work but I have a hunch you have all your latitudes
and longitudes in the wrong position, but I can't possible know for sure
because your example datapoint is ambiguous.  76 degrees latitude is
pretty far up-there though, hence my hunch you've got it wrong.

>
>and it worked
>
>Also from your observation on IsWithIn predicate I tested with Intersects
>and I found there is a  difference in the QTime.
>
>For IsWithin
> "QTime": 9
>ResponseHeader": {
>"status": 0,
>"QTime": 9
>},
>
>When I used "Intersects"
>responseHeader": {
>"status": 0,
>"QTime": 26
>}

There's no way the Intersects code is slower than IsWithin; IsWithin needs
to visit many more grid tiles -- big ones covering lots of docs.  Perhaps
you have these times flipped as well ;-)   Any way, given you have
multi-valued point data, you should choose the spatial predicate that
matches what you intend (your requirements).  Maybe that's IsWithin, maybe
that's Intersects.  *if* your field was *not* multi-valued (most people
don't have multi-valued spatial data per doc), then these two predicates
become semantically equivalent for such data, and so most people should
always choose Intersects even if "within" is colloquially how one thinks
of it.

~ David



Re: Problems bulk adding documents to Solr Cloud in 4.5.1

2013-11-19 Thread Michael Tracey
Dave, that's the exact symptoms we all have had in SOLR-5402.  After many 
attempted fixes (including upgrading jetty, switching to tomcat, messing with 
buffer settings) my solution was to fall back to 4.4 and await a fix.

- Original Message -
From: "Dave Seltzer" 
To: solr-user@lucene.apache.org
Sent: Monday, November 18, 2013 9:48:46 PM
Subject: Problems bulk adding documents to Solr Cloud in 4.5.1

Hello,

I'm having quite a bit of trouble indexing content in Solr Cloud. I build a
content indexer on top of the REST API designed to index my data quickly.
It was working very well indexing about 100 documents per ""
instruction.

After some tweaking of the schema I switched on a few more servers. Set up
a few shards and started indexing data. Everything was working perfectly,
but as soon as I switched to "Cloud" I started getting
RemoteServerExceptions "Illegal to have multiple roots."

I'm using the stock Jetty container on both servers.

To get things working I reduced the number of documents per add until it
worked. Unfortunately that has limited me to adding a single document per
add - which is quite slow.

I'm fairly sure it's not the size of the HTTP post because things were
working just fine until I moved over to Solr Cloud.

Does anyone have any information about this problem? It sounds a lot like
Sai Gadde's https://issues.apache.org/jira/browse/SOLR-5402

Thanks so much!

-Dave


External File field Reload option

2013-11-19 Thread adityab
Hi, 
I have been using external file field (eff) for holding rank of the document
which gets updated every day based on different stats collected by the
system. Once the rank is computed the new files are pushed to Master which
will eventually replicate to slaves on next commit. 

Our eff file has around 1.6M lines a simple key value pare. Its roughly
about 16MB. Its been observed that loading this file at first takes around
192 sec. I agree this can be done at the start of the server and should not
impact the performance while serving traffic. (We have 10 such fields, file
per zone).

Now documents are pushed to Master every 2 hrs in batches. Eff is just
pushed once a day. As we apply commit every 2hrs, On slaves when new reader
is opened after replication it takes a long time to warmup because it has to
load the eff file again. 

Curious to know if the file has not changed and resides outside index, is
there a way in solar to check if the eff file is actually modified before
trying to reload it? 

Any other suggestions?

 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/External-File-field-Reload-option-tp4101929.html
Sent from the Solr - User mailing list archive at Nabble.com.


ANNOUNCEMENT: ZooKeeper monitoring

2013-11-19 Thread Otis Gospodnetic
Hi,

A number of people/organizations use SPM for monitoring their Solr /
SolrCloud clusters, and since SolrCloud relies on ZooKeeper, we added
support for ZooKeeper monitoring and alerting to SPM.

This means you can now monitor ZooKeeper along your other clusters -
SolrCloud, Hadoop, HBase, Kafka and Elasticsearch, if you have it.
:).

Screenshot:
http://blog.sematext.com/2013/11/19/announcement-zookeeper-performance-monitoring-in-spm/

Demo:
https://apps.sematext.com/demo

Feedback welcome!

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


solr highlighting

2013-11-19 Thread javozzo
Hi,
i'm new in Solr. I use Solr-3.6.o and i tried to use the highlighting
parameters to obtaine a result like Goolge. (if the searched word appears in
the title, this word must be bold).
Example 
searched word = solr 

retrieve document title
welcome in *Solr*
the highlighting parameters that i use are:
hl.fl=* and hl=true.
I have an array that contains some records with the term between tags  but
I'd like to get directy this the title field.
Any ideas?
Thanks
Danilo 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-highlighting-tp4101919.html
Sent from the Solr - User mailing list archive at Nabble.com.

Question regarding possibility of data loss

2013-11-19 Thread adfel70
Hi, we plan to establish an ensemble of solr with zookeeper. 
We gonna have 6 solr servers with 2 instances on each server, also we'll
have 6 shards with replication factor 2, in addition we'll have 3
zookeepers. 

Our concern is that we will send documents to index and solr won't index
them but won't send any error message and we will suffer a data loss

1. Is there any situation that can cause this kind of problem? 
2. Can it happen if some of ZKs are down? or some of the solr instances? 
3. How can we monitor them? Can we do something to prevent these kind of
errors? 

Thanks in advance 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-regarding-possibility-of-data-loss-tp4101915.html
Sent from the Solr - User mailing list archive at Nabble.com.


Question regarding possibili

2013-11-19 Thread adfel70
Hi, we plan to establish an ensemble of solr with zookeeper.
We gonna have 6 solr servers with 2 instances on each server, also we'll
have 6 shards with replication factor 2, in addition we'll have 3
zookeepers.

Our concern is that we will send documents to index and solr won't index
them but *won't *send any error message and we will suffer a *data loss*

1. Is there any situation that can cause this kind of problem?
2. Can it happen if some of ZKs are down? or some of the solr instances?
3. How can we monitor them? Can we do something to prevent these kind of
errors?

Thanks in advance





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-regarding-possibili-tp4101914.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available

2013-11-19 Thread Uwe Reh

Am 18.11.2013 14:39, schrieb Furkan KAMACI:

Atlassian Jira has two options at default: exporting to PDF and exporting
to Word.


I see, 'Word' isn't optimal for a reference guide. But OO can handle 
'doc' and has epub plugins.

Could it be possible, to offer the doku also as 'doc(x)'

barefaced
Uwe



Re: How to get score with getDocList method Solr API

2013-11-19 Thread Amit Aggarwal
Hello shekhar ,
Thanks for answering . Do I have to set GET_SCORES FLAG as last parameter
of getDocList method ?

Thanks
On 19-Nov-2013 1:43 PM, "Shalin Shekhar Mangar" 
wrote:

> A few flags are supported:
> public static final int GET_DOCSET= 0x4000;
> public static final int TERMINATE_EARLY = 0x04;
> public static final int GET_DOCLIST   =0x02; // get
> the documents actually returned in a response
> public static final int GET_SCORES =   0x01;
>
> Use the GET_SCORES flag to get the score with each document.
>
> On Tue, Nov 19, 2013 at 8:08 AM, Amit Aggarwal
>  wrote:
> > Hello All,
> >
> > I am trying to develop a custom request handler.
> > Here is the snippet :
> >
> > // returnMe is nothing but a list of Document going to return
> >
> > try {
> >
> > // FLAG ???
> > DocList docList = searcher.getDocList(parsedQuery,
> > parsedFilterQueryList, Sort.RELEVANCE, 1, maxDocs , FLAG);
> >
> > // Now get DocIterator
> > DocIterator it = docList.iterator();
> >
> > // Now for each id get doc and put it in
> list
> >
> > int i =0;
> > while (it.hasNext()) {
> >
> > returnMe.add(searcher.doc(it.next()));
> >
> > }
> >
> >
> > Ques 1 - > My question is , what does FLAG represent in getDocList
> method ?
> > Ques 2 - > How can I ensure that searcher.getDocList method give me score
> > also with each document.
> >
> >
> > --
> > Amit Aggarwal
> > 8095552012
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


[Announcement] Forage 0.3, the search server for Node.js

2013-11-19 Thread Fergus McDowall
Hi!

As a reader of the Solr mailing list, you might be interested in an
experimental search server for node.js, Forage.

Forage is built on the levelDB library from Google.

Check it out here: http://www.foragejs.net/

As always, feedback, pull requests, comments, praise, criticism and beer
are most welcome.

Fergus


Re: Solr spatial search within the polygon

2013-11-19 Thread Dhanesh Radhakrishnan
Hi David,
Thank you so much for the detailed reply. I've checked each and every lat
lng coordinates and its a purely polygon.
After  some time I did one change in the lat lng indexing.
Changed the indexing format.

Initially I indexed the latitude and longitude separated by comma  Eg:-
"location":["9.445890,76.540970"]
Instead I indexed with space.
 "location":["9.445890 76.540970"]

and it worked

Also from your observation on IsWithIn predicate I tested with Intersects
and I found there is a  difference in the QTime.

For IsWithin
 "QTime": 9
ResponseHeader": {
"status": 0,
"QTime": 9
},

When I used "Intersects"
responseHeader": {
"status": 0,
"QTime": 26
}

Thank you so much

Regards
dhanesh s.r


Re: Listing collection fields

2013-11-19 Thread Shalin Shekhar Mangar
You can use the ListFields method in the new Schema API:

https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-ListFields

Note that this will return all configured fields but it doesn't tell
you the actual dynamic field names in the index. I don't know if we
have anything better than a luke request for that yet.

On Tue, Nov 19, 2013 at 5:56 AM, youknow...@heroicefforts.net
 wrote:
> I'd like to get the complete field list for a collection, including dynamic 
> fields.  Is issuing a Luke request still the recommended way for retrieving 
> this data?
>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.



-- 
Regards,
Shalin Shekhar Mangar.


Re: How to get score with getDocList method Solr API

2013-11-19 Thread Shalin Shekhar Mangar
A few flags are supported:
public static final int GET_DOCSET= 0x4000;
public static final int TERMINATE_EARLY = 0x04;
public static final int GET_DOCLIST   =0x02; // get
the documents actually returned in a response
public static final int GET_SCORES =   0x01;

Use the GET_SCORES flag to get the score with each document.

On Tue, Nov 19, 2013 at 8:08 AM, Amit Aggarwal
 wrote:
> Hello All,
>
> I am trying to develop a custom request handler.
> Here is the snippet :
>
> // returnMe is nothing but a list of Document going to return
>
> try {
>
> // FLAG ???
> DocList docList = searcher.getDocList(parsedQuery,
> parsedFilterQueryList, Sort.RELEVANCE, 1, maxDocs , FLAG);
>
> // Now get DocIterator
> DocIterator it = docList.iterator();
>
> // Now for each id get doc and put it in list
>
> int i =0;
> while (it.hasNext()) {
>
> returnMe.add(searcher.doc(it.next()));
>
> }
>
>
> Ques 1 - > My question is , what does FLAG represent in getDocList method ?
> Ques 2 - > How can I ensure that searcher.getDocList method give me score
> also with each document.
>
>
> --
> Amit Aggarwal
> 8095552012
>



-- 
Regards,
Shalin Shekhar Mangar.