Re: How to query against dynamic fields without listing them all?

2019-07-14 Thread David Santamauro
Hi Steven,

You can dump all the dynamic fields into a copyField




Then you can just set
  "qf":"CC_COMP_NAME_ALL"


On 7/14/19, 10:42 AM, "Steven White"  wrote:

Hi everyone,

In my schema, I have the following field:

  

When I index, I create dynamic fields and index into it like so:

  doc.addField("CC_COMP_NAME_" + componentName.toUpperCase(),
ccAllFieldsDataValue);

In my query handler, I have this:

  {"requestHandler":{"/select_hcl":{
  "class":"solr.SearchHandler",
  "name":"/select_hcl",
  "defaults":{
"defType":"edismax",
"echoParams":"explicit",
"fl":"CC_UNIQUE_FIELD,CC_FILE_PATH,score",
"indent":"true",
"qf":"CC_COMP_NAME_*",
"rows":"100",
"wt":"xml"

My expectation was when i query using this handler, it will include all the
dynamic fields with the prefix of "CC_COMP_NAME_" however, that is not
happening and I'm getting 0 hits.  But when I use the full field name, such
as CC_COMP_NAME_1 or CC_COMP_NAME_2, that works so I know my data is
indexed, it is just that Solr not paying attention to the dynamic field
syntax in "qf".

I don't want to keep a list of those dynamic fields and pass them to my
handler, but if I must, than I must.  If so, how can I get the list of
those dynamic fields from Solr so that I don't have to maintain and sync-up
the list myself.

Thanks

Steven



Re: Urgent help on solr optimisation issue !!

2019-06-07 Thread David Santamauro
I use the same algorithm and for me, initialMaxSegments is always the number of 
segments currently in the index (seen, e.g, in the SOLR admin UI). 
finalMaxSegments depends on what kind of updates have happened. If I know that 
"older" documents are untouched, then I'll usually use -60% or even -70%, 
depending on the initialMaxSegments. I have a few cores that I'll even go all 
the way down to 1.

If you are going to attempt this, I'd suggest to test with a small reduction, 
say 10 segments, and monitor the index size and difference between maxDoc and 
numDocs. I've shaved ~ 1T off of an index optimizing from 75 down to  30 
segments (7T index total) and reduced a significant % of delete documents in 
the process. YMMV ...

If you are using a version of SOLR >=7.5 (see LUCENE-7976), this might all be 
moot.

//


On 6/7/19, 2:29 PM, "jena"  wrote:

Thanks @Michael Joyner,  how did you decide initialmax segment to 256 ? Or 
it
is some random number i can use for my case ? Can you guuide me how to
decide the initial & final max segments ?

 
Michael Joyner wrote
> That is the way we do it here - also helps a lot with not needing x2 or 
> x3 disk space to handle the merge:
> 
> public void solrOptimize() {
>  int initialMaxSegments = 256;
>  int finalMaxSegments = 4;
>  if (isShowSegmentCounter()) {
>  log.info("Optimizing ...");
>  }
>  try (SolrClient solrServerInstance = getSolrClientInstance()) {
>  for (int segments = initialMaxSegments; segments >= 
> finalMaxSegments; segments--) {
>  if (isShowSegmentCounter()) {
>  System.out.println("Optimizing to a max of " + 
> segments + " segments.");
>  }
>  try {
>  solrServerInstance.optimize(true, true, segments);
>  } catch (RemoteSolrException | SolrServerException | 
> IOException e) {
>  log.severe(e.getMessage());
>  }
>  }
>  } catch (IOException e) {
>  throw new RuntimeException(e);
>  }
>  }
> 
> On 6/7/19 4:56 AM, Nicolas Franck wrote:
>> In that case, hard optimisation like that is out the question.
>> Resort to automatic merge policies, specifying a maximum
>> amount of segments. Solr is created with multiple segments
>> in mind. Hard optimisation seems like not worth the problem.
>>
>> The problem is this: the less segments you specify during
>> during an optimisation, the longer it will take, because it has to read
>> all of these segments to be merged, and redo the sorting. And a cluster
>> has a lot of housekeeping on top of it.
>>
>> If you really want to issue a optimisation, then you can
>> also do it in steps (max segments parameter)
>>
>> 10 -> 9 -> 8 -> 7 .. -> 1
>>
>> that way less segments need to be merged in one go.
>>
>> testing your index will show you what a good maximum
>> amount of segments is for your index.
>>
>>> On 7 Jun 2019, at 07:27, jena 

> sthita2010@

>  wrote:
>>>
>>> Hello guys,
>>>
>>> We have 4 solr(version 4.4) instance on production environment, which
>>> are
>>> linked/associated with zookeeper for replication. We do heavy deleted &
>>> add
>>> operations. We have around 26million records and the index size is
>>> around
>>> 70GB. We serve 100k+ requests per day.
>>>
>>>
>>> Because of heavy indexing & deletion, we optimise solr instance
>>> everyday,
>>> because of that our solr cloud getting unstable , every solr instance go
>>> on
>>> recovery mode & our search is getting affected & very slow because of
>>> that.
>>> Optimisation takes around 1hr 30minutes.
>>> We are not able fix this issue, please help.
>>>
>>> Thanks & Regards
>>>
>>>
>>>
>>> --
>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Urgent help on solr optimisation issue !!

2019-06-07 Thread David Santamauro
So is this new optimize maxSegments / commit expungeDeletes behavior in 7.5? My 
experience, and I watch the my optimize process very closely, is that using 
maxSgements does not touch every segment with a deleted document. 
expungeDeletes merges all segments that have deleted documents that have been 
touched with said commit.

After reading LUCENE-7976, it seems this is, indeed, new behavior.


On 6/7/19, 10:31 AM, "Erick Erickson"  wrote:

Optimizing guarantees that there will be _no_ deleted documents in an index 
when done. If a segment has even one deleted document, it’s merged, no matter 
what you specify for maxSegments. 

Segments are write-once, so to remove deleted data from a segment it must 
be at least rewritten into a new segment, whether or not it’s merged with 
another segment on optimize.

expungeDeletes  does _not_ merge every segment that has deleted documents. 
It merges segments that have > 10% (the default) deleted documents. If your 
index happens to have all segments with > 10% deleted docs, then it will, 
indeed, merge all of them.

In your example, if you look closely you should find that all segments that 
had any deleted documents were written (merged) to new segments. I’d expect 
that segments with _no_ deleted documents might mostly be left alone. And two 
of the segments were chosen to merge together.

See LUCENE-7976 for a long discussion of how this changed starting  with 
SOLR 7.5.

Best,
Erick

> On Jun 7, 2019, at 7:07 AM, David Santamauro  
wrote:
> 
> Erick, on 6.0.1, optimize with maxSegments only merges down to the 
specified number. E.g., given an index with 75 segments, optimize with 
maxSegments=74 will only merge 2 segments leaving 74 segments. It will choose a 
segment to merge that has deleted documents, but does not merge every segment 
with deleted documents.
> 
> I think you are thinking about the expungeDeletes parameter on the commit 
request. That will merge every segment that has a deleted document.
> 
> 
> On 6/7/19, 10:00 AM, "Erick Erickson"  wrote:
> 
>This isn’t quite right. Solr will rewrite _all_ segments that have 
_any_ deleted documents in them when optimizing, even one. Given your 
description, I’d guess that all your segments will have deleted documents, so 
even if you do specify maxSegments on the optimize command, the entire index 
will be rewritten.
> 
>You’re in a bind, see: 
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/.
 You have this one massive segment and it will _not_ be merged until it’s 
almost all deleted documents, see the link above for a fuller explanation.
> 
>Prior to Solr 7.5 you don’t have many options except to re-index and 
_not_ optimize. So if possible I’d reindex from scratch into a new collection 
and do not optimize. Or restructure your process such that you can optimize in 
a quiet period when little indexing is going on.
> 
>Best,
>Erick
> 
>> On Jun 7, 2019, at 2:51 AM, jena  wrote:
>> 
>> Thanks @Nicolas Franck for reply, i don't see any any segment info for 
4.4
>> version. Is there any API i can use to get my segment information ? Will 
try
>> to use maxSegments and see if it can help us during optimization.
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 
> 




Re: Urgent help on solr optimisation issue !!

2019-06-07 Thread David Santamauro

/clarification/ ... expungeDeletes will merge every segment *touched by the 
current commit* that has a deleted document.


On 6/7/19, 10:07 AM, "David Santamauro"  wrote:

Erick, on 6.0.1, optimize with maxSegments only merges down to the 
specified number. E.g., given an index with 75 segments, optimize with 
maxSegments=74 will only merge 2 segments leaving 74 segments. It will choose a 
segment to merge that has deleted documents, but does not merge every segment 
with deleted documents.

I think you are thinking about the expungeDeletes parameter on the commit 
request. That will merge every segment that has a deleted document.


On 6/7/19, 10:00 AM, "Erick Erickson"  wrote:

This isn’t quite right. Solr will rewrite _all_ segments that have 
_any_ deleted documents in them when optimizing, even one. Given your 
description, I’d guess that all your segments will have deleted documents, so 
even if you do specify maxSegments on the optimize command, the entire index 
will be rewritten.

You’re in a bind, see: 
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/.
 You have this one massive segment and it will _not_ be merged until it’s 
almost all deleted documents, see the link above for a fuller explanation.

Prior to Solr 7.5 you don’t have many options except to re-index and 
_not_ optimize. So if possible I’d reindex from scratch into a new collection 
and do not optimize. Or restructure your process such that you can optimize in 
a quiet period when little indexing is going on.

Best,
Erick

> On Jun 7, 2019, at 2:51 AM, jena  wrote:
> 
> Thanks @Nicolas Franck for reply, i don't see any any segment info 
for 4.4
> version. Is there any API i can use to get my segment information ? 
Will try
> to use maxSegments and see if it can help us during optimization.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html





Re: Urgent help on solr optimisation issue !!

2019-06-07 Thread David Santamauro
Erick, on 6.0.1, optimize with maxSegments only merges down to the specified 
number. E.g., given an index with 75 segments, optimize with maxSegments=74 
will only merge 2 segments leaving 74 segments. It will choose a segment to 
merge that has deleted documents, but does not merge every segment with deleted 
documents.

I think you are thinking about the expungeDeletes parameter on the commit 
request. That will merge every segment that has a deleted document.


On 6/7/19, 10:00 AM, "Erick Erickson"  wrote:

This isn’t quite right. Solr will rewrite _all_ segments that have _any_ 
deleted documents in them when optimizing, even one. Given your description, 
I’d guess that all your segments will have deleted documents, so even if you do 
specify maxSegments on the optimize command, the entire index will be rewritten.

You’re in a bind, see: 
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/.
 You have this one massive segment and it will _not_ be merged until it’s 
almost all deleted documents, see the link above for a fuller explanation.

Prior to Solr 7.5 you don’t have many options except to re-index and _not_ 
optimize. So if possible I’d reindex from scratch into a new collection and do 
not optimize. Or restructure your process such that you can optimize in a quiet 
period when little indexing is going on.

Best,
Erick

> On Jun 7, 2019, at 2:51 AM, jena  wrote:
> 
> Thanks @Nicolas Franck for reply, i don't see any any segment info for 4.4
> version. Is there any API i can use to get my segment information ? Will 
try
> to use maxSegments and see if it can help us during optimization.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




Re: Solr boolean query with phrase match

2019-03-25 Thread David Santamauro
Perhaps the Complex Phrase Query Parser might be what you are looking for.

https://lucene.apache.org/solr/guide/7_3/other-parsers.html

//

On 3/25/19, 1:41 AM, "krishan goyal"  wrote:

Hi,

I want to execute a solr query with boolean clauses using the eDismax Query
Parser.

But the phrase match is executed on the complete query and not on the
individual queries which are created.

Is it possible to have both boolean conditions in query and phrase matches ?

Eg:
Query -
(gear AND cycle) OR (black AND cycle)

The parsed query for this is

"+((+(query:gear)~0.01 +(query:cycle)~0.01) (+(query:black)~0.01
+(query:cycle)~0.01)) (phrase:\"gear cycle black cycle\")~0.01"

As can be seen the query conditions are as expected but I want the phrase
match on "gear cycle" or "black cycle" .

Using boost/bq will not solve the use case because I also want to define
phrase slop. So that a phrase match for "black cycle" will match documents
like "black colour cycle".

Is it possible to either
1. Apply the phrase match on the individual queries produced ?
2. Apply the phrase match on a different attribute than 'q'. As a
workaround I can create the individual phrases to be matched and supply
that to this attribute.
3. Or any other solution for this use case ?

Thanks
Krishan



Re: Is it possible to force solr show all facet values for the field with an enum type?

2019-01-06 Thread David Santamauro
Seeing that the field is an enumeration, couldn't you just use a set of 
facet.query(s)?

  ?q=*:*
  =user_s:Bar
  =true
  =enumfield:A
  =enumfield:B
  =0

//

On 1/5/19, 3:01 PM, "Arvydas Silanskas"  
wrote:

Thanks for your reply.

No, not exactly what I want.

Consider I have enum defined as


A
B


and correspondingly I have defined a fieldtype "enumType" that uses this
enum, and a field "enumfield" that is of type "enumType". Consider my index
is like this:

[
  {
"name_s":"Doc 1",
"enumfield":"A",
"user_s":"Foo",
"id":"2ebc0754-e7d8-405e-9962-99c6cd1d9275",
"_version_":1621850725207244800},
  {
"name_s":"Doc 2",
"user_s":"Bar",
"id":"0536827a-703a-456e-9087-71b85b63c58b",
"_version_":1621850725397037056}]

notice how there are no documents that have "enumfield":"B".
Now, if I execute query
"facet.field=enumfield=on=user_s:Bar=on=*:*=json",
my facet response's fields look like this:

"facet_fields":{
  "enumfield":[
"A",0]}

There is no "B" key -- and that's my problem. It tells me about other
facet values if they're filtered out by fq, but it tells me nothing
about facet values that aren't present in any doc.

My question is how to force the response to be
"facet_fields":{
  "enumfield":[
"A",0,
"B", 0]}


2019-01-05, št, 19:42 Erick Erickson  rašė:

> So really the results you want are q=*:*=enumField right?
> You could fire that query in parallel and combine the two in your app,
> perhaps caching the result if the index isn't changing very rapidly.
>
> Facets were designed with the idea that they'd only count for docs
> that were hits, so there's no built-in way to do what you want. Which, 
BTW,
> could be _very_ expensive in the general case. The query would
> have to count up, say, the hits for 100M documents...
>
> Best,
> Erick
>
> On Sat, Jan 5, 2019 at 1:53 AM Arvydas Silanskas
>  wrote:
> >
> > Hello,
> > I have an enum solr fieldtype. When I do a facet search, I want that all
> > the enum values appear in the facet -- and setting field.mincount = 0 is
> > not enough. It only works, if there exist a document with the matching
> > value for the field, but it was filtered out by current query (and then
> I'm
> > returned that facet value with the count 0). But can I make it to also
> > return the values that literally none of the documents in the index 
have?
> > The values, that only appear in the enum declaration xml.
>



Re: ComplexPhraseQParser vs phrase slop

2018-10-10 Thread David Santamauro
Anyone have any insight here?

On 10/8/18, 3:34 PM, "David Santamauro"  wrote:

Hi, quick question. Should

  1) {!complexphrase inOrder=false}f: ( "cat jump"~2 )

... and

  2) f: ( "cat jump"~2 )

... yield the same results? I'm trying to diagnose a more complicated 
discrepancy that I've boiled down to this simple case. I understand #1 creates 
a SpanQuery and #2 a PhraseQuery but I would have thought without wildcards and 
with the attribute inOrder=false that both would/should yield the exact same 
results. If they should ( and they aren't for me ) what could the problem be? 
If the shouldn't, could someone explain why?

Thanks




ComplexPhraseQParser vs phrase slop

2018-10-08 Thread David Santamauro
Hi, quick question. Should

  1) {!complexphrase inOrder=false}f: ( "cat jump"~2 )

... and

  2) f: ( "cat jump"~2 )

... yield the same results? I'm trying to diagnose a more complicated 
discrepancy that I've boiled down to this simple case. I understand #1 creates 
a SpanQuery and #2 a PhraseQuery but I would have thought without wildcards and 
with the attribute inOrder=false that both would/should yield the exact same 
results. If they should ( and they aren't for me ) what could the problem be? 
If the shouldn't, could someone explain why?

Thanks



Re: how to access solr in solrcloud

2018-09-12 Thread David Santamauro
... or haproxy.

On 9/12/18, 10:23 AM, "Vadim Ivanov"  wrote:

Hi,  Steve
If you are using  solr1:8983 to access solr and solr1 is down IMHO nothing
helps you to access dead ip.
You should switch to any other live node in the cluster or I'd propose to
have nginx as frontend to access
Solrcloud. 

-- 
BR, Vadim



-Original Message-
From: Gu, Steve (CDC/OD/OADS) (CTR) [mailto:c...@cdc.gov] 
Sent: Wednesday, September 12, 2018 4:38 PM
To: 'solr-user@lucene.apache.org'
Subject: how to access solr in solrcloud

Hi, all

I am upgrading our solr to 7.4 and would like to set up solrcloud for
failover and load balance.   There are three zookeeper servers (zk1:2181,
zk1:2182) and two solr instance solr1:8983, solr2:8983.  So what will be the
solr url should the client to use for access?  Will it be solr1:8983, the
leader?

If we  use solr1:8983 to access solr, what happens if solr1:8983 is down?
Will the request be routed to solr2:8983 via the zookeeper?  I understand
that zookeeper is doing all the coordination works but wanted to understand
how this works.

Any insight would be greatly appreciated.
Steve





Re: Overlapped Gap Facets

2016-11-17 Thread David Santamauro


I had a similar question a while back but it was regarding date 
differences. Perhaps that might give you some ideas.


http://lucene.472066.n3.nabble.com/date-difference-faceting-td4249364.html

//



On 11/17/2016 09:49 AM, Furkan KAMACI wrote:

Is it possible to do such a facet on a date field:

  Last 1 Day
  Last 1 Week
  Last 1 Month
  Last 6 Month
  Last 1 Year
  Older than 1 Year

which has overlapped facet gaps?

Kind Regards,
Furkan KAMACI



Re: Aggregate Values Inside a Facet Range

2016-11-04 Thread David Santamauro


I believe your answer is in the subject
  => facet.range
https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-RangeFaceting

//

On 11/04/2016 02:25 PM, Furkan KAMACI wrote:

I have documents like that

id:5
timestamp:NOW //pseudo date representation
count:13

id:4
timestamp:NOW //pseudo date representation
count:3

id:3
timestamp:NOW-1DAY //pseudo date representation
count:21

id:2
timestamp:NOW-1DAY //pseudo date representation
count:29

id:1
timestamp:NOW-3DAY //pseudo date representation
count:4

When I want to facet last 3 days data by timestamp its OK. However my need
is that:

facets:
 TODAY: 16 //pseudo representation
 TODAY - 1: 50 //pseudo date representation
 TODAY - 2: 0 //pseudo date representation
 TODAY - 3: 4 //pseudo date representation

I mean, I have to facet by dates and aggregate values inside that facet
range. Is it possible to do that without multiple queries at Solr?

Kind Regards,
Furkan KAMACI



Re: how to remove duplicate from search result

2016-09-27 Thread David Santamauro

Have a look at

https://cwiki.apache.org/confluence/display/solr/Result+Grouping


On 09/27/2016 11:03 AM, googoo wrote:

hi,

We want to provide remove duplicate from search result function.

like we have below documents.
id(uniqueKey)   guid
doc1G1
doc2G2
doc3G3
doc4G1

user run one query and hit doc1, doc2 and doc4.
user want to remove duplicate from search result based on guid field.
since doc1 and doc4 has same guid, one of them should be drop from search
result.

how we can address this requirement?

Thanks,
Yongtao





--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-remove-duplicate-from-search-result-tp4298272.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Removing SOLR fields from schema

2016-09-22 Thread David Santamauro



On 09/22/2016 08:55 AM, Shawn Heisey wrote:

On 9/21/2016 11:46 PM, Selvam wrote:

We use SOLR 5.x in cloud mode and have huge set of fields. We now want
to remove some 50 fields from Index/schema itself so that indexing &
querying will be faster. Is there a way to do that without losing
existing data on other fields? We don't want to do full re-indexing.


When you remove fields from your schema, you can continue to use Solr
with no problems even without a reindex.  But you won't see any benefit
to your query performance until you DO reindex.  Until the reindex is
done (ideally wiping the index first), all the data from the removed
fields will remain in the index and affect your query speeds.


Will an optimize remove those fields and corresponding data?





Re: script to get core num docs

2016-09-19 Thread David Santamauro


https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API

wget -O- -q \

'/admin/cores?action=STATUS=coreName=json=true' 
\


  | grep numDocs

//


/admin/cores?action=STATUS=alexandria_shard2_replica1=json=1'|grep 
numDocs|cut -f2 -d':'|


On 09/19/2016 11:22 AM, KRIS MUSSHORN wrote:

How can i get the count of docs from a core with bash?
Seems like I have to call Admin/Luke but cant find any specifics.
Thanks
Kris



Re: analyzer for _text_ field

2016-07-15 Thread David Santamauro


The opening and closing single quotes don't match

-data-binary '{ ... }’

it should be:

-data-binary '{ ... }'


On 07/15/2016 02:59 PM, Steve Rowe wrote:

Waldyr, maybe it got mangled by my email client or yours?

Here’s the same command:

   

--
Steve
www.lucidworks.com


On Jul 15, 2016, at 2:16 PM, Waldyr Neto  wrote:

Hy Steves, tks for the help
unfortunately i'm making some mistake

when i try to run



curl -X POST -H 'Content-type: application/json’ \
http://localhost:8983/solr/gettingstarted/schema --data-binary
'{"add-field-type": { "name": "my_new_field_type", "class":
"solr.TextField","analyzer": {"charFilters": [{"class":
"solr.HTMLStripCharFilterFactory"}], "tokenizer": {"class":
"solr.StandardTokenizerFactory"},"filters":[{"class":
"solr.WordDelimiterFilterFactory"}, {"class":
"solr.LowerCaseFilterFactory"}]}},"replace-field": { "name":
"_text_","type": "my_new_field_type", "multiValued": "true","indexed":
"true","stored": "false"}}’

i receave the folow error msg from curl program
:

curl: (3) [globbing] unmatched brace in column 1

curl: (6) Could not resolve host: name

curl: (6) Could not resolve host: my_new_field_type,

curl: (6) Could not resolve host: class

curl: (6) Could not resolve host: solr.TextField,analyzer

curl: (3) [globbing] unmatched brace in column 1

curl: (3) [globbing] bad range specification in column 2

curl: (3) [globbing] unmatched close brace/bracket in column 32

curl: (6) Could not resolve host: tokenizer

curl: (3) [globbing] unmatched brace in column 1

curl: (3) [globbing] unmatched close brace/bracket in column 30

curl: (3) [globbing] unmatched close brace/bracket in column 32

curl: (3) [globbing] unmatched brace in column 1

curl: (3) [globbing] unmatched close brace/bracket in column 28

curl: (3) [globbing] unmatched brace in column 1

curl: (6) Could not resolve host: name

curl: (6) Could not resolve host: _text_,type

curl: (6) Could not resolve host: my_new_field_type,

curl: (6) Could not resolve host: multiValued

curl: (6) Could not resolve host: true,indexed

curl: (6) Could not resolve host: true,stored

curl: (3) [globbing] unmatched close brace/bracket in column 6

cvs1:~ vvisionphp1$

On Fri, Jul 15, 2016 at 2:45 PM, Steve Rowe  wrote:


Hi Waldyr,

An example of changing the _text_ analyzer by first creating a new field
type, and then changing the _text_ field to use the new field type (after
starting Solr 6.1 with “bin/solr start -e schemaless”):

-
PROMPT$ curl -X POST -H 'Content-type: application/json’ \
http://localhost:8983/solr/gettingstarted/schema --data-binary '{
  "add-field-type": {
"name": "my_new_field_type",
"class": "solr.TextField",
"analyzer": {
  "charFilters": [{
"class": "solr.HTMLStripCharFilterFactory"
  }],
  "tokenizer": {
"class": "solr.StandardTokenizerFactory"
  },
  "filters":[{
  "class": "solr.WordDelimiterFilterFactory"
}, {
  "class": "solr.LowerCaseFilterFactory"
  }]}},
  "replace-field": {
"name": "_text_",
"type": "my_new_field_type",
"multiValued": "true",
"indexed": "true",
"stored": "false"
  }}’
-

PROMPT$ curl
http://localhost:8983/solr/gettingstarted/schema/fields/_text_

-
{
  "responseHeader”:{ […] },
  "field":{
"name":"_text_",
"type":"my_new_field_type",
"multiValued":true,
"indexed":true,
"stored":false}}
-

--
Steve
www.lucidworks.com


On Jul 15, 2016, at 12:54 PM, Waldyr Neto  wrote:

Hy, How can i configure the analyzer for the _text_ field?







Re: json facet - date range & interval

2016-06-28 Thread David Santamauro


Have you tried %-escaping?

json.facet = {
  daterange : { type  : range,
field : datefield,
start : "NOW/DAY%2D10DAYS",
end   : "NOW/DAY",
gap   : "%2B1DAY"
  }
}


On 06/28/2016 01:19 PM, Jay Potharaju wrote:

json.facet={daterange : {type : range, field : datefield, start :
"NOW/DAY-10DAYS", end : "NOW/DAY",gap:"\+1DAY"} }

Escaping the plus sign also gives the same error. Any other suggestions how
can i make this work?
Thanks
Jay

On Mon, Jun 27, 2016 at 10:23 PM, Erick Erickson 
wrote:


First thing I'd do is escape the plus. It's probably being interpreted
as a space.

Best,
Erick

On Mon, Jun 27, 2016 at 9:24 AM, Jay Potharaju 
wrote:

Hi,
I am trying to use the json range facet with a tdate field. I tried the
following but get an error. Any suggestions on how to fix the following
error /examples for date range facets.

json.facet={daterange : {type : range, field : datefield, start
:"NOW-10DAYS", end : "NOW/DAY", gap : "+1DAY" } }

  msg": "Can't add gap 1DAY to value Fri Jun 17 15:49:36 UTC 2016 for

field:

datefield", "code": 400

--
Thanks
Jay








Re: Deleted documents and expungeDeletes

2016-04-01 Thread David Santamauro


The docs on reclaimDeletesWeight say:

"Controls how aggressively merges that reclaim more deletions are 
favored. Higher values favor selecting merges that reclaim deletions."


I can't imagine you would notice anything after only a few commits. I 
have many shards that size or larger and what I do occasionally is to 
loop an optimize, setting maxSegments with decremented values, e.g.,


for maxSegments in $( seq 40 -1 20 ); do
  # optimize maxSegments=$maxSegments
done

It's definitely a poor-man's hack and is clearly not the most efficient 
way of optimizing, but it does remove deletes without requiring double 
or triple the disk space that a full optimize requires. I can usually 
reclaim 100-300GB of disk space in a collection that us currently ~ 2TB 
-- not inconsequential.


Seeing you only have 1.6M documents, perhaps an index rebuild isn't out 
of the question? I did just that on a test collection with 100M 
documents. Starting with 0 deleted docs, a reclaimDeletesWeight=5.0 and 
probably about 1-3% document turnover per week (updates) over the last 3 
months and my deleted percentage is staying below 10%.


If that's not an option, keeping reclaimDeletesWeight at 5.0 and using 
expungeDeletes=true on commit will get that percentage down over time.


//


On 04/01/2016 04:49 AM, Jostein Elvaker Haande wrote:

On 30 March 2016 at 17:46, Erick Erickson  wrote:

through a clever bit of reflection, you can set the
reclaimDeletesWeight variable from solrconfig by including something
like
5 (going from memory
here, you'll get an error on startup if I've messed it up.)


I added the following to my solrconfig a couple of days ago:

 
   8
   8
   5.0
 

There has been several commits and the core is current according to
SOLR admin, however I'm still seeing a lot of deleted docs. These are
my current core statistics.

Last Modified:4 minutes ago
Num Docs:1 675 255
Max Doc:2 353 476
Heap Memory Usage:208 464 267
Deleted Docs:678 221
Version:1 870 539
Segment Count:39

Index size is close to 149GB.

So at the moment, I'm seeing a deleted docs to max docs percentage
ratio of 28.81%. With 'reclaimsWeight' set to 5, it doesn't seem to be
deleting away any deleted docs.

Anything obvious I'm missing?



Re: Deleted documents and expungeDeletes

2016-03-30 Thread David Santamauro



On 03/30/2016 08:23 AM, Jostein Elvaker Haande wrote:

On 30 March 2016 at 12:25, Markus Jelsma  wrote:

Hello - with TieredMergePolicy and default reclaimDeletesWeight of 2.0, and 
frequent updates, it is not uncommon to see a ratio of 25%. If you want deletes 
to be reclaimed more often, e.g. weight of 4.0, you will see very frequent 
merging of large segments, killing performance if you are on spinning disks.


Most of our installations are on spinning disks, so if I want a more
aggressive reclaim, this will impact performance. This is of course
something that I do not desire, so I'm wondering if scheduling a
commit with 'expungeDeletes' during off peak business hours is a
better approach than setting up a more aggressive merge policy.



As far as my experimentation with @expungeDeletes goes, if the data you 
indexed and committed using @expungeDeletes didn't touch segments with 
any deleted documents nor wasn't enough data to cause merging with a 
segment containing deleted documents, no deleted documents will be 
removed. Basically, @expungeDeletes expunges deletes in segments 
affected by the commit. If you have a large update that touches many 
segments containing deleted documents and you use @expungeDeletes, it 
could be just as resource intensive as an optimize.


My setting for reclaimDeletesWeight:
  5.0

It keeps the deleted documents down to ~ 10% without any noticable 
impact on resources or performance. But I'm still in the testing phase 
with this setting.




Re: docValues error

2016-02-29 Thread David Santamauro


thanks Shawn, that seems to be the error exactly.

On 02/29/2016 09:22 AM, Shawn Heisey wrote:

On 2/28/2016 3:31 PM, David Santamauro wrote:


I'm porting a 4.8 schema to 5.3 and I came across this new error when
I tried to group.field=f1:

unexpected docvalues type SORTED_SET for field 'f1' (expected=SORTED).
Use UninvertingReader or index with docvalues.

f1 is defined as

 
   
 
 
 
   
 

   

Notice that I don't have docValues defined. I realize the field type
doesn't allow docValues so why does this group request fail with a
docValues error? It did work with 4.8

Any clue would be appreciated, thanks


It sounds like you are running into pretty much exactly what I did with 5.x.

https://issues.apache.org/jira/browse/SOLR-8088

I had to create a copyField that's a string (StrField) type and include
docValues on that field.  I still can't use my tokenized field like I
want to, as I do in 4.x.

Thanks,
Shawn



Re: docValues error

2016-02-29 Thread David Santamauro



On 02/29/2016 07:59 AM, Tom Evans wrote:

On Mon, Feb 29, 2016 at 11:43 AM, David Santamauro
<david.santama...@gmail.com> wrote:

You will have noticed below, the field definition does not contain
multiValues=true


What version of the schema are you using? In pre 1.1 schemas,
multiValued="true" is the default if it is omitted.


1.5

Other single-value fields (tint, string) group correctly. The move from 
4.8 to 5.3 has rendered grouping on populated, single-value, 
solr.TextField fields crippled -- at least for me.


Re: docValues error

2016-02-29 Thread David Santamauro




On 02/29/2016 06:05 AM, Mikhail Khludnev wrote:

On Mon, Feb 29, 2016 at 12:43 PM, David Santamauro <
david.santama...@gmail.com> wrote:


unexpected docvalues type SORTED_SET for field 'f1' (expected=SORTED). Use
UninvertingReader or index with docvalues.


  DocValues is primary citizen api for accessing forward-view index, ie. it
replaced FieldCache. The error is caused by an attempt to group by
multivalue field, which is explicitly claimed as unsupported in the doc.



You will have noticed below, the field definition does not contain 
multiValues=true




On 02/28/2016 05:31 PM, David Santamauro wrote:



f1 is defined as

  

  
  
  

  





Re: docValues error

2016-02-29 Thread David Santamauro


So I started over (deleted all documents), re-deployed configs to 
zookeeper and reloaded the collection.


This error still appears when I group.field=f1

unexpected docvalues type SORTED_SET for field 'f1' (expected=SORTED). 
Use UninvertingReader or index with docvalues.


What exactly does this error mean and why am I getting it with a field 
that doesn't even have docValues defined?


Why is the DocValues code being used when docValues are not defined 
anywhere in my schema.xml?



null:java.lang.IllegalStateException: unexpected docvalues type 
SORTED_SET for field 'f1' (expected=SORTED). Use UninvertingReader or 
index with docvalues.

at org.apache.lucene.index.DocValues.checkField(DocValues.java:208)
at org.apache.lucene.index.DocValues.getSorted(DocValues.java:264)
	at 
org.apache.lucene.search.grouping.term.TermFirstPassGroupingCollector.doSetNextReader(TermFirstPassGroupingCollector.java:92)
	at 
org.apache.lucene.search.SimpleCollector.getLeafCollector(SimpleCollector.java:33)
	at 
org.apache.lucene.search.MultiCollector.getLeafCollector(MultiCollector.java:117)
	at 
org.apache.lucene.search.TimeLimitingCollector.getLeafCollector(TimeLimitingCollector.java:144)
	at 
org.apache.lucene.search.MultiCollector.getLeafCollector(MultiCollector.java:117)

at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:763)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:486)
	at 
org.apache.solr.search.grouping.CommandHandler.searchWithTimeLimiter(CommandHandler.java:233)
	at 
org.apache.solr.search.grouping.CommandHandler.execute(CommandHandler.java:160)
	at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:398)


etc ...



On 02/28/2016 05:31 PM, David Santamauro wrote:


I'm porting a 4.8 schema to 5.3 and I came across this new error when I
tried to group.field=f1:

unexpected docvalues type SORTED_SET for field 'f1' (expected=SORTED).
Use UninvertingReader or index with docvalues.

f1 is defined as

 
   
 
 
 
   
 

   

Notice that I don't have docValues defined. I realize the field type
doesn't allow docValues so why does this group request fail with a
docValues error? It did work with 4.8

Any clue would be appreciated, thanks

David


docValues error

2016-02-28 Thread David Santamauro


I'm porting a 4.8 schema to 5.3 and I came across this new error when I 
tried to group.field=f1:


unexpected docvalues type SORTED_SET for field 'f1' (expected=SORTED). 
Use UninvertingReader or index with docvalues.


f1 is defined as

positionIncrementGap="100">

  



  


  required="true" />


Notice that I don't have docValues defined. I realize the field type 
doesn't allow docValues so why does this group request fail with a 
docValues error? It did work with 4.8


Any clue would be appreciated, thanks

David


Re: date difference faceting

2016-01-08 Thread David Santamauro


For anyone wanting to know an answer, I used

facet.query={!frange l=0 u=3110400}ms(d_b,d_a)
facet.query={!frange l=3110401 u=6220800}ms(d_b,d_a)
facet.query={!frange l=6220801 u=15552000}ms(d_b,d_a)

etc ...

Not the prettiest nor most efficient but accomplishes what I need 
without re-indexing TBs of data.


thanks.

On 01/08/2016 12:09 PM, Erick Erickson wrote:

I'm going to side-step your primary question and say that it's nearly
always best to do your calculations up-front during indexing to make
queries more efficient and thus serve more requests on the same
hardware. This assumes that the stat you're interested in is
predictable of course...

Best,
Erick

On Fri, Jan 8, 2016 at 2:23 AM, David Santamauro
<david.santama...@gmail.com> wrote:


Hi,

I have two date fields, d_a and d_b, both of type solr.TrieDateField, that
represent different events associated with a particular document. The
interval between these dates is relevant for corner-case statistics. The
interval is calculated as the difference: sub(d_b,d_a) and I've been able to

   stats=true={!func}sub(d_b,d_a)

What I ultimately would like to report is the interval represented as a
range, which could be seen as facet.query

(pseudo code)
   facet.query=sub(d_b,d_a)[ * TO 8640 ] // day
   facet.query=sub(d_b,d_a)[ 8641 TO 60480 ] // week
   facet.query=sub(d_b,d_a)[ 60481 TO 259200 ] // month
etc.

Aside from actually indexing the difference in a separate field, is there
something obvious I'm missing? I'm on SOLR 5.2 in cloud mode.

thanks
David


date difference faceting

2016-01-08 Thread David Santamauro


Hi,

I have two date fields, d_a and d_b, both of type solr.TrieDateField, 
that represent different events associated with a particular document. 
The interval between these dates is relevant for corner-case statistics. 
The interval is calculated as the difference: sub(d_b,d_a) and I've been 
able to


  stats=true={!func}sub(d_b,d_a)

What I ultimately would like to report is the interval represented as a 
range, which could be seen as facet.query


(pseudo code)
  facet.query=sub(d_b,d_a)[ * TO 8640 ] // day
  facet.query=sub(d_b,d_a)[ 8641 TO 60480 ] // week
  facet.query=sub(d_b,d_a)[ 60481 TO 259200 ] // month
etc.

Aside from actually indexing the difference in a separate field, is 
there something obvious I'm missing? I'm on SOLR 5.2 in cloud mode.


thanks
David


Re: How to check when a search exceeds the threshold of timeAllowed parameter

2015-12-23 Thread David Santamauro



On 12/23/2015 01:42 AM, William Bell wrote:

I agree that when using timeAllowed in the header info there should be an
entry that indicates timeAllowed triggered.


If I'm not mistaken, there is
 => partialResults:true

  "responseHeader":{ "partialResults":true }

//



This is the only reason why we have not used timeAllowed. So this is a
great suggestion. Something like: 1 ??
That would be great.


0
1
107

*:*
1000





On Tue, Dec 22, 2015 at 6:43 PM, Vincenzo D'Amore 
wrote:


Well... I can write everything, but really all this just to understand
when timeAllowed
parameter trigger a partial answer? I mean, isn't there anything set in the
response when is partial?

On Wed, Dec 23, 2015 at 2:38 AM, Walter Underwood 
wrote:


We need to know a LOT more about your site. Number of documents, size of
index, frequency of updates, length of queries approximate size of server
(CPUs, RAM, type of disk), version of Solr, version of Java, and features
you are using (faceting, highlighting, etc.).

After that, we’ll have more questions.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Dec 22, 2015, at 4:58 PM, Vincenzo D'Amore 

wrote:


Hi All,

my website is under pressure, there is a big number of concurrent

searches.

When the connected users are too many, the searches becomes so slow

that

in

some cases users have to wait many seconds.
The queue of searches becomes so long that, in same cases, servers are
blocked trying to serve all these requests.
As far as I know because some searches are very expensive, and when

many

expensive searches clog the queue server becomes unresponsive.

In order to quickly workaround this herd effect, I have added a
default timeAllowed to 15 seconds, and this seems help a lot.

But during stress tests but I'm unable to understand when and what

requests

are affected by timeAllowed parameter.

Just be clear, I have configure timeAllowed parameter in a SolrCloud
environment, given that partial results may be returned (if there are

any),

how can I know when this happens? When the timeAllowed parameter

trigger

a

partial answer?

Best regards,
Vincenzo



--
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251






--
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251







collection mbeans: requests

2015-08-04 Thread David Santamauro


I have a question about how the stat 'requests' is calculated. I would 
really appreciate it if anyone could shed some light on the figures below.


Assumptions:
  version: 5.2.0
  layout: 8 node solrcloud, no replicas (node71-node78)
  collection: col1
  handler: /search
  stats request: /col1/admin/mbeans?stats=truecat=QUERYHANDLERwt=json'

I wrote a simple shell script that grabs the requests stats member from 
every node.


After collection reload
node 71 -- requests: 2
node 72 -- requests: 2
node 73 -- requests: 2
node 74 -- requests: 2
node 75 -- requests: 2
node 76 -- requests: 2
node 77 -- requests: 2
node 78 -- requests: 2
* I assume these are the auto-warm searches


After submitting 1 request (q=*:*)
node 71 -- requests: 4
node 72 -- requests: 3
node 73 -- requests: 3
node 74 -- requests: 3
node 75 -- requests: 3
node 76 -- requests: 4
node 77 -- requests: 3
node 78 -- requests: 3

After resubmitting the same request
node 71 -- requests: 6
node 72 -- requests: 4
node 73 -- requests: 4
node 74 -- requests: 4
node 75 -- requests: 4
node 76 -- requests: 5
node 77 -- requests: 5
node 78 -- requests: 4

If that wasn't strange enough, things get out of control if I add in 
facet.pivot parameter(s)


Fresh after reload (see above, 2 for every node)

Total after a facet.pivot on two fields
node 71 -- requests: 13
node 72 -- requests: 15
node 73 -- requests: 14
node 74 -- requests: 12
node 75 -- requests: 14
node 76 -- requests: 12
node 77 -- requests: 14
node 78 -- requests: 12

I imagine I'm seeing the internal cross-talk between nodes and if so, 
how can one reliably keep stats on the number of real requests?


thanks

David


Re: collection mbeans: requests

2015-08-04 Thread David Santamauro


I have your suggested shards.qt set up in another collection for another 
reason but I'll do that redirect here as well, thanks for the confirmation.


On 08/04/2015 10:45 AM, Shawn Heisey wrote:

On 8/4/2015 5:19 AM, David Santamauro wrote:


I have a question about how the stat 'requests' is calculated. I would
really appreciate it if anyone could shed some light on the figures below.

Assumptions:
   version: 5.2.0
   layout: 8 node solrcloud, no replicas (node71-node78)
   collection: col1
   handler: /search
   stats request: /col1/admin/mbeans?stats=truecat=QUERYHANDLERwt=json'

I wrote a simple shell script that grabs the requests stats member from
every node.

After collection reload
node 71 -- requests: 2
node 72 -- requests: 2
node 73 -- requests: 2
node 74 -- requests: 2
node 75 -- requests: 2
node 76 -- requests: 2
node 77 -- requests: 2
node 78 -- requests: 2
* I assume these are the auto-warm searches


After submitting 1 request (q=*:*)
node 71 -- requests: 4
node 72 -- requests: 3
node 73 -- requests: 3
node 74 -- requests: 3
node 75 -- requests: 3
node 76 -- requests: 4
node 77 -- requests: 3
node 78 -- requests: 3

After resubmitting the same request
node 71 -- requests: 6
node 72 -- requests: 4
node 73 -- requests: 4
node 74 -- requests: 4
node 75 -- requests: 4
node 76 -- requests: 5
node 77 -- requests: 5
node 78 -- requests: 4

If that wasn't strange enough, things get out of control if I add in
facet.pivot parameter(s)

Fresh after reload (see above, 2 for every node)

Total after a facet.pivot on two fields
node 71 -- requests: 13
node 72 -- requests: 15
node 73 -- requests: 14
node 74 -- requests: 12
node 75 -- requests: 14
node 76 -- requests: 12
node 77 -- requests: 14
node 78 -- requests: 12

I imagine I'm seeing the internal cross-talk between nodes and if so,
how can one reliably keep stats on the number of real requests?


Queries on distributed indexes change from the one request that you make
into a request to every shard, to check for relevant documents.  If
relevant documents are found, a second call to those specific shards is
made to retrieve those documents.  So if you have 5 shards in your
index, there could be up to 11 requests counted for a single query.  If
all the shards are on separate nodes, then for that 11-request query,
one of those nodes would count three requests and the others would count
two.

I know what I'm going to say next would work on an index that is
distributed but *not* SolrCloud, and I think it will work in SolrCloud too.

If you add a shards.qt parameter to defaults in your main request
handler (usually /select) that points at another, identically configured
handler (perhaps named /shards) that is also in solrconfig.xml, then
that other handler should receive the distributed requests and the main
handler should only count the real requests.  You would be able to
track those numbers separately.

Thanks,
Shawn



Re: Frequent deletions

2015-01-11 Thread David Santamauro

[ disclaimer: this worked for me, ymmv ... ]

I just battled this. Turns out incrementally optimizing using the
maxSegments attribute was the most efficient solution for me. In
particular when you are actually running out of disk space. 

#!/bin/bash

# n-segments I started with
high=400
# n-segments I want to optimize down to
low=300

for i in $(seq $high -10 $low); do
  # your optimize call with maxSegments=$i
  sleep 2
done

I was able to shrink my +3TB index by about 300GB optimizing
from 400 segments down to 300 (10 at a time). It optimized out the .del
for those segments that had one and, the best part, because you are only
rewriting 10 segments per loop, disk space footprint stays tolerable ... 
At least compared to a commit @expungeDeletes=true or of course, an
optimize without @maxSegments which basically rewrites the entire index.

NOTE: it wreaks havoc on the system, so expect search slowdown and best
not to index while this is going on either.

David


On Sun, 2015-01-11 at 06:46 -0700, ig01 wrote:
 Hi,
 
 It's not an option for us, all the documents in our index have same deletion
 probability.
 Is there any other solution to perform an optimization in order to reduce
 index size?
 
 Thanks in advance.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178720.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: A bad idea to store core data directory over NAS?

2014-11-04 Thread David Santamauro


Interestingly enough, one of our installations has a 16-node cluster 
using 4 NAS devices (xen as virtualization backbone). The data drive for 
the individual node that holds the index is a stripe of 2x 500GB disks. 
Each disk of the stripe is on a different NAS device (scattered 
pattern). With a total index size (not including replicas) of over 2TB, 
performance is pretty snappy.


Indexing, of course, is resource intensive (disk I/O on the NAS as well 
as network bandwidth). Also, other activity on each NAS by other NFS 
clients could severely impact performance of search and index, so one 
needs to be aware of contentious activity.


David


On 11/4/2014 4:59 PM, Walter Underwood wrote:

I did that once by accident. It was 100X slower.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/

On Nov 4, 2014, at 1:57 PM, Gili Nachum gilinac...@gmail.com wrote:


My data center is out of SAN or local disk storage - is it a big no-no to
store Solr core data folder over NAS?
That means 1. Lucene index 2. Transaction log.

The NAS mount would be accessed by a single machine. I do care about
performance.

If I do go with NAS. Should I expect index corruption and other oddities?




moving to new core.properties setup

2014-06-11 Thread David Santamauro


I have configured many tomcat+solrCloud setups but I'm trying now to 
research the new solr.properties configuration.


I have a functioning zookeeper to which I manually loaded a 
configuration using:


zkcli.sh -cmd upconfig \
  -zkhost xx.xx.xx.xx:2181 \
  -d /test/conf \
  -n test

My solr.xml looks like:

solr
  str name=coreRootDirectory/test/data/str
  bool name=sharedSchematrue/bool
  solrcloud
str name=host${host:}/str
int name=hostPort8080/int
str name=hostContext${hostContext:/test}/str
int name=zkClientTimeout${zkClientTimeout:3}/int
str name=zkhostxx.xx.xx.xx:2181/str
  /solrcloud
  shardHandlerFactory name=shardHandlerFactory
class=HttpShardHandlerFactory
int name=socketTimeout${socketTimeout:0}/int
int name=connTimeout${connTimeout:0}/int
  /shardHandlerFactory
/solr

... all fine. I start tomcat and I see

 Loading container configuration from /test/solr.xml
[...]
 Looking for core definitions underneath /test/data
 Found 0 core definitions

which is anticipated as I have not created any cores or collections.

Then, trying to create a collection

wget -O- \

'http://xx.xx.xx.xx/test/admin/collections?action=CREATEname=testCollectionnumShards=1replicationFactor=1maxShardsPerNode=1collection.config=testproperty.dataDir=/test/data/testCollectionproperty.instanceDir=/test'

I get:

 org.apache.solr.common.SolrException: Solr instance is not running in 
SolrCloud mode.


Hrmmm, here I am confused. I have a working zookeeper, I have a loaded 
configuration, I have an empty data directory (no collections, cores, 
core.properties etc) and I have specified the zkHost configuration 
parameter in my solr.xml (yes, IP:port is correct)


What exactly am I missing?

thanks for the help.

David



Re: Stuck on SEVERE: Error filterStart

2014-04-16 Thread David Santamauro


You need to copy solr/example/lib/ext/*.jar into your tomcat lib 
directory (/usr/share/tomcat/lib)


Also make sure a /usr/share/tomcat/conf/log4j.properties is there as well.

... then restart.

HTH

David


On 4/16/2014 11:47 AM, Arthur Pemberton wrote:

I am trying Solr for the first time, and I am stuck at the error SEVERE:
Error filterStart

My setup:
  - Centos 6.x
  - OpenJDK 1.7
  - Tomcat 7

 From reading [1] I believe the issue is missing JAR files, but I have no
idea where to put them, even the wiki is a bit vague on that.

Lib directories that I am aware of
  - /usr/share/tomcat/lib (for tomcat)
  - /opt/solr/example/solr/collection1/lib (for my instance)


This is the error I get:

Apr 15, 2014 11:35:36 PM org.apache.catalina.core.StandardContext
filterStart
SEVERE: Exception starting filter SolrRequestFilter
java.lang.NoClassDefFoundError: Failed to initialize Apache Solr: Could not
find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars
need to go in the jetty lib/ext directory. For other containers, the
corresponding directory should be used. For more information, see:
http://wiki.apache.org/solr/SolrLogging
 at
org.apache.solr.servlet.CheckLoggingConfiguration.check(CheckLoggingConfiguration.java:28)
 at
org.apache.solr.servlet.BaseSolrFilter.clinit(BaseSolrFilter.java:31)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
 at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at java.lang.Class.newInstance(Class.java:374)
 at
org.apache.catalina.core.DefaultInstanceManager.newInstance(DefaultInstanceManager.java:134)
 at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:256)
 at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
 at
org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:103)
 at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4650)
 at
org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5306)
 at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
 at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
 at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
 at
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:633)
 at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:657)
 at
org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1637)
 at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)

I would like to get past this so I can try out Solr.

I have gone as far as putting `lib
dir=/opt/solr/example/solr/collection1/lib/ regex=*\.jar /`
into /opt/solr/example/solr/collection1/conf/solrconfig.xml but that did
not help.

I have used Java before, but purely for academic purposes, so I do not have
experience resolving these dependencies.



[1] https://wiki.apache.org/solr/SolrLogging



Re: Strange relevance scoring

2014-04-08 Thread David Santamauro


Is there any general setting that removes this punishment or must 
omitNorms=false be part of every field definition?



On 4/8/2014 7:04 AM, Ahmet Arslan wrote:

Hi,

length normal is computed for every document at index time. I think it is 
1/sqrt(number of terms). Please see section 6. norm(t,d) at

https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html


If you don't care about length normalisation, you can set omitNorms=true in 
field declarations. http://wiki.apache.org/solr/SchemaXml#Common_field_options



On Tuesday, April 8, 2014 1:57 PM, John Nielsen j...@mcb.dk wrote:
Hi,

I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
or solrconfig.xml files.

This is the query i used with debug=results.
http://pastebin.com/bWzUkjKz

And here is the answer.
http://pastebin.com/nCXFcuky

I am not sure what I am supposed to be looking for.



On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
markus.jel...@openindex.iowrote:


Hi - the thing you describe is possible when your set up uses
SpanFirstQuery. But to be sure what's going on you should post the debug
output.

-Original message-

From:John Nielsen j...@mcb.dk
Sent: Tuesday 8th April 2014 11:03
To: solr-user@lucene.apache.org
Subject: Strange relevance scoring

Hi,

We are seeing a strange phenomenon with our Solr setup which I have been
unable to answer.

My Google-fu is clearly not up to the task, so I am trying here.

It appears that if i do a freetext search for a single word, say

modellering

on a text field, the scoring is massively boosted if the first word of

the

text field is a hit.

For instance if there is only one occurrence of the word modellering in
the text field and that occurrence is the first word of the text, then

that

document gets a higher relevancy than if the word modelling occurs 5
times in the text and the first word of the text is any other word.

Is this normal behavior? Is special attention paid to the first word in a
text field? I would think that the latter case would get the highest

score.



--
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk












Re: Facetting by field then query

2014-03-27 Thread David Santamauro


For pivot facets in SolrCloud, see
  https://issues.apache.org/jira/browse/SOLR-2894

Resolution: Unresolved
Fix Version/s 4.8

I am waiting patiently ...

On 03/27/2014 05:04 AM, Alvaro Cabrerizo wrote:

I don't think you can do it, as pivot
facetinghttp://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting
doesn't
let you use facet queries.  The closer query I can imagine is:


- q=sentence:bar OR sentence:foo
- facet=true
- facet.pivot=media_id,sentence

At least the q will make faceting only by those documents containing foo
and bar but depending on the size of sentence field you cant get a huge
response.

Hope it helps.


On Wed, Mar 26, 2014 at 11:12 PM, David Larochelle 
dlaroche...@cyber.law.harvard.edu wrote:


I have the following schema

field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=media_id type=int indexed=true stored=true
required=false multiValued=false /
field name=sentence  type=text_general indexed=true stored=true
termVectors=true termPositions=true termOffsets=true /


I'd like to be able to facet by a field and then by queries. i.e.


facet_fields: {media_id: [1:{ sentence:foo: 102410, sentence:bar:
29710}2:
{ sentence:foo: 600, sentence:bar: 220}
3:
{ sentence:foo: 80, sentence:bar: 2330}]}


However, when I try:

http://localhost:8983/solr/collection1/select?q=*:*wt=jsonindent=truefacet=truefacet.query=sentence%3A%foofacet.query=sentence%3Abarfacet.field=media_id

the facet counts for the queries and media_id are listed separately rather
than hierarchically.

I realize that I could use 2 separate requests and programmatically combine
the results but would much prefer to use a single Solr request.

Is there any way to go this in Solr?

Thanks in advance,


David







Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-02-28 Thread David Santamauro


Have you tried to just use a copyField? For example, I had a similar use 
case where I needed to have particular field (f1) tokenized but also 
needed to facet on the complete contents.


For that, I created a copyField

  copyField source=f1 dest=f2 /

f1 used tokenizers and filters but f2 was just a plain string. You then 
facet on f2


... just an idea



On 02/28/2014 04:54 AM, epnRui wrote:

Hi Ahmet!!

I went ahead and did something I thought it was not a clean solution and
then when I read your post and I found we thought of the same solution,
including the European_Parliament with the _  :)

So I guess there would be no way to do this more cleanly, maybe only
implementing my own Tokenizer and Filters, but I honestly couldn't find a
tutorial for implement a customized solr Tokenizer. If I end up needing to
do it I will write a tutorial.

So for now I'm doing PatternReplaceCharFilterFactory to replace European
Parliament with MD5HashEuropean_Parliament (initially I didnt use the
md5hash European_Parliament).

Then I replace it back after the StandardTokenizerFactory ran, into
European Parliament. Well I guess I just found a way to do a 2 words token
:)

I had seen the ShingleFilterFactory but the problem is I don't need the
whole phrase in tokens of 2 words and I understood it's what it does. Of
course I would need some filter that would handle a .txt with the tokens to
merge, like European and Parliament.

I'm still having some other problem now but maybe I find a solution after I
read the page you annexed which seems great. Solr is considering #European
as #European and European, meaning it does 2 facets for one token. I want it
to consider it only as #European. I ran the analyzer debugger in my Solr
admin console and I don't see how he can be doing that.
Would you know of a reason for this?

Thanks for your reply and that page you annexed seems excelent and I'll read
it through.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html
Sent from the Solr - User mailing list archive at Nabble.com.





boost group doclist members

2014-02-11 Thread David Santamauro


Without falling into the x/y problem area, I'll explain what I want to 
do: I would like to group my result set by a field, f1 and within each 
group, I'd like to boost the score of the most appropriate member of 
the group so it appears first in the doc list.


The most appropriate member is defined by the content of other fields 
(e.g., f2, f3). So basically, I'd like to boost based on the values in 
fields f2 and f3.


If there is a better way to achieve this, I'm all ears. But I was 
thinking this could be achieved by using a function query as the 
sortspec to group.sort.


Example content:

doc
  field name=f14181770/field !-- integer --
  field name=f2x_val/field   !-- text --
  field name=f3100/field !-- integer --
/doc
doc
  field name=f14181770/field
  field name=f2y_val/field
  field name=f3100/field
/doc
doc
  field name=f14181770/field
  field name=f2z_val/field
  field name=f3100/field
/doc

All 3 of the above documents will be grouped into a doclist with 
groupValue=4181770. My questions is then, How do I make the document 
with f2=y_val appear first in the doclist. I've been playing with


group.field=f1
group.sort=query({!dismax qf=f2 bq=f2:y_val^100}) asc

... but I'm getting:
org.apache.solr.common.SolrException: Can't determine a Sort Order (asc 
or desc) in sort spec 'query({!dismax qf=f2 bq=f2:y_val^100.0}) asc', 
pos=14.


Can anyone point to a some examples of this?

thanks

David



Re: UTF-8 encoding problems while replicating an index using SolrCloud

2014-02-05 Thread David Santamauro


I had that same error. I cleared it up by commenting out all the 
/update/xxx handlers and changing /update class to solr.UpdateRequestHandler


Hope that helps

David


On 02/05/2014 01:37 PM, Ugo Matrangolo wrote:

Hi,

we are having problems with an installation of SolrCloud where a leader
node kicks off an indexing and tries to replicate all the updates using
the UpdateHandler.

What we get instead is an error around a wrong UTF-8 encoding from the
leader trying to call the /udpate endpoint on the replica:

request:
http://10.40.0.25:9765/skus/update?update.chain=custom_version_=-1459207589104451584update.distrib=FROMLEADERupdate.from=http%3A%2
http://10.40.0.25:9765/gilt-by-sku/update?update.chain=custom_version_=-1459207589104451584update.distrib=FROMLEADERupdate.from=http%3A%2\F%2F10.40.0.24%3A9765%2Fskus%2Fwt=javabinversion=2
 at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

While on the replica we get this:

2014-02-05 14:00:00,226 [qtp-108] INFO
  org.apache.solr.update.processor.LogUpdateProcessor  - [skus] webapp=
path=/update
params={update.distrib=FROMLEADER_version_=-1459207589104451584update.from=http://10.40.0.24:9765/skus/wt=javabinversion=2update.chain=custom
http://10.40.0.24:9765/gilt-by-sku/wt=javabinversion=2update.chain=custom}
{} 0 71
2014-02-05 14:00:00,227 [qtp-108] ERROR org.apache.solr.core.SolrCore  -
org.apache.solr.common.SolrException: *Invalid UTF-8 middle byte 0xe0
(at cha**r #1, byte #-1)*
 at
org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
 at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)

I have tried to sanitize all my docs making sure all the strings are in
UTF-8 but does not work.

Attached there is also the HTTP conversation that produces the error.

Would love to understand what is going on here :)

Thank you,
Ugo





Re: shard1 gone missing ... (upgrade to 4.6.1)

2014-02-03 Thread David Santamauro


Mark, I am testing the upgrade and indexing gives me this error:

914379 [http-apr-8080-exec-4] ERROR org.apache.solr.core.SolrCore  ? 
org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe0 (at 
char #1, byte #-1)


... and a bunch of these

request: 
http://xx.xx.xx.xx/col1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2Fxx.xx.xx.xx%3A8080%2Fcol1%2Fwt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)
1581335 [updateExecutor-1-thread-7] ERROR 
org.apache.solr.update.StreamingSolrServers  ? error

org.apache.solr.common.SolrException: Bad Request


Nothing else in the process chain has changed. Does this have anything 
to do with the deprecated warnings:


WARN  org.apache.solr.handler.UpdateRequestHandler  ? Using deprecated 
class: XmlUpdateRequestHandler -- replace with UpdateRequestHandler


thanks

David


On 01/31/2014 11:22 AM, Mark Miller wrote:



On Jan 31, 2014, at 11:15 AM, David Santamauro david.santama...@gmail.com 
wrote:


On 01/31/2014 10:22 AM, Mark Miller wrote:


I’d also highly recommend you try moving to Solr 4.6.1 when you can though. We 
have fixed many, many, many bugs around SolrCloud in the 4 releases since 4.4. 
You can follow the progress in the CHANGES file we update for each release.


Can I do a drop-in replacement of 4.4.0 ?




It should be a drop in replacement. For some that use deep API’s in plugins, 
sometimes you might have to make a couple small changes to your code.

Alway best to do a test with a copy of your index, but for most, it should be a 
drop in replacement.

- Mark

http://about.me/markrmiller





Re: need help in understating solr cloud stats data

2014-02-03 Thread David Santamauro


Zabbix 2.2 has a jmx client built in as well as a few JVM templates. I 
wrote my own templates for my solr instance and monitoring and graphing 
is wonderful.


David


On 02/03/2014 12:55 PM, Joel Cohen wrote:

I had to come up with some Solr stats monitoring for my Zabbix instance. I
found that using JMX was the easiest way for us.

There is a command line jmx client that works quite well for me.
http://crawler.archive.org/cmdline-jmxclient/

I wrote a shell script to wrap around that and shove the data back to
Zabbix for ingestion and monitoring. I've listed the stats that I am
gathering, and the mbean that is called. My shell script is rather
simplistic.

!/bin/bash

cmdLineJMXJar=/usr/local/lib/cmdline-jmxclient.jar
jmxHost=$1
port=$2
query=$3
value=$4

java -jar ${cmdLineJMXJar} user:pass ${jmxHost}:${port} ${query} ${value}
21 | awk '{print $NF}'

The script is called as so: jmxstats.sh solr server name or IP jmx port
name of mbean value to query from mbean
My collection name is productCatalog, so swap that with yours.

*select requests*:
solr/productCatalog:id=org.apache.solr.handler.component.SearchHandler,type=/select
requests
*select errors:
*solr/productCatalog:id=org.apache.solr.handler.component.SearchHandler,type=/select
errors
*95th percentile request time*:
solr/productCatalog:id=org.apache.solr.handler.component.SearchHandler,type=/select
95thPcRequestTime
*update requests*:
solr/productCatalog:id=org.apache.solr.handler.UpdateRequestHandler,type=/update
requests
*update errors:*
solr/productCatalog:id=org.apache.solr.handler.UpdateRequestHandler,type=/update
errors
*95th percentile update time:*
solr/productCatalog:id=org.apache.solr.handler.UpdateRequestHandler,type=/update
95thPcRequestTime

*query result cache lookups*:
solr/productCatalog:id=org.apache.solr.search.LRUCache,type=queryResultCache
cumulative_lookups
*query result cache inserts*:
solr/productCatalog:id=org.apache.solr.search.LRUCache,type=queryResultCache
cumulative_inserts
*query result cache evictions*:
solr/productCatalog:id=org.apache.solr.search.LRUCache,type=queryResultCache
cumulative_evictions
*query result cache hit ratio:
*solr/productCatalog:id=org.apache.solr.search.LRUCache,type=queryResultCache
cumulative_hitratio

*document cache lookups:
*solr/productCatalog:id=org.apache.solr.search.LRUCache,type=documentCache
cumulative_lookups
*document cache inserts:
*solr/productCatalog:id=org.apache.solr.search.LRUCache,type=documentCache
cumulative_inserts
*document cache evictions:
*solr/productCatalog:id=org.apache.solr.search.LRUCache,type=documentCache
cumulative_evictions
*document cache hit ratio:
*solr/productCatalog:id=org.apache.solr.search.LRUCache,type=documentCache
cumulative_hitratio

*filter cache lookups:
*solr/productCatalog:type=filterCache,id=org.apache.solr.search.FastLRUCache
cumulative_lookups
*filter cache inserts:
*solr/productCatalog:type=filterCache,id=org.apache.solr.search.FastLRUCache
cumulative_inserts
*filter cache evictions:
*solr/productCatalog:type=filterCache,id=org.apache.solr.search.FastLRUCache
cumulative_evictions
*filter cache hit ratio:
*solr/productCatalog:type=filterCache,id=org.apache.solr.search.FastLRUCache
cumulative_hitratio

*field value cache lookups:
*solr/productCatalog:type=fieldValueCache,id=org.apache.solr.search.FastLRUCache
cumulative_lookups
*field value cache inserts:
*solr/productCatalog:type=fieldValueCache,id=org.apache.solr.search.FastLRUCache
cumulative_inserts
*field value cache evictions:
*solr/productCatalog:type=fieldValueCache,id=org.apache.solr.search.FastLRUCache
cumulative_evictions
*field value cache hit ratio:
*solr/productCatalog:type=fieldValueCache,id=org.apache.solr.search.FastLRUCache
cumulative_evictions

This set of stats gets me a pretty good idea of what's going on with my
SolrCloud at any time. Anyone have any thoughts or suggestions?

Joel Cohen
Senior System Engineer
Bluefly, Inc.


On Mon, Feb 3, 2014 at 11:25 AM, Greg Walters greg.walt...@answers.comwrote:


The code I wrote is currently a bit of an ugly hack so I'm a bit reluctant
to share it and there's some legal concerns with open-sourcing code within
my company. That being said, I wouldn't mind rewriting it on my own time.
Where can I find a starter kit for contributors with coding guidelines and
the like? Spruced up some I'd be OK with submitting a patch.

Thanks,
Greg

On Feb 3, 2014, at 10:08 AM, Mark Miller markrmil...@gmail.com wrote:


You should contribute that and spread the dev load with others :)

We need something like that at some point, it's just no one has done it.

We currently expect you to aggregate in the monitoring layer and it's a lot
to ask IMO.


- Mark

http://about.me/markrmiller

On Feb 3, 2014, at 10:49 AM, Greg Walters greg.walt...@answers.com

wrote:



I've had some issues monitoring Solr with the per-core mbeans and ended

up writing a custom request handler that gets loaded then registers
itself as an mbean. When called it polls all the 

shard1 gone missing ...

2014-01-31 Thread David Santamauro


Hi,

I have a strange situation. I created a collection with 4 ndoes 
(separate servers, numShards=4), I then proceeded to index data ... all 
has been seemingly well until this morning when I had to reboot one of 
the nodes.


After reboot, the node I rebooted went into recovery mode! This is 
completely illogical as there is 1 shard per node (no replicas).


What could have possibly happened to 1) trigger a recovery and; 2) have 
the node think it has a replica to even recover from?


Looking at the graph from the SOLR admin page it shows that shard1 
disappeared and the server that was rebooted appears in a recovering 
state under the server home to shard2.


I then looked at clusterstate.json and it confirms that shard1 is 
completely missing and shard2 now has a replica. ... I'm baffled, 
confused, dismayed.


Versions:
Solr 4.4 (4 nodes with tomcat container)
zookeeper-3.4.5 (5-node ensemble)

Oh, and I'm assuming shard1 is completely corrupt.

I'd really appreciate any insight.

David

PS I have a copy of all the shards backed up. Is there a way to possibly 
rsync shard1 back into place and fix clusterstate.json manually?


Re: shard1 gone missing ...

2014-01-31 Thread David Santamauro

On 01/31/2014 10:35 AM, Mark Miller wrote:




On Jan 31, 2014, at 10:31 AM, Mark Miller markrmil...@gmail.com wrote:


Seems unlikely by the way. Sounds like what probably happened is that for some 
reason it thought when you restarted the shard that you were creating it with 
numShards=2 instead of 1.


No, that’s not right. Sorry.

It must have got assigned a new core node name. numShards would still have to 
be seen as 1 for it to try and be a replica. Brain lapse.

Are you using a custom coreNodeName or taking the default? Can you post your 
solr.xml so we can see your genericCoreNodeNames and coreNodeName settings?

One possibility is that you got assigned a coreNodeName, but for some reason it 
was not persisted in solr.xml.

- Mark

http://about.me/markrmiller



There is nothing of note in the zookeeper logs. My solr.xml (sanitized 
for privacy) and identical on all 4 nodes.


solr persistent=false 
zkHost=xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181

  cores adminPath=/admin/cores
 host=${host:}
 hostPort=8080
 hostContext=${hostContext:/x}
 zkClientTimeout=${zkClientTimeout:15000}
 defaultCoreName=c1
 shareSchema=true 

 core name=c1
   collection=col1
   instanceDir=/dir/x
   config=solrconfig.xml
   dataDir=/dir/x/data/y
 /
  /cores
/solr

I don't specify coreNodeName nor a genericCoreNodeNames default value 
...  should I?


The tomcat log is basically just a replay of what happened.

16443 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.core.CoreContainer  ? registering core: ...


# this is, I think what you are talking about above with new coreNodeName
16444 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.cloud.ZkController  ? Register replica - core:c1 
address:http://xx.xx.xx.xx:8080/x collection: col1 shard:shard4


16453 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.client.solrj.impl.HttpClientUtil  ? Creating new http 
client, 
config:maxConnections=1maxConnectionsPerHost=20connTimeout=3socketTimeout=3retry=false


16505 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.cloud.ZkController  ? We are http://node1:8080/x and 
leader is http://node2:8080/x


Then it just starts replicating.

If there is anything specific I should be groking for in these logs, let 
me know.


Also, given that my clusterstate.json now looks like this:

assume:
  node1=xx.xx.xx.1
  node2=xx.xx.xx.2

shard4:{
range:2000-3fff,
state:active,
replicas:{
  node2:8080_x_col1:{
state:active,
core:c1,
node_name:node2:8080_x,
base_url:http://node2:8080/x;,
leader:true},
 this should not be a replica of shard2 but its own shard1
  node1:8080_x_col1:{
state:recovering,
core:c1,
node_name:node1:8080_x,
base_url:http://node1:8080/x}},

Can I just recreate shard1

shard1:{
* NOTE: range is assumed based on ranges of other nodes
range:0-1fff,
state:active,
replicas:{
  node1:8080_x_col1:{
state:active,
core:c1,
node_name:node1:8080_x,
base_url:http://node1:8080/x;,
leader:true}},

... and then remove the replica ..
shard4:{
range:2000-3fff,
state:active,
replicas:{
  node2:8080_x_col1:{
state:active,
core:c1,
node_name:node2:8080_x,
base_url:http://node2:8080/x;,
leader:true}},

That would be great...

thanks for your help

David



Re: shard1 gone missing ...

2014-01-31 Thread David Santamauro

On 01/31/2014 10:22 AM, Mark Miller wrote:


I’d also highly recommend you try moving to Solr 4.6.1 when you can though. We 
have fixed many, many, many bugs around SolrCloud in the 4 releases since 4.4. 
You can follow the progress in the CHANGES file we update for each release.


Can I do a drop-in replacement of 4.4.0 ?




Re: Can I store only the index in Solr and not the actual data

2014-01-13 Thread David Santamauro

On 01/13/2014 06:16 AM, Bijoy Deb wrote:

Hi,

I have my data in HDFS,which I need to index using Solr.In that case,does Solr 
always store both the data (the fields that need to be retrieved) as well as 
the index, or  can it be configured to store only the index that points to the 
original data in HDFS.
Personally,I would like the latter feature as the former will unnecessary cause 
data duplication and will occupy more diskspace.
In a word,I feel that similar to database indexes,my data should not be 
required to get stored separately in any server(Solr server) and only the index 
should be created that will point to that data.


The attribute you are looking for is @stored in your schema.xml[1].

[1] http://wiki.apache.org/solr/SchemaXml


David


Re: Perl Client for SolrCloud

2014-01-08 Thread David Santamauro

On 01/07/2014 04:41 PM, Saumitra Srivastav wrote:

Is there any perl client for SolrCloud. There are some Solr clients in perl
but they are for single node Solr.

I couldn't find anyone which can connect to SolrCloud similar to SolrJ's
CloudSolrServer.


Since I have a load balancer in front of 8 nodes, WebService::Solr[1] 
still works fine.


haproxy[2] load balancer is a wonderful tool.

[1] 
http://search.cpan.org/~petdance/WebService-Solr-0.22/lib/WebService/Solr.pm

[2] http://haproxy.1wt.eu/




combining cores into a collection

2014-01-02 Thread David Santamauro


Hi,

I have a few cores on the same machine that share the schema.xml and 
solrconfig.xml from an earlier setup. Basically from the older 
distribution method of using

  shards=localhost:1234/core1,localhost:1234/core2[,etc]
for searching.

They are unique sets of documents, i.e., no overlap of uniqueId between 
cores and they were indexed with SOLR 4.1.


Is there a way to combine those cores into a collection, maybe through 
the collections API? They are loaded with a lot of data so avoiding a 
reload is of the utmost importance.


thanks,

David


Re: combining cores into a collection

2014-01-02 Thread David Santamauro

On 01/02/2014 08:29 AM, michael.boom wrote:

Hi David,

They are loaded with a lot of data so avoiding a reload is of the utmost
importance.
Well, reloading a core won't cause any data loss. Is it 100% availability
during the process is what you need?


Not really ... uptime is irrelevant because they aren't in production. I 
just don't want to spend the time reloading 1TB of documents.


Basically, I have a bunch of (previously known as ... ) shards on one 
machine (I'd like them to stay on one machine) that aren't associated 
with a SolrCloud. I query them using


  shards=localhost:1234/core1,localhost:1234/core2[,etc...]

My current loading logic doesn't matter but rest assured, there are no 
duplicate uniqueIds across each shard.


I want to bring them all into a cloud collection. Assume I have 3 
cores/shards


  core1
  core2
  core3

as above, I currently query them as:

  /core1?q=*:*shards=localhost:1234/core2,localhost:1234/core3

I want to be able to address all three as if they were shards of a 
collection, something like.


collection1
 = shard1 (was core1)
 = shard2 (was core2)
 = shard3 (was core3)

I want to be able to load to collection1. search collection1 etc.

I've tried

/collections?action=CREATEname=collection1shards=core1,core2,core3

.. but it doesn't actually recognize the existing cores.

thanks




Re: combining cores into a collection

2014-01-02 Thread David Santamauro

On 01/02/2014 12:44 PM, Chris Hostetter wrote:


: Not really ... uptime is irrelevant because they aren't in production. I just
: don't want to spend the time reloading 1TB of documents.

terminologiy confusion: you mean you don't wnat to *reindex* all of the
documents ... in solr reloading a core means something specific 
different from what you are talking about, and is what michael.boom was
refering to.


quite correct, sorry. reindex the core(s), not reload the core(s).


: I want to bring them all into a cloud collection. Assume I have 3 cores/shards
:
:   core1
:   core2
:   core3

You can't convert arbitrary cores into shards of a new collection, because
the document routing logic (which dictates what shard a doc lives in based
on it's uniqueKey) won't make sense.


I guess this is the heart if the issue.

I managed to assign the individual cores to a collection using the 
collection API to create the collection and then the solr.xml to define 
the core(s) and it's collection. This *seemed* to work. I even test 
indexed a set of documents checking totals before and after as well as 
content. Again, this *seemed* to work.


Did I get lucky that all 5k documents were coincidentally found in the 
appropriate core(s)? Have I possibly corrupted one or more cores? They 
are a working copy so nothing would be lost.



: I want to be able to address all three as if they were shards of a collection,
: something like.

w/o reindexing, one thing you could do is create a single collection for
each of your cores, and then create a collection alias over all three of
these collections...

https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CreateormodifyanAliasforaCollection


Yes this works but isn't this really just a convenient way to avoid the 
shard parameter on /select?



if you want to just be able to shove docs to a single collection in solr
cloud and have them replace docs with the same uniqueKey then you're


yes, this is what I was hoping I could do.


going to need to either: re-index using SolrCloud so the default document
routing is done properly up front; implement a custom doc router that
knows about whatever rules you used to decide what would be in core1,
core2, core3.


I was afraid of that, but see question above about what I've done and 
index consistency.


Thanks for the insight.

David



Re: adding a node to SolrCloud

2013-12-26 Thread David Santamauro


On 12/23/2013 05:43 PM, Greg Preston wrote:

I believe you can just define multiple cores:

core default=true instanceDir=shard1/
name=collectionName_shard1 shard=shard1/
core default=true instanceDir=shard2/
name=collectionName_shard2 shard=shard2/
...

(this is the old style solr.xml.  I don't know how to do it in the newer style)


Yes, that is exactly what I did but somehow, the link between shards and 
collections gets lost and everything gets very confused.


I guess I should have read more carefully about the valid parameters on 
the core element. My problem was a missing attribute:


  @collection=collection-name

So the complete core definition that survives tomcat restarts:

 core name=core_shard_1
   collection=collection-name
   instanceDir=/solr/instance/dir
   config=solrconfig-standard.xml
   dataDir=/solr/data/dir/core_shard_1
 /


David


Re: adding a node to SolrCloud

2013-12-26 Thread David Santamauro

On 12/26/2013 02:29 PM, Shawn Heisey wrote:

On 12/24/2013 8:35 AM, David Santamauro wrote:

You may have one or more of the SolrCloud 'bootstrap' options on the
startup commandline.  The bootstrap options are intended to be used
once, in order to bootstrap from a non-SolrCloud setup to a SolrCloud
setup.


No, no unnecessary options. I manually bootstrapped a common config.


I have no idea what might be wrong here.


Between the Collections API and the CoreAdmin API, you should never need
to edit solr.xml (if using the pre-4.4 format) or core.properties files
(if using core discovery, available 4.4 and later) directly.


Now this I don't understand. If I have created cores through the
CoreAdmin API, how is solr.xml affected? If I don't edit it, how does
SOLR know what cores it has to expose to a distributed collection?


If you are using the old-style solr.xml (which will be supported through
all future 4.x versions, but not 5.0), then core definitions are stored
in solr.xml and the contents of the file are changed by many of the
CoreAdmin API actions.  The Collections API calls the CoreAdmin API on
servers throughout the cloud.


I have never experienced tomcat or the SOLR webapp create, modify or 
otherwise touch in anyway the solr.xml file. I have always had to add 
the necessary core definition manually.



http://wiki.apache.org/solr/Solr.xml%20%28supported%20through%204.x%29

If you are using the core discovery format, which was made available in
working form in version 4.4, then solr.xml does NOT contain core
definitions.  The main example in 4.4 and later uses the new format.
Cores are discovered at Solr startup by crawling the filesystem from a
root starting point looking for core.properties files.  In this mode,
solr.xml is fairly static.

http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond
http://wiki.apache.org/solr/Core%20Discovery%20%284.4%20and%20beyond%29


I'll begin exploring this new format, thanks for the help and links.

David



Re: adding a node to SolrCloud

2013-12-24 Thread David Santamauro

On 12/23/2013 08:42 PM, Shawn Heisey wrote:

On 12/23/2013 12:23 PM, David Santamauro wrote:

I managed to create 8 new cores and the Solr Admin cloud page showed
them wonderfully as active replicas.

The only issue I have is what goes into solr.xml (I'm using tomcat)?

Putting
   core name=... /

for each of the new cores I created seemed like the reasonable approach
but when I tested a tomcat restart, the distribution was messed up ...
for one thing, the cores on the new machine showed up as collections!
And tomcat never even made it to accept connections for some reason.

I cleaned everything up with zookepper so my graph looks like it should
and I removed that new machine from the distribution (by removing zk
attributes) and restarted .. all is well again.

Any idea what could have went wrong on tomcat restart?


You may have one or more of the SolrCloud 'bootstrap' options on the
startup commandline.  The bootstrap options are intended to be used
once, in order to bootstrap from a non-SolrCloud setup to a SolrCloud setup.


No, no unnecessary options. I manually bootstrapped a common config.


Between the Collections API and the CoreAdmin API, you should never need
to edit solr.xml (if using the pre-4.4 format) or core.properties files
(if using core discovery, available 4.4 and later) directly.


Now this I don't understand. If I have created cores through the 
CoreAdmin API, how is solr.xml affected? If I don't edit it, how does 
SOLR know what cores it has to expose to a distributed collection?


thanks

David


Re: adding a node to SolrCloud

2013-12-23 Thread David Santamauro

On 12/22/2013 09:48 PM, Shawn Heisey wrote:

On 12/22/2013 2:10 PM, David Santamauro wrote:

My goal is to have a redundant copy of all 8 currently running, but
non-redundant shards. This setup (8 nodes with no replicas) was a test
and it has proven quite functional from a performance perspective.
Loading, though, takes almost 3 weeks so I'm really not in a position to
redesign the distribution, though I can add nodes.

I have acquired another resource, a very large machine that I'd like to
use to hold the replicas of the currently deployed 8-nodes.

I realize I can run 8 jetty/tomcats and accomplish my goal but that is a
maintenance headache and is really a last resort. I really would just
like to be able to deploy this big machine with 'numShards=8'.

Is that possible or do I really need to have 8 other nodes running?


You don't want to run more than one container or Solr instance per
machine.  Things can get very confused, and it's too much overhead.



With existing collections, you can simply run the CoreAdmin CREATE
action on the new node with more resources.

So you'd do something like this, once for each of the 8 existing parts:

http://newnode:port/solr/admin/cores?action=CREATEname=collname_shard1_replica2collection=collnameshard=shard1

It will automatically replicate the shard from its current leader.


Fantastic! Clearly my understanding of collection, vs core vs 
shard was lacking but now I see the relationship better.




One thing to be aware of: With 1.4TB of index data, it might be
impossible to keep enough of the index in RAM for good performance,
unless the machine has a terabyte or more of RAM.


Yes, I'm well aware of the performance implications, many of which are 
mitigated by 2TB of SSD and 512GB RAM.


Thanks for the nudge in the right direction. The first node/shard1 is 
replicating right now.


David





Re: adding a node to SolrCloud

2013-12-23 Thread David Santamauro


Shawn,

I managed to create 8 new cores and the Solr Admin cloud page showed 
them wonderfully as active replicas.


The only issue I have is what goes into solr.xml (I'm using tomcat)?

Putting
  core name=... /

for each of the new cores I created seemed like the reasonable approach 
but when I tested a tomcat restart, the distribution was messed up ... 
for one thing, the cores on the new machine showed up as collections! 
And tomcat never even made it to accept connections for some reason.


I cleaned everything up with zookepper so my graph looks like it should 
and I removed that new machine from the distribution (by removing zk 
attributes) and restarted .. all is well again.


Any idea what could have went wrong on tomcat restart?

thanks.




On 12/22/2013 09:48 PM, Shawn Heisey wrote:

On 12/22/2013 2:10 PM, David Santamauro wrote:

My goal is to have a redundant copy of all 8 currently running, but
non-redundant shards. This setup (8 nodes with no replicas) was a test
and it has proven quite functional from a performance perspective.
Loading, though, takes almost 3 weeks so I'm really not in a position to
redesign the distribution, though I can add nodes.

I have acquired another resource, a very large machine that I'd like to
use to hold the replicas of the currently deployed 8-nodes.

I realize I can run 8 jetty/tomcats and accomplish my goal but that is a
maintenance headache and is really a last resort. I really would just
like to be able to deploy this big machine with 'numShards=8'.

Is that possible or do I really need to have 8 other nodes running?


You don't want to run more than one container or Solr instance per
machine.  Things can get very confused, and it's too much overhead.
Also, you shouldn't start Solr with the numShards parameter on the
commandline.  That should be given when you create each collection.

With existing collections, you can simply run the CoreAdmin CREATE
action on the new node with more resources.

http://wiki.apache.org/solr/SolrCloud#Creating_cores_via_CoreAdmin

So you'd do something like this, once for each of the 8 existing parts:

http://newnode:port/solr/admin/cores?action=CREATEname=collname_shard1_replica2collection=collnameshard=shard1

It will automatically replicate the shard from its current leader.

One thing to be aware of: With 1.4TB of index data, it might be
impossible to keep enough of the index in RAM for good performance,
unless the machine has a terabyte or more of RAM.

http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

Thanks,
Shawn





Re: adding a node to SolrCloud

2013-12-23 Thread David Santamauro

On 12/23/2013 05:03 PM, Greg Preston wrote:

Yes, I'm well aware of the performance implications, many of which are 
mitigated by 2TB of SSD and 512GB RAM


I've got a very similar setup in production.  2TB SSD, 256G RAM (128G
heaps), and 1 - 1.5 TB of index per node.  We're in the process of
splitting that to multiple JVMs per host.  GC pauses were causing ZK
timeouts (you should up that in solr.xml).  And resync's after the
timeouts took long enough that a large tlog built up (we have near
continuous indexing), and we couldn't replay the tlog fast enough to
catch up to current.


GC pauses are a huge issue in our current production environment 
(monolithic index) and general performance was meager, hence the move to 
a distributed design. We will have 8 nodes with ~ 200GB per node, one 
shard each and performance for single and most multi-term queries has 
become sub-second and throughput has increased 10-fold. Larger boolean 
queries can still take 2-3s but we can live with that.


At any rate, I still can't figure out what my solr.xml is supposed to 
look like on the node with all 8 redundant shards.


David



On Mon, Dec 23, 2013 at 2:31 AM, David Santamauro
david.santama...@gmail.com wrote:

On 12/22/2013 09:48 PM, Shawn Heisey wrote:


On 12/22/2013 2:10 PM, David Santamauro wrote:


My goal is to have a redundant copy of all 8 currently running, but
non-redundant shards. This setup (8 nodes with no replicas) was a test
and it has proven quite functional from a performance perspective.
Loading, though, takes almost 3 weeks so I'm really not in a position to
redesign the distribution, though I can add nodes.

I have acquired another resource, a very large machine that I'd like to
use to hold the replicas of the currently deployed 8-nodes.

I realize I can run 8 jetty/tomcats and accomplish my goal but that is a
maintenance headache and is really a last resort. I really would just
like to be able to deploy this big machine with 'numShards=8'.

Is that possible or do I really need to have 8 other nodes running?



You don't want to run more than one container or Solr instance per
machine.  Things can get very confused, and it's too much overhead.





With existing collections, you can simply run the CoreAdmin CREATE

action on the new node with more resources.

So you'd do something like this, once for each of the 8 existing parts:


http://newnode:port/solr/admin/cores?action=CREATEname=collname_shard1_replica2collection=collnameshard=shard1

It will automatically replicate the shard from its current leader.



Fantastic! Clearly my understanding of collection, vs core vs shard
was lacking but now I see the relationship better.




One thing to be aware of: With 1.4TB of index data, it might be
impossible to keep enough of the index in RAM for good performance,
unless the machine has a terabyte or more of RAM.



Yes, I'm well aware of the performance implications, many of which are
mitigated by 2TB of SSD and 512GB RAM.

Thanks for the nudge in the right direction. The first node/shard1 is
replicating right now.

David







adding a node to SolrCloud

2013-12-22 Thread David Santamauro


Hi,

I have an 8-node setup currently with 1 shard per node (no redundancy). 
These 8 nodes are smaller machines not capable of supporting the entire 
collection..


I have another machine resource that can act as other node and this last 
node is capable of holding the entire collection. I'd like to make this 
large node hold a replica of all the other 8 nodes' shards for 
redundancy as well as performance.


Is this possible? ... and what would be the configuration magic needed 
to accomplish this? This collection is very large (1.4TB) and as you can 
imagine, I don't want to fiddle around trying configuration alternatives 
assuming each try would lead to a replication of 1.4TB of data.


Any help would be greatly appreciated.

David


Re: adding a node to SolrCloud

2013-12-22 Thread David Santamauro


any hint?

On 12/22/2013 06:48 AM, David Santamauro wrote:


Hi,

I have an 8-node setup currently with 1 shard per node (no redundancy).
These 8 nodes are smaller machines not capable of supporting the entire
collection..

I have another machine resource that can act as other node and this last
node is capable of holding the entire collection. I'd like to make this
large node hold a replica of all the other 8 nodes' shards for
redundancy as well as performance.

Is this possible? ... and what would be the configuration magic needed
to accomplish this? This collection is very large (1.4TB) and as you can
imagine, I don't want to fiddle around trying configuration alternatives
assuming each try would lead to a replication of 1.4TB of data.

Any help would be greatly appreciated.

David




Re: adding a node to SolrCloud

2013-12-22 Thread David Santamauro


Thanks for the reply.

My goal is to have a redundant copy of all 8 currently running, but 
non-redundant shards. This setup (8 nodes with no replicas) was a test 
and it has proven quite functional from a performance perspective. 
Loading, though, takes almost 3 weeks so I'm really not in a position to 
redesign the distribution, though I can add nodes.


I have acquired another resource, a very large machine that I'd like to 
use to hold the replicas of the currently deployed 8-nodes.


I realize I can run 8 jetty/tomcats and accomplish my goal but that is a 
maintenance headache and is really a last resort. I really would just 
like to be able to deploy this big machine with 'numShards=8'.


Is that possible or do I really need to have 8 other nodes running?

David


On 12/22/2013 03:58 PM, Furkan KAMACI wrote:

Hi David;

When you start up 8 nodes within that machine they will be replicas of each
shards and you will accomplish what you want. However if you can give more
detail about your hardware infrastructure and needs I can offer you a
design.

Thanks;
Furkan KAMACI


22 Aralık 2013 Pazar tarihinde David Santamauro david.santama...@gmail.com
adlı kullanıcı şöyle yazdı:


any hint?

On 12/22/2013 06:48 AM, David Santamauro wrote:


Hi,

I have an 8-node setup currently with 1 shard per node (no redundancy).
These 8 nodes are smaller machines not capable of supporting the entire
collection..

I have another machine resource that can act as other node and this last
node is capable of holding the entire collection. I'd like to make this
large node hold a replica of all the other 8 nodes' shards for
redundancy as well as performance.

Is this possible? ... and what would be the configuration magic needed
to accomplish this? This collection is very large (1.4TB) and as you can
imagine, I don't want to fiddle around trying configuration alternatives
assuming each try would lead to a replication of 1.4TB of data.

Any help would be greatly appreciated.

David









Re: AND query on multivalue text

2008-11-24 Thread David Santamauro


On Nov 24, 2008, at 8:52 AM, Erik Hatcher wrote:



On Nov 24, 2008, at 8:37 AM, David Santamauro wrote:

i need to search something as
myText:billion AND guarantee

i need to be extracted only the record where the words exists in  
the same value (in this case only the first record) because in  
the 2nd record the two words are in different values


is it possible?


It's not possible with a purely boolean query like this, but it is  
possible with a sloppy phrase query where the position increment  
gap (see example schema.xml) is greater than the slop factor.


Erik




I think what is needed here is the concept of SAME, i.e.,  
myText:billion SAME guarantee. I know a few full-text engines that  
can handle this operator one way or another. And without it, I  
don't quick understand the usefulness of multiValue fields.


Yeah, multi-valued fields are a bit awkward to grasp fully in  
Lucene.  Especially in this context where it's a full-text field.   
Basically as far as indexing goes, there's no such thing as a multi- 
valued field.  An indexed field gets split into terms, and terms  
have positional information attached to them (thus a position  
increment gap can be used to but a big virtual gap between the last  
term of one field instance and the first term of the next one).  A  
multi-valued field gets stored (if it is set to be stored, that is)  
as separate strings, and is retrievable as the separate values.


Multi-valued fields are handy for facets where, say, a product can  
have multiple categories associated with it.  In this case it's a  
bit clearer.  It's the full-text multi-valued fields that seem a bit  
strange.


Erik




OK, it seems it is the multi-dimensional aspect that is missing

field[0]: A B C D
field[1]:   B   D

...and the concept of field array would need to be introduced  
(probably at the lucene level).


Do you know if there has been any serious thought given to this, i.e.,  
the possibility of introducing a new SAME operator or is this a corner- 
case not worthy?


thanks
David