Re: Solr using a ridiculous amount of memory

2013-04-18 Thread John Nielsen
 That was strange. As you are using a multi-valued field with the new
setup, they should appear there.

Yes, the new field we use for faceting is a multi valued field.

 Can you find the facet fields in any of the other caches?

Yes, here it is, in the field cache:

http://screencast.com/t/mAwEnA21yL

 I hope you are not calling the facets with facet.method=enum? Could you
paste a typical facet-enabled search request?

Here is a typical example (I added newlines for readability):

http://172.22.51.111:8000/solr/default1_Danish/search
?defType=edismax
q=*%3a*
facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_7+key%3ditemvariantoptions_int_mv_7%7ditemvariantoptions_int_mv
facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_9+key%3ditemvariantoptions_int_mv_9%7ditemvariantoptions_int_mv
facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_8+key%3ditemvariantoptions_int_mv_8%7ditemvariantoptions_int_mv
facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_2+key%3ditemvariantoptions_int_mv_2%7ditemvariantoptions_int_mv
fq=site_guid%3a(10217)
fq=item_type%3a(PRODUCT)
fq=language_guid%3a(1)
fq=item_group_1522_combination%3a(*)
fq=is_searchable%3a(True)
sort=item_group_1522_name_int+asc, variant_of_item_guid+asc
querytype=Technical
fl=feed_item_serialized
facet=true
group=true
group.facet=true
group.ngroups=true
group.field=groupby_variant_of_item_guid
group.sort=name+asc
rows=0

 Are you warming all the sort- and facet-fields?

I'm sorry, I don't know. I have the field value cache commented out in my
config, so... Whatever is default?

Removing the custom sort fields is unfortunately quite a bit more difficult
than my other facet modification.

The problem is that each item can have several sort orders. The sort order
to use is defined by a group number which is known ahead of time. The group
number is included in the sort order field name. To solve it in the same
way i solved the facet problem, I would need to be able to sort on a
multi-valued field, and unless I'm wrong, I don't think that it's possible.

I am quite stomped on how to fix this.




On Wed, Apr 17, 2013 at 3:06 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 John Nielsen [j...@mcb.dk]:
  I never seriously looked at my fieldValueCache. It never seemed to get
 used:

  http://screencast.com/t/YtKw7UQfU

 That was strange. As you are using a multi-valued field with the new
 setup, they should appear there. Can you find the facet fields in any of
 the other caches?

 ...I hope you are not calling the facets with facet.method=enum? Could you
 paste a typical facet-enabled search request?

  Yep. We still do a lot of sorting on dynamic field names, so the field
 cache
  has a lot of entries. (9.411 entries as we speak. This is considerably
 lower
  than before.). You mentioned in an earlier mail that faceting on a field
  shared between all facet queries would bring down the memory needed.
  Does the same thing go for sorting?

 More or less. Sorting stores the raw string representations (utf-8) in
 memory so the number of unique values has more to say than it does for
 faceting. Just as with faceting, a list of pointers from documents to
 values (1 value/document as we are sorting) is maintained, so the overhead
 is something like

 #documents*log2(#unique_terms*average_term_length) +
 #unique_terms*average_term_length
 (where average_term_length is in bits)

 Caveat: This is with the index-wide sorting structure. I am fairly
 confident that this is what Solr uses, but I have not looked at it lately
 so it is possible that some memory-saving segment-based trickery has been
 implemented.

  Does those 9411 entries duplicate data between them?

 Sorry, I do not know. SOLR- discusses the problems with the field
 cache and duplication of data, but I cannot infer if it is has been solved
 or not. I am not familiar with the stat breakdown of the fieldCache, but it
 _seems_ to me that there are 2 or 3 entries for each segment for each sort
 field. Guesstimating further, let's say you have 30 segments in your index.
 Going with the guesswork, that would bring the number of sort fields to
 9411/3/30 ~= 100. Looks like you use a custom sort field for each client?

 Extrapolating from 1.4M documents and 180 clients, let's say that there
 are 1.4M/180/5 unique terms for each sort-field and that their average
 length is 10. We thus have
 1.4M*log2(1500*10*8) + 1500*10*8 bit ~= 23MB
 per sort field or about 4GB for all the 180 fields.

 With this few unique values, the doc-value structure is by far the
 biggest, just as with facets. As opposed to the faceting structure, this is
 fairly close to the actual memory usage. Switching to a single sort field
 would reduce the memory usage from 4GB to about 55MB.

  I do commit a bit more often than i should. I get these in my log file
 from
  time to time: PERFORMANCE WARNING: Overlapping onDeckSearchers=2

 So 1 active searcher and 2 warming searchers. Ignoring that one of the
 warming searchers is highly likely to 

Solr 4.2 fl issue

2013-04-18 Thread William Bell
We are getting an issue when using a GUID got a field in Solr 4.2. Solr 3.6
is fine. Something like:

fl=098765-765-788558-7654_userid as a string stored.

The issue is when the GUID is begging with numeric and then a minus.

This is a bug

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Lucene Sorting

2013-04-18 Thread pankaj.pandey4
Hi,

We are facing sorting issue on the data indexed using Solr. Below is the sample 
code. Problem is, data returned by the below code is not properly sorted i.e. 
there's no ordering of data. Can anyone assist me on this?

TopDocs topDocs = null;
  Directory directory = FSDirectory.open(indexDir);
  IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory));
Sort column = new Sort(new SortField(sortColumn, SortField.STRING, reverse));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
queryParser = new QueryParser(Version.LUCENE_36, fieldName, analyzer);
  queryParser.setAllowLeadingWildcard(true);
  queryParser.setDefaultOperator(Operator.AND);
topDocs = searcher.search(queryParser.parse(queryStr), filter, maxHits, column);

Thanks!

Regards,
Pankaj

Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com


Re: facet.method enum vs fc

2013-04-18 Thread Toke Eskildsen
On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
 I am doing faceting on an index of 120M documents, 
 on the field of url[...]

I would guess that you would need 3-4GB for that.
How much memory do you allocate to Solr?

- Toke Eskildsen



SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
Step 1: distribute processing

We have 2 servers in which we'll run 2 SolrCloud instances on.

We'll define 2 shards so that both servers are busy for each request
(improving response time of the request).

 

Step 2: Failover

We would now like to ensure that if either of the servers goes down (we're
very unlucky with disks), that the other will be able to take over
automatically.

So we define 2 shards with a replication factor of 2.

 

So we have:

. Server 1: Shard 1, Replica 2

. Server 2: Shard 2, Replica 1

 

Question:

But in SolrCloud, replicas are active right? So isn't it now possible that
the load balancer will have Server 1 process *both* parts of a request,
after all, it has both shards due to the replication, right?



Re: Select Queris While Merging Indexes

2013-04-18 Thread Furkan KAMACI
Thanks for explanations. I should read deep about the lifecycle of Searcher
objects. Should I read them from a Lucene book or is there any Solr
documentation or books covers it?

2013/4/18 Jack Krupansky j...@basetechnology.com

 merging indexes

 The proper terminology is merging segments.

 Until the new, merged segment is complete, the existing segments remain
 untouched and readable.

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Wednesday, April 17, 2013 6:28 PM
 To: solr-user@lucene.apache.org
 Subject: Select Queris While Merging Indexes


 I see that while merging indexes (I mean optimizing via admin gui), my Solr
 instance can still response select queries (as well). How that querying
 mechanism works (because merging not finished yet but my Solr instance
 still can return a consistent response)?



Re: Max http connections in CloudSolrServer

2013-04-18 Thread J Mohamed Zahoor

Thanks for this.
The reason i asked this was.. when i fire 30 queries simultaneously from 30 
threads using the same CloudSolrServer instance, 
some queries gets fired after a delay.. sometime the delay is 30-50 seconds...

In solr logs i can see.. 20+ queries get fired almost immediately... but some 
of them gets fired late..

i increased the connections per host from 32 to 200.. still no respite...

./zahoor

On 18-Apr-2013, at 12:20 AM, Shawn Heisey s...@elyograg.org wrote:

 ModifiableSolrParams params = new ModifiableSolrParams();
  params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000);
  params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200);
  HttpClient client = HttpClientUtil.createClient(params);
  LBHttpSolrServer lbServer = new LBHttpSolrServer
(client, http://localhost/solr;);
  lbServer.removeSolrServer(http://localhost/solr;);
  SolrServer server = new CloudSolrServer(zkHost, lbServer);



Re: Solr using a ridiculous amount of memory

2013-04-18 Thread Toke Eskildsen
On Thu, 2013-04-18 at 08:34 +0200, John Nielsen wrote:

 
[Toke: Can you find the facet fields in any of the other caches?]

 Yes, here it is, in the field cache:

 http://screencast.com/t/mAwEnA21yL
 
Ah yes, mystery solved, my mistake.

 http://172.22.51.111:8000/solr/default1_Danish/search

[...]

 fq=site_guid%3a(10217)

This constraints to hits to a specific customer, right? Any search will
only be in a single customer's data?

 
[Toke: Are you warming all the sort- and facet-fields?]

 I'm sorry, I don't know. I have the field value cache commented out in
 my config, so... Whatever is default?

(a bit shaky here) I would say not warming. You could check simply by
starting solr and looking at the caches before you issue any searches.

This fits the description of your searchers gradually eating memory
until your JVM OOMs. Each time a new field is faceted or sorted upon, it
it added to the cache. As your index is relatively small and the number
of values in the single fields is small, the initialization time for a
field is so short that it is not a performance problem. Memory wise is
is death by a thousand cuts.

If you did explicit warming of all the possible fields for sorting and
faceting, your would allocate it all up front and would be sure that
there would be enough memory available. But it would take much longer
than your current setup. You might want to try it out (no need to fiddle
with Solr setup, just make a script and fire wgets as this has the same
effect).

 The problem is that each item can have several sort orders. The sort
 order to use is defined by a group number which is known ahead of
 time. The group number is included in the sort order field name. To
 solve it in the same way i solved the facet problem, I would need to
 be able to sort on a multi-valued field, and unless I'm wrong, I don't
 think that it's possible.

That is correct.

Three suggestions off the bat:

1) Reduce the number of sort fields by mapping names.
Count the maximum number of unique sort fields for any given customer.
That will be the total number of sort fields in the index. For each
group number for a customer, map that number to one of the index-wide
sort fields.
This only works if the maximum number of unique fields is low (let's say
a single field takes 50MB, so 20 fields should be okay).

2) Create a custom sorter for Solr.
Create a field with all the sort values, prefixed by group ID. Create a
structure (or reuse the one from Lucene) with a doc-terms map with all
the terms in-memory. When sorting, extract the relevant compare-string
for a document by iterating all the terms for the document and selecting
the one with the right prefix.
Memory wise this scales linear to the number of terms instead of the
number of fields, but it would require quite some coding.

3) Switch to a layout where each customer has a dedicated core.
The basic overhead is a lot larger than for a shared index, but it would
make your setup largely immune to the adverse effect of many documents
coupled with many facet- and sort-fields.

- Toke Eskildsen, State and University Library, Denmark




Re: SolrCloud vs Solr master-slave replication

2013-04-18 Thread Victor Ruiz
Thank you again for your answer Shawn. 

Network card seems to work fine, but we've found segmentation faults, so now
our hosting provider is going to run a full hw check. Hopefully they'll
replace the server and problem wil be solved

Regards,
Victor





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-vs-Solr-master-slave-replication-tp4055541p4056925.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud vs Solr master-slave replication

2013-04-18 Thread Victor Ruiz
Also, I forgot to say... the same error started to happen again.. the index
is again corrupted :(



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-vs-Solr-master-slave-replication-tp4055541p4056926.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr using a ridiculous amount of memory

2013-04-18 Thread John Nielsen

  http://172.22.51.111:8000/solr/default1_Danish/search

 [...]

  fq=site_guid%3a(10217)

 This constraints to hits to a specific customer, right? Any search will
 only be in a single customer's data?


Yes, thats right. No search from any given client ever returns anything
from another client.


[Toke: Are you warming all the sort- and facet-fields?]

  I'm sorry, I don't know. I have the field value cache commented out in
  my config, so... Whatever is default?

 (a bit shaky here) I would say not warming. You could check simply by
 starting solr and looking at the caches before you issue any searches.


The field cache shows 0 entries at startup. On the running server, forcing
a commit (and thus opening a new searcher) does not change the number of
entries.


  The problem is that each item can have several sort orders. The sort
  order to use is defined by a group number which is known ahead of
  time. The group number is included in the sort order field name. To
  solve it in the same way i solved the facet problem, I would need to
  be able to sort on a multi-valued field, and unless I'm wrong, I don't
  think that it's possible.

 That is correct.

 Three suggestions off the bat:

 1) Reduce the number of sort fields by mapping names.
 Count the maximum number of unique sort fields for any given customer.
 That will be the total number of sort fields in the index. For each
 group number for a customer, map that number to one of the index-wide
 sort fields.
 This only works if the maximum number of unique fields is low (let's say
 a single field takes 50MB, so 20 fields should be okay).


I just checked our DB. Our worst case scenario client has over a thousand
groups for sorting. Granted, it may be, probably is, an error with the
data. It is an interesting idea though and I will look into this posibility.


 3) Switch to a layout where each customer has a dedicated core.
 The basic overhead is a lot larger than for a shared index, but it would
 make your setup largely immune to the adverse effect of many documents
 coupled with many facet- and sort-fields.


Now this is where my brain melts down.

If I understand the fieldCache mechanism correctly (which i can see that I
don't), the data used for faceting and sorting is saved in the fieldCache
using a key comprised of the fields used for said faceting/sorting. That
data only contains the data which is actually used for the operation. This
is what the fq queries are for.

So if i generate a core for each client, I would have a client specific
fieldCache containing the data from that client. Wouldn't I just split up
the same data into several cores?

I'm afraid I don't understand how this would help.


-- 
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk


Re: Solr using a ridiculous amount of memory

2013-04-18 Thread Toke Eskildsen
On Thu, 2013-04-18 at 11:59 +0200, John Nielsen wrote:
 Yes, thats right. No search from any given client ever returns
 anything from another client.

Great. That makes the 1 core/client solution feasible.

[No sort  facet warmup is performed]

[Suggestion 1: Reduce the number of sort fields by mapping]

[Suggestion 3: 1 core/customer]

 If I understand the fieldCache mechanism correctly (which i can see
 that I don't), the data used for faceting and sorting is saved in the
 fieldCache using a key comprised of the fields used for said
 faceting/sorting. That data only contains the data which is actually
 used for the operation. This is what the fq queries are for.
 
You are missing an essential part: Both the facet and the sort
structures needs to hold one reference for each document
_in_the_full_index_, even when the document does not have any values in
the fields.

It might help to visualize the structures as arrays of values with docID
as index: String[] myValues = new String[140] takes up 1.4M * 32 bit
(or more for a 64 bit machine) = 5.6MB, even when it is empty.

Note: Neither String-objects, nor Java references are used for the real
facet- and sort-structures, but the principle is quite the same.

 So if i generate a core for each client, I would have a client
 specific fieldCache containing the data from that client. Wouldn't I
 just split up the same data into several cores?

The same terms, yes, but not the same references.

Let's say your customer has 10K documents in the index and that there
are 100 unique values, each 10 bytes long, in each group .

As each group holds its own separate structure, we use the old formula
to get the memory overhead:

#documents*log2(#unique_terms*average_term_length) +
#unique_terms*average_term_length
 
1.4M*log2(100*(10*8)) + 100*(10*8) bit = 1.2MB + 1KB.

Note how the values themselves are just 1KB, while the nearly empty
reference list takes 1.2MB.


Compare this to a dedicated core with just the 10K documents:
10K*log2(100*(10*8)) + 100*(10*8) bit = 8.5KB + 1KB.

The terms take up exactly the same space, but the heap requirement for
the references is reduced by 99%.

Now, 25GB for 180 clients means 140MB/client with your current setup.
I do not know the memory overhead of running a core, but since Solr can
run fine with 32MB for small indexes, it should be smaller than that.
You will of course have to experiment and to measure.


- Toke Eskildsen, State and University Library, Denmark




TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread sawanverma
Its quite confusing about this error.

I had a situation where i have to turn on the highlighting. In some cases
though the number of docs found for a particular query was for example say
2, the highlighting was coming only for 1. I did some checks and found that
that particular text searched was in the bigger document and was towards the
end of the document.

So i increased the hl.maxAnalyzedChar from default value of 51200 to a
bigger value say 500. And then it started working, i mean now the
highlighting was working properly.

Now, i have encountered one more problem with the same error.

There is a document which is returning the maxClauseCount error when i do
search on content:*. The document is quite big big in size and the
hl.maxAnalyzedChars was default i.e. 51200.

I tried decreasing that i found that the error is coming exactly at 31375
char (this is did my setting the hl.maxAnalyzedChars to 31375. it worked
fine till 31374).

Solutions are most welcome as i am in great need of this.

Sample query is as below
http://localhost:8983/solr/test/select/?q=content:* AND
obs_date:[2010-01-01T00:00:00Z%20TO%202011-12-31T23:59:59Z]fl=contenthl=truehl.fl=contenthl.snippets=1hl.fragsize=1500hl.requireFieldMatch=truehl.alternateField=contenthl.maxAlternateFieldLength=1500hl.maxAnalyzedChars=31375facet.limit=200facet.mincount=1start=64rows=1sort=obs_date%20desc

Regards,
Sawan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread pravesh
Just increase the value of /maxClauseCount/ in your solrconfig.xml. Keep it
large enough.

Best
Pravesh



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056966.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread Yonik Seeley
Can you provide a full stack trace of the exception?

There's a maxClauseCount in solrconfig.xml that you can increase to
work around the issue.

-Yonik
http://lucidworks.com


On Thu, Apr 18, 2013 at 7:31 AM, sawanverma sawan.ve...@glassbeam.com wrote:
 Its quite confusing about this error.

 I had a situation where i have to turn on the highlighting. In some cases
 though the number of docs found for a particular query was for example say
 2, the highlighting was coming only for 1. I did some checks and found that
 that particular text searched was in the bigger document and was towards the
 end of the document.

 So i increased the hl.maxAnalyzedChar from default value of 51200 to a
 bigger value say 500. And then it started working, i mean now the
 highlighting was working properly.

 Now, i have encountered one more problem with the same error.

 There is a document which is returning the maxClauseCount error when i do
 search on content:*. The document is quite big big in size and the
 hl.maxAnalyzedChars was default i.e. 51200.

 I tried decreasing that i found that the error is coming exactly at 31375
 char (this is did my setting the hl.maxAnalyzedChars to 31375. it worked
 fine till 31374).

 Solutions are most welcome as i am in great need of this.

 Sample query is as below
 http://localhost:8983/solr/test/select/?q=content:* AND
 obs_date:[2010-01-01T00:00:00Z%20TO%202011-12-31T23:59:59Z]fl=contenthl=truehl.fl=contenthl.snippets=1hl.fragsize=1500hl.requireFieldMatch=truehl.alternateField=contenthl.maxAlternateFieldLength=1500hl.maxAnalyzedChars=31375facet.limit=200facet.mincount=1start=64rows=1sort=obs_date%20desc

 Regards,
 Sawan



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RE: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread sawanverma
Thanks Pravesh.

But won't that hit the query performance? Still what would be the ideal value 
to increase? Say this error may come even if we increase the value from 1024 to 
say 5120?
Have tried increasing the value and it had hit the performance.

Regards,
Sawan

From: pravesh [via Lucene] [mailto:ml-node+s472066n4056966...@n3.nabble.com]
Sent: Thursday, April 18, 2013 5:06 PM
To: Sawan Verma
Subject: Re: TooManyClauses: maxClauseCount is set to 1024

Just increase the value of maxClauseCount in your solrconfig.xml. Keep it large 
enough.

Best
Pravesh

If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056966.html
To unsubscribe from TooManyClauses: maxClauseCount is set to 1024, click 
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4056965code=c2F3YW4udmVybWFAZ2xhc3NiZWFtLmNvbXw0MDU2OTY1fC0xMTI5MDQ2NDY1.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056968.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread pravesh
Update:

Also remove your range queries from the main query and specify it as a
filter query.


Best
Pravesh



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056969.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread sawanverma
Hi Yonik,

Thanks for your reply.

I tried increasing the maxClauseCount to a bigger value. But what could be the 
ideal value and will not that hit the performance? What are the chances that if 
we increase the value we will not face this issue again?

As you asked pasting below the full trace of the error

Problem accessing /solr/ar/select/. Reason:
maxClauseCount is set to 1024

org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 
1024
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127)
at 
org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:51)
at 
org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:55)
at 
org.apache.lucene.search.ScoringRewrite$3.collect(ScoringRewrite.java:95)
at 
org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38)
at 
org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:312)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:98)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:391)
at 
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)
at 
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:185)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

Regards,
Sawan

From: Yonik Seeley-4 [via Lucene] 
[mailto:ml-node+s472066n4056967...@n3.nabble.com]
Sent: Thursday, April 18, 2013 5:09 PM
To: Sawan Verma
Subject: Re: TooManyClauses: maxClauseCount is set to 1024

Can you provide a full stack trace of the exception?

There's a maxClauseCount in solrconfig.xml that you can increase to
work around the issue.

-Yonik
http://lucidworks.com


On Thu, Apr 18, 2013 at 7:31 AM, sawanverma [hidden 
email]/user/SendEmail.jtp?type=nodenode=4056967i=0 wrote:

 Its quite confusing about this error.

 I had a situation where i have to turn on the highlighting. In some cases
 though the number of docs found for a particular query was for example say
 2, the highlighting was coming only for 1. I did some checks and found that
 that particular text searched was in the bigger document and was towards the
 end of the document.

 So i increased the hl.maxAnalyzedChar from default value of 51200 to a
 

RE: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread sawanverma
Yonik,

When i remove the sort part from the query below it works fine. But with
sort it throws the exception

http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1sort=obs_date%20desc
--  Throws Exception



http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1
-- Works fine.

From the above its clear that sort is causing the problem. Any idea why is
this happening and how to fix this?

Regards,
sawan





--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056974.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr using a ridiculous amount of memory

2013-04-18 Thread John Nielsen
 You are missing an essential part: Both the facet and the sort
 structures needs to hold one reference for each document
 _in_the_full_index_, even when the document does not have any values in
 the fields.


Wow, thank you for this awesome explanation! This is where the penny
dropped for me.

I will definetely move to a multi-core setup. It will take some time and a
lot of re-coding. As soon as I know the result, I will let you know!






-- 
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk


stats.facet not working for timestamp field

2013-04-18 Thread J Mohamed Zahoor
Hi

I am using SOlr 4.1 with 6 shards.

i want to find out some price stats for all the days in my index.
I ended up using stats component like 
stats=truestats.field=pricestats.facet=timestamp.



but it throws up error like  

str name=msgInvalid Date String:' #1;#0;#0;#0;'[my(#0;'/str



My Question is : is timestamp supported as stats.facet ?

./zahoor




Re: zkState changes too often

2013-04-18 Thread jmozah


On 16-Apr-2013, at 11:16 PM, Mark Miller markrmil...@gmail.com wrote:

 Are you using a the concurrent low pause garbage collector or perhaps G1? 


I use the default one which comes in jdk 1.7.

 
 Are you able to use something like visualvm to pinpoint what the bottleneck 
 might be?

Unfortunately..  it is prod machine and i could not replicate it locally.

 
 Otherwise, keep raising the timeout.


Thats what i did now.. will see if it comes in the next run..

./zahoor



Re: Max http connections in CloudSolrServer

2013-04-18 Thread J Mohamed Zahoor

I dont yet know if this is the reason...
I am looking if jetty has some limit on accepting connections.. 

./zahoor


On 18-Apr-2013, at 12:52 PM, J Mohamed Zahoor zah...@indix.com wrote:

 
 Thanks for this.
 The reason i asked this was.. when i fire 30 queries simultaneously from 30 
 threads using the same CloudSolrServer instance, 
 some queries gets fired after a delay.. sometime the delay is 30-50 seconds...
 
 In solr logs i can see.. 20+ queries get fired almost immediately... but some 
 of them gets fired late..
 
 i increased the connections per host from 32 to 200.. still no respite...
 
 ./zahoor
 
 On 18-Apr-2013, at 12:20 AM, Shawn Heisey s...@elyograg.org wrote:
 
 ModifiableSolrParams params = new ModifiableSolrParams();
  params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000);
  params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200);
  HttpClient client = HttpClientUtil.createClient(params);
  LBHttpSolrServer lbServer = new LBHttpSolrServer
(client, http://localhost/solr;);
  lbServer.removeSolrServer(http://localhost/solr;);
  SolrServer server = new CloudSolrServer(zkHost, lbServer);
 



Re: zkState changes too often

2013-04-18 Thread Mark Miller

On Apr 18, 2013, at 8:40 AM, jmozah jmo...@gmail.com wrote:

 
 
 On 16-Apr-2013, at 11:16 PM, Mark Miller markrmil...@gmail.com wrote:
 
 Are you using a the concurrent low pause garbage collector or perhaps G1? 
 
 
 I use the default one which comes in jdk 1.7.

It varies by platform, but 99% that means you are using the throughput 
collector and you should try the CMS collector instead. 

- Mark

 
 
 Are you able to use something like visualvm to pinpoint what the bottleneck 
 might be?
 
 Unfortunately..  it is prod machine and i could not replicate it locally.
 
 
 Otherwise, keep raising the timeout.
 
 
 Thats what i did now.. will see if it comes in the next run..
 
 ./zahoor
 



Re: Solr 4.2 fl issue

2013-04-18 Thread Otis Gospodnetic
Hi,
What is the issue though?  :)

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Apr 18, 2013 2:53 AM, William Bell billnb...@gmail.com wrote:

 We are getting an issue when using a GUID got a field in Solr 4.2. Solr 3.6
 is fine. Something like:

 fl=098765-765-788558-7654_userid as a string stored.

 The issue is when the GUID is begging with numeric and then a minus.

 This is a bug

 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Re: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread Otis Gospodnetic
Correct. This is what you want if server 2 goes down.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote:

 Step 1: distribute processing

 We have 2 servers in which we'll run 2 SolrCloud instances on.

 We'll define 2 shards so that both servers are busy for each request
 (improving response time of the request).



 Step 2: Failover

 We would now like to ensure that if either of the servers goes down (we're
 very unlucky with disks), that the other will be able to take over
 automatically.

 So we define 2 shards with a replication factor of 2.



 So we have:

 . Server 1: Shard 1, Replica 2

 . Server 2: Shard 2, Replica 1



 Question:

 But in SolrCloud, replicas are active right? So isn't it now possible that
 the load balancer will have Server 1 process *both* parts of a request,
 after all, it has both shards due to the replication, right?




Re: Select Queris While Merging Indexes

2013-04-18 Thread Otis Gospodnetic
If you understand the underlying lucene searcher it will be easy to
understand what's happening at solr level.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Apr 18, 2013 3:22 AM, Furkan KAMACI furkankam...@gmail.com wrote:

 Thanks for explanations. I should read deep about the lifecycle of Searcher
 objects. Should I read them from a Lucene book or is there any Solr
 documentation or books covers it?

 2013/4/18 Jack Krupansky j...@basetechnology.com

  merging indexes
 
  The proper terminology is merging segments.
 
  Until the new, merged segment is complete, the existing segments remain
  untouched and readable.
 
  -- Jack Krupansky
 
  -Original Message- From: Furkan KAMACI
  Sent: Wednesday, April 17, 2013 6:28 PM
  To: solr-user@lucene.apache.org
  Subject: Select Queris While Merging Indexes
 
 
  I see that while merging indexes (I mean optimizing via admin gui), my
 Solr
  instance can still response select queries (as well). How that querying
  mechanism works (because merging not finished yet but my Solr instance
  still can return a consistent response)?
 



RE: Tokenize on paragraphs and sentences

2013-04-18 Thread Alex Cougarman
Thanks, Jack. Sorry, took me a while to reply :)
It sounds like sentence/paragraph level searches won't be easy.

Warm regards,
Alex 

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: 15 April 2013 5:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Tokenize on paragraphs and sentences

Technically, yes, but you would have to do a lot of work yourself. Like, a 
sentence/paragraph recognizer that inserted sentence and paragraph markers, and 
a query parser that allows you to do SpanNear and SpanNot (to selectively 
exclude sentence or paragraph marks based on your granularity of
search.)

The LucidWorks Search query parser has SpanNot support (or at least did at one 
point in time), but no sentence/paragraph marking.

You could come up with some heuristic regular expressions for sentence and 
paragraph marks, like consecutive newlines for a paragraph and dot followed by 
white space for sentence (with some more heuristics for abbreviations.)

Or you could have an update processor do the marking.

-- Jack Krupansky

-Original Message-
From: Alex Cougarman
Sent: Monday, April 15, 2013 9:48 AM
To: solr-user@lucene.apache.org
Subject: Tokenize on paragraphs and sentences

Hi. Is it possible to search within paragraphs or sentences in Solr? The 
PatternTokenizerFactory uses regular expressions, but how can this be done with 
plain ASCII docs that don't have p tags (HTML), yet they're broken into 
paragraphs? Thanks.

Warm regards,
Alex




RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
But my concern is this, when we have just 2 servers:
 - I want 1 to be able to take over in case the other fails, as you point
out.
 - But when *both* servers are up I don't want the SolrCloud load balancer
to have Shard1 and Replica2 do the work (as they would both reside on the
same physical server).

Does that make sense? I want *both* server1  server2 sharing the processing
of every request, *and* I want the failover capability.

I'm probably missing some bit of logic here, but I want to be sure I
understand the architecture.

Dave



-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Thursday, April 18, 2013 8:13 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Correct. This is what you want if server 2 goes down.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote:

 Step 1: distribute processing

 We have 2 servers in which we'll run 2 SolrCloud instances on.

 We'll define 2 shards so that both servers are busy for each request 
 (improving response time of the request).



 Step 2: Failover

 We would now like to ensure that if either of the servers goes down 
 (we're very unlucky with disks), that the other will be able to take 
 over automatically.

 So we define 2 shards with a replication factor of 2.



 So we have:

 . Server 1: Shard 1, Replica 2

 . Server 2: Shard 2, Replica 1



 Question:

 But in SolrCloud, replicas are active right? So isn't it now possible 
 that the load balancer will have Server 1 process *both* parts of a 
 request, after all, it has both shards due to the replication, right?





Re: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread Timothy Potter
Hi Dave,

This sounds more like a budget / deployment issue vs. anything
architectural. You want 2 shards with replication so you either need
sufficient capacity on each of your 2 servers to host 2 Solr instances or
you need 4 servers. You need to avoid starving Solr of necessary RAM, disk
performance, and CPU regardless of how you lay out the cluster otherwise
performance will suffer. My guess is if each Solr had sufficient resources,
you wouldn't actually notice much difference in query performance.

Tim


On Thu, Apr 18, 2013 at 8:03 AM, David Parks davidpark...@yahoo.com wrote:

 But my concern is this, when we have just 2 servers:
  - I want 1 to be able to take over in case the other fails, as you point
 out.
  - But when *both* servers are up I don't want the SolrCloud load balancer
 to have Shard1 and Replica2 do the work (as they would both reside on the
 same physical server).

 Does that make sense? I want *both* server1  server2 sharing the
 processing
 of every request, *and* I want the failover capability.

 I'm probably missing some bit of logic here, but I want to be sure I
 understand the architecture.

 Dave



 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Thursday, April 18, 2013 8:13 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 Correct. This is what you want if server 2 goes down.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote:

  Step 1: distribute processing
 
  We have 2 servers in which we'll run 2 SolrCloud instances on.
 
  We'll define 2 shards so that both servers are busy for each request
  (improving response time of the request).
 
 
 
  Step 2: Failover
 
  We would now like to ensure that if either of the servers goes down
  (we're very unlucky with disks), that the other will be able to take
  over automatically.
 
  So we define 2 shards with a replication factor of 2.
 
 
 
  So we have:
 
  . Server 1: Shard 1, Replica 2
 
  . Server 2: Shard 2, Replica 1
 
 
 
  Question:
 
  But in SolrCloud, replicas are active right? So isn't it now possible
  that the load balancer will have Server 1 process *both* parts of a
  request, after all, it has both shards due to the replication, right?
 
 




more results when adding more criterias

2013-04-18 Thread Kai Becker
Hi,
I have a field which has data like this:
letters
letters numbers
letters numbers letters numbers
Where letters can have from 1 to 10 letters strings and number can have up 
to 4 digits.

It is defined like this:
field name=myField type=myFieldType indexed=true stored=true 
multiValued=true /
fieldType name=myFieldType class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
  /analyzer
/fieldType

When the user enters foo, i search for foo directly or something that starts 
with foo .
I don't want to find fool or foop or anything like this.
I also allow users to enter terms that they don't want to find.

So a query for  foo NOT foo 123 is converted to this:

parsedquery_toString: +(+(myField:foo myField:foo *) +(-myField:foo 123 
-myField:foo 123 * +*:*)),

My problem is that this finds more entries then just foo, which converts to 
this:
parsedquery_toString: +(myField:foo myField:foo *),

I have read a bit about the internal solr logic, using MUST, SHOULD and 
MUST_NOT, but still don't understand.

When I look at parsedquery_toString: +(+(myField:foo myField:foo *) 
+(-myField:foo 123 -myField:foo 123 * +*:*)),
then I see two criteria A and B and both MUST be satisfied.
Criteria A is the same as parsedquery_toString: +(myField:foo myField:foo *), 
so the number of results MUST be identical here.
Since the final results must match both A and B, the number must be equal or 
lower than just A, right?

Where do I think wrong?

Thanks,
Kai




solr4 : disable updateLog

2013-04-18 Thread Jamel ESSOUSSI
Hi,

If I disable (comment) the updateLog bloc, this will affect indexing result:






--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr4-disable-updateLog-tp4056998.html
Sent from the Solr - User mailing list archive at Nabble.com.


Paging and sorting in Solr

2013-04-18 Thread hassancrowdc
I have done paging using solr rows and start query attributes. 

But now it shows me result with that is sorted page wise. 
I meant if i have the following scenario: 

rows=25start=0sort=manufacturer asc 

It will give me first 25 matching results and then sort only those. 

I want it to sort all the results first and then apply rows and start. How
can i do that?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Paging and sorting in Solr

2013-04-18 Thread Oussama Jilal

I am sure it does the sorting first (since I always done that).

On 04/18/2013 02:49 PM, hassancrowdc wrote:

I have done paging using solr rows and start query attributes.

But now it shows me result with that is sorted page wise.
I meant if i have the following scenario:

rows=25start=0sort=manufacturer asc

It will give me first 25 matching results and then sort only those.

I want it to sort all the results first and then apply rows and start. How
can i do that?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Oussama Jilal



Re: Solr 4.2 fl issue

2013-04-18 Thread Yonik Seeley
When using a field name that doen't follow conventions (basically like
Java identifiers), try this:

fl=field(098765-765-788558-7654_userid)

Or enclose it in quotes if it's really a whacky field name:

fl=field(098765-765-788558-7654_userid)

-Yonik
http://lucidworks.com


On Thu, Apr 18, 2013 at 2:52 AM, William Bell billnb...@gmail.com wrote:
 We are getting an issue when using a GUID got a field in Solr 4.2. Solr 3.6
 is fine. Something like:

 fl=098765-765-788558-7654_userid as a string stored.

 The issue is when the GUID is begging with numeric and then a minus.

 This is a bug

 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076


Re: Paging and sorting in Solr

2013-04-18 Thread hassancrowdc
Hi,

I double checked. It is the field. if i sort through manufacturer field it
sorts but if i sort through name it does not sort. both the field has
everything same. Is there any difference in sorting alphabetically or size
of the word? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000p4057013.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr indexing

2013-04-18 Thread hassancrowdc
Solr is not showing the dates i have in database. any help? is solr following
any specific timezone? On my database my date is 2013-04-18 11:29:33 but
solr shows me 2013-04-18T15:29:33Z.   Any help



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-tp4057017.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr indexing

2013-04-18 Thread Andy Lester

On Apr 18, 2013, at 10:49 AM, hassancrowdc hassancrowdc...@gmail.com wrote:

 Solr is not showing the dates i have in database. any help? is solr following
 any specific timezone? On my database my date is 2013-04-18 11:29:33 but
 solr shows me 2013-04-18T15:29:33Z.   Any help


Solr knows nothing of timezones.  Solr expects everything is in UTC.  If you 
want time zone support, you'll have to convert local time to UTC before 
importing, and then convert back to local time from UTC when you read from Solr.

xoa

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance



Re: Paging and sorting in Solr

2013-04-18 Thread Jack Krupansky
Maybe you have your name field as text rather than string. Don't try 
sorting text fields - make a copy (copyField) to a string field and sort 
the string field. So, for example, have name as text for keyword search, 
and name_s as string for sorting (and faceting.)


-- Jack Krupansky

-Original Message- 
From: hassancrowdc

Sent: Thursday, April 18, 2013 11:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Paging and sorting in Solr

Hi,

I double checked. It is the field. if i sort through manufacturer field it
sorts but if i sort through name it does not sort. both the field has
everything same. Is there any difference in sorting alphabetically or size
of the word?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000p4057013.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: facet.method enum vs fc

2013-04-18 Thread Mingfeng Yang
20G is allocated to Solr already.

Ming


On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen 
t...@statsbiblioteket.dkwrote:

 On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
  I am doing faceting on an index of 120M documents,
  on the field of url[...]

 I would guess that you would need 3-4GB for that.
 How much memory do you allocate to Solr?

 - Toke Eskildsen




Re: Solr indexing

2013-04-18 Thread Jack Krupansky

Solr dates are always Z, GMT.

-- Jack Krupansky

-Original Message- 
From: hassancrowdc

Sent: Thursday, April 18, 2013 11:49 AM
To: solr-user@lucene.apache.org
Subject: Solr indexing

Solr is not showing the dates i have in database. any help? is solr 
following

any specific timezone? On my database my date is 2013-04-18 11:29:33 but
solr shows me 2013-04-18T15:29:33Z.   Any help



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-tp4057017.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread Shawn Heisey

On 4/18/2013 6:02 AM, sawanverma wrote:

Hi Yonik,

Thanks for your reply.

I tried increasing the maxClauseCount to a bigger value. But what could be the 
ideal value and will not that hit the performance? What are the chances that if 
we increase the value we will not face this issue again?


Changing the maxBooleanClauses value does not affect performance.  It's 
just an arbitrary limit on query complexity.  You can make it as big as 
you want and Solr's performance will not change.  For most people, 1024 
is plenty.  For others, we have no idea how many clauses are needed.


The queries themselves with large numbers of clauses are what affects 
performance, and the only way to improve it is to decrease the query 
complexity.  Chances are good that you are already experiencing the 
performance hit associated with large queries.  Adding more clauses to a 
query will reduce performance.  If you find yourself in a situation 
where you continually need more boolean clauses, you may need to start 
over and create a better design.


The maxBooleanClauses value is just a safety net, created long ago when 
Lucene worked differently than it does now.  There is a discussion 
currently happening among committers about whether that limit even needs 
to exist.  Very likely the limit in Solr will be removed in the near future.


Thanks,
Shawn



Re: Max http connections in CloudSolrServer

2013-04-18 Thread Shawn Heisey

On 4/18/2013 6:42 AM, J Mohamed Zahoor wrote:


I dont yet know if this is the reason...
I am looking if jetty has some limit on accepting connections..


Are you using the Jetty included with Solr, or a Jetty installed 
separately?  The Jetty included with Solr has a maxThreads value of 
1 in its config.  The default would be closer to 200, and a single 
request from a Cloud client likely uses multiple Jetty threads.


Thanks,
Shawn



Re: solr 3.5 core rename issue

2013-04-18 Thread Jie Sun
yeah I realize using ${solr.core.name} for dataDir must be the cause for the
issue we see... it is fair to say the SWAP and RENAME just create an alias
that still points to the old datadir.

if they can not fix it then it is not a bug :-) at least we understand
exactly what is going on there.

thanks so much for your help!
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-5-core-rename-issue-tp4056425p4057037.html
Sent from the Solr - User mailing list archive at Nabble.com.



shard query return 500 on large data set

2013-04-18 Thread Jie Sun
Hi -

when I execute a shard query like:

[myhost]:8080/solr/mycore/select?q=type:messagerows=14...qt=standardwt=standardexplainOther=hl.fl=shards=solrserver1:8080/solr/mycore,solrserver2:8080/solr/mycore,solrserver3:8080/solr/mycore
 

everything works fine until I query against a large set of data ( 100k
documents), 
when the number of rows returned exceeds about 50k.

by the way I am using HttpClient GET method to send the solr shard query
over.

In the above scenario, the query fails with a 500 server error as returned
status code.

I am using solr 3.5.

I encountered a 404 before, when one of the shard servers does not have the
core (404) the whole shard query will return 404 to me; so I expect if one
of the server encounter a timeout (408?),  the shard query should return
time out status code? 

I guess I am not sure what will be the shard query results with various
error scenario... guess i could look into solr code, but if you have any
input, it will be appreciated. thanks

Renee



--
View this message in context: 
http://lucene.472066.n3.nabble.com/shard-query-return-500-on-large-data-set-tp4057038.html
Sent from the Solr - User mailing list archive at Nabble.com.


Sorting on alias fields

2013-04-18 Thread Stephane Gamard
Hi all, 

I am trying to sort results based on multiple fields aliased as one. Is that 
possible? While solr does not complain (no error, results OK, etc etc etc) it 
fails to sort the hits appropriately. I've attached the query, relevant schema 
part and result. 

I am very curious to know if that is a feature that is currently supported 
(sorting on aliases)

Cheers, 

_Stephane

schema:
dynamicField name=*_sort required=false type=date indexed=true 
stored=true multiValued=false/
field name=last_modified type=date indexed=true stored=true 
multiValued=false/
field name=modified type=date indexed=true stored=true 
multiValued=false/
   

query: 
http://localhost:8983/select?q.alt=*%3A*q.op=ORrows=10start=0qt=%2Fselectq=qf=title_search%5E1.0pf=title_search%5E1.0fl=last_modified%2Cmodified%2Cmodified_sort%3Alast_modified%2Cmodified_sort%3AmodifieddebugOther=1debug=1debugQuery=truesort=modified_sort+desc

result:
response
lst name=responseHeader
int name=status0/int
int name=QTime9/int
lst name=params
str name=sortmodified_sort desc/str
str name=qftitle_search^1.0/str
str name=q.alt*:*/str
str name=debugOther1/str
str name=debug1/str
str name=rows10/str
str name=pftitle_search^1.0/str
str name=fl
last_modified,modified,modified_sort:last_modified,modified_sort:modified
/str
str name=debugQuerytrue/str
str name=start0/str
str name=q/
str name=q.opOR/str
str name=qt/select/str
/lst
/lst
result name=response numFound=17 start=0
doc
date name=last_modified2013-04-12T00:00:00Z/date
date name=modified_sort2013-04-12T00:00:00Z/date
/doc
doc
date name=last_modified2007-10-18T00:00:00Z/date
date name=modified2007-10-18T00:00:00Z/date
date name=modified_sort2007-10-18T00:00:00Z/date
/doc
doc
date name=last_modified2013-04-12T00:00:00Z/date
date name=modified_sort2013-04-12T00:00:00Z/date
/doc
doc



Re: SolrCloud vs Solr master-slave replication

2013-04-18 Thread Lance Norskog
Run checksums on all files in both master and slave, and verify that 
they are the same.

TCP/IP has a checksum algorithm that was state-of-the-art in 1969.

On 04/18/2013 02:10 AM, Victor Ruiz wrote:

Also, I forgot to say... the same error started to happen again.. the index
is again corrupted :(



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-vs-Solr-master-slave-replication-tp4055541p4056926.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread sawanverma
Shawn,

Thanks a lot for your reply. But I am confused again if the following query is 
complex.
http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1sort=obs_date%20desc

Is that because of content : *? The only unusual thing is the field size of the 
content. In this particular case content field has enormously big data. Since 
this problem comes only when we do a search on * for content field. Is there a 
way that we can split the doc size?

Regards,
Sawan

From: Shawn Heisey-4 [via Lucene] 
[mailto:ml-node+s472066n4057027...@n3.nabble.com]
Sent: 18 April 2013 PM 09:38
To: Sawan Verma
Subject: Re: TooManyClauses: maxClauseCount is set to 1024

On 4/18/2013 6:02 AM, sawanverma wrote:
 Hi Yonik,

 Thanks for your reply.

 I tried increasing the maxClauseCount to a bigger value. But what could be 
 the ideal value and will not that hit the performance? What are the chances 
 that if we increase the value we will not face this issue again?

Changing the maxBooleanClauses value does not affect performance.  It's
just an arbitrary limit on query complexity.  You can make it as big as
you want and Solr's performance will not change.  For most people, 1024
is plenty.  For others, we have no idea how many clauses are needed.

The queries themselves with large numbers of clauses are what affects
performance, and the only way to improve it is to decrease the query
complexity.  Chances are good that you are already experiencing the
performance hit associated with large queries.  Adding more clauses to a
query will reduce performance.  If you find yourself in a situation
where you continually need more boolean clauses, you may need to start
over and create a better design.

The maxBooleanClauses value is just a safety net, created long ago when
Lucene worked differently than it does now.  There is a discussion
currently happening among committers about whether that limit even needs
to exist.  Very likely the limit in Solr will be removed in the near future.

Thanks,
Shawn



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4057027.html
To unsubscribe from TooManyClauses: maxClauseCount is set to 1024, click 
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4056965code=c2F3YW4udmVybWFAZ2xhc3NiZWFtLmNvbXw0MDU2OTY1fC0xMTI5MDQ2NDY1.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4057060.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query Elevation Component

2013-04-18 Thread davers
I want to elevate certain documents differently depending a a certain fq
parameter in the request. I've read of somebody coding solr to do this but
no code was shared. Where would I start looking to implement this feature
myself?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-Elevation-Component-tp4056856p4057065.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread Shawn Heisey

On 4/18/2013 11:53 AM, sawanverma wrote:

Shawn,

Thanks a lot for your reply. But I am confused again if the following query is 
complex.
http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1sort=obs_date%20desc


I hardly know anything about highlighting, so nothing that I say here 
may have any relevance to your situtation at all.


A query of content:* strikes me as an invalid query.  If you are 
shooting for all documents where content exists and excluding those 
where it doesn't exist, I would think that 'q=content:[* TO *]' (the TO 
must be uppercase) would be a better option.


Exactly how your query gets expanded into a something that exceeds 
maxBooleanClauses is a complete mystery to me, and probably does have 
something to do with the highlighting.


Thanks,
Shawn



updating documents unintentionally adds extra values to certain fields

2013-04-18 Thread joyce chan
Hi

I am using solr 4.2, and have set up spatial search config as below

http://wiki.apache.org/solr/SpatialSearch#Schema_Configuration

But everything I make an update to a document,
http://wiki.apache.org/solr/UpdateJSON#Updating_a_Solr_Index_with_JSON

more values of the *_coordinates fields gets inserted, even though it was
not set to multivalue  this behavior doesn't happen to any of the other
fields.

Any ideas how to avoid adding extra values to the _coordinates fields on
updates?


Making fields unavailable for return to specific end points.

2013-04-18 Thread Andrew Lundgren
We have a few internal fields that we would like to restrict from being 
returned in result sets.

I have seen how fl is used in specify fields that you do what returned, I am 
kind of looking for the opposite.  There are just a few fields that don't make 
sense to return to our clients.

Is there any functionality for a blocked-fl?

Thank you!

--
Andrew


 NOTICE: This email message is for the sole use of the intended recipient(s) 
and may contain confidential and privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.



RE: Making fields unavailable for return to specific end points.

2013-04-18 Thread Andrew Lundgren
Hmm...  Just found this JIRA:  https://issues.apache.org/jira/browse/SOLR-3191

I think I have answered my question.

-Original Message-
From: Andrew Lundgren [mailto:lundg...@familysearch.org] 
Sent: Thursday, April 18, 2013 1:21 PM
To: solr-user@lucene.apache.org
Subject: Making fields unavailable for return to specific end points.

We have a few internal fields that we would like to restrict from being 
returned in result sets.

I have seen how fl is used in specify fields that you do what returned, I am 
kind of looking for the opposite.  There are just a few fields that don't make 
sense to return to our clients.

Is there any functionality for a blocked-fl?

Thank you!

--
Andrew


 NOTICE: This email message is for the sole use of the intended recipient(s) 
and may contain confidential and privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.


 NOTICE: This email message is for the sole use of the intended recipient(s) 
and may contain confidential and privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.



Change the response of delta import

2013-04-18 Thread hassancrowdc
Is there any way i can change the response xml from delta import query:
locathost:8080/solr/devices/dataimport?command=delta-importcommit=true

I want to change the response. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Change-the-response-of-delta-import-tp4057093.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Change the response of delta import

2013-04-18 Thread Shawn Heisey

On 4/18/2013 1:59 PM, hassancrowdc wrote:

Is there any way i can change the response xml from delta import query:
locathost:8080/solr/devices/dataimport?command=delta-importcommit=true

I want to change the response.


The response is created by the dataimporthandler source code.  It's a 
contrib module included with Solr.  You can change that code and 
recompile, then replace your dataimporthandler jar with the new one.


Thanks,
Shawn



Re: Paging and sorting in Solr

2013-04-18 Thread hassancrowdc
thnx



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000p4057098.html
Sent from the Solr - User mailing list archive at Nabble.com.


PositionLengthAttribute - Does it do anything at all?

2013-04-18 Thread Hayden Muhl
I've been playing around with the PositionLengthAttribute for a few days,
and it doesn't seem to have any effect at all.

I'm aware that position length is not stored in the index, as explained in
this blog post.

http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

However, even when used at query time it doesn't seem to do anything. Let's
take the following token stream as an example.

text: he
posInc: 1
posLen: 1

text: cannot
posInc: 1
posLen: 2

text: can
posInc: 0
posLen: 1

text: not
posInc: 1
posLen: 1

text: help
posInc: 1
posLen: 1

If we were to construct this graph of tokens, it should match the phrases
he can not help and he cannot help. According to my testing, it will
match the phrases he can not help and he cannot not help, because the
position length is entirely ignored and treated as if it is always 1.

Am I misunderstanding how these attributes work?

- Hayden


Re: What are the pros and cons Having More Replica at SolrCloud

2013-04-18 Thread Timothy Potter
re: more replicas -

pro: you can scale your query processing workload because you have more
nodes available to service queries, eg 1,000 QPS sent to Solr with 5
replicas, then each is only processing roughly 200 QPS. If you need to
scale up to 10K QPS, then add more replicas to distribute the increased
workload

con: additional overhead (mostly network I/O) when indexing, shard leader
has to send N additional requests per update where N is the number of
replicas per shard. This seems minor unless you have many replicas per
shard. I can't think of any cons of having more replicas on the query side

As for your other question, when the leader receives an update request, it
forwards to all replicas in the active or recovering state in parallel and
waits for their response before responding to the client. All replicas must
accept the update for it to be considered successful, i.e. all replicas and
the leader must be in agreement on the status of a request. This is why you
hear people referring to Solr as favoring consistency over
write-availability. If you have 10 active replicas for a shard, then all 10
must accept the update or it fails, there's no concept of tunable
consistency on a write in Solr. Failed / offline replicas are obviously
ignored and they will sync up with the leader once they are back online.

Cheers,
Tim


On Thu, Apr 18, 2013 at 4:48 PM, Furkan KAMACI furkankam...@gmail.comwrote:

 What are the pros and cons Having More Replica at SolrCloud?

 Also there is a point that I want to learn. When a request come to a
 leader. Does it forwards it to a replica. And if forwards it to replica,
 does replica works parallel to build up the index with other replicas of
 its same leader?



Updating clusterstate from the zookeeper

2013-04-18 Thread Manuel Le Normand
Hello,
After creating a distributed collection on several different servers I
sometimes get to deal with failing servers (cores appear not available =
grey) or failing cores (Down / unable to recover = brown / red).
In case i wish to delete this errorneous collection (through collection
API) only the green nodes get erased, leaving a meaningless unavailable
collection in the clusterstate.json.

Is there any way to edit explicitly the clusterstate.json? If not, how do i
update it so the collection as above gets deleted?

Cheers,
Manu


Re: What are the pros and cons Having More Replica at SolrCloud

2013-04-18 Thread Manuel Le Normand
On the query side, another down side i see would be that for a given memory
pool, you'd have to share it with more cores because every replica uses
it's own cache.
True for the inner solr caching (JVM's heap) and OS caching as well.
Adding a replicated core creates a new data set (index) that will be
accessed while queried.
If your replication adds a core of shard1 on a server that includes only
shard2, the OS caching and solr caching would have to share
the RAM with totally different memory parts (as files and query results for
different shards are different) so it's clear.
 In the second case, if you add a replicated core to a server that already
contains shard1, I'm not sure. There might be benefits if JVM handles its
caches per shard and not per core, but the OS caching would differentiate
between the different replications of same index and try to add both index
files on memory.

Cheers,
Manu

So if you're short on memory or queries are alike (have high hit ration)
you may better take advantage of your RAM usage than splitting it to many
replications.


On Fri, Apr 19, 2013 at 3:08 AM, Timothy Potter thelabd...@gmail.comwrote:

 re: more replicas -

 pro: you can scale your query processing workload because you have more
 nodes available to service queries, eg 1,000 QPS sent to Solr with 5
 replicas, then each is only processing roughly 200 QPS. If you need to
 scale up to 10K QPS, then add more replicas to distribute the increased
 workload

 con: additional overhead (mostly network I/O) when indexing, shard leader
 has to send N additional requests per update where N is the number of
 replicas per shard. This seems minor unless you have many replicas per
 shard. I can't think of any cons of having more replicas on the query side

 As for your other question, when the leader receives an update request, it
 forwards to all replicas in the active or recovering state in parallel and
 waits for their response before responding to the client. All replicas must
 accept the update for it to be considered successful, i.e. all replicas and
 the leader must be in agreement on the status of a request. This is why you
 hear people referring to Solr as favoring consistency over
 write-availability. If you have 10 active replicas for a shard, then all 10
 must accept the update or it fails, there's no concept of tunable
 consistency on a write in Solr. Failed / offline replicas are obviously
 ignored and they will sync up with the leader once they are back online.

 Cheers,
 Tim


 On Thu, Apr 18, 2013 at 4:48 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:

  What are the pros and cons Having More Replica at SolrCloud?
 
  Also there is a point that I want to learn. When a request come to a
  leader. Does it forwards it to a replica. And if forwards it to replica,
  does replica works parallel to build up the index with other replicas of
  its same leader?
 



Re: Solr system and numbers

2013-04-18 Thread uohzoaix
if i wanna search on subsets of number,what can i do?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-system-and-numbers-tp482519p4057134.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr system and numbers

2013-04-18 Thread Alexandre Rafalovitch
Do you mean a range (e.g. [4 TO 17]) or a prefix (e.g. 10*)? For range
you need to index it as a number. For prefix, string is probably
better. Than, just use standard query parameters.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Apr 18, 2013 at 9:29 PM, uohzoaix johncho...@gmail.com wrote:
 if i wanna search on subsets of number,what can i do?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-system-and-numbers-tp482519p4057134.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr indexing

2013-04-18 Thread uohzoaix
you just change date filedtype to string



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-tp4057017p4057136.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
I think I still don't understand something here. 

My concern right now is that query times are very slow for 120GB index (14s
on avg), I've seen a lot of disk activity when running queries.

I'm hoping that distributing that query across 2 servers is going to improve
the query time, specifically I'm hoping that we can distribute that disk
activity because we don't have great disks on there (yet).

So, with disk IO being a factor in mind, running the query on one box, vs.
across 2 *should* be a concern right?

Admittedly, this is the first step in what will probably be many to try to
work our query times down from 14s to what I want to be around 1s.

Dave


-Original Message-
From: Timothy Potter [mailto:thelabd...@gmail.com] 
Sent: Thursday, April 18, 2013 9:16 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Hi Dave,

This sounds more like a budget / deployment issue vs. anything
architectural. You want 2 shards with replication so you either need
sufficient capacity on each of your 2 servers to host 2 Solr instances or
you need 4 servers. You need to avoid starving Solr of necessary RAM, disk
performance, and CPU regardless of how you lay out the cluster otherwise
performance will suffer. My guess is if each Solr had sufficient resources,
you wouldn't actually notice much difference in query performance.

Tim


On Thu, Apr 18, 2013 at 8:03 AM, David Parks davidpark...@yahoo.com wrote:

 But my concern is this, when we have just 2 servers:
  - I want 1 to be able to take over in case the other fails, as you 
 point out.
  - But when *both* servers are up I don't want the SolrCloud load 
 balancer to have Shard1 and Replica2 do the work (as they would both 
 reside on the same physical server).

 Does that make sense? I want *both* server1  server2 sharing the 
 processing of every request, *and* I want the failover capability.

 I'm probably missing some bit of logic here, but I want to be sure I 
 understand the architecture.

 Dave



 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Thursday, April 18, 2013 8:13 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 Correct. This is what you want if server 2 goes down.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote:

  Step 1: distribute processing
 
  We have 2 servers in which we'll run 2 SolrCloud instances on.
 
  We'll define 2 shards so that both servers are busy for each request 
  (improving response time of the request).
 
 
 
  Step 2: Failover
 
  We would now like to ensure that if either of the servers goes down 
  (we're very unlucky with disks), that the other will be able to take 
  over automatically.
 
  So we define 2 shards with a replication factor of 2.
 
 
 
  So we have:
 
  . Server 1: Shard 1, Replica 2
 
  . Server 2: Shard 2, Replica 1
 
 
 
  Question:
 
  But in SolrCloud, replicas are active right? So isn't it now 
  possible that the load balancer will have Server 1 process *both* 
  parts of a request, after all, it has both shards due to the
replication, right?
 
 





DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-18 Thread SandeepM
Hi!

I am using SOLR 4.2.1.

My solrconfig.xml contains the following:

  searchComponent name=MySpellcheck class=solr.SpellCheckComponent
   str name=queryAnalyzerFieldTypetext_spell/str

 lst name=spellchecker
   str name=nameMySpellchecker/str
   str name=fieldspell/str
   str name=classnamesolr.DirectSolrSpellChecker/str
   str name=distanceMeasureinternal/str
   float name=accuracy0.5/float
   int name=maxEdits2/int
   int name=minPrefix1/int
   int name=maxInspections5/int
   int name=minQueryLength3/int
   float name=maxQueryFrequency0.01/float
   
 /lst
 /searchComponent

requestHandler name=/select class=solr.SearchHandler startup=lazy
lst name=defaults
  int name=rows10/int
  str name=dfid/str
  str name=spellcheck.dictionaryMySpellchecker/str
  str name=spellcheckon/str
  str name=spellcheck.extendedResultsfalse/str
  str name=spellcheck.count10/str
  str name=spellcheck.alternativeTermCount10/str
  str name=spellcheck.maxResultsForSuggest35/str
  str name=spellcheck.onlyMorePopulartrue/str
  str name=spellcheck.collatetrue/str
  str name=spellcheck.collateExtendedResultsfalse/str
  str name=spellcheck.maxCollationTries10/str
  str name=spellcheck.maxCollations1/str
  str name=spellcheck.collateParam.q.opAND/str
/lst
arr name=last-components
  strMySpellcheck/str
/arr
  /requestHandler

schema.xml with the spell field looks like:

fieldType name=text_spell class=solr.TextField
positionIncrementGap=100  sortMissingLast=true 
analyzer type=index
tokenizer
class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory
/
filter class=solr.StopFilterFactory
ignoreCase=true
 words=lang/stopwords_en.txt
enablePositionIncrements=true /
/analyzer
analyzer type=query
tokenizer
class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory
/
filter class=solr.StopFilterFactory
ignoreCase=true
 words=lang/stopwords_en.txt
enablePositionIncrements=true /
/analyzer
/fieldType

field name=spell type=text_spell indexed=true
stored=false multiValued=true /

copyField source=title dest=spell /
copyField source=artist dest=spell /
 
My query:
http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit

In this case, the intent is to correct chocolat factry with chocolate
factory which exists in my spell field index. I see a QTime from the above
query as somewhere between 350-400ms

I run a similar query replacing the spellcheck terms to pursut hapyness
whereas pursuit happyness actually exists in my spell field and I see
QTime of 15-17ms .

Both query produce collations correctly but there is order of magnitude
difference in QTime.  There is one edit per term in both cases or 2 edits in
each query. The length of words in both these queries seem identical. I'd
like to understand why there is this vast difference in QTime.  I would
appreciate any help with this since I am not sure how I can get any
meaningful performance numbers and attribute the slowness to anything in
particular. 

I also see a vast difference in QTime in another case.  Replace the search
terms in the above query with over cuckoo's nest, over cuccoo's nst,
etc.   over cuckoo's nest exists in my indexed spell field and so it
should find it almost immediately.  This query fails to produce any
collation and takes 10seconds. While the second query over cuccoo's nst
corrects the phrase and also returns in 24ms. Something does not sound right
here.

I would appreciate help with these.

Thanks in advance.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread Shawn Heisey
On 4/18/2013 8:12 PM, David Parks wrote:
 I think I still don't understand something here. 
 
 My concern right now is that query times are very slow for 120GB index (14s
 on avg), I've seen a lot of disk activity when running queries.
 
 I'm hoping that distributing that query across 2 servers is going to improve
 the query time, specifically I'm hoping that we can distribute that disk
 activity because we don't have great disks on there (yet).
 
 So, with disk IO being a factor in mind, running the query on one box, vs.
 across 2 *should* be a concern right?
 
 Admittedly, this is the first step in what will probably be many to try to
 work our query times down from 14s to what I want to be around 1s.

I went through my mailing list archive to see what all you've said about
your setup.  One thing that I can't seem to find is a mention of how
much total RAM is in each of your servers.  I apologize if it was
actually there and I overlooked it.

In one email thread, you wanted to know whether Solr is CPU-bound or
IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O is
the slowest piece of the puzzle. The way to get good performance out of
Solr is to have enough memory that you can take the disk mostly out of
the equation by having the operating system cache the index in RAM.  If
you don't have enough RAM for that, then Solr becomes IO-bound, and your
CPUs will be busy in iowait, unable to do much real work.  If you DO
have enough RAM to cache all (or most) of your index, then Solr will be
CPU-bound.

With 120GB of total index data on each server, you would want at least
128GB of RAM per server, assuming you are only giving 8-16GB of RAM to
Solr, and that Solr is the only thing running on the machine.  If you
have more servers and shards, you can reduce the per-server memory
requirement because the amount of index data on each server would go
down.  I am aware of the cost associated with this kind of requirement -
each of my Solr servers has 64GB.

If you are sharing the server with another program, then you want to
have enough RAM available for Solr's heap, Solr's data, the other
program's heap, and the other program's data.  Some programs (like
MySQL) completely skip the OS disk cache and instead do that caching
themselves with heap memory that's actually allocated to the program.
If you're using a program like that, then you wouldn't need to count its
data.

Using SSDs for storage can speed things up dramatically and may reduce
the total memory requirement to some degree, but even an SSD is slower
than RAM.  The transfer speed of RAM is faster, and from what I
understand, the latency is at least an order of magnitude quicker -
nanoseconds vs microseconds.

In another thread, you asked about how Google gets such good response
times.  Although Google's software probably works differently than
Solr/Lucene, when it comes right down to it, all search engines do
similar jobs and have similar requirements.  I would imagine that Google
gets incredible response time because they have incredible amounts of
RAM at their disposal that keep the important bits of their index
instantly available.  They have thousands of servers in each data
center.  I once got a look at the extent of Google's hardware in one
data center - it was HUGE.  I couldn't get in to examine things closely,
they keep that stuff very locked down.

Thanks,
Shawn



RE: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread sawanverma
Shawn,

Giving content:[* TO *] gives the same error but when I give content:[a TO z] 
it works fine. Can you please explain what does it mean when I give content:[a 
TO z]? Can I use this as workaround? The datatype of content field is text_en.

Thanks again for you replies and suggestions.

Regards,
Sawan

From: Shawn Heisey-4 [via Lucene] 
[mailto:ml-node+s472066n4057074...@n3.nabble.com]
Sent: Friday, April 19, 2013 12:33 AM
To: Sawan Verma
Subject: Re: TooManyClauses: maxClauseCount is set to 1024

On 4/18/2013 11:53 AM, sawanverma wrote:
 Shawn,

 Thanks a lot for your reply. But I am confused again if the following query 
 is complex.
 http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1sort=obs_date%20desc

I hardly know anything about highlighting, so nothing that I say here
may have any relevance to your situtation at all.

A query of content:* strikes me as an invalid query.  If you are
shooting for all documents where content exists and excluding those
where it doesn't exist, I would think that 'q=content:[* TO *]' (the TO
must be uppercase) would be a better option.

Exactly how your query gets expanded into a something that exceeds
maxBooleanClauses is a complete mystery to me, and probably does have
something to do with the highlighting.

Thanks,
Shawn



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4057074.html
To unsubscribe from TooManyClauses: maxClauseCount is set to 1024, click 
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4056965code=c2F3YW4udmVybWFAZ2xhc3NiZWFtLmNvbXw0MDU2OTY1fC0xMTI5MDQ2NDY1.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4057181.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TooManyClauses: maxClauseCount is set to 1024

2013-04-18 Thread Shawn Heisey
On 4/18/2013 11:02 PM, sawanverma wrote:
 Giving content:[* TO *] gives the same error but when I give content:[a TO z] 
 it works fine. Can you please explain what does it mean when I give 
 content:[a TO z]? Can I use this as workaround? The datatype of content field 
 is text_en.

That syntax is a range query.  The [* TO *] basically means that you are
requesting all documents where the content field exists (has a value).
It's not very likely that [a TO z] will include all possible documents -
it would not include a value like zap for instance, because
alphabetically, that is after z.

I am a little bit confused - why would you want to do highlighting on a
query that matches all documents that contain the content field, or even
all documents?  The point of highlighting is to show the parts of the
text that matched your query text, but you don't have any query text.

I think it may be time to back up and tell us what you want to actually
accomplish, rather than trying to deal directly with the error message.
 Because it has to do with highlighting, I may not be able to help, but
there are plenty of very smart people here who do understand highlighting.

Thanks,
Shawn



RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
Wow! That was the most pointed, concise discussion of hardware requirements
I've seen to date, and it's fabulously helpful, thank you Shawn!  We
currently have 2 servers that I can dedicate about 12GB of ram to Solr on
(we're moving to these 2 servers now). I can upgrade further if it's needed
 justified, and your discussion helps me justify that such an upgrade is
the right thing to do.

So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I should
be in the free and clear then right?  This seems reasonable and doable.

In this more extreme example the failover properties of solr cloud become
more clear. I couldn't possibly run a replica shard without doubling the
memory, so really replication isn't reasonable until I have double the
hardware, then the load balancing scheme makes perfect sense. With 3
servers, 50GB of RAM and 120GB index I should just backup the index
directory I think.

My previous though to run replication just for failover would have actually
resulted in LOWER performance because I would have halved the memory
available to the master  replica. So the previous question is answered as
well now.

Question: if I had 1 server with 60GB of memory and 120GB index, would solr
make full use of the 60GB of memory? Thus trimming disk access in half. Or
is it an all-or-nothing thing?  In a dev environment, I didn't notice SOLR
consuming the full 5GB of RAM assigned to it with a 120GB index.

Dave


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 11:51 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/18/2013 8:12 PM, David Parks wrote:
 I think I still don't understand something here. 
 
 My concern right now is that query times are very slow for 120GB index 
 (14s on avg), I've seen a lot of disk activity when running queries.
 
 I'm hoping that distributing that query across 2 servers is going to 
 improve the query time, specifically I'm hoping that we can distribute 
 that disk activity because we don't have great disks on there (yet).
 
 So, with disk IO being a factor in mind, running the query on one box, vs.
 across 2 *should* be a concern right?
 
 Admittedly, this is the first step in what will probably be many to 
 try to work our query times down from 14s to what I want to be around 1s.

I went through my mailing list archive to see what all you've said about
your setup.  One thing that I can't seem to find is a mention of how much
total RAM is in each of your servers.  I apologize if it was actually there
and I overlooked it.

In one email thread, you wanted to know whether Solr is CPU-bound or
IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O is the
slowest piece of the puzzle. The way to get good performance out of Solr is
to have enough memory that you can take the disk mostly out of the equation
by having the operating system cache the index in RAM.  If you don't have
enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy
in iowait, unable to do much real work.  If you DO have enough RAM to cache
all (or most) of your index, then Solr will be CPU-bound.

With 120GB of total index data on each server, you would want at least 128GB
of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and
that Solr is the only thing running on the machine.  If you have more
servers and shards, you can reduce the per-server memory requirement because
the amount of index data on each server would go down.  I am aware of the
cost associated with this kind of requirement - each of my Solr servers has
64GB.

If you are sharing the server with another program, then you want to have
enough RAM available for Solr's heap, Solr's data, the other program's heap,
and the other program's data.  Some programs (like
MySQL) completely skip the OS disk cache and instead do that caching
themselves with heap memory that's actually allocated to the program.
If you're using a program like that, then you wouldn't need to count its
data.

Using SSDs for storage can speed things up dramatically and may reduce the
total memory requirement to some degree, but even an SSD is slower than RAM.
The transfer speed of RAM is faster, and from what I understand, the latency
is at least an order of magnitude quicker - nanoseconds vs microseconds.

In another thread, you asked about how Google gets such good response times.
Although Google's software probably works differently than Solr/Lucene, when
it comes right down to it, all search engines do similar jobs and have
similar requirements.  I would imagine that Google gets incredible response
time because they have incredible amounts of RAM at their disposal that keep
the important bits of their index instantly available.  They have thousands
of servers in each data center.  I once got a look at the extent of Google's
hardware in one data center - it was HUGE.  I couldn't get in to examine
things