Re: Avoid re indexing

2015-08-02 Thread Upayavira
You do not want to add a new shard, first you want your docs evenly
spread, secondly, they are spread using hash ranges, to add more
capacity, you spread out those hash ranges using shard splitting.
Adding a new shard doesnt really make any sense here. Unless you go
for implicit routing where you decide for yourself which shard a doc
goes into, but it seems too late to make that decision in your case.

Upayavira

On Sun, Aug 2, 2015, at 12:40 AM, Nagasharath wrote:
 Yes, shard splitting will only help in managing large clusters and to
 improve query performance. In my case as index size is fully grown (no
 capacity to hold in the existing shards) across the collection adding a
 new shard will help and for which I have to re index.
 
 
  On 01-Aug-2015, at 6:34 pm, Upayavira u...@odoko.co.uk wrote:
  
  Erm, that doesn't seem to make sense. Seems like you are talking about
  *merging* shards.
  
  Say you had two shards, 3m docs each:
  
  shard1: 3m docs
  shard2: 3m docs
  
  If you split shard1, you would have:
  
  shard1_0: 1.5m docs
  shard1_1: 1.5m docs
  shard2: 3m docs
  
  You could, of course, then split shard2. You could also split shard1
  into three parts instead, if you preferred:
  
  shard1_0: 1m docs
  shard1_1: 1m docs
  shard1_2: 1m docs
  shard2: 3m docs
  
  Upayavira
  
  On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote:
  If my current shard is holding 3 million documents will the new subshard
  after splitting also be able to hold 3 million documents?
  If that is the case After shard splitting the sub shards should hold 6
  million documents if a shard is split in to two. Am I right?
  
  On 01-Aug-2015, at 5:43 pm, Upayavira u...@odoko.co.uk wrote:
  
  
  
  On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
  I am using solrj to index documents
  
  i agree with you regarding the index update but i should not see any
  deleted documents as it is a fresh index. Can we actually identify what
  are
  those deleted documents?
  
  If you post doc 1234, then you post doc 1234 a second time, you will see
  a deletion in your index. If you don't want deletions to show in your
  index, be sure NEVER to update a document, only add new ones with
  absolutely distinct document IDs.
  
  You cannot see (via Solr) which docs are deleted. You could, I suppose,
  introspect the Lucene index, but that would most definitely be an expert
  task.
  
  if there is no option of adding shards to existing collection i do not
  like
  the idea of re indexing the whole data (worth hours) and we have gone
  with
  good number of shards but there is a rapid increase of size in data over
  the past few days, do you think is it worth logging a ticket?
  
  You can split a shard. See the collections API:
  
  https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
  
  What would you want to log a ticket for? I'm not sure that there's
  anything that would require that.
  
  Upayavira


Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Jay Potharaju
The document contains around 30 fields and have stored set to true for
almost 15 of them. And these stored fields are queried and updated all the
time. You will notice that the deleted documents is almost 30% of the
docs.  And it has stayed around that percent and has not come down.
I did try optimize but that was disruptive as it caused search errors.
I have been playing with merge factor to see if that helps with deleted
documents or not. It is currently set to 5.

The server has 24 GB of memory out of which memory consumption is around 23
GB normally and the jvm is set to 6 GB. And have noticed that the available
memory on the server goes to 100 MB at times during a day.
All the updates are run through DIH.

Every day at least once i see the following error, which result in search
errors on the front end of the site.

ERROR org.apache.solr.servlet.SolrDispatchFilter -
null:org.eclipse.jetty.io.EofException

From what I have read these are mainly due to timeout and my timeout is set
to 30 seconds and cant set it to a higher number. I was thinking maybe due
to high memory usage, sometimes it leads to bad performance/errors.

My objective is to stop the errors, adding more memory to the server is not
a good scaling strategy. That is why i was thinking maybe there is a issue
with the way things are set up and need to be revisited.

Thanks


On Sat, Aug 1, 2015 at 7:06 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 8/1/2015 6:49 PM, Jay Potharaju wrote:
  I currently have a single collection with 40 million documents and index
  size of 25 GB. The collections gets updated every n minutes and as a
 result
  the number of deleted documents is constantly growing. The data in the
  collection is an amalgamation of more than 1000+ customer records. The
  number of documents per each customer is around 100,000 records on
 average.
 
  Now that being said, I 'm trying to get an handle on the growing deleted
  document size. Because of the growing index size both the disk space and
  memory is being used up. And would like to reduce it to a manageable
 size.
 
  I have been thinking of splitting the data into multiple core, 1 for each
  customer. This would allow me manage the smaller collection easily and
 can
  create/update the collection also fast. My concern is that number of
  collections might become an issue. Any suggestions on how to address this
  problem. What are my other alternatives to moving to a multicore
  collections.?
 
  Solr: 4.9
  Index size:25 GB
  Max doc: 40 million
  Doc count:29 million
 
  Replication:4
 
  4 servers in solrcloud.

 Creating 1000+ collections in SolrCloud is definitely problematic.  If
 you need to choose between a lot of shards and a lot of collections, I
 would definitely go with a lot of shards.  I would also want a lot of
 servers for an index with that many pieces.

 https://issues.apache.org/jira/browse/SOLR-7191

 I don't think it would matter how many collections or shards you have
 when it comes to how many deleted documents are in your index.  If you
 want to clean up a large number of deletes in an index, the best option
 is an optimize.  An optimize requires a large amount of disk I/O, so it
 can be extremely disruptive if the query volume is high.  It should be
 done when the query volume is at its lowest.  For the index you
 describe, a nightly or weekly optimize seems like a good option.

 Aside from having a lot of deleted documents in your index, what kind of
 problems are you trying to solve?

 Thanks,
 Shawn




-- 
Thanks
Jay Potharaju


Are Solr releases predictable? Every 2 months?

2015-08-02 Thread Gili Nachum
When is 5.3 coming out?
When is SOLR-6273 https://issues.apache.org/jira/browse/SOLR-6273 (Cross
Data Center Replication) to be released?
Any way to tell?


Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Shawn Heisey
On 8/2/2015 8:29 AM, Jay Potharaju wrote:
 The document contains around 30 fields and have stored set to true for
 almost 15 of them. And these stored fields are queried and updated all the
 time. You will notice that the deleted documents is almost 30% of the
 docs.  And it has stayed around that percent and has not come down.
 I did try optimize but that was disruptive as it caused search errors.
 I have been playing with merge factor to see if that helps with deleted
 documents or not. It is currently set to 5.
 
 The server has 24 GB of memory out of which memory consumption is around 23
 GB normally and the jvm is set to 6 GB. And have noticed that the available
 memory on the server goes to 100 MB at times during a day.
 All the updates are run through DIH.

Using all availble memory is completely normal operation for ANY
operating system.  If you hold up Windows as an example of one that
doesn't ... it lies to you about available memory.  All modern
operating systems will utilize memory that is not explicitly allocated
for the OS disk cache.

The disk cache will instantly give up any of the memory it is using for
programs that request it.  Linux doesn't try to hide the disk cache from
you, but older versions of Windows do.  In the newer versions of Windows
that have the Resource Monitor, you can go there to see the actual
memory usage including the cache.

 Every day at least once i see the following error, which result in search
 errors on the front end of the site.
 
 ERROR org.apache.solr.servlet.SolrDispatchFilter -
 null:org.eclipse.jetty.io.EofException
 
 From what I have read these are mainly due to timeout and my timeout is set
 to 30 seconds and cant set it to a higher number. I was thinking maybe due
 to high memory usage, sometimes it leads to bad performance/errors.

Although this error can be caused by timeouts, it has a specific
meaning.  It means that the client disconnected before Solr responded to
the request, so when Solr tried to respond (through jetty), it found a
closed TCP connection.

Client timeouts need to either be completely removed, or set to a value
much longer than any request will take.  Five minutes is a good starting
value.

If all your client timeout is set to 30 seconds and you are seeing
EofExceptions, that means that your requests are taking longer than 30
seconds, and you likely have some performance issues.  It's also
possible that some of your client timeouts are set a lot shorter than 30
seconds.

 My objective is to stop the errors, adding more memory to the server is not
 a good scaling strategy. That is why i was thinking maybe there is a issue
 with the way things are set up and need to be revisited.

You're right that adding more memory to the servers is not a good
scaling strategy for the general case ... but in this situation, I think
it might be prudent.  For your index and heap sizes, I would want the
company to pay for at least 32GB of RAM.

Having said that ... I've seen Solr installs work well with a LOT less
memory than the ideal.  I don't know that adding more memory is
necessary, unless your system (CPU, storage, and memory speeds) is
particularly slow.  Based on your document count and index size, your
documents are quite small, so I think your memory size is probably good
-- if the CPU, memory bus, and storage are very fast.  If one or more of
those subsystems aren't fast, then make up the difference with lots of
memory.

Some light reading, where you will learn why I think 32GB is an ideal
memory size for your system:

https://wiki.apache.org/solr/SolrPerformanceProblems

It is possible that your 6GB heap is not quite big enough for good
performance, or that your GC is not well-tuned.  These topics are also
discussed on that wiki page.  If you increase your heap size, then the
likelihood of needing more memory in the system becomes greater, because
there will be less memory available for the disk cache.

Thanks,
Shawn



Re: Are Solr releases predictable? Every 2 months?

2015-08-02 Thread Alexandre Rafalovitch
They are not that predictable. Somebody has to volunteer to be a
release manager and then there is a flurry of cleanups, release
candidates, etc.

You can see all that on the Lucene-Dev mailing list. For example, a
5.3 has been proposed (as an idea) on July 30th. But not much happened
since. But it must be fairly close.

The specific JIRA is so far trunk only and not backported to 5_x. So,
it is very unlikely to make 5.3, in my mind.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 2 August 2015 at 06:37, Gili Nachum gilinac...@gmail.com wrote:
 When is 5.3 coming out?
 When is SOLR-6273 https://issues.apache.org/jira/browse/SOLR-6273 (Cross
 Data Center Replication) to be released?
 Any way to tell?


How to use BitDocSet within a PostFilter

2015-08-02 Thread Stephen Weiss
Hi everyone,

I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl 
through grandchild documents during a search through the parents and filter out 
documents based on statistics gathered from aggregating the grandchildren 
together.  I've been successful in getting the logic correct, but it does not 
perform so well - I'm grabbing too many documents from the index along the way. 
 I'm trying to filter out grandchild documents which are not relevant to the 
statistics I'm collecting, in order to reduce the number of document objects 
pulled from the IndexReader.

I've implemented the following code in my DelegatingCollector.collect:

if (inStockSkusBitSet == null) {
SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from 
IndexSearcher to expose getDocSet.
inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet 
to expose getBits.
inStockSkusBitSet = inStockSkusBitDocSet.getBits();
}


My BitDocSet reports a size which matches a standard query for the more limited 
set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this 
same cardinality.  Based on that fact, it seems that the getDocSet call itself 
must be working properly, and returning the right number of documents.  
However, when I try to filter out grandchild documents using either 
BitDocSet.exists or BitSet.get (passing over any grandchild document which 
doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 
less results than I'm supposed to.   It seems many documents that should match 
the filter, are being excluded, and documents which should not match the 
filter, are being included.

I'm trying to use it either of these ways:

if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;

The currentChildDocNumber is simply the docNumber which is passed to 
DelegatingCollector.collect, decremented until I hit a document that doesn't 
belong to the parent document.

I can't seem to figure out a way to actually use the BitDocSet (or its 
derivatives) to quickly eliminate document IDs.  It seems like this is how it's 
supposed to be used.  What am I getting wrong?

Sorry if this is a newbie question, I've never written a PostFilter before, and 
frankly, the documentation out there is a little sketchy (mostly for version 4) 
- so many classes have changed names and so many of the more well-documented 
techniques are deprecated or removed now, it's tough to follow what the current 
best practice actually is.  I'm using the block join functionality heavily so 
I'm trying to keep more current than that.  I would be happy to send along the 
full source privately if it would help figure this out, and plan to write up 
some more elaborate instructions (updated for Solr 5) for the next person who 
decides to write a PostFilter and work with block joins, if I ever manage to 
get this performing well enough.

Thanks for any pointers!  Totally open to doing this an entirely different way. 
 I read DocValues might be a more elegant approach but currently that would 
require reindexing, so trying to avoid that.

Also, I've been wondering if the query above would read from the filter cache 
or not.  The query is constructed like this:


private Term inStockTrueTerm = new Term(sku_history.is_in_stock, T);
private Term objectTypeSkuHistoryTerm = new Term(object_type, 
sku_history);
...

inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
inStockSkusQuery = new BooleanQuery();
inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST);
--
Steve



WGSN is a global foresight business. Our experts provide deep insight and 
analysis of consumer, fashion and design trends. We inspire our clients to plan 
and trade their range with unparalleled confidence and accuracy. Together, we 
Create Tomorrow.

WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of 
market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle 
 Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN 
INstockhttp://www.wgsninstock.com/, WGSN 
StyleTrialhttp://www.wgsn.com/en/styletrial/ and WGSN 
Mindsethttp://www.wgsn.com/en/services/consultancy/, our bespoke consultancy 
services.

The information in or attached to this email is confidential and may be legally 
privileged. If you are not the intended recipient of this message, any use, 
disclosure, copying, distribution or any action taken in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please notify the sender immediately by return email and delete this message 
and any copies from your computer and network. WGSN does not warrant that this 

Collection APIs to create collection and custom cores naming

2015-08-02 Thread davidphilip cherian
How to use the 'property.name=value' in the api example[1] to modify
core.properties value of 'name'

While creating the collection with below query[2], the core names become
'aggregator_shard1_replica1' and 'aggregator_shard2_replica1'. I wanted
have specific/custom name for each of these cores. I tried passing the
params as property.name=namename=aggregator_s1, but it did not work.

Editing the core.properties key value pair of name=aggregator_s1 after
collection is created, works! But I was looking for setting this property
with create request itself.

[2]
http://example.com:8983/solr/admin/collections?action=CREATEname=aggregatornumShards=1replicationFactor=2maxShardsPerNode=1collection.configName=aggregator_configproperty.name=namename=aggregator_s1

[1]
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1


Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Jay Potharaju
Shawn,
Thanks for the feedback. I agree that increasing timeout might alleviate
the timeout issue. The main problem with increasing timeout is the
detrimental effect it will have on the user experience, therefore can't
increase it.
I have looked at the queries that threw errors, next time I try it
everything seems to work fine. Not sure how to reproduce the error.
My concern with increasing the memory to 32GB is what happens when the
index size grows over the next few months.
One of the other solutions I have been thinking about is to rebuild
index(weekly) and create a new collection and use it. Are there any good
references for doing that?
Thanks
Jay

On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 8/2/2015 8:29 AM, Jay Potharaju wrote:
  The document contains around 30 fields and have stored set to true for
  almost 15 of them. And these stored fields are queried and updated all
 the
  time. You will notice that the deleted documents is almost 30% of the
  docs.  And it has stayed around that percent and has not come down.
  I did try optimize but that was disruptive as it caused search errors.
  I have been playing with merge factor to see if that helps with deleted
  documents or not. It is currently set to 5.
 
  The server has 24 GB of memory out of which memory consumption is around
 23
  GB normally and the jvm is set to 6 GB. And have noticed that the
 available
  memory on the server goes to 100 MB at times during a day.
  All the updates are run through DIH.

 Using all availble memory is completely normal operation for ANY
 operating system.  If you hold up Windows as an example of one that
 doesn't ... it lies to you about available memory.  All modern
 operating systems will utilize memory that is not explicitly allocated
 for the OS disk cache.

 The disk cache will instantly give up any of the memory it is using for
 programs that request it.  Linux doesn't try to hide the disk cache from
 you, but older versions of Windows do.  In the newer versions of Windows
 that have the Resource Monitor, you can go there to see the actual
 memory usage including the cache.

  Every day at least once i see the following error, which result in search
  errors on the front end of the site.
 
  ERROR org.apache.solr.servlet.SolrDispatchFilter -
  null:org.eclipse.jetty.io.EofException
 
  From what I have read these are mainly due to timeout and my timeout is
 set
  to 30 seconds and cant set it to a higher number. I was thinking maybe
 due
  to high memory usage, sometimes it leads to bad performance/errors.

 Although this error can be caused by timeouts, it has a specific
 meaning.  It means that the client disconnected before Solr responded to
 the request, so when Solr tried to respond (through jetty), it found a
 closed TCP connection.

 Client timeouts need to either be completely removed, or set to a value
 much longer than any request will take.  Five minutes is a good starting
 value.

 If all your client timeout is set to 30 seconds and you are seeing
 EofExceptions, that means that your requests are taking longer than 30
 seconds, and you likely have some performance issues.  It's also
 possible that some of your client timeouts are set a lot shorter than 30
 seconds.

  My objective is to stop the errors, adding more memory to the server is
 not
  a good scaling strategy. That is why i was thinking maybe there is a
 issue
  with the way things are set up and need to be revisited.

 You're right that adding more memory to the servers is not a good
 scaling strategy for the general case ... but in this situation, I think
 it might be prudent.  For your index and heap sizes, I would want the
 company to pay for at least 32GB of RAM.

 Having said that ... I've seen Solr installs work well with a LOT less
 memory than the ideal.  I don't know that adding more memory is
 necessary, unless your system (CPU, storage, and memory speeds) is
 particularly slow.  Based on your document count and index size, your
 documents are quite small, so I think your memory size is probably good
 -- if the CPU, memory bus, and storage are very fast.  If one or more of
 those subsystems aren't fast, then make up the difference with lots of
 memory.

 Some light reading, where you will learn why I think 32GB is an ideal
 memory size for your system:

 https://wiki.apache.org/solr/SolrPerformanceProblems

 It is possible that your 6GB heap is not quite big enough for good
 performance, or that your GC is not well-tuned.  These topics are also
 discussed on that wiki page.  If you increase your heap size, then the
 likelihood of needing more memory in the system becomes greater, because
 there will be less memory available for the disk cache.

 Thanks,
 Shawn




-- 
Thanks
Jay Potharaju