from:"Manuel Le Normand"

Re: boosting words from specific list

2014-09-30 Thread Manuel Le Normand

I have not tried it but I would check the option of using the synonymFilter
to duplicate certain query words . Anothe opt - you can detect these word
at index time (eg. UpdateProcessor) to give these documents a document
boost in case it fits your logic. Or even make a copy field that contains a
whitelist words and query two fields each query - the original and the
copyField.
With debug query you'll be able to get the scores and adjust your boosts.
Small issue, many solutions. Look what works for you
Manuel

Re: Searching and highlighting ten's of fields

2014-07-31 Thread Manuel Le Normand

Right, it works!
I was not aware of this functionality and being able to customize it by
hl.requireFieldMatch param.

Thanks

Searching and highlighting ten's of fields

2014-07-30 Thread Manuel Le Normand

Hello,
I need to expose the search and highlighting capabilities over few tens of
fields. The edismax's qf param makes it possible but the time performances
for searching tens of words over tens of fields is problematic.

I made a copyField (indexed, not stored) for these fields, which gives way
better search performances but does not enable highlighting the original
fields which are stored.

Is there any way of searching this copyField and highlighting other fields
with any of the highlight components?

BTW, I need to keep the field structure so storing the copyField is not an
alternative.

Re: Searching and highlighting ten's of fields

2014-07-30 Thread Manuel Le Normand

Current I use the classic but I can change my posting format in order to
work with another highlighting component if that leads to any solution

Re: Searching and highlighting ten's of fields

2014-07-30 Thread Manuel Le Normand

The slowdown occurs during search, not highlighting. Having a disjunctive
query with 50 terms running 20 different posting lists is a hard task.
Harder than searching these 50 terms on a single (larger) posting list as
in the copyField case.

With the edismax qf param, sure, hl.fl=* works as it should. In the
copyField case it does not as it is a non stored field. There are no
higlights on non stored fields AFAIK.

Is there a way to search the global copyField but highlight the original
stored fields?


On Wed, Jul 30, 2014 at 5:54 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Doesn't hl.fl work in this case? Or is highlighting the 10 fields the
 slowdown?

 Best,
 Erick


 On Wed, Jul 30, 2014 at 2:55 AM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

  Current I use the classic but I can change my posting format in order to
  work with another highlighting component if that leads to any solution

OCR - Saving multi-term position

2014-07-02 Thread Manuel Le Normand

Hello,
Many of our indexed documents are scanned and OCR'ed documents.
Unfortunately we were not able to improve much the OCR quality (less than
80% word accuracy) for various reasons, a fact which badly hurts the
retrieval quality.

As we use an open-source OCR, we think of changing every scanned term
output to it's main possible variations to get a higher level of confidence.

Is there any analyser that supports this kind of need or should I make up a
syntax and analyser of my own, i.e the payload syntax?

The quick brown fox -- The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4

Thanks,
Manuel

Re: OCR - Saving multi-term position

2014-07-02 Thread Manuel Le Normand

Thanks for your answers Erick and Michael.

The term confidence level is an OCR output metric which tells for every
word what are the odds it's the actual scanned term. I wish the OCR prog to
output all the suspected words that sum up to above ~90% of confidence it
is the actual term instead of outputting a single word as default behaviour.

I'm happy to hear this approach was used before, I will implement an
analyser that indexes these terms in same position to enable positional
queries.
Hope it works on well. In case it does I will open up a Jira ticket for it.

If anyone else has had experience with this use case I'd love hearing,

Manuel


On Wed, Jul 2, 2014 at 7:28 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Problem here is that you wind up with a zillion unique terms in your
 index, which may lead to performance issues, but you probably already
 know that :).

 I've seen situations where running it through a dictionary helps. That
 is, does each term in the OCR match some dictionary? Problem here is
 that it then de-values terms that don't happen to be in the
 dictionary, names for instance.

 But to answer your question: No, there really isn't a pre-built
 analysis chain that i know of that does this. Root issue is how to
 assign confidence? No clue for your specific domain.

 So payloads seem quite reasonable here. Happens there's a recent
 end-to-end example, see:
 http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/

 Best,
 Erick

 On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
  I don't have first hand knowledge of how you implement that, but I bet a
  look at the WordDelimiterFilter would help you understand how to emit
  multiple terms with the same positions pretty easily.
 
  I've heard of this bag of word variants approach to indexing
 poor-quality
  OCR output before for findability reasons and I heard it works out OK.
 
  Michael Della Bitta
 
  Applications Developer
 
  o: +1 646 532 3062
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 
  w: appinions.com http://www.appinions.com/
 
 
  On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand 
  manuel.lenorm...@gmail.com wrote:
 
  Hello,
  Many of our indexed documents are scanned and OCR'ed documents.
  Unfortunately we were not able to improve much the OCR quality (less
 than
  80% word accuracy) for various reasons, a fact which badly hurts the
  retrieval quality.
 
  As we use an open-source OCR, we think of changing every scanned term
  output to it's main possible variations to get a higher level of
  confidence.
 
  Is there any analyser that supports this kind of need or should I make
 up a
  syntax and analyser of my own, i.e the payload syntax?
 
  The quick brown fox -- The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3
 fox|4
 
  Thanks,
  Manuel

Re: Compression vs FieldCache for doc ids retrieval

2014-05-30 Thread Manuel Le Normand

Is the issue SOLR-5478 what you were looking for?

Re: Application of different stemmers / stopword lists within a single field

2014-04-28 Thread Manuel Le Normand

Why wouldn't you take advantage of your use case - the chars belong to
different char classes.

You can index this field to a single solr field (no copyField) and apply an
analysis chain that includes both languages analysis - stopword, stemmers
etc.
As every filter should apply to its' specific language (e.g an arabic
stemmer should not stem a lating word) you can make cross languages search
on this single field.


On Mon, Apr 28, 2014 at 5:59 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 If you can throw money at the problem:
 http://www.basistech.com/text-analytics/rosette/language-identifier/ .
 Language Boundary Locator at the bottom of the page seems to be
 part/all of your solution.

 Otherwise, specifically for English and Arabic, you could play with
 Unicode ranges to try detecting text blocks:
 1) Create an UpdateRequestProcessor chain that
 a) clones text into field_EN and field_AR.
 b) applies regular expression transformations that strip English or
 Arabic unicode text range correspondingly, so field_EN only has
 English characters left, etc. Of course, you need to decide what you
 want to do with occasional EN or neutral characters happening in the
 middle of Arabic text (numbers: Arabic or Indic? brackets, dashes,
 etc). But if you just index text, it might be ok even if it is not
 perfect.
 c) deletes empty fields, just in case not all of them have mix language
 2) Use eDismax to search over both fields, each with its own processor.

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Fri, Apr 25, 2014 at 5:34 PM, Timothy Hill timothy.d.h...@gmail.com
 wrote:
  This may not be a practically solvable problem, but the company I work
 for
  has a large number of lengthy mixed-language documents - for example,
  scholarly articles about Islam written in English but containing lengthy
  passages of Arabic. Ideally, we would like users to be able to search
 both
  the English and Arabic portions of the text, using the full complement of
  language-processing tools such as stemming and stopword removal.
 
  The problem, of course, is that these two languages co-occur in the same
  field. Is there any way to apply different processing to different words
 or
  paragraphs within a single field through language detection? Is this to
 all
  intents and purposes impossible within Solr? Or is another approach
 (using
  language detection to split the single large field into
  language-differentiated smaller fields, for example)
 possible/recommended?
 
  Thanks,
 
  Tim Hill

Indexing useful N-grams and adding payloads

2014-03-10 Thread Manuel Le Normand

Hi,
I have a performance and scoring problem for phrase queries

   1. Performance - phrase queries involving frequent terms are very slow
   due to the reading of large positions posting list.
   2. Scoring - I want to control the boost of phrase and entity (in
   gazetteers) matches

Indexing all terms as bi-grams and unigrams is out of question in my use
case, so I plan indexing only the useful bi-grams. Part of it will be
achieved by the CommonGram filter in which I put the frequent words, but I
think of going one step further and indexing also every phrase query I have
extracted from my query log and entity from my gazetteers To the latter
(which are N-grams) I will also add a payload to control the boost.

An example MappingCharFilter.txt would be:

#phrase-query
term1 term2 term3 = term1_term2_term3|1
#entity
firstName lastName = firstName_lastName|2

One of the issues is that I have 100k-1M (depending on frequency)
phrases/entities as above. I saw that MappingCharFilter is implemented as
an FST, still I'm concerned that iterating on the charBuffer for long
documents might cause problems.

Has anyone faced a similar issue? Is this mapping implementation resonable
during query time performance wise?

Thanks in advance,
Manuel

Using payloads for expanded query terms

2014-02-18 Thread Manuel Le Normand

Hello,
I'm trying to handle a situation with taxonomy search - that is for each
taxonomy I have a list of words with their boosts. These taxonomies are
updated frequently so I retrieve these scored lists at query time from an
external service.

My expectation would be:
 q={!some_query_parser}Cities_France OR Cities_England = q=max(Paris^0.5
Lyon^0.4 La Defense^0.3) OR max(London^0.5, Oxford^4)

Implementations possibilities I thought about:

   1. An adapted synonym filter, where query term boosts are encoded as
   payloads.
   2. Query parser that handles the term expansion and weighting. The main
   drawback is the fact it forces me to stick to my own query parser.
   3. Building the query outside Solr.

What would you recommand?

Thanks,
Manuel

Re: Solr 4.6.0: DocValues (distributed search)

2014-01-10 Thread Manuel Le Normand

In short, when running a distributed search every shard runs the query
separately. Each shard's collector returns the topN (rows param) internal
docId's of the matching documents.

These topN docId's are converted to their uniqueKey in the
BinaryResponseWriter and sent to the frontend core (the one the received
the query). This conversion is implemented by a StoredFieldVisitor, meaning
the uniqueKeys are read from their stored field and not from their
docValues.

As in our use-case we have a high row param, these conversions became a
performance bottleneck. We implemented a user-cache that stores the shard's
uniqueKey docValues, which is a [docId, uniqueKey] mapping. This eliminates
the need of accessing the stored field for these frequent conversions.

You can have a look at the patch. Feel free commenting
https://issues.apache.org/jira/browse/SOLR-5478

Best,
Manuel


On Thu, Jan 9, 2014 at 7:33 PM, ku3ia dem...@gmail.com wrote:

 Today I setup a simple SolrCloud with tow shards. Seems the same. When I'm
 debugging a distributed search I can't catch a break-point at lucene codec
 file, but when I'm using faceted search everything looks fine - debugger
 stops.

 Can anyone help me with my question? Thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-6-0-DocValues-distributed-search-tp4110289p4110511.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Sudden Solr crush after commit

2013-12-12 Thread Manuel Le Normand

In the last days one of my tomcat servlet, running only a Solr instance,
crushed unexpectedly twice.

Low memory usage, nothing written in the tomcat log, and the last thing
happening in solr log is 'end_commit_flush' followed by 'UnInverted
mutli-valued field' for the fields faceted during the newsearcher run.
Right after this, the tomcat crushed leaving no trace.

Has anyone experienced a similar issue before?

Thanks,
Manu

Re: Updating shard range in Zookeeper

2013-12-12 Thread Manuel Le Normand

Zookeeper client for eclipse is the tool you're looking for. You can edit
directly the clusterstate.
http://www.massedynamic.org/mediawiki/index.php?title=Eclipse_Plug-in_for_ZooKeeper

Another option can be using the delivered zkclient (distributed with solr
4.5 and above) and upload a new clusterstate with a new shard range.

Good luck

Re: Sudden Solr crush after commit

2013-12-12 Thread Manuel Le Normand

Running solr 4.3, sharded collection. Tomcat 7.0.39
Faceting on multivalue fields works perfectly fine, I was describing this
log to emphasize the fact the servlet failed right after a new searcher was
opened and the event listener finished running a warming faceting query.

Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Manuel Le Normand

In order to set discountOverlaps to true you must have added the
similarity class=solr.DefaultSimilarityFactory to the schema.xml, which
is commented out by default!

As by default this param is false, the above situation is expected with
correct positioning, as said.

In order to fix the field norms you'd have to reindex with the similarity
class which initializes the param to true.

Cheers,
Manu

Re: Bad fieldNorm when using morphologic synonyms

2013-12-08 Thread Manuel Le Normand

Robert, you last reply is not accurate.
It's true that the field norms and termVectors are independent. But this
issue of higher norms for this case is expected with well assigned
positions. The LengthNorm is assigned as FieldInvertState.length which is
the count of incrementToken and not num of positions! It is the case for
wordDelimiterFilter or ReversedWildcardFilter which do change the norm when
expanding a term.

Re: distributed search is significantly slower than direct search

2013-11-25 Thread Manuel Le Normand

https://issues.apache.org/jira/browse/SOLR-5478

There it goes

On Mon, Nov 18, 2013 at 5:44 PM, Manuel Le Normand 
manuel.lenorm...@gmail.com wrote:

 Sure, I am out of office till end of week. I reply after i upload the patch

Re: distributed search is significantly slower than direct search

2013-11-18 Thread Manuel Le Normand

Sure, I am out of office till end of week. I reply after i upload the patch

Re: distributed search is significantly slower than direct search

2013-11-17 Thread Manuel Le Normand

In order to accelerate the BinaryResponseWriter.write we extended this
writer class to implement the docid to id tranformation by docValues (on
memory) with no need to access stored field for id reading nor lazy loading
of fields that also has a cost. That should improve read rate as docValues
are sequential and should avoid disk IO. This docValues implementation is
accessed during both query stages (as mentioned above) in case you ask for
id's only, or only once, during the distributed search stage, in case you
intend asking for stored fields different than id.

We just started testing it for performance. I would love hearing any
oppinions or test performances for this implementation

Manu

Re: distributed search is significantly slower than direct search

2013-11-13 Thread Manuel Le Normand

It's surprising such a query takes a long time, I would assume that after
trying consistently q=*:* you should be getting cache hits and times should
be faster. Try see in the adminUI how do your query/doc cache perform.
Moreover, the query in itself is just asking the first 5000 docs that were
indexed (returing the first [docid]), so seems all this time is wasted on
transfer. Out of these 7 secs how much is spent on the above method? What
do you return by default? How big is every doc you display in your results?
Might be the matter that both collections work on the same ressources. Try
elaborating your use-case.

Anyway, it seems like you just made a test to see what will be the
performance hit in a distributed environment so I'll try to explain some
things we encountered in our benchmarks, with a case that has at least the
similarity of the num of docs fetched.

We reclaim 2000 docs every query, running over 40 shards. This means every
shard is actually transfering to our frontend 2000 docs every
document-match request (the first you were referring to). Even if lazily
loaded, reading 2000 id's (on 40 servers) and lazy loading the fields is a
tough job. Waiting for the slowest shard to respond, then sorting the docs
and reloading (lazy or not) the top 2000 docs might take a long time.

Our times are 4-8 secs, but do it's not possible comparing cases. We've
done few steps that improved it along the way, steps that led to others.
These were our starters:

   1. Profile these queries from different servers and solr instances, try
   putting your finger what collection is working hard and why. Check if
   you're stuck on components that don't have an added value for you but are
   used by default.
   2. Consider eliminating the doc cache. It loads lots of (partly) lazy
   documents that their probability of secondary usage is low. There's no such
   thing popular docs when requesting so many docs. You may be using your
   memory in a better way.
   3. Bottleneck check - inner server metrics as cpu user / iowait, packets
   transferred over the network, page faults etc. are excellent in order to
   understand if the disk/network/cpu is slowing you down. Then upgrade
   hardware in one of the shards to check if it helps by looking at the
   upgraded shard qTime compared to other.
   4. Warm up the index after commiting - try to benchmark how do queries
   performs before and after some warm-up, let's say some few hundreds of
   queries (from your previous system) in order to warm up the os cache
   (assuming your using NRTDirectoryFactory)


Good luck,
Manu


On Wed, Nov 13, 2013 at 2:38 PM, Erick Erickson erickerick...@gmail.comwrote:

 One thing you can try, and this is more diagnostic than a cure, is return
 just
 the id field (and insure that lazy field loading is true). That'll tell you
 whether
 the issue is actually fetching the document off disk and decompressing,
 although
 frankly that's unlikely since you can get your 5,000 rows from a single
 machine
 quickly.

 The code you found where Solr is spending its time, is that on the
 routing core
 or on the shards? I actually have a hard time understanding how that
 code could take a long time, doesn't seem right.

 You are transferring 5,000 docs across the network, so it's possible that
 your network is just slow, that's certainly a difference between the local
 and remote case, but that's a stab in the dark.

 Not much help I know,
 Erick



 On Wed, Nov 13, 2013 at 2:52 AM, Elran Dvir elr...@checkpoint.com wrote:

  Erick, Thanks for your response.
 
  We are upgrading our system using Solr.
  We need to preserve old functionality.  Our client displays 5K document
  and groups them.
 
  Is there a way to refactor code in order to improve distributed documents
  fetching?
 
  Thanks.
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Wednesday, October 30, 2013 3:17 AM
  To: solr-user@lucene.apache.org
  Subject: Re: distributed search is significantly slower than direct
 search
 
  You can't. There will inevitably be some overhead in the distributed
 case.
  That said, 7 seconds is quite long.
 
  5,000 rows is excessive, and probably where your issue is. You're having
  to go out and fetch the docs across the wire. Perhaps there is some
  batching that could be done there, I don't know whether this is one
  document per request or not.
 
  Why 5K docs?
 
  Best,
  Erick
 
 
  On Tue, Oct 29, 2013 at 2:54 AM, Elran Dvir elr...@checkpoint.com
 wrote:
 
   Hi all,
  
   I am using Solr 4.4 with multi cores. One core (called template) is my
   routing core.
  
   When I run
   http://127.0.0.1:8983/solr/template/select?rows=5000q=*:*shards=127.
   0.0.1:8983/solr/core1,
   it consistently takes about 7s.
   When I run http://127.0.0.1:8983/solr/core1/select?rows=5000q=*:*, it
   consistently takes about 40ms.
  
   I profiled the distributed query.
   This is the distributed query process (I hope the terms

Basic query process question with fl=id

2013-10-24 Thread Manuel Le Normand

Hi

Any distributed lookup is basically composed of two stages: the first
collecting all the matching documents from every shard and a second which
fetches additional information about specific ids (i.e stored, termVectors).

It can be seen in the logs of each shard (isShard=true), where first
request logs the num of hits that were received on the query by the
specific shard and a second that contains the ids fields (ids=...) for the
additional fetch.
At the end of both I get a total QTime of the query and the total num of
hits.

My question is about the case only id's are requested (fl=id). This query
should make only one request against a shard, while it actually does the
two of them.

Looks like the response builder has to go through these two stages no
matter what is the kind of query.

My question:
1. Is it normal the response builder has to go though both stages?
2. Does the first request gets internal lucene DocId's or the actual
uniqueKey id?
3. A query as above (fl=id), where is the Id read from? Is it fetched from
the stored file? or doc value file if exists? Because if fetched from the
stored, a high row param (say 1000 in my case) would need 1000 lookups
which could badly heart performance.

Thanks
Manuel

Re: Profiling Solr Lucene for query

2013-10-15 Thread Manuel Le Normand

I tried my last proposition, editing the clusterstate.json to add a dummy
frontend shard seems to work. I made sure the ranges were not overlapping.
Doesn't it resolve the solr cloud issue as specified above?

Re: Profiling Solr Lucene for query

2013-10-12 Thread Manuel Le Normand

Would adding a dummy shard instead of a dummy collection would resolve the
situation? - e.g. editing clusterstate.json from a zookeeper client and
adding a shard with a 0-range so no docs are routed to this core. This core
would be on a separate server and act as the collection gateway.

Re: Profiling Solr Lucene for query

2013-09-11 Thread Manuel Le Normand

Dmitry - currently we don't have such a front end, this sounds like a good
idea creating it. And yes, we do query all 36 shards every query.

Mikhail - I do think 1 minute is enough data, as during this exact minute I
had a single query running (that took a qtime of 1 minute). I wanted to
isolate these hard queries. I repeated this profiling few times.

I think I will take the termInterval from 128 to 32 and check the results.
I'm currently using NRTCachingDirectoryFactory




On Mon, Sep 9, 2013 at 11:29 PM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Manuel,

 The frontend solr instance is the one that does not have its own index and
 is doing merging of the results. Is this the case? If yes, are all 36
 shards always queried?

 Dmitry


 On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

  Hi Dmitry,
 
  I have solr 4.3 and every query is distributed and merged back for
 ranking
  purpose.
 
  What do you mean by frontend solr?
 
 
  On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote:
 
   are you querying your shards via a frontend solr? We have noticed, that
   querying becomes much faster if results merging can be avoided.
  
   Dmitry
  
  
   On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand 
   manuel.lenorm...@gmail.com wrote:
  
Hello all
Looking on the 10% slowest queries, I get very bad performances (~60
  sec
per query).
These queries have lots of conditions on my main field (more than a
hundred), including phrase queries and rows=1000. I do return only
 id's
though.
I can quite firmly say that this bad performance is due to slow
 storage
issue (that are beyond my control for now). Despite this I want to
   improve
my performances.
   
As tought in school, I started profiling these queries and the data
 of
  ~1
minute profile is located here:
http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg
   
Main observation: most of the time I do wait for readVInt, who's
   stacktrace
(2 out of 2 thread dumps) is:
   
catalina-exec-3870 - Thread t@6615
 java.lang.Thread.State: RUNNABLE
 at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
 at
   
   
  
 
 org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
2357)
 at
   
   
  
 
 ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
 at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
 at
   
   
  
 
 org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
 at
   org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
 at
   
   
  
 
 org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
   
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
   
   
  
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
   
  oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
   
   
  
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
   
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
   
   
  
 
 org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
 at
  org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
   
   
So I do actually wait for IO as expected, but I might be too many
 time
   page
faulting while looking for the TermBlocks (tim file), ie locating the
   term.
As I reindex now, would it be useful lowering down the termInterval
(default to 128)? As the FST (tip files) are that small (few 10-100
 MB)
   so
there are no memory contentions, could I lower down this param to 8
 for
example? The benefit from lowering down the term interval would be to
obligate the FST to get on memory (JVM - thanks to the
   NRTCachingDirectory)
as I do not control the term dictionary file (OS caching, loads an
   average
of 6% of it).
   
   
General configs:
solr 4.3
36 shards, each has few million docs
These 36 servers (each server has 2 replicas) are running virtual,
 16GB
memory each (4GB for JVM, 12GB remain for the OS caching),  consuming
   260GB
of disk mounted for the index files.

Re: Profiling Solr Lucene for query

2013-09-09 Thread Manuel Le Normand

Hi Dmitry,

I have solr 4.3 and every query is distributed and merged back for ranking
purpose.

What do you mean by frontend solr?


On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote:

 are you querying your shards via a frontend solr? We have noticed, that
 querying becomes much faster if results merging can be avoided.

 Dmitry


 On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

  Hello all
  Looking on the 10% slowest queries, I get very bad performances (~60 sec
  per query).
  These queries have lots of conditions on my main field (more than a
  hundred), including phrase queries and rows=1000. I do return only id's
  though.
  I can quite firmly say that this bad performance is due to slow storage
  issue (that are beyond my control for now). Despite this I want to
 improve
  my performances.
 
  As tought in school, I started profiling these queries and the data of ~1
  minute profile is located here:
  http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg
 
  Main observation: most of the time I do wait for readVInt, who's
 stacktrace
  (2 out of 2 thread dumps) is:
 
  catalina-exec-3870 - Thread t@6615
   java.lang.Thread.State: RUNNABLE
   at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
   at
 
 
 org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
  2357)
   at
 
 
 ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
   at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
   at
 
 
 org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
   at
 org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
   at
 
 
 org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
   at
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
   at
 
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
   at
  oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
   at
 
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
   at
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
   at
 
 
 org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
 
 
  So I do actually wait for IO as expected, but I might be too many time
 page
  faulting while looking for the TermBlocks (tim file), ie locating the
 term.
  As I reindex now, would it be useful lowering down the termInterval
  (default to 128)? As the FST (tip files) are that small (few 10-100 MB)
 so
  there are no memory contentions, could I lower down this param to 8 for
  example? The benefit from lowering down the term interval would be to
  obligate the FST to get on memory (JVM - thanks to the
 NRTCachingDirectory)
  as I do not control the term dictionary file (OS caching, loads an
 average
  of 6% of it).
 
 
  General configs:
  solr 4.3
  36 shards, each has few million docs
  These 36 servers (each server has 2 replicas) are running virtual, 16GB
  memory each (4GB for JVM, 12GB remain for the OS caching),  consuming
 260GB
  of disk mounted for the index files.

Re: Expunge deleting using excessive transient disk space

2013-09-09 Thread Manuel Le Normand

I can only agree for the 50% free space recommendation. Unfortunately I do
not have this for the current time, I'm standing on a 10% free disk (out of
300GB for each server). I'm aware it is very low.

Does this seem reasonable adapting the current merge policy (or writing a
new one) that would free up the transient disk space every merge instead of
waiting for all of them to achieve? Where can I get such a answer (people
who wrote the code)?

Thanks


On Sun, Sep 8, 2013 at 9:30 PM, Erick Erickson erickerick...@gmail.comwrote:

 Right, but you should have at least as much free space as your total index
 size, and I don't see the total index size (but I'm just glancing).

 I'm not entirely sure you can precisely calculate the maximum free space
 you have relative to the amount needed for merging, some of the people who
 wrote that code can probably tell you more.

 I'd _really_ try to get more disk space. The amount of engineer time spent
 trying to tune this is way more expensive than a disk...

 Best,
 Erick


 On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

Hi,
  In order to delete part of my index I run a delete by query that intends
 to
  erase 15% of the docs.
  I added this params to the solrconfig.xml
  mergePolicy class=org.apache.lucene.index.TieredMergePolicy
 int name=maxMergeAtOnce2/int
 int name=maxMergeAtOnceExplicit2/int
 double name=maxMergedSegmentMB5000.0/double
 double name=reclaimDeletesWeight10.0/double
 double name=segmentsPerTier15.0/double
  /mergePolicy
 
  The extra params were added in order to promote merge of old segments but
  with restriction on the transient disk that can be used (as I have only
  15GB per shard).
 
  This procedure failed on a no space left on device exception, although
  proper calculations show that these params should cause no usage excess
 of
  the transient free disk space I have.
   Looking on the infostream I can see that the first merges do succeed but
  older segments are kept in reference thus cannot be deleted until all the
  merging are done.
 
  Is there anyway of overcoming this?

Expunge deleting using excessive transient disk space

2013-09-08 Thread Manuel Le Normand

  Hi,
In order to delete part of my index I run a delete by query that intends to
erase 15% of the docs.
I added this params to the solrconfig.xml
mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce2/int
   int name=maxMergeAtOnceExplicit2/int
   double name=maxMergedSegmentMB5000.0/double
   double name=reclaimDeletesWeight10.0/double
   double name=segmentsPerTier15.0/double
/mergePolicy

The extra params were added in order to promote merge of old segments but
with restriction on the transient disk that can be used (as I have only
15GB per shard).

This procedure failed on a no space left on device exception, although
proper calculations show that these params should cause no usage excess of
the transient free disk space I have.
 Looking on the infostream I can see that the first merges do succeed but
older segments are kept in reference thus cannot be deleted until all the
merging are done.

Is there anyway of overcoming this?

Profiling Solr Lucene for query

2013-09-08 Thread Manuel Le Normand

Hello all
Looking on the 10% slowest queries, I get very bad performances (~60 sec
per query).
These queries have lots of conditions on my main field (more than a
hundred), including phrase queries and rows=1000. I do return only id's
though.
I can quite firmly say that this bad performance is due to slow storage
issue (that are beyond my control for now). Despite this I want to improve
my performances.

As tought in school, I started profiling these queries and the data of ~1
minute profile is located here:
http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg

Main observation: most of the time I do wait for readVInt, who's stacktrace
(2 out of 2 thread dumps) is:

catalina-exec-3870 - Thread t@6615
 java.lang.Thread.State: RUNNABLE
 at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
 at
org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
2357)
 at
ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
 at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
 at
org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
 at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
 at
org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
 at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)


So I do actually wait for IO as expected, but I might be too many time page
faulting while looking for the TermBlocks (tim file), ie locating the term.
As I reindex now, would it be useful lowering down the termInterval
(default to 128)? As the FST (tip files) are that small (few 10-100 MB) so
there are no memory contentions, could I lower down this param to 8 for
example? The benefit from lowering down the term interval would be to
obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory)
as I do not control the term dictionary file (OS caching, loads an average
of 6% of it).


General configs:
solr 4.3
36 shards, each has few million docs
These 36 servers (each server has 2 replicas) are running virtual, 16GB
memory each (4GB for JVM, 12GB remain for the OS caching),  consuming 260GB
of disk mounted for the index files.

Wrong leader election leads to shard removal

2013-08-14 Thread Manuel Le Normand

Hello,
My solr cluster runs on RH Linux with tomcat7 servlet.
NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr
4.3

For experimental reasons I splitted my cluster to 2 sub-clusters, each
containing a single replica of each shard.
When connecting back these sub-clusters the sync failed (more than 100 docs
indexed per shard) so a replication process started on sub-cluster #2. Due
to transient storage limitations needed for the replication process, I
removed all the index from sub-cluster #2 before connecting it back, then I
connected sub-cluster #2's servers in 3-4 bulks to avoid high disk load.
The first bulk replications worked well, but after a while an internal
script pkilled all the solr instances, some while replicating. After
starting back the servlet I discovered the disaster - on part of the
replicas that were in a replicating stage there was a wrong zookeeper
leader election - good state replicas (sub-cluster 1) replicated from empty
replicas (sub-cluster 2) ending up in removing all documents in these
shards!!

These are the logs from solr-prod32 (sub cluster #2 - bad state) - the
shard1_replica1 is elected to be leader although it was not before the
replication process (and shouldn't have the higher version number):

2013-08-13 13:39:15.838 [INFO ]
org.apache.solr.cloud.ShardLeaderElectionContext Enough replicas found to
continue.
2013-08-13 13:39:15.838 [INFO ]
org.apache.solr.cloud.ShardLeaderElectionContext I may be the new leader -
try and sync
2013-08-13 13:39:15.839 [INFO ] org.apache.solr.cloud.SyncStrategy Sync
replicas to http://solr-prod32:5050/solr/raw shard1_replica1/
 2013-08-13 13:39:15.841 [INFO ]
org.apache.solr.client.solrj.impl.HttpClientUtil Creating new http client,
config:maxConnectionsPerHost=20maxConnections=1connTimeout=3socketTimeout=3retry=false
2013-08-13 13:39:15.844 [INFO ] org.apache.solr.update.PeerSync PeerSync:
core=raw_shard1_replica1 url=http://solr-prod32:8080/solr START replicas=[
http://solr-prod02:5080/solr/raw shard1_replica2/] nUpdates=100
2013-08-13 13:39:15.847 [INFO I org.apache.solr.update.PeerSync PeerSync:
core=raw shard1_replica1 url=http://solr-prod32:8080/solr DONE. We have
no versions. sync failed.
2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.SyncStrategy Leader's
attempt to sync with shard failed, moving to the next canidate
2013-08-13 13:39:15.847 [INFO ]
org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we
have no versions - we can't sync in that case - we were active before, so
become leader anyway
2013-08-13 13:39:15.847 [INFO ]
org.apache.solr.cloud.ShardLeaderElectionContext I am the new leader:
http://solr-prod32:8080/solr/raw_shard1_replica1/
2013-08-13 13:39:15.847 [INFO ] org.apache.solr.common.cloud.SolrZkClient
makePath: /collections/raw/leaders/shardl
2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
A cluster state change: WatchedEvent state:SyncConnected
type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
(live nodes size: 40)

While in solr-prod02 (sub cluster #1 - good state) I get:
2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
publishing core=raw_shard1_replica2 state=down
2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
numShards not found on descriptor - reading it from system property
2013-08-13 13:39:15.673 [INFO ] org.apache.solr.core.CoreContainer
registering core: raw_shard1_replica2
2013-08-13 13:39:15.673 [INFO ] org.apache.solr.cloud.ZkController Register
replica - core:raw_shard1_replica2 address:
http://so1r-prod02:8080/solrcollection:raw shard:shard1
2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
A cluster state change: WatchedEvent stare:SyncConnected
type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
(live nodes size: 40)
2013-08-13 13:39:17.480 [INFO ] org.apache.solr.cloud.ZkController We are
httpL//solr-prod02:8080/solr/raw_shard1_replica2/ and leader is
http://solr-prod32:8080/solr/raw_shard1_replica1/
2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController No
LogReplay needed for core=raw_shard1_replica2
2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController Core
needs to recover:raw shard1_replica2
2013-08-13 13:39:17.481 [INFO ] org.apache.solr.update.DefaultSolrCoreState
Running recovery - first canceling any ongoing recovery
2013-08-13 13:39:17.485 [INFO org.apache.solr.common.cloud.ZkStateReader
Updating cloud state from ZooKeeper...
2013-08-13 13:39:17.485 [INFO ] org.apache.solr.cloud.RecoveryStrategy
Starting recovery process. core=raw_shard1_rep1ica2

Why was the leader elected wrongly??

Thanks

Re: Wrong leader election leads to shard removal

2013-08-14 Thread Manuel Le Normand

Does this sound like the scenario that happened:
By removing the index dir from replica 2 I also removed the tlog from which
the zookeeper extracts the version of the two replicas and decides which
one should be elected to leader. As replica 2 did have no tlog, the zk
didn't have anyway to compare the 2 registered replicas so it just picked
arbitrarly one of the replicas to lead, resulting in electing empty
replicas.

How does the zookeeper compare the 2 tlogs to know which one is more
recent? does it not rely on the version number shown in the admin UI?


On Wed, Aug 14, 2013 at 11:00 AM, Manuel Le Normand 
manuel.lenorm...@gmail.com wrote:

 Hello,
 My solr cluster runs on RH Linux with tomcat7 servlet.
 NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr
 4.3

 For experimental reasons I splitted my cluster to 2 sub-clusters, each
 containing a single replica of each shard.
 When connecting back these sub-clusters the sync failed (more than 100
 docs indexed per shard) so a replication process started on sub-cluster #2.
 Due to transient storage limitations needed for the replication process, I
 removed all the index from sub-cluster #2 before connecting it back, then I
 connected sub-cluster #2's servers in 3-4 bulks to avoid high disk load.
 The first bulk replications worked well, but after a while an internal
 script pkilled all the solr instances, some while replicating. After
 starting back the servlet I discovered the disaster - on part of the
 replicas that were in a replicating stage there was a wrong zookeeper
 leader election - good state replicas (sub-cluster 1) replicated from empty
 replicas (sub-cluster 2) ending up in removing all documents in these
 shards!!

 These are the logs from solr-prod32 (sub cluster #2 - bad state) - the
 shard1_replica1 is elected to be leader although it was not before the
 replication process (and shouldn't have the higher version number):

 2013-08-13 13:39:15.838 [INFO ]
 org.apache.solr.cloud.ShardLeaderElectionContext Enough replicas found to
 continue.
 2013-08-13 13:39:15.838 [INFO ]
 org.apache.solr.cloud.ShardLeaderElectionContext I may be the new leader -
 try and sync
 2013-08-13 13:39:15.839 [INFO ] org.apache.solr.cloud.SyncStrategy Sync
 replicas to http://solr-prod32:5050/solr/raw shard1_replica1/
  2013-08-13 13:39:15.841 [INFO ]
 org.apache.solr.client.solrj.impl.HttpClientUtil Creating new http client,
 config:maxConnectionsPerHost=20maxConnections=1connTimeout=3socketTimeout=3retry=false
 2013-08-13 13:39:15.844 [INFO ] org.apache.solr.update.PeerSync PeerSync:
 core=raw_shard1_replica1 url=http://solr-prod32:8080/solr START replicas=[
 http://solr-prod02:5080/solr/raw shard1_replica2/] nUpdates=100
 2013-08-13 13:39:15.847 [INFO I org.apache.solr.update.PeerSync PeerSync:
 core=raw shard1_replica1 url=http://solr-prod32:8080/solr DONE. We have
 no versions. sync failed.
 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.SyncStrategy
 Leader's attempt to sync with shard failed, moving to the next canidate
 2013-08-13 13:39:15.847 [INFO ]
 org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we
 have no versions - we can't sync in that case - we were active before, so
 become leader anyway
 2013-08-13 13:39:15.847 [INFO ]
 org.apache.solr.cloud.ShardLeaderElectionContext I am the new leader:
 http://solr-prod32:8080/solr/raw_shard1_replica1/
 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.common.cloud.SolrZkClient
 makePath: /collections/raw/leaders/shardl
 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
 A cluster state change: WatchedEvent state:SyncConnected
 type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
 (live nodes size: 40)

 While in solr-prod02 (sub cluster #1 - good state) I get:
 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
 publishing core=raw_shard1_replica2 state=down
 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
 numShards not found on descriptor - reading it from system property
 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.core.CoreContainer
 registering core: raw_shard1_replica2
 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.cloud.ZkController
 Register replica - core:raw_shard1_replica2 address:
 http://so1r-prod02:8080/solr collection:raw shard:shard1
 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
 A cluster state change: WatchedEvent stare:SyncConnected
 type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
 (live nodes size: 40)
 2013-08-13 13:39:17.480 [INFO ] org.apache.solr.cloud.ZkController We are
 httpL//solr-prod02:8080/solr/raw_shard1_replica2/ and leader is
 http://solr-prod32:8080/solr/raw_shard1_replica1/
 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController No
 LogReplay needed for core=raw_shard1_replica2
 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController Core
 needs

Merged segment warmer Solr 4.4

2013-07-29 Thread Manuel Le Normand

Hi,
I have a slow storage machine and non sufficient RAM for the whole index to
store all the index. This causes the first queries (~5000) to be very slow
(they are read from disk and my cpu is most of time in iowait), and after
that the readings from the index become very fast and read mainly from
memory as the OS caching cached the most used parts of the index.
My concern is about new segments that are commited to disk, either merged
segments or newly formed segments.
My first thought was to deal with linux caching policy (to factor up the
caching of index files rather than uninverted files that are least
frequently used) to urge the right OS caching without having to
explicitly query the index for this to happen.
Secondly I thought of initiating a new searcher event listener that queries
on docs that were inserted since the last hard commit.
A new ability of solr 4.4 (solr 4761) is to configure a mergedSegmentWarmer
- how does this component work and is it good for my usecase?

Are there any other ideas for dealing this usecase? What would be your
proposal as most effective way to deal with it?

Re: SolrEntityProcessor gets slower and slower

2013-07-22 Thread Manuel Le Normand

 Minfeng- This issue is tougher as the number of shard you have raise, you
can read Erick Erickson's post:
http://grokbase.com/t/lucene/solr-user/131p75p833/how-distributed-queries-works.
If you have 100M docs I guess you are running this issue.

The common way to deal with this issue is by filtering on a value that
would return fewer results every query, as a creation_date field, and every
query change this field range. For your data import use-case you might want
to generate your data-import.xml with different entities, each one for
another creation_date range. Thus no need for deep paging.

Another option is using
http://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore.
Implementing
it in a multi sharded environment, as all your scores=1.0 thus results are
ranked by shard (according to the internal [docId] of each shard), is not
possible of my knowledge.

Caching all the query results in each shard (by raising the
queryResultWindow) should help, wouldn't it?


Best,

Manu


On Mon, Jun 10, 2013 at 8:56 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 SolrEntityProcessor is fine for small amounts of data but not useful for
 such a large index. The problem is that deep paging in search results is
 expensive. As the start value for a query increases so does the cost of
 the query. You are much better off just re-indexing the data.


 On Mon, Jun 10, 2013 at 11:19 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  I trying to migrate 100M documents from a solr index (v3.6) to a
 solrcloud
  index (v4.1, 4 shards) by using SolrEntityProcessor.  My data-config.xml
 is
  like
 
  dataConfig document entity name=sep
 processor=SolrEntityProcessor
  url=http://10.64.35.117:8995/solr/; query=*:* rows=2000 fl=
 
 
 author_class,authorlink,author_location_text,author_text,author,category,date,dimension,entity,id,language,md5_text,op_dimension,opinion_text,query_id,search_source,sentiment,source_domain_text,source_domain,text,textshingle,title,topic,topic_text,url
  / /document /dataConfig
 
  Initially, the data import rate is about 1K docs/second, but it
 eventually
  decrease to 20docs/second after running for tens of hours.
 
  Last time I tried data import with solorentityprocessor, the transfer
 rate
  can be as high as 3K docs/seconds.
 
  Anyone has any clues what can cause the slowdown?
 
  Thanks,
  Ming-
 



 --
 Regards,
 Shalin Shekhar Mangar.

Re: Regex in Stopword.xml

2013-07-22 Thread Manuel Le Normand

Use the pattern replace filter factory


filter class=solr.PatternReplaceFilterFactory pattern=([^a-z])
replacement=/

This will do exactly what you asked for


http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceFilterFactory




On Mon, Jul 22, 2013 at 12:22 PM, Scatman alan.aron...@sfr.com wrote:

 Hi,

 I was looking for an issue, in order to put some regular expression in the
 StopWord.xml, but it seems that we can only have words in the file.
 I'm just wondering if there is a feature which will be done in this way or
 if someone got a tip it will help me a lot :)

 Best,
 Scatman.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Regex-in-Stopword-xml-tp4079412.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr caching clarifications

2013-07-15 Thread Manuel Le Normand

Great explanation and article.

Yes, this buffer for merges seems very small, and still optimized. Thats
impressive.

Re: Solr caching clarifications

2013-07-14 Thread Manuel Le Normand

Alright, thanks Erick. For the question about memory usage of merges, taken
from  Mike McCandless Blog

The big thing that stays in RAM is a logical int[] mapping old docIDs to
new docIDs, but in more recent versions of Lucene (4.x) we use a much more
efficient structure than a simple int[] ... see
https://issues.apache.org/jira/browse/LUCENE-2357

How much RAM is required is mostly a function of how many documents (lots
of tiny docs use more RAM than fewer huge docs).


A related clarification
As my users are not aware of the fq possibility, i was wondering how do I
make the best out of this field cache. Would if be efficient transforming
implicitly their query to a filter query on fields that are boolean
searches (date range etc. that do not affect the score of a document). Is
this a good practice? Is there any plugin for a query parser that makes it?




 Inline

 On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  Hello,
  As a result of frequent java OOM exceptions, I try to investigate more
into
  the solr jvm memory heap usage.
  Please correct me if I am mistaking, this is my understanding of usages
for
  the heap (per replica on a solr instance):
  1. Buffers for indexing - bounded by ramBufferSize
  2. Solr caches
  3. Segment merge
  4. Miscellaneous- buffers for Tlogs, servlet overhead etc.
 
  Particularly I'm concerned by Solr caches and segment merges.
  1. How much memory consuming (bytes per doc) are FilterCaches
(bitDocSet)
  and queryResultCaches (DocList)? I understand it is related to the skip
  spaces between doc id's that match (so it's not saved as a bitmap). But
  basically, is every id saved as a java int?

 Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you
 can get the maxDoc number from your Solr admin page). Plus some overhead
 for storing the fq text, but that's usually not much. This is for each
 entry up to Size.




 queryResultCache is usually trivial unless you've configured it
extravagantly.
 It's the query string length + queryResultWindowSize integers per entry
 (queryResultWindowSize is from solrconfig.xml).

  2. QueryResultMaxDocsCached - (for example = 100) means that any query
  resulting in more than 100 docs will not be cached (at all) in the
  queryResultCache? Or does it have to do with the documentCache?
 It's just a limit on the queryResultCache entry size as far as I can
 tell. But again
 this cache is relatively small, I'd be surprised if it used
 significant resources.

  3. DocumentCache - written on the wiki it should be greater than
  max_results*concurrent_queries. Max result is just the num of rows
  displayed (rows-start) param, right? Not the queryResultWindow.

 Yes. This a cache (I think) for the _contents_ of the documents you'll
 be returning to be manipulated by various components during the life
 of the query.

  4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
  cache be used? (on the expense of eviction of docs that were already
loaded
  with stored fields)

 Not sure, but I don't think this will contribute much to memory pressure.
This
 is about now many fields are loaded to get a single value from a doc in
the
 results list, and since one is usually working with 20 or so docs this
 is usually
 a small amount of memory.

  5. How large is the heap used by mergings? Assuming we have a merge of
10
  segments of 500MB each (half inverted files - *.pos *.doc etc, half non
  inverted files - *.fdt, *.tvd), how much heap should be left unused for
  this merge?

 Again, I don't think this is much of a memory consumer, although I
 confess I don't
 know the internals. Merging is mostly about I/O.

 
  Thanks in advance,
  Manu

 But take a look at the admin page, you can see how much memory various
 caches are using by looking at the plugins/stats section.

 Best
 Erick

Solr caching clarifications

2013-07-11 Thread Manuel Le Normand

Hello,
As a result of frequent java OOM exceptions, I try to investigate more into
the solr jvm memory heap usage.
Please correct me if I am mistaking, this is my understanding of usages for
the heap (per replica on a solr instance):
1. Buffers for indexing - bounded by ramBufferSize
2. Solr caches
3. Segment merge
4. Miscellaneous- buffers for Tlogs, servlet overhead etc.

Particularly I'm concerned by Solr caches and segment merges.
1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet)
and queryResultCaches (DocList)? I understand it is related to the skip
spaces between doc id's that match (so it's not saved as a bitmap). But
basically, is every id saved as a java int?
2. QueryResultMaxDocsCached - (for example = 100) means that any query
resulting in more than 100 docs will not be cached (at all) in the
queryResultCache? Or does it have to do with the documentCache?
3. DocumentCache - written on the wiki it should be greater than
max_results*concurrent_queries. Max result is just the num of rows
displayed (rows-start) param, right? Not the queryResultWindow.
4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
cache be used? (on the expense of eviction of docs that were already loaded
with stored fields)
5. How large is the heap used by mergings? Assuming we have a merge of 10
segments of 500MB each (half inverted files - *.pos *.doc etc, half non
inverted files - *.fdt, *.tvd), how much heap should be left unused for
this merge?

Thanks in advance,
Manu

Common practice for free text field

2013-06-25 Thread Manuel Le Normand

My schema contains about a hundred of fields of various types (int,
strings, plain text, emails).
I was concerned what is the common practice for searching free text over
the index. Assuming there are not boosts related to field matching, these
are the options I see:

   1. Index and query a all_fields copyField source=*
   1.  advantages - only one query flow against a single index.
  2. disadvantage - the tokenizing is not necessarily adapted to this
  kind of field, this requires more storage and memory
   2. Field aliasing ( f.myalias.qf=realfield)
  1. advantages - opposite from the above
  2. disadvantages - a single query term would query 100 different
  fields. Multi term query might be a serious performance issue.

Any common practices?

Re: Common practice for free text field

2013-06-25 Thread Manuel Le Normand

By field aliasing I meant something like: f.all_fields.qf=*_txt+*_s+*_int
that would sum up to 100 fields


On Wed, Jun 26, 2013 at 12:00 AM, Manuel Le Normand 
manuel.lenorm...@gmail.com wrote:

 My schema contains about a hundred of fields of various types (int,
 strings, plain text, emails).
 I was concerned what is the common practice for searching free text over
 the index. Assuming there are not boosts related to field matching, these
 are the options I see:

1. Index and query a all_fields copyField source=*
1.  advantages - only one query flow against a single index.
   2. disadvantage - the tokenizing is not necessarily adapted to this
   kind of field, this requires more storage and memory
2. Field aliasing ( f.myalias.qf=realfield)
   1. advantages - opposite from the above
   2. disadvantages - a single query term would query 100 different
   fields. Multi term query might be a serious performance issue.

 Any common practices?

Parallel queries on a single core

2013-06-17 Thread Manuel Le Normand

Hello all,
Assuming I have a single shard with a single core, how do run
multi-threaded queries on Solr 4.x?

Specifically, if one user sends a heavy query (legitimate wildcard query
for 10 sec), what happens to all other users quering during this period?

If the repsonse is that simultaneous queries (say 2) run multi-threaded, a
single CPU would switch between those two query-threads, and in case of 2
CPU's each CPU would run his own thread. But the latter case does not give
any advantage to repFactor  1 perfomance speaking, as it's close to same
as a single replica running wth 1 CPU's. So I am bit confused about this,

Thanks,

Manu

Avoiding OOM fatal crash

2013-06-17 Thread Manuel Le Normand

Hello again,

After a heavy query on my index (returning 100K docs in a single query) my
JVM heap's floods and I get an JAVA OOM exception, and then that my
GCcannot collect anything (GC
overhead limit exceeded) as these memory chunks are not disposable.

I want to afford queries like this, my concern is that this case provokes a
total Solr crash, returning a 503 Internal Server Error while trying to *
index.*

Is there anyway to separate these two logics? I'm fine with solr not being
able to return any response after returning this OOM, but I don't see the
justification the query to flood JVM's internal (bounded) buffers for
writings.

Thanks,
Manuel

Re: Avoiding OOM fatal crash

2013-06-17 Thread Manuel Le Normand

One of my users requested it, they are less aware of what's allowed and I
don't want apriori blocking them for long specific request (there are other
params that might end up OOMing me).

I thought of timeAllowed restriction, but also this solution cannot
guarantee during this delay I would not get the JVM heap flooded (for
example I already have all cashed and my RAM io's are very fast)


On Mon, Jun 17, 2013 at 11:47 PM, Walter Underwood wun...@wunderwood.orgwrote:

 Don't request 100K docs in a single query. Fetch them in smaller batches.

 wunder

 On Jun 17, 2013, at 1:44 PM, Manuel Le Normand wrote:

  Hello again,
 
  After a heavy query on my index (returning 100K docs in a single query)
 my
  JVM heap's floods and I get an JAVA OOM exception, and then that my
  GCcannot collect anything (GC
  overhead limit exceeded) as these memory chunks are not disposable.
 
  I want to afford queries like this, my concern is that this case
 provokes a
  total Solr crash, returning a 503 Internal Server Error while trying to *
  index.*
 
  Is there anyway to separate these two logics? I'm fine with solr not
 being
  able to return any response after returning this OOM, but I don't see the
  justification the query to flood JVM's internal (bounded) buffers for
  writings.
 
  Thanks,
  Manuel

Re: Parallel queries on a single core

2013-06-17 Thread Manuel Le Normand

Yes, that answers the first part of my question, thanks.

So saying N (equally heavy) queries agains N CPUs would run simultaneously,
right?

Previous posting suggest high qps rate can be solved perfomance-wise by
having high replicationFactor. But what's the  benefit (performance wise)
compared to having a single replica served by many CPU's?




On Tue, Jun 18, 2013 at 12:14 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 If I understand your question correctly - what happens with Solr and N
 parallel queries is not much different from what happens with N
 processes running in the OS - they all get a slice of the CPU time to
 do their work.  Not sure if that answers your question...?

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Mon, Jun 17, 2013 at 4:32 PM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  Hello all,
  Assuming I have a single shard with a single core, how do run
  multi-threaded queries on Solr 4.x?
 
  Specifically, if one user sends a heavy query (legitimate wildcard query
  for 10 sec), what happens to all other users quering during this period?
 
  If the repsonse is that simultaneous queries (say 2) run multi-threaded,
 a
  single CPU would switch between those two query-threads, and in case of 2
  CPU's each CPU would run his own thread. But the latter case does not
 give
  any advantage to repFactor  1 perfomance speaking, as it's close to same
  as a single replica running wth 1 CPU's. So I am bit confused about
 this,
 
  Thanks,
 
  Manu

Re: Avoiding OOM fatal crash

2013-06-17 Thread Manuel Le Normand

Unfortunately my organisation's too big to control or teach every employee
what are the limits, as well as they can vary (many facets - how much is
ok?, asking for too many fields in proportion of too many rows etc)

Don't you think it is preferable to commit the maxBufferSize in the JVM
heap for indexing only?


On Tue, Jun 18, 2013 at 12:11 AM, Walter Underwood wun...@wunderwood.orgwrote:

 Make them aware of what is required. Solr is not designed to return huge
 requests.

 If you need to do this, you will need to run the JVM with a big enough
 heap to build the request. You are getting OOM because the JVM does not
 have enough memory to build a response with 100K documents.

 wunder

 On Jun 17, 2013, at 1:57 PM, Manuel Le Normand wrote:

  One of my users requested it, they are less aware of what's allowed and I
  don't want apriori blocking them for long specific request (there are
 other
  params that might end up OOMing me).
 
  I thought of timeAllowed restriction, but also this solution cannot
  guarantee during this delay I would not get the JVM heap flooded (for
  example I already have all cashed and my RAM io's are very fast)
 
 
  On Mon, Jun 17, 2013 at 11:47 PM, Walter Underwood 
 wun...@wunderwood.orgwrote:
 
  Don't request 100K docs in a single query. Fetch them in smaller
 batches.
 
  wunder
 
  On Jun 17, 2013, at 1:44 PM, Manuel Le Normand wrote:
 
  Hello again,
 
  After a heavy query on my index (returning 100K docs in a single query)
  my
  JVM heap's floods and I get an JAVA OOM exception, and then that my
  GCcannot collect anything (GC
  overhead limit exceeded) as these memory chunks are not disposable.
 
  I want to afford queries like this, my concern is that this case
  provokes a
  total Solr crash, returning a 503 Internal Server Error while trying
 to *
  index.*
 
  Is there anyway to separate these two logics? I'm fine with solr not
  being
  able to return any response after returning this OOM, but I don't see
 the
  justification the query to flood JVM's internal (bounded) buffers for
  writings.
 
  Thanks,
  Manuel

Re: Exceptions on startup shutdown for solr 4.3 on Tomcat 7

2013-05-12 Thread Manuel Le Normand

Ok! Will check eventually if it's an ACE issue and will upload the stack
trace in case something else is throwing theses exceptions...

Thanks meanwhile


On Mon, May 13, 2013 at 12:11 AM, Shawn Heisey s...@elyograg.org wrote:

 On 5/12/2013 2:37 PM, Manuel Le Normand wrote:
  The upgrade from 4.2.1 to 4.3 on Tomcat 7 didn't go successfully, and I
 get
  many exceptions I didn't see in the earlier version. The services on
  different servers are up, I can access admin UI, create collections etc.
  but service startup and shutdown seem quite buggy. I tried reseting early
  configs but got back to the same situation. The given situation happens
  even on instances without any cores.
 
  On startup I get:
  org.apache.catalina.core.StandardWrapperValve invoke
  SEVERE: Servlet.service() for servlet [default] in context with path
  [/solr] threw exception
  java.lang.IllegalStateException: Cannot call sendError() after the
 response
  has been committed at
 
 org.apache.catalina.connector.ResonseFacade.sendError(ResponseFacade.java:451)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFiletr.java:692)

 Things like this are not usually a problem in Solr, it's likely to be in
 tomcat settings.  It's always possible that it might be a problem in
 Solr, but it's not likely.

 The following question/answer page may provide some insight.  The
 settings that are mentioned in the answer on this page are likely just
 changing tomcat settings.

 http://forums.adobe.com/thread/1042921

 To get more specific answers here, we'll need more info from your logs -
 ideally an entire fresh log.

 The best thing to do would be to shut tomcat down, move or delete your
 existing log, then start it back up.  Once a new log is created that
 shows the problem, copy the entire file and make it available on the
 Internet.  If it's relatively small (100k or so), use a paste website
 (pastie.org or your favorite).  If it's pretty big, use a file sharing
 site like dropbox.

 If you need to sanitize your log to remove identifying info, do a
 consistent search/replace with a harmless string - don't delete entire
 lines, or it will be difficult to tell what's happening.

 Thanks,
 Shawn

Too many unique terms

2013-04-23 Thread Manuel Le Normand

Hi there,
Looking at one of my shards (about 1M docs) i see lot of unique terms, more
than 8M which is a significant part of my total term count. These are very
likely useless terms, binaries or other meaningless numbers that come with
few of my docs.
I am totally fine with deleting them so these terms would be unsearchable.
Thinking about it i get that
1. It is impossible apriori knowing if it is unique term or not, so i
cannot add them to my stop words.
2. I have a performance decrease cause my cached chuncks do contain useless
data, and im short on memory.

Assuming a constant index, is there a way of deleting all terms that are
unique from at least the dictionary tim and tip files? Will i get
significant query time performance increase? Does any body know a class of
regex that identify meaningless terms that i can add to my updateProcessor?

Thanks
Manu

Query specific replica

2013-04-23 Thread Manuel Le Normand

Hello,
Since i replicated my shards (i have 2 cores per shard now), I get a
remarkable decrease in qTime. I assume it happens since my memory has to
split between twice more cores than it used to.

In my low qps rate use-case, I use replications as shard backup only (in
case one of my servers goes down) and not for the ability of serving
parallel requests. In this case i decrease because the two cores of the
shard are active.

I was wondering wether it is possible to query the same core every request,
instead of load balancing between the different replicas? And only if the
leader replica goes down the second replica would start serving requests.

Cheers,
Manu

Re: solr-cloud performance decrease day by day

2013-04-19 Thread Manuel Le Normand

Can happen for various reasons.

Can you recreate the situation, meaning restarting the servlet or server
would start with good qTime and decrease from that point? How fast does
this happen?

Start by monitoring the jvm process, with oracle visualVM for example.
Monitor for frequent garbage collections or unreasonable memory peacks or
opening threads.
Then monitor your system to see if there's an io disk latency or disk usage
that increases in time, the writing queue to disk exploads, cpu load
becomes heavier or network usage's exeeds limit.

If you can recreate the decrease and monitor well, one of the above params
should pop up. Fixing it after defining the problem will be easier.

Good day,
Manu
On Apr 19, 2013 10:26 AM, qibaoyuan qibaoy...@gmail.com wrote:

Updating clusterstate from the zookeeper

2013-04-18 Thread Manuel Le Normand

Hello,
After creating a distributed collection on several different servers I
sometimes get to deal with failing servers (cores appear not available =
grey) or failing cores (Down / unable to recover = brown / red).
In case i wish to delete this errorneous collection (through collection
API) only the green nodes get erased, leaving a meaningless unavailable
collection in the clusterstate.json.

Is there any way to edit explicitly the clusterstate.json? If not, how do i
update it so the collection as above gets deleted?

Cheers,
Manu

Re: What are the pros and cons Having More Replica at SolrCloud

2013-04-18 Thread Manuel Le Normand

On the query side, another down side i see would be that for a given memory
pool, you'd have to share it with more cores because every replica uses
it's own cache.
True for the inner solr caching (JVM's heap) and OS caching as well.
Adding a replicated core creates a new data set (index) that will be
accessed while queried.
If your replication adds a core of shard1 on a server that includes only
shard2, the OS caching and solr caching would have to share
the RAM with totally different memory parts (as files and query results for
different shards are different) so it's clear.
 In the second case, if you add a replicated core to a server that already
contains shard1, I'm not sure. There might be benefits if JVM handles its
caches per shard and not per core, but the OS caching would differentiate
between the different replications of same index and try to add both index
files on memory.

Cheers,
Manu

So if you're short on memory or queries are alike (have high hit ration)
you may better take advantage of your RAM usage than splitting it to many
replications.


On Fri, Apr 19, 2013 at 3:08 AM, Timothy Potter thelabd...@gmail.comwrote:

 re: more replicas -

 pro: you can scale your query processing workload because you have more
 nodes available to service queries, eg 1,000 QPS sent to Solr with 5
 replicas, then each is only processing roughly 200 QPS. If you need to
 scale up to 10K QPS, then add more replicas to distribute the increased
 workload

 con: additional overhead (mostly network I/O) when indexing, shard leader
 has to send N additional requests per update where N is the number of
 replicas per shard. This seems minor unless you have many replicas per
 shard. I can't think of any cons of having more replicas on the query side

 As for your other question, when the leader receives an update request, it
 forwards to all replicas in the active or recovering state in parallel and
 waits for their response before responding to the client. All replicas must
 accept the update for it to be considered successful, i.e. all replicas and
 the leader must be in agreement on the status of a request. This is why you
 hear people referring to Solr as favoring consistency over
 write-availability. If you have 10 active replicas for a shard, then all 10
 must accept the update or it fails, there's no concept of tunable
 consistency on a write in Solr. Failed / offline replicas are obviously
 ignored and they will sync up with the leader once they are back online.

 Cheers,
 Tim


 On Thu, Apr 18, 2013 at 4:48 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:

  What are the pros and cons Having More Replica at SolrCloud?
 
  Also there is a point that I want to learn. When a request come to a
  leader. Does it forwards it to a replica. And if forwards it to replica,
  does replica works parallel to build up the index with other replicas of
  its same leader?

Re: Slow qTime for distributed search

2013-04-11 Thread Manuel Le Normand

Hi,
We have different working hours, sorry for the reply delay. Your assumed
numbers are right, about 25-30Kb per doc. giving a total of 15G per shard,
there are two shards per server (+2 slaves that should do no work normally).
An average query has about 30 conditions (OR AND mixed), most of them
textual, a small part on dateTime. They use only simple queries (no facet,
filters etc.) as it is taken from the actual query set of my entreprise
that works with an old search engine.

As we said, if the shards in collection1 and collection2 have the same
number of docs each (and same RAM  CPU per shard), it is apparently not a
slow IO issue, right? So the fact of not having cached all my index doesn't
seem the be the bottleneck.Moreover, i do store the fields but my query set
requests only the id's and rarely snippets so I'd assume that the plenty of
RAM i'd give the OS wouldn't make any difference as these *.fdt files don't
need to get cached.

The conclusion i get to is that the merging issue is the problem, and the
only possibility of outsmarting it is to distribute to much fewer shards,
meaning that i'll get back to few millions of docs per shard which are
about linearly slower with the num of docs per shard. Though the latter
should improve if i give much more RAM per server.

I'll try tweaking a bit my schema and making better use of solr cache
(filter query as an example), but i have something telling me the problem
might be elsewhere. My main clue to it is that merging seems a simple CPU
task, and tests show that even with a small amount of responses it takes a
long time (and clearly the merging task on few docs is very short)


On Wed, Apr 10, 2013 at 2:50 AM, Shawn Heisey s...@elyograg.org wrote:

 On 4/9/2013 3:50 PM, Furkan KAMACI wrote:

 Hi Shawn;

 You say that:

 *... your documents are about 50KB each.  That would translate to an index
 that's at least 25GB*

 I know we can not say an exact size but what is the approximately ratio of
 document size / index size according to your experiences?


 If you store the fields, that is actual size plus a small amount of
 overhead.  Starting with Solr 4.1, stored fields are compressed.  I believe
 that it uses LZ4 compression.  Some people store all fields, some people
 store only a few or one - an ID field.  The size of stored fields does have
 an impact on how much OS disk cache you need, but not as much as the other
 parts of an index.

 It's been my experience that termvectors take up almost as much space as
 stored data for the same fields, and sometimes more.  Starting with Solr
 4.2, termvectors are also compressed.

 Adding docValues (new in 4.2) to the schema will also make the index
 larger.  The requirements here are similar to stored fields.  I do not know
 whether this data gets compressed, but I don't think it does.

 As for the indexed data, this is where I am less clear about the storage
 ratios, but I think you can count on it needing almost as much space as the
 original data.  If the schema uses types or filters that produce a lot of
 information, the indexed data might be larger than the original input.
  Examples of data explosions in a schema: trie fields with a non-zero
 precisionStep, the edgengram filter, the shingle filter.

 Thanks,
 Shawn

Re: Slow qTime for distributed search

2013-04-09 Thread Manuel Le Normand

Thanks for replying.
My config:

   - 40 dedicated servers, dual-core each
   - Running Tomcat servlet on Linux
   - 12 Gb RAM per server, splitted half between OS and Solr
   - Complex queries (up to 30 conditions on different fields), 1 qps rate

Sharding my index was done for two reasons, based on 2 servers (4shards)
tests:

   1. As index grew above few million of docs qTime raised greatly, while
   sharding the index to smaller pieces (about 0.5M docs) gave way better
   results, so I bound every shard to have 0.5M docs.
   2. Tests showed i was cpu-bounded during queries. As i have low qps rate
   (emphasize: lower than expected qTime) and as a query runs single-threaded
   on each shard, it made sense to accord a cpu to each shard.

For the same amount of docs per shards I do expect a raise in total qTime
for the reasons:

   1. The response should wait for the slowest shard
   2. Merging the responses from 40 different shards takes time

What i understand from your explanation is that it's the merging that takes
time and as qTime ends only after the second retrieval phase, the qTime on
each shard will take longer. Meaning during a significant proportion of the
first query phase (right after the [id,score] are retieved), all cpu's are
idle except the response-merger thread running on a single cpu. I thought
of the merge as a simple sorting of [id,score], way more simple than
additional 300 ms cpu time.

Why would a RAM increase improve my performances, as it's a
response-merge (CPU resource) bottleneck?

Thanks in advance,
Manu


On Mon, Apr 8, 2013 at 10:19 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/8/2013 12:19 PM, Manuel Le Normand wrote:

 It seems that sharding my collection to many shards slowed down
 unreasonably, and I'm trying to investigate why.

 First, I created collection1 - 4 shards*replicationFactor=1 collection
 on
 2 servers. Second I created collection2 - 48 shards*replicationFactor=2
 collection on 24 servers, keeping same config and same num of documents
 per
 shard.


 The primary reason to use shards is for index size, when your index is so
 big that a single index cannot give you reasonable performance. There are
 also sometimes performance gains when you break a smaller index into
 shards, but there is a limit.

 Going from 2 shards to 3 shards will have more of an impact that going
 from 8 shards to 9 shards.  At some point, adding shards makes things
 slower, not faster, because of the extra work required for combining
 multiple queries into one result response.  There is no reasonable way to
 predict when that will happen.

  Observations showed the following:

 1. Total qTime for the same query set is 5 time higher in collection2
 (150ms-700 ms)
 2. Adding to colleciton2 the *shard.info=true* param in the query
 shows

 that each shard is much slower than each shard was in collection1
 (about 4
 times slower)
 3.  Querying only specific shards on collection2 (by adding the

 shards=shard1,shard2...shard12 param) gave me much better qTime per
 shard
 (only 2 times higher than in collection1)
 4. I have a low qps rate, thus i don't suspect the replication factor

 for being the major cause of this.
 5. The avg. cpu load on servers during querying was much higher in

 collection1 than in collection2 and i didn't catch any other
 bottlekneck.


 A distributed query actually consists of up to two queries per shard. The
 first query just requests the uniqueKey field, not the entire document.  If
 you are sorting the results, then the sort field(s) are also requested,
 otherwise the only additional information requested is the relevance score.
  The results are compiled into a set of unique keys, then a second query is
 sent to the proper shards requesting specific documents.


  Q:
 1. Why does the amount of shards affect the qTime of each shard?
 2. How can I overcome to reduce back the qTime of each shard?


 With more shards, it takes longer for the first phase to compile the
 results, so the second phase (document retrieval) gets delayed, and the
 QTime goes up.

 One way to reduce the total time is to reduce the number of shards.

 You haven't said anything about how complex your queries are, your index
 size(s), or how much RAM you have on each server and how it is allocated.
  Can you provide this information?

 Getting good performance out of Solr requires plenty of RAM in your OS
 disk cache.  Query times of 150 to 700 milliseconds seem very high, which
 could be due to query complexity or a lack of server resources (especially
 RAM), or possibly both.

 Thanks,
 Shawn

Re: Slow qTime for distributed search

2013-04-08 Thread Manuel Le Normand

After taking a look on what I'd wrote earlier, I will try to rephrase in a
clear manner.

It seems that sharding my collection to many shards slowed down
unreasonably, and I'm trying to investigate why.

First, I created collection1 - 4 shards*replicationFactor=1 collection on
2 servers. Second I created collection2 - 48 shards*replicationFactor=2
collection on 24 servers, keeping same config and same num of documents per
shard.
Observations showed the following:

   1. Total qTime for the same query set is 5 time higher in collection2
   (150ms-700 ms)
   2. Adding to colleciton2 the *shard.info=true* param in the query shows
   that each shard is much slower than each shard was in collection1 (about 4
   times slower)
   3.  Querying only specific shards on collection2 (by adding the
   shards=shard1,shard2...shard12 param) gave me much better qTime per shard
   (only 2 times higher than in collection1)
   4. I have a low qps rate, thus i don't suspect the replication factor
   for being the major cause of this.
   5. The avg. cpu load on servers during querying was much higher in
   collection1 than in collection2 and i didn't catch any other bottlekneck.

Q:
1. Why does the amount of shards affect the qTime of each shard?
2. How can I overcome to reduce back the qTime of each shard?

Thanks,
Manu

Slow qTime for distributed search

2013-04-07 Thread Manuel Le Normand

Hello
After performing a benchmark session on small scale i moved to a full scale
on 16 quad core servers.
Observations at small scale gave me excellent qTime (about 150 ms) with up
to 2 servers, showing my searching thread was mainly cpu bounded. My query
set is not faceted.
Growing to full scale (with same config  schema  num of docs per shard) i
sharded my collection to 48 shards and added a replication for each.
Since then i have a major performance deteriotaion, my qTime went up to 700
msec. Servers have a much smaller load, and network does not show any
difficulties. I understand that the response merging and waiting for the
slowest shard response should increase my small scale qTime, so checked
shard.info=true to observe that each shard was taking much longer, while
defining query for specific shards (shards=shard1,shard2...shard12) i get
much better results for each shard qTime and total qTime.

Keeping the same config, how come the num of shards affects the qTime of
each shard?

How can i evercome this issue?

Thanks,
Manu

Re: Is Solr more CPU bound or IO bound?

2013-03-17 Thread Manuel Le Normand

Your question is a typical use-case dependent, the bottleneck will change
from user to user.

These are two main issues that will affect the answer:
1. How do you index: what is your indexing rate (how many docs a days)? how
big is a typical document? how many documents do you plan on indexing in
tota? do you store fields? calculate their term vectors?
2. How looks you retrieval process: What's the query rate expected? Are
there common queries (taking advantage of the cache)? Complexity of queries
(faceted / highlighted / filtered/ how many conditions, NRT)? Do you plan
to retrieve stored fields or only id's?

After answering all that there's an interative game between hardware
configuration and software configuration (how do you split your shards, use
your cache, tuning your merges and flushes etc) that would also affect the
IO / CPU bounded answer.

In my use-case for example the indexing part is IO bounded, but as my
indexing rate is much below the rate my machine could initially provide it
didn't affect my hardware spec.
After fine tuning my configuration i discovered my retrieval process was
CPU bounded and was directly affecting my avg response time, while the IO
rate in cache usage was quite low.

Try describing your use case in more details with the above questions so
we'd be able to give you guidelines.

Best,
Manu


On Mon, Mar 18, 2013 at 3:55 AM, David Parks davidpark...@yahoo.com wrote:

 I'm spec'ing out some hardware for a first go at our production Solr
 instance, but I haven't spent enough time loadtesting it yet.



 What I want to ask if how IO intensive solr is vs. CPU intensive,
 typically.



 Specifically I'm considering whether to dual-purpose the Solr servers to
 run
 Solr and another CPU-only application we have. I know Solr uses a fair
 amount of CPU, but if it also is very disk intensive it might be a net
 benefit to have more instances running Solr and share the CPU resources
 with
 the other app than to run Solr separate from the other CPU app that
 wouldn't
 otherwise use the disk.



 Thoughts on this?



 Thanks,

 David

Re: Optimization storage issue

2013-03-02 Thread Manuel Le Normand

Hi Tim - thanks for the answer.
For your assumption: my documents are about 50kb each in the index, but
after two weeks of updating and not removing i have about 40% percent of
unused docs in my index and that has an impact on the query performance.
1) My incentive for optimizing and not merging was to take advantage of the
dead hours of the engine, hours in local night that have low qps rate.
Thus i would control the hours during which these operations occur and the
merging and query threads wouldn't have to compete on the same resources -
correct me if i'm mistaking.

2) Using expunge deletes attribute might be an interesting option as the
segments that contain deleted docs should be only a few as they were
created in the same time range (a month before), but in my case i have few
deleted docs even in new segments for various reasons. If I use this
suggested commit and all my segments contain deleted docs it would result
into optimizing, wouldn't it? Is there an option of controlling expunge
deletes to more than N deleted docs, so i would avoid the pseudo-optimize
process.

Manu


On Sat, Mar 2, 2013 at 8:54 PM, Timothy Potter thelabd...@gmail.com wrote:

 Hi Manuel,

 If you search optimize on this mailing list, you'll see that one of
 the common suggestions is to avoid optimizing and fine-tune segment
 merging instead. So to begin, take a look at your solrconfig.xml and
 find out what your merge policy and mergeFactor are set to (note: they
 may be commented out which implies segment merging is still enabled
 with the default settings). You can experiment with changing the
 mergeFactor.

 Based on your description of adding and removing a few thousand
 documents each day, I'm going to assume your documents are very large
 otherwise I can't see how you'd ever notice an impact on query
 performance. Is my assumption about the document size correct?

 One thing you can try is to use the expungeDeletes attribute set to
 true when you commit, ie. commit expungeDeletes=true/. This
 triggers Solr to merge any segments with deletes.

 Lastly, I'm not sure about your specific questions related to
 optimizations, but I think it's worth trying the suggestions above and
 avoid optimizations altogether. I'm pretty sure the answer to #1 is no
 and for #2 is it optimizes independently.

 Cheers,
 Tim


 On Sat, Mar 2, 2013 at 10:24 AM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  My use-case is a casi-monthly changing index. Everyday i index few
  thousands of docs and erase a similar number of older documents, whilst
 few
  documents last in the index for ever (about 20 % of my index). After few
  experiments, i get that leaving the older documents in the index (mostly
 in
  the *.tim file) slows down significally my avg qTime and got to the
  conclusion i need to optimize the index once every few days to get ride
 of
  the older documents.
 
  Optimization requires about 2 times more the index storage. As i have
 many
  shards and one replica for each, and the optimization occurs
 simultaneously
  for all, i need twice the amount of storage of my initial index size,
 while
  half of it is used very unfrequently (optimization takes about an hour).
 
  1) Is there a possibility of using a storage pool for all shards, so
 every
  shard uses the spare storage in series, forcing the optimization to run
  unsimultaneously. In this case all the storage i'd use would be (total
  index storage + shard storage) instead of twice the total index storage.
 
  2) When i run optimization for a replicated core, does it copy from its
  leader or does it optimize independenly?
 
  Thanks,
  Manu

Re: Threads running while querrying

2013-02-20 Thread Manuel Le Normand

Yes, i made a one threaded script which sends a querry by a post request to
the shard's url, gets back the response and posts the next querry.
How can it matter?
Manuel

On Wednesday, February 20, 2013, Erick Erickson wrote:

 Silly question perhaps, but are you feeding queries  at Solr with a single
 thread? Because Solr uses multiple threads to search AFAIK.

 Best
 Erick


 On Wed, Feb 20, 2013 at 4:01 AM, Manuel Le Normand 
 manuel.lenorm...@gmail.com javascript:; wrote:

  More to it, i do see 75 more threads under the process of tomcat6, but
 only
  a single one is working while querrying
 
  On Wednesday, February 20, 2013, Manuel Le Normand wrote:
 
   Hello,
   I created a single collection on a linux server with 8m docs. Solr 4.1
   While making performance tests, i see that my quad core server makes a
   full use of a single core while the 3 others are idle.
   Is there a possibility of making a single sharded collection available
  for
   multi-threaded querry?
   P.s: im not indexing while querrying

57 matches

Mail list logo