Re: boosting words from specific list

2014-09-30 Thread Manuel Le Normand
I have not tried it but I would check the option of using the synonymFilter to duplicate certain query words . Anothe opt - you can detect these word at index time (eg. UpdateProcessor) to give these documents a document boost in case it fits your logic. Or even make a copy field that contains a

Re: Searching and highlighting ten's of fields

2014-07-31 Thread Manuel Le Normand
Right, it works! I was not aware of this functionality and being able to customize it by hl.requireFieldMatch param. Thanks

Searching and highlighting ten's of fields

2014-07-30 Thread Manuel Le Normand
Hello, I need to expose the search and highlighting capabilities over few tens of fields. The edismax's qf param makes it possible but the time performances for searching tens of words over tens of fields is problematic. I made a copyField (indexed, not stored) for these fields, which gives way

Re: Searching and highlighting ten's of fields

2014-07-30 Thread Manuel Le Normand
Current I use the classic but I can change my posting format in order to work with another highlighting component if that leads to any solution

Re: Searching and highlighting ten's of fields

2014-07-30 Thread Manuel Le Normand
hl.fl work in this case? Or is highlighting the 10 fields the slowdown? Best, Erick On Wed, Jul 30, 2014 at 2:55 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Current I use the classic but I can change my posting format in order to work with another highlighting component

OCR - Saving multi-term position

2014-07-02 Thread Manuel Le Normand
Hello, Many of our indexed documents are scanned and OCR'ed documents. Unfortunately we were not able to improve much the OCR quality (less than 80% word accuracy) for various reasons, a fact which badly hurts the retrieval quality. As we use an open-source OCR, we think of changing every scanned

Re: OCR - Saving multi-term position

2014-07-02 Thread Manuel Le Normand
New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand

Re: Compression vs FieldCache for doc ids retrieval

2014-05-30 Thread Manuel Le Normand
Is the issue SOLR-5478 what you were looking for?

Re: Application of different stemmers / stopword lists within a single field

2014-04-28 Thread Manuel Le Normand
Why wouldn't you take advantage of your use case - the chars belong to different char classes. You can index this field to a single solr field (no copyField) and apply an analysis chain that includes both languages analysis - stopword, stemmers etc. As every filter should apply to its' specific

Indexing useful N-grams and adding payloads

2014-03-10 Thread Manuel Le Normand
Hi, I have a performance and scoring problem for phrase queries 1. Performance - phrase queries involving frequent terms are very slow due to the reading of large positions posting list. 2. Scoring - I want to control the boost of phrase and entity (in gazetteers) matches Indexing

Using payloads for expanded query terms

2014-02-18 Thread Manuel Le Normand
Hello, I'm trying to handle a situation with taxonomy search - that is for each taxonomy I have a list of words with their boosts. These taxonomies are updated frequently so I retrieve these scored lists at query time from an external service. My expectation would be:

Re: Solr 4.6.0: DocValues (distributed search)

2014-01-10 Thread Manuel Le Normand
In short, when running a distributed search every shard runs the query separately. Each shard's collector returns the topN (rows param) internal docId's of the matching documents. These topN docId's are converted to their uniqueKey in the BinaryResponseWriter and sent to the frontend core (the

Sudden Solr crush after commit

2013-12-12 Thread Manuel Le Normand
In the last days one of my tomcat servlet, running only a Solr instance, crushed unexpectedly twice. Low memory usage, nothing written in the tomcat log, and the last thing happening in solr log is 'end_commit_flush' followed by 'UnInverted mutli-valued field' for the fields faceted during the

Re: Updating shard range in Zookeeper

2013-12-12 Thread Manuel Le Normand
Zookeeper client for eclipse is the tool you're looking for. You can edit directly the clusterstate. http://www.massedynamic.org/mediawiki/index.php?title=Eclipse_Plug-in_for_ZooKeeper Another option can be using the delivered zkclient (distributed with solr 4.5 and above) and upload a new

Re: Sudden Solr crush after commit

2013-12-12 Thread Manuel Le Normand
Running solr 4.3, sharded collection. Tomcat 7.0.39 Faceting on multivalue fields works perfectly fine, I was describing this log to emphasize the fact the servlet failed right after a new searcher was opened and the event listener finished running a warming faceting query.

Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Manuel Le Normand
In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field

Re: Bad fieldNorm when using morphologic synonyms

2013-12-08 Thread Manuel Le Normand
Robert, you last reply is not accurate. It's true that the field norms and termVectors are independent. But this issue of higher norms for this case is expected with well assigned positions. The LengthNorm is assigned as FieldInvertState.length which is the count of incrementToken and not num of

Re: distributed search is significantly slower than direct search

2013-11-25 Thread Manuel Le Normand
https://issues.apache.org/jira/browse/SOLR-5478 There it goes On Mon, Nov 18, 2013 at 5:44 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Sure, I am out of office till end of week. I reply after i upload the patch

Re: distributed search is significantly slower than direct search

2013-11-18 Thread Manuel Le Normand
Sure, I am out of office till end of week. I reply after i upload the patch

Re: distributed search is significantly slower than direct search

2013-11-17 Thread Manuel Le Normand
In order to accelerate the BinaryResponseWriter.write we extended this writer class to implement the docid to id tranformation by docValues (on memory) with no need to access stored field for id reading nor lazy loading of fields that also has a cost. That should improve read rate as docValues are

Re: distributed search is significantly slower than direct search

2013-11-13 Thread Manuel Le Normand
It's surprising such a query takes a long time, I would assume that after trying consistently q=*:* you should be getting cache hits and times should be faster. Try see in the adminUI how do your query/doc cache perform. Moreover, the query in itself is just asking the first 5000 docs that were

Basic query process question with fl=id

2013-10-24 Thread Manuel Le Normand
Hi Any distributed lookup is basically composed of two stages: the first collecting all the matching documents from every shard and a second which fetches additional information about specific ids (i.e stored, termVectors). It can be seen in the logs of each shard (isShard=true), where first

Re: Profiling Solr Lucene for query

2013-10-15 Thread Manuel Le Normand
I tried my last proposition, editing the clusterstate.json to add a dummy frontend shard seems to work. I made sure the ranges were not overlapping. Doesn't it resolve the solr cloud issue as specified above?

Re: Profiling Solr Lucene for query

2013-10-12 Thread Manuel Le Normand
Would adding a dummy shard instead of a dummy collection would resolve the situation? - e.g. editing clusterstate.json from a zookeeper client and adding a shard with a 0-range so no docs are routed to this core. This core would be on a separate server and act as the collection gateway.

Re: Profiling Solr Lucene for query

2013-09-11 Thread Manuel Le Normand
is the one that does not have its own index and is doing merging of the results. Is this the case? If yes, are all 36 shards always queried? Dmitry On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi Dmitry, I have solr 4.3 and every query

Re: Profiling Solr Lucene for query

2013-09-09 Thread Manuel Le Normand
much faster if results merging can be avoided. Dmitry On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main

Re: Expunge deleting using excessive transient disk space

2013-09-09 Thread Manuel Le Normand
to get more disk space. The amount of engineer time spent trying to tune this is way more expensive than a disk... Best, Erick On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi, In order to delete part of my index I run a delete by query

Expunge deleting using excessive transient disk space

2013-09-08 Thread Manuel Le Normand
Hi, In order to delete part of my index I run a delete by query that intends to erase 15% of the docs. I added this params to the solrconfig.xml mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce2/int int name=maxMergeAtOnceExplicit2/int double

Profiling Solr Lucene for query

2013-09-08 Thread Manuel Le Normand
Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due

Wrong leader election leads to shard removal

2013-08-14 Thread Manuel Le Normand
Hello, My solr cluster runs on RH Linux with tomcat7 servlet. NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr 4.3 For experimental reasons I splitted my cluster to 2 sub-clusters, each containing a single replica of each shard. When connecting back these sub-clusters the

Re: Wrong leader election leads to shard removal

2013-08-14 Thread Manuel Le Normand
, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, My solr cluster runs on RH Linux with tomcat7 servlet. NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr 4.3 For experimental reasons I splitted my cluster to 2 sub-clusters, each containing a single replica

Merged segment warmer Solr 4.4

2013-07-29 Thread Manuel Le Normand
Hi, I have a slow storage machine and non sufficient RAM for the whole index to store all the index. This causes the first queries (~5000) to be very slow (they are read from disk and my cpu is most of time in iowait), and after that the readings from the index become very fast and read mainly

Re: SolrEntityProcessor gets slower and slower

2013-07-22 Thread Manuel Le Normand
Minfeng- This issue is tougher as the number of shard you have raise, you can read Erick Erickson's post: http://grokbase.com/t/lucene/solr-user/131p75p833/how-distributed-queries-works. If you have 100M docs I guess you are running this issue. The common way to deal with this issue is by

Re: Regex in Stopword.xml

2013-07-22 Thread Manuel Le Normand
Use the pattern replace filter factory filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement=/ This will do exactly what you asked for http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceFilterFactory On Mon, Jul 22, 2013 at 12:22 PM,

Re: Solr caching clarifications

2013-07-15 Thread Manuel Le Normand
Great explanation and article. Yes, this buffer for merges seems very small, and still optimized. Thats impressive.

Re: Solr caching clarifications

2013-07-14 Thread Manuel Le Normand
, 2013 at 8:36 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr

Solr caching clarifications

2013-07-11 Thread Manuel Le Normand
Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr

Common practice for free text field

2013-06-25 Thread Manuel Le Normand
My schema contains about a hundred of fields of various types (int, strings, plain text, emails). I was concerned what is the common practice for searching free text over the index. Assuming there are not boosts related to field matching, these are the options I see: 1. Index and query a

Re: Common practice for free text field

2013-06-25 Thread Manuel Le Normand
By field aliasing I meant something like: f.all_fields.qf=*_txt+*_s+*_int that would sum up to 100 fields On Wed, Jun 26, 2013 at 12:00 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: My schema contains about a hundred of fields of various types (int, strings, plain text, emails). I

Parallel queries on a single core

2013-06-17 Thread Manuel Le Normand
Hello all, Assuming I have a single shard with a single core, how do run multi-threaded queries on Solr 4.x? Specifically, if one user sends a heavy query (legitimate wildcard query for 10 sec), what happens to all other users quering during this period? If the repsonse is that simultaneous

Avoiding OOM fatal crash

2013-06-17 Thread Manuel Le Normand
Hello again, After a heavy query on my index (returning 100K docs in a single query) my JVM heap's floods and I get an JAVA OOM exception, and then that my GCcannot collect anything (GC overhead limit exceeded) as these memory chunks are not disposable. I want to afford queries like this, my

Re: Avoiding OOM fatal crash

2013-06-17 Thread Manuel Le Normand
not get the JVM heap flooded (for example I already have all cashed and my RAM io's are very fast) On Mon, Jun 17, 2013 at 11:47 PM, Walter Underwood wun...@wunderwood.orgwrote: Don't request 100K docs in a single query. Fetch them in smaller batches. wunder On Jun 17, 2013, at 1:44 PM, Manuel Le

Re: Parallel queries on a single core

2013-06-17 Thread Manuel Le Normand
in the OS - they all get a slice of the CPU time to do their work. Not sure if that answers your question...? Otis -- Solr ElasticSearch Support http://sematext.com/ On Mon, Jun 17, 2013 at 4:32 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all, Assuming I have a single

Re: Avoiding OOM fatal crash

2013-06-17 Thread Manuel Le Normand
OOM because the JVM does not have enough memory to build a response with 100K documents. wunder On Jun 17, 2013, at 1:57 PM, Manuel Le Normand wrote: One of my users requested it, they are less aware of what's allowed and I don't want apriori blocking them for long specific request

Re: Exceptions on startup shutdown for solr 4.3 on Tomcat 7

2013-05-12 Thread Manuel Le Normand
Ok! Will check eventually if it's an ACE issue and will upload the stack trace in case something else is throwing theses exceptions... Thanks meanwhile On Mon, May 13, 2013 at 12:11 AM, Shawn Heisey s...@elyograg.org wrote: On 5/12/2013 2:37 PM, Manuel Le Normand wrote: The upgrade from

Too many unique terms

2013-04-23 Thread Manuel Le Normand
Hi there, Looking at one of my shards (about 1M docs) i see lot of unique terms, more than 8M which is a significant part of my total term count. These are very likely useless terms, binaries or other meaningless numbers that come with few of my docs. I am totally fine with deleting them so these

Query specific replica

2013-04-23 Thread Manuel Le Normand
Hello, Since i replicated my shards (i have 2 cores per shard now), I get a remarkable decrease in qTime. I assume it happens since my memory has to split between twice more cores than it used to. In my low qps rate use-case, I use replications as shard backup only (in case one of my servers goes

Re: solr-cloud performance decrease day by day

2013-04-19 Thread Manuel Le Normand
Can happen for various reasons. Can you recreate the situation, meaning restarting the servlet or server would start with good qTime and decrease from that point? How fast does this happen? Start by monitoring the jvm process, with oracle visualVM for example. Monitor for frequent garbage

Updating clusterstate from the zookeeper

2013-04-18 Thread Manuel Le Normand
Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only

Re: What are the pros and cons Having More Replica at SolrCloud

2013-04-18 Thread Manuel Le Normand
On the query side, another down side i see would be that for a given memory pool, you'd have to share it with more cores because every replica uses it's own cache. True for the inner solr caching (JVM's heap) and OS caching as well. Adding a replicated core creates a new data set (index) that will

Re: Slow qTime for distributed search

2013-04-11 Thread Manuel Le Normand
Hi, We have different working hours, sorry for the reply delay. Your assumed numbers are right, about 25-30Kb per doc. giving a total of 15G per shard, there are two shards per server (+2 slaves that should do no work normally). An average query has about 30 conditions (OR AND mixed), most of them

Re: Slow qTime for distributed search

2013-04-09 Thread Manuel Le Normand
a response-merge (CPU resource) bottleneck? Thanks in advance, Manu On Mon, Apr 8, 2013 at 10:19 PM, Shawn Heisey s...@elyograg.org wrote: On 4/8/2013 12:19 PM, Manuel Le Normand wrote: It seems that sharding my collection to many shards slowed down unreasonably, and I'm trying to investigate why

Re: Slow qTime for distributed search

2013-04-08 Thread Manuel Le Normand
After taking a look on what I'd wrote earlier, I will try to rephrase in a clear manner. It seems that sharding my collection to many shards slowed down unreasonably, and I'm trying to investigate why. First, I created collection1 - 4 shards*replicationFactor=1 collection on 2 servers. Second I

Slow qTime for distributed search

2013-04-07 Thread Manuel Le Normand
Hello After performing a benchmark session on small scale i moved to a full scale on 16 quad core servers. Observations at small scale gave me excellent qTime (about 150 ms) with up to 2 servers, showing my searching thread was mainly cpu bounded. My query set is not faceted. Growing to full scale

Re: Is Solr more CPU bound or IO bound?

2013-03-17 Thread Manuel Le Normand
Your question is a typical use-case dependent, the bottleneck will change from user to user. These are two main issues that will affect the answer: 1. How do you index: what is your indexing rate (how many docs a days)? how big is a typical document? how many documents do you plan on indexing in

Re: Optimization storage issue

2013-03-02 Thread Manuel Le Normand
specific questions related to optimizations, but I think it's worth trying the suggestions above and avoid optimizations altogether. I'm pretty sure the answer to #1 is no and for #2 is it optimizes independently. Cheers, Tim On Sat, Mar 2, 2013 at 10:24 AM, Manuel Le Normand

Re: Threads running while querrying

2013-02-20 Thread Manuel Le Normand
with a single thread? Because Solr uses multiple threads to search AFAIK. Best Erick On Wed, Feb 20, 2013 at 4:01 AM, Manuel Le Normand manuel.lenorm...@gmail.com javascript:; wrote: More to it, i do see 75 more threads under the process of tomcat6, but only a single one is working while