from:"rishi"

Thanks Shawn.. yeah regular optimize might be the route we take, if this 
becomes a recurring issue.
 I remember in our old multicore deployment CPU used to spike and the core 
almost became non responsive. 

My guess with solr cloud architecture, any slack by leader while optimizing is 
picked up by the replica.
I was searching around for optimize behaviour of solr cloud and could not find 
much information.

Does anyone have experience running optimize for solr cloud in a loaded 
production env?

Thanks,
Rishi.

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Mon, May 4, 2015 9:11 am
Subject: Re: Solr Cloud reclaiming disk space from deleted documents

On 5/4/2015 4:55 AM, Rishi Easwaran wrote:
> Sadly with the size of our
complex, spiting and adding more HW is not a viable long term solution. 
>  I
guess the options we have are to run optimize regularly and/or become aggressive
in our merges proactively even before solr cloud gets into this situation.

If
you are regularly deleting most of your index, or reindexing large
parts of it,
which effectively does the same thing, then regular
optimizes may be required
to keep the index size down, although you must
remember that you need enough
room for the core to grow in order to
actually complete the optimize.  If the
core is 75-90 percent deleted
docs, then you will not need 2x the core size to
optimize it, because
the new index will be much smaller.

Currently,
SolrCloud will always optimize the entire collection when you
ask for an
optimize on any core, but it will NOT optimize all the
replicas (cores) at the
same time.  It will go through the cores that
make up the collection and
optimize each one one in sequence.  If your
index is sharded and replicated
enough, hopefully that will make it
possible for the optimize to complete even
though the amount of disk
space available may be low.

We have at least one
issue in Jira where users have asked for optimize
to honor distrib=false, which
would allow the user to be in complete
control of all optimizing, but so far
that hasn't been implemented.  The
volunteers that maintain Solr can only
accomplish so much in the limited
time they have
available.

Thanks,
Shawn

Re: Multiple index.timestamp directories using up disk space

Walter,

Unless I am missing something here.. I completely get that, when a few segment 
merges solr requires 2x space of segments to accomplish this.
Usually any index has multiple segments files so this fragmented 2x space 
consumption is not an issue, even as merged segments grow bigger. 

But what I am talking about is copy of a whole index as is into a new 
directory.  The new directory has no relation to the older index directory or 
its segments, so not sure what merges are going on across directories/indexes, 
and why solr needs the older index.

Thanks,
Rishi.

 

 

 

-Original Message-
From: Walter Underwood 
To: solr-user 
Sent: Mon, May 4, 2015 9:50 am
Subject: Re: Multiple index.timestamp directories using up disk space


One segment is in-use, being searched. That segment (and others) are merged into
a new segment. After the new segment is ready, searches are directed to the new
copy and the old copies are deleted.

That is how two copies are needed.

If
you cannot provide 2X the disk space, you will not have a stable Solr
installation. You should consider a different search engine.

“Optimizing”
(forced merges) will not help. It will probably cause failures more often
because it always merges the larges segment.

Walter
Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my
blog)


On May 4, 2015, at 3:53 AM, Rishi Easwaran 
wrote:

> Thanks for the responses Mark and Ramkumar.
> 
> The question I
had was, why does Solr need 2 copies at any given time, leading to 2x disk space
usage. 
> Not sure if this information is not published anywhere, and makes HW
estimation almost impossible for large scale deployment. Even if the copies are
temporary, this becomes really expensive, especially when using SSD in
production, when the complex size is over 400TB indexes, running 1000's of solr
cloud shards. 
> 
> If a solr follower has decided that it needs to do
replication from leader and capture full copy snapshot. Why can't it delete the
old information and replicate from scratch, not requiring more disk space.
> Is
the concern data loss (a case when both leader and follower lose data)?.
> 
>
Thanks,
> Rishi.   
> 
> 
> 
> 
> 
> 
> 
> -Original
Message-
> From: Mark Miller 
> To: solr-user

> Sent: Tue, Apr 28, 2015 10:52 am
> Subject:
Re: Multiple index.timestamp directories using up disk space
> 
> 
> If
copies of the index are not eventually cleaned up, I'd fill a JIRA
> to
>
address the issue. Those directories should be removed over time. At
> times
>
there will have to be a couple around at the same time and others may
> take
>
a while to clean up.
> 
> - Mark
> 
> On Tue, Apr 28, 2015 at 3:27 AM
Ramkumar
> R. Aiyengar <
> andyetitmo...@gmail.com> wrote:
> 
>> SolrCloud
does need up to
> twice the amount of disk space as your usual
>> index size
during replication.
> Amongst other things, this ensures you have
>> a full
copy of the index at any
> point. There's no way around this, I would
>>
suggest you provision the
> additional disk space needed.
>> On 20 Apr 2015
23:21, "Rishi Easwaran"
>  wrote:
>> 
>>> Hi
All,
>>> 
>>> We are seeing this
> problem with solr 4.6 and solr
4.10.3.
>>> For some reason, solr cloud tries to
> recover and creates a new
index
>>> directory - (ex:index.20150420181214550),
> while keeping the older
index
>> as
>>> is. This creates an issues where the
> disk space fills up
and the shard
>>> never ends up recovering.
>>> Usually
> this requires a
manual intervention of  bouncing the instance and
>>> wiping
> the disk clean
to allow for a clean recovery.
>>> 
>>> Any ideas on how to
> prevent solr
from creating multiple copies of index
>>> directory.
>>> 
>>> 
>
Thanks,
>>> Rishi.
>>> 
>> 
> 
>

Re: Multiple index.timestamp directories using up disk space

Thanks for the responses Mark and Ramkumar.

 The question I had was, why does Solr need 2 copies at any given time, leading 
to 2x disk space usage. 
 Not sure if this information is not published anywhere, and makes HW 
estimation almost impossible for large scale deployment. Even if the copies are 
temporary, this becomes really expensive, especially when using SSD in 
production, when the complex size is over 400TB indexes, running 1000's of solr 
cloud shards. 

 If a solr follower has decided that it needs to do replication from leader and 
capture full copy snapshot. Why can't it delete the old information and 
replicate from scratch, not requiring more disk space.
 Is the concern data loss (a case when both leader and follower lose data)?.

 Thanks,
 Rishi.   

-Original Message-
From: Mark Miller 
To: solr-user 
Sent: Tue, Apr 28, 2015 10:52 am
Subject: Re: Multiple index.timestamp directories using up disk space

If copies of the index are not eventually cleaned up, I'd fill a JIRA
to
address the issue. Those directories should be removed over time. At
times
there will have to be a couple around at the same time and others may
take
a while to clean up.

- Mark

On Tue, Apr 28, 2015 at 3:27 AM Ramkumar
R. Aiyengar <
andyetitmo...@gmail.com> wrote:

> SolrCloud does need up to
twice the amount of disk space as your usual
> index size during replication.
Amongst other things, this ensures you have
> a full copy of the index at any
point. There's no way around this, I would
> suggest you provision the
additional disk space needed.
> On 20 Apr 2015 23:21, "Rishi Easwaran"
 wrote:
>
> > Hi All,
> >
> > We are seeing this
problem with solr 4.6 and solr 4.10.3.
> > For some reason, solr cloud tries to
recover and creates a new index
> > directory - (ex:index.20150420181214550),
while keeping the older index
> as
> > is. This creates an issues where the
disk space fills up and the shard
> > never ends up recovering.
> > Usually
this requires a manual intervention of  bouncing the instance and
> > wiping
the disk clean to allow for a clean recovery.
> >
> > Any ideas on how to
prevent solr from creating multiple copies of index
> > directory.
> >
> >
Thanks,
> > Rishi.
> >
>

Re: Solr Cloud reclaiming disk space from deleted documents

Sadly with the size of our complex, spiting and adding more HW is not a viable 
long term solution. 
 I guess the options we have are to run optimize regularly and/or become 
aggressive in our merges proactively even before solr cloud gets into this 
situation.
 
 Thanks,
 Rishi.
 

 

 

-Original Message-
From: Gili Nachum 
To: solr-user 
Sent: Mon, Apr 27, 2015 4:18 pm
Subject: Re: Solr Cloud reclaiming disk space from deleted documents


To prevent it from re occurring you could monitor index size and once above
a
certain size threshold add another machine and split the shard between
existing
and new machine.
On Apr 20, 2015 9:10 PM, "Rishi Easwaran"
 wrote:

> So is there anything that can be done from
a tuning perspective, to
> recover a shard that is 75%-90% full, other that get
rid of the index and
> rebuild the data?
>  Also to prevent this issue from
re-occurring, looks like we need make our
> system aggressive with segment
merges using lower merge factor
>
>
> Thanks,
> Rishi.
>
>
>
>
-Original Message-
> From: Shawn Heisey 
> To:
solr-user 
> Sent: Mon, Apr 20, 2015 11:25 am
>
Subject: Re: Solr Cloud reclaiming disk space from deleted documents
>
>
> On
4/20/2015 8:44 AM, Rishi Easwaran wrote:
> > Yeah I noticed that. Looks like
>
optimize won't work since on some disks we are already pretty full.
> > Any
>
thoughts on increasing/decreasing 10  or
>
ConcurrentMergeScheduler to make solr do merges faster.
>
> You don't have to
do
> an optimize to need 2x disk space.  Even normal
> merging, if it happens
just
> right, can require the same disk space as a
> full optimize.  Normal
Solr
> operation requires that you have enough
> space for your index to reach
at least
> double size on occasion.
>
> Higher merge factors are better for
indexing speed,
> because merging
> happens less frequently.  Lower merge
factors are better for
> query
> speed, at least after the merging finishes,
because merging happens
> more
> frequently and there are fewer total segments
at any given moment.
>
> During a merge, there is so much I/O that query speed
is often
> negatively
> affected.
>
> Thanks,
> Shawn
>
>
>
>

Multiple index.timestamp directories using up disk space

2015-04-20 Thread Rishi Easwaran

Hi All,

We are seeing this problem with solr 4.6 and solr 4.10.3. 
For some reason, solr cloud tries to recover and creates a new index directory 
- (ex:index.20150420181214550), while keeping the older index as is. This 
creates an issues where the disk space fills up and the shard never ends up 
recovering.
Usually this requires a manual intervention of  bouncing the instance and 
wiping the disk clean to allow for a clean recovery. 

Any ideas on how to prevent solr from creating multiple copies of index 
directory.

Thanks,
Rishi.

Re: Solr Cloud reclaiming disk space from deleted documents

2015-04-20 Thread Rishi Easwaran

So is there anything that can be done from a tuning perspective, to recover a 
shard that is 75%-90% full, other that get rid of the index and rebuild the 
data?
 Also to prevent this issue from re-occurring, looks like we need make our 
system aggressive with segment merges using lower merge factor  

Thanks,
Rishi.

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Mon, Apr 20, 2015 11:25 am
Subject: Re: Solr Cloud reclaiming disk space from deleted documents

On 4/20/2015 8:44 AM, Rishi Easwaran wrote:
> Yeah I noticed that. Looks like
optimize won't work since on some disks we are already pretty full.
> Any
thoughts on increasing/decreasing 10  or
ConcurrentMergeScheduler to make solr do merges faster.

You don't have to do
an optimize to need 2x disk space.  Even normal
merging, if it happens just
right, can require the same disk space as a
full optimize.  Normal Solr
operation requires that you have enough
space for your index to reach at least
double size on occasion.

Higher merge factors are better for indexing speed,
because merging
happens less frequently.  Lower merge factors are better for
query
speed, at least after the merging finishes, because merging happens
more
frequently and there are fewer total segments at any given moment.

During a merge, there is so much I/O that query speed is often
negatively
affected.

Thanks,
Shawn

Re: Solr Cloud reclaiming disk space from deleted documents

2015-04-20 Thread Rishi Easwaran

Yeah I noticed that. Looks like optimize won't work since on some disks we are 
already pretty full.
Any thoughts on increasing/decreasing 10  or 
ConcurrentMergeScheduler to make solr do merges faster.   


 

 

 

-Original Message-
From: Gili Nachum 
To: solr-user 
Sent: Sun, Apr 19, 2015 12:34 pm
Subject: Re: Solr Cloud reclaiming disk space from deleted documents


I assume you don't have much free space available in your disk. Notice
that
during optimization (merge into a single segment) your shard replica
space
usage may peak to 2x-3x of it's normal size until optimization
completes.
Is it a problem? Not if optimization occurs over shards serially and
your
index is broken to many small shards.
On Apr 18, 2015 1:54 AM, "Rishi
Easwaran"  wrote:

> Thanks Shawn for the quick
reply.
> Our indexes are running on SSD, so 3 should be ok.
> Any
recommendation on bumping it up?
>
> I guess will have to run optimize for
entire solr cloud and see if we can
> reclaim space.
>
> Thanks,
>
Rishi.
>
>
>
>
>
>
>
>
> -Original Message-
> From: Shawn
Heisey 
> To: solr-user 
>
Sent: Fri, Apr 17, 2015 6:22 pm
> Subject: Re: Solr Cloud reclaiming disk space
from deleted documents
>
>
> On 4/17/2015 2:15 PM, Rishi Easwaran wrote:
> >
Running into an issue and wanted
> to see if anyone had some suggestions.
> >
We are seeing this with both solr 4.6
> and 4.10.3 code.
> > We are running an
extremely update heavy application, with
> millions of writes and deletes
happening to our indexes constantly.  An
> issue we
> are seeing is that solr
cloud reclaiming the disk space that can be used
> for new
> inserts, by
cleanup up deletes.
> >
> > We used to run optimize periodically with
> our
old multicore set up, not sure if that works for solr cloud.
> >
> > Num
>
Docs:28762340
> > Max Doc:48079586
> > Deleted Docs:19317246
> >
> >
Version
> 1429299216227
> > Gen 16525463
> > Size 109.92 GB
> >
> > In our
solrconfig.xml we
> use the following configs.
> >
> > 
> >

> >
> false
> >
>
1000
> >
>
2147483647
> >
>
1
> >
> >
>
10
> > 
class="org.apache.lucene.index.TieredMergePolicy"/>
> >
 class="org.apache.lucene.index.ConcurrentMergeScheduler">
>
>  name="maxThreadCount">3
> > 
name="maxMergeCount">15
> > 
> >
>
64
> >
> > 
>
> This
>
part of my response won't help the issue you wrote about, but it
> can
affect
> performance, so I'm going to mention it.  If your indexes are
>
stored on regular
> spinning disks, reduce mergeScheduler/maxThreadCount
> to
1.  If they are stored
> on SSD, then a value of 3 is OK.  Spinning
> disks
cannot do seeks (read/write
> head moves) fast enough to handle
> multiple
merging threads properly.  All the
> seek activity required will
> really slow
down merging, which is a very bad thing
> when your indexing
> load is high. 
SSD disks do not have to seek, so multiple
> threads are OK
> there.
>
> An
optimize is the only way to reclaim all of the disk
> space held by
> deleted
documents.  Over time, as segments are merged
> automatically,
> deleted doc
space will be automatically recovered, but it won't
> be
> perfect, especially
as segments are merged multiple times into very
> large
> segments.
>
> If
you send an optimize command to a core/collection in SolrCloud,
> the
> entire
collection will be optimized ... the cloud will do one
> shard
> replica
(core) at a time until the entire collection has been
> optimized.
> There is
no way (currently) to ask it to only optimize a
> single core, or to do
>
multiple cores simultaneously, even if they are on
> different
>
servers.
>
> Thanks,
> Shawn
>
>
>
>

Re: Solr Cloud reclaiming disk space from deleted documents

2015-04-17 Thread Rishi Easwaran

Thanks Shawn for the quick reply.
Our indexes are running on SSD, so 3 should be ok.
Any recommendation on bumping it up?

I guess will have to run optimize for entire solr cloud and see if we can 
reclaim space.

Thanks,
Rishi. 

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Fri, Apr 17, 2015 6:22 pm
Subject: Re: Solr Cloud reclaiming disk space from deleted documents

On 4/17/2015 2:15 PM, Rishi Easwaran wrote:
> Running into an issue and wanted
to see if anyone had some suggestions.
> We are seeing this with both solr 4.6
and 4.10.3 code.
> We are running an extremely update heavy application, with
millions of writes and deletes happening to our indexes constantly.  An issue we
are seeing is that solr cloud reclaiming the disk space that can be used for new
inserts, by cleanup up deletes. 
>
> We used to run optimize periodically with
our old multicore set up, not sure if that works for solr cloud.
>
> Num
Docs:28762340
> Max Doc:48079586
> Deleted Docs:19317246
>
> Version
1429299216227
> Gen 16525463
> Size 109.92 GB
>
> In our solrconfig.xml we
use the following configs.
>
> 
> 
>
false
>
1000
>
2147483647
>
1
>
>
10
> 
> 
> 3
> 15
> 
>
64
> 
> 

This
part of my response won't help the issue you wrote about, but it
can affect
performance, so I'm going to mention it.  If your indexes are
stored on regular
spinning disks, reduce mergeScheduler/maxThreadCount
to 1.  If they are stored
on SSD, then a value of 3 is OK.  Spinning
disks cannot do seeks (read/write
head moves) fast enough to handle
multiple merging threads properly.  All the
seek activity required will
really slow down merging, which is a very bad thing
when your indexing
load is high.  SSD disks do not have to seek, so multiple
threads are OK
there.

An optimize is the only way to reclaim all of the disk
space held by
deleted documents.  Over time, as segments are merged
automatically,
deleted doc space will be automatically recovered, but it won't
be
perfect, especially as segments are merged multiple times into very
large
segments.

If you send an optimize command to a core/collection in SolrCloud,
the
entire collection will be optimized ... the cloud will do one
shard
replica (core) at a time until the entire collection has been
optimized.
There is no way (currently) to ask it to only optimize a
single core, or to do
multiple cores simultaneously, even if they are on
different
servers.

Thanks,
Shawn

Solr Cloud reclaiming disk space from deleted documents

2015-04-17 Thread Rishi Easwaran

Hi All,

Running into an issue and wanted to see if anyone had some suggestions.
We are seeing this with both solr 4.6 and 4.10.3 code.
We are running an extremely update heavy application, with millions of writes 
and deletes happening to our indexes constantly.  An issue we are seeing is 
that solr cloud reclaiming the disk space that can be used for new inserts, by 
cleanup up deletes. 

We used to run optimize periodically with our old multicore set up, not sure if 
that works for solr cloud.

Num Docs:28762340
Max Doc:48079586
Deleted Docs:19317246

Version 1429299216227
Gen 16525463
Size 109.92 GB

In our solrconfig.xml we use the following configs.



false
1000
2147483647
1

10


3
15

64




Any suggestions on which which tunable to adjust, mergeFactor, mergeScheduler 
thread counts etc would be great.

Thanks,
Rishi.

Re: Basic Multilingual search capability

2015-02-26 Thread Rishi Easwaran

Hi Tom,

Thanks for your inputs. 
I was planning to use stopword filter, but will definitely make sure they are 
unique and not to step over each other.  I think for our system even going with 
length of 50-75 should be fine, will definitely up that number after doing some 
analysis on our input.
Just one clarification, when you say ICUFilterFactory am I correct in thinking 
its ICUFodingFilterFactory.

Thanks,
Rishi.

-Original Message-
From: Tom Burton-West 
To: solr-user 
Sent: Wed, Feb 25, 2015 4:33 pm
Subject: Re: Basic Multilingual search capability

Hi Rishi,

As others have indicated Multilingual search is very difficult to do well.

At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages.  We also added the
CJKBigramFilter to get better precision on CJK queries.  We don't use stop
words because stop words in one language are content words in another.  For
example "die" in German is a stopword but it is a content word in English.

Putting multiple languages in one index can affect word frequency
statistics which make relevance ranking less accurate.  So for example for
the English query "Die Hard" the word "die" would get a low idf score
because it occurs so frequently in German.  We realize that our  approach
does not produce the best results, but given the 400 languages, and limited
resources, we do our best to make search "not suck" for non-English
languages.   When we have the resources we are thinking about doing special
processing for a small fraction of the top 20 languages.  We plan to select
those languages  that most need special processing and relatively easy to
disambiguate from other languages.

If you plan on identifying languages (rather than scripts), you should be
aware that most language detection libraries don't work well on short texts
such as queries.

If you know that you have scripts for which you have content in only one
language, you can use script detection instead of language detection.

If you have German, a filter length of 25 might be too low (Because of
compounding). You might want to analyze a sample of your German text to
find a good length.

Tom

http://www.hathitrust.org/blogs/Large-scale-Search

On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran 
wrote:

> Hi Alex,
>
> Thanks for the suggestions. These steps will definitely help out with our
> use case.
> Thanks for the idea about the lengthFilter to protect our system.
>
> Thanks,
> Rishi.
>
>
>
>
>
>
>
> -Original Message-
> From: Alexandre Rafalovitch 
> To: solr-user 
> Sent: Tue, Feb 24, 2015 8:50 am
> Subject: Re: Basic Multilingual search capability
>
>
> Given the limited needs, I would probably do something like this:
>
> 1) Put a language identifier in the UpdateRequestProcessor chain
> during indexing and route out at least known problematic languages,
> such as Chinese, Japanese, Arabic into individual fields
> 2) Put everything else together into one field with ICUTokenizer,
> maybe also ICUFoldingFilter
> 3) At the very end of that joint filter, stick in LengthFilter with
> some high number, e.g. 25 characters max. This will ensure that
> super-long words from non-space languages and edge conditions do not
> break the rest of your system.
>
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 23 February 2015 at 23:14, Walter Underwood 
> wrote:
> >> I understand relevancy, stemming etc becomes extremely complicated with
> multilingual support, but our first goal is to be able to tokenize and
> provide
> basic search capability for any language. Ex: When the document contains
> hello
> or здравствуйте, the analyzer creates tokens and provides exact match
> search
> results.
>
>
>

Re: Basic Multilingual search capability

2015-02-25 Thread Rishi Easwaran

Hi Alex,

Thanks for the suggestions. These steps will definitely help out with our use 
case.
Thanks for the idea about the lengthFilter to protect our system.

Thanks,
Rishi.

 

 

 

-Original Message-
From: Alexandre Rafalovitch 
To: solr-user 
Sent: Tue, Feb 24, 2015 8:50 am
Subject: Re: Basic Multilingual search capability


Given the limited needs, I would probably do something like this:

1) Put a language identifier in the UpdateRequestProcessor chain
during indexing and route out at least known problematic languages,
such as Chinese, Japanese, Arabic into individual fields
2) Put everything else together into one field with ICUTokenizer,
maybe also ICUFoldingFilter
3) At the very end of that joint filter, stick in LengthFilter with
some high number, e.g. 25 characters max. This will ensure that
super-long words from non-space languages and edge conditions do not
break the rest of your system.


Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 February 2015 at 23:14, Walter Underwood  wrote:
>> I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.

Re: Basic Multilingual search capability

2015-02-25 Thread Rishi Easwaran

Hi Trey,

Thanks for the detailed response and the link to the talk, it was very 
informative.
Yes looking at the current system requirements ICUTokenizer might be the best 
bet for our use case.
MultiTextField mentioned in the jira SOLR-6492 has some cool features and 
definitely looking forward to trying out once its integrated to main.

Thanks,
Rishi.

-Original Message-
From: Trey Grainger 
To: solr-user 
Sent: Tue, Feb 24, 2015 1:40 am
Subject: Re: Basic Multilingual search capability

Hi Rishi,

I don't generally recommend a language-insensitive approach except for
really simple multilingual use cases (for most of the reasons Walter
mentioned), but the ICUTokenizer is probably the best bet you're going to
have if you really want to go that route and only need exact-match on the
tokens that are parsed. It won't work that well for all languages (CJK
languages, for example), but it will work fine for many.

It is also possible to handle multi-lingual content in a more intelligent
(i.e. per-language configuration) way in your search index, of course.
There are three primary strategies (i.e. ways that actually work in the
real world) to do this:
1) create a separate field for each language and search across all of them
at query time
2) create a separate core per language-combination and search across all of
them at query time
3) invoke multiple language-specific analyzers within a single field's
analyzer and index/query using one or more of those language's analyzers
for each document/query.

These are listed in ascending order of complexity, and each can be valid
based upon your use case. For at least the first and third cases, you can
use index-time language detection to map to the appropriate
fields/analyzers if you are otherwise unaware of the languages of the
content from your application layer. The third option requires custom code
(included in the large Multilingual Search chapter of Solr in Action
<http://solrinaction.com> and soon to be contributed back to Solr via
SOLR-6492 <https://issues.apache.org/jira/browse/SOLR-6492>), but it
enables you to index an arbitrarily large number of languages into the same
field if needed, while preserving language-specific analysis for each
language.

I presented in detail on the above strategies at Lucene/Solr Revolution
last November, so you may consider checking out the presentation and/or
slides to asses if one of these strategies will work for your use case:
http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

For the record, I'd highly recommend going with the first strategy (a
separate field per language) if you can, as it is certainly the simplest of
the approaches (albeit the one that scales the least well after you add
more than a few languages to your queries). If you want to stay simple and
stick with the ICUTokenizer then it will work to a point, but some of the
problems Walter mentioned may eventually bite you if you are supporting
certain groups of languages.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder

On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood 
wrote:

> It isn’t just complicated, it can be impossible.
>
> Do you have content in Chinese or Japanese? Those languages (and some
> others) do not separate words with spaces. You cannot even do word search
> without a language-specific, dictionary-based parser.
>
> German is space separated, except many noun compounds are not
> space-separated.
>
> Do you have Finnish content? Entire prepositional phrases turn into word
> endings.
>
> Do you have Arabic content? That is even harder.
>
> If all your content is in space-separated languages that are not heavily
> inflected, you can kind of do OK with a language-insensitive approach. But
> it hits the wall pretty fast.
>
> One thing that does work pretty well is trademarked names (LaserJet, Coke,
> etc). Those are spelled the same in all languages and usually not inflected.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Feb 23, 2015, at 8:00 PM, Rishi Easwaran 
> wrote:
>
> > Hi Alex,
> >
> > There is no specific language list.
> > For example: the documents that needs to be indexed are emails or any
> messages for a global customer base. The messages back and forth could be
> in any language or mix of languages.
> >
> > I understand relevancy, stemming etc becomes extremely complicated with
> multilingual support, but our first goal is to be able to tokenize and
> provide basic search capability for any language. Ex: When the document
> contains hello or здравствуйте, the analyzer creates tokens and provides
> exact match search results.
> >
> > No

Re: Basic Multilingual search capability

Hi Wunder,

Yes we do expect incoming documents to contain Chinese/Japanese/Arabic 
languages.

From what you have mentioned, it looks like we need to auto detect the incoming 
content language and tokenize/filter after that.
But I thought the ICU tokenizer had capability to do that  
(https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer)
"This tokenizer processes multilingual text and tokenizes it appropriately 
based on its script attribute." 
or am I missing something? 

Thanks,
Rishi.

-Original Message-
From: Walter Underwood 
To: solr-user 
Sent: Mon, Feb 23, 2015 11:17 pm
Subject: Re: Basic Multilingual search capability

It isn’t just complicated, it can be impossible.

Do you have content in Chinese or Japanese? Those languages (and some others) 
do 
not separate words with spaces. You cannot even do word search without a 
language-specific, dictionary-based parser.

German is space separated, except many noun compounds are not space-separated.

Do you have Finnish content? Entire prepositional phrases turn into word 
endings.

Do you have Arabic content? That is even harder.

If all your content is in space-separated languages that are not heavily 
inflected, you can kind of do OK with a language-insensitive approach. But it 
hits the wall pretty fast.

One thing that does work pretty well is trademarked names (LaserJet, Coke, 
etc). 
Those are spelled the same in all languages and usually not inflected.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 23, 2015, at 8:00 PM, Rishi Easwaran  wrote:

> Hi Alex,
> 
> There is no specific language list.  
> For example: the documents that needs to be indexed are emails or any 
> messages 
for a global customer base. The messages back and forth could be in any 
language 
or mix of languages.
> 
> I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.
> 
> Now it would be great if it had capability to tokenize email addresses 
(ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
(здравствуйте.pdf), but maybe we can use filters to accomplish that. 
> 
> Thanks,
> Rishi.
> 
> -Original Message-
> From: Alexandre Rafalovitch 
> To: solr-user 
> Sent: Mon, Feb 23, 2015 5:49 pm
> Subject: Re: Basic Multilingual search capability
> 
> 
> Which languages are you expecting to deal with? Multilingual support
> is a complex issue. Even if you think you don't need much, it is
> usually a lot more complex than expected, especially around relevancy.
> 
> Regards,
>   Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
> 
> 
> On 23 February 2015 at 16:19, Rishi Easwaran  wrote:
>> Hi All,
>> 
>> For our use case we don't really need to do a lot of manipulation of 
>> incoming 

> text during index time. At most removal of common stop words, tokenize 
> emails/ 

> filenames etc if possible. We get text documents from our end users, which 
> can 

> be in any language (sometimes combination) and we cannot determine the 
language 
> of the incoming text. Language detection at index time is not necessary.
>> 
>> Which analyzer is recommended to achive basic multilingual search capability 
> for a use case like this.
>> I have read a bunch of posts about using a combination standardtokenizer or 
> ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking 
> for 

> ideas, suggestions, best practices.
>> 
>> http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
>> http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
>> https://issues.apache.org/jira/browse/SOLR-6492
>> 
>> 
>> Thanks,
>> Rishi.
>> 
> 
>

Re: Basic Multilingual search capability

Hi Alex,

There is no specific language list.  
For example: the documents that needs to be indexed are emails or any messages 
for a global customer base. The messages back and forth could be in any 
language or mix of languages.

I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.

Now it would be great if it had capability to tokenize email addresses 
(ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
(здравствуйте.pdf), but maybe we can use filters to accomplish that. 

Thanks,
Rishi.

-Original Message-
From: Alexandre Rafalovitch 
To: solr-user 
Sent: Mon, Feb 23, 2015 5:49 pm
Subject: Re: Basic Multilingual search capability

Which languages are you expecting to deal with? Multilingual support
is a complex issue. Even if you think you don't need much, it is
usually a lot more complex than expected, especially around relevancy.

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/

On 23 February 2015 at 16:19, Rishi Easwaran  wrote:
> Hi All,
>
> For our use case we don't really need to do a lot of manipulation of incoming 
text during index time. At most removal of common stop words, tokenize emails/ 
filenames etc if possible. We get text documents from our end users, which can 
be in any language (sometimes combination) and we cannot determine the language 
of the incoming text. Language detection at index time is not necessary.
>
> Which analyzer is recommended to achive basic multilingual search capability 
for a use case like this.
> I have read a bunch of posts about using a combination standardtokenizer or 
ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for 
ideas, suggestions, best practices.
>
> http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
> http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
> https://issues.apache.org/jira/browse/SOLR-6492
>
>
> Thanks,
> Rishi.
>

Basic Multilingual search capability

Hi All,

For our use case we don't really need to do a lot of manipulation of incoming 
text during index time. At most removal of common stop words, tokenize emails/ 
filenames etc if possible. We get text documents from our end users, which can 
be in any language (sometimes combination) and we cannot determine the language 
of the incoming text. Language detection at index time is not necessary.

Which analyzer is recommended to achive basic multilingual search capability 
for a use case like this.
I have read a bunch of posts about using a combination standardtokenizer or 
ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for 
ideas, suggestions, best practices.

http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
https://issues.apache.org/jira/browse/SOLR-6492  

 
Thanks,
Rishi.

Re: Strange search behaviour when upgrading to 4.10.3

Thanks Shawn.
Just ran the analysis between 4.6 and 4.10, there seems to be only difference 
between the outputs positionLength value is set in 4.10. Does that mean 
anything.

Version 4.10

SF

text

raw_bytes

start

end

positionLength

type

position

message

[6d 65 73 73 61 67 65]

0

7

1

ALNUM

1

 Version 4.6

SF

text

raw_bytes

type

start

end

position

message

[6d 65 73 73 61 67 65]

ALNUM

0

7

1

Thanks,
Rishi.

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Fri, Feb 20, 2015 6:51 pm
Subject: Re: Strange search behaviour when upgrading to 4.10.3

On 2/20/2015 4:24 PM, Rishi Easwaran wrote:
> Also, the tokenizer we use is very similar to the following.
> ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalTokenizer.java
> ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalLexer.jflex
>
>
> From the looks of it the text is being indexed as a single token and not 
broken across whitespace. 

I can't claim to know how analyzer code works.  I did manage to see the
code, but it doesn't mean much to me.

I would suggest using the analysis tab in the Solr admin interface.  On
that page, select the field or fieldType, set the "verbose" flag and
type the actual field contents into the "index" side of the page.  When
you click the Analyze Values button, it will show you what Solr does
with the input at index time.

Do you still have access to any machines (dev or otherwise) running the
old version with the custom component? If so, do the same things on the
analysis page for that version that you did on the new version, and see
whether it does something different.  If it does do something different,
then you will need to track down the problem in the code for your custom
analyzer.

Thanks,
Shawn

Re: Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran

Hi Shawn,
Also, the tokenizer we use is very similar to the following.
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalTokenizer.java
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalLexer.jflex

From the looks of it the text is being indexed as a single token and not broken 
across whitespace. 

Thanks,
Rishi. 

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Fri, Feb 20, 2015 11:52 am
Subject: Re: Strange search behaviour when upgrading to 4.10.3

On 2/20/2015 9:37 AM, Rishi Easwaran wrote:
> We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 
search results are not being returned, actually looks like only the first word 
in a sentence is getting indexed. 
> Ex: inserting "This is a test message" only returns results when searching 
> for 
content:this*. searching for content:test* or content:message* does not work 
with 4.10. Only searching for content:*message* works. This leads to me to 
believe there is something wrong with behaviour of our analyzer and tokenizers 

>  
> 
>   
>
> 
> 
> 
>  
> Looking at the release notes from solr and lucene
> http://lucene.apache.org/solr/4_10_1/changes/Changes.html
> http://lucene.apache.org/core/4_10_1/changes/Changes.html
> Nothing really sticks out, atleast to me.  Any help to get it working with 
4.10 would be great.

The links you provided lead to zero-byte files when I try them, so I
could not look deeper.

Have you recompiled your custom analysis components against the newer
versions of the Solr/Lucene libraries?  Anytime you're dealing with
custom components, you cannot assume that a component compiled to work
with one version of Solr will work with another version.  The internal
API does change, and there is less emphasis on avoiding API breaks in
minor Solr releases than there is with Lucene, because the vast majority
of Solr users are not writing their own code that uses the Solr API. 
Recompiling against the newer libraries may cause compiler errors that
reveal places in your code that require changes.

Thanks,
Shawn

Re: Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran

Yes, The analyzers and tokenizers were recompiled with new version of 
solr/lucene and there were some errors, most of them were related to using 
BytesRefBuilder, which i did. 

Can you try these links.
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalAnalyzer.java

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Fri, Feb 20, 2015 11:52 am
Subject: Re: Strange search behaviour when upgrading to 4.10.3

On 2/20/2015 9:37 AM, Rishi Easwaran wrote:
> We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 
search results are not being returned, actually looks like only the first word 
in a sentence is getting indexed. 
> Ex: inserting "This is a test message" only returns results when searching 
> for 
content:this*. searching for content:test* or content:message* does not work 
with 4.10. Only searching for content:*message* works. This leads to me to 
believe there is something wrong with behaviour of our analyzer and tokenizers 

>  
> 
>   
>
> 
> 
> 
>  
> Looking at the release notes from solr and lucene
> http://lucene.apache.org/solr/4_10_1/changes/Changes.html
> http://lucene.apache.org/core/4_10_1/changes/Changes.html
> Nothing really sticks out, atleast to me.  Any help to get it working with 
4.10 would be great.

The links you provided lead to zero-byte files when I try them, so I
could not look deeper.

Have you recompiled your custom analysis components against the newer
versions of the Solr/Lucene libraries?  Anytime you're dealing with
custom components, you cannot assume that a component compiled to work
with one version of Solr will work with another version.  The internal
API does change, and there is less emphasis on avoiding API breaks in
minor Solr releases than there is with Lucene, because the vast majority
of Solr users are not writing their own code that uses the Solr API. 
Recompiling against the newer libraries may cause compiler errors that
reveal places in your code that require changes.

Thanks,
Shawn

Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran

Hi,

We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3
search results are not being returned, actually looks like only the first word
in a sentence is getting indexed.
Ex: inserting "This is a test message" only returns results when searching for
content:this*. searching for content:test* or content:message* does not work
with 4.10. Only searching for content:*message* works. This leads to me to
believe there is something wrong with behaviour of our analyzer and tokenizers

A little bit of background.

We have our own analyzer and tokenizer since pre solr 1.4 and its been
regularly updated. The analyzer works with solr 4.6 we have it running in
production (I also tested that search works with solr 4.9.1).
It is very similar to the tokenizers and analyzers located here.
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalAnalyzer.java
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/
But with modifications to work with latest solr/lucene code ex: override-
createComponents

The schema of the filed being analyzed is as follows

Looking at the release notes from solr and lucene
http://lucene.apache.org/solr/4_10_1/changes/Changes.html
http://lucene.apache.org/core/4_10_1/changes/Changes.html
Nothing really sticks out, atleast to me. Any help to get it working with 4.10
would be great.

Thanks,
Rishi.

SOLR Talk at AOL Dulles Campus.

2014-07-08 Thread Rishi Easwaran

All, 
There is a tech talk on AOL Dulles campus tomorrow. Do swing by if you can and 
share it with your colleagues and friends. 
www.meetup.com/Code-Brew/events/192361672/
There will be free food and beer served at this event :)

Thanks,
Rishi.

Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-31 Thread Rishi Easwaran

The SSD is separated into logical volumes..each instance gets 100 GB SSD disk 
space to write its index.
If I add them all up its ~45GB in 1TB SSD disk space. 
Not sure I get " You should not be running more than one instance of Solr per 
machine.One instance of Solr can run multiple indexes. "   
Yeah I know that, we have been running 6-8 instances of SOLR using multicore 
ability since ~2008, supporting millions of small indexes. 
Now we are looking at SOLR cloud with large indexes to see if we can leverage 
some of its benefits.
As many folks have experienced, JVM with its stop the world pauses, cannot GC 
using CMS within acceptable limits with very large heaps. 
To utilize the H/W to its full potential, multiple instances on a single host 
is pretty common practice for us. 

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Sun, Mar 30, 2014 5:51 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

On 3/30/2014 2:59 PM, Rishi Easwaran wrote:
> RAM shouldn't be a problem. 
> I have a box with 144GB RAM, running 12 instances with 4GB Java heap each.
> There are 9 instances wrting to 1TB of SSD disk space. 
>  Other 3 are writing to SATA drives, and have autosoftcommit disabled.

This brought up more questions than it answered.  I was assuming that
you only had a total of 4GB of index data, but after reading this, I
think my assumption may be incorrect.  If you add up all the Solr index
data on the SSD, how much disk space does it take?

You should not be running more than one instance of Solr per machine.
One instance of Solr can run multiple indexes.  Running more than one
results in quite a lot of overhead, and it seems unlikely that you would
need to dedicate 48GB of total RAM to the Java heap.

Thanks,
Shawn

Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-30 Thread Rishi Easwaran

RAM shouldn't be a problem. 
I have a box with 144GB RAM, running 12 instances with 4GB Java heap each.
There are 9 instances wrting to 1TB of SSD disk space. 
 Other 3 are writing to SATA drives, and have autosoftcommit disabled.

 

 

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Fri, Mar 28, 2014 8:35 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2


On 3/28/2014 4:07 PM, Rishi Easwaran wrote:
> 
>  Shawn,
> 
> I changed the autoSoftCommit value to 15000 (15 sec). 
> My index size is pretty small ~4GB and its running on a SSD drive with ~100 
> GB 
space on it. 
> Now I see the warn message every 15 seconds.
> 
> The caches I think are minimal
> 
> 
> 
> 
initialSize="512" autowarmCount="0"/>
>  
initialSize="512"   autowarmCount="0"/>
> 
> 200
> 
> I think still something is going on. I mean 15s on SSD drives is a long time 
to handle a 4GB index.

How much RAM do you have and what size is your max java heap?

https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn

Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-28 Thread Rishi Easwaran


 Shawn,

I changed the autoSoftCommit value to 15000 (15 sec). 
My index size is pretty small ~4GB and its running on a SSD drive with ~100 GB 
space on it. 
Now I see the warn message every 15 seconds.

The caches I think are minimal



 
 

200

I think still something is going on. I mean 15s on SSD drives is a long time to 
handle a 4GB index.


Thanks,
Rishi.

 

-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Fri, Mar 28, 2014 3:28 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2


On 3/28/2014 1:03 PM, Rishi Easwaran wrote:
> I thought auto soft commit was for NRT search (shouldn't it be optimized for 
search performance), if i have to wait 10 mins how is it NRT? or am I missing 
something?

You are correct, but once a second is REALLY often.  If the rest of your 
config is not set up properly, that's far too frequent.  With commits 
happening once a second, they must complete in less than a second, and 
that can be difficult to achieve.

A typical extreme NRT config requires small (or disabled) Solr caches, 
no cache autowarming, and enough free RAM (not allocated to programs) to 
cache all of the index data on the server.  If the index is very big, it 
may not be possible to get the commit time below one second, so you may 
need to go with something like 10 to 60 seconds.

Thanks,
Shawn

Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-28 Thread Rishi Easwaran

Hi Dmitry,

I thought auto soft commit was for NRT search (shouldn't it be optimized for 
search performance), if i have to wait 10 mins how is it NRT? or am I missing 
something?

 Thanks,
Rishi.

-Original Message-
From: Dmitry Kan 
To: solr-user 
Sent: Fri, Mar 28, 2014 1:02 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Hi Rishi,

Do you really need soft-commit every second? Can you make it 10 mins, for
example?

What is happening (conditional on checking your logs) is that several
commits (looks like 2 in your case) are arriving in a quick succession.
Then system is starting to warmup the searchers, one per each commit. This
is a waste of resources, because only one searcher will be used in the end,
so one of them is warming in vain.

Just rethink your commit strategy with regards to the update frequency and
warming up time to avoid issues with this in the future.

Dmitry

On Thu, Mar 27, 2014 at 11:16 PM, Rishi Easwaran wrote:

> All,
>
> I am running SOLR Cloud 4.6, everything looks ok, except for this warn
> message constantly in the logs.
>
>
> 2014-03-27 17:09:03,982 WARN  [commitScheduler-15-thread-1] [] SolrCore -
> [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> 2014-03-27 17:09:05,517 WARN  [commitScheduler-15-thread-1] [] SolrCore -
> [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> 2014-03-27 17:09:06,774 WARN  [commitScheduler-15-thread-1] [] SolrCore -
> [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> 2014-03-27 17:09:08,085 WARN  [commitScheduler-15-thread-1] [] SolrCore -
> [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> 2014-03-27 17:09:09,114 WARN  [commitScheduler-15-thread-1] [] SolrCore -
> [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> 2014-03-27 17:09:10,238 WARN  [commitScheduler-15-thread-1] [] SolrCore -
> [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
>
> Searched around a bit, looks like my solrconfig.xml is configured fine and
> verified there are no explicit commits sent by our clients.
>
> My solrconfig.xml
>  
> 1
> 6
> false
> 
>
>
>  1000
>
>
>
> Any idea why its warning every second, the only config that has 1 second
> is softcommit.
>
> Thanks,
> Rishi.
>
>

-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-27 Thread Rishi Easwaran

All,

I am running SOLR Cloud 4.6, everything looks ok, except for this warn message 
constantly in the logs.


2014-03-27 17:09:03,982 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:05,517 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:06,774 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:08,085 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:09,114 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:10,238 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Searched around a bit, looks like my solrconfig.xml is configured fine and 
verified there are no explicit commits sent by our clients.

My solrconfig.xml 
 
1
6
false


   
 1000
   


Any idea why its warning every second, the only config that has 1 second is 
softcommit.

Thanks,
Rishi.

Re: Solr Cloud Hangs consistently .

2013-06-19 Thread Rishi Easwaran

Update!!

Got SOLR cloud working, was able to do 90k document inserts with 
replicationFactor=2, with my jmeter script, previously was getting stuck with 
3k inserts or less.
After some investigation, figured out that ulimits for my process were not 
being set properly, OS defaults were kicking in, which is very small for a 
server app.
One of our install script had changed.
I had to up the ulimits - -n,-u,-v and for now no other issues seen.


 

 

-Original Message-
From: Rishi Easwaran 
To: solr-user 
Sent: Tue, Jun 18, 2013 10:40 am
Subject: Re: Solr Cloud Hangs consistently .


Mark,

All I am doing are inserts, afaik search side deadlocks should not be an issue.

I am using Jmeter, standard test driver we use for most of our benchmarks and 
stats collection.
My jmeter.jmx file- http://apaste.info/79IS , maybe i overlooked something

 
Is there a benchmark script that solr community uses (preferably with jmeter), 
we are write heavy so at the moment focusing on inserts only.

Thanks,

Rishi.

 

 

-Original Message-
From: Yago Riveiro 
To: solr-user 
Sent: Mon, Jun 17, 2013 6:19 pm
Subject: Re: Solr Cloud Hangs consistently .


I do all the indexing through a HTTP POST, with replicationFactor=1 no problem, 
if is higher deadlock problems can appear

A stack trace like this 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 

is that I get

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, June 17, 2013 at 11:03 PM, Mark Miller wrote:

> If it actually happens with replicationFactor=1, it doesn't likely have 
anything to do with the update handler issue I'm referring to. In some cases 
like these, people have better luck with Jetty than Tomcat - we test it much 
more. For instance, it's setup to help avoid search side distributed deadlocks.
> 
> In any case, there is something special about it - I do and have seen a lot 
> of 

heavy indexing to SolrCloud by me and others without running into this. Both 
with replicationFacotor=1 and greater. So there is something specific in how 
the 

load is being done or what features/methods are being used that likely causes 
it 

or makes it easier to cause.
> 
> But again, the issue I know about involves threads that are not even created 
in the replicationFactor = 1 case, so that could be a first report afaik.
> 
> - Mark
> 
> On Jun 17, 2013, at 5:52 PM, Rishi Easwaran mailto:rishi.easwa...@aol.com)> wrote:
> 
> > Update!!
> > 
> > This happens with replicationFactor=1
> > Just for kicks I created a collection with a 24 shards, replicationfactor=1 
cluster on my exisiting benchmark env.
> > Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu 
most metrics looks fine.
> > Only indication seems to be netstat showing incoming request not being read 
in.
> > 
> > Yago,
> > 
> > I saw your previous post 
> > (http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631)
> > Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets 
fixed, but no luck.
> > Looks like this is a dominant and easily reproducible issue on SOLR cloud.
> > 
> > 
> > Thanks,
> > 
> > Rishi. 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -Original Message-
> > From: Yago Riveiro mailto:yago.rive...@gmail.com)>
> > To: solr-user  > (mailto:solr-user@lucene.apache.org)>
> > Sent: Mon, Jun 17, 2013 5:15 pm
> > Subject: Re: Solr Cloud Hangs consistently .
> > 
> > 
> > I can confirm that the deadlock happen with only 2 replicas by shard. I 
> > need 


> > shutdown one node that host a replica of the shard to recover the 
> > indexation 


> > capability.
> > 
> > -- 
> > Yago Riveiro
> > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > 
> > 
> > On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote:
> > 
> > > 
> > > 
> > > Hi All,
> > > 
> > > I am trying to benchmark SOLR Cloud and it consistently hangs. 
> > > Nothing in the logs, no stack trace, no errors, no warnings, just seems 
stuck.
> > > 
> > > A little bit about my set up. 
> > > I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each 
host 
> > > 
> > 
> > is configured to have 8 SOLR cloud nodes running at 4GB each.
> > > JVM configs: http://apaste.info/57Ai
> > > 
> > > My cluster has 12 shards with replication factor 2- 
> > > http://apaste.info/09sA
> > > 
> > > I originally stated with SOLR 4.2., tomcat

Re: SOLR Cloud - Disable Transaction Logs

2013-06-18 Thread Rishi Easwaran

Erick,

We at AOL mail have been using SOLR for quiet a while and our system is pretty 
write heavy and disk I/O is one of our bottlenecks. At present we use regular 
SOLR in the lotsOfCore configuration and I am in  the process of benchmarking 
SOLR cloud for our use case. I don't have concrete data that tLogs are placing 
lot of load on the system, but for a large scale system like ours even minimal 
load gets magnified. 

>From the Cloud design, for a properly set up cluster, usually you have 
>replicas at different availability zones . Probablity of losing more than 1 
>availability zone at any given time should be pretty low. Why have tLogs if 
>all replicas on an update get the request anyway, In theory 1 replica must be 
>able to commit eventually.

NRT is an optional feature and probably not tied to Cloud, correct?

Thanks,

Rishi.

-Original Message-
From: Erick Erickson 
To: solr-user 
Sent: Tue, Jun 18, 2013 4:07 pm
Subject: Re: SOLR Cloud - Disable Transaction Logs

bq: the replica can take over and maintain a durable
state of my index

This is not true. On an update, all the nodes in a slice
have already written the data to the tlog, not just the
leader. So if a leader goes down, the replicas have
enough local info to insure that data is not lost. Without
tlogs this would not be true since documents are not
durably saved until a hard commit.

tlogs save data between hard commits. As Yonik
explained to me once, "soft commits are about
visibility, hard commits are about durability" and
tlogs fill up the gap between hard commits.

So to reinforce Shalin's comment yes, you can disable tlogs
if
1> you don't want any of SolrCloud's HA/DR capabilities
2> NRT is unimportant

IOW if you're using 4.x just like you would 3.x in terms
of replication, HA/DR, etc. This is perfectly reasonable,
but don't get hung up on disabling tlogs.

And you haven't told us _why_ you want to do this. They
don't consume much memory or disk space unless you
have configured your hard commits (with openSearcher
true or false) to be quite long. Do you have any proof at
all that the tlogs are placing enough load on the system
to go down this road?

Best
Erick

On Tue, Jun 18, 2013 at 10:49 AM, Rishi Easwaran  wrote:
> SolrJ already has access to zookeeper cluster state. Network I/O bottleneck 
can be avoided by parallel requests.
> You are only as slow as your slowest responding server, which could be your 
single leader with the current set up.
>
> Wouldn't this lessen the burden of the leader, as he does not have to 
> maintain 
transaction logs or distribute to replicas?
>
>
>
>
>
>
>
> -Original Message-
> From: Shalin Shekhar Mangar 
> To: solr-user 
> Sent: Tue, Jun 18, 2013 2:05 am
> Subject: Re: SOLR Cloud - Disable Transaction Logs
>
>
> Yes, but at what cost? You are thinking of replacing disk IO with even more
> slower network IO. The transaction log is a append-only log -- it is not
> pretty cheap especially so if you compare it with the indexing process.
> Plus your write request/sec will drop a lot once you start doing
> synchronous replication.
>
>
> On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran wrote:
>
>> Shalin,
>>
>> Just some thoughts.
>>
>> Near Real time replication- don't we use solrCmdDistributor, which send
>> requests immediately to replicas with a clonedRequest, as an option can't
>> we achieve something similar form CloudSolrserver in Solrj instead of
>> leader doing it. As long as 2 nodes receive writes and acknowledge.
>> durability should be high.
>> Peer-Sync and Recovery - Can we achieve that merging indexes from leader
>> as needed, instead of replaying the transaction logs?
>>
>> Rishi.
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Shalin Shekhar Mangar 
>> To: solr-user 
>> Sent: Mon, Jun 17, 2013 3:43 pm
>> Subject: Re: SOLR Cloud - Disable Transaction Logs
>>
>>
>> It is also necessary for near real-time replication, peer sync and
>> recovery.
>>
>>
>> On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran > >wrote:
>>
>> > Hi,
>> >
>> > Is there a way to disable transaction logs in SOLR cloud. As far as I can
>> > tell no.
>> > Just curious why do we need transaction logs, seems like an I/O intensive
>> > operation.
>> > As long as I have replicatonFactor >1, if a node (leader) goes down, the
>> > replica can take over and maintain a durable state of my index.
>> >
>> > I understand from the previous discussions, that it was intended for
>> > update durability and realtime get.
>> > But, unless I am missing something an ability to disable it in SOLR cloud
>> > if not needed would be good.
>> >
>> > Thanks,
>> >
>> > Rishi.
>> >
>> >
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>>
>>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
>

Re: SOLR Cloud - Disable Transaction Logs

2013-06-18 Thread Rishi Easwaran

SolrJ already has access to zookeeper cluster state. Network I/O bottleneck can 
be avoided by parallel requests. 
You are only as slow as your slowest responding server, which could be your 
single leader with the current set up.

Wouldn't this lessen the burden of the leader, as he does not have to maintain 
transaction logs or distribute to replicas? 

 

 

 

-Original Message-
From: Shalin Shekhar Mangar 
To: solr-user 
Sent: Tue, Jun 18, 2013 2:05 am
Subject: Re: SOLR Cloud - Disable Transaction Logs


Yes, but at what cost? You are thinking of replacing disk IO with even more
slower network IO. The transaction log is a append-only log -- it is not
pretty cheap especially so if you compare it with the indexing process.
Plus your write request/sec will drop a lot once you start doing
synchronous replication.


On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran wrote:

> Shalin,
>
> Just some thoughts.
>
> Near Real time replication- don't we use solrCmdDistributor, which send
> requests immediately to replicas with a clonedRequest, as an option can't
> we achieve something similar form CloudSolrserver in Solrj instead of
> leader doing it. As long as 2 nodes receive writes and acknowledge.
> durability should be high.
> Peer-Sync and Recovery - Can we achieve that merging indexes from leader
> as needed, instead of replaying the transaction logs?
>
> Rishi.
>
>
>
>
>
>
>
> -Original Message-
> From: Shalin Shekhar Mangar 
> To: solr-user 
> Sent: Mon, Jun 17, 2013 3:43 pm
> Subject: Re: SOLR Cloud - Disable Transaction Logs
>
>
> It is also necessary for near real-time replication, peer sync and
> recovery.
>
>
> On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran  >wrote:
>
> > Hi,
> >
> > Is there a way to disable transaction logs in SOLR cloud. As far as I can
> > tell no.
> > Just curious why do we need transaction logs, seems like an I/O intensive
> > operation.
> > As long as I have replicatonFactor >1, if a node (leader) goes down, the
> > replica can take over and maintain a durable state of my index.
> >
> > I understand from the previous discussions, that it was intended for
> > update durability and realtime get.
> > But, unless I am missing something an ability to disable it in SOLR cloud
> > if not needed would be good.
> >
> > Thanks,
> >
> > Rishi.
> >
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr Cloud Hangs consistently .

2013-06-18 Thread Rishi Easwaran

Mark,

All I am doing are inserts, afaik search side deadlocks should not be an issue.

I am using Jmeter, standard test driver we use for most of our benchmarks and 
stats collection.
My jmeter.jmx file- http://apaste.info/79IS , maybe i overlooked something

 
Is there a benchmark script that solr community uses (preferably with jmeter), 
we are write heavy so at the moment focusing on inserts only.

Thanks,

Rishi.

 

 

-Original Message-
From: Yago Riveiro 
To: solr-user 
Sent: Mon, Jun 17, 2013 6:19 pm
Subject: Re: Solr Cloud Hangs consistently .


I do all the indexing through a HTTP POST, with replicationFactor=1 no problem, 
if is higher deadlock problems can appear

A stack trace like this 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 
is that I get

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, June 17, 2013 at 11:03 PM, Mark Miller wrote:

> If it actually happens with replicationFactor=1, it doesn't likely have 
anything to do with the update handler issue I'm referring to. In some cases 
like these, people have better luck with Jetty than Tomcat - we test it much 
more. For instance, it's setup to help avoid search side distributed deadlocks.
> 
> In any case, there is something special about it - I do and have seen a lot 
> of 
heavy indexing to SolrCloud by me and others without running into this. Both 
with replicationFacotor=1 and greater. So there is something specific in how 
the 
load is being done or what features/methods are being used that likely causes 
it 
or makes it easier to cause.
> 
> But again, the issue I know about involves threads that are not even created 
in the replicationFactor = 1 case, so that could be a first report afaik.
> 
> - Mark
> 
> On Jun 17, 2013, at 5:52 PM, Rishi Easwaran mailto:rishi.easwa...@aol.com)> wrote:
> 
> > Update!!
> > 
> > This happens with replicationFactor=1
> > Just for kicks I created a collection with a 24 shards, replicationfactor=1 
cluster on my exisiting benchmark env.
> > Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu 
most metrics looks fine.
> > Only indication seems to be netstat showing incoming request not being read 
in.
> > 
> > Yago,
> > 
> > I saw your previous post 
> > (http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631)
> > Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets 
fixed, but no luck.
> > Looks like this is a dominant and easily reproducible issue on SOLR cloud.
> > 
> > 
> > Thanks,
> > 
> > Rishi. 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -Original Message-
> > From: Yago Riveiro mailto:yago.rive...@gmail.com)>
> > To: solr-user  > (mailto:solr-user@lucene.apache.org)>
> > Sent: Mon, Jun 17, 2013 5:15 pm
> > Subject: Re: Solr Cloud Hangs consistently .
> > 
> > 
> > I can confirm that the deadlock happen with only 2 replicas by shard. I 
> > need 

> > shutdown one node that host a replica of the shard to recover the 
> > indexation 

> > capability.
> > 
> > -- 
> > Yago Riveiro
> > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > 
> > 
> > On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote:
> > 
> > > 
> > > 
> > > Hi All,
> > > 
> > > I am trying to benchmark SOLR Cloud and it consistently hangs. 
> > > Nothing in the logs, no stack trace, no errors, no warnings, just seems 
stuck.
> > > 
> > > A little bit about my set up. 
> > > I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each 
host 
> > > 
> > 
> > is configured to have 8 SOLR cloud nodes running at 4GB each.
> > > JVM configs: http://apaste.info/57Ai
> > > 
> > > My cluster has 12 shards with replication factor 2- 
> > > http://apaste.info/09sA
> > > 
> > > I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
> > running this configuration in production in Non-Cloud form. 
> > > It got stuck repeatedly.
> > > 
> > > I decided to upgrade to the latest and greatest of everything, SOLR 4.3, 
JDK7 
> > and tomcat7. 
> > > It still shows same behaviour and hangs through the test.
> > > 
> > > My test schema and config.
> > > Schema.xml - http://apaste.info/imah
> > > SolrConfig.xml - http://apaste.info/ku4F
> > > 
> > > The test is pretty simple. its a jmeter test with upda

Re: Solr Cloud Hangs consistently .

Update!!

This happens with replicationFactor=1
Just for kicks I created a collection with a 24 shards, replicationfactor=1 
cluster on my exisiting benchmark env.
Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu most 
metrics looks fine.
Only indication seems to be netstat showing incoming request not being read in.

Yago,

I saw your previous post 
(http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631)
Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets 
fixed, but no luck.
Looks like this is a dominant and easily reproducible issue on SOLR cloud.

Thanks,

Rishi. 

-Original Message-
From: Yago Riveiro 
To: solr-user 
Sent: Mon, Jun 17, 2013 5:15 pm
Subject: Re: Solr Cloud Hangs consistently .

I can confirm that the deadlock happen with only 2 replicas by shard. I need 
shutdown one node that host a replica of the shard to recover the indexation 
capability.

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote:

> 
> 
> Hi All,
> 
> I am trying to benchmark SOLR Cloud and it consistently hangs. 
> Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck.
> 
> A little bit about my set up. 
> I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host 
is configured to have 8 SOLR cloud nodes running at 4GB each.
> JVM configs: http://apaste.info/57Ai
> 
> My cluster has 12 shards with replication factor 2- http://apaste.info/09sA
> 
> I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
running this configuration in production in Non-Cloud form. 
> It got stuck repeatedly.
> 
> I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 
and tomcat7. 
> It still shows same behaviour and hangs through the test.
> 
> My test schema and config.
> Schema.xml - http://apaste.info/imah
> SolrConfig.xml - http://apaste.info/ku4F
> 
> The test is pretty simple. its a jmeter test with update command via SOAP rpc 
(round robin request across every node), adding in 5 fields from a csv file - 
id, guid, subject, body, compositeID (guid!id).
> number of jmeter threads = 150. loop count = 20, num of messages to add/per 
guid = 3; total 150*3*20 = 9000 documents. 
> 
> When cloud gets stuck, i don't get anything in the logs, but when i run 
netstat i see the following.
> Sample netstat on a stuck run. http://apaste.info/hr0O 
> hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.
> 
> At the moment my benchmarking efforts are at a stand still.
> 
> Any help from the community would be great, I got some heap dumps and stack 
dumps, but haven't found a smoking gun yet.
> If I can provide anything else to diagnose this issue. just let me know.
> 
> Thanks,
> 
> Rishi.

Spread the word - Opening at AOL Mail Team in Dulles VA

Hi All,

With the economy the way it is and many folks still looking. 
Figured this is a good place as any to publish this. 

Just today, we got an opening for mid-senior level Software Engineer in our 
team.
Experience with SOLR is a big+.
Feel free to have a look at this position.
http://www.linkedin.com/jobs?viewJob=&jobId=6073910

If interested, send your current resume to rishi.easwa...@aol.com.
I will take it to my Director.   

This position is in Dulles, VA.

Thanks,

Rishi.

Re: SOLR Cloud - Disable Transaction Logs

Shalin,

Just some thoughts.

Near Real time replication- don't we use solrCmdDistributor, which send 
requests immediately to replicas with a clonedRequest, as an option can't we 
achieve something similar form CloudSolrserver in Solrj instead of leader doing 
it. As long as 2 nodes receive writes and acknowledge. durability should be 
high.
Peer-Sync and Recovery - Can we achieve that merging indexes from leader as 
needed, instead of replaying the transaction logs?

Rishi.

-Original Message-
From: Shalin Shekhar Mangar 
To: solr-user 
Sent: Mon, Jun 17, 2013 3:43 pm
Subject: Re: SOLR Cloud - Disable Transaction Logs

It is also necessary for near real-time replication, peer sync and recovery.

On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran wrote:

> Hi,
>
> Is there a way to disable transaction logs in SOLR cloud. As far as I can
> tell no.
> Just curious why do we need transaction logs, seems like an I/O intensive
> operation.
> As long as I have replicatonFactor >1, if a node (leader) goes down, the
> replica can take over and maintain a durable state of my index.
>
> I understand from the previous discussions, that it was intended for
> update durability and realtime get.
> But, unless I am missing something an ability to disable it in SOLR cloud
> if not needed would be good.
>
> Thanks,
>
> Rishi.
>
>

-- 
Regards,
Shalin Shekhar Mangar.

SOLR Cloud - Disable Transaction Logs

Hi,

Is there a way to disable transaction logs in SOLR cloud. As far as I can tell 
no.
Just curious why do we need transaction logs, seems like an I/O intensive 
operation.
As long as I have replicatonFactor >1, if a node (leader) goes down, the 
replica can take over and maintain a durable state of my index.

I understand from the previous discussions, that it was intended for update 
durability and realtime get.
But, unless I am missing something an ability to disable it in SOLR cloud if 
not needed would be good.

Thanks,

Rishi.

Re: Solr Cloud Hangs consistently .

FYI..you can ignore  http4ClientExpiryService thread in the stack dump.
Its a dummy executor service, i created to test out something, unrelated to 
this issue.  
 

 

 

-Original Message-
From: Rishi Easwaran 
To: solr-user 
Sent: Mon, Jun 17, 2013 2:54 pm
Subject: Re: Solr Cloud Hangs consistently .


Mark,

I got a few stack dumps of the instance that was stuck ssdtest-d03:8011

http://apaste.info/cofK
http://apaste.info/sv4M
http://apaste.info/cxUf

 


 I can get dumps of others if needed.

Thanks,

Rishi.

 

-Original Message-
From: Mark Miller 
To: solr-user 
Sent: Mon, Jun 17, 2013 1:57 pm
Subject: Re: Solr Cloud Hangs consistently .


Could you give a simple stack trace dump as well?

It's likely the distributed update deadlock that has been reported a few times 
now - I think usually with a replication factor greater than 2, but I can't be 
sure. The deadlock involves sending docs concurrently to replicas and I 
wouldn't 

have expected it to be so easily hit with only 2 replicas per shard. I should 
be 

able to tell from a stack trace though.

If it is that, it's on my short list to investigate (been there a long time now 
though - but I still hope to look at it soon).

- Mark

On Jun 17, 2013, at 1:44 PM, Rishi Easwaran  wrote:

> 
> 
> Hi All,
> 
> I am trying to benchmark SOLR Cloud and it consistently hangs. 
> Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck.
> 
> A little bit about my set up. 
> I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host 
is configured to have 8 SOLR cloud nodes running at 4GB each.
> JVM configs: http://apaste.info/57Ai
> 
> My cluster has 12 shards with replication factor 2- http://apaste.info/09sA
> 
> I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
running this configuration in production in Non-Cloud form. 
> It got stuck repeatedly.
> 
> I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 
and tomcat7. 
> It still shows same behaviour and hangs through the test.
> 
> My test schema and config.
> Schema.xml - http://apaste.info/imah
> SolrConfig.xml - http://apaste.info/ku4F
> 
> The test is pretty simple. its a jmeter test with update command via SOAP rpc 
(round robin request across every node), adding in 5 fields from a csv file - 
id, guid, subject, body, compositeID (guid!id).
> number of jmeter threads = 150. loop count = 20, num of messages to add/per 
guid = 3; total 150*3*20 = 9000 documents.  
> 
> When cloud gets stuck, i don't get anything in the logs, but when i run 
netstat i see the following.
> Sample netstat on a stuck run. http://apaste.info/hr0O 
> hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.
> 
> 
> At the moment my benchmarking efforts are at a stand still.
> 
> Any help from the community would be great, I got some heap dumps and stack 
dumps, but haven't found a smoking gun yet.
> If I can provide anything else to diagnose this issue. just let me know.
> 
> Thanks,
> 
> Rishi.
> 
> 
> 
> 
> 
> 
> 
>

Re: Solr Cloud Hangs consistently .

Mark,

I got a few stack dumps of the instance that was stuck ssdtest-d03:8011

http://apaste.info/cofK
http://apaste.info/sv4M
http://apaste.info/cxUf

 


 I can get dumps of others if needed.

Thanks,

Rishi.

 

-Original Message-
From: Mark Miller 
To: solr-user 
Sent: Mon, Jun 17, 2013 1:57 pm
Subject: Re: Solr Cloud Hangs consistently .


Could you give a simple stack trace dump as well?

It's likely the distributed update deadlock that has been reported a few times 
now - I think usually with a replication factor greater than 2, but I can't be 
sure. The deadlock involves sending docs concurrently to replicas and I 
wouldn't 
have expected it to be so easily hit with only 2 replicas per shard. I should 
be 
able to tell from a stack trace though.

If it is that, it's on my short list to investigate (been there a long time now 
though - but I still hope to look at it soon).

- Mark

On Jun 17, 2013, at 1:44 PM, Rishi Easwaran  wrote:

> 
> 
> Hi All,
> 
> I am trying to benchmark SOLR Cloud and it consistently hangs. 
> Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck.
> 
> A little bit about my set up. 
> I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host 
is configured to have 8 SOLR cloud nodes running at 4GB each.
> JVM configs: http://apaste.info/57Ai
> 
> My cluster has 12 shards with replication factor 2- http://apaste.info/09sA
> 
> I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
running this configuration in production in Non-Cloud form. 
> It got stuck repeatedly.
> 
> I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 
and tomcat7. 
> It still shows same behaviour and hangs through the test.
> 
> My test schema and config.
> Schema.xml - http://apaste.info/imah
> SolrConfig.xml - http://apaste.info/ku4F
> 
> The test is pretty simple. its a jmeter test with update command via SOAP rpc 
(round robin request across every node), adding in 5 fields from a csv file - 
id, guid, subject, body, compositeID (guid!id).
> number of jmeter threads = 150. loop count = 20, num of messages to add/per 
guid = 3; total 150*3*20 = 9000 documents.  
> 
> When cloud gets stuck, i don't get anything in the logs, but when i run 
netstat i see the following.
> Sample netstat on a stuck run. http://apaste.info/hr0O 
> hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.
> 
> 
> At the moment my benchmarking efforts are at a stand still.
> 
> Any help from the community would be great, I got some heap dumps and stack 
dumps, but haven't found a smoking gun yet.
> If I can provide anything else to diagnose this issue. just let me know.
> 
> Thanks,
> 
> Rishi.
> 
> 
> 
> 
> 
> 
> 
>

Solr Cloud Hangs consistently .



Hi All,

I am trying to benchmark SOLR Cloud and it consistently hangs. 
Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck.

A little bit about my set up. 
I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host 
is configured to have 8 SOLR cloud nodes running at 4GB each.
JVM configs: http://apaste.info/57Ai

My cluster has 12 shards with replication factor 2- http://apaste.info/09sA

I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
running this configuration in production in Non-Cloud form. 
It got stuck repeatedly.

I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 
and tomcat7. 
 It still shows same behaviour and hangs through the test.

My test schema and config.
Schema.xml - http://apaste.info/imah
SolrConfig.xml - http://apaste.info/ku4F

The test is pretty simple. its a jmeter test with update command via SOAP rpc 
(round robin request across every node), adding in 5 fields from a csv file - 
id, guid, subject, body, compositeID (guid!id).
number of jmeter threads = 150. loop count = 20, num of messages to add/per 
guid = 3; total 150*3*20 = 9000 documents.  

When cloud gets stuck, i don't get anything in the logs, but when i run netstat 
i see the following.
Sample netstat on a stuck run. http://apaste.info/hr0O 
hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.

 
At the moment my benchmarking efforts are at a stand still.

Any help from the community would be great, I got some heap dumps and stack 
dumps, but haven't found a smoking gun yet.
If I can provide anything else to diagnose this issue. just let me know.

Thanks,

Rishi.

Re: shardkey

2013-06-12 Thread Rishi Easwaran

>From my understanding.
In SOLR cloud the CompositeIdDocRouter uses HashbasedDocRouter.
CompositeId router is default if your numShards>1 on collection creation.
CompositeId router generates an hash using the uniqueKey defined in your 
schema.xml to route your documents to a dedicated shard.

You can use select?q=xyz&shard.keys=uniquekey to focus your search to hit only 
the shard that has your shard.key  

 

 Thanks,

Rishi.

 

-Original Message-
From: Joshi, Shital 
To: 'solr-user@lucene.apache.org' 
Sent: Wed, Jun 12, 2013 10:01 am
Subject: shardkey


Hi,

We are using Solr 4.3.0 SolrCloud (5 shards, 10 replicas). I have couple 
questions on shard key. 

1. Looking at the admin GUI, how do I know which field is being used 
for shard 
key.
2. What is the default shard key used?
3. How do I override the default shard key?

Thanks.

Re: Solr Composite Unique key from existing fields in schema

Thanks Jack, That fixed it and guarantees the order.

As far as I can tell SOLR cloud 4.2.1 needs a uniquekey defined in its schema, 
or I get an exception.
SolrCore Initialization Failures
 * testCloud2_shard1_replica1: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
QueryElevationComponent requires the schema to have a uniqueKeyField. 

Now that I have an autogenerated composite-id, it has to become a part of my 
schema as uniquekey for SOLR cloud to work. 
  
  
  
compositeId

Is there a way to avoid compositeId field being defined in my schema.xml, would 
like to avoid the overhead of storing this field in my index.

Thanks,

Rishi.


 

 

 

-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 4:33 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The TL;DR response: Try this:


  
userid_s
id
  
  
docid_s
id
  
  
id
--
  
  
  


That will assure that the userid gets processed before the docid.

I'll have to review the contract for CloneFieldUpdateProcessorFactory to see 
what is or ain't guaranteed when there are multiple input fields - whether 
this is a bug or a feature or simply undefined.

-- Jack Krupansky

-Original Message----- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 3:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

I thought the same, but that doesn't seem to be the case.








-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 3:32 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The order in the ID should be purely dependent on the order of the field
names in the processor configuration:

docid_s
userid_s

-- Jack Krupansky

-Original Message----- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the
compositeId that is generated is based on input order.

For example:
If my input comes in as
1
12345

I get the following compositeId1-12345.

If I reverse the input

12345

1
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


  
docid_s
userid_s
id
  
  
id
--
  
  
  


Add documents such as:

curl
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
  "docid_s": "doc-1",
  "userid_s": "user-1",
  "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
  "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
  
docid

Wanted to change this to a composite key something like
userid-docid.
I know I can auto generate compositekey at document insert time, using
custom code to generate a new field, but wanted to know if there was an
inbuilt SOLR mechanism of doing this. That would prevent us from creating
and storing an extra field.

Thanks,

Rishi.

Re: Solr Composite Unique key from existing fields in schema

I thought the same, but that doesn't seem to be the case.


 

 

 

-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 3:32 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The order in the ID should be purely dependent on the order of the field 
names in the processor configuration:

docid_s
userid_s

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the 
compositeId that is generated is based on input order.

For example:
If my input comes in as
1
12345

I get the following compositeId1-12345.

If I reverse the input

12345

1
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


  
docid_s
userid_s
id
  
  
id
--
  
  
  


Add documents such as:

curl
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
  "docid_s": "doc-1",
  "userid_s": "user-1",
  "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
  "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
  
docid

Wanted to change this to a composite key something like
userid-docid.
I know I can auto generate compositekey at document insert time, using
custom code to generate a new field, but wanted to know if there was an
inbuilt SOLR mechanism of doing this. That would prevent us from creating
and storing an extra field.

Thanks,

Rishi.

Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the 
compositeId that is generated is based on input order.

For example: 
If my input comes in as 
1
12345

 I get the following compositeId1-12345. 

If I reverse the input 

12345

1
I get the following compositeId 12345-1 . 
 

In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


  
docid_s
userid_s
id
  
  
id
--
  
  
  


Add documents such as:

curl 
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
  "docid_s": "doc-1",
  "userid_s": "user-1",
  "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
  "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone 
update processor, and pick your composite key field name as well. And set 
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid, 
docid).

I used the standard Solr example schema, so I used dynamic fields for the 
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
  
docid

Wanted to change this to a composite key something like 
userid-docid.
I know I can auto generate compositekey at document insert time, using 
custom code to generate a new field, but wanted to know if there was an 
inbuilt SOLR mechanism of doing this. That would prevent us from creating 
and storing an extra field.

Thanks,

Rishi.

Re: Solr Composite Unique key from existing fields in schema

Thanks Jack, looks like that will do the trick from me. I will try it out. 

 

 

 

-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


  
docid_s
userid_s
id
  
  
id
--
  
  
  


Add documents such as:

curl 
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
  "docid_s": "doc-1",
  "userid_s": "user-1",
  "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
  "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone 
update processor, and pick your composite key field name as well. And set 
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid, 
docid).

I used the standard Solr example schema, so I used dynamic fields for the 
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
  
docid

Wanted to change this to a composite key something like 
userid-docid.
I know I can auto generate compositekey at document insert time, using 
custom code to generate a new field, but wanted to know if there was an 
inbuilt SOLR mechanism of doing this. That would prevent us from creating 
and storing an extra field.

Thanks,

Rishi.

Solr Composite Unique key from existing fields in schema