Re: Solr performance issues

2014-12-29 Thread Mahmoud Almokadem
Thanks all.

I've the same index with a bit different schema and 200M documents,
installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size
of index is about 1.5TB, have many updates every 5 minutes, complex queries
and faceting with response time of 100ms that is acceptable for us.

Toke Eskildsen,

Is the index updated while you are searching? *No*
Do you do any faceting or other heavy processing as part of a search? *No*
How many hits does a search typically have and how many documents are
returned? *The test for QTime only with no documents returned and No. of
hits varying from 50,000 to 50,000,000.*
How many concurrent searches do you need to support? How fast should the
response time be? *May be 100 concurrent searches with 100ms with facets.*

Does splitting the shard to two shards on the same node so every shard will
be on a single EBS Volume better than using LVM?

Thanks

On Mon, Dec 29, 2014 at 2:00 AM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Mahmoud Almokadem [prog.mahm...@gmail.com] wrote:
  We've installed a cluster of one collection of 350M documents on 3
  r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is
  about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS
  General purpose (1x1TB + 1x500GB) on each instance. Then we create
 logical
  volume using LVM of 1.5TB to fit our index.

 Your search speed will be limited by the slowest storage in your group,
 which would be your 500GB EBS. The General Purpose SSD option means (as far
 as I can read at http://aws.amazon.com/ebs/details/#piops) that your
 baseline of 3 IOPS/MB = 1500 IOPS, with bursts of 3000 IOPS. Unfortunately
 they do not say anything about latency.

 For comparison, I checked the system logs from a local test with our 21TB
 / 7 billion documents index. It used ~27,000 IOPS during the test, with
 mean search time a bit below 1 second. That was with ~100GB RAM for disk
 cache, which is about ½% of index size. The test was with simple term
 queries (1-3 terms) and some faceting. Back of the envelope: 27,000 IOPS
 for 21TB is ~1300 IOPS/TB. Your indexes are 1.1TB, so 1.1*1300 IOPS ~= 1400
 IOPS.

 All else being equal (which is never the case), getting 1-3 second
 response times for a 1.1TB index, when one link in the storage chain is
 capped at a few thousand IOPS, you are using networked storage and you have
 little RAM for caching, does not seem unrealistic. If possible, you could
 try temporarily boosting performance of the EBS, to see if raw IO is the
 bottleneck.

  The response time is about 1 and 3 seconds for simple queries (1 token).

 Is the index updated while you are searching?
 Do you do any faceting or other heavy processing as part of a search?
 How many hits does a search typically have and how many documents are
 returned?
 How many concurrent searches do you need to support? How fast should the
 response time be?

 - Toke Eskildsen



Re: SolrCloud Paging on large indexes

2014-12-29 Thread Bram Van Dam

On 12/23/2014 04:07 PM, Toke Eskildsen wrote:

The beauty of the cursor is that it is has little to no overhead, relative to a 
standard top-X sorted search. A standard search uses a sliding window over the 
full result set, as does a cursor-search. Same amount of work. It is just a 
question of limits for the window.


That is very good to hear. Thanks.


Nobody will hit next 499 times, but a lot of our users skip to the last
page quite often. Maybe I should make *that* as hard as possible. Hmm.


Issue a search with sort in reverse order, then reverse the returned list of 
documents?


Sneaky. I like it. But in the end we're simply getting rid of the 
last-button. Solves a lot of issues. If have a billion search results, 
you might as well refine your criteria!


 - Bram



How large is your solr index?

2014-12-29 Thread Bram Van Dam

Hi folks,

I'm trying to get a feel of how large Solr can grow without slowing down 
too much. We're looking into a use-case with up to 100 billion documents 
(SolrCloud), and we're a little afraid that we'll end up requiring 100 
servers to pull it off.


The largest index we currently have is ~2billion documents in a single 
Solr instance. Documents are smallish (5k each) and we have ~50 fields 
in the schema, with an index size of about 2TB. Performance is mostly 
OK. Cold searchers take a while, but most queries are alright after 
warming up. I wish I could provide more statistics, but I only have very 
limited access to the data (...banks...).


I'd very grateful to anyone sharing statistics, especially on the larger 
end of the spectrum -- with or without SolrCloud.


Thanks,

 - Bram


Re: Loading data to FieldValueCache

2014-12-29 Thread Yonik Seeley
On Fri, Dec 26, 2014 at 12:26 PM, Erick Erickson
erickerick...@gmail.com wrote:
 I don't know the complete algorithm, but if the number of docs that
 satisfy the fq is small enough,
 then just the internal Lucene doc IDs are stored rather than a bitset.

If smaller than maxDoc/64 ids are collected, a sorted int set is used
instead of a bitset.
Also, the enum method can skip caching for the smaller terms:

facet.enum.cache.minDf=100
might be good for general purpose.
Or set the value really high to not use the filter cache at all.

-Yonik


Highlighting do not show for some solr results

2014-12-29 Thread Volel, Andre
Hello,

I turned on highlighting and some records do not have highlight text (See image 
below):

[cid:image001.png@01D02358.A0E23D60]


Does anyone know why this is happening and how I can fix it?

Here is the querystring  I am using 
wt=jsonjson.wrf=?indent=truehl=truehl.fl=title,contenthl.tag.pre=emhl.tag.post=/emhl.snippets=2.

Thanks



Re: Highlighting do not show for some solr results

2014-12-29 Thread Erick Erickson
two things:

1 attachments rarely make it through the e-mail system, you have to put
things like screenshots out on different servers and provide a link.
2 I did see the attachment in my moderator role and it's not clear what
your problem really is. I'm _guessing_ that your complaint is that the top
few returns are just the file names, there's no text. In that case, you're
probably matching some other field than text but highlighting on the text
field. Do you perhaps have your request handler configured to use edismax
and are searching across multiple fields?

Best,
Erick

On Mon, Dec 29, 2014 at 8:14 AM, Volel, Andre avo...@bklynlibrary.org
wrote:

  Hello,



 I turned on highlighting and some records do not have highlight text (See
 image below):







 Does anyone know why this is happening and how I can fix it?



 Here is the querystring  I am using “
 wt=jsonjson.wrf=?indent=truehl=truehl.fl=title,contenthl.tag.pre=emhl.tag.post=/emhl.snippets=2
 ”.



 Thanks





Re: How large is your solr index?

2014-12-29 Thread Erick Erickson
When you say 2B docs on a single Solr instance, are you talking only one shard?
Because if you are, you're very close to the absolute upper limit of a
shard, internally
the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.

But yeah, your 100B documents are going to use up a lot of servers...

Best,
Erick

On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam bram.van...@intix.eu wrote:
 Hi folks,

 I'm trying to get a feel of how large Solr can grow without slowing down too
 much. We're looking into a use-case with up to 100 billion documents
 (SolrCloud), and we're a little afraid that we'll end up requiring 100
 servers to pull it off.

 The largest index we currently have is ~2billion documents in a single Solr
 instance. Documents are smallish (5k each) and we have ~50 fields in the
 schema, with an index size of about 2TB. Performance is mostly OK. Cold
 searchers take a while, but most queries are alright after warming up. I
 wish I could provide more statistics, but I only have very limited access to
 the data (...banks...).

 I'd very grateful to anyone sharing statistics, especially on the larger end
 of the spectrum -- with or without SolrCloud.

 Thanks,

  - Bram


Re: Loading data to FieldValueCache

2014-12-29 Thread Erick Erickson
bq: There will be no updates to my index. So, no worries about ageing
out or garbage collection

This is irrelevant to aging out filterCache entries, this is purely query time.

bq: Each having 64 GB of RAM, out of which I am allocating 45 GB to Solr.

It's usually a mistake to give Solr so much ram relative to the OS, see Uwe's
excellent blog here:

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

That said, you know your system best. And the fact that you have so many
shards may well mean that memory considerations aren't relevant.

Personally, though, I think you've massively over-sharded your
collection and are
incurring significant overhead, but again you know your requirements much better
than I do.

Best,
Erick

On Mon, Dec 29, 2014 at 7:43 AM, Yonik Seeley yo...@heliosearch.com wrote:
 On Fri, Dec 26, 2014 at 12:26 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 I don't know the complete algorithm, but if the number of docs that
 satisfy the fq is small enough,
 then just the internal Lucene doc IDs are stored rather than a bitset.

 If smaller than maxDoc/64 ids are collected, a sorted int set is used
 instead of a bitset.
 Also, the enum method can skip caching for the smaller terms:

 facet.enum.cache.minDf=100
 might be good for general purpose.
 Or set the value really high to not use the filter cache at all.

 -Yonik


Re: Solr performance issues

2014-12-29 Thread Shawn Heisey
On 12/29/2014 2:36 AM, Mahmoud Almokadem wrote:
 I've the same index with a bit different schema and 200M documents,
 installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size
 of index is about 1.5TB, have many updates every 5 minutes, complex queries
 and faceting with response time of 100ms that is acceptable for us.
 
 Toke Eskildsen,
 
 Is the index updated while you are searching? *No*
 Do you do any faceting or other heavy processing as part of a search? *No*
 How many hits does a search typically have and how many documents are
 returned? *The test for QTime only with no documents returned and No. of
 hits varying from 50,000 to 50,000,000.*
 How many concurrent searches do you need to support? How fast should the
 response time be? *May be 100 concurrent searches with 100ms with facets.*
 
 Does splitting the shard to two shards on the same node so every shard will
 be on a single EBS Volume better than using LVM?

The basic problem is simply that the system has so little memory that it
must read large amounts of data from the disk when it does a query.
There is not enough RAM to cache the important parts of the index.  RAM
is much faster than disk, even SSD.

Typical consumer-grade DDR3-1600 memory has a data transfer rate of
about 12800 megabytes per second.  If it's ECC memory (which I would say
is a requirement) then the transfer rate is probably a little bit slower
than that.  Figuring 9 bits for every byte gets us about 11377 MB/s.
That's only an estimate, and it could be wrong in either direction, but
I'll go ahead and use it.

http://en.wikipedia.org/wiki/DDR3_SDRAM#JEDEC_standard_modules

If your SSD is SATA, the transfer rate will be limited to approximately
600MB/s -- the 6 gigabit per second transfer rate of the newest SATA
standard.  That makes memory about 18 times as fast as SATA SSD.  I saw
one PCI express SSD that claimed a transfer rate of 2900 MB/s.  Even
that is only about one fourth of the estimated speed of DDR3-1600 with
ECC.  I don't know what interface technology Amazon uses for their SSD
volumes, but I would bet on it being the cheaper version, which would
mean SATA.  The networking between the EC2 instance and the EBS storage
is unknown to me and may be a further bottleneck.

http://ocz.com/enterprise/z-drive-4500/specifications

Bottom line -- you need a lot more memory.  Speeding up the disk may
*help* ... but it will not replace that simple requirement.  With EC2 as
the platform, you may need more instances and more shards.

Your 200 million document index that works well with only 90GB of total
memory ... that's surprising to me.  That means that the important parts
of that index *do* fit in memory ... but if the index gets much larger,
performance is likely to drop off sharply.

Thanks,
Shawn



Re: Solr performance issues

2014-12-29 Thread Mahmoud Almokadem
Thanks Shawn.

What do you mean with important parts of index? and how to calculate their 
size?

Thanks,
Mahmoud

Sent from my iPhone

 On Dec 29, 2014, at 8:19 PM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 12/29/2014 2:36 AM, Mahmoud Almokadem wrote:
 I've the same index with a bit different schema and 200M documents,
 installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size
 of index is about 1.5TB, have many updates every 5 minutes, complex queries
 and faceting with response time of 100ms that is acceptable for us.
 
 Toke Eskildsen,
 
 Is the index updated while you are searching? *No*
 Do you do any faceting or other heavy processing as part of a search? *No*
 How many hits does a search typically have and how many documents are
 returned? *The test for QTime only with no documents returned and No. of
 hits varying from 50,000 to 50,000,000.*
 How many concurrent searches do you need to support? How fast should the
 response time be? *May be 100 concurrent searches with 100ms with facets.*
 
 Does splitting the shard to two shards on the same node so every shard will
 be on a single EBS Volume better than using LVM?
 
 The basic problem is simply that the system has so little memory that it
 must read large amounts of data from the disk when it does a query.
 There is not enough RAM to cache the important parts of the index.  RAM
 is much faster than disk, even SSD.
 
 Typical consumer-grade DDR3-1600 memory has a data transfer rate of
 about 12800 megabytes per second.  If it's ECC memory (which I would say
 is a requirement) then the transfer rate is probably a little bit slower
 than that.  Figuring 9 bits for every byte gets us about 11377 MB/s.
 That's only an estimate, and it could be wrong in either direction, but
 I'll go ahead and use it.
 
 http://en.wikipedia.org/wiki/DDR3_SDRAM#JEDEC_standard_modules
 
 If your SSD is SATA, the transfer rate will be limited to approximately
 600MB/s -- the 6 gigabit per second transfer rate of the newest SATA
 standard.  That makes memory about 18 times as fast as SATA SSD.  I saw
 one PCI express SSD that claimed a transfer rate of 2900 MB/s.  Even
 that is only about one fourth of the estimated speed of DDR3-1600 with
 ECC.  I don't know what interface technology Amazon uses for their SSD
 volumes, but I would bet on it being the cheaper version, which would
 mean SATA.  The networking between the EC2 instance and the EBS storage
 is unknown to me and may be a further bottleneck.
 
 http://ocz.com/enterprise/z-drive-4500/specifications
 
 Bottom line -- you need a lot more memory.  Speeding up the disk may
 *help* ... but it will not replace that simple requirement.  With EC2 as
 the platform, you may need more instances and more shards.
 
 Your 200 million document index that works well with only 90GB of total
 memory ... that's surprising to me.  That means that the important parts
 of that index *do* fit in memory ... but if the index gets much larger,
 performance is likely to drop off sharply.
 
 Thanks,
 Shawn
 


Re: How large is your solr index?

2014-12-29 Thread ralph tice
Like all things it really depends on your use case.  We have 160B
documents in our largest SolrCloud and doing a *:* to get that count takes
~13-14 seconds.  Doing a text:happy query only takes ~3.5-3.6 seconds cold,
subsequent queries for the same terms take 500ms.  We have a little over
3TB of RAM in the cluster which is around 1/10th size on disk which are
fast SSDs (rated 300K IOPS per machine), but more importantly we are using
12-13 large machines rather than dozens or hundreds of small machines, and
if your use case is primarily full text search you probably could get away
with even fewer machines depending on query patterns.  We run several JVMs
per machine and many shards per JVM, but are careful to order shards so
that queries get dispersed across multiple JVMs across multiple machines
wherever possible.

Facets over high cardinality fields are going to be painful.  We currently
programmatically limit the range to around 1/12th or 1/13th of the data set
for facet queries, but plan on evaluating Heliosearch (initial results
didn't look promising) and Toke's sparse faceting patch (SOLR-5894) to help
out there.

If any given JVM goes OOM that also becomes a rough time operationally.  If
your indexing rate spikes past what your sharding strategy can handle, that
sucks too.

There could be more support / ease of use enhancements for moving shards
across SolrClouds, moving shards across physically nodes within a
SolrCloud, and snapshot/restore of a SolrCloud, but there has also been a
lot of recent work in these areas that are starting to provide the
underlying infrastructure for more advanced shard management.

I think there are more people getting into the space of 100B documents but
I only ran into or discovered a handful during my time at Lucene/Solr
Revolution this November.  The majority of large scale SolrCloud users seem
to have many collections (collections per logical user) rather than many
documents in one/few collections.

Regards,
--Ralph

On Mon Dec 29 2014 at 11:55:41 AM Erick Erickson erickerick...@gmail.com
wrote:

 When you say 2B docs on a single Solr instance, are you talking only one
 shard?
 Because if you are, you're very close to the absolute upper limit of a
 shard, internally
 the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.

 But yeah, your 100B documents are going to use up a lot of servers...

 Best,
 Erick

 On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam bram.van...@intix.eu
 wrote:
  Hi folks,
 
  I'm trying to get a feel of how large Solr can grow without slowing down
 too
  much. We're looking into a use-case with up to 100 billion documents
  (SolrCloud), and we're a little afraid that we'll end up requiring 100
  servers to pull it off.
 
  The largest index we currently have is ~2billion documents in a single
 Solr
  instance. Documents are smallish (5k each) and we have ~50 fields in the
  schema, with an index size of about 2TB. Performance is mostly OK. Cold
  searchers take a while, but most queries are alright after warming up. I
  wish I could provide more statistics, but I only have very limited
 access to
  the data (...banks...).
 
  I'd very grateful to anyone sharing statistics, especially on the larger
 end
  of the spectrum -- with or without SolrCloud.
 
  Thanks,
 
   - Bram



[ANNOUNCE] Apache Solr 4.10.3 released

2014-12-29 Thread Mark Miller
December 2014, Apache Solr™ 4.10.3 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.3

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.3 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.10.3 includes 21 bug fixes, as well as Lucene 4.10.3 and its 12
bug fixes.

This release fixes the following security vulnerability that has
affected Solr since the Solr 4.0 Alpha release.

CVE-2014-3628: Stored XSS vulnerability in Solr Admin UI.

Information disclosure: The Solr Admin UI Plugin / Stats page does not
escape data values which allows an attacker to execute javascript by
executing a query that will be stored and displayed via the
'fieldvaluecache' object.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Happy Holidays,

Mark Miller

http://www.about.me/markrmiller


Re: Solr performance issues

2014-12-29 Thread Shawn Heisey
On 12/29/2014 12:07 PM, Mahmoud Almokadem wrote:
 What do you mean with important parts of index? and how to calculate their 
 size?

I have no formal education in what's important when it comes to doing a
query, but I can make some educated guesses.

Starting with this as a reference:

http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene410/package-summary.html#file-names

I would guess that the segment info (.si) files and the term index
(*.tip) files would be supremely important to *always* have in memory,
and they are fairly small.  Next would be the term dictionary (*.tim)
files.  The term dictionary is pretty big, and would be very important
for fast queries.

Frequencies, positions, and norms may also be important, depending on
exactly what kind of query you have.  Frequencies and positions are
quite large.  Frequencies are critical for relevence ranking (the
default sort by score), and positions are important for phrase queries.
 Position data may also be used by relevance ranking, but I am not
familiar enough with it to say for sure.

If you have docvalues defined, then *.dvm and *.dvd files would be used
for facets and sorting on those specific fields.  The *.dvd files can be
very big, depending on your schema.

The *.fdx and *.fdt files become important when actually retrieving
results after the matching documents have been determined.  The stored
data is compressed, so additional CPU power is required to uncompress
that data before it is sent to the client.  Stored data may be large or
small, depending on your schema.  Stored data does not directly affect
search speed, but if memory space is limited, every block of stored data
that gets retrieved will result in some other part of the index being
removed from the OS disk cache, which means that it might need to be
re-read from the disk on the next query.

Thanks,
Shawn



Re: How large is your solr index?

2014-12-29 Thread Jack Krupansky
And that Lucene index document limit includes deleted and updated
documents, so even if your actual document count stays under 2^31-1,
deleting and updating documents can push the apparent document count over
the limit unless you very aggressively merge segments to expunge deleted
documents.

-- Jack Krupansky

-- Jack Krupansky

On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson erickerick...@gmail.com
wrote:

 When you say 2B docs on a single Solr instance, are you talking only one
 shard?
 Because if you are, you're very close to the absolute upper limit of a
 shard, internally
 the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.

 But yeah, your 100B documents are going to use up a lot of servers...

 Best,
 Erick

 On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam bram.van...@intix.eu
 wrote:
  Hi folks,
 
  I'm trying to get a feel of how large Solr can grow without slowing down
 too
  much. We're looking into a use-case with up to 100 billion documents
  (SolrCloud), and we're a little afraid that we'll end up requiring 100
  servers to pull it off.
 
  The largest index we currently have is ~2billion documents in a single
 Solr
  instance. Documents are smallish (5k each) and we have ~50 fields in the
  schema, with an index size of about 2TB. Performance is mostly OK. Cold
  searchers take a while, but most queries are alright after warming up. I
  wish I could provide more statistics, but I only have very limited
 access to
  the data (...banks...).
 
  I'd very grateful to anyone sharing statistics, especially on the larger
 end
  of the spectrum -- with or without SolrCloud.
 
  Thanks,
 
   - Bram



RE: How large is your solr index?

2014-12-29 Thread Toke Eskildsen
Bram Van Dam [bram.van...@intix.eu] wrote:
 I'm trying to get a feel of how large Solr can grow without slowing down
 too much. We're looking into a use-case with up to 100 billion documents
 (SolrCloud), and we're a little afraid that we'll end up requiring 100
 servers to pull it off.

One recurring theme on this list is that it is very hard to compare indexes. 
Even if the data structure happens to be the same, performance will very 
drastically depending on the types of queries and the processing requested. 
That being said, I acknowledge that it helps with stories to get a feel of what 
can be done.

One second caveat is that I find it an exercise in futility to talk about scale 
without an idea of expected response times as well as the expected number of 
concurrent users. If you are just doing some nightly batch processing, you 
could probably run your (scaling up from your description) 100TB index off 
spinning drives on a couple of boxes. If you expect to be hammered with 
millions of requests per day, you would have to put a zero or two behind that 
number.

End of sermon.

At Lucene/Solr Revolution 2014, Grant Ingersoll also asked for user stories and 
pointed to https://wiki.apache.org/solr/SolrUseCases - sadly it has not caught 
on. The only entry is for our (State and University Library, Denmark) setup 
with 21TB / 7 billion documents on a single machine. To follow my own advice, I 
can elaborate that we have 1-3 concurrent users and a design goal of median 
response times below 2 seconds for faceted search. I guess that is at the 
larger end at the spectrum for pure size, but at the very low end for usage.

- Toke Eskildsen


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jonathan Rochkind
Okay, some months later I've come back to this with an isolated 
reproduction case. Thanks very much for any advice or debugging help you 
can give.


The WordDelimiter filter is making a mixed-case query NOT match the 
single-case source, when it ought to.


I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no 
sense to debug here, and I need to install and try to reproduce on a 
more recent version).


I have an index that includes ONE document (deleted and reindexed after 
index change), with content in only one field (text) other than 'id', 
and that content is one word: delalain.


My analysis (both index and query, I don't have different ones) for the 
'text' field is simply:


fieldType name=text class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=true

  analyzer
tokenizer class=solr.ICUTokenizerFactory /

filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 catenateWords=1 splitOnCaseChange=1/


filter class=solr.ICUFoldingFilterFactory /
  /analyzer
/fieldType

I am querying simply with eg /select?defType=luceneq=text%3Adelalain

Querying for delalain finds this document, as expected. Querying for 
DELALAIN finds this document, as expected (note the ICUFoldingFactory).


However, querying for deLALAIN does not find this document, which is 
unexpected.


INDEX analysis of the source, delalain, ends in this in the index, 
which seems pretty straightforward, so I'll only bother pasting in the 
final index analysis:


##
textdelalain
raw_bytes   [64 65 6c 61 6c 61 69 6e]
position1
start   0
end 8
typeALPHANUM
script  Latin
###




QUERY analysis of the problematic query, deLALAIN, looks like this:

#
ICUTtextdeLALAIN
raw_bytes   [64 65 4c 41 4c 41 49 4e]   
start   0   
end 8   
typeALPHANUM
script  Latin   
position1   


WDF textde  LALAIN  deLALAIN
raw_bytes   [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 
4e]
start   0   2   0
end 2   8   8
typeALPHANUMALPHANUMALPHANUM
position1   2   2
script  Common  Common  Common


ICUFF   textde  lalain  delalain
raw_bytes   [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 
6e]
position1   2   2
start   0   2   0
end 2   8   8
typeALPHANUMALPHANUMALPHANUM
script  Common  Common  Common
###



It's obviously the WordDelimiterFilter that is messing things up -- but 
how/why, and is it a bug?


It wants to search for both de lalain as a phrase, as well as 
alternately delalain as one word -- that's the intended supported 
point of the WDF with this configuration, right? And should work?


The problem is that is not succesfully matching delalain as one word 
-- so, how to figure out why not and what to do about it?


Previously, Erick and Diego asked for the info from debug=query, so 
here is that as well:



lst name=debug
  str name=rawquerystringtext:deLALAIN/str
  str name=querystringtext:deLALAIN/str
  str name=parsedqueryMultiPhraseQuery(text:de (lalain 
delalain))/str

  str name=parsedquery_toStringtext:de (lalain delalain)/str
  str name=QParserLuceneQParser/str
/lst


Hmm, that does not seem to quite look like neccesarily, if I interpret 
that correctly, it's looking for de followed by either lalain or 
delalain.  Ie, it would match de delalain?  But that's not right at 
all.


So, what's gone wrong? Something with WDF with configuration to 
generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's 
a bug, one that might be fixed in a more recent Solr?).


Thanks!

Jonathan




On 9/3/14 7:15 PM, Erick Erickson wrote:

Jonathan:

If at all possible, delete your collection/data directory (the whole
directory, including data) between runs after you've changed
your schema (at least any of your analysis that pertains to indexing).
Mixing old and new schema definitions can add to the confusion!

Good luck!
Erick

On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually
using defaults, not sure why I chose non-defaults originally.

I still need to find time to make a smaller isolation/reproduction case, I'm
getting confusing results that suggest some other part of my field def may
be pertinent.

I'll come back when I've done that (hopefully next week), and include the
_parsed_ from debug=query then. Thanks!

Jonathan



On 9/2/14 4:26 PM, Erick Erickson wrote:


What happens if you append 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jack Krupansky
WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction -
the index analyzer would index as you have indicated, indexing both the
unitary term and the multi-term phrase, while the query analyzer would NOT
do the split on case, so that the query could be a unitary term (possibly
with mixed case, but that would not split the term) or could be a two-word
phrase.

-- Jack Krupansky


-- Jack Krupansky

On Mon, Dec 29, 2014 at 5:12 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Okay, some months later I've come back to this with an isolated
 reproduction case. Thanks very much for any advice or debugging help you
 can give.

 The WordDelimiter filter is making a mixed-case query NOT match the
 single-case source, when it ought to.

 I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
 sense to debug here, and I need to install and try to reproduce on a more
 recent version).

 I have an index that includes ONE document (deleted and reindexed after
 index change), with content in only one field (text) other than 'id', and
 that content is one word: delalain.

 My analysis (both index and query, I don't have different ones) for the
 'text' field is simply:

 fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer
 tokenizer class=solr.ICUTokenizerFactory /

 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 catenateWords=1 splitOnCaseChange=1/

 filter class=solr.ICUFoldingFilterFactory /
   /analyzer
 /fieldType

 I am querying simply with eg /select?defType=luceneq=text%3Adelalain

 Querying for delalain finds this document, as expected. Querying for
 DELALAIN finds this document, as expected (note the ICUFoldingFactory).

 However, querying for deLALAIN does not find this document, which is
 unexpected.

 INDEX analysis of the source, delalain, ends in this in the index, which
 seems pretty straightforward, so I'll only bother pasting in the final
 index analysis:

 ##
 textdelalain
 raw_bytes   [64 65 6c 61 6c 61 69 6e]
 position1
 start   0
 end 8
 typeALPHANUM
 script  Latin
 ###




 QUERY analysis of the problematic query, deLALAIN, looks like this:

 #
 ICUTtextdeLALAIN
 raw_bytes   [64 65 4c 41 4c 41 49 4e]
 start   0
 end 8
 typeALPHANUM
 script  Latin
 position1


 WDF textde  LALAIN  deLALAIN
 raw_bytes   [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41
 49 4e]
 start   0   2   0
 end 2   8   8
 typeALPHANUM  ALPHANUM  ALPHANUM
 position1   2   2
 script  Common  Common  Common


 ICUFF   textde  lalain  delalain
 raw_bytes   [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61
 69 6e]
 position1   2   2
 start   0   2   0
 end 2   8   8
 typeALPHANUM  ALPHANUM  ALPHANUM
 script  Common  Common  Common
 ###



 It's obviously the WordDelimiterFilter that is messing things up -- but
 how/why, and is it a bug?

 It wants to search for both de lalain as a phrase, as well as
 alternately delalain as one word -- that's the intended supported point
 of the WDF with this configuration, right? And should work?

 The problem is that is not succesfully matching delalain as one word --
 so, how to figure out why not and what to do about it?

 Previously, Erick and Diego asked for the info from debug=query, so here
 is that as well:

 
 lst name=debug
   str name=rawquerystringtext:deLALAIN/str
   str name=querystringtext:deLALAIN/str
   str name=parsedqueryMultiPhraseQuery(text:de (lalain
 delalain))/str
   str name=parsedquery_toStringtext:de (lalain delalain)/str
   str name=QParserLuceneQParser/str
 /lst
 

 Hmm, that does not seem to quite look like neccesarily, if I interpret
 that correctly, it's looking for de followed by either lalain or
 delalain.  Ie, it would match de delalain?  But that's not right at all.

 So, what's gone wrong? Something with WDF with configuration to
 generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's
 a bug, one that might be fixed in a more recent Solr?).

 Thanks!

 Jonathan




 On 9/3/14 7:15 PM, Erick Erickson wrote:

 Jonathan:

 If at all possible, delete your collection/data directory (the whole
 directory, including data) between runs after you've changed
 your schema (at least any of your analysis that pertains to indexing).
 Mixing old and new schema definitions can add to the confusion!

 Good luck!
 Erick

 On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

 Thanks Erick and Diego. Yes, I noticed in my 

RE: Solr performance issues

2014-12-29 Thread Toke Eskildsen
Mahmoud Almokadem [prog.mahm...@gmail.com] wrote:
 I've the same index with a bit different schema and 200M documents,
 installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size
 of index is about 1.5TB, have many updates every 5 minutes, complex queries
 and faceting with response time of 100ms that is acceptable for us.

So you have
Setup 1: 3 * (30GB RAM + 600GB SSD) for a total of 1.5TB index 200M docs. 
Acceptable performance.
Setup 2: 3 * (60GB RAM + 1TB SSD + 500GB SSD) for a total of 3.3TB 350M docs. 
Poor performance.

The only real difference, besides doubling everything, is the LVM? I understand 
why you find that to be the culprit, but from what I can read, the overhead 
should not be anywhere near enough to result in the performance drop you are 
describing. Could it be that some snapshotting or backup was running when you 
tested?

Splitting your shards and doubling the number of machines, as you suggest, 
would result in
Setup 3: 6 * (60GB RAM + 600GB SSD) for a total of 3.3TB 350M docs.
which would be remarkable similar to your setup 1. I think that would be the 
next logical step, unless you can easily do a temporary boost of your IOPS.

BTW: You are getting dangerously close to your storage limits here - it seems 
that a single large merge could make you run out of space.

- Toke Eskildsen


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Alexandre Rafalovitch
 splitOnCaseChange=1

So, it does not get split during indexing because there is no case
change. But does get split during search and now you are looking for
partial tokens against a combined single-token in the index. And not
matching.

The WordDelimiterFilterFactory is more for product IDs that have
multitudes of spellings. Your use-case seems to be a lot more of just
matching with ignoring case (looking at last email only).

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 29 December 2014 at 17:12, Jonathan Rochkind rochk...@jhu.edu wrote:
 Okay, some months later I've come back to this with an isolated reproduction
 case. Thanks very much for any advice or debugging help you can give.

 The WordDelimiter filter is making a mixed-case query NOT match the
 single-case source, when it ought to.

 I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
 sense to debug here, and I need to install and try to reproduce on a more
 recent version).

 I have an index that includes ONE document (deleted and reindexed after
 index change), with content in only one field (text) other than 'id', and
 that content is one word: delalain.

 My analysis (both index and query, I don't have different ones) for the
 'text' field is simply:

 fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer
 tokenizer class=solr.ICUTokenizerFactory /

 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 catenateWords=1 splitOnCaseChange=1/

 filter class=solr.ICUFoldingFilterFactory /
   /analyzer
 /fieldType

 I am querying simply with eg /select?defType=luceneq=text%3Adelalain

 Querying for delalain finds this document, as expected. Querying for
 DELALAIN finds this document, as expected (note the ICUFoldingFactory).

 However, querying for deLALAIN does not find this document, which is
 unexpected.

 INDEX analysis of the source, delalain, ends in this in the index, which
 seems pretty straightforward, so I'll only bother pasting in the final index
 analysis:

 ##
 textdelalain
 raw_bytes   [64 65 6c 61 6c 61 69 6e]
 position1
 start   0
 end 8
 typeALPHANUM
 script  Latin
 ###




 QUERY analysis of the problematic query, deLALAIN, looks like this:

 #
 ICUTtextdeLALAIN
 raw_bytes   [64 65 4c 41 4c 41 49 4e]
 start   0
 end 8
 typeALPHANUM
 script  Latin
 position1


 WDF textde  LALAIN  deLALAIN
 raw_bytes   [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41
 49 4e]
 start   0   2   0
 end 2   8   8
 typeALPHANUM  ALPHANUM  ALPHANUM
 position1   2   2
 script  Common  Common  Common


 ICUFF   textde  lalain  delalain
 raw_bytes   [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61
 69 6e]
 position1   2   2
 start   0   2   0
 end 2   8   8
 typeALPHANUM  ALPHANUM  ALPHANUM
 script  Common  Common  Common
 ###



 It's obviously the WordDelimiterFilter that is messing things up -- but
 how/why, and is it a bug?

 It wants to search for both de lalain as a phrase, as well as alternately
 delalain as one word -- that's the intended supported point of the WDF
 with this configuration, right? And should work?

 The problem is that is not succesfully matching delalain as one word --
 so, how to figure out why not and what to do about it?

 Previously, Erick and Diego asked for the info from debug=query, so here is
 that as well:

 
 lst name=debug
   str name=rawquerystringtext:deLALAIN/str
   str name=querystringtext:deLALAIN/str
   str name=parsedqueryMultiPhraseQuery(text:de (lalain
 delalain))/str
   str name=parsedquery_toStringtext:de (lalain delalain)/str
   str name=QParserLuceneQParser/str
 /lst
 

 Hmm, that does not seem to quite look like neccesarily, if I interpret that
 correctly, it's looking for de followed by either lalain or delalain.
 Ie, it would match de delalain?  But that's not right at all.

 So, what's gone wrong? Something with WDF with configuration to
 generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a
 bug, one that might be fixed in a more recent Solr?).

 Thanks!

 Jonathan





 On 9/3/14 7:15 PM, Erick Erickson wrote:

 Jonathan:

 If at all possible, delete your collection/data directory (the whole
 directory, including data) between runs after you've changed
 your schema (at least any of your analysis that pertains to indexing).
 Mixing old and new schema definitions can add to the confusion!

 Good luck!
 Erick

 On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

 Thanks Erick and Diego. Yes, I noticed in my last message I'm not
 actually
 using 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jonathan Rochkind

On 12/29/14 5:24 PM, Jack Krupansky wrote:

WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction


I do not understand what separate query/index analysis you are 
suggesting to accomplish what I wanted.


I understand the WDF, like all software, is not magic, of course. But I 
thought this was an intended use case of the WDF, with those settings:


A mixedCase query would match mixedCase in the index; and the same 
query mixedCase would also match two separate words mixed Case in 
index.  (Case insensitively since I apply an ICUFoldingFilter on top of 
that).


Was I wrong, is this not an intended thing for the WDF to do? Or do I 
just have the wrong configuration options for it to do it? Or is it a bug?


When I started this thread a few months ago, I think Erick Erickson 
agreed this was an intended use case for the WDF, but maybe I explained 
it poorly. Erick if you're around and want to at least confirm whether 
WDF is supposed to do this in your understanding, that would be great!


Jonathan


no replication using commitWithin via curl?

2014-12-29 Thread Brendan Humphreys
Hi,

We've noticed that when we send deletes to our SolrCloud cluster via curl
with the param commitWithin=1 specified, the deletes are applied and
are visible to the leader node, but aren't replicated to other nodes.

The problem can be worked around by issuing an explicit (hard) commit.

Is this expected behaviour? Can anyone shed light on what is going on here?

Thanks,
-Brendan


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Erick Erickson
Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 On 12/29/14 5:24 PM, Jack Krupansky wrote:

 WDF is powerful, but it is not magic. In general, the indexed data is
 expected to be clean while the query might be sloppy. You need to separate
 the index and query analyzers and they need to respect that distinction


 I do not understand what separate query/index analysis you are suggesting to
 accomplish what I wanted.

 I understand the WDF, like all software, is not magic, of course. But I
 thought this was an intended use case of the WDF, with those settings:

 A mixedCase query would match mixedCase in the index; and the same query
 mixedCase would also match two separate words mixed Case in index.
 (Case insensitively since I apply an ICUFoldingFilter on top of that).

 Was I wrong, is this not an intended thing for the WDF to do? Or do I just
 have the wrong configuration options for it to do it? Or is it a bug?

 When I started this thread a few months ago, I think Erick Erickson agreed
 this was an intended use case for the WDF, but maybe I explained it poorly.
 Erick if you're around and want to at least confirm whether WDF is supposed
 to do this in your understanding, that would be great!

 Jonathan


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Alexandre Rafalovitch
On 29 December 2014 at 18:07, Jonathan Rochkind rochk...@jhu.edu wrote:
 I do not understand what separate query/index analysis you are suggesting to
 accomplish what I wanted.

I am sure you do know that, but just in case. At the moment, you have
only one analyzer chain, so it applies at both index and query time.
You can split those and have separate treatment during indexing and
during search. Useful with synonyms, etc. The example schema has both
versions shown.

But I would start by just removing splitOnCaseChange attribute and
reindexing. I don't think that flag means what you want it to mean.

Regards,
Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


Re: How to implement multi-set in a Solr schema.

2014-12-29 Thread Meraj A. Khan
Thanks Jack, inorder to not affect the query time , what are the options
available to handle this as index time ? So that I group all the similar
books at index time by placing them in some kind of a set , and retrive all
the contents of the set at query time if any one them matches the query.
On Dec 29, 2014 12:49 AM, Jack Krupansky jack.krupan...@gmail.com wrote:

 You can also use group.query or group.func to group documents matching a
 query or unique values of a function query. For the latter you could
 implement an NLP algorithm.


 -- Jack Krupansky

 On Sun, Dec 28, 2014 at 5:56 PM, Meraj A. Khan mera...@gmail.com wrote:

  Thanks Aman, the thing is the bookName field values are not exactly
  identical , but nearly identical , so at the time of indexing I need to
  figure out which other book name field this is similar to using NLP
  techniques and then put it in the appropriate bag, so that at the
 retrieval
  time I only retrieve all the elements from that bag if any one of the
  element matches with the search query.
 
  Thanks.
  On Dec 28, 2014 1:54 PM, Aman Tandon amantandon...@gmail.com wrote:
 
   HI,
  
   You can use the grouping in the solr. You can does this by via query or
  via
   solrconfig.xml.
  
   *A) via query*
  
  
 
 http://localhost:8983?your_query_params*group=truegroup.field=bookName*
  
   You can limit the size of group (how many documents you wants to show),
   suppose you want to show 5 documents per group on this bookName field
  then
   you can specify the parameter *group.limit=5.*
  
   *B) via solrconfig*
   str name=grouptrue/str str name=group.field*bookName*/str
   str name=group.ngroupstrue/str str
  name=group.truncatetrue/str
  
   With Regards
   Aman Tandon
  
   On Sun, Dec 28, 2014 at 10:29 PM, S.L simpleliving...@gmail.com
 wrote:
  
Hi All,
   
I have a use case where I need to group documents that have a same
  field
called bookName , meaning if there are a multiple documents with the
  same
bookName value and if the user input is searched by a query on
  bookName
   ,
I need to be able to group all the documents by the same bookName
   together,
so that I could display them as a group in the UI.
   
What kind of support does Solr provide for such a scenario , and how
   should
I look at changing my schema.xml which as bookName as single valued
  text
field ?
   
Thanks.
   
  
 



Re: no replication using commitWithin via curl?

2014-12-29 Thread Brendan Humphreys
I've confirmed this is also happens with deletes via SolrJ with
commitWithin - the document is deleted from the leader but the delete is
not replicated to other nodes. Document updates are replicated fine.

Any help in debugging this behaviour would be much appreciated.

Cheers,
-Brendan

On 30 December 2014 at 10:11, Brendan Humphreys bren...@canva.com wrote:

 Hi,

 We've noticed that when we send deletes to our SolrCloud cluster via curl
 with the param commitWithin=1 specified, the deletes are applied and
 are visible to the leader node, but aren't replicated to other nodes.

 The problem can be worked around by issuing an explicit (hard) commit.

 Is this expected behaviour? Can anyone shed light on what is going on here?

 Thanks,
 -Brendan



poor performance when connecting to CloudSolrServer(zkHosts) using solrJ

2014-12-29 Thread zhangjianad

hi,
I setups a SolrCloud, and code a simple solrJ program to query solr
data as below, but it takes about 40 seconds to new CloudSolrServer
instance,less than 100 miliseconds is acceptable. what is going on when new
CloudSolrServer? and how to fix this issue?

String zkHost = bicenter1.dcc:2181,datanode2.dcc:2181;
String defaultCollection = hdfsCollection;

long startms=System.currentTimeMillis();
CloudSolrServer server = new CloudSolrServer(zkHost);
server.setDefaultCollection(defaultCollection);
server.setZkConnectTimeout(3000);
server.setZkClientTimeout(6000);
long endms=System.currentTimeMillis();
System.out.println(endms-startms);

ModifiableSolrParams params = new ModifiableSolrParams();
params.set(q, id:*hbase*);
params.set(sort, price desc);
params.set(start, 0);
params.set(rows, 10);

try {
QueryResponse response=server.query(params);
SolrDocumentList results = response.getResults();
for (SolrDocument doc:results) {
String rowkey=doc.getFieldValue(id).toString();
}

} catch (SolrServerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

server.shutdown();

thanks for any responses.

jan


---
免责声明(Disclaimer)
1.此电子邮件包含来自神州数码的信息,而且是机密的或者专用的信息。这些信息是供所有以上列出的个人或者团体使用的。如果您不是此邮件的预期收件人,请勿阅读、复制、转发或存储此邮件。如果已误收此邮件,请通知发件人。
This e-mail may contain confidential and/or privileged information from Digital 
China and is intended solely for the attention and use of the person(s) named 
above. If you are not the intended recipient (or have received this e-mail in 
error), please notify the sender immediately and destroy this e-mail. Any 
unauthorized copying, disclosure or distribution of the material in this email 
is strictly forbidden.
2.本公司不担保本电子邮件中信息的准确性、适当性或完整性,并且对此产生的任何错误或疏忽不承担任何责任。
The content provided in this e-mail can not be guaranteed and assured to be 
accurate, appropriate for all, and complete by Digital China, and Digital China 
can not be held responsible for any error or negligence derived therefrom.
3.接收方应在接收电子邮件或任何附件时检查有无病毒。本公司对由于转载本电子邮件而引发病毒产生的任何损坏不承担任何责任。
The internet communications through this e-mail can not be guaranteed or 
assured to be error or virus-free, and the sender do not accept liability for 
any errors, omissions or damages arising therefrom.



Re: no replication using commitWithin via curl?

2014-12-29 Thread Shawn Heisey
On 12/29/2014 4:11 PM, Brendan Humphreys wrote:
 We've noticed that when we send deletes to our SolrCloud cluster via curl
 with the param commitWithin=1 specified, the deletes are applied and
 are visible to the leader node, but aren't replicated to other nodes.
 
 The problem can be worked around by issuing an explicit (hard) commit.
 
 Is this expected behaviour? Can anyone shed light on what is going on here?

Another of your messages mentions 4.10.2, which should have the fix for
a similar problem reported with a much earlier version, fixed in 4.6.1.

https://issues.apache.org/jira/browse/SOLR-5658

There's some confusion around another problem introduced by SOLR-5658 --
SOLR-5762 -- but if you use the latest version, that shouldn't be a problem.

If you are running 4.10.2, perhaps SOLR-5658 has come back, or maybe you
have multiple versions of the solr jars on your classpath?

Thanks,
Shawn



Re: poor performance when connecting to CloudSolrServer(zkHosts) using solrJ

2014-12-29 Thread Shawn Heisey
On 12/29/2014 6:52 PM, zhangjia...@dcits.com wrote:
   I setups a SolrCloud, and code a simple solrJ program to query solr
 data as below, but it takes about 40 seconds to new CloudSolrServer
 instance,less than 100 miliseconds is acceptable. what is going on when new
 CloudSolrServer? and how to fix this issue?
 
   String zkHost = bicenter1.dcc:2181,datanode2.dcc:2181;
   String defaultCollection = hdfsCollection;
 
   long startms=System.currentTimeMillis();
   CloudSolrServer server = new CloudSolrServer(zkHost);
   server.setDefaultCollection(defaultCollection);
   server.setZkConnectTimeout(3000);
   server.setZkClientTimeout(6000);
   long endms=System.currentTimeMillis();
   System.out.println(endms-startms);
 
   ModifiableSolrParams params = new ModifiableSolrParams();
   params.set(q, id:*hbase*);
   params.set(sort, price desc);
   params.set(start, 0);
   params.set(rows, 10);
 
   try {
   QueryResponse response=server.query(params);
   SolrDocumentList results = response.getResults();
   for (SolrDocument doc:results) {
   String rowkey=doc.getFieldValue(id).toString();
   }
 
   } catch (SolrServerException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
   }
 
   server.shutdown();

The only part of the constructor for CloudSolrServer that I cannot
easily look at is the part that creates the httpclient, because
ultimately that calls code outside of Solr, in the HttpComponents
project.  Everything that I *can* see is code that should happen
extremely quickly, and the httpclient creation code is something that I
have used myself and never had any noticeable delay.  The constructor
for CloudSolrServer does *NOT* contact zookeeper or Solr, it merely sets
up the instance.  Nothing is contacted until a request is made.  I
examined the CloudSolrServer code from branch_5x.

I tried out your code (with SolrJ 4.6.0 against a SolrCloud 4.2.1
cluster).  Although the query itself encountered an exception in
zookeeper (probably from the version discrepancy between Solr and
SolrJ), the elapsed time printed out from the CloudSolrServer
initialization was 240 milliseconds on the first run, 60 milliseconds on
a second run, and 64 milliseconds on a third run.  Those are all MUCH
less than the 1000 milliseconds that would represent one second, and
incredibly less than the 4 milliseconds that would represent 40 seconds.

Side issue:  I hope that you have more than two zookeeper servers in
your ensemble.  A two-node zookeeper ensemble is actually *less*
reliable than a single node, because a failure of EITHER of those two
nodes will result in a loss of quorum.  Three nodes is the minimum
required for a redundant zookeeper ensemble.

Thanks,
Shawn



Re: How large is your solr index?

2014-12-29 Thread Shawn Heisey
On 12/29/2014 2:30 PM, Toke Eskildsen wrote:
 At Lucene/Solr Revolution 2014, Grant Ingersoll also asked for user stories 
 and pointed to https://wiki.apache.org/solr/SolrUseCases - sadly it has not 
 caught on. The only entry is for our (State and University Library, Denmark) 
 setup with 21TB / 7 billion documents on a single machine. To follow my own 
 advice, I can elaborate that we have 1-3 concurrent users and a design goal 
 of median response times below 2 seconds for faceted search. I guess that is 
 at the larger end at the spectrum for pure size, but at the very low end for 
 usage.

Off-Topic tangent:

I believe it would be useful to organize a session at Lucene Revolution,
possibly more interactive than a straight presentation, where users with
very large indexes are encouraged to attend.  The point of this session
would be to exchange war stories, configuration requirements, hardware
requirements, and observations.

Bringing people with similar goals together to discuss their solutions
should be beneficial.  The discussions could pinpoint areas where Solr
and SolrCloud are weak on scalability, and hopefully lead to issues in
Jira and fixes for those problems.  Better documentation for extreme
scaling is also a possible outcome.

Another idea, not sure if it would be good as an alternate idea or
supplemental, is a less formal gathering, perhaps over a meal or three.

My index is hardly large enough to mention, but I would be interested in
attending such a gathering to learn more about the topic.

Thanks,
Shawn



Re: no replication using commitWithin via curl?

2014-12-29 Thread Brendan Humphreys
Thanks for the reply Shawn.

Yes I am using 4.10.2 - I should have mentioned that in my original post. I
can confirm there are not multiple versions of solr in the classpath; Our
SolrCloud nodes are built programmatically in AWS using the download
package of a specific Solr version as a starting point.

I should add that document adds/updates are visible on all nodes very
quickly. Its only the deletes that are problematic. Reloading the core on a
node brings into into alignment with the leader.

I'll dig into the JIRA's you linked to see if there are any hints as to
whats going on.

Cheers,
-Brendan



On 30 December 2014 at 12:57, Shawn Heisey apa...@elyograg.org wrote:

 On 12/29/2014 4:11 PM, Brendan Humphreys wrote:
  We've noticed that when we send deletes to our SolrCloud cluster via curl
  with the param commitWithin=1 specified, the deletes are applied and
  are visible to the leader node, but aren't replicated to other nodes.
 
  The problem can be worked around by issuing an explicit (hard) commit.
 
  Is this expected behaviour? Can anyone shed light on what is going on
 here?

 Another of your messages mentions 4.10.2, which should have the fix for
 a similar problem reported with a much earlier version, fixed in 4.6.1.

 https://issues.apache.org/jira/browse/SOLR-5658

 There's some confusion around another problem introduced by SOLR-5658 --
 SOLR-5762 -- but if you use the latest version, that shouldn't be a
 problem.

 If you are running 4.10.2, perhaps SOLR-5658 has come back, or maybe you
 have multiple versions of the solr jars on your classpath?

 Thanks,
 Shawn




Re: How large is your solr index?

2014-12-29 Thread Alexandre Rafalovitch
On 29 December 2014 at 21:42, Shawn Heisey apa...@elyograg.org wrote:
 I believe it would be useful to organize a session at Lucene Revolution,
 possibly more interactive than a straight presentation, where users with
 very large indexes are encouraged to attend.  The point of this session
 would be to exchange war stories, configuration requirements, hardware
 requirements, and observations.

+1

And have a scribe to take notes with whom to follow-up later :-) And
interview separately for Solr podcast too.

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/