from:"Burton\-West, Tom"

RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Burton-West, Tom

Thanks wunder and Lance,

In the discussions I've seen of Japanese IR in the English language IR 
literature, Hiragana is either removed or strings are segmented first by 
character class.  I'm interested in finding out more about why bigramming 
across classes is desirable.
Based on my limited understanding of Japanese, I can see how perhaps bigramming 
a Han and Hiragana character might make sense but what about Han and Katakana?

Lance, how did you weight the unigram vs bigram fields for CJK? or did you just 
OR them together assuming that idf will give the bigrams more weight?

Tom

RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Burton-West, Tom

Thanks wunder,

I really appreciate the help.

Tom

CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-27 Thread Burton-West, Tom

I have a few questions about the CJKBigram filter.

About 10% of our queries that contain Han characters are single character 
queries.   It looks like the CJKBigram filter only outputs single characters 
when there are no adjacent bigrammable characters in the input.   This means we 
would have to create a separate field to index Han unigrams in order to address 
single character queries.  Is this correct?

For Japanese, the default settings form bigrams across character types.  So for 
a string containing Hiragana and Han characters bigrams containing a mixture of 
Hiragana and Han characters are formed:
いろは革命歌   =“いろ” ”ろは“  “は革”   ”革命” “命歌”

Is there a way to specify that you don’t want bigrams across character types?

Tom

Tom Burton-West
Digital Library Production Service
University of Michigan Library

http://www.hathitrust.org/blogs/large-scale-search

maxMergeDocs in Solr 3.6

2012-04-19 Thread Burton-West, Tom

Hello all,

I'm getting ready to upgrade from Solr 3.4 to Solr 3.6 and I noticed that 
maxMergeDocs is no longer in the example solrconfig.xml.
Has maxMergeDocs been deprecated? or doe the tieredMergePolicy ignore it?

Since our Docs are about 800K or more and the setting in the old example 
solrconfig was 2,147,483,647 we would never hit this limit, but I was wondering 
about why it is no longer in the example.



Tom

RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom

Seems like a change in default behavior like this should be included in the 
changes.txt for Solr 3.5.
Not sure how to do that.

Tom

-Original Message-
From: Naomi Dushay [mailto:ndus...@stanford.edu] 
Sent: Thursday, February 23, 2012 1:57 PM
To: solr-user@lucene.apache.org
Subject: autoGeneratePhraseQueries sort of silently set to false 

Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with 
results when there were hyphenated words:   aaa-bbb.   Erik Hatcher pointed me 
to the autoGeneratePhraseQueries attribute now available on fieldtype 
definitions in schema.xml.  This is a great feature, and everything is peachy 
if you start with Solr 3.4.   But many of us started earlier and are upgrading, 
and that's a different story.

It was surprising to me that

a.  the default for this new feature caused different search results than Solr 
1.4 

b.  it wasn't documented clearly, IMO

http://wiki.apache.org/solr/SchemaXml   makes no mention of it

In the schema.xml example, there is this at the top:

!-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --

And there was this in a couple of field definitions:

fieldType name=text_en_splitting class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true
fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=false

But that was it.

RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom

Thanks Erik,

The 3.1 changes document the ability to set this and the default being set to 
true
However apparently the change between 3.4 and 3.5 the default was set to 
false  
Since this will change the behavior of any field where 
autoGeneratePhraseQueries is not explicitly set, it could easily surprise users 
who update to 3.5. 

 That's why I think the changing of the default behavior (i.e. when not 
explicitly set) should be called out explicitly in the changes.txt for 3.5.   

True, everyone should read the notes in the example schema.xml, but I think it 
would help if the change was also noted in changes.txt.  

Is it possible to revise the changes.txt for 3.5?

Do you by any chance know where the change in the default behavior was 
discussed?  I know it has been a contentious issue.

Tom

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Thursday, February 23, 2012 2:53 PM
To: solr-user@lucene.apache.org
Subject: Re: autoGeneratePhraseQueries sort of silently set to false

there's this (for 3.1, but in the 3.x CHANGES.txt):

* SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
  autoGeneratePhraseQueries=true (the default) causes the query parser to
  generate phrase queries if multiple tokens are generated from a single
  non-quoted analysis string.  For example WordDelimiterFilter splitting 
text:pdp-11
  will cause the parser to generate text:pdp 11 rather than (text:PDP OR 
text:11).
  Note that autoGeneratePhraseQueries=true tends to not work well for non 
whitespace
  delimited languages. (yonik)

with a ton of useful, though back and forth, commentary here: 
https://issues.apache.org/jira/browse/SOLR-2015

Note that the behavior, as Naomi pointed out so succinctly, is adjustable based 
off the *schema* version setting.  (look at your schema line in schema.xml).  
The code is simply this:

if (schema.getVersion()  1.3f) {
  autoGeneratePhraseQueries = false;
} else {
  autoGeneratePhraseQueries = true;
}

on TextField.  Specifying autoGeneratePhraseQueries explicitly on a field type 
overrides whatever the default may be.

Erik



On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote:

 Seems like a change in default behavior like this should be included in the 
 changes.txt for Solr 3.5.
 Not sure how to do that.
 
 Tom
 
 -Original Message-
 From: Naomi Dushay [mailto:ndus...@stanford.edu] 
 Sent: Thursday, February 23, 2012 1:57 PM
 To: solr-user@lucene.apache.org
 Subject: autoGeneratePhraseQueries sort of silently set to false 
 
 Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do 
 with results when there were hyphenated words:   aaa-bbb.   Erik Hatcher 
 pointed me to the autoGeneratePhraseQueries attribute now available on 
 fieldtype definitions in schema.xml.  This is a great feature, and everything 
 is peachy if you start with Solr 3.4.   But many of us started earlier and 
 are upgrading, and that's a different story.
 
 It was surprising to me that
 
 a.  the default for this new feature caused different search results than 
 Solr 1.4 
 
 b.  it wasn't documented clearly, IMO
 
 http://wiki.apache.org/solr/SchemaXml   makes no mention of it
 
 
 In the schema.xml example, there is this at the top:
 
 !-- attribute name is the name of this schema and is only used for display 
 purposes.
   Applications should change this to reflect the nature of the search 
 collection.
   version=1.4 is Solr's version number for the schema syntax and 
 semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
 nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
 except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --
 
 And there was this in a couple of field definitions:
 
 fieldType name=text_en_splitting class=solr.TextField 
 positionIncrementGap=100 autoGeneratePhraseQueries=true
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
 
 But that was it.

RE: Can Apache Solr Handle TeraByte Large Data

2012-01-16 Thread Burton-West, Tom

Hello ,

Searching real-time sounds difficult with that amount of data. With large 
documents, 3 million documents, and 5TB of data the index will be very large. 
With indexes that large your performance will probably be I/O bound.  

Do you plan on allowing phrase or proximity searches? If so, your performance 
will be even more I/O bound as documents that large will have huge positions 
indexes that will need to be read into memory for processing phrase queries. To 
reduce I/O you need as much of the index in memory (Lucene/Solr caches, and 
operating system disk cache).  Every commit invalidates the Solr/Lucene caches 
(unless the newer nrt code has solved this for Solr).  

If you index and serve on the same server, you are also going to get terrible 
response time whenever your commits trigger a large merge.

If you need to service 10-100 qps or more, you may need to look at putting your 
index on SSDs or spreading it over enough machines so it can stay in memory.

What kind of response times are you looking for and what query rate?

We have somewhat smaller documents. We have 10 million documents and about 
6-8TB of data in HathiTrust and have spread the index over 12 shards on 4 
machines (i.e. 3 shards per machine).   We get an average of around 200-300ms 
response time but our 95th percentile times are about 800ms and 99th percentile 
are around 2 seconds.  This is with an average load of less than 1 query/second.

As Otis suggested, you may want to implement a strategy that allows users to 
search within the large documents by breaking the documents up into smaller 
units. What we do is have two Solr indexes.  The first indexes complete 
documents.  When the user clicks on a result, we index the entire document on a 
page level in a small Solr index on-the-fly.  That way they can search within 
the document and get page level results.
 
More details about our setup: http://www.hathitrust.org/blogs/large-scale-search

Tom Burton-West
University of Michigan Library
www.hathitrust.org
-Original Message-

RE: Getting facet counts for 10,000 most relevant hits

2011-10-03 Thread Burton-West, Tom

Thanks so much for your reply Hoss,

I didn't realize how much more complicated this gets with distributed search. 
Do you think it's worth opening a JIRA issue for this?
Is there already some ongoing work on the faceting code that this might fit in 
with?

In the meantime, I think I'll go ahead and do some performance tests on my 
kludge.  That might work for us as an interim measure until I have time to dive 
into the Solr/Lucene distributed faceting code.

Tom

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Friday, September 30, 2011 9:20 PM
To: solr-user@lucene.apache.org
Subject: RE: Getting facet counts for 10,000 most relevant hits


: I figured out how to do this in a kludgey way on the client side but it 
: seems this could be implemented much more efficiently at the Solr/Lucene 
: level.  I described my kludge and posted a question about this to the 

It can, and I have -- but only for the case of a single node...

In general the faceting code in solr just needs a DocSet.  the default 
imple uses the DocSet computed as aside effect when executing the main 
search, but a custom SearchComponent could pick any DocSet it wants.

A few years back I wrote a custom faceting plugin that computed a score 
for each constraint based on:
 * Editorially assigned weights from a config file
 * the number of matching documents (ie: normal constraint count)
 * the number of matching documents from hte first N results

...where the last number was determined by internally executing the search 
with rows of N, to generate a DocList object, nad then converting that 
DocList into a DocSet, and using that as the input to SimpleFacetCounts.

Ignoring the Editorial weights part of the above, the logic for 
scoring constraints based on the other two factors is general enough 
thta it could be implemented in solr, we just need a way to configure N 
and what kind of function should be applied to the two counts.

...But...

This approach really breaks down in a distributed model.  You can't do the 
same quick and easy DocList-DocSet transformation on each node, you have 
to do more complicated federating logic like the existing FacetComponent 
code does, and even there we don't have anything that would help with the 
only the first N type logic.  My best idea would be to do the same thing 
you describe in your kludge approach to solving this in the client...

: 
(http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html).
  

...the coordinator would have to query all of the shards for their top N, 
and then tell each one exactly which of those docs to include in the 
weighted facets constraints count ... which would make for some relaly 
big requests if N is large.

the only sane way to do this type of thing efficiently in a distributed 
setup would probably be to treat the top N part of the goal as a 
guideline for a sampling problem, telling each shard to consider only 
*their* top N results when computing the top facets in shardReq #1, and 
then do the same give me an exact count type logic in shardReq #2 
thta we already do.  So the constraints picked may not acutally be 
the top constraints for the first N docs across the whole collection (just 
like right now they aren't garunteed to be the top constraints for all 
docs in the collection in a long tail situation), but they would 
representative of the first-ish docs across the whole collection.

-Hoss

RE: Getting facet counts for 10,000 most relevant hits

2011-09-30 Thread Burton-West, Tom

Hi Lan,

I figured out how to do this in  a kludgey way on the client side but it seems 
this could be implemented much more efficiently at the Solr/Lucene level.  I 
described my kludge and posted a question about this to the dev list, but so 
far have not received any replies 
(http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html).
  I also found Solr-385, but I don't understand how grouping solves the 
problem. It looks like a much different issue to me.

The problem I am trying to solve is that I only have room in the interface to 
show 30 facet values at the most and whether these are ordered by facet counts 
against the entire result set or by the highest ranking score of a member of a 
facet-value group, the problem is that we want to base the facet counts/ranking 
on only the top N hits rather than the entire result set.  In my use case the 
top 10,000 hits versus all 170,000.

Tom

-Original Message-
From: Lan [mailto:dung@gmail.com] 
Sent: Thursday, September 29, 2011 7:40 PM
To: solr-user@lucene.apache.org
Subject: Re: Getting facet counts for 10,000 most relevant hits

I implemented a similar feature for a categorization suggestion service. I
did the faceting in the client code, which is not exactly the best
performing but it worked very well.

It would be nice to have the Solr server do the faceting for performance.


Burton-West, Tom wrote:
 
 If relevance ranking is working well, in theory it doesn't matter how many
 hits you get as long as the best results show up in the first page of
 results.  However, the default in choosing which facet values to show is
 to show the facets with the highest count in the entire result set.  Is
 there a way to issue some kind of a filter query or facet query that would
 show only the facet counts for the 10,000 most relevant search results?
 
 As an example, if you search in our full-text collection for jaguar you
 get 170,000 hits.  If I am looking for the car rather than the OS or the
 animal, I might expect to be able to click on a facet and limit my results
 to the car.  However, facets containing the word car or automobile are not
 in the top 5 facets that we show.  If you click on more  you will see
 automobile periodicals but not the rest of the facets containing the
 word automobile .  This occurs because the facet counts are for all
 170,000 hits.  The facet counts  for at least 160,000 irrelevant hits are
 included (assuming only the top 10,000 hits are relevant) .
 
 What we would like to do is get the facet counts for the N most relevant
 documents and select the 5 or 30 facet values with the highest counts for
 those relevant documents.
 
 Is this possible or would it require writing some lucene or Solr code?
 
 Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-facet-counts-for-10-000-most-relevant-hits-tp3363459p3380852.html
Sent from the Solr - User mailing list archive at Nabble.com.

Getting facet counts for 10,000 most relevant hits

2011-09-23 Thread Burton-West, Tom

If relevance ranking is working well, in theory it doesn't matter how many hits 
you get as long as the best results show up in the first page of results.  
However, the default in choosing which facet values to show is to show the 
facets with the highest count in the entire result set.  Is there a way to 
issue some kind of a filter query or facet query that would show only the facet 
counts for the 10,000 most relevant search results?

As an example, if you search in our full-text collection for jaguar you get 
170,000 hits.  If I am looking for the car rather than the OS or the animal, I 
might expect to be able to click on a facet and limit my results to the car.  
However, facets containing the word car or automobile are not in the top 5 
facets that we show.  If you click on more  you will see automobile 
periodicals but not the rest of the facets containing the word automobile .  
This occurs because the facet counts are for all 170,000 hits.  The facet 
counts  for at least 160,000 irrelevant hits are included (assuming only the 
top 10,000 hits are relevant) .

What we would like to do is get the facet counts for the N most relevant 
documents and select the 5 or 30 facet values with the highest counts for those 
relevant documents.

Is this possible or would it require writing some lucene or Solr code?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

RE: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-19 Thread Burton-West, Tom

Thanks Robert,

Removing set from  setMaxMergedSegmentMB and using maxMergedSegmentMB 
fixed the problem.
( Sorry about the multiple posts.  Our mail server was being flaky and the 
client lied to me about whether the message had been sent.)

I'm still confused about the mergeFactor=10 setting in the example 
configuration.  Took a quick look at the code, but I'm obviously looking in the 
wrong place. Is mergeFactor=10 interpreted by TieredMergePolicy as
segmentsPerTier=10 and maxMergeAtOnce=10?   If I specify values for these is 
the mergeFactor setting ignored?

Tom



-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, September 16, 2011 7:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

On Fri, Sep 16, 2011 at 6:53 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello,

 The TieredMergePolicy has become the default with Solr 3.3, but the 
 configuration in the example uses the mergeFactor setting which applys to the 
 LogByteSizeMergePolicy.

 How is the mergeFactor interpreted by the TieredMergePolicy?

 Is there an example somewhere showing how to configure the Solr 
 TieredMergePolicy to set the parameters:
 setMaxMergeAtOnce, setSegmentsPerTier, and setMaxMergedSegmentMB?

an example is here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/solrconfig-mergepolicy.xml


 I tried setting setMaxMergedSegmentMB in Solr 3.3
 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
      int name=maxMergeAtOnce20/int
      int name=segmentsPerTier40/int
     !--400GB /20=20GB  or 2MB--
 double name=setMaxMergedSegmentMB2/double
    /mergePolicy


 and got this error message
 SEVERE: java.lang.RuntimeException: no setter corrresponding to 
 'setMaxMergedSegmentMB' in org.apache.lucene.index.TieredMergePolicy

Right, i think it should be:

double name=maxMergedSegmentMB2/double


-- 
lucidimagination.com

Example configuring TieredMergePolicy in Solr

2011-09-16 Thread Burton-West, Tom

Hello,

The TieredMergePolicy has become the default with Solr 3.3, but the 
configuration in the example uses the mergeFactor setting which applys to the 
LogByteSizeMergePolicy.
How is the mergeFactor interpreted by the TieredMergePolicy?

Is there an example somewhere showing how to configure the Solr 
TieredMergePolicy to set the parameters:
setMaxMergeAtOnce, setSegmentsPerTier, and setMaxMergedSegmentMB?

Tom Burton-West

Example for Solr TieredMergePolicy configuration

2011-09-16 Thread Burton-West, Tom

Hello,

The TieredMergePolicy has become the default with Solr 3.3, but the 
configuration in the example uses the mergeFactor setting which applys to the 
LogByteSizeMergePolicy.
How is the mergeFactor interpreted by the TieredMergePolicy?

Is there an example somewhere showing how to configure the Solr 
TieredMergePolicy to set the parameters:
setMaxMergeAtOnce, setSegmentsPerTier, and setMaxMergedSegmentMB?

Tom Burton-West

Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-16 Thread Burton-West, Tom

Hello,

The TieredMergePolicy has become the default with Solr 3.3, but the 
configuration in the example uses the mergeFactor setting which applys to the 
LogByteSizeMergePolicy.

How is the mergeFactor interpreted by the TieredMergePolicy?

Is there an example somewhere showing how to configure the Solr 
TieredMergePolicy to set the parameters:
setMaxMergeAtOnce, setSegmentsPerTier, and setMaxMergedSegmentMB?

I tried setting setMaxMergedSegmentMB in Solr 3.3
mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce20/int
  int name=segmentsPerTier40/int
 !--400GB /20=20GB  or 2MB--
double name=setMaxMergedSegmentMB2/double
/mergePolicy


and got this error message
SEVERE: java.lang.RuntimeException: no setter corrresponding to 
'setMaxMergedSegmentMB' in org.apache.lucene.index.TieredMergePolicy


Tom Burton-West

RE: performance crossover between single index and sharding

2011-08-02 Thread Burton-West, Tom

Hi Markus,

Just as a data point for a very large sharded index, we have the full text of 
9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 
machines. Each machine has 3 shards. The size of each shard ranges between 
475GB and 550GB.  We are definitely I/O bound. Our machines have 144GB of 
memory with about 16GB dedicated to the tomcat instance running the 3 Solr 
instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. 
 We release a new index every morning and then warm the caches with several 
thousand queries.  I probably should add that our disk storage is a very high 
performance Isilon appliance that has over 500 drives and every block of every 
file is striped over no less than 14 different drives. (See blog for details *)

We have a very low number of queries per second (0.3-2 qps) and our modest 
response time goal is to keep 99th percentile response time for our application 
(i.e. Solr + application) under 10 seconds.

Our current performance statistics are:

average response time  300 ms
median response time   113 ms
90th percentile663 ms
95th percentile1,691 ms

We had plans to do some performance testing to determine the optimum shard size 
and optimum number of shards per machine, but that has remained on the back 
burner for a long time as other higher priority items keep pushing it down on 
the todo list.

We would be really interested to hear about the experiences of people who have 
so many shards that the overhead of distributing the queries, and 
consolidating/merging the responses becomes a serious issue.


Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

* 
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, August 02, 2011 12:33 PM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Actually, i do worry about it. Would be marvelous if someone could provide 
some metrics for an index of many terabytes.

 [..] At some extreme point there will be diminishing
 returns and a performance decrease, but I wouldn't worry about that at all
 until you've got many terabytes -- I don't know how many but don't worry
 about it.
 
 ~ David
 
 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in
 dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing
 list archive at Nabble.com.

RE: performance crossover between single index and sharding

2011-08-02 Thread Burton-West, Tom

Hi Jonothan and Markus,

Why 3 shards on one machine instead of one larger shard per machine?

Good question!

We made this architectural decision several years ago and I'm not remembering 
the rationale at the moment. I believe we originally made the decision due to 
some tests showing a sweetspot for I/O performance for shards with 
500,000-600,000 documents, but those tests were made before we implemented 
CommonGrams and when we were still using attached storage.  I think we also 
might have had concerns about Java OOM errors with a really large shard/index, 
but we now know that we can keep memory usage under control by tweaking the 
amount of the terms index that gets read into memory.

We should probably do some tests and revisit the question.

The reason we don't have 12 shards on 12 machines is that current performance 
is good enough that we can't justify buying 8 more machines:)

Tom

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, August 02, 2011 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Hi Tom,

Very interesting indeed! But i keep wondering why some engineers choose to 
store multiple shards of the same index on the same machine, there must be 
significant overhead. The only reason i can think of is ease of maintenance in 
moving shards to a separate physical machine.
I know that rearranging the shard topology can be a real pain in a large 
existing cluster (e.g. consistent hashing is not consistent anymore and having 
to shuffle docs to their new shard), is this the reason you choose this 
approach?

Cheers,
bble.com.

RE: what s the optimum size of SOLR indexes

2011-07-05 Thread Burton-West, Tom

Hello,

On Mon, 2011-07-04 at 13:51 +0200, Jame Vaalet wrote:
 What would be the maximum size of a single SOLR index file for resulting in 
 optimum search time ?

How do you define optimimum?   Do you want the fastest possible response time 
at any cost or do you have a specific response time goal? 

Can you give us more details on your use case?   What kind of load are you 
expecting?  What kind of queries do you need to support?
Some of the trade-offs depend if you are CPU bound or I/O bound.

Assuming a fairly large index, if you *absolutely need* the fastest possible 
search response time and you can *afford the hardware*, you probably want to 
shard your index and size your indexes so they can all fit in memory (and do 
some work to make sure the index data is always in memory).  If you can't 
afford that much memory, but still need very fast response times, you might 
want to size your indexes so they all fit on SSD's.  As an example of a use 
case on the opposite side of the spectrum, here at HathiTrust, we have a very 
low number of queries per second and we are running an index that totals 6 TB 
in size with shards of about 500GB and average response times of 200ms (but 
99th percentile times of about 2 seconds).

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

RE: Garbage Collection: I have given bad advice in the past!

2011-06-24 Thread Burton-West, Tom

Hi Shawn,

Thanks for sharing this information.  I also found that in our use case, for 
some reason the default settings for the concurrent garbage collector seem to 
size the young generation way too small (At least for heap sizes of 1GB or 
larger.)   Can you also let us know what version of the JVM you are using?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

RE: huge shards (300GB each) and load balancing

2011-06-15 Thread Burton-West, Tom

Hi Dimitry,

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you 
using SOLR 3.1?

I'm pretty sure that the termIndexInterval (ratio of tii file to tis file) is 
in the 1.4.1 example solrconfig.xml file, although I don't have a copy to check 
at the moment.  We are using a 3.1 dev version.  As far as the 
termInfosIndexDivisor, I I'm also pretty sure it works with 1.4.1, but you 
might have to ask the list to be sure.  As you can see from the blog posts 
those settings really reduced our memory requirements.We haven't been doing 
faceting so we expect memory use to go up again once we add faceting, but at 
least we are starting at a 4GB baseline instead of a 20-32GB baseline.

Did you you do logical sharding or document hash based?

On the indexing side we just assign documents to a particular shard on a round 
robin basis and use a database to keep track of which document is in which 
shard so if we need to update it we update the right shard (See the Forty 
days article on the blog for a more detailed description and some diagrams) .  
We hope that this distributes the documents evenly enough to avoid problems 
with Solr's lack of global idf.

Do you have load balancer between the front SOLR (or front entity) and shards,

As far as load balancing which shard is the head shard/front shard, again, our 
app layer just randomly picks one of the shards to be the head shard.  We 
originally were going to do tests to determine if it was better to have one 
dedicated machine configured to be the head shard, but never got around to 
that.  We have a very low query request rate, so haven't had to seriously look 
at load balancing

do you do merging? 

I'm not sure what you mean by do you do merging .  We are just using the 
default Solr distributed search.  In theory our documents should be randomly 
distributed among the shards so the lack of global idf should not hurt the 
merging process.  Andrzej Bialecki gave a recent presentation on Solr 
distributed search that talks about less than optimal results merging and some 
ideas for dealing with it:
http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf

Each shard currently is allocated max 12GB memory. 
I'm curious about how much memory you leave to the OS for disk caching.  Can 
you give any details about the number of shards per machine and the total 
memory on the machine.


Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search




From: Dmitry Kan [dmitry@gmail.com]
Sent: Tuesday, June 14, 2011 2:15 PM
To: solr-user@lucene.apache.org
Subject: Re: huge shards (300GB each) and load balancing

Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?

RE: FastVectorHighlighter and hl.fragsize parameter set to zero causes exception

2011-06-11 Thread Burton-West, Tom

Thank you Koji,

I'll take a look at SingleFragListBuilder, LUCENE-2464,  and SOLR-1985, and I 
will update the wiki on Monday.

Tom


There is SingleFragListBuilder for this purpose. Please see:

https://issues.apache.org/jira/browse/LUCENE-2464

 3)  Are there other parameters listed in the wiki that should have a note 
 indicating whether they apply to only the regular highlighter or the FVH?

FVH doesn't support (or makes no sense):

fragsize=0, mergeContiguous, maxAnalyzedChars, formatter, 
simple.pre/simple.post,
fragmenter, highlightMultiTerm, regex.*

and FVH support requireFieldMatch but doesn't support per-field override for it.

If you would update wiki, it definitely be very helpful.

Thank you!

koji
--
http://www.rondhuit.com/en/

FastVectorHighlighter and hl.fragsize parameter set to zero causes exception

2011-06-10 Thread Burton-West, Tom

According to the documentation on the Solr wiki page, setting the hl.fragsize 
parameter to  0 indicates that the whole field value should be used (no 
fragmenting).   However the FastVectorHighlighter throws an exception

message fragCharSize(0) is too small. It must be 18 or higher. 
java.lang.IllegalArgumentException: fragCharSize(0) is too small. It must be 18 
or higher. at 
org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder.createFieldFragList(SimpleFragListBuilder.java:36)
 at 
org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getFieldFragList(FastVectorHighlighter.java:177)
 at 
org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFragments(FastVectorHighlighter.java:166)
 at

I noticed that in Solr 1268 there was an attempt to fix this problem but it 
apparently did not work.

1)  Is there any plan to implement a feature in FastVectorHighlighter that 
would behave the same as the regular Highlighter, i.e. return the whole field 
value when hl.fragsize=0?
2)  Should I edit the wiki entry to indicate that the hl.fragsize=0 does 
not work with FVH?
3)  Are there other parameters listed in the wiki that should have a note 
indicating whether they apply to only the regular highlighter or the FVH?

Tom Burton-West

RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Burton-West, Tom

Hi Koji,


Thank you for your reply.

 It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery 
 and DisjunctionMaxQuery
 and Query constructed by those queries.

Sorry, I'm not sure I understand.  Are you saying that FVH supports MultiTerm 
highlighting?  

Tom

RE: huge shards (300GB each) and load balancing

2011-06-08 Thread Burton-West, Tom

Hi Dmitry,

I am assuming you are splitting one very large index over multiple shards 
rather than replicating and index multiple times.

Just for a point of comparison, I thought I would describe our experience with 
large shards. At HathiTrust, we run a 6 terabyte index over 12 shards.  This is 
split over 4 machines with 3 shards per machine and our shards are about 
400-500GB.  We get average response times of around 200 ms with the 99th 
percentile queries up around 1-2 seconds. We have a very low qps rate, i.e. 
less than 1 qps.  We also index offline on a separate machine and update the 
indexes nightly.

Some of the issues we have found with very large shards are:
1) Becaue of the very large shard size, I/O tends to be the bottleneck, with 
phrase queries containing common words being the slowest.
2) Because of the I/O issues running cache-warming queries to get postings into 
the OS disk cache is important as is leaving significant free memory for the OS 
to use for disk caching
3) Because of the I/O issues using stop words or CommonGrams produces a 
significant performance increase.
2) We have a huge number of unique terms in our indexes.  In order to reduce 
the amount of memory needed by the in-memory terms index we set the 
termInfosIndexDivisor to 8, which causes Solr to only load every 8th term from 
the tii file into memory. This reduced memory use from over 18GB to below 3G 
and got rid of 30 second stop the world java Garbage Collections. (See 
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for 
details)  We later ran into memory problems when indexing so instead changed 
the index time parameter termIndexInterval from 128 to 1024.

(More details here: http://www.hathitrust.org/blogs/large-scale-search)

Tom Burton-West

Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Burton-West, Tom

We are trying to implement highlighting for wildcard (MultiTerm) queries.  This 
seems to work find with the regular highlighter but when we try to use the 
fastVectorHighlighter we don't see any results in the  highlighting section of 
the response.  Appended below are the parameters we are using.

Tom Burton-West

query
str name=qocr:tink*/str
highlighting params:

str name=hl.highlightMultiTermtrue/str
str name=hl.fragsize200/str
str name=hl.useFastVectorHighlightertrue/str
str name=hl.snippets200/str
str name=hl.fragmentsBuildercolored/str
str name=hl.fragListBuildersimple/str
str name=hl.flocr/str
str name=hl.usePhraseHighlightertrue/str
str name=hltrue/str

RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Burton-West, Tom

Hi Erick,

Thanks for asking, yes we have termVectors=true set:

fieldType name=FullText class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=false stored=true termVectors=true 
termPositions=true termOffsets=true

I guess I should also mention that highlighting works fine using the 
fastVectorHighLighter as long as we don't do a MultiTerm query.   For example 
see the query and results appended below (using the same hl parameters listed 
in the previous email)


Tom

str name=qocr:tinkham/str

lst name=highlighting
−
lst name=mdp.39015015394847_24
−
arr name=ocr
−
str
 John {lt:}b style=background:#00{gt:}Tinkham{lt:}/b{gt:}, who
married Miss Mallie Kingsbury; Mr. William Ash-
ley, and Mr. Leavitt, who, I believe, built the big
stone house, now left high and dry by itself, on
the top of Lyon street hill. As 
/str
/arr
/lst
/lst

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, June 08, 2011 4:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Does MultiTerm highlighting work with the fastVectorHighlighter?

Just to check, does the field have termVectors=true set?
I think it's required for FVH to work.
Best
Erick

RE: 400 MB Fields

2011-06-07 Thread Burton-West, Tom

Hi Otis, 

Our OCR fields average around 800 KB.  My guess is that the largest docs we 
index (in a single OCR field) are somewhere between 2 and 10MB.  We have had 
issues where the in-memory representation of the document (the in memory index 
structures being built)is several times the size of the text, so I would 
suspect even with the largest ramBufferSizeMB, you might run into problems.  
(This is with the 3.x branch.  Trunk might not have this problem since it's 
much more memory efficient when indexing

Tom Burton-West
www.hathitrust.org/blogs

From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
Sent: Tuesday, June 07, 2011 6:59 PM
To: solr-user@lucene.apache.org
Subject: 400 MB Fields

Hello,

What are the biggest document fields that you've ever indexed in Solr or that
you've heard of?  Ah, it must be Tom's Hathi trust. :)

I'm asking because I just heard of a case of an index where some documents
having a field that can be around 400 MB in size!  I'm curious if anyone has any
experience with such monster fields?
Crazy?  Yes, sure.
Doable?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

filter cache and negative filter query

2011-05-17 Thread Burton-West, Tom

If I have a query with a filter query such as :  q=artfq=history and then 
run a second query  q=artfq=-history, will Solr realize that it can use the 
cached results of the previous filter query history  (in the filter cache) or 
will it not realize this and have to actually do a second filter query against 
the index  for not history?

Tom

RE: CommonGrams indexing very slow!

2011-04-27 Thread Burton-West, Tom

Hi Salman,

Sounds like somehow you are triggering merges or optimizes.  What is your 
mergeFactor?  

Have you turned on the IndexWriter log?

In solrconfig.xml
infoStream file=${solr.indexwriter.log.dir}true/infoStream

 In our case we feed the directory name as a Java property in our java startup 
script , but you can also hard code where you want the log written like in the 
current example Solr config:

infoStream file=INFOSTREAM.txtfalse/infoStream

That should provide some clues.  For example you can see how many segments of 
each level there are just before you do the commit that triggers the problem.   
My first guess is that you have enough segments so that adding the documents 
and committing triggers a cascading merge. (But this is a WAG without seeing 
what's in your indexwriter log)

Can you also send your solrconfig so we can see your mergeFactor and 
ramBufferSizeMB settings?

Tom

  All,
 
  We have created index with CommonGrams and the final size is around
 370GB.
  Everything is working fine but now when we add more documents into index
 it
  takes forever (almost 12 hours)...seems to change all the segments file
 in a
  commit.
 
  The same commit used to take few mins with normal index.
 
  Any idea whats going on?
 
  --
  Regards,
 
  Salman Akram
  Principal Software Engineer - Tech Lead
  NorthBay Solutions
  410-G4 Johar Town, Lahore
  Off: +92-42-35290152
 
  Cell: +92-321-4391210 -- +92-300-4009941
 




-- 
Regards,

Salman Akram

RE: CommonGrams indexing very slow!

2011-04-27 Thread Burton-West, Tom

Hi Salman,

We had a similar problem with the IndexMergeTool in Lucene contrib.
I seem to remember having to hack the IndexMergeTool code so that it wouldn't 
create the CFF automatically.
Let me know if you need it and I'll dig up the modified code.



Tom.

-Original Message-
From: Salman Akram [mailto:salman.ak...@northbaysolutions.net] 
Sent: Wednesday, April 27, 2011 1:43 PM
To: solr-user@lucene.apache.org
Subject: Re: CommonGrams indexing very slow!

Thanks for the response. We got it resolved! .

We made small indexes in bulk using SOLR with Standard File Format and then
merged it with a Lucene app which for some reason made it CFS. Now when we
started adding real time documents using SOLR (with Compound File Format set
to false) it was merging with every commit!

We just set the CFF to true and now its normal. Weird but that's how it got
resolved.

BTW any idea why this happening and if we now optimize it using SFF it
should be fine in future with CFF= false?

P.S: Increasing the MergeFactor didn't even work.

On Wed, Apr 27, 2011 at 10:09 PM, Burton-West, Tom tburt...@umich.eduwrote:

 Hi Salman,

 Sounds like somehow you are triggering merges or optimizes.  What is your
 mergeFactor?

 Have you turned on the IndexWriter log?

 In solrconfig.xml
 infoStream file=${solr.indexwriter.log.dir}true/infoStream

  In our case we feed the directory name as a Java property in our java
 startup script , but you can also hard code where you want the log written
 like in the current example Solr config:

 infoStream file=INFOSTREAM.txtfalse/infoStream

 That should provide some clues.  For example you can see how many segments
 of each level there are just before you do the commit that triggers the
 problem.   My first guess is that you have enough segments so that adding
 the documents and committing triggers a cascading merge. (But this is a WAG
 without seeing what's in your indexwriter log)

 Can you also send your solrconfig so we can see your mergeFactor and
 ramBufferSizeMB settings?

 Tom

   All,
  
   We have created index with CommonGrams and the final size is around
  370GB.
   Everything is working fine but now when we add more documents into
 index
  it
   takes forever (almost 12 hours)...seems to change all the segments file
  in a
   commit.
  
   The same commit used to take few mins with normal index.
  
   Any idea whats going on?
  
   --
   Regards,
  
   Salman Akram
   Principal Software Engineer - Tech Lead
   NorthBay Solutions
   410-G4 Johar Town, Lahore
   Off: +92-42-35290152
  
   Cell: +92-321-4391210 -- +92-300-4009941
  
 



 --
 Regards,

 Salman Akram




-- 
Regards,

Salman Akram

RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread Burton-West, Tom

Don't know your use case, but if you just want a list of the 400 most common 
words you can use the lucene contrib. HighFreqTerms.java with the - t flag.  
You have to point it at your lucene index.  You also probably don't want Solr 
to be running and want to give the JVM running HighFreqTerms a lot of memory.

http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=log

Tom
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: mdz-munich [mailto:sebastian.lu...@bsb-muenchen.de] 
Sent: Tuesday, April 26, 2011 9:29 AM
To: solr-user@lucene.apache.org
Subject: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

Hi!

We've got one index splitted into 4 shards á 70.000 records of large
full-text data from (very dirty) OCR. Thus we got a lot of unique terms. 
No we try to obtain the first 400 most common words for CommonGramsFilter
via TermsComponent but the request runs allways out of memory. The VM is
equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. 

Any Ideas how to get the most common terms without increasing VMs Memory?   
 
Thanks  best regards,

Sebastian 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: QUESTION: SOLR INDEX BIG FILE SIZES

2011-04-18 Thread Burton-West, Tom

 As far as I know, Solr will never arrive to a segment file greater than 2GB,
so this shouldn't be a problem.

Solr can easily create a file size over 2GB, it just depends on how much data 
you index and your particular Solr configuration, including your 
ramBufferSizeMB, your mergeFactor, and whether you optimize.  For example we 
index about a terabyte of full text and optimize our indexes so we have a 300GB 
*prx file.  If you really have a filesystem limit of 2GB, there is a parameter 
called maxMergeMB in Solr 3.1 that you can set.  Unfortunately it is the 
maximum size of a segment that will be merged rather than the maximum size of 
the resulting segment.  So if you have a mergeFactor of 10 you could probably 
set it somewhere around (2GB / 10)= 200.  Just to be cautious, you might want 
to set it to 100.  

mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy
double name=maxMergeMB200/double
/mergePolicy

In the flexible indexing branch/trunk there is a new merge policy and parameter 
that allows you to set the maximum size of the merged segment: 
https://issues.apache.org/jira/browse/LUCENE-854. 


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: Juan Grande [mailto:juan.gra...@gmail.com] 
Sent: Friday, April 15, 2011 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: QUESTION: SOLR INDEX BIG FILE SIZES

Hi John,

¿How can split the file of the solr index into multiple files?


Actually, the index is organized in a set of files called segments. It's not
just a single file, unless you tell Solr to do so.

That's because some file systems are about to support a maximun
 of space in a single file for example some UNIX file systems only support
 a maximun of 2GB per file.


As far as I know, Solr will never arrive to a segment file greater than 2GB,
so this shouldn't be a problem.

¿What is the recommended storage strategy for a big solr index files?


I guess that it depends in the indexing/querying performance that you're
having, the performance that you want, and what big exactly means for you.
If your index is so big that individual queries take too long, sharding may
be what you're looking for.

To better understand the index format, you can see
http://lucene.apache.org/java/3_1_0/fileformats.html

Also, you can take a look at my blog (http://juanggrande.wordpress.com), in
my last post I speak about segments merging.

Regards,

*Juan*


2011/4/15 JOHN JAIRO GÓMEZ LAVERDE jjai...@hotmail.com


 SOLR
 USER SUPPORT TEAM

 I have a quiestion about the maximun file size of solr index,
 when i have a lot of data in the solr index,

 -¿How can split the file of the solr index into multiple files?

 That's because some file systems are about to support a maximun
 of space in a single file for example some UNIX file systems only support
 a maximun of 2GB per file.

 -¿What is the recommended storage strategy for a big solr index files?

 Thanks for the reply.

 JOHN JAIRO GÓMEZ LAVERDE
 Bogotá - Colombia - South America

RE: Understanding the DisMax tie parameter

2011-04-15 Thread Burton-West, Tom

Thanks everyone.

I updated the wiki.  If you have a chance please take a look and check to make 
sure I got it right on the wiki.

http://wiki.apache.org/solr/DisMaxQParserPlugin#tie_.28Tie_breaker.29

Tom



-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, April 14, 2011 5:41 PM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Cc: Burton-West, Tom
Subject: Re: Understanding the DisMax tie parameter


: Perhaps the parameter could have had a better name.  It's essentially
: max(score of matching clauses) + tie * (score of matching clauses that
: are not the max)
: 
: So it can be used and thought of as a tiebreak only in the sense that
: if two docs match a clause (with essentially the same score), then a
: small tie value will act as a tiebreaker *if* one of those docs also
: matches some other fields.

correct.  w/o a tiebreaker value, a dismax query will only look at the 
maximum scoring clause for each doc -- the tie param is named for it's 
ability to help break ties when multiple documents have the same score 
from the max scoring clause -- by adding in a small portion of the scores 
(based on the 0-1 ratio of the tie param) from the other clauses.


-Hoss

Understanding the DisMax tie parameter

2011-04-14 Thread Burton-West, Tom

Hello,

I'm having trouble understanding the relationship of the word tie and 
tiebreaker to the explanation of this parameter on the wiki.
What two (or more things) are in a tie? and how does the number in the range 
from 0 to 1 break the tie?

http://wiki.apache.org/solr/DisMaxQParserPlugin#tie_.28Tie_breaker.29

A value of 0.0 makes the query a pure disjunction max query -- only the 
maximum scoring sub query contributes to the final score. A value of 1.0 
makes the query a pure disjunction sum query where it doesn't matter what the 
maximum scoring sub query is, the final score is the sum of the sub scores. 
Typically a low value (ie: 0.1) is useful.

Tom Burton-West

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom

Thanks Mike,

At first I thought this couldn't be related to the 2.1 Billion terms issue 
since the only place we have tons of terms is in the OCR field and this is not 
the OCR field. But then I remembered that the total number of terms in all 
fields is what matters. We've had no problems with regular searches against the 
index or with other facet queries.  Only with this facet.   Is TermInfoAndOrd 
only used for faceting?

I'll go ahead and build the patch and let you know.


Tom

p.s. Here is the field definition:
field name=topicStr type=string indexed=true stored=false 
multiValued=true/
fieldType name=string class=solr.StrField sortMissingLast=true 
omitNorms=true/


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, April 11, 2011 8:40 AM
To: solr-user@lucene.apache.org
Cc: Burton-West, Tom
Subject: Re: ArrayIndexOutOfBoundsException with facet query

Tom,

I think I see where this may be -- it looks like another  2B terms
bug in Lucene (we are using an int instead of a long in the
TermInfoAndOrd class inside TermInfosReader.java), only present in
3.1.

I'm also mad that Test2BTerms fails to catch this!!  I will go fix
that test and confirm it sees this bug.

Can you build from source?  If so, try this patch:

Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
===
--- lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(revision
1089906)
+++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(working copy)
@@ -46,8 +46,8 @@

   // Just adds term's ord to TermInfo
   private final static class TermInfoAndOrd extends TermInfo {
-final int termOrd;
-public TermInfoAndOrd(TermInfo ti, int termOrd) {
+final long termOrd;
+public TermInfoAndOrd(TermInfo ti, long termOrd) {
   super(ti);
   this.termOrd = termOrd;
 }
@@ -245,7 +245,7 @@
 // wipe out the cache when they iterate over a large numbers
 // of terms in order
 if (tiOrd == null) {
-  termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+  termsCache.put(cacheKey, new TermInfoAndOrd(ti,
enumerator.position));
 } else {
   assert sameTermInfo(ti, tiOrd, enumerator);
   assert (int) enumerator.position == tiOrd.termOrd;
@@ -262,7 +262,7 @@
 // random-access: must seek
 final int indexPos;
 if (tiOrd != null) {
-  indexPos = tiOrd.termOrd / totalIndexInterval;
+  indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
 } else {
   // Must do binary search:
   indexPos = getIndexOffset(term);
@@ -274,7 +274,7 @@
 if (enumerator.term() != null  term.compareTo(enumerator.term()) == 0) {
   ti = enumerator.termInfo();
   if (tiOrd == null) {
-termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position));
   } else {
 assert sameTermInfo(ti, tiOrd, enumerator);
 assert (int) enumerator.position == tiOrd.termOrd;

Mike

http://blog.mikemccandless.com

On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote:
 The query below results in an array out of bounds exception:
 select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr

 Here is the exception:
  Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)

 We are using a dev version of Solr/Lucene:

 Solr Specification Version: 3.0.0.2010.11.19.16.00.54
 Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54
 Lucene Specification Version: 3.1-SNAPSHOT
 Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10

 Just before the exception we see this entry in our tomcat logs:

 Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert
 INFO: UnInverted multi-valued field 
 {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0}
 Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute

 Is this a known bug?  Can anyone provide a clue as to how we can determine 
 what the problem is?

 Tom Burton-West


 Appended Below is the exception stack trace:

 SEVERE: Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
        at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:271)
        at 
 org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:338)
        at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:928)
        at 
 org.apache.lucene.index.DirectoryReader$MultiTermEnum.init

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom

Thanks Mike,

With the unpatched version, the first time I run the facet query on topicStr it 
works fine, but the second time I get the ArrayIndexOutOfBoundsException.   If 
I try different facets such as language, I don't see the same symptoms.  Maybe 
the number of facet values needs to exceed some number to trigger the bug?

I rebuilt lucene-core-3.1-SNAPSHOT.jar  with your patch and it fixes the 
problem. 


Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, April 11, 2011 1:00 PM
To: Burton-West, Tom
Cc: solr-user@lucene.apache.org
Subject: Re: ArrayIndexOutOfBoundsException with facet query

Right, it's the total number of terms across all fields... unfortunately.

This class is used to enroll a term into the terms cache that wraps
the terms dictionary, so in theory you could also hit this issue
during normal searching when a term is looked up once,  and then
looked up again (the 2nd time will pull from the cache).

I've mod'd Test2BTerms and am running it now...

Mike

http://blog.mikemccandless.com

On Mon, Apr 11, 2011 at 12:51 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Mike,

 At first I thought this couldn't be related to the 2.1 Billion terms issue 
 since the only place we have tons of terms is in the OCR field and this is 
 not the OCR field. But then I remembered that the total number of terms in 
 all fields is what matters. We've had no problems with regular searches 
 against the index or with other facet queries.  Only with this facet.   Is 
 TermInfoAndOrd only used for faceting?

 I'll go ahead and build the patch and let you know.


 Tom

 p.s. Here is the field definition:
 field name=topicStr type=string indexed=true stored=false 
 multiValued=true/
 fieldType name=string class=solr.StrField sortMissingLast=true 
 omitNorms=true/


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Monday, April 11, 2011 8:40 AM
 To: solr-user@lucene.apache.org
 Cc: Burton-West, Tom
 Subject: Re: ArrayIndexOutOfBoundsException with facet query

 Tom,

 I think I see where this may be -- it looks like another  2B terms
 bug in Lucene (we are using an int instead of a long in the
 TermInfoAndOrd class inside TermInfosReader.java), only present in
 3.1.

 I'm also mad that Test2BTerms fails to catch this!!  I will go fix
 that test and confirm it sees this bug.

 Can you build from source?  If so, try this patch:

 Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
 ===
 --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
 (revision
 1089906)
 +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
 (working copy)
 @@ -46,8 +46,8 @@

   // Just adds term's ord to TermInfo
   private final static class TermInfoAndOrd extends TermInfo {
 -    final int termOrd;
 -    public TermInfoAndOrd(TermInfo ti, int termOrd) {
 +    final long termOrd;
 +    public TermInfoAndOrd(TermInfo ti, long termOrd) {
       super(ti);
       this.termOrd = termOrd;
     }
 @@ -245,7 +245,7 @@
             // wipe out the cache when they iterate over a large numbers
             // of terms in order
             if (tiOrd == null) {
 -              termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
 enumerator.position));
 +              termsCache.put(cacheKey, new TermInfoAndOrd(ti,
 enumerator.position));
             } else {
               assert sameTermInfo(ti, tiOrd, enumerator);
               assert (int) enumerator.position == tiOrd.termOrd;
 @@ -262,7 +262,7 @@
     // random-access: must seek
     final int indexPos;
     if (tiOrd != null) {
 -      indexPos = tiOrd.termOrd / totalIndexInterval;
 +      indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
     } else {
       // Must do binary search:
       indexPos = getIndexOffset(term);
 @@ -274,7 +274,7 @@
     if (enumerator.term() != null  term.compareTo(enumerator.term()) == 0) {
       ti = enumerator.termInfo();
       if (tiOrd == null) {
 -        termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
 enumerator.position));
 +        termsCache.put(cacheKey, new TermInfoAndOrd(ti, 
 enumerator.position));
       } else {
         assert sameTermInfo(ti, tiOrd, enumerator);
         assert (int) enumerator.position == tiOrd.termOrd;

 Mike

 http://blog.mikemccandless.com

 On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote:
 The query below results in an array out of bounds exception:
 select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr

 Here is the exception:
  Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)

 We are using a dev version of Solr/Lucene:

 Solr Specification Version: 3.0.0.2010.11.19.16.00.54
 Solr Implementation Version: 3.1

ArrayIndexOutOfBoundsException with facet query

2011-04-08 Thread Burton-West, Tom

The query below results in an array out of bounds exception:
select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr

Here is the exception:
 Exception during facet.field of 
topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
at 
org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)

We are using a dev version of Solr/Lucene:

Solr Specification Version: 3.0.0.2010.11.19.16.00.54
Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54
Lucene Specification Version: 3.1-SNAPSHOT
Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10

Just before the exception we see this entry in our tomcat logs:

Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert
INFO: UnInverted multi-valued field 
{field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0}
Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute

Is this a known bug?  Can anyone provide a clue as to how we can determine what 
the problem is?

Tom Burton-West


Appended Below is the exception stack trace:

SEVERE: Exception during facet.field of 
topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
at 
org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:271)
at 
org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:338)
at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:928)
at 
org.apache.lucene.index.DirectoryReader$MultiTermEnum.init(DirectoryReader.java:1055)
at 
org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:659)
at 
org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
at 
org.apache.solr.request.NumberedTermEnum.skipTo(UnInvertedField.java:1018)
at 
org.apache.solr.request.UnInvertedField.getTermText(UnInvertedField.java:838)
at 
org.apache.solr.request.UnInvertedField.getCounts(UnInvertedField.java:617)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:279)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:312)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:174)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1354)

RE: Using Solr over Lucene effects performance?

2011-03-14 Thread Burton-West, Tom

+1 on some kind of simple performance framework that would allow comparing Solr 
vs Lucene.  Any chance the Lucene benchmark programs in contrib could be 
adopted to read Solr config information?
BTW: You probably want to empty the OS cache in addition to restarting Solr 
between each run if the index is large enough so disk I/O is a factor.

Tom


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Friday, March 11, 2011 5:28 PM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Cc: sivaram
Subject: Re: Using Solr over Lucene effects performance?

I have seen little repeatable empirical evidence for the usual answer
mostly no.

With respect: everyone in the Solr universe seems to answer this
question in the way Yonik has.
However, with a large number of requests the XML
serialization/deserialization must have some, likely significant,
impact.

Yonik makes the valid point that that I will generalize to: some
combination of #docs, #queries, doc size, network, hardware, disk, etc
it will impact and others it will be less important.

Is there any chance that a simple performance framework could be
created in Solr, which runs queries directly against Solr, as well as
against the underlying Lucene index directly?
1 - Text file with one query per line (isn't there a tool out there
that will generate random queries based on a given index? Sorry, my
google fails me...)
2 - Test application: Configuration file that defines the max#
parallel queries per second. The queries are run multiple times:
1,2,4,8,16,32...max# queries. Solr is restarted between each run.
These tests are run against:
   a) Solr local
   b) Solr across the network
   c) Lucene index directly, local
   d) Lucene index directly, across the network using RMI (RemoteSearchable)
3 - Generates a report showing the results

It should perhaps also allow a second file with fewer queries that is
used to warm the caches and is not included in the reporting.
Oh, the configuration file should also include the network information
for remote indexes.
The configuration file could also include a parameter for the
probability that a query will be paged into a random 1..n pages, where
n is also a settable parameter.

Just thought a more empirical framework would help all of us, as
opposed to anecdotal evidence.

Thanks,
Glen
http://zzzoot.blogspot.com/

PS. If there is a good analysis of the performance cost in large scale
instances (many documents, many queries in parallel) of the XML
marshaling/demarshaling in Solr, please share it. -g

On Fri, Mar 11, 2011 at 4:48 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Mar 11, 2011 at 4:21 PM, sivaram yogendra.bopp...@gmail.com wrote:
 I searched for this but couldn't find a convincing answer.
 I'm planning to use Lucene/Solr in a tool for indexing and searching
 documents. I'm thinking of if I use Lucene directly instead of Solr, will it
 improves the performance of the search?(in terms of time taken for indexing
 or returning search results or if Solr slows down my application when
 compared to Lucene). I have worked with Solr in small scale before but this
 time I have to use for an index with over a million docs to get indexed and
 searched.

 On a small scale (hundreds of docs or so), Solr's overhead (parsing
 parameters, etc) could matter.
 When you scale up to larger indexes, it's in the noise (i.e. the
 actual computation of searching, faceting, highlighting, etc,
 dominate).

 -Yonik
 http://lucidimagination.com




-- 

-

RE: How to handle searches across traditional and simplifies Chinese?

2011-03-08 Thread Burton-West, Tom

This page discusses the reasons why it's not a simple one to one mapping

http://www.kanji.org/cjk/c2c/c2cbasis.htm

Tom
-Original Message-
 I have documents that contain both simplified and traditional Chinese 
 characters. Is there any way to search across them? For example, if someone 
 searches for 类 (simplified Chinese), I'd like to be able to recognize that 
 the equivalent character is 類 in traditional Chinese and search for 类 or 類 in 
 the documents

Solr indexing socket timeout errors

2011-01-07 Thread Burton-West, Tom

Hello all,

We are getting intermittent socket timeout errors (see below).  Out of about 
600,000 indexing requests, 30 returned these socket timeout errors.  We haven't 
been able to correlate these with large merges, which tends to slow down the 
indexing response rate.

Does anyone know where we might look to determine the cause?

Tom

Tom Burton-West

Jan 7, 2011 2:31:07 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.net.SocketTimeoutException] 
Read timed out
at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1354)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at 
org.apache.coyote.http11.InternalInputBuffer.fill(InternalInputBuffer.java:777)
at 
org.apache.coyote.http11.InternalInputBuffer$InputStreamInputBuffer.doRead(InternalInputBuffer.java:807)
at 
org.apache.coyote.http11.filters.IdentityInputFilter.doRead(IdentityInputFilter.java:116)
at 
org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:742)
at org.apache.coyote.Request.doRead(Request.java:419)
at 
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:270)
at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:403)
  at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:293)
at 
org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
at 
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
at 
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
at 
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at 
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 24 more

RE: Memory use during merges (OOM)

2010-12-18 Thread Burton-West, Tom

Thanks Robert, 

We will try the termsIndexInterval as a workaround.   I have also opened a JIRA 
issue: https://issues.apache.org/jira/browse/SOLR-2290.
Hope I found the right sections of the Lucene code.  I'm just now in the 
process of looking at the Solr IndexReaderFactory and SolrIndexWriter and 
SolrIndexConfig  trying to better understand how solrconfig.xml gets 
instantiated and how it affects the readers and writers.

Tom

From: Robert Muir [rcm...@gmail.com]

On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom tburt...@umich.edu wrote:
Your setting isn't being applied to the reader IW uses during
merging... its only for readers Solr opens from directories
explicitly.
I think you should open a jira issue!

 Do I understand correctly that this setting in theory could be applied to the 
 reader IW uses during merging but is not currently being applied?

yes, i'm not really sure (especially given the name=) if you can/or
it was planned to have multiple IR factories in solr, e.g. a separate
one for spellchecking.
so i'm not sure if we should (hackishly) steal this parameter from the
IR factory (it is common to all IRFactories, not just
StandardIRFactory) and apply it to to IW..

but we could at least expose the divisor param separately to the IW
config so you have some way of setting it.


 indexReaderFactory name=IndexReaderFactory 
 class=org.apache.solr.core.StandardIndexReaderFactory
int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

 I understand the tradeoffs for doing this during searching, but not the 
 trade-offs for doing this during merging.  Is the use during merging the 
 similar to the use during searching?

  i.e. Some process has to look up data for a particular term as opposed to 
 having to iterate through all the terms?
  (Haven't yet dug into the merging/indexing code).

it needs it for applying deletes...

as a workaround (if you are reindexing), maybe instead of using the
Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8
* 128) ?

this will solve your merging problem, and have the same perf
characteristics of divisor=8, except you cant go back down like you
can with the divisor without reindexing with a smaller interval...

if you've already tested that performance with the divisor of 8 is
acceptable, or in your case maybe necessary!, it sort of makes sense
to 'bake it in' by setting your divisor back to 1 and your interval =
1024 instead...

RE: Memory use during merges (OOM)

2010-12-16 Thread Burton-West, Tom

Thanks Mike,

But, if you are doing deletions (or updateDocument, which is just a
delete + add under-the-hood), then this will force the terms index of
the segment readers to be loaded, thus consuming more RAM.

Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance 
a few documents have been updated, which would cause a delete +add.  


One workaround for large terms index is to set the terms index divisor
.that IndexWriter should use whenever it loads a terms index (this is
IndexWriter.setReaderTermsIndexDivisor).

I always get confused about the two different divisors and their names in the 
solrconfig.xml file

We are setting  termInfosIndexDivisor, which I think translates to the Lucene 
IndexWriter.setReaderTermsIndexDivisor

indexReaderFactory name=IndexReaderFactory 
class=org.apache.solr.core.StandardIndexReaderFactory
int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

The other one is termIndexInterval which is set on the writer and determines 
what gets written to the tii file.  I don't remember how to set this in Solr.

Are we setting the right one to reduce RAM usage during merging?


 So I think the gist is... the RAM usage will be in proportion to the
 net size of the merge (mergeFactor + how big each merged segment is),
 how many merges you allow concurrently, and whether you do false or
 true deletions

Does an optimize do something differently?  

Tom

RE: Memory use during merges (OOM)

2010-12-16 Thread Burton-West, Tom

Your setting isn't being applied to the reader IW uses during
merging... its only for readers Solr opens from directories
explicitly.
I think you should open a jira issue!

Do I understand correctly that this setting in theory could be applied to the 
reader IW uses during merging but is not currently being applied?   

indexReaderFactory name=IndexReaderFactory 
class=org.apache.solr.core.StandardIndexReaderFactory
int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

I understand the tradeoffs for doing this during searching, but not the 
trade-offs for doing this during merging.  Is the use during merging the 
similar to the use during searching? 

 i.e. Some process has to look up data for a particular term as opposed to 
having to iterate through all the terms?  
 (Haven't yet dug into the merging/indexing code).   

Tom


-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 

 We are setting  termInfosIndexDivisor, which I think translates to the Lucene 
 IndexWriter.setReaderTermsIndexDivisor

Memory use during merges (OOM)

2010-12-15 Thread Burton-West, Tom

Hello all,

Are there any general guidelines for determining the main factors in memory use 
during merges?

We recently changed our indexing configuration to speed up indexing but in the 
process of doing a very large merge we are running out of memory.
Below is a list of the changes and part of the indexwriter log.  The changes 
increased the indexing though-put by almost an order of magnitude.
(about 600 documents per hour to about 6000 documents per hour.  Our documents 
are about 800K)

We are trying to determine which of the changes to tweak to avoid the OOM, but 
still keep the benefit of the increased indexing throughput

Is it likely that the changes to ramBufferSizeMB are the culprit or could it be 
the mergeFactor change from 10-20?

 Is there any obvious relationship between ramBufferSizeMB and the memory 
consumed by Solr?
 Are there rules of thumb for the memory needed in terms of the number or size 
of segments?

Our largest segments prior to the failed merge attempt were between 5GB and 
30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

Tom Burton-West
-

Changes to indexing configuration:
mergeScheduler
before: serialMergeScheduler
after:concurrentMergeScheduler
mergeFactor
before: 10
after : 20
ramBufferSizeMB
before: 32
  after: 320

excerpt from indexWriter.log

Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP: findMerges: 40 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP: 0 to 20: add this merge
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP: 20 to 40: add this merge

...
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: applyDeletes
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted 
docIDs and 0 deleted queries on 40 segments.
Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
http-8091-Processor70]: hit exception flushing deletes
Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
tom

access to environment variables in solrconfig.xml and/or schema.xml?

2010-12-13 Thread Burton-West, Tom

I see variables used to access java system properties in solrconfig.xml and 
schema.xml:

http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution
dataDir${solr.data.dir:}/dataDir
or
${solr.abortOnConfigurationError:true}

Is there a way to access environment variables or does everything have to be 
stuffed into a java system property?

Tom Burton-West

RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Burton-West, Tom

Hi Mike,

We turned on infostream.   Is there documentation about how to interpret it, or 
should I just grep through the codebase?

Is the excerpt below what I am looking for as far as understanding the 
relationship between ramBufferSize and size on disk?
is newFlushedSize the size on disk in bytes?


DW:   ramUsed=329.782 MB newFlushedSize=74520060 docs/MB=0.943 new/old=21.55%

RAM: now balance allocations: usedMB=325.997 vs trigger=320 deletesMB=0.048 
byteBlockFre
e=0.125 perDocFree=0.006 charBlockFree=0
...
DW: after free: freedMB=0.225 usedMB=325.82
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]: flush: now pause all indexing threads
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]:   flush: segment=_5h docStoreSegment=_5e 
docStoreOffset=266 flushDocs=true flushDeletes=false 
flushDocStores=false numDocs=40 numBufDelTerms=40
... Dec 1, 2010 5:40:22 PM   purge field=geographic
Dec 1, 2010 5:40:22 PM   purge field=serialTitle_ab
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: DW:   ramUsed=325.772 MB newFlushedSize=69848046 
docs/MB=0.6 new/old=20.447%
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: flushedFiles=[_5h.frq, _5h.tis, _5h.prx, _5h.nrm, 
_5h.fnm, _5h.tii]



Tom


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 3:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Mike,

 Yes we have many unique terms due to dirty OCR and 400 languages and probably 
 lots of low doc freq terms as well (although with the ICUTokenizer and 
 ICUFoldingFilter we should get fewer terms due to bad tokenization and 
 normalization.)

OK likely this explains the lowish RAM efficiency.

 Is this additional overhead because each unique term takes a certain amount 
 of space compared to adding entries to a list for an existing term?

Exactly.  There's a highish startup cost for each term but then
appending docs/positions to that term is more efficient especially for
higher frequency terms.  In the limit, a single unique term  across
all docs will have very high RAM efficiency...

 Does turning on IndexWriters infostream have a significant impact on memory 
 use or indexing speed?

I don't believe so

Mike

Solr 3x segments file and deleting index

2010-12-01 Thread Burton-West, Tom

If I want to delete an entire index and start over, in previous versions of 
Solr, you could stop Solr, delete all files in the index directory and restart 
Solr.  Solr would then create empty segments files and you could start 
indexing.   In Solr 3x if I delete all the files in the index  directory I get 
a large stack trace with this error:

org.apache.lucene.index.IndexNotFoundException: no segments* file found

As a workaround, whenever I delete an index (by deleting all files in the index 
directory), I copy the segments files that come with the Solr example to the 
index directory and then restart Solr.

Is this a feature or a bug?   What is the rationale?

Tom

Tom Burton-West

ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom

We are using a recent Solr 3.x (See below for exact version).

We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
mainIndex sections of our solrconfig.xml:

ramBufferSizeMB320/ramBufferSizeMB
mergeFactor20/mergeFactor

We expected that this would mean that the index would not write to disk until 
it reached somewhere approximately over 300MB in size.
However, we see many small segments that look to be around 80MB in size.

We have not yet issued a single commit so nothing else should force a write to 
disk.

With a merge factor of 20 we also expected to see larger segments somewhere 
around 320 * 20 = 6GB in size, however we see several around 1GB.

We understand that the sizes are approximate, but these seem nowhere near what 
we expected.

Can anyone explain what is going on?

BTW
maxBufferedDocs is commented out, so this should not be affecting the buffer 
flushes
!--maxBufferedDocs1000/maxBufferedDocs--


Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene Specification 
Version: 3.1-SNAPSHOTLucene Implementation Version: 3.1-SNAPSHOT 1036094 - 
2010-11-19 16:01:10

Tom Burton-West

RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom

Thanks Mike,

Yes we have many unique terms due to dirty OCR and 400 languages and probably 
lots of low doc freq terms as well (although with the ICUTokenizer and 
ICUFoldingFilter we should get fewer terms due to bad tokenization and 
normalization.)

Is this additional overhead because each unique term takes a certain amount of 
space compared to adding entries to a list for an existing term?

Does turning on IndexWriters infostream have a significant impact on memory use 
or indexing speed?  

If it does, I'll reproduce this on our test server rather than turning it on 
for a bit on the production indexer.  If it doesn't I'll turn it on and post 
here.

Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 2:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

The ram efficiency (= size of segment once flushed divided by size of
RAM buffer) can vary drastically.

Because the in-RAM data structures must be growable (to append new
docs to the postings as they are encountered), the efficiency is never
100%.  I think 50% is actually a good ram efficiency, and lower than
that (even down to 27%) I think is still normal.

Do you have many unique or low-doc-freq terms?  That brings the efficiency down.

If you turn on IndexWriter's infoStream and post the output we can see
if anything odd is going on...

80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments.
Do you do any deletions in this run?  A merged segment size will often
be less than the sum of the parts, especially if there are many terms
but across segments these terms are shared but the infoStream will
also show what merges are taking place.

Mike

On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom tburt...@umich.edu wrote:
 We are using a recent Solr 3.x (See below for exact version).

 We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
 mainIndex sections of our solrconfig.xml:

 ramBufferSizeMB320/ramBufferSizeMB
 mergeFactor20/mergeFactor

 We expected that this would mean that the index would not write to disk until 
 it reached somewhere approximately over 300MB in size.
 However, we see many small segments that look to be around 80MB in size.

 We have not yet issued a single commit so nothing else should force a write 
 to disk.

 With a merge factor of 20 we also expected to see larger segments somewhere 
 around 320 * 20 = 6GB in size, however we see several around 1GB.

 We understand that the sizes are approximate, but these seem nowhere near 
 what we expected.

 Can anyone explain what is going on?

 BTW
 maxBufferedDocs is commented out, so this should not be affecting the buffer 
 flushes
 !--maxBufferedDocs1000/maxBufferedDocs--


 Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
 Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene 
 Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 
 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10

 Tom Burton-West

RE: Doubt about index size

2010-11-12 Thread Burton-West, Tom

Hi Claudio,

What's happening when you re-index the documents is that Solr/Lucene implements 
an update as a delete plus a new index.  Because of the nature of inverted 
indexes, deleting documents requires a rewrite of the entire index. In order to 
avoid rewriting the entire index each time one document is deleted, deletes are 
implemented as a list of deleted  internal lucene ids. Documents aren't 
actually removed from the indexes until the index segment is merged or an 
optimize occurs.

maxDoc's is the total number of documents indexed without taking into 
consideration that some of them are marked as deleted
numDocs is the actual number of undeleted documents

If you run an optimize the index will be rewritten, the index size will go down 
 and numDocs will equal maxDocs 

Tom Burton-West

-Original Message-
From: Claudio Devecchi [mailto:cdevec...@gmail.com] 
Sent: Friday, November 12, 2010 10:50 AM
To: Lista Solr
Subject: Doubt about index size

Hi everybody,

I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one
thing, let me try to explain.

I have 1.2 million xml files and I'm indexing then, when I do it for first
time my index size is around 3 GB and in my statistics on
http://localhost:8983/solr/admin/stats.jsp I have two entries that is:

numDocs : 1120171
maxDoc : 1120171

Until here is all right, but if I make a index update reindexing all the
same 1120171 documents I have the stats bellow:

numDocs : 1120171
maxDoc : 2240342

... and my index size goes around 6GB.

Why this happen? What happens on index size if I have the same number of
searcheable docs?

Somebody knows?

Tks

RE: Doubt about index size

2010-11-12 Thread Burton-West, Tom

An optimize takes lots of cpu and I/O since it has to rewrite your indexes, so 
only do it when necessary.

You can just use curl to send an optimize message to Solr when you are ready.

See:
http://wiki.apache.org/solr/UpdateXmlMessages#Passing_commit_parameters_as_part_of_the_URL

Tom
-Original Message-
From: Claudio Devecchi [mailto:cdevec...@gmail.com] 
Sent: Friday, November 12, 2010 12:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Doubt about index size

Hi Tom, thanks for your explanation,

Do you recommend the index continues this way? Or can I configure it to make
optmize automatically?

tks

On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom tburt...@umich.eduwrote:

 Hi Claudio,

 What's happening when you re-index the documents is that Solr/Lucene
 implements an update as a delete plus a new index.  Because of the nature of
 inverted indexes, deleting documents requires a rewrite of the entire index.
 In order to avoid rewriting the entire index each time one document is
 deleted, deletes are implemented as a list of deleted  internal lucene ids.
 Documents aren't actually removed from the indexes until the index segment
 is merged or an optimize occurs.

 maxDoc's is the total number of documents indexed without taking into
 consideration that some of them are marked as deleted
 numDocs is the actual number of undeleted documents

 If you run an optimize the index will be rewritten, the index size will go
 down  and numDocs will equal maxDocs

 Tom Burton-West

 -Original Message-
 From: Claudio Devecchi [mailto:cdevec...@gmail.com]
 Sent: Friday, November 12, 2010 10:50 AM
 To: Lista Solr
 Subject: Doubt about index size

 Hi everybody,

 I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one
 thing, let me try to explain.

 I have 1.2 million xml files and I'm indexing then, when I do it for first
 time my index size is around 3 GB and in my statistics on
 http://localhost:8983/solr/admin/stats.jsp I have two entries that is:

 numDocs : 1120171
 maxDoc : 1120171

 Until here is all right, but if I make a index update reindexing all the
 same 1120171 documents I have the stats bellow:

 numDocs : 1120171
 maxDoc : 2240342

 ... and my index size goes around 6GB.

 Why this happen? What happens on index size if I have the same number of
 searcheable docs?

 Somebody knows?

 Tks




-- 
Claudio Devecchi
flickr.com/cdevecchi

Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom

We are trying to solve some multilingual issues with our Solr analysis filter 
chain and would like to use the new Lucene 3.x filters that are Unicode 
compliant.

Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
UAX#29 support from Solr?

Is it just a matter of writing the appropriate Solr filter factories?  Are 
there any tricky gotchas in writing such a filter?

If so, should I open a JIRA issue or two JIRA issues so the filter factories 
can be contributed to the Solr code base?

Tom

RE: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom

Thanks Robert,

I'll use the workaround for now (using StandardTokenizerFactory and specifying 
version 3.1), but I suspect that I don't want the added URL/IP address 
recognition due to my use case.  I've also talked to a couple people who 
recommended using the ICUTokenFilter with some rule modifications, but haven't 
had a chance to investigate that yet.

  I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) 
and https://issues.apache.org/jira/browse/SOLR-2211.  Sometime later this week 
I'll try writing the FilterFactories and upload patches. (Unless someone beats 
me to it :)

Tom

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, November 01, 2010 12:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support 
from Solr

On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom tburt...@umich.edu wrote:
 We are trying to solve some multilingual issues with our Solr analysis filter 
 chain and would like to use the new Lucene 3.x filters that are Unicode 
 compliant.

 Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
 UAX#29 support from Solr?

right now, you can use the StandardTokenizerFactory (which is UAX#29 +
URL and IP address recognition) from Solr.
just make sure you set the Version to 3.1 in your solrconfig.xml with
branch_3x, otherwise it will use the old standardtokenizer for
backwards compatibility.

  !--
Controls what version of Lucene various components of Solr adhere
to. Generally, you want
to use the latest version to get all bug fixes and improvements.
It is highly recommended
that you fully re-index after changing this setting as it can
affect both how text is indexed
and queried.
  --
  luceneMatchVersionLUCENE_31/luceneMatchVersion

But if you want the pure UAX#29 Tokenizer without this, there isn't a
factory. Also if you want customization/supplementary character
support, there is no factory for ICUTokenizer at the moment.

 If so, should I open a JIRA issue or two JIRA issues so the filter factories 
 can be contributed to the Solr code base?

Please open issues for a factory for the pure UAX#29 Tokenizer, and
for the ICU factories (maybe we can just put this into a contrib for
now?) !

filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom

At the Lucene Revolution conference I asked about efficiently building a filter 
query from an external list of Solr unique ids.

Some use cases I can think of are:
1)  personal sub-collections (in our case a user can create a small subset 
of our 6.5 million doc collection and then run filter queries against it)
2)  tagging documents
3)  access control lists
4)  anything that needs complex relational joins
5)  a sort of alternative to incremental field updating (i.e. update in an 
external database or kv store)
6)  Grant's clustering cluster points and similar apps.

Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't seem to be 
any work on it yet.

Hoss  mentioned a couple of ideas:
1) sub-classing query parser
2) Having the app query a database and somehow passing something to 
Solr or lucene for the filter query

Can Hoss or someone else point me to more detailed information on what might be 
involved in the two ideas listed above?

Is somehow keeping an up-to-date map of unique Solr ids to internal Lucene ids 
needed to implement this or is that a separate issue?


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom

Hi Jonathan,

The advantages of the obvious approach you outline are that it is simple, it 
fits in to the existing Solr model, it doesn't require any customization or 
modification to Solr/Lucene java code.  Unfortunately, it does not scale well.  
We originally tried just what you suggest for our implementation of Collection 
Builder.  For a user's personal collection we had a table that maps the 
collection id to the unique Solr ids.
Then when they wanted to search their collection, we just took their search and 
added a filter query with the fq=(id:1 OR id:2 OR).   I seem to remember 
running in to a limit on the number of OR clauses allowed. Even if you can set 
that limit larger, there are a  number of efficiency issues.  

We ended up constructing a separate Solr index where we have a multi-valued 
collection number field. Unfortunately, until incremental field updating gets 
implemented, this means that every time someone adds a document to a 
collection, the entire document (including 700KB of OCR) needs to be re-indexed 
just to update the collection number field. This approach has allowed us to 
scale up to a total of something under 100,000 documents, but we don't think we 
can scale it much beyond that for various reasons.

I was actually thinking of some kind of custom Lucene/Solr component that would 
for example take a query parameter such as lookitUp=123 and the component 
might do a JDBC query against a database or kv store and return results in some 
form that would be efficient for Solr/Lucene to process. (Of course this 
assumes that a JDBC query would be more efficient than just sending a long list 
of ids to Solr).  The other part of the equation is mapping the unique Solr ids 
to internal Lucene ids in order to implement a filter query.   I was wondering 
if something like the unique id to Lucene id mapper in zoie might be useful or 
if that is too specific to zoie. SoThis may be totally off-base, since I 
haven't looked at the zoie code at all yet.

In our particular use case, we might be able to build some kind of in-memory 
map after we optimize an index and before we mount it in production. In our 
workflow, we update the index and optimize it before we release it and once it 
is released to production there is no indexing/merging taking place on the 
production index (so the internal Lucene ids don't change.)  

Tom



-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Friday, October 15, 2010 1:07 PM
To: solr-user@lucene.apache.org
Subject: RE: filter query from external list of Solr unique IDs

Definitely interested in this. 

The naive obvious approach would be just putting all the ID's in the query. 
Like fq=(id:1 OR id:2 OR).  Or making it another clause in the 'q'.  

Can you outline what's wrong with this approach, to make it more clear what's 
needed in a solution?

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom

Thanks Yonik,

Is this something you might have time to throw together, or an outline of what 
needs to be thrown together?
Is this something that should be asked on the developer's list or discussed in 
SOLR 1715 or does it make the most sense to keep the discussion in this thread?

Tom

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Friday, October 15, 2010 1:19 PM
To: solr-user@lucene.apache.org
Subject: Re: filter query from external list of Solr unique IDs

On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom tburt...@umich.edu wrote:
 At the Lucene Revolution conference I asked about efficiently building a 
 filter query from an external list of Solr unique ids.
Yeah, I've thought about a special query parser and query to deal with
this (relatively) efficiently, both from a query perspective and a
memory perspective.

Should be pretty quick to throw together:
- comma separated list of terms (unique ids are a special case of this)
- in the query, store as a single byte array for efficiency
- sort the ids if they aren't already sorted
- do lookups with a term enumerator and skip weighting or anything
else like that
- configurable caching... may, or may not want to cache this big query

That's only part of the stuff you mention, but seems like it would be
useful to a number of people.

-Yonik
http://www.lucidimagination.com

RE: Experience with large merge factors

2010-10-06 Thread Burton-West, Tom

Hi Mike,

.Do you use multiple threads for indexing?  Large RAM buffer size is
also good, but I think perf peaks out mabye around 512 MB (at least
based on past tests)?

We are using Solr, I'm not sure if Solr uses multiple threads for indexing.  We 
have 30 producers each sending documents to 1 of 12 Solr shards on a round 
robin basis.  So each shard will get multiple requests.

Believe it or not, merging is typically compute bound.  It's costly to
decode  re-encode all the vInts.

Sounds like we need to do some monitoring during merging to see what the cpu 
use is and also the io wait during large merges.

Larger merge factor is good because it means the postings are copied 
fewer times, but, it's bad beacuse you could risk running out of
descriptors, and, if the OS doesn't have enough RAM, you'll start to
thin out the readahead that the OS can do (which makes the merge less
efficient since the disk heads are seeking more).

Is there a way to estimate the amount of RAM for the readahead?   Once we start 
the re-indexing we will be running 12 shards on a 16 processor box with 144 GB 
of memory.

Do you do any deleting?
Deletes would happen as a byproduct of updating a record.  This shouldn't 
happen too frequently during re-indexing, but we update records when a document 
gets re-scanned and re-OCR'd.  This would probably amount to a few thousand.


Do you use stored fields and/or term vectors?  If so, try to make
your docs uniform if possible, ie add the same fields in the same
order.  This enables lucene to use bulk byte copy merging under the hood.

We use 4 or 5 stored fields.  They are very small compared to our huge OCR 
field.  Since we construct our Solr documents programattically, I'm fairly 
certain that they are always in the same order.  I'll have to look at the code 
when I get back to make sure.

We aren't using term vectors now, but we plan to add them as well as a number 
of fields based on MARC (cataloging) metadata in the future.

Tom

Experience with large merge factors

2010-10-05 Thread Burton-West, Tom

Hi all,

At some point we will need to re-build an index that totals about 3 terabytes 
in size (split over 12 shards).  At our current indexing speed we estimate that 
this will take about 4 weeks.  We would like to reduce that time.  It appears 
that our main bottleneck is disk I/O during index merging.

Each index is somewhere between 250 and 350GB.  We are currently using a 
mergeFactor of 10 and a ramBufferSizeMB of 32MB.  What this means is that for 
every approximately 320 MB, 3.2GB,  and 32GB we get merges.  We are doing this 
offline and will run an optimize at the end.  What we would like to do is 
reduce the number of intermediate merges.   We thought about just using a 
nomerge merge policy and then optimizing at the end, but suspect we would run 
out of filehandles and that merging 10,000 segments during an optimize might 
not be efficient.

We would like to find some optimum mergeFactor somewhere between 0 (noMerge 
merge policy) and 1,000.  (We are also planning to raise the ramBufferSizeMB 
significantly).

What experience do others have using a large mergeFactor?

Tom

Estimating memory use for Solr caches

2010-10-01 Thread Burton-West, Tom

We are having some memory and GC issues.  I'm trying to get a handle on the 
contribution of the Solr caches.  Is there a way to estimate the amount of 
memory used by  the documentCache and the queryResultCache?

I assume if we know the average size of our stored fields we can just multiply 
the size of the documentCache by the average size of the stored fields.
We store author, title, date, and id.  For each document this is likely to be 
less than 1 KB so if we set documentCache size=50,000 that should be 50MB.  Is 
that about right?

Alternatively, could we calculate this based on the size of the fdt file?  Our 
on disk fdt file that stores the stored field data is 16MB so in theory if the 
stored data for all the documents in our index were in the documentCache, it 
shouldn't exceed 16MB, plus some overhead.  Does this make sense?

queryResultCache stores a list of docIDs.  I assume these are Java ints but the 
number depends on the number of hits. Is there a good way to estimate (or 
measure:)  the size of this in memory?


Tom Burton-West

RE: bi-grams for common terms - any analyzers do that?

2010-09-27 Thread Burton-West, Tom

Hi Jonathan,

 I'm afraid I'm having trouble understanding   if the analyzer returns more 
 than one position back from a queryparser token

I'm not sure if the queryparser forms a phrase query without explicit phrase 
quotes is a problem for me, I had no idea it happened until now, never 
noticed, and still don't really understand in what circumstances it happens.

The problem I had was for a Boolean query l'art AND historie that the 
WordDelimiterFilter tokenized l'art  as two tokens l at position 1 and 
art at position 2.   So the queryparser decided this means a phrase query for 
l followed immediately by art.  See
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance 
for details.  

This would happen whenever any token filter split a token into more than one 
token.  For example a filter that splits foo-bar into foo bar.  The 
exception is  SynonymFilter or something like it.  In the case of 
SynonymFilter, its not really a case of splitting one token into multiple 
tokens, but given one token of input, it outputs all the synonyms of the term.  
However all the tokens have the same position attribute. (see: 
http://www.lucidimagination.com/search/document/CDRG_ch05_5.6.19?q=synonym%20filter)

 So for example for the string the small thing  if you had a synonym list for 
small:
small=tiny,teeny

input:
postion|1   |2|3
token  |the |small|thing
Would output

postion|1   |2|2|2|3
token  |the |small| tiny|teeny|thing

In this case when the queryParser gets back small teeny tiny  since they have 
the same position, they are not turned into a phrase query.

for l'art

input
postion|1 
token  |l'art

output
postion|1|2 
token  |l|art
In this case there are two tokens with different positions so it treats them as 
a phrase query.

Tom Burton-West

RE: bi-grams for common terms - any analyzers do that?

2010-09-27 Thread Burton-West, Tom

Hi Yonik,

If the new autoGeneratePhraseQueries is off, position doesn't matter, and 
the query will 
be treated as index OR reader.

Just wanted to make sure, in Solr does autoGeneratePhraseQueries = off treat 
the query with the *default* query operator as set in SolrConfig rather than 
necessarily using the Boolean OR operator?

i.e.  if solrQueryParser defaultOperator=AND/
 and autoGeneratePhraseQueries = off 

then IndexReader - index  reader - index AND reader

Tom

RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Burton-West, Tom

Hi all,

The CommonGrams filter is designed to only work on phrase queries.  It is 
designed to solve the problem of slow phrase queries with phrases containing 
common words, when you don't want to use stop words.  It would not make sense 
for Boolean queries. Boolean queries just get passed through unchanged. 

For background on the CommonGramsFilter please see: 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

There are two filters,  CommonGramsFilter and CommonGramsQueryFilter you use 
CommonGramsFilter on indexing and CommonGramsQueryFilter for query processing.  
CommonGramsFilter outputs both CommonGrams and Unigrams so that Boolean queries 
(i.e. non-phrase queries)  will work.  For example the rain would produce 3 
tokens:
the  position 1
rain position 2
the-rain position 1
When you have a phrase query, you want Solr to search for the token the-rain 
so you don't want the unigrams.
When you have a Boolean query, the CommonGramsQueryFilter only gets one token 
as input and simply outputs it.

Appended below is a sample config from our schema.xml.

For background on the problem with l'art please see: 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance We 
used a custom filter to change all punctuation to spaces.   You could probably 
use one of the other filters to do this. (See the comments from David Smiley at 
the end of the blog post regarding possible approaches.)At the time, I just 
couldn't get WordDelimiterFilter to behave as documented with various 
combinations of parameters and was not aware of the other filters David 
mentions.

The problem with l'art is actually due to a bug or feature in the 
QueryParser.  Currently the QueryParser interacts with the token chain and 
decides whether the tokens coming back from a tokenfilter should be treated as 
a phrase query based on whether or not more than one non-synonym token comes 
back from the tokestream for a single 'queryparser token'.
It also splits on whitespace which causes all CJK queries to be treated as 
phrase queries regardless of the CJK tokenizer you use. This is a contentious 
issue.  See https://issues.apache.org/jira/browse/LUCENE-2458.  There is a 
semi-workaround using PositionFilter, but it has many undesirable side effects. 
 I believe Robert Muir, who is an expert on the various problems involved and  
opened Lucene-2458 is working on a better fix.

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search



fieldType name=CommonGramTest class=solr.TextField 
positionIncrementGap=100
−
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=ISOLatin1AccentFilterFactory/
filter class=solr.PunctuationFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.CommonGramsFilterFactory words=new400common.txt/
/analyzer
−
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=ISOLatin1AccentFilterFactory/
filter class=solr.PunctuationFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/
/analyzer
/fieldType

RE: Solr memory use, jmap and TermInfos/tii

2010-09-13 Thread Burton-West, Tom

Thanks Robert and everyone!

I'm working on changing our JVM settings today, since putting Solr 1.4.1 into 
production will take a bit more work and testing.  Hopefully, I'll be able to 
test the setTermIndexDivisor on our test server tomorrow.

Mike, I've started the process to see if we can provide you with our tii/tis 
data.  I'll let you know as soon as I hear anything.  


Tom

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Sunday, September 12, 2010 10:48 AM
To: solr-user@lucene.apache.org; simon.willna...@gmail.com
Subject: Re: Solr memory use, jmap and TermInfos/tii

On Sun, Sep 12, 2010 at 9:57 AM, Simon Willnauer 
simon.willna...@googlemail.com wrote:

  To change the divisor in your solrconfig, for example to 4, it looks like
  you need to do this.
 
   indexReaderFactory name=IndexReaderFactory
  class=org.apache.solr.core.StandardIndexReaderFactory
 int name=setTermIndexInterval4/int
   /indexReaderFactory 

 Ah, thanks robert! I didn't know about that one either!

 simon


actually I'm wrong, for solr 1.4, use setTermIndexDivisor.

i was looking at 3.1/trunk and there is a bug in the name of this parameter:
https://issues.apache.org/jira/browse/SOLR-2118

-- 
Robert Muir
rcm...@gmail.com

RE: Solr and jvm Garbage Collection tuning

2010-09-13 Thread Burton-West, Tom

Thanks Kent for your info.  

We are not doing any faceting, sorting, or much else.  My guess is that most of 
the memory increase is just the data structures created when parts of the frq 
and prx files get read into memory.  Our frq files are about 77GB  and the prx 
files are about 260GB per shard and we are running 3 shards per machine.   I 
suspect that the document cache and query result cache don't take up that much 
space, but will try a run with those caches set to 0, just to see.

We have dual 4 core processors and 74GB total memory.  We want to leave a 
significant amount of memory free for OS disk caching. 

We tried increasing the memory from 20GB to 28GB and adding the 
-XXMaxGCPauseMillis=1000 flag but that seemed to have no effect.  

Currently I'm testing using the ConcurrentMarkSweep and that's looking much 
better although I don't understand why it has sized the Eden space down into 
the 20MB range. However, I am very new to Java memory management.

Anyone know if when using ConcurrentMarkSweep its better to let the JVM size 
the Eden space or better to give it some hints?


Once we get some decent JVM settings we can put into production I'll be testing 
using termIndexInterval with Solr 1.4.1 on our test server.

Tom

-Original Message-
From: Grant Ingersoll [mailto:gsing...@apache.org] 

.What are your current GC settings?  Also, I guess I'd look at ways you can 
reduce the heap size needed. 
 Caching, field type choices, faceting choices.  
Also could try playing with the termIndexInterval which will load fewer terms 
into memory at the cost of longer seeks. 

 At some point, though, you just may need more shards and the resulting 
 smaller indexes.  How many CPU cores do you have on each machine?

RE: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Burton-West, Tom

Thanks Mike,

Do you use a terms index divisor?  Setting that to 2 would halve the
amount of RAM required but double (on average) the seek time to locate
a given term (but, depending on your queries, that seek time may still
be a negligible part of overall query time, ie the tradeoff could be very 
worth it).

On Monday I plan to switch to Solr 1.4.1 on our test machine and experiment 
with the index divisor.  Is there an example of how to set up the divisor 
parameter in solrconfig.xml somewhere?

In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
parallel arrays instead of separate objects, and, 
we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will show 
this gain...; 

I'm looking forward to a number of the developments in 4.0, but am a bit wary 
of using it in production.   I've wanted to work in some tests with 4.0, but 
other more pressing issues have so far prevented this.

What about Lucene 2205?  Would that be a way to get some of the benefit similar 
to the changes in flex without the rest of the changes in flex and 4.0?

I'd be really curious to test the RAM reduction in 4.0 on your terms  
dict/index -- 
is there any way I could get a copy of just the tii/tis  files in your index? 
 Your index is a great test for Lucene!

We haven't been able to make much data available due to copyright and other 
legal issues.  However, since there is absolutely no way anyone could 
reconstruct copyrighted works from the tii/tis index alone, that should be ok 
on that front.  On Monday I'll try to get legal/administrative clearance to 
provide the data and also ask around and see if I can get the ok to either find 
a spare hard drive to ship, or make some kind of sftp arrangement.  Hopefully 
we will find a way to be able to do this.

BTW  Most of the terms are probably the result of  dirty OCR and the impact is 
probably increased by our present punctuation filter.  When we re-index we 
plan to use a more intelligent filter that will truncate extremely long tokens 
on punctuation and we also plan to do some minimal prefiltering prior to 
sending documents to Solr for indexing.  However, since with now have over 400 
languages , we will have to be conservative in our filtering since we would 
rather  index dirty OCR than risk not indexing legitimate content.  

Tom

Solr memory use, jmap and TermInfos/tii

2010-09-10 Thread Burton-West, Tom

Hi all,

When we run the first query after starting up Solr, memory use goes up from 
about 1GB to 15GB and never goes below that level.  In debugging a recent OOM 
problem I ran jmap with the output appended below.  Not surprisingly, given the 
size of our indexes, it looks like the TermInfo and Term data structures which 
are the in-memory representation of the tii file are taking up most of the 
memory. This is running Solr under Tomcat with 16GB allocated to the jvm and 3 
shards each with a tii file of about 600MB.

Total index size is about 400GB for each shard (we are indexing about 600,000 
full-text books in each shard).

In interpreting the jmap output, can we assume that the listings for utf8 
character arrays ([C), java.lang.String, long int arrays ([J), and int 
arrays ([i) are all part of the data structures involved in representing the 
tii file in memory?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

(jmap output, commas in numbers added)

num #instances #bytes  class name
--
   1:  82,496,803 4,273,137,904  [C
   2:  82,498,673 3,299,946,920  java.lang.String
   3:  27,810,887 1,112,435,480  org.apache.lucene.index.TermInfo
   4:  27,533,080 1,101,323,200  org.apache.lucene.index.TermInfo
   5:  27,115,577 1,084,623,080  org.apache.lucene.index.TermInfo
   6:  27,810,894  889,948,608  org.apache.lucene.index.Term
   7:  27,533,088  881,058,816  org.apache.lucene.index.Term
   8:  27,115,589  867,698,848  org.apache.lucene.index.Term
   9:   148  659,685,520  [J
  10: 2  222,487,072  [Lorg.apache.lucene.index.Term;
  11: 2  222,487,072  [Lorg.apache.lucene.index.TermInfo;
  12: 2  220,264,600  [Lorg.apache.lucene.index.Term;
  13: 2  220,264,600  [Lorg.apache.lucene.index.TermInfo;
  14: 2  216,924,560  [Lorg.apache.lucene.index.Term;
  15: 2  216,924,560  [Lorg.apache.lucene.index.TermInfo;
  16:737,060  155,114,960  [I
  17:627,793   35,156,408  java.lang.ref.SoftReference

Solr and jvm Garbage Collection tuning

2010-09-10 Thread Burton-West, Tom

We have noticed that when the first query hits Solr after starting it up, 
memory use increases significantly, from about 1GB to about 16GB, and then as 
queries are received it goes up to about 19GB at which point there is a Full 
Garbage Collection which takes about 30 seconds and then memory use drops back 
down to 16GB.  Under a relatively heavy load, the full GC happens about every 
10-20 minutes.

 We are running 3 Solr shards under one Tomcat with 20GB allocated to the jvm.  
Each shard has a total index size of about 400GB on and a tii size of about 
600MB and indexes about 650,000 full-text books. (The server has a total of 
72GB of memory, so we are leaving quite a bit of memory for the OS disk cache).

Is there some argument we could give the jvm so that it would collect garbage 
more frequently? Or some other JVM tuning action that might reduce the amount 
of time where Solr is waiting on GC?

If we could get the time for each GC to take under a second, with the trade-off 
being that GC  would occur much more frequently, that would help us avoid the 
occasional query taking more than 30 seconds at the cost of a larger number of 
queries taking at least a second.


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

RE: analysis tool vs. reality

2010-08-13 Thread Burton-West, Tom

+1
I just had occasion to debug something where the interaction between the 
queryparser and the analyzer produced *interesting* results.  Having a separate 
jsp that includes the whole chain (i.e. analyzer/tokenizer/filter and qp) would 
be great!

Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Friday, August 13, 2010 5:19 AM
To: solr-user@lucene.apache.org
Subject: Re: analysis tool vs. reality

Maybe, separate from analysis.jsp (showing only how text is analyzed),
Solr needs a debug page showing the steps the field's QueryParser goes
through on a given query, to debug such tricky QueryParser/Analyzer
interactions?

We could make a wrapper around the analyzer that records each text
fragment sent to it by the QueryParser, as a start.  It'd be great to
also see it spelled out how that then resulted in a particular part of
the query.  So for query ABC12 FOO you'd see that ABC12 was sent to
analyzer, it returned two tokens (ABC, 12), and then QueryParser made
a PhraseQuery from that, and then FOO was sent, and that turned into
TermQuery, and default op was AND and so a toplevel BooleanQuery with
2 MUST terms was created...

Mike

On Thu, Aug 12, 2010 at 8:39 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Aug 12, 2010 at 8:07 PM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:


 :  You say it's bogus because the qp will divide on whitesapce first --
 but
 :  you're assuming you know what query parser will be used ... the field
 :  query parser (to name one) doesn't split on whitespace first.  That's
 my
 :  point: analysis.jsp doesn't make any assumptions about what query
 parser
 :  *might* be used, it just tells you what your analyzers do with strings.
 : 
 :
 : you're right, we should just fix the bug that the queryparser tokenizes
 on
 : whitespace first. then analysis.jsp will be significantly less confusing.

 dude .. not trying to get into a holy war here

 actually I'm suggesting the practical solution: that we fix the primary
 problem that makes it confusing.


 even if you change the Lucene QUeryParser so that whitespace isn't a meta
 character it doens't affect the underlying issue: analysis.jsp is agnostic
 about QueryParsers.


 analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and
 your default queryparser is actually a de-facto whitespace tokenizer, don't
 try to sugarcoat it.

 --
 Robert Muir
 rcm...@gmail.com

RE: Improve Query Time For Large Index

2010-08-12 Thread Burton-West, Tom

Hi Peter,

If hits aren't showing up, and you aren't getting any queryResultCache hits 
even with the exact query being repeated, something is very wrong.  I'd suggest 
first getting the query result cache working, and then moving on to look at 
other possible bottlenecks.  

What are your settings for queryResultWindowSize and queryResultMaxDocsCached?

Following up on Robert's point, you might also try to run a few queries in the 
admin interface with the debug flag on to see if the query parser is creating 
phrase queries (assuming you have queries like http://foo.bar.baz).  The 
debug/explain will indicate whether the parsed query is a PhraseQuery.

Tom



-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Thursday, August 12, 2010 5:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Improve Query Time For Large Index

Hi Tom,

I tried again with:
  queryResultCache class=solr.LRUCache size=1 initialSize=1
autowarmCount=1/

and even now the hitratio is still 0. What could be wrong with my setup?

('free -m' shows that the cache has over 2 GB free.)

Regards,
Peter.

 Hi Peter,

 Can you give a few more examples of slow queries?  
 Are they phrase queries? Boolean queries? prefix or wildcard queries?
 If one word queries are your slow queries, than CommonGrams won't help.  
 CommonGrams will only help with phrase queries.

 How are you using termvectors?  That may be slowing things down.  I don't 
 have experience with termvectors, so someone else on the list might speak to 
 that.

 When you say the query time for common terms stays slow, do you mean if you 
 re-issue the exact query, the second query is not faster?  That seems very 
 strange.  You might restart Solr, and send a first query (the first query 
 always takes a relatively long time.)  Then pick one of your slow queries and 
 send it 2 times.  The second time you send the query it should be much faster 
 due to the Solr caches and you should be able to see the cache hit in the 
 Solr admin panel.  If you send the exact query a second time (without enough 
 intervening queries to evict data from the cache, ) the Solr queryResultCache 
 should get hit and you should see a response time in the .01-5 millisecond 
 range.

 What settings are you using for your Solr caches?

 How much memory is on the machine?  If your bottleneck is disk i/o for 
 frequent terms, then you want to make sure you have enough memory for the OS 
 disk cache.  

 I assume that http is not in your stopwords.  CommonGrams will only help with 
 phrase queries
 CommonGrams was committed and is in Solr 1.4.  If you decide to use 
 CommonGrams you definitely need to re-index and you also need to use both the 
 index time filter and the query time filter.  Your index will be larger.

 fieldType name=foo ...
 analyzer type=index
 filter class=solr.CommonGramsFilterFactory words=new400common.txt/
 /analyzer

 analyzer type=query
 filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/
 /analyzer
 /fieldType



 Tom
 -Original Message-
 From: Peter Karich [mailto:peat...@yahoo.de] 
 Sent: Tuesday, August 10, 2010 3:32 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Improve Query Time For Large Index

 Hi Tom,

 my index is around 3GB large and I am using 2GB RAM for the JVM although
 a some more is available.
 If I am looking into the RAM usage while a slow query runs (via
 jvisualvm) I see that only 750MB of the JVM RAM is used.

   
 Can you give us some examples of the slow queries?
 
 for example the empty query solr/select?q=
 takes very long or solr/select?q=http
 where 'http' is the most common term

   
 Are you using stop words?  
 
 yes, a lot. I stored them into stopwords.txt

   
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
 
 this looks interesting. I read through
 https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
 I only need to enable it via:

 filter class=solr.CommonGramsFilterFactory ignoreCase=true 
 words=stopwords.txt/

 right? Do I need to reindex?

 Regards,
 Peter.

   
 Hi Peter,

 A few more details about your setup would help list members to answer your 
 questions.
 How large is your index?  
 How much memory is on the machine and how much is allocated to the JVM?
 Besides the Solr caches, Solr and Lucene depend on the operating system's 
 disk caching for caching of postings lists.  So you need to leave some 
 memory for the OS.  On the other hand if you are optimizing and refreshing 
 every 10-15 minutes, that will invalidate all the caches, since an optimized 
 index is essentially a set of new files.

 Can you give us some examples of the slow queries?  Are you using stop 
 words?  

 If your slow queries are phrase queries, then you might try either adding 
 the most frequent terms in your index to the stopwords list  or try 
 CommonGrams and add them to the common words list.  (Details on

RE: Improve Query Time For Large Index

2010-08-11 Thread Burton-West, Tom

Hi Peter,

Can you give a few more examples of slow queries?  
Are they phrase queries? Boolean queries? prefix or wildcard queries?
If one word queries are your slow queries, than CommonGrams won't help.  
CommonGrams will only help with phrase queries.

How are you using termvectors?  That may be slowing things down.  I don't have 
experience with termvectors, so someone else on the list might speak to that.

When you say the query time for common terms stays slow, do you mean if you 
re-issue the exact query, the second query is not faster?  That seems very 
strange.  You might restart Solr, and send a first query (the first query 
always takes a relatively long time.)  Then pick one of your slow queries and 
send it 2 times.  The second time you send the query it should be much faster 
due to the Solr caches and you should be able to see the cache hit in the Solr 
admin panel.  If you send the exact query a second time (without enough 
intervening queries to evict data from the cache, ) the Solr queryResultCache 
should get hit and you should see a response time in the .01-5 millisecond 
range.

What settings are you using for your Solr caches?

How much memory is on the machine?  If your bottleneck is disk i/o for frequent 
terms, then you want to make sure you have enough memory for the OS disk cache. 
 

I assume that http is not in your stopwords.  CommonGrams will only help with 
phrase queries
CommonGrams was committed and is in Solr 1.4.  If you decide to use CommonGrams 
you definitely need to re-index and you also need to use both the index time 
filter and the query time filter.  Your index will be larger.

fieldType name=foo ...
analyzer type=index
filter class=solr.CommonGramsFilterFactory words=new400common.txt/
/analyzer

analyzer type=query
filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/
/analyzer
/fieldType



Tom
-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Tuesday, August 10, 2010 3:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Improve Query Time For Large Index

Hi Tom,

my index is around 3GB large and I am using 2GB RAM for the JVM although
a some more is available.
If I am looking into the RAM usage while a slow query runs (via
jvisualvm) I see that only 750MB of the JVM RAM is used.

 Can you give us some examples of the slow queries?

for example the empty query solr/select?q=
takes very long or solr/select?q=http
where 'http' is the most common term

 Are you using stop words?  

yes, a lot. I stored them into stopwords.txt

 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

this looks interesting. I read through
https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
I only need to enable it via:

filter class=solr.CommonGramsFilterFactory ignoreCase=true 
words=stopwords.txt/

right? Do I need to reindex?

Regards,
Peter.

 Hi Peter,

 A few more details about your setup would help list members to answer your 
 questions.
 How large is your index?  
 How much memory is on the machine and how much is allocated to the JVM?
 Besides the Solr caches, Solr and Lucene depend on the operating system's 
 disk caching for caching of postings lists.  So you need to leave some memory 
 for the OS.  On the other hand if you are optimizing and refreshing every 
 10-15 minutes, that will invalidate all the caches, since an optimized index 
 is essentially a set of new files.

 Can you give us some examples of the slow queries?  Are you using stop words? 
  

 If your slow queries are phrase queries, then you might try either adding the 
 most frequent terms in your index to the stopwords list  or try CommonGrams 
 and add them to the common words list.  (Details on CommonGrams here: 
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)

 Tom Burton-West

 -Original Message-
 From: Peter Karich [mailto:peat...@yahoo.de] 
 Sent: Tuesday, August 10, 2010 9:54 AM
 To: solr-user@lucene.apache.org
 Subject: Improve Query Time For Large Index

 Hi,

 I have 5 Million small documents/tweets (= ~3GB) and the slave index
 replicates itself from master every 10-15 minutes, so the index is
 optimized before querying. We are using solr 1.4.1 (patched with
 SOLR-1624) via SolrJ.

 Now the search speed is slow 2s for common terms which hits more than 2
 mio docs and acceptable for others: 0.5s. For those numbers I don't use
 highlighting or facets. I am using the following schema [1] and from
 luke handler I know that numTerms =~20 mio. The query for common terms
 stays slow if I retry again and again (no cache improvements).

 How can I improve the query time for the common terms without using
 Distributed Search [2] ?

 Regards,
 Peter.


 [1]
 field name=id type=tlong indexed=true stored=true
 required=true /
 field name=date type=tdate indexed=true stored=true /
 !-- term* attributes to prepare faster highlighting. --
 field name=txt

RE: Improve Query Time For Large Index

2010-08-10 Thread Burton-West, Tom

Hi Peter,

A few more details about your setup would help list members to answer your 
questions.
How large is your index?  
How much memory is on the machine and how much is allocated to the JVM?
Besides the Solr caches, Solr and Lucene depend on the operating system's disk 
caching for caching of postings lists.  So you need to leave some memory for 
the OS.  On the other hand if you are optimizing and refreshing every 10-15 
minutes, that will invalidate all the caches, since an optimized index is 
essentially a set of new files.

Can you give us some examples of the slow queries?  Are you using stop words?  

If your slow queries are phrase queries, then you might try either adding the 
most frequent terms in your index to the stopwords list  or try CommonGrams and 
add them to the common words list.  (Details on CommonGrams here: 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)

Tom Burton-West

-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Tuesday, August 10, 2010 9:54 AM
To: solr-user@lucene.apache.org
Subject: Improve Query Time For Large Index

Hi,

I have 5 Million small documents/tweets (= ~3GB) and the slave index
replicates itself from master every 10-15 minutes, so the index is
optimized before querying. We are using solr 1.4.1 (patched with
SOLR-1624) via SolrJ.

Now the search speed is slow 2s for common terms which hits more than 2
mio docs and acceptable for others: 0.5s. For those numbers I don't use
highlighting or facets. I am using the following schema [1] and from
luke handler I know that numTerms =~20 mio. The query for common terms
stays slow if I retry again and again (no cache improvements).

How can I improve the query time for the common terms without using
Distributed Search [2] ?

Regards,
Peter.


[1]
field name=id type=tlong indexed=true stored=true
required=true /
field name=date type=tdate indexed=true stored=true /
!-- term* attributes to prepare faster highlighting. --
field name=txt type=text indexed=true stored=true
   termVectors=true termPositions=true termOffsets=true/

[2]
http://wiki.apache.org/solr/DistributedSearch

RE: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Burton-West, Tom

A good starting place might be the list of stemming errors for the original 
Porter stemmer in this article that describes k-stem:

Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings 
of the 16th annual international ACM SIGIR conference on Research and 
development in information retrieval (pp. 191-202). Pittsburgh, Pennsylvania, 
United States: ACM. doi:10.1145/160688.160718

I don't know if the current porter stemmer is different.  I do see that on the 
snowball page there is a porter and a porter2 stemmer and this explanation is 
linked from the porter2 stemmer page: 
http://snowball.tartarus.org/algorithms/english/stemmer.html


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, July 30, 2010 4:42 PM
To: solr-user@lucene.apache.org
Subject: Good list of English words that get butchered by Porter Stemmer

Hello,

I'm looking for a list of English  words that, when stemmed by Porter stemmer, 
end up in the same stem as  some similar, but unrelated words.  Below are some 
examples:

# this gets stemmed to iron, so if you search for ironic, you'll get iron 
matches
ironic

# same stem as animal
anime
animated 
animation
animations

I imagine such a list could be added to the example protwords.txt

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

RE: Total number of terms in an index?

2010-07-27 Thread Burton-West, Tom

Hi Jason,

Are you looking for the total number of unique terms or total number of term 
occurrences?

Checkindex reports both, but does a bunch of other work so is probably not the 
fastest.

If you are looking for total number of term occurrences, you might look at 
contrib/org/apache/lucene/misc/HighFreqTerms.java.
 
If you are just looking for the total number of unique terms, I wonder if there 
is some low level API that would allow you to just access the in-memory 
representation of the tii file and then multiply the number of terms in it by 
your indexDivisor (default 128). I haven't dug in to the code so I don't 
actually know how the tii file gets loaded into a data structure in memory.  If 
there is api access, it seems like this might be the quickest way to get the 
number of unique terms.  (Of course you would have to do this for each segment).

Tom
-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Monday, July 26, 2010 8:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Total number of terms in an index?


: Sorry, like the subject, I mean the total number of terms.

it's not stored anywhere, so the only way to fetch it is to actually 
iteate all of the terms and count them (that's why LukeRequestHandler is 
slow slow to compute this particular value)

If i remember right, someone mentioned at one point that flex would let 
you store data about stuff like this in your index as part of the segment 
writing, but frankly i'm still not sure how that iwll help -- because you 
unless your index is fully optimized, you still have to iterate the terms 
in each segment to 'de-dup' them.


-Hoss

RE: indexing best practices

2010-07-19 Thread Burton-West, Tom

Hi Ken,

This is all very dependent on your documents, your indexing setup and your 
hardware. Just as an extreme data point, I'll describe our experience.  

We run 5 clients on each of 6 machines to send documents to Solr using the 
standard http xml process.  Our documents contain about 10 fields, but one 
field contains OCR for the full text of a book.  The documents are about 700KB 
in size.

Each client sends solr documents to one of 10 solr shards on a round-robin 
basis.  We are running 5 shards on each of two dedicated indexing machines each 
with 144GB of memory and 2 x Quad Core Intel Xeon E5540 2.53GHz processors 
(Nehalem).  What we generally see is that once the index gets large enough for 
significant merging, our producers can send documents to solr faster than it 
can index them.

We suspect that our bottleneck is simply disk I/O for index merging on the Solr 
build machines.  We are currently experimenting with changing the 
maxRAMBufferSize settings and various merge policies/merge factors to see if we 
can speed up the Solr end of the indexing process.   Since we optimize our 
index down to two segments, we are also planning to experiment with using the 
nomerge merge policy. I hope to have some results to report on our blog 
sometime in the next  month or so.

Tom Burton-West
www.hathitrust.org/blogs

-Original Message-
From: kenf_nc [mailto:ken.fos...@realestate.com] 
Sent: Sunday, July 18, 2010 8:18 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing best practices


No one has done performance analysis? Or has a link to anywhere where it's
been done?

basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv)  vs  DIH  vs POSTing
2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
returns number?

I have 16 million small (8 to 10 fields, no large text fields) docs that get
updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
fields) that get updated monthly. It currently takes about 20 hours to do a
full import. I would like to cut that down as much as possible.
Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
Sent from the Solr - User mailing list archive at Nabble.com.

benchmarking indexing :Use Solr or use Lucene Benchmark?

2010-06-08 Thread Burton-West, Tom

Hi all,

We are about to test out various factors to try to speed up our indexing 
process.  One set of experiments will try various maxRamBufferSizeMB  settings. 
  Since the factors we will be varying are at the Lucene level, we are 
considering using the Lucene Benchmark utilities in Lucene/contrib..   Have 
other Solr users used Lucene Benchmark?  Can anyone provide any hints for 
adapting it to Solr? (Are there any common gotchas etc?).

Tom

Tom Burton-West
University of Michigan Libraries
http://www.hathitrust.org/blogs/large-scale-search

RE: Using NoOpMergePolicy (Lucene 2331) from Solr

2010-04-29 Thread Burton-West, Tom

Thanks Koji, 

That was the information I was looking for.  I'll be sure to post the test 
results to the list.  It may be a few weeks before we can schedule the tests 
for our test server.

Tom


I've never tried it but NoMergePolicy and NoMergeScheduler
can be specified in solrconfig.xml:

  ramBufferSizeMB1000/ramBufferSizeMB
 mergePolicy class=org.apache.lucene.index.NoMergePolicy/
 mergeScheduler class=org.apache.lucene.index.NoMergeScheduler/

Koji

-- 
http://www.rondhuit.com/en/

Using NoOpMergePolicy (Lucene 2331) from Solr

2010-04-27 Thread Burton-West, Tom

Is it possible to use the NoOpMergePolicy ( 
https://issues.apache.org/jira/browse/LUCENE-2331   ) from Solr?

We have very large indexes and always optimize, so we are thinking about using 
a very large ramBufferSizeMB
and a NoOpMergePolicy and then running an optimize to avoid extra disk reads 
and writes.

Tom Burton-West

RE: nfs vs sas in production

2010-04-27 Thread Burton-West, Tom

Hi Kallin,

Given the previous postings on the list about terrible NFS performance we were 
pleasantly surprised when we did some tests against a well tuned NFS RAID array 
on a private network.  We got reasonably good results (given our large index 
sizes.) See 
http://www.hathitrust.org/blogs/large-scale-search/current-hardware-used-testing
  and 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance.   

Just prior to going into production we moved from direct attached storage to a 
very high performance NAS in production for a number of reasons including ease 
of management as we scale out.  One of the reasons was to reduce contention 
between indexing/optimizing and search instances for disk I/O.  See 
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond
 for details.

Tom

-Original Message-
From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] 
Sent: Tuesday, April 27, 2010 4:13 PM
To: 'solr-user@lucene.apache.org'
Subject: nfs vs sas in production

Hey,

A question was raised during a meeting about our new Solr based search 
projects. We're getting 4 cutting edge servers each with something like 24 Gigs 
of ram dedicated to search. However there is some problem with the amount of 
SAS based storage each machine can handle, and people wonder if we might have 
to use a NFS based drive instead. Does anyone have any experience using SAS vs. 
NFS drives for Solr? Any feedback would be appreciated!

Thanks,
-Kallin Nagelberg

Experience with indexing billions of documents?

2010-04-02 Thread Burton-West, Tom

We are currently indexing 5 million books in Solr, scaling up over the next few 
years to 20 million.  However we are using the entire book as a Solr document.  
We are evaluating the possibility of indexing individual pages as there are 
some use cases where users want the most relevant pages regardless of what book 
they occur in.  However, we estimate that we are talking about somewhere 
between 1 and 6 billion pages and have concerns over whether Solr will scale to 
this level.

Does anyone have experience using Solr with 1-6 billion Solr documents?

The lucene file format document 
(http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)  mentions a 
limit of about 2 billion document ids.   I assume this is the lucene internal 
document id and would therefore be a per index/per shard limit.  Is this 
correct?


Tom Burton-West.

Experience with Solr and JVM heap sizes over 2 GB

2010-03-31 Thread Burton-West, Tom

Hello all,

We have been running a configuration in production with 3 solr instances under 
one  tomcat with 16GB allocated to the JVM.  (java -Xmx16384m -Xms16384m)  I 
just noticed the warning in the LucidWorks Certified Distribution Reference 
Guide that warns against using more than 2GB (see below).
Are other people using systems with over 2GB allocated to the JVM?

What steps can we take to determine if performance is being adversely affected 
by the large heap size?

“The larger the heap the longer it takes to do garbage collection. This can 
mean minor, random pauses or, in extreme cases, “freeze the world” pauses of a 
minute or more. As a practical matter, this can become a serious problem for 
heap sizes that exceed about two gigabytes, even if far more physical memory is 
available.”
http://www.lucidimagination.com/search/document/CDRG_ch08_8.4.1?q=memory%20caching

Tom Burton-West
--
lst name=jvm
str name=version14.2-b01/str
str name=nameJava HotSpot(TM) 64-Bit Server VM/str
int name=processors16/int
−
lst name=memory
str name=free2.3 GB/str
str name=total15.3 GB/str
str name=max15.3 GB/str
str name=used13.1 GB (%85.3)/str
/lst

RE: Cleaning up dirty OCR

2010-03-11 Thread Burton-West, Tom

Thanks Robert,

I've been thinking about this since you suggested it on another thread.  One 
problem is that it would also remove real words. Apparently 40-60% of the words 
in large corpora occur only once 
(http://en.wikipedia.org/wiki/Hapax_legomenon.)  

There are a couple of use cases where removing words that occur only once might 
be a problem.  

One is for genealogical searches where a user might want to retrieve a document 
if their relative is only mentioned once in the document.  We have quite a few 
government documents and other resources such as the Lineage Book of the 
Daughters of the American Revolution.  

Another use case is humanities researchers doing phrase searching for quotes.  
In this case, if we remove one of the words in the quote because it occurs only 
once in a document, then the phrase search would fail.  For example if someone 
were searching Macbeth and entered the phrase query: Eye of newt and toe of 
frog it would fail if we had removed newt from the index because newt 
occurs only once in Macbeth.

I ran a quick check against a couple of our copies of Macbeth and found out of 
about 5,000 unique words about 3,000 occurred only once.  Of these about 1,800 
were in the unix dictionary, so at least 1800 words that would be removed would 
be real words as opposed to OCR errors (a spot check of the words not in the 
unix /usr/share/dict/words file revealed most of them also as real words rather 
than OCR errors.)

I also ran a quick check against a document with bad OCR and out of about 
30,000 unique words, 20,000 occurred only once.  Of those 20,000 only about 300 
were in the unix dictionary so your intuition that a lot of OCR errors will 
occur only once seems spot on.  A quick look at the words not in the dictionary 
revealed a mix of technical terms, common names, and obvious OCR nonsense such 
as ffll.lj'slall'lm 

I guess the question I need to determine is whether the benefit of removing 
words that occur only once outweighs the costs in terms of the two use cases 
outlined above.   When we get our new test server set up, sometime in the next 
month, I think I will go ahead and prune a test index of 500K docs and do some 
performance testing just to get an idea of the potential performance gains of 
pruning the index.

I have some other questions about index pruning, but I want to do a bit more 
reading and then I'll post a question to either the Solr or Lucene list.  Can 
you suggest which list I should post an index pruning question to?

Tom








-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Tuesday, March 09, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Cleaning up dirty OCR

 Can anyone suggest any practical solutions to removing some fraction of the 
 tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcm...@gmail.com

Cleaning up dirty OCR

2010-03-09 Thread Burton-West, Tom

Hello all,

We have been indexing a large collection of OCR'd text. About 5 million books 
in over 200 languages.  With 1.5 billion OCR'd pages, even a small OCR error 
rate creates a relatively large number of meaningless unique terms.  (See  
http://www.hathitrust.org/blogs/large-scale-search/too-many-words )

We would like to remove some *fraction* of these nonsense words caused by OCR 
errors prior to indexing. ( We don't want to remove real words, so we need 
some method with very few false positives.)

A dictionary based approach does not seem feasible given the number of 
languages and the inclusion of proper names, place names, and technical terms.  
 We are considering using some heuristics, such as looking for strings over a 
certain length or strings containing more than some number of punctuation 
characters.

This paper has a few such heuristics:
Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal 
of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World 
Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, 
July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf

Can anyone suggest any practical solutions to removing some fraction of the 
tokens containing OCR errors from our input stream?

Tom Burton-West
University of Michigan Library
www.hathitrust.org

What is largest reasonable setting for ramBufferSizeMB?

2010-02-17 Thread Burton-West, Tom

Hello all,

At some point we will need to re-build an index that totals about 2 terrabytes 
in size (split over 10 shards).  At our current indexing speed we estimate that 
this will take about 3 weeks.  We would like to reduce that time.  It appears 
that our main bottleneck is disk I/O.
 We currently have ramBufferSizeMB set to 32 and our merge factor is 10.  If we 
increase ramBufferSizeMB to 320, we avoid a merge and the 9 disk writes and 
reads to merge 9+1 32MB segments into a 320MB segment.

 Assuming we allocate enough memory to the JVM, would it make sense to increase 
ramBufferSize to 3200MB?   What are people's experiences with very large 
ramBufferSizeMB sizes?

Tom Burton-West
University of Michigan Library
www.hathitrust.org

TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-08 Thread Burton-West, Tom

Hello all,

After optimizing rather large indexes on 10 shards (each index holds about 
500,000 documents and is  about 270-300 GB in size) we started getting  
intermittent TermInfosReader.get()  ArrayIndexOutOfBounds exceptions.  The 
exceptions sometimes seem to occur on all 10 shards at the same time and 
sometimes on one shard but not the others.   We also sometimes get an Internal 
Server Error but that might be either a cause or an effect of the array index 
out of bounds.  Here is the top part of the message:


java.lang.ArrayIndexOutOfBoundsException: -14127432
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)

Any suggestions for troubleshooting would be appreciated.

Trace from tomcat logs appended below.

Tom Burton-West

---

Feb 5, 2010 8:09:02 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.ArrayIndexOutOfBoundsException: -14127432
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:943)
at 
org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:308)
at 
org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:144)
at org.apache.lucene.search.Similarity.idf(Similarity.java:481)
at 
org.apache.lucene.search.TermQuery$TermWeight.init(TermQuery.java:44)
at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:146)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:186)
at 
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:366)
at org.apache.lucene.search.Query.weight(Query.java:95)
at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:651)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:581)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:903)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:176)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:619)

Feb 5, 2010 8:09:02 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Internal Server Error

Internal Server Error

request: http://solr-sdr-search-10:8081/serve-10/select
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:423)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:242)
at

IndexWriter InfoStream in solrconfig not working

2009-10-07 Thread Burton-West, Tom

Hello,

We are trying to debug an indexing/optimizing problem and have tried setting 
the infoStream  file in solrconf.xml so that the SolrIndexWriter will write a 
log file.  Here is our setting:

!--
 To aid in advanced debugging, you may turn on IndexWriter debug 
logging. Uncommenting this and setting to true
 will set the file that the underlying Lucene IndexWriter will write 
its debug infostream to.
--
infoStream file=/tmp/LuceneIndexWriterDebug.logtrue/infoStream

After making that change to solrconfig.xml, restarting Solr, we see a message 
in the tomcat logs saying that the log is enabled:

build-2_log.2009-10-06.txt:INFO: IndexWriter infoStream debug log is enabled: 
/tmp/LuceneIndexWriterDebug.log

However, if we then run an optimize we can't see any log file being written.

I also looked at the patch for  http://issues.apache.org/jira/browse/SOLR-1145, 
but did not see a unit test that I might try to run in our system.


Do others have this logging working successfully ?
Is there something else that needs to be set up?

Tom

Solr admin url for example gives 404

2009-08-26 Thread Burton-West, Tom

Hello all,

When I start up Solr from the example directory using start.jar, it seems to 
start up, but when I go to the localhost admin url 
(http://localhost:8983/solr/admin) I get a 404 (See message appended below).  
Has the url for the Solr admin changed?


Tom
Tom Burton-West
---
Here is the message I get with the 404:


HTTP ERROR: 404 NOT_FOUND RequestURI=/solr/admin Powered by 
jetty://http://jetty.mortbay.org
Steps to reproduce the problems:

1 get the latest Solr from svn (R 808058)
2 run ant clean test   (all tests pass)
3 cd ./example
4. start solr
$ java -jar start.jar
2009-08-26 12:08:08.300::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
2009-08-26 12:08:08.472::INFO:  jetty-6.1.3
2009-08-26 12:08:08.519::INFO:  Started SocketConnector @ 0.0.0.0:8983
5. go to browser and try to look at admin panel: 
http://localhost:8983/solr/admin

WordDelimiterFilterFactory removes words when options set to 0

2009-04-17 Thread Burton-West, Tom

In trying to understand the various options for WordDelimiterFilterFactory, I 
tried setting all options to 0.
This seems to prevent a number of words from being output at all. In particular 
can't and 99dxl don't get output, nor do any wods containing hypens. Is 
this correct behavior?


Here is what the Solr Analyzer output

org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1   2   3   4   5   6   7   8   
9
term text   ca-55   99_3_a9 55-67   powerShot   ca999x15foo-bar 
can't   joe's   99dxl

 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=0, 
generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, 
catenateNumbers=0}

term position   1   5
term text   powerShot   joe
term type   wordword
source start,end20,29   53,56

Here is the schema
fieldtype name=mbooksOcrXPatLike class=solr.TextField
  analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.WordDelimiterFilterFactory
splitOnCaseChange=0
generateWordParts=0
generateNumberParts=0
catenateWords=0
catenateNumbers=0
catenateAll=0
/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldtype

Tom

Can TermIndexInterval be set in Solr?

2009-03-25 Thread Burton-West, Tom

Hello all,

We are experimenting with the ShingleFilter with a very large document set (1 
million full-text books). Because the ShingleFilter indexes every word pair as 
a token, the number of unique terms increases tremendously.  In our experiments 
so far the tii and tis files are getting very large and the tii file will 
eventually be too large to fit into memory.  If we set the TermIndexInterval to 
a larger number than the default 128, the tii file size should go down.  Is it 
possible to set this somehow through Solr configuration or do we need to modify 
the code somewhere and call IndexWriter.setTermIndexInterval?


Tom

Tom Burton-West
Digital Library Production Services
University of Michigan Library

RE: NIO not working yet

2008-12-02 Thread Burton-West, Tom

Thanks Yonik,

-The next nightly build (Dec-01-2008) should have the changes.

The latest nightly build seems to be 30-Nov-2008 08:20,
http://people.apache.org/builds/lucene/solr/nightly/ 
has the version with the NIO fix been built?  Are we looking in the
wrong place?

Tom

Tom Burton-West
Information Retrieval Progammer
Digital Library Production Services
University of Michigan Library

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Sunday, November 30, 2008 8:43 PM
To: solr-user@lucene.apache.org
Subject: Re: NIO not working yet

OK, the development version of Solr should now be fixed (i.e. NIO should
be the default for non-Windows platforms).  The next nightly build
(Dec-01-2008) should have the changes.

-Yonik

On Wed, Nov 12, 2008 at 2:59 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 NIO support in the latest Solr development versions does not work yet 
 (I previously advised that some people with possible lock contention 
 problems try it out).  We'll let you know when it's fixed, but in the 
 meantime you can always set the system property 
 org.apache.lucene.FSDirectory.class to 
 org.apache.lucene.store.NIOFSDirectory to try it out.

 for example:

 java 
 -Dorg.apache.lucene.FSDirectory.class=org.apache.lucene.store.NIOFSDir
 ectory
  -jar start.jar

 -Yonik

Load balancing for distributed Solr

2008-12-02 Thread Burton-West, Tom

Hello all,

As I understand distributed Solr, a request for a distributed search
goes to a particular Solr instance with a list of arguments specifying
the addresses of the shards to search.  The Solr instance to which the
request is first directed is responsible for distributing the query to
the other shards and pulling together the results.  My questions are:

1 Does it make sense to 
 A.  Always have the same Solr instance responsible for distributing the
query to the other shards
   or 
 B.   Rotate which shard does the distributing/result aggregating?  

2. For scenario A, are there different requirements (memory,cpu,
processors etc) for the machine doing the distribution versus the
machines hosting the shards responding to the distributed requests?

3. For scenario B, are people using some kind of load balancing to
distribute which Solr instance acts as the query distributor/response
aggregator? 

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Services
University of Michigan

port of Nutch CommonGrams to Solr for help with slow phrase queries

2008-11-24 Thread Burton-West, Tom

Hello all,

We are having problems with extremely slow phrase queries when the
phrase query contains a common words. We are reluctant to just use stop
words due to various problems with false hits and some things becoming
impossible to search with stop words turned on. (For example to be or
not to be, the who, man in the moon vs man on the moon etc.)

The approach to this problem used by Nutch looks promising.  Has anyone
ported the Nutch CommonGrams filter to Solr?

Construct n-grams for frequently occuring terms and phrases while
indexing. Optimize phrase queries to use the n-grams. Single terms are
still indexed too, with n-grams overlaid.
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C
ommonGrams.html


Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Services
University of Michigan Library

Processing of prx file for phrase queries: Whole position list for term read?

2008-11-18 Thread Burton-West, Tom

Hello,

We are working with a very large index and with large documents (300+
page books.)  It appears that the bottleneck on our system is the disk
IO involved in reading position information from the prx file for
commonly occuring terms. 

An example slow query is  the new economics.

To process the above phrase query for the word the, does the entire
part of the .prx file for the word the need to be read in to memory or
only the fragments of the entries for the word the that contain
specific doc ids?

In reading the lucene index file formats document
(http://lucene.apache.org/java/2_4_0/fileformats.html) its not clear
whether the .tis file stores a pointer into the .prx file for a term
(and therefore the entire list of doc_ids and positions for that term
needs to be read into memory), or if the .tis file stores a pointer to
the term **and doc id** in the prx file, in which case only the
positions for a given doc id would need to be read. Or if somehow the
.frq file has information on where to find the doc id in the .prx file.


The documentation for the .tis file says that it stores ProxDelta which
is based on the term (rather than the term/doc id).  On the other hand
the documentation for the .prx file states that Positions entries are
ordered by increasing document number (the document number is implicit
from the .frq file)


Tom

RE: Solr locking issue? BLOCKED on lock=org.apache.lucene.store.FSDirectory

2008-11-11 Thread Burton-West, Tom

Hi Yonik,

Thanks for the NIO suggestion. We are using Linux, but our indexes are
NFS mounted.  I thought I saw something about problems with NIO and NFS,
but am fuzzy on the details. 

These results are with Solr 1.2 and I'm wondering if even without the
NIO change, upgrading to Solr 1.3 might help.  What confuses me is why
multiple searchers are locking the prx index file.  I would think that
searching is a read-only operation.   

Perhaps we need to change something to tell Solr we aren't updating the
index?

Tom 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Friday, November 07, 2008 8:25 PM
To: solr-user@lucene.apache.org
Cc: Farber, Phillip
Subject: Re: Solr locking issue? BLOCKED on
lock=org.apache.lucene.store.FSDirectory

Hi Tom, if you're on a non Windows box, could you perhaps try your test
on the latest Solr nightly build?  We've recently improved this through
the use of NIO.

-Yonik

On Fri, Nov 7, 2008 at 4:23 PM, Burton-West, Tom [EMAIL PROTECTED]
wrote:
 Hello,

 We are testing Solr with a simulation of 30 concurrent users.  We are 
 getting socket timeouts and the thread dump from the admin tool shows 
 about 100+ threads with a similar message about a lock. (Message 
 appended below).

 We supsect this may have something to do with one or more phrase 
 queries containing common terms since our index is very large and we 
 suspect one or more very large segments of the position index need to 
 be read into memory.

 Can someone point us to either the possible cause of this problem or 
 what we might change to reduce/eliminate it?

 Tom

 Tom Burton-West
 Information Retrieval Programmer
 Digital Library Production Services
 University of Michigan Library
 [EMAIL PROTECTED]

 --

  'http-8080-Processor54' Id=71, BLOCKED on
 [EMAIL PROTECTED]
 47 , total cpu time=2070.ms user time=1460.ms at 
 org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirect
 or
 y.java:532)
 at
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.j
 av
 a:93)
 at
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput
 .j
 ava:34)
 at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
 at
 org.apache.lucene.index.SegmentTermPositions.readDeltaPosition(Segment
 Te
 rmPositions.java:70)
 at
 org.apache.lucene.index.SegmentTermPositions.nextPosition(SegmentTermP
 os
 itions.java:66)
 at
 org.apache.lucene.search.PhrasePositions.nextPosition(PhrasePositions.
 ja
 va:76)
 at

org.apache.lucene.search.ExactPhraseScorer.phraseFreq(ExactPhraseScorer.
 java:45)
 at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:94)
 at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:81)
 at org.apache.lucene.search.Scorer.score(Scorer.java:48)
 at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
 at org.apache.lucene.search.Searcher.search(Searcher.java:118)
 at org.apache.lucene.search.Searcher.search(Searcher.java:97)
 at

org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.
 java:888)
 at
 org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher
 .j
 ava:805)
 at
 org.apache.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.
 ja
 va:698)
 at
 org.apache.solr.request.StandardRequestHandler.handleRequestBody(Stand
 ar
 dRequestHandler.java:122)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
 rB
 ase.java:77)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
 ja
 va:191)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
 .j
 ava:159)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
 ca
 tionFilterChain.java:215)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
 lt
 erChain.java:188)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
 lv
 e.java:213)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
 lv
 e.java:174)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:5
 48
 )
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
 va
 :127)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
 va
 :117)
 at

org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
 java:108)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
 :1
 74)
 at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:
 87
 4)
 at
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.pr
 oc
 essConnection(Http11BaseProtocol.java:665)
 at
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoi
 nt
 .java:528)
 at
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFoll
 ow
 erWorkerThread.java:81

Solr locking issue? BLOCKED on lock=org.apache.lucene.store.FSDirectory

2008-11-07 Thread Burton-West, Tom

Hello,

We are testing Solr with a simulation of 30 concurrent users.  We are
getting socket timeouts and the thread dump from the admin tool shows
about 100+ threads with a similar message about a lock. (Message
appended below).

We supsect this may have something to do with one or more phrase queries
containing common terms since our index is very large and we suspect one
or more very large segments of the position index need to be read into
memory.

Can someone point us to either the possible cause of this problem or
what we might change to reduce/eliminate it?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Services
University of Michigan Library
[EMAIL PROTECTED]

--

 'http-8080-Processor54' Id=71, BLOCKED on
[EMAIL PROTECTED]
, total cpu time=2070.ms user time=1460.ms
at
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirector
y.java:532)
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.jav
a:93)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.j
ava:34)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
at
org.apache.lucene.index.SegmentTermPositions.readDeltaPosition(SegmentTe
rmPositions.java:70)
at
org.apache.lucene.index.SegmentTermPositions.nextPosition(SegmentTermPos
itions.java:66)
at
org.apache.lucene.search.PhrasePositions.nextPosition(PhrasePositions.ja
va:76)
at
org.apache.lucene.search.ExactPhraseScorer.phraseFreq(ExactPhraseScorer.
java:45)
at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:94)
at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:81)
at org.apache.lucene.search.Scorer.score(Scorer.java:48)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.Searcher.search(Searcher.java:118)
at org.apache.lucene.search.Searcher.search(Searcher.java:97)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.
java:888)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.j
ava:805)
at
org.apache.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.ja
va:698)
at
org.apache.solr.request.StandardRequestHandler.handleRequestBody(Standar
dRequestHandler.java:122)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:191)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:159)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:215)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:188)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:174)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548
)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:117)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:108)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:1
74)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:87
4)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.proc
essConnection(Http11BaseProtocol.java:665)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint
.java:528)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollow
erWorkerThread.java:81)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool
.java:689)
at java.lang.Thread.run(Thread.java:619)

93 matches

Mail list logo