Outdated information on JVM heap sizes in Solr 8.3 documentation?

2020-02-14 Thread Tom Burton-West
Hello,

In the section on JVM tuning in the  Solr 8.3 documentation (
https://lucene.apache.org/solr/guide/8_3/jvm-settings.html#jvm-settings)
there is a paragraph which cautions about setting heap sizes over 2 GB:

"The larger the heap the longer it takes to do garbage collection. This can
mean minor, random pauses or, in extreme cases, "freeze the world" pauses
of a minute or more. As a practical matter, this can become a serious
problem for heap sizes that exceed about **two gigabytes**, even if far
more physical memory is available. On robust hardware, you may get better
results running multiple JVMs, rather than just one with a large memory
heap. "  (** added by me)

I suspect this paragraph is severely outdated, but am not a Java expert.
 It seems to be contradicted by the statement in "
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#memory-and-gc-settings;
"...values between 10 and 20 gigabytes are not uncommon for production
servers"

Are "freeze the world" pauses still an issue with modern JVM's?
Is it still advisable to avoid heap sizes over 2GB?

Tom
https://www.hathitrust.org/blogslarge-scale-search


Re: loadOnStartup=false doesn't appear to work for Solr 6.6

2018-08-17 Thread Tom Burton-West
Thanks Erick,

Silly oversight on my part.  I went into the admin panel and used the core
selector to view information about the core and it was running.
I did some more thinking about it and restarted solr and looked at the core
admin panel where I could see that the startTime was "-".

So the problem is operator error.  I didn't think about how the core
selector actually sends a query to the core to get stats, which of course
starts the core.


Tom

On Fri, Aug 17, 2018 at 12:18 PM, Erick Erickson 
wrote:

> Tom:
>
> That hasn't been _intentionally_ changed. However, any request that
> comes in (update or query) will permanently load the core (assuming no
> transient cores), and any request to the core will autoload it. How
> are you determining that the core hasn't been loaded? And are there
> any background tasks that could be causing them to load (autowarming
> in solrconfig doesn't count).
>
> On Fri, Aug 17, 2018 at 8:57 AM, Tom Burton-West 
> wrote:
> > Hello,
> >
> > I'm not using SolrCloud and want to have some cores not load when Solr
> > starts up.
> > I tried loadOnStartup=false, but the cores seem to start up anyway.
> >
> > Is the loadOnStartup parameter still usable with Solr 6.6 or does the
> > documentation need updating?
> >  Or  Is there something else I need to do/set?
> >
> > Tom
>


loadOnStartup=false doesn't appear to work for Solr 6.6

2018-08-17 Thread Tom Burton-West
Hello,

I'm not using SolrCloud and want to have some cores not load when Solr
starts up.
I tried loadOnStartup=false, but the cores seem to start up anyway.

Is the loadOnStartup parameter still usable with Solr 6.6 or does the
documentation need updating?
 Or  Is there something else I need to do/set?

Tom


Re: Can the export handler be used with the edismax or dismax query handler

2018-07-29 Thread Tom Burton-West
Thanks Mikhail and Erick,

I don't need ranks or score.  I just need the full set of results.  Will
the export handler work with a fq that uses edismax? (I'm not at work
today, but I can try it out tomorrow.)

I compared a simple (not edismax) query and the export handler with
cursormark with rows = 50K to 200K.  The export handler took about 8 ms to
export all 1.9 million results and had minimal impact on server CPU and
memory.  With the cursormark it took about 1 minute 20 seconds, CPU use
increased by about 25% and there were many more garbage collections
although the time for GC totaled only a few seconds.

Tom



On Sat, Jul 28, 2018 at 4:25 AM, Mikhail Khludnev  wrote:

> Tom,
> Do you say you don't need rank results or you don't need to export score?
> If the former is true, you can just put edismax to fq.
> Just a note: using cursor mark with the score may cause some kind of hit
> dupes and probably missing some ones.
>
> On Sat, Jul 28, 2018 at 5:20 AM Erick Erickson 
> wrote:
>
> > What about cursorMark? That's designed to handle repeated calls with
> > increasing "start" parameters without bogging down.
> >
> > https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
> >
> > Best,
> > Erick
> >
> > On Fri, Jul 27, 2018 at 9:47 AM, Tom Burton-West 
> > wrote:
> > > Thanks Joel,
> > >
> > > My use case is that I have a complex edismax query (example below)  and
> > the
> > > user wants to download the set of *all* search results (ids and some
> > small
> > > metadata fields).   So they don't need the relevance ranking.
> However, I
> > > need to somehow get the exact set that the complex edismax query
> matched.
> > >
> > > Should I try to write some code to rewrite  the logic of the edismax
> > query
> > > with a complex boolean query or would it make sense for me to look at
> > > possibly modifying the export handler for my use case?
> > >
> > > Tom
> > >
> > > "q= _query_:"{!edismax
> > >
> > qf='ocr^5+allfieldsProper^2+allfields^1+titleProper^50+
> title_topProper^30+title_restProper^15+title^10+title_
> top^5+title_rest^2+series^5+series2^5+author^80+author2^
> 50+issn^1+isbn^1+oclc^1+sdrnum^1+ctrlnum^1+id^1+
> rptnum^1+topicProper^2+topic^1+hlb3^1+fullgeographic^1+fullgenre^1+era^1+'
> > >
> > pf='title_ab^1+titleProper^1500+title_topProper^1000+title_
> restProper^800+series^100+series2^100+author^1600+
> author2^800+topicProper^200+fullgenre^200+hlb3^200+allfieldsProper^100+'
> > > mm='100%25' tie='0.9' } European Art History"
> > >
> > >
> > > On Thu, Jul 26, 2018 at 6:02 PM, Joel Bernstein 
> > wrote:
> > >
> > >> The export handler doesn't allow sorting by score at this time. It
> only
> > >> supports sorting on fields. So the edismax qparser won't cxcurrently
> > work
> > >> with the export handler.
> > >>
> > >> Joel Bernstein
> > >> http://joelsolr.blogspot.com/
> > >>
> > >> On Thu, Jul 26, 2018 at 5:52 PM, Tom Burton-West 
> > >> wrote:
> > >>
> > >> > Hello all,
> > >> >
> > >> > I am completely new to the export handler.
> > >> >
> > >> > Can the export handler be used with the edismax or dismax query
> > handler?
> > >> >
> > >> > I tried using local params :
> > >> >
> > >> > q= _query_:"{!edismax qf='ocr^5+allfields^1+titleProper^50'
> > >> > mm='100%25'
> > >> > tie='0.9' } art"
> > >> >
> > >> > which does not seem to be working.
> > >> >
> > >> > Tom
> > >> >
> > >>
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Can the export handler be used with the edismax or dismax query handler

2018-07-27 Thread Tom Burton-West
Thanks Joel,

My use case is that I have a complex edismax query (example below)  and the
user wants to download the set of *all* search results (ids and some small
metadata fields).   So they don't need the relevance ranking.  However, I
need to somehow get the exact set that the complex edismax query matched.

Should I try to write some code to rewrite  the logic of the edismax query
with a complex boolean query or would it make sense for me to look at
possibly modifying the export handler for my use case?

Tom

"q= _query_:"{!edismax
qf='ocr^5+allfieldsProper^2+allfields^1+titleProper^50+title_topProper^30+title_restProper^15+title^10+title_top^5+title_rest^2+series^5+series2^5+author^80+author2^50+issn^1+isbn^1+oclc^1+sdrnum^1+ctrlnum^1+id^1+rptnum^1+topicProper^2+topic^1+hlb3^1+fullgeographic^1+fullgenre^1+era^1+'
pf='title_ab^1+titleProper^1500+title_topProper^1000+title_restProper^800+series^100+series2^100+author^1600+author2^800+topicProper^200+fullgenre^200+hlb3^200+allfieldsProper^100+'
mm='100%25' tie='0.9' } European Art History"


On Thu, Jul 26, 2018 at 6:02 PM, Joel Bernstein  wrote:

> The export handler doesn't allow sorting by score at this time. It only
> supports sorting on fields. So the edismax qparser won't cxcurrently work
> with the export handler.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Jul 26, 2018 at 5:52 PM, Tom Burton-West 
> wrote:
>
> > Hello all,
> >
> > I am completely new to the export handler.
> >
> > Can the export handler be used with the edismax or dismax query handler?
> >
> > I tried using local params :
> >
> > q= _query_:"{!edismax qf='ocr^5+allfields^1+titleProper^50'
> > mm='100%25'
> > tie='0.9' } art"
> >
> > which does not seem to be working.
> >
> > Tom
> >
>


Can the export handler be used with the edismax or dismax query handler

2018-07-26 Thread Tom Burton-West
Hello all,

I am completely new to the export handler.

Can the export handler be used with the edismax or dismax query handler?

I tried using local params :

q= _query_:"{!edismax qf='ocr^5+allfields^1+titleProper^50' mm='100%25'
tie='0.9' } art"

which does not seem to be working.

Tom


Error in Solr 6.6 Example schemas re: DocValues for StrField type must be single-valued?

2017-08-15 Thread Tom Burton-West
Hello,

The comments in the example schema's for Solr 6.6, for state that the
StrField type must be single-valued to support doc values

For example
Solr-6.6.0/server/solr/configsets/basic_configs/conf/managed-schema:

216  

However, on line 221 a StrField is declared with docValues that is
multiValued:
221  

Also note that the comments above say that the field must either be
required or have a default value, but line 221 appears to satisfy neither
condition.

The JavaDocs indicate that StrField can be multi-valued
https://lucene.apache.org/core/6_6_0//core/org/apache/
lucene/index/DocValuesType.html

Is the comment in the example schema file  completely wrong, or is there
some issue with using a docValues with a multivalued StrField?

Tom Burton-West

https://www.hathitrust.org/blogslarge-scale-search


Re: Solr Support for BM25F

2016-04-18 Thread Tom Burton-West
Hi David,

It may not matter for your use case  but just in case you really are
interested in the "real BM25F" there is a difference between configuring K1
and B for different fields in Solr and a "real" BM25F implementation.  This
has to do with Solr's model of fields being mini-documents (i.e. each field
has its own length, idf and tf)   See the discussion in
https://issues.apache.org/jira/browse/LUCENE-2959, particularly these
comments by Robert Muir:

"Actually as far as BM25f, this one presents a few challenges (some already
discussed on LUCENE-2091 
).

To summarize:

   - for any field, Lucene has a per-field terms dictionary that contains
   that term's docFreq. To compute BM25f's IDF method would be challenging,
   because it wants a docFreq "across all the fields". (its not clear to me at
   a glance either from the original paper, if this should be across only the
   fields in the query, across all the fields in the document, and if a
   "static" schema is implied in this scoring system (in lucene document 1 can
   have 3 fields and document 2 can have 40 different ones, even with
   different properties).
   - the same issue applies to length normalization, lucene has a "field
   length" but really no concept of document length."

Tom

On Thu, Apr 14, 2016 at 12:41 PM, David Cawley 
wrote:

> Hello,
> I am developing an enterprise search engine for a project and I was hoping
> to implement BM25F ranking algorithm to configure the tuning parameters on
> a per field basis. I understand BM25 similarity is now supported in Solr
> but I was hoping to be able to configure k1 and b for different fields such
> as title, description, anchor etc, as they are structured documents.
> I am fairly new to Solr so any help would be appreciated. If this is
> possible or any steps as to how I can go about implementing this it would
> be greatly appreciated.
>
> Regards,
>
> David
>
> Current Solr Version 5.4.1
>


Changing Similarity without re-indexing (for example from default to BM25)

2015-08-19 Thread Tom Burton-West
Hello all,

The last time I worked with changing Simlarities was with Solr 4.1 and at
that time, it was possible to simply change the schema to specify the use
of a different Similarity without re-indexing.   This allowed me to
experiment with several different ranking algorithms without having to
re-index.

 Currently the documentation states that while doing this is theoretically
possible but not well defined:

To change Similarity
http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/similarities/Similarity.html,
one must do so for both indexing and searching, and the changes must happen
before either of these actions take place. Although in theory there is
nothing stopping you from changing mid-stream, it just isn't well-defined
what is going to happen.

http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/similarities/package-summary.html#changingSimilarity

Has something changed between 4.1 and 5.2 that actually will prevent
changing Similarity without re-indexing from working, or is this just a
warning in case at some future point someone contributes code so that a
particular similarity takes advantage of a different index format?

Tom


Re: How to configure Solr PostingsFormat block size

2015-03-12 Thread Tom Burton-West
Hi Hoss,

I created a wrapper class, compiled a jar and included an
org.apache.lucene.codecs.Codec file in META-INF/services in the jar file
with an entry for the wrapper class :HTPostingsFormatWrapper.   I created a
collection1/lib directory and put the jar there. (see below)

I'm getting the dread ClassCastException Class.asSubclass(Unknown Source
error (See below).

This is looking like a complex classloader issues.   Should I put the file
somewhere else and/or declare a lib directory in solrconfig.xml?

Any suggestions on how to troubleshoot this?.

Tom



error:
by: java.lang.ClassCastException: class
org.apache.lucene.codecs.HTPostingsFormatWrapper
 at java.lang.Class.asSubclass(Unknown Source)
 at org.apache.lucene.util.SPIClassIterator.next(SPIClassIterator.java:141)


---
Contents of the jar file:

C:\d\solr\lucene_solr_4_10_2\solr\example\solr\collection1\libjar -tvf
HTPostingsFormatWrapper.jar
25 Thu Mar 12 10:37:04 EDT 2015 META-INF/MANIFEST.MF
  1253 Thu Mar 12 10:37:04 EDT 2015
org/apache/lucene/codecs/HTPostingsFormatWrapper.class
  1276 Thu Mar 12 10:49:06 EDT 2015
META-INF/services/org.apache.lucene.codecs.Codec




Contents of  META-INF/services/org.apache.lucene.codecs.Codec in the jar
file:
org.apache.lucene.codecs.lucene49.Lucene49Codec
org.apache.lucene.codecs.lucene410.Lucene410Codec
# tbw adds custom wrapper here per Hoss e-mail
org.apache.lucene.codecs.HTPostingsFormatWrapper

-
log file excerpt with stack trace:

12821 [main] INFO  org.apache.solr.core.CoresLocator  – Looking for core
definitions underneath C:\d\solr\lucene_solr_4_10_2\solr\example\solr
12838 [main] INFO  org.apache.solr.core.CoresLocator  – Found core
collection1 in C:\d\solr\lucene_solr_4_10_2\solr\example\solr\collection1\
12839 [main] INFO  org.apache.solr.core.CoresLocator  – Found 1 core
definitions
12841 [coreLoadExecutor-5-thread-1] INFO
 org.apache.solr.core.SolrResourceLoader  – new SolrResourceLoader for
directory: 'C:\d\solr\lucene_solr_4_10_2\solr\example\solr\collection1\'
12842 [coreLoadExecutor-5-thread-1] INFO
 org.apache.solr.core.SolrResourceLoader  – Adding
'file:/C:/d/solr/lucene_solr_4_10_2/solr/example/solr/collection1/lib/HTPostingsFormatWrapper.jar'
to classloader
12870 [coreLoadExecutor-5-thread-1] ERROR
org.apache.solr.core.CoreContainer  – Error creating core [collection1]:
class org.apache.lucene.codecs.HTPostingsFormatWrapper
java.lang.ClassCastException: class
org.apache.lucene.codecs.HTPostingsFormatWrapper
at java.lang.Class.asSubclass(Unknown Source)
at org.apache.lucene.util.SPIClassIterator.next(SPIClassIterator.java:141)
at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:65)
at org.apache.lucene.codecs.Codec.reloadCodecs(Codec.java:119)
at
org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:206)
at
org.apache.solr.core.SolrResourceLoader.init(SolrResourceLoader.java:142)
at
org.apache.solr.core.ConfigSetService$Default.createCoreResourceLoader(ConfigSetService.java:144)
at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:58)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:489)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:255)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:249)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

On Wed, Jan 14, 2015 at 6:05 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : As a foolish dev (not malicious I hope!), I did mess around with
 something
 : like this once; I was writing my own Codec.  I found I had to create a
 file
 : called META-INF/services/org.apache.lucene.codecs.Codec in my solr
 plugin jar
 : that contained the fully-qualified class name of my codec: I guess this
 : registers it with the SPI framework so it can be found by name?  I'm not

 Yep, that's how SPI works - the important bits are mentioned/linked in the
 PostingsFormat (and other SPI related classes in lucene) javadocs...


 https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/PostingsFormat.html


 https://docs.oracle.com/javase/7/docs/api/java/util/ServiceLoader.html?is-external=true





 -Hoss
 http://www.lucidworks.com/



Re: Basic Multilingual search capability

2015-02-25 Thread Tom Burton-West
Hi Rishi,

As others have indicated Multilingual search is very difficult to do well.

At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages.  We also added the
CJKBigramFilter to get better precision on CJK queries.  We don't use stop
words because stop words in one language are content words in another.  For
example die in German is a stopword but it is a content word in English.

Putting multiple languages in one index can affect word frequency
statistics which make relevance ranking less accurate.  So for example for
the English query Die Hard the word die would get a low idf score
because it occurs so frequently in German.  We realize that our  approach
does not produce the best results, but given the 400 languages, and limited
resources, we do our best to make search not suck for non-English
languages.   When we have the resources we are thinking about doing special
processing for a small fraction of the top 20 languages.  We plan to select
those languages  that most need special processing and relatively easy to
disambiguate from other languages.


If you plan on identifying languages (rather than scripts), you should be
aware that most language detection libraries don't work well on short texts
such as queries.

If you know that you have scripts for which you have content in only one
language, you can use script detection instead of language detection.


If you have German, a filter length of 25 might be too low (Because of
compounding). You might want to analyze a sample of your German text to
find a good length.

Tom

http://www.hathitrust.org/blogs/Large-scale-Search


On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran rishi.easwa...@aol.com
wrote:

 Hi Alex,

 Thanks for the suggestions. These steps will definitely help out with our
 use case.
 Thanks for the idea about the lengthFilter to protect our system.

 Thanks,
 Rishi.







 -Original Message-
 From: Alexandre Rafalovitch arafa...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tue, Feb 24, 2015 8:50 am
 Subject: Re: Basic Multilingual search capability


 Given the limited needs, I would probably do something like this:

 1) Put a language identifier in the UpdateRequestProcessor chain
 during indexing and route out at least known problematic languages,
 such as Chinese, Japanese, Arabic into individual fields
 2) Put everything else together into one field with ICUTokenizer,
 maybe also ICUFoldingFilter
 3) At the very end of that joint filter, stick in LengthFilter with
 some high number, e.g. 25 characters max. This will ensure that
 super-long words from non-space languages and edge conditions do not
 break the rest of your system.


 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org
 wrote:
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide
 basic search capability for any language. Ex: When the document contains
 hello
 or здравствуйте, the analyzer creates tokens and provides exact match
 search
 results.





Optimize maxSegments=2 not working right with Solr 4.10.2

2015-02-23 Thread Tom Burton-West
Hello,

We normally run an optimize with maxSegments=2  after our daily indexing.
This has worked without problem on Solr 3.6.  We recently moved to Solr
4.10.2 and on several shards the optimize completed with no errors in the
logs, but left more than 2 segments.

We send this xml to Solr
optimize maxSegments=2/

I've attached a copy of the indexwriter log for one of the segments where
there were 4 segments rather than the requested number (i.e. there should
have been only 2 segments) at the end of the optimize.It looks like a
merge was done down to two segments and then somehow another process
flushed some postings to disk creating two more segments.  Then there are
messages about 2 of the remaining 4 segments being too big. (See below)

What we expected is that the remainng 2 small segments (about 40MB) would
get merged with the smaller of the two large segments, i.e. with the 56GB
segment, since we gave the argument maxSegments=2.   This didn't happen.


Any suggestions about how to troubleshoot this issue would be appreciated.

Tom

---
Excerpt from indexwriter log:

TMP][http-8091-Processor5]: findForcedMerges maxSegmentCount=2  ...
...
[IW][Lucene Merge Thread #0]: merge time 3842310 msec for 65236 docs
...
[TMP][http-8091-Processor5]: findMerges: 4 segments
 [TMP][http-8091-Processor5]:   seg=_1fzb(4.10.2):C1081559/24089:delGen=9
size=672402.066 MB [skip: too large]
 [TMP][http-8091-Processor5]:   seg=_1gj2(4.10.2):C65236/2:delGen=1
size=56179.245 MB [skip: too large]
 [TMP][http-8091-Processor5]:   seg=_1gj0(4.10.2):C16 size=44.280 MB
 [TMP][http-8091-Processor5]:   seg=_1gj1(4.10.2):C8 size=40.442 MB
 [TMP][http-8091-Processor5]:   allowedSegmentCount=3 vs count=4 (eligible
count=2) tooBigCount=2


build-1.iw.2015-02-23.txt.gz
Description: GNU Zip compressed data


Re: Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West
Thanks Hoss,

Protection from misconfiguration and/or starting separate solr instances
pointing to the same index dir I can understand.

The current documentation on the wiki and in the ref guide (along with just
enough understanding of Solr/Lucene indexing to be dangerous)  left me
wondering if maybe somehow a correctly configured Solr might have multiple
processes writing to the same file.
I'm wondering if your explanation above  might be added to the
documentation.

Tom

On Fri, Feb 20, 2015 at 1:25 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : We are using Solr.  We would not configure two different Solr instances
 to
 : write to the same index.  So why would a normal Solr set-up possibly
 end
 : up having more than one process writing to the same index?

 The risk here is that if you configure lockType=single, and then have some
 unintended user error such that two distinct java processes both attempt
 to use the same index dir, the locType will not protect you in that
 situation.

 For example: you normally run solr on port 8983, but someone accidently
 starts a second instance of solr on more 7574 using the exact same conigs
 with the exact same index dir -- lockType single won't help you spot this
 error.  lockType=native will (assuming your FileSystem can handle it)

 lockType=single should protect you however if, for example, multiple
 SolrCores w/in the same Solr java process attempted to refer to the same
 index dir because you accidently put an absolulte path in a solrconfig.xml
 that gets shared my multiple cores.


 -Hoss
 http://www.lucidworks.com/



Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West
Hello,

We don't want to use locktype=native (we are using NFS) or locktype=simple
(we mount a read-only snapshot of the index on our search servers and with
locktype=simple, Solr refuses to start up becaise it sees the lock file.)

However, we don't quite understand the warnings about using locktype=single
in the context of normal Solr operation.  The ref guide and the wiki (
http://wiki.apache.org/lucene-java/AvailableLockFactories)
 seem to indicate there is some danger in using locktype=single.

The wiki says:
locktype=single:
Uses an object instance to represent the lock, so this is usefull when you
are certain that all modifications to the a given index are running against
a single shared in-process Directory instance. This is currently the
default locking for RAMDirectory, but it could also make sense on a FSDirectory
provided the other processes use the index in read-only.

We are using Solr.  We would not configure two different Solr instances to
write to the same index.  So why would a normal Solr set-up possibly end
up having more than one process writing to the same index?

At the Lucene level there are multiple index writers per thread, but they
each write to their own segments, and (I think) all the threads are in the
same Solr process),

Are we safe using locktype=single?

Tom


Solr example for Solr 4.10.2 gives warning about Multiple request handlers with same name

2015-01-16 Thread Tom Burton-West
Hello,

I'm running Solr 4.10.2 out of the box with the Solr example.

i.e. ant example
cd solr/example
java -jar start.jar

in /example/log

At start-up the example gives this message in the log:

WARN  - 2015-01-16 12:31:40.895; org.apache.solr.core.RequestHandlers;
Multiple requestHandler registered to the same name: /update ignoring:
org.apache.solr.handler.UpdateRequestHandler

Is this a bug?   Is there something wrong with the out of the box example
configuration?

Tom


Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Michael and Hoss,

assuming I've written the subclass of the postings format, I need to tell
Solr to use it.

Do I just do something like:

fieldType name=ocr class=solr.TextField postingsFormat=MySubclass /

Is there a way to set this for all fieldtypes or would that require writing
a custom CodecFactory?

Tom


On Mon, Jan 12, 2015 at 4:46 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : It looks like this is a good starting point:
 :
 : http://wiki.apache.org/solr/SolrConfigXml#codecFactory

 The default SchemaCodecFactory already supports defining a diff posting
 format per fieldType - but there isn't much in solr to let you tweak
 individual options on specific posting formats via configuration.

 So what you'd need to do is write a small subclass of
 Lucene41PostingsFormat that called super(yourMin, yourMax) in it's
 constructor.






Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Hoss,

This is starting to sound pretty complicated. Are you saying this is not
doable with Solr 4.10?
...or at least: that's how it *should* work :)   makes me a bit nervous
about trying this on my own.

Should I open a JIRA issue or am I probably the only person with a use case
for replacing a TermIndexInterval setting with changing the min and max
block size on the 41 postings format?



Tom








On Tue, Jan 13, 2015 at 3:16 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : ...the nuts  bolts of it is that the PostingFormat baseclass should take
 : care of all the SPI name registration that you need based on what you
 : pass to the super() construction ... allthough now that i think about it,
 : i'm not sure how you'd go about specifying your own name for the
 : PostingFormat when also doing something like subclassing
 : Lucene41PostingsFormat ... there's no Lucene41PostingsFormat constructor
 : you can call from your subclass to override the name.
 :
 : not sure what the expectation is there in the java API.

 ok, so i talked this through with mikemccand on IRC...

 in 4x, the API is actaully really dangerous - you can subclass things like
 Lucene41PostingsFormat w/o overriding the name used in SPI, and might
 really screw things up as far as what class is used to read back your
 files later.

 in the 5.0 APIs, these non-abstract codec related classes are all final to
 prevent exactly this type of behavior - but you can still use the
 constructor args to change behavior related to *writing* the index, and
 the classes all are designed to be smart enough that when they are loaded
 by SPI at search time, they can make sense of what's on disk (regardlessof
 wether non-default constructor args were used at index time)

 but the question remains: where does that leave you as a solr user who
 wants to write a plugin, since Solr only allows you to configure the SPI
 name (no constructor args) via 'postingFormat=foo'

 the anwser is that instead of writing a subclass, you would have to write
 a small proxy class, something like...


 public final class MyPfWrapper extends PostingFormat {
   PostingFormat pf = new Lucene50PostingsFormat(42, 9);
   public MyPfWrapper() {
 super(MyPfWrapper);
   }
   public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws
 IOException {
 return pf.fieldsConsumer(state);
   }
   public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws
 IOException {
 return pf.fieldsConsumer(state);
   }
   public FieldsProducer fieldsProducer(SegmentReadState state) throws
 IOException {
 return pf.fieldsProducer(state);
   }
 }

 ..and then refer to it with postingFormat=MyPfWrapper

 at index time, Solr will use SPI to find your MyPfWrapper class, which
 will delegate to an instance of Lucene50PostingsFormat constructed with
 the overriden constants, and then at query time the SegmentReader code
 paths will use SPI to find MyPfWrapper by name as well, and it will again
 delegate to Lucene50PostingsFormat for reading back the index.


 or at least: that's how it *should* work :)




 -Hoss
 http://www.lucidworks.com/



How to configure Solr PostingsFormat block size

2015-01-12 Thread Tom Burton-West
Hello all,

Our indexes have around 3 billion unique terms, so for Solr 3, we set
TermIndexInterval to about 8 times the default.  The net effect of this is
to reduce the size of the in-memory index by about 1/8th.  (For background
see for
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again, )

We would like to do something similar for Solr4.   T

he Lucene 4.10.2 JavaDoc for setTermIndexInterval suggests how this can be
done by setting the minimum and maximum size for a block in Lucene code (
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
)
For example, Lucene41PostingsFormat
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html
implements the term index instead based upon how terms share prefixes. To
configure its parameters (the minimum and maximum size for a block), you
would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int, int)
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29.
which can also be configured on a per-field basis

How can we configure Solr to use different (i.e. non-default) mimum and
maximum block sizes?

Tom


Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-12 Thread Tom Burton-West
Thanks everybody for the information.

Shawn, thanks for bringing up the issues around making sure each document
is indexed ok.  With our current architecture, that is important for us.

Yonik's clarification about streaming really helped me to understand one of
the main advantages of CUSS:

When you add a document, it immediately writes it to a stream where
solr can read it off and index it.  When you add a second document,
it's immediately written to the same stream (or at least one of the
open streams), as part of the same udpate request.  No separate HTTP
request, No separate update request.

In our use case, where documents are in the 700K-2MB range, I suspect that
the overhead of opening/closing new requests is dwarfed by the time it
takes to just send the data over the wire and parsing the data. However,
I'm starting to think about whether I can find some time to do some testing.

Mikhail, thanks for suggesting looking at DIH,  I haven't looked at it in
several years and didn't realize there is now functionality to deal with
XML documents.

When I asked about being able to read XML files from the filesystem, it was
for the purposes of running some benchmark tests to see if CUSS offers
enough advantages to re-architect our system.

Currently the main bottleneck in our system is constructing Solr documents.
We use multiple document producers which are responsible both for
creating a document and for sending it to Solr.  Although each producer
waits until it gets a response from Solr before sending the next document
to be indexed, we run 20-100 producers, so this is similar to CUSS running
multiple threads. (although of course we open a new http request and Solr
update request each time)

As far as using DIH or something like it, we might be able to use it for
testing with already created documents.

Creating the documents requires assembling (and massaging) data from
several sources including a few database queries, unzipping files on our
filesystem and contatenating them, and querying another Solr instance which
has metadata.

I'm now thinking that for testing purposes it  might be sufficient to
construct dummy documents as in the examples rather than trying to use our
actual documents.  If the speed improvements look significant enough, then
I'd need to figure out how to test with real documents.

Thanks again for all the input.

Tom


Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-11 Thread Tom Burton-West
Thanks Eric,

That is helpful.  We already have a process that works similarly.  Each
thread/process that sends a document to Solr waits until it gets a response
in order to make sure that the document was indexed successfully (we log
errors and retry docs that don't get indexed successfully), however we run
20-100 of these processes,depending on  throughput (i.e. we send documents
to Solr for indexing as fast as we can until they start queuing up on the
Solr end.)

Is there a way to use CUSS with XML documents?

ie my second question:
 A related question, is how to use ConcurrentUpdateSolrServer with XML
 documents

 I have very large XML documents, and the examples I see all build
documents
 by adding fields in Java code.  Is there an example that actually reads
XML
 files from the file system?

Tom


Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-10 Thread Tom Burton-West
Hello all,

In the example schema.xml for Solr 4.10.2 this comment is listed under the
PERFORMANCE NOTE

For maximum indexing performance, use the ConcurrentUpdateSolrServer
java client.

Is there some documentation somewhere that explains why this will maximize
indexing peformance?

In particular, I have very large documents on the order of 700KB, so I'[m
interested to determine if there is a significant advantage to using the
ConcurrentUpdateSolrServer in my use case.

A related question, is how to use ConcurrentUpdateSolrServer with XML
documents

I have very large XML documents, and the examples I see all build documents
by adding fields in Java code.  Is there an example that actually reads XML
files from the file system?

Tom


Re: Solr 4.10 termsIndexInterval and termsIndexDivisor not supported with default PostingsFormat?

2014-09-24 Thread Tom Burton-West
Thanks Hoss,

Just opened SOLR-6560 and attached a patch which removes the offending
section from the example solrconfig.xml file.

  We suspect that with the much more efficient block and FST based Solr 4
default postings format that the need to mess with the parameters in order
to reduce memory usage has gone away.  Haven't really tested yet.

If there is still a use case for configuring the Solr default
PostingsFormat  and the ability to set the parameters currently exists,
than maybe someone who understands this could put an example in the
solrconfig.xml file and documentation.   On the other hand if the use case
still exists and Solr doesn't have the ability to configure the parameters,
maybe another issue should be opened.  Looks like all that would be needed
is a mechanism to pass a couple of ints to the Lucene postings format:

 For example, Lucene41PostingsFormat
http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.htmlimplements
the term index instead based upon how terms share prefixes. To configure
its parameters (the minimum and maximum size for a block), you would
instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int, int)
http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat(int,%20int).
which can also be configured on a per-field basis:

Tom

On Thu, Sep 18, 2014 at 1:42 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : I think the documentation and example files for Solr 4.x need to be
 : updated.  If someone will let me know I'll be happy to fix the example
 : and perhaps someone with edit rights could fix the reference guide.

 I think you're correct - can you open a Jira with suggested improvements
 for the configs?  (i see you commented on the ref guide page which is
 helpful - but the jira issue wil also help serve sa a reminder to audit
 *all* the pages for refrences to these options, ie: in config snippets,
 etc...)

 : According to the JavaDocs for IndexWriterConfig, the Lucene level
 : implementations of these do not apply to the default PostingsFormat
 : implementation.
 :
 http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/IndexWriterConfig.html#setReaderTermsIndexDivisor%28int%29
 :
 : Despite this statement in the Lucene JavaDocs, in the
 : example/solrconfig.xml there is the following:

 Yeah ... I'm not sure what (if anything?) we should say about these in the
 example configs -- the *setting* is valid and supported by
 IndexWriterConfig no matter what posting format you use, so it's not an
 error to configure this, but it can be ignored in many cases.

 : Can someone please confirm that these two parameter settings
 : termIndexInterval and termsIndexDivisor, do not apply to the default
 : PostingsFormat for Solr 4.10?

 I was taking your word for it :)


 -Hoss
 http://www.lucidworks.com/



queryResultMaxDocsCached vs queryResultWindowSize

2014-09-23 Thread Tom Burton-West
Hello,

queryResultWindowSize sets the number of documents   to cache for each
query in the queryResult cache.So if you normally output 10 results per
pages, and users don't go beyond page 3 of results, you could set
queryResultWindowSize to 30 and the second and third page requests will
read from cache, not from disk.  This is well documented in both the Solr
example solrconfig.xml file and the Solr documentation.

However, the example in solrconfig.xml and the documentation in the
reference manual for Solr 4.10 say that queryResultMaxDocsCached :

sets the maximum number of documents to cache for any entry in the
queryResultCache.

If this were true, then it would seem that this parameter duplicates
queryResultWindowSize.

Looking at the code  it appears that the queryResultMaxDocsCached parameter
actually tells Solr not to cache any results list that has a size  over
 queryResultMaxDocsCached:.

From:  SolrIndexSearcher.getDocListC
// lastly, put the superset in the cache if the size is less than or equal
// to queryResultMaxDocsCached
if (key != null  superset.size() = queryResultMaxDocsCached 
!qr.isPartialResults()) {
  queryResultCache.put(key, superset);
}

Deciding whether or not to cache a DocList if its size is over N (where N =
queryResultMaxDocsCached) is very different than caching only N items from
the DocList which is what the current documentation (and the variable name)
implies.

Looking at the JIRA issue https://issues.apache.org/jira/browse/SOLR-291
the original intent was to control memory use and the variable name
originally suggested was  noCacheIfLarger

Can someone please let me know if it is true that the
queryResultMaxDocsCached parameter actually tells Solr not to cache any
results list that contains over the  queryResultMaxDocsCached?

If so, I will add a comment to the Cwiki doc and open a JIRA and submit a
patch to the example file.

Tom


.



---

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_10/solr/example/solr/collection1/conf/solrconfig.xml?revision=1624269view=markup

635 !-- Maximum number of documents to cache for any entry in the
636 queryResultCache.
637 --
638 queryResultMaxDocsCached200/queryResultMaxDocsCached


How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

2014-09-17 Thread Tom Burton-West
The Solr wiki says   A repeated question is how can I have the
original term contribute
more to the score than the stemmed version? In Solr 4.3, the
KeywordRepeatFilterFactory has been added to assist this
functionality. 

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming

(Full section reproduced below.)
I can see how in the example from the wiki reproduced below that both
the stemmed and original term get indexed, but I don't see how the
original term gets more weight than the stemmed term.  Wouldn't this
require a filter that gives terms with the keyword attribute more
weight?

What am I missing?

Tom



-
A repeated question is how can I have the original term contribute
more to the score than the stemmed version? In Solr 4.3, the
KeywordRepeatFilterFactory has been added to assist this
functionality. This filter emits two tokens for each input token, one
of them is marked with the Keyword attribute. Stemmers that respect
keyword attributes will pass through the token so marked without
change. So the effect of this filter would be to index both the
original word and the stemmed version. The 4 stemmers listed above all
respect the keyword attribute.

For terms that are not changed by stemming, this will result in
duplicate, identical tokens in the document. This can be alleviated by
adding the RemoveDuplicatesTokenFilterFactory.

fieldType name=text_keyword class=solr.TextField
positionIncrementGap=100
 analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.KeywordRepeatFilterFactory/
   filter class=solr.PorterStemFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
/fieldType


Solr 4.10 termsIndexInterval and termsIndexDivisor not supported with default PostingsFormat?

2014-09-16 Thread Tom Burton-West
Hello,

I think the documentation and example files for Solr 4.x need to be
updated.  If someone will let me know I'll be happy to fix the example
and perhaps someone with edit rights could fix the reference guide.

Due to dirty OCR and over 400 languages we have over 2 billion unique
terms in our index.  In Solr 3.6 we set termIndexInterval to 1024 (8
times the default of 128) to reduce the size of the in-memory index.
Previously we used termIndexDivisor for a similar purpose.

We suspect that in Solr 4.10 (and probably previous Solr 4.x versions)
termIndexInterval and termIndexDivisor do not apply to the default
codec and are probably unnecessary (since the default terms index now
uses a much more efficient representation).

According to the JavaDocs for IndexWriterConfig, the Lucene level
implementations of these do not apply to the default PostingsFormat
implementation.
http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/IndexWriterConfig.html#setReaderTermsIndexDivisor%28int%29

Despite this statement in the Lucene JavaDocs, in the
example/solrconfig.xml there is the following:

!-- Expert: Controls how often Lucene loads terms into memory
278 Default is 128 and is likely good for most everyone.
279 --
280 !-- termIndexInterval128/termIndexInterval --

In the 4.10 reference manual page 365 there is also an example showing
the termIndexInterval.

Can someone please confirm that these two parameter settings
termIndexInterval and termsIndexDivisor, do not apply to the default
PostingsFormat for Solr 4.10?

Tom


spam detection issue on sending legitimate mail to Solr list

2014-09-15 Thread Tom Burton-West
HI all,

I just sent a very long post with 5 or 6  links to relevant articles in
response to a thread on the Solr users list and got a message my mail was
rejected due to a spam score.Can anyone tell me what I need to do to
change the message so I can send it to the list.  (Is there some reference
I need to look at to understand the message?)  Below is the message I
received:
... tried to deliver your message, but it was rejected by the server for
the recipient domain lucene.apache.org bymx1.eu.apache.org. [192.87.106.230
].

The error that the other server returned was:
552 spam score (6.2) exceeded threshold (HTML_MESSAGE,RCVD_IN_DNSWL_
LOW,SPF_NEUTRAL,URIBL_SBL

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search


Re: How to implement multilingual word components fields schema?

2014-09-15 Thread Tom Burton-West
Hi Ilia,

I see that Trey answered your question about how you might stack
language specific filters in one field.  If I remember correctly, his
approach assumes you have identified the language of the query.  That
is not the same as detecting the script of the query and is much
harder.

Trying to do language-specific processing on multiple languages,
especially a large number such as the 200 you mention or the 400 in
HathiTrust is a very difficult problem.  Detecting language (rather
than script) in short queries is an open problem in the research
literature.  As others have suggested, you might want to start with
something less ambitious that meets most of your business needs.

You also might want to consider whether the errors a stemmer might
make on some queries will be worth the increase in recall that you
will get on others. Concern about getting results that can confuse
users is the one of the  main reason we haven't seriously pursued
stemming in HathiTrust full-text search.

Regarding the papers listed in my previous e-mail, you can get the
first paper at the link I gave and the second paper (although on
re-reading it, I don't think it will be very useful) is available if
you go to the link for the code and follow the link on that page for
the paper.

I suspect you might want to think about the differences between
scripts and  languages.  Most of the Solr/Lucene stemmers either
assume you are only giving them the language they are designed for, or
work on the basis of script.  This works well when there is only one
language per script, but breaks if you have many languages using the
same script such as the Latin-1 languages.

(Because of an issue with the Solr-user spam filter and an issue with
my e-mail client all the URLs except the one below have
http[s] removed and/or spaces added.  See this gist for all the URLS
with context:  https://gist.github.com/anonymous/2e1233d80f37354001a3)

That PolyGlotStemming filter uses the ICUTokenizer's script
identification, but there are at least 12 different languages that use
the Arabic script (www omniglot com writing arabic)  and over 100 that
use Latin-1.  Please see the list of languages and scripts at
aspell. net/ man-html /Supported.   html#Supported. or www. omniglot
.com /writing/langalph .htm#latin

As a simple example, German and English both use the Latin-1 character
set.  Using an English stemmer for German or a German stemmer for
English is unlikely to work very well.  If you try to use stop words
for multiple languages you will run into difficulties where a stop
word in one language is a content word in another.  For example  if
you use German stop words such as die, you will eliminate the
English content word die.

Identifying languages in short texts such as queries is a hard
problem. About half the papers looking at query language
identification cheat, and look at things such as the language of the
pages that a user has clicked on.  If all you have to make a guess is
the text of the query, language identification is very difficult.  I
suspect that mixed script queries are even harder  (see www .transacl.
org/wp-content/uploads/2014/02/38.pdf).

 See the papers by Marco Lui and Tim Baldwin on Marco's web page:
ww2  .cs. mu. oz. au /~mlui/
In this paper they explain why short text language identification is a
hard problem Language Identification: the Long and the Short of the
Matter  www  .aclweb.  org/anthology/N10-1027

Other papers available on Marco's page describe the design and
implementation of langid.py which is a state-of-the-art language
identification program.

 I've tried several  language guessers  designed for short texts and
at least on queries from our query logs,  the results were unusable.
Both langid.py  with the defaults (noEnglish.langid.gz  pipe
delimited) and ldig with the most recent latin.model
(NonEnglish.ldig.gz tab delimited) did not work well for our queries.

However, both of these have parameters that can be tweaked and also
facilities for training if you have labeled data.

ldig is specifically designed to run on short text like queries or twitter.
It can be configured to spit out the scores for each language instead
of only the highest score (default).  Also we didn't try to limit the
list of languages it looks for, and that might give better results.

github .com  /shuyo/ldig
langdetect looks like its by the same programmer and is in Java, but I
haven't tried it:
code .google. com/p/language-detection/

langid is designed by linguistic experts, but may need to be trained
on short queries.
github .com/saffsd/langid.py

There is also Mike McCandless' port of the Google CLD

blog. mikemccandless .com/2013/08/a-new-version-of-compact-language  .html
code .google .com/p/chromium-compact-language-detector/source/browse/README
However here is this comment:
Indeed I see the same results as you; I think CLD2 is just not
designed for short text.
and a similar comment was made in this talk:
videolectures 

Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Tom Burton-West
Hi Ilia,

I don't know if it would be helpful but below I've listed  some academic
papers on this issue of how best to deal with mixed language/mixed script
queries and documents.  They are probably taking a more complex approach
than you will want to use, but perhaps they will help to think about the
various ways of approaching the problem.

We haven't tackled this problem yet. We have over 200 languages.  Currently
we are using the ICUTokenizer and ICUFolding filter but don't do any
stemming due to a concern with overstemming (we have very high recall, so
don't want to hurt precision by stemming)  and the difficulty of correct
language identification of short queries.

If you have languages where there is only one language per script however,
you might be able to do much more.  I'm not sure if I'm remembering
correctly but I believe some of the stemmers such as the Greek stemmer will
pass through any strings that don't contain characters in the Greek script.
  So it might be possible to at least do stemming on some of your
languages/scripts.

 I'll be very interested to learn what approach you end up using.

Tom

--

Some papers:

Mohammed Mustafa, Izzedin Osman, and Hussein Suleman. 2011. Indexing and
weighting of multilingual and mixed documents. In *Proceedings of the South
African Institute of Computer Scientists and Information Technologists
Conference on Knowledge, Innovation and Leadership in a Diverse,
Multidisciplinary Environment* (SAICSIT '11). ACM, New York, NY, USA,
161-170. DOI=10.1145/2072221.2072240
http://doi.acm.org/10.1145/2072221.2072240

That paper and some others are here:
http://www.husseinsspace.com/research/students/mohammedmustafaali.html

There is also some code from this article:

Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo
Rosso. 2014. Query expansion for mixed-script information retrieval.
In *Proceedings
of the 37th international ACM SIGIR conference on Research  development in
information retrieval* (SIGIR '14). ACM, New York, NY, USA, 677-686.
DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622

Code:
http://users.dsic.upv.es/~pgupta/mixed-script-ir.html

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search


On Fri, Sep 5, 2014 at 10:06 AM, Ilia Sretenskii sreten...@multivi.ru
wrote:

 Hello.
 We have documents with multilingual words which consist of different
 languages parts and seach queries of the same complexity, and it is a
 worldwide used online application, so users generate content in all the
 possible world languages.

 For example:
 言語-aware
 Løgismose-alike
 ຄໍາຮ້ອງສະຫມັກ-dependent

 So I guess our schema requires a single field with universal analyzers.

 Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

 But then it requires stemming and lemmatization.

 How to implement a schema with universal stemming/lemmatization which would
 probably utilize the ICU generated token script attribute?

 http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

 By the way, I have already examined the Basistech schema of their
 commercial plugins and it defines tokenizer/filter language per field type,
 which is not a universal solution for such complex multilingual texts.

 Please advise how to address this task.

 Sincerely, Ilia Sretenskii.



Re: When not to use NRTCachingDirectory and what to use instead.

2014-04-21 Thread Tom Burton-West
Hi Ken,

Given the comments which seemed to describe using NRT for the opposite of
our use case, I just set our Solr 4 to use the solr.MMapDirectoryFactory.
 Didn't bother to test whether NRT would be better for our use case, mostly
because it didn't sound like there was an advantage and   I've been focused
on other things relating to Solr 4.  , I'd love to hear any results from
someone who is testing for a  batch indexing use case and has tested
various xxxDirectoryFactory implementations.  Please let me know your
results if you do end up doing some testing.

Tom


On Sat, Apr 19, 2014 at 9:51 AM, Ken Krugler kkrugler_li...@transpac.comwrote:


 Tom - did you ever get any useful results from testing here? I'm also
 curious about the impact of various xxxDirectoryFactory implementations for
 batch indexing.

 Thanks,

 -- Ken

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr








Re: tf and very short text fields

2014-04-04 Thread Tom Burton-West
Thanks Marcus,

I was thinking about normalization and was absolutely wrong about setting
K1 to zero.   I should have taken a look at the algorithm and walked
through setting K=0.  (This is easier to do looking at the formula in
wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though the
code.)
When you set k1 to 0 it does just what you said i.e provides binary tf.
 That part of the formula  returns 1 if the term is present and 0 if not.
Which is I think what Wunder was trying to accomplish.

Sorry about jumping in without double checking things first.

Tom


On Fri, Apr 4, 2014 at 7:38 AM, Markus Jelsma markus.jel...@openindex.iowrote:

 Hi - In this case Walter, iirc, was looking for two things: no
 normalization and no flat TF (1f for tf(float freq)  0). We know that k1
 controls TF saturation but in BM25Similarity you can see that k1 is
 multiplied by the encoded norm value, taking b also into account. So
 setting k1 to zero effectively disabled length normalization and results in
 flat or binary TF.

 Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the
 field, term occurs three times in the field:

 28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5
 ), product of:
   6.4 = boost
   4.406719 = idf(docFreq=1, docCount=122)
   1.0 = tfNorm, computed from:
 1.5 = phraseFreq=1.5
 0.0 = parameter k1
 0.75 = parameter b
 8.721312 = avgFieldLength
 16.0 = fieldLength




 27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5
 ), product of:
   6.4 = boost
   4.406719 = idf(docFreq=1, docCount=122)
   0.98619986 = tfNorm, computed from:
 1.5 = phraseFreq=1.5
 0.2 = parameter k1
 0.75 = parameter b
 8.721312 = avgFieldLength
 16.0 = fieldLength


 You can clearly see the final TF norm being 1, despite the term frequency
 and length. Please correct my wrongs :)
 Markus



 -Original message-
  From:Tom Burton-West tburt...@umich.edu
  Sent: Thursday 3rd April 2014 20:18
  To: solr-user@lucene.apache.org
  Subject: Re: tf and very short text fields
 
  Hi Markus and Wunder,
 
  I'm  missing the original context, but I don't think BM25 will solve this
  particular problem.
 
  The k1 parameter sets how quickly the contribution of tf to the score
 falls
  off with increasing tf.   It would be helpful for making sure really long
  documents don't get too high a score, but I don't think it would help for
  very short documents without messing up its original design purpose.
 
  For BM25, if you want to turn off length normalization, you set b to 0.
   However, I don't think that will do what you want, since turning off
  normalization will mean that the score for new york, new york  will be
  twice that of the score for new york since without normalization the tf
  in new york new york is twice that of new york.
 
  I think the earlier suggestion to override tfidfsimilarity and emit 1f
 in
  tf() is probably the best way to switch to eliminate using tf counts,
  assumming that is really what you want.
 
  Tom
 
 
 
 
 
 
 
 
  On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.org
 wrote:
 
   Thanks! We'll try that out and report back. I keep forgetting that I
 want
   to try BM25, so this is a good excuse.
  
   wunder
  
   On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io
 
   wrote:
  
Also, if i remember correctly, k1 set to zero for bm25 automatically
   omits norms in the calculation. So thats easy to play with without
   reindexing.
   
   
Markus Jelsma markus.jel...@openindex.io schreef:Yes, override
   tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set
 to
   zero in your schema.
   
   
Walter Underwood wun...@wunderwood.org schreef:And here is another
   peculiarity of short text fields.
   
The movie New York, New York should not be twice as relevant for
 the
   query new york. Is there a way to use a binary term frequency rather
 than
   a count?
   
wunder
--
Walter Underwood
wun...@wunderwood.org
   
   
   
  
   --
   Walter Underwood
   wun...@wunderwood.org
  
  
  
  
 



Re: Analysis of Japanese characters

2014-04-03 Thread Tom Burton-West
Hi Shawn,

For an input of 田中角栄 the bigram filter works like you described, and what
I would expect.  If I add a space at the point where the ICU tokenizer
would have split them anyway, the bigram filter output is very different.

If I'm understanding what you are reporting, I suspect this is behavior as
designed.   My guess is that the bigram filter figures that if there was
space in the original input (to the whole filter chain), it should not
create a bigram across it.

Tom

BTW: if you can show a few examples of Japanese queries the show the
original problem  and the reason its a problem (without of course showing
anything proprietary), I'd love to see them.  I'm always interested in
learning more about Japanese query processing.


Re: tf and very short text fields

2014-04-03 Thread Tom Burton-West
Hi Markus and Wunder,

I'm  missing the original context, but I don't think BM25 will solve this
particular problem.

The k1 parameter sets how quickly the contribution of tf to the score falls
off with increasing tf.   It would be helpful for making sure really long
documents don't get too high a score, but I don't think it would help for
very short documents without messing up its original design purpose.

For BM25, if you want to turn off length normalization, you set b to 0.
 However, I don't think that will do what you want, since turning off
normalization will mean that the score for new york, new york  will be
twice that of the score for new york since without normalization the tf
in new york new york is twice that of new york.

I think the earlier suggestion to override tfidfsimilarity and emit 1f in
tf() is probably the best way to switch to eliminate using tf counts,
assumming that is really what you want.

Tom








On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote:

 Thanks! We'll try that out and report back. I keep forgetting that I want
 to try BM25, so this is a good excuse.

 wunder

 On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:

  Also, if i remember correctly, k1 set to zero for bm25 automatically
 omits norms in the calculation. So thats easy to play with without
 reindexing.
 
 
  Markus Jelsma markus.jel...@openindex.io schreef:Yes, override
 tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to
 zero in your schema.
 
 
  Walter Underwood wun...@wunderwood.org schreef:And here is another
 peculiarity of short text fields.
 
  The movie New York, New York should not be twice as relevant for the
 query new york. Is there a way to use a binary term frequency rather than
 a count?
 
  wunder
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 

 --
 Walter Underwood
 wun...@wunderwood.org






Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
Hi Shawn,

I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?

Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese character sets.  For
example the config given in the JavaDocs tells it to make bigrams across 3
of the different Japanese character sets.  (Is the issue related to Romaji?)

 filter class=solr.CJKBigramFilterFactory
   han=true hiragana=true
   katakana=true hangul=true outputUnigrams=false /



http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html

Tom


On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey s...@elyograg.org wrote:

 My company is setting up a system for a customer from Japan.  We have an
 existing system that handles primarily English.

 Here's my general text analysis chain:

 http://apaste.info/xa5

 After talking to the customer about problems they are encountering with
 search, we have determined that some of the problems are caused because
 ICUTokenizer splits on *any* character set change, including changes
 between different Japanase character sets.

 Knowing the risk of this being an XY problem, here's my question: Can
 someone help me develop a rule file for the ICU Tokenizer that will *not*
 split when the character set changes from one of the japanese character
 sets to another japanese character set, but still split on other character
 set changes?

 Thanks,
 Shawn




Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
Hi Shawn,

I may still be missing your point.  Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.

I thought if you set  han=true, hiragana=true
You would get this kind of result where the third bigram is composed of a
hirigana and han character

いろは革命歌   =“いろ” ”ろは“  “は革”   ”革命” “命歌”

Hopefully the e-mail hasn't munged the output of the Solr analysis panel
below:

I can see this in our query processing where outpugUnigrams=false:
org.apache.solr.analysis.ICUTokenizerFactory {luceneMatchVersion=LUCENE_36}
Splits into unigrams
term text いろは革命歌
org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false,
outputUnigrams=false, katakana=false, han=true, hiragana=true,
luceneMatchVersion=LUCENE_36}
makes bigrams including the middle one which is one character hirigana and
one han
term text いろろはは革革命命歌

It appears that if you include outputUnigrams=true (as we both do in the
indexing configuration) that this doesn't happen.
org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false,
outputUnigrams=true, katakana=false, han=true, hiragana=true ,
luceneMatchVersion=LUCENE_36}
いろは革命歌 革命命歌 type HIRAGANAHIRAGANAHIRAGANASINGLESINGLESINGLE
DOUBLEDOUBLE

Not sure what happens for katakana as the ICUTokenizer doesn't convert it
to unigrams and our configuration is set to katakana=false.   I'll play
around on the test machine when I have time.

Tom


Re: Indexing large documents

2014-03-19 Thread Tom Burton-West
Hi Stephen,

We regularly index documents in the range of 500KB-8GB on machines that
have about 10GB devoted to Solr.  In order to avoid OOM's on Solr versions
prior to Solr 4.0, we use a separate indexing machine(s) from the search
server machine(s) and also set the termIndexInterval to 8 times that of the
default 128
termIndexInterval1024/termIndexInterval (See
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for
a description of the problem, although the solution we are using is
different, termIndexInterval rather than termInfosDivisor)

I would like to second Otis' suggestion that you consider breaking large
documents into smaller sub-documents.   We are currently not doing that and
we believe that relevance ranking is not working well at all.

 If you consider that most relevance ranking algorithms were designed,
tested, and tuned on TREC newswire-size documents (average 300 words) or
truncated web documents (average 1,000-3,000 words), it seems likely that
they may not work well with book size documents (average 100,000 words).
 Ranking algorithms that use IDF will be particularly affected.


We are currently investigating grouping and block-join options.
Unfortunately, our data does not have good mark-up or metadata to allow
splitting books by chapter.  We have investigated indexing pages of books,
but  due to many issues including performance and scalability  (We index
the full-text of 11 million books and indexing on the page level it would
result in 3.3 billion solr documents), we haven't arrived at a workable
solution for our use case.   At the moment the main bottleneck is memory
use for faceting, but we intend to experiment with docValues to see if the
increase in index size is worth the reduction in memory use.

Presently block-join indexing does not implement scoring, although we hope
that will change in the near future and the relevance ranking for grouping
will rank the group by the highest ranking member.   So if you split a book
into chapters, it would rank the book by the highest ranking chapter.
 This may be appropriate for your use case as Otis suggested.  In our use
case sometimes this is appropriate, but we are investigating the
possibility of other methods of scoring the group based on a more flexible
function of the scores of the members (i.e scoring book based on function
of scores of chapters).

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search



On Tue, Mar 18, 2014 at 11:17 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 I think you probably want to split giant documents because you / your users
 probably want to be able to find smaller sections of those big docs that
 are best matches to their queries.  Imagine querying War and Peace.  Almost
 any regular word your query for will produce a match.  Yes, you may want to
 enable field collapsing aka grouping.  I've seen facet counts get messed up
 when grouping is turned on, but have not confirmed if this is a (known) bug
 or not.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann 
 stephen_kottm...@h3biomedicine.com wrote:

  Hi Solr Users,
 
  I'm looking for advice on best practices when indexing large documents
  (100's of MB or even 1 to 2 GB text files). I've been hunting around on
  google and the mailing list, and have found some suggestions of splitting
  the logical document up into multiple solr documents. However, I haven't
  been able to find anything that seems like conclusive advice.
 
  Some background...
 
  We've been using solr with great success for some time on a project that
 is
  mostly indexing very structured data - ie. mainly based on ingesting
  through DIH.
 
  I've now started a new project and we're trying to make use of solr
 again -
  however, in this project we are indexing mostly unstructured data - pdfs,
  powerpoint, word, etc. I've not done much configuration - my solr
 instance
  is very close to the example provided in the distribution aside from some
  minor schema changes. Our index is relatively small at this point ( ~3k
  documents ), and for initial indexing I am pulling documents from a http
  data source, running them through Tika, and then pushing to solr using
  solrj. For the most part this is working great... until I hit one of
 these
  huge text files and then OOM on indexing.
 
  I've got a modest JVM - 4GB allocated. Obviously I can throw more memory
 at
  it, but it seems like maybe there's a more robust solution that would
 scale
  better.
 
  Is splitting the logical document into multiple solr documents best
  practice here? If so, what are the considerations or pitfalls of doing
 this
  that I should be paying attention to. I guess when querying I always need
  to use a group by field to prevent multiple hits for the same document.
 Are
  there issues with term frequency, etc that you need

Default core for updates in multicore setup

2014-02-05 Thread Tom Burton-West
Hello,

I'm running the example setup for Solr 4.6.1.

In the ../example/solr/  directory, I set up a second core.  I  wanted to
send updates to that core.

  I looked at  .../exampledocs/post.sh and expected to see the URL as:  URL=
http://localhost:8983/solr/collection1/update
However it does not have the core name:
URL=http://localhost:8983/solr/update
Solr however accepts updates with that url in the core named collection1.

I then tried to locate some config somewhere that would specify that the
default core would be collection1, but could not find it.

1) Is there somewhere where the default core for  the xx/solr/update URL is
configured?

2) I ran across SOLR-545 which seems to imply that the current behavior
(dispatching the update requests to the core named collection1) is a bug
which was fixed in Solr 1.3.   Is this a new bug or a change in design?

https://issues.apache.org/jira/browse/SOLR-545

Tom


Re: Default core for updates in multicore setup

2014-02-05 Thread Tom Burton-West
Thanks Hoss,

hardcoded default of collection1 is still used for
backcompat when there is no defaultCoreName configured by the user.

Aha, it's hardcoded if there is nothing set in a config.  No wonder I
couldn't find it by grepping around the config files.

I'm still trying to sort out the old and new style solr.xml/core
configuration stuff.  Thanks for your help.

Tom




On Wed, Feb 5, 2014 at 4:31 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : I then tried to locate some config somewhere that would specify that the
 : default core would be collection1, but could not find it.

 in the older style solr.xml, you can specify a defaultCoreName.  Moving
 forward, relying on the default core name is discouraged (and will
 hopefully be removed before 5.0) so it's not possible to configure it in
 the new core discovry style of solr.xml...

 https://cwiki.apache.org/confluence/display/solr/Solr+Cores+and+solr.xml

 For now owever, the hardcoded default of collection1 is still used for
 backcompat when there is no defaultCoreName configured by the user.
 and things like post.sh, post.jar, and the tutorial have not really been
 updated yet to reflect that the use of the default core name is
 deprecated.

 : 2) I ran across SOLR-545 which seems to imply that the current behavior
 : (dispatching the update requests to the core named collection1) is a bug

 Yeah, A lot of things have changed since 1.3 ... not sure when exactly the
 configurable defaultCoreName was added, but it was sometime after that
 issue i believe.


 -Hoss
 http://www.lucidworks.com/



Re: Evaluating a SOLR index with trec_eval

2013-10-30 Thread Tom Burton-West
Hi Michael,

I know you are asking about Solr, but in case you haven't seen it, Ian
Soboroff has a nice little demo for Lucene:

https://github.com/isoboroff/trec-demo.

There is also the lucene benchmark code:
http://lucene.apache.org/core/4_5_1/benchmark/org/apache/lucene/benchmark/quality/package-summary.html

Otherwise, all I can think of is writing an app layer that keeps track of
the id, sends the query to Solr, parses the search results and spits out
results in the trec format.  I'd love to find some open-source code that
does what you ask.

I did a quick and dirty version of something like that for the INEX book
track.  I'll see if I can find the code and if it is in any shape to share.

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Sevice
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search


On Wed, Oct 30, 2013 at 10:52 AM, Michael Preminger 
michael.premin...@hioa.no wrote:

 Hello!

 Is there a simple way to evaluate a SOLR index with TREC_EVAL?
 I mean:
 *  preparing a query file in some format Solr will understand, but where
 each query has an ID
 * getting results out in trec format, with these query IDs attached

 Thanks

 Michael



Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
Thanks Shawn and Naomi,

I think I am running into the same bug, but the symptoms are a bit
different.
I'm wondering if it makes sense to file a separate linked bug report.

The workaround is to remove sharedLib from solr.xml,
The solr.xml that comes out-of-the-box does not have a sharedLib.

I am using Solr 4.4. out-of-the-box, with the exception that I set up a
lib directory in example/solr/collection1.   I did not change solr.xml from
the out-of-the-box.  There is no  mention of lib in the out-of-the-box
example/solr/solr.xml.

I did not change the out-of-the-box solrconfig.xml.

 According to the README.txt, all that needs to be done is create the
collection1/lib directory and put the jars there.
However, I am getting the class not found error.

Should I open another bug report or comment on the existing report?

Tom




On Tue, Aug 27, 2013 at 6:48 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/27/2013 4:29 PM, Tom Burton-West wrote:

 According to the README.txt in solr-4.4.0/solr/example/solr/**
 collection1,
 all we have to do is create a collection1/lib directory and put whatever
 jars we want in there.

 .. /lib.
 If it exists, Solr will load any Jars
 found in this directory and use them to resolve any plugins
  specified in your solrconfig.xml or schema.xml 


I did so  (see below).  However, I keep getting a class not found error
 (see below).

 Has the default changed from what is documented in the README.txt file?
 Is there something I have to change in solrconfig.xml or solr.xml to make
 this work?

 I looked at SOLR-4852, but don't understand.   It sounds like maybe there
 is a problem if the collection1/lib directory is also specified in
 solrconfig.xml.  But I didn't do that. (i.e. out of the box
 solrconfig.xml)
   Does this mean that by following what it says in the README.txt, I am
 making some kind of a configuration error.  I also don't understand the
 workaround in SOLR-4852.


 That's my bug! :)  If you have sharedLib set to lib (or explicitly the
 lib directory under solr.solr.home) in solr.xml, then ICUTokenizer cannot
 be found despite the fact that all the correct jars are there.

 The workaround is to remove sharedLib from solr.xml, or set it to some
 other directory that either doesn't exist or has no jars in it.  The
 ${solr.solr.home}/lib directory is automatically added to the classpath
 regardless of config, there seems to be some kind of classloading bug when
 the sharedLib adds the same directory again.  This all worked fine in 3.x,
 and early 4.x releases, but due to classloader changes, it seems to have
 broken.  I think (based on the issue description) that it started being a
 problem with 4.3-SNAPSHOT.

 The same thing happens if you set sharedLib to foo and put some of your
 jars in lib and some in foo.  It's quite mystifying.

 Thanks,
 Shawn




Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
My point in the previous e-mail was that following the instructions in the
documentation does not seem to work.
The workaround I found was to simply change the name of the collection1/lib
directory to collection1/foobar and then include it in solrconfig.xml.
  lib dir=./foobar /

This works, but does not explain why out-of-the-box, simply creating a
collection1/lib directory and putting the jars there does not work as
documented in both  the README.txt and in solrconfig.xml.

Shawn, should I add these comments to your JIRA issue?
Should I open a separate related JIRA issue?

Tom

Tom


On Tue, Aug 27, 2013 at 7:18 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/27/2013 5:11 PM, Naomi Dushay wrote:

 Perhaps you are missing the following from your solrconfig

 lib dir=/home/blacklight/solr-**home/lib /


 I ran into this issue (I'm the one that filed SOLR-4852) and I am not
 using blacklight.  I am only using what can be found in a Solr download,
 plus the MySQL JDBC driver for dataimport.

 I prefer not to load jars via solrconfig.xml.  I have a lot of cores and
 every core needs to use the same jars.  Rather than have the same jars
 loaded 18 times (once by each of the 18 solrconfig.xml files), I would
 rather have Solr load them once and make the libraries available to all
 cores.  Using ${solr.solr.home}/lib accomplishes this goal.

 Thanks,
 Shawn




Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
Hi Shawn,

I'm going to add this to the your JIRA unless you think that it would be
good to open another issue.
The issue for me is that making a ./lib in the instanceDir is documented as
working in several places and has worked in previous versions of Solr, for
example solr 4.1.0.

 I make a ./lib directory in Solr Home, all works just fine. However
according to the documentation making a ./lib directory in the instanceDir
should work, and in fact in Solr 4.1.0 it works just fine.

 So the question for me is whether making a ./lib directory as documented
in collections1/conf/solrconfig.xml and collections1/README.txt is supposed
to work in Solr 4.4 , but due to a bug it is not working.

   If it is not supposed to work, then the documentation needs fixing and
some note needs to be made about upgrading from previous versions of Solr.

Do you think I should open another JIRA and link it to yours or just add
this information (i.e. other scenarios where class loading not working) to
your JIRA?

Details below:

Tom

The documentation in the collections1/conf  directory is confusing.   For
example the collections1/conf/solrconfig.xml file says you should put a
./lib dir in your instanceDir.  (Am I correct that an instanceDir refers to
the core? )   On the other hand the documentation in the
collections1/README.txt is confusing about whether it is talking about the
instanceDir or Solr Home:

  For example, In collections1/conf/solrconfig.xml there is this comment:

   If a ./lib directory exists in your instanceDir, all files
   found in it are included as if you had used the following
   syntax...

  lib dir=./lib /

Also in collections1/conf/README.txt  it is suggested that you use ./lib
 but that README.txt file needs editing as it is very confusing about
whether it is talking about Solr Home or the Instance Directory in the text
excerpted below.

I would assume that the conf and data directories have to be subdirectories
of the instanceDir, since I assume they are set per core.   So in the
excerpt below the discussion of the sub-directories should apply to the
instanceDir not Solr Home.



Example SolrCore Instance Directory
=
This directory is provided as an example of what an Instance Directory
should look like for a SolrCore

It's not strictly necessary that you copy all of the files in this
directory when setting up a new SolrCores, but it is recommended.

Basic Directory Structure
-

The Solr Home directory typically contains the following sub-directorie:

  conf/
This directory is mandatory and must contain your solrconfig.xml
and schema.xml.  Any other optional configuration files would also
be kept here.

   data/
This directory is the default location where Solr will keep your
...
lib/



On Wed, Aug 28, 2013 at 12:11 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/28/2013 9:34 AM, Tom Burton-West wrote:

 I think I am running into the same bug, but the symptoms are a bit
 different.
 I'm wondering if it makes sense to file a separate linked bug report.

  The workaround is to remove sharedLib from solr.xml,

 The solr.xml that comes out-of-the-box does not have a sharedLib.

  I am using Solr 4.4. out-of-the-box, with the exception that I set
 up a
 lib directory in example/solr/collection1.   I did not change solr.xml
 from
 the out-of-the-box.  There is no  mention of lib in the out-of-the-box
 example/solr/solr.xml.

 I did not change the out-of-the-box solrconfig.xml.

   According to the README.txt, all that needs to be done is create the
 collection1/lib directory and put the jars there.
 However, I am getting the class not found error.

 Should I open another bug report or comment on the existing report?


 I have never heard of using ${instanceDir}/lib for jars.  That doesn't
 mean it won't work, but I have never seen it mentioned anywhere.

 I have only ever put the lib directory in solr.home, where solr.xml is.
  Did you try that?

 If you have seen documentation for collection1/lib, then there may be a
 doc bug, another dimension to the bug already filed, or a new bug.  Do you
 see log entries saying your jars in collection/lib are loaded?  If you do,
 then I think it's probably another dimension to the existing bug.

 Thanks,
 Shawn




ICUTokenizer class not found with Solr 4.4

2013-08-27 Thread Tom Burton-West
Hello all,

According to the README.txt in solr-4.4.0/solr/example/solr/collection1,
all we have to do is create a collection1/lib directory and put whatever
jars we want in there.

.. /lib.
   If it exists, Solr will load any Jars
   found in this directory and use them to resolve any plugins
specified in your solrconfig.xml or schema.xml 


  I did so  (see below).  However, I keep getting a class not found error
(see below).

Has the default changed from what is documented in the README.txt file?
Is there something I have to change in solrconfig.xml or solr.xml to make
this work?

I looked at SOLR-4852, but don't understand.   It sounds like maybe there
is a problem if the collection1/lib directory is also specified in
solrconfig.xml.  But I didn't do that. (i.e. out of the box solrconfig.xml)
 Does this mean that by following what it says in the README.txt, I am
making some kind of a configuration error.  I also don't understand the
workaround in SOLR-4852.

Is this an ICU issue?  A java 7 issue?  a Solr 4.4 issue,  or did I simply
not understand the README.txt?



Tom

--


org.apache.solr.common.SolrException; null:java.lang.NoClassDefFoundError:
org/apache/lucene/analysis/icu/segmentation/ICUTokenizer

 ls collection1/lib
icu4j-49.1.jar
lucene-analyzers-icu-4.4-SNAPSHOT.jar
solr-analysis-extras-4.4-SNAPSHOT.jar

https://issues.apache.org/jira/browse/SOLR-4852

Collection1/README.txt excerpt:

 lib/
This directory is optional.  If it exists, Solr will load any Jars
found in this directory and use them to resolve any plugins
specified in your solrconfig.xml or schema.xml (ie: Analyzers,
Request Handlers, etc...).  Alternatively you can use the lib
syntax in conf/solrconfig.xml to direct Solr to your plugins.  See
the example conf/solrconfig.xml file for details.


How to set discountOverlaps=true in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
If I am using solr.SchemaSimilarityFactory to allow different similarities
for different fields, do I set discountOverlaps=true on the factory or
per field?

What is the syntax?   The below does not seem to work

similarity class=solr.BM25SimilarityFactory discountOverlaps=true  
similarity class=solr.SchemaSimilarityFactory discountOverlaps=true
 /

Tom


Re: How to set discountOverlaps=true in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
Thanks Markus,

I set it , but it seems to make no difference in the score or statistics
listed in the debugQuery or in the ranking.   I'm using a field with
CommonGrams and a huge list of common words, so there should be a huge
difference in the document length with and without discountOverlaps.

Is the default for Solr 4 true?

 similarity class=solr.BM25SimilarityFactory  
  float name=k11.2/float
  float name=b0.75/float
bool name=discountOverlapsfalse/bool
  /similarity



On Thu, Aug 22, 2013 at 4:58 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi Tom,

 Don't set it as attributes but as lists as Solr uses everywhere:
 similarity class=solr.SchemaSimilarityFactory
   bool name=discountOverlapstrue/bool
 /similarity

 For BM25 you can also set k1 and b which is very convenient!

 Cheers


 -Original message-
  From:Tom Burton-West tburt...@umich.edu
  Sent: Thursday 22nd August 2013 22:42
  To: solr-user@lucene.apache.org
  Subject: How to set discountOverlaps=quot;truequot; in Solr 4x
 schema.xml
 
  If I am using solr.SchemaSimilarityFactory to allow different
 similarities
  for different fields, do I set discountOverlaps=true on the factory or
  per field?
 
  What is the syntax?   The below does not seem to work
 
  similarity class=solr.BM25SimilarityFactory discountOverlaps=true  
  similarity class=solr.SchemaSimilarityFactory discountOverlaps=true
   /
 
  Tom
 



Re: How to set discountOverlaps=true in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
I should have said that I have set it both to true and to false and
restarted Solr each time and the rankings and info in the debug query
showed no change.

Does this have to be set at index time?

Tom






Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Hello,

I am running solr 4.2.1 on 3 shards and have about 365 million documents in
the index total.
I sent a query asking for 1 million rows at a time,  but I keep getting an
error claiming that there is an invalid version or data not in javabin
format (see below)

If I lower the number of rows requested to 100,000, I have no problems.

Does Solr have  a limit on number of rows that can be requested or is this
a bug?


Tom

INFO: [core] webapp=/dev-1 path=/select
params={shards=XXX:8111/dev-1/core,XXX:8111/dev-2/core,XXX:8111/dev-3/corefl=vol_idindent=onstart=0q=*:*rows=100}
hits=365488789 status=500 QTime=132743
Jul 25, 2013 1:26:00 PM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException:
java.lang.RuntimeException: Invalid version (expected 2, but 60) or the
data in not in 'javabin' format
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:302)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 60)
or the data in not in 'javabin' format
at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:109)
at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
:


Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Hi Jack,

I should have pointed out our use case.  In any reasonable case where
actual end users will be looking at search results, paging 1,000 at a time
is reasonable.  But what we are doing is a dump of the unique ids with a
*:* query.   This allows us to verify that what our system thinks has
been indexed is actually indexed.   Since we need to dump out results in
the hundreds of millions,  requesting 1,000 at a time is not scalable.

The other context of this is that we currently index 10 million books with
each book as a Solr document.  We are looking at indexing at the page
level, which would result in about 3 billion pages.  So testing the
scalability of queries used by our current production system, such as the
query against the index that is not released to production to get a list of
the unique ids that are actually indexed in Solr is part of that testing
process.

Tom


On Thu, Jul 25, 2013 at 2:13 PM, Jack Krupansky j...@basetechnology.comwrote:

 As usual, there is no published hard limit per se, but I would urge
 caution about requesting more than 1,000 rows at a time or even 250. Sure,
 in a fair number of cases 5,000 or 10,000 or even 100,000 MAY work (at
 least sometimes), but Solr and Lucene are more appropriate for paged
 results, where page size is 10, 20, 50, 100 or something in that range. So,
 my recommendation is to use 250 to 1,000 as the limit for rows. And
 certainly do a proof of concept implementation for anything above 1,000.

 So, if rows=10 works for you, consider yourself lucky!

 That said, there is sometimes talk of supporting streaming, which
 presumably would allow access to all results, but chunked/paged in some way.

 -- Jack Krupansky

 -Original Message- From: Tom Burton-West
 Sent: Thursday, July 25, 2013 1:39 PM
 To: solr-user@lucene.apache.org
 Subject: Solr 4.2.1 limit on number of rows or number of hits per shard?

 Hello,

 I am running solr 4.2.1 on 3 shards and have about 365 million documents in
 the index total.
 I sent a query asking for 1 million rows at a time,  but I keep getting an
 error claiming that there is an invalid version or data not in javabin
 format (see below)

 If I lower the number of rows requested to 100,000, I have no problems.

 Does Solr have  a limit on number of rows that can be requested or is this
 a bug?


 Tom

 INFO: [core] webapp=/dev-1 path=/select
 params={shards=XXX:8111/dev-1/**core,XXX:8111/dev-2/core,XXX:**
 8111/dev-3/corefl=vol_id**indent=onstart=0q=*:*rows=**100}
 hits=365488789 status=500 QTime=132743
 Jul 25, 2013 1:26:00 PM org.apache.solr.common.**SolrException log
 SEVERE: null:org.apache.solr.common.**SolrException:
 java.lang.RuntimeException: Invalid version (expected 2, but 60) or the
 data in not in 'javabin' format
at
 org.apache.solr.handler.**component.SearchHandler.**handleRequestBody(**
 SearchHandler.java:302)
at
 org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
 RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1817)
at
 org.apache.solr.servlet.**SolrDispatchFilter.execute(**
 SolrDispatchFilter.java:639)
at
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
 SolrDispatchFilter.java:345)
at
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
 SolrDispatchFilter.java:141)
at
 org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
 ApplicationFilterChain.java:**215)
at
 org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
 ApplicationFilterChain.java:**188)
at
 org.apache.catalina.core.**StandardWrapperValve.invoke(**
 StandardWrapperValve.java:213)
at
 org.apache.catalina.core.**StandardContextValve.invoke(**
 StandardContextValve.java:172)
at
 org.apache.catalina.valves.**AccessLogValve.invoke(**
 AccessLogValve.java:548)
at
 org.apache.catalina.core.**StandardHostValve.invoke(**
 StandardHostValve.java:127)
at
 org.apache.catalina.valves.**ErrorReportValve.invoke(**
 ErrorReportValve.java:117)
at
 org.apache.catalina.core.**StandardEngineValve.invoke(**
 StandardEngineValve.java:108)
at
 org.apache.catalina.connector.**CoyoteAdapter.service(**
 CoyoteAdapter.java:174)
at
 org.apache.coyote.http11.**Http11Processor.process(**
 Http11Processor.java:875)
at
 org.apache.coyote.http11.**Http11BaseProtocol$**Http11ConnectionHandler.**
 processConnection(**Http11BaseProtocol.java:665)
at
 org.apache.tomcat.util.net.**PoolTcpEndpoint.processSocket(**
 PoolTcpEndpoint.java:528)
at
 org.apache.tomcat.util.net.**LeaderFollowerWorkerThread.**runIt(**
 LeaderFollowerWorkerThread.**java:81)
at
 org.apache.tomcat.util.**threads.ThreadPool$**ControlRunnable.run(**
 ThreadPool.java:689)
at java.lang.Thread.run(Thread.**java:619)
 Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 60)
 or the data in not in 'javabin

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Thanks Shawn,

I was confused by the error message: Invalid version (expected 2, but 60)
or the data in not in 'javabin' format

Your explanation makes sense.  I didn't think about what the shards have to
send back to the head shard.
Now that I look in my logs, I can see the posts that  the shards are
sending to the head shard and actually get a good measure of how many bytes
are being sent around.

I'll poke around and look at multipartUploadLimitInKB, and also see if
there is some servlet container limit config I might need to mess with.

Tom




On Thu, Jul 25, 2013 at 2:46 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/25/2013 12:26 PM, Shawn Heisey wrote:

 Either multipartUploadLimitInKB doesn't work properly, or there may be
 some hard limits built into the servlet container, because I set
 multipartUploadLimitInKB in the requestDispatcher config to 32768 and it
 still didn't work.  I wonder, perhaps there is a client-side POST buffer
 limit as well as the servlet container limit, which comes in to play
 because the Solr server is acting as a client for the distributed
 requests?


 Followup:

 I should probably add that I used a different version (and got some
 different errors) because what I've got on my dev server is an old
 branch_4x version:

 4.4-SNAPSHOT 1497605 - ncindex - 2013-06-27 17:12:30

 My online production system is 4.2.1, but I am not going to run this query
 on that system because of the potential to break things.  I did try it
 against my backup production system running 3.5.0 with a 1MB server-side
 POST buffer and got an error that seems to at least partially confirm my
 suspicions.  Here's an excerpt:

 HTTP ERROR 500

 Problem accessing /solr/ncmain/select. Reason:

 Form too large184251041048576  java.lang.**IllegalStateException:
 Form too large184251041048576 at org.mortbay.jetty.Request.**
 extractParameters(Request.**java:1561)   at org.mortbay.jetty.Request.
 **getParameterMap(Request.java:**870)  at org.apache.solr.request.**
 ServletSolrParams.init(**ServletSolrParams.java:29)  at
 org.apache.solr.servlet.**StandardRequestParser.**
 parseParamsAndFillStreams(**SolrRequestParsers.java:394) at
 org.apache.solr.servlet.**SolrRequestParsers.parse(**SolrRequestParsers.java:115)
at 
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**SolrDispatchFilter.java:223)
 at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
 doFilter(ServletHandler.java:**1212)  at org.mortbay.jetty.servlet.**
 ServletHandler.handle(**ServletHandler.java:399) at
 org.mortbay.jetty.security.**SecurityHandler.handle(**SecurityHandler.java:216)
  at 
 org.mortbay.jetty.servlet.**SessionHandler.handle(**SessionHandler.java:182)
 at 
 org.mortbay.jetty.handler.**ContextHandler.handle(**ContextHandler.java:766)
 at 
 org.mortbay.jetty.webapp.**WebAppContext.handle(**WebAppContext.java:450)
 at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
 ContextHandlerCollection.java:**230)at org.mortbay.jetty.handler.*
 *HandlerCollection.handle(**HandlerCollection.java:114)   at
 org.mortbay.jetty.handler.**HandlerWrapper.handle(**HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.**handle(Server.java:326) at
 org.mortbay.jetty.**HttpConnection.handleRequest(**HttpConnection.java:542)
  at org.mortbay.jetty.**HttpConnection$RequestHandler.**
 content(HttpConnection.java:**945) at 
 org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:756)
  at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**java:212)
 at org.mortbay.jetty.**HttpConnection.handle(**HttpConnection.java:404)
 at 
 org.mortbay.jetty.bio.**SocketConnector$Connection.**run(SocketConnector.java:228)
   at org.mortbay.thread.**QueuedThreadPool$PoolThread.**
 run(QueuedThreadPool.java:582)

 Thanks,
 Shawn




Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Hi Shawn,

Thanks for your help.   I found a workaround for this use case, which is to
avoid using a shards query and just asking each shard for a dump of the
unique ids. i.e. run an *:* query and ask for 1 million rows at a time.
This should be a no scoring query, so I would think that it doesn't have to
do any ranking or sorting.   What I am now seeing is that qtimes have gone
up from about 5 seconds per request to nearly a minute as the start
parameter gets higher.  I don't know if this is actually because of the
start parameter or if something is happening with memory use and/or caching
that is just causing things to take longer.  I'm at around 35 out of 119
million for this shard and queries have gone from taking 5 seconds to
taking almost a minute.

INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=3600q=*:*rows=100}
hits=119220943 status=0 QTime=52952


Tom


INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=700q=*:*rows=100}
hits=119220943 status=0 QTime=9772
Jul 25, 2013 5:39:43 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=800q=*:*rows=100}
hits=119220943 status=0 QTime=11274
Jul 25, 2013 5:41:44 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=900q=*:*rows=100}
hits=119220943 status=0 QTime=13104
Jul 25, 2013 5:43:39 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=1000q=*:*rows=100}
hits=119220943 status=0 QTime=13568
...
...
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=1300q=*:*rows=100}
hits=119220943 status=0 QTime=26703

Jul 25, 2013 5:58:20 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=1700q=*:*rows=100}
hits=119220943 status=0 QTime=22607
Jul 25, 2013 6:00:31 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=1800q=*:*rows=100}
hits=119220943 status=0 QTime=24109
...
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=3000q=*:*rows=100}
hits=119220943 status=0 QTime=41034
Jul 25, 2013 6:31:36 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=3100q=*:*rows=100}
hits=119220943 status=0 QTime=42844
Jul 25, 2013 6:34:16 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=3200q=*:*rows=100}
hits=119220943 status=0 QTime=45046
Jul 25, 2013 6:36:57 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=3300q=*:*rows=100}
hits=119220943 status=0 QTime=49792
Jul 25, 2013 6:39:43 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_idindent=onstart=3400q=*:*rows=100}
hits=119220943 status=0 QTime=58699




On Thu, Jul 25, 2013 at 6:18 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/25/2013 3:09 PM, Tom Burton-West wrote:

 Thanks Shawn,

 I was confused by the error message: Invalid version (expected 2, but 60)
 or the data in not in 'javabin' format

 Your explanation makes sense.  I didn't think about what the shards have
 to
 send back to the head shard.
 Now that I look in my logs, I can see the posts that  the shards are
 sending to the head shard and actually get a good measure of how many
 bytes
 are being sent around.

 I'll poke around and look at multipartUploadLimitInKB, and also see if
 there is some servlet container limit config I might need to mess with.


 I think I figured it out, after a peek at the source code.  I upgraded to
 Solr 4.4 first, my 100,000 row query still didn't work.  By setting
 formdataUploadLimitInKB (in addition to multipartUploadLimitInKB, not sure
 if both are required), I was able to get a 100,000 row query to work.

 A query for one million rows did finally respond to my browser query, but
 it took a REALLY REALLY long time (82 million docs in several shards, only
 16GB RAM on the dev server) and it crashed firefox due to the size of the
 response.  It also seemed to error out on some of the shard responses.  My
 handler has shards.tolerant=true, so that didn't seem to kill the whole
 query ... but because the response crashed firefox, I couldn't tell.

 I repeated the query using curl so I could save the response.  It's been
 running for several minutes without any server-side errors, but I still
 don't have any results.

 Your servers are much more robust than my little dev server, so this might
 work for you - if you aren't using the start parameter in addition to the
 rows parameter.  You might need to sort ascending by your unique key field
 and use a range query ([* TO *] for the first one), find the highest

Re: What does too many merges...stalling in indexwriter log mean?

2013-07-12 Thread Tom Burton-West
Thanks Shawn,

Do you have any feeling for what gets traded off if we increase the
maxMergeCount?

This is completely new for us because we are experimenting with indexing
pages instead of whole documents.  Since our average document is about 370
pages, this means that we have increased the number of documents we are
asking Solr to index by a couple of orders of magnitude. (on the other hand
the size of the document decreases by a couple of orders of magnitude).
I'm not sure why increasing the number of documents (and reducing their
size) is causing more merges.  I'll have to investigate.

Tom


On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/11/2013 1:47 PM, Tom Burton-West wrote:

 We are seeing the message too many merges...stalling  in our indexwriter
 log.   Is this something to be concerned about?  Does it mean we need to
 tune something in our indexing configuration?


 It sounds like you've run into the maximum number of simultaneous merges,
 which I believe defaults to two, or maybe three.  The following config
 section in indexConfig will likely take care of the issue. This assumes
 3.6 or later, I believe that on older versions, this goes in
 indexDefaults.

   mergeScheduler class=org.apache.lucene.**index.**
 ConcurrentMergeScheduler
 int name=maxThreadCount1/int
 int name=maxMergeCount6/int
   /mergeScheduler

 Looking through the source code to confirm, this definitely seems like the
 case.  Increasing maxMergeCount is likely going to speed up your indexing,
 at least by a little bit.  A value of 6 is probably high enough for mere
 mortals, buy you guys don't do anything small, so I won't begin to
 speculate what you'll need.

 If you are using spinning disks, you'll want maxThreadCount at 1.  If
 you're using SSD, then you can likely increase that value.

 Thanks,
 Shawn




What does too many merges...stalling in indexwriter log mean?

2013-07-11 Thread Tom Burton-West
Hello,

We are seeing the message too many merges...stalling  in our indexwriter
log.   Is this something to be concerned about?  Does it mean we need to
tune something in our indexing configuration?

Tom


When not to use NRTCachingDirectory and what to use instead.

2013-07-10 Thread Tom Burton-West
Hello all,

The default directory implementation in Solr 4 is the NRTCachingDirectory
(in the example solrconfig.xml file , see below).

The Javadoc for NRTCachingDirectoy (
http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true)
 says:

 This class is likely only useful in a near real-time context, where
indexing rate is lowish but reopen rate is highish, resulting in many tiny
files being written...

It seems like we have exactly the opposite use case, so we would like
advice on what directory implementation to use instead.

We are doing offline batch indexing, so no searches are being done.  So we
don't need NRT.  We also have a high indexing rate as we are trying to
index 3 billion pages as quickly as possible.

I am not clear what determines the reopen rate.   Is it only related to
searching or is it involved in indexing as well?

 Does the NRTCachingDirectory have any benefit for indexing under the use
case noted above?

I'm guessing we should just use the solrStandardDirectoryFactory instead.
 Is this correct?

Tom

---





!-- The DirectoryFactory to use for indexes.

   solr.StandardDirectoryFactory is filesystem
   based and tries to pick the best implementation for the current
   JVM and platform.  solr.NRTCachingDirectoryFactory, the default,
   wraps solr.StandardDirectoryFactory and caches small files in memory
   for better NRT performance.

   One can force a particular implementation via
solr.MMapDirectoryFactory,
   solr.NIOFSDirectoryFactory, or solr.SimpleFSDirectoryFactory.

   solr.RAMDirectoryFactory is memory based, not
   persistent, and doesn't work with replication.
--
  directoryFactory name=DirectoryFactory

class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/


Solr 4.x replacement for termsIndexDivisor

2013-05-21 Thread Tom Burton-West
Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms
 ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
 ).

In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file.   We originally used the
termInfosIndexDivisor which affects the sampling of the tii file when read
into memory.  Later we used the termIndexInterval.
Please see
http://lucene.472066.n3.nabble.com/Solr-4-0-Beta-termIndexInterval-vs-termIndexDivisor-vs-termInfosIndexDivisor-tt4006182.htmlfor
more background.

Neither of these work with the default posting format in Solr4.x.  However
in the latest Solr 4.x example/solrconfig.xml file there is commented out
text that implies that you can still use setTermIndexDivisor (appended
below).  That should probably be removed from the example if it does not
work in Solr 4.x.

At the Lucene level there are parameters to affect the size of tie
in-memory representation of the index to the index (tip file).
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html

In the Javadoc for IndexWriterConfig.setTermIndexInterval, There is the
following statement:

*This parameter does not apply to all PostingsFormat implementations,
including the default one in this release. It only makes sense for term
indexes that are implemented as a fixed gap between terms. For example,
Lucene41PostingsFormathttp://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.htmlimplements
the term index instead based upon how terms share prefixes. To
configure its parameters (the minimum and maximum size for a block), you
would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
int)http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29.
which can also be configured on a per-field basis*
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29

This is followed by an example of how to set the min and max block size in
Lucene.

Is the ability to set the min and max block size available in Solr?

If not, should I open a JIRA?


Tom
--
Exceprt from the Solr 4.3 latest rev of the example/solrconfig.xml file:

http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/example/solr/collection1/conf/solrconfig.xml?revision=1470617view=co

!-- By explicitly declaring the Factory, the termIndexDivisor can
   be specified.
--!--
 indexReaderFactory name=IndexReaderFactory
 class=solr.StandardIndexReaderFactory
   int name=setTermIndexDivisor12/int
 /indexReaderFactory 
--


Re: Slow queries for common terms

2013-03-22 Thread Tom Burton-West
Hi David and Jan,

I wrote the blog post, and David, you are right, the problem we had was
with phrase queries because our positions lists are so huge.  Boolean
queries don't need to read the positions lists.   I think you need to
determine whether you are CPU bound or I/O bound.It is possible that
you are I/O bound and reading the term frequency postings for 90 million
docs is taking a long time.  In that case, More memory in the machine (but
not dedicated to Solr) might help because Solr relies on OS disk caching
for caching the postings lists.  You would still need to do some cache
warming with your most common terms.

On the other hand as Jan pointed out, you may be cpu bound because Solr
doesn't have early termination and has to rank all 90 million docs in order
to show the top 10 or 25.

Did you try the OR search to see if your CPU is at 100%?

Tom

On Fri, Mar 22, 2013 at 10:14 AM, Jan Høydahl jan@cominvent.com wrote:

 Hi

 There might not be a final cure with more RAM if you are CPU bound.
 Scoring 90M docs is some work. Can you check what's going on during those
 15 seconds? Is your CPU at 100%? Try an (foo OR bar OR baz) search which
 generates 100mill hits and see if that is slow too, even if you don't use
 frequent words.

 I'm sure you can find other frequent terms in your corpus which display
 similar behaviour, words which are even more frequent than book. Are you
 using AND as default operator? You will benefit from limiting the number
 of results as much as possible.

 The real solution is to shard across N number of servers, until you reach
 the desired performance for the desired indexing/querying load.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com




ngrams or truncation for multilingual searching in Solr

2013-02-05 Thread Tom Burton-West
Hello all,

We have a large number of languages which we currently index all in one
index.  The paper below uses ngrams as a substitute for language-specific
stemming and got good results with a number of complex languages.Has
anyone tried doing this with Solr?

 They also got fairly good results (at least for the more complex
languages) by simply truncating words.
We would be very interested to hear about any experience using either of
these approaches for multiple languages


The test collections were pretty much short newswire stories, so the other
question is whether similar results might be expected for longer documents.

Paul McNamee, Charles Nicholas, and James Mayfield. 2009. Addressing
morphological variation in alphabetic languages. In *Proceedings of the
32nd international ACM SIGIR conference on Research and development in
information retrieval* (SIGIR '09). ACM, New York, NY, USA, 75-82.
DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957

Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search


Why does debugQuery/explain output sometimes include queryNorm and sometimes not for same query?

2013-01-25 Thread Tom Burton-West
Hello all,

I have a one term query:  ocr:aardvark   When I look at the explain
output, for some matches the queryNorm and fieldWeight are shown and for
some matches only the weight is shown with no query norm.  (See below)

What explains the difference?  Shouldn't the queryNorm be applied to each
result (and show up in each explain from the debugQuery?)

This is Solr 3.6.

Tom
-

str name=parsedqueryocr:aardvark/str

 lst name=explain

str name=mdp.390150591683130.4395488 = (MATCH)
fieldWeight(ocr:aardvark in 504374), product of:  7.5498343 =
tf(termFreq(ocr:aardvark)=57)  7.4521165 = idf(docFreq=1328,
maxDocs=842643)  0.0078125 = fieldNorm(field=ocr, doc=504374)/str
str name=mdp.390150504807660.43645293 = (MATCH)
weight(ocr:aardvark in 380212), product of:  0.9994 =
queryWeight(ocr:aardvark), product of:7.3996296 =
idf(docFreq=1550, maxDocs=933116)0.1351419 = queryNorm  0.43645296
= (MATCH) fieldWeight(ocr:aardvark in 380212), product of:
7.5498343 = tf(termFreq(ocr:aardvark)=57)7.3996296 =
idf(docFreq=1550, maxDocs=933116)0.0078125 = fieldNorm(field=ocr,
doc=380212)/strstr name=mdp.390150575285670.3501519 =
(MATCH) fieldWeight(ocr:aardvark in 200365), product of:  1.0 =
tf(termFreq(ocr:aardvark)=1)  7.4699073 = idf(docFreq=1436,
maxDocs=927474)  0.046875 = fieldNorm(field=ocr, doc=200365)/str


Re: Why does debugQuery/explain output sometimes include queryNorm and sometimes not for same query?

2013-01-25 Thread Tom Burton-West
Thanks Hoss,

Yes it is a distributed query.

Tom

On Fri, Jan 25, 2013 at 2:32 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I have a one term query:  ocr:aardvark   When I look at the explain
 : output, for some matches the queryNorm and fieldWeight are shown and for
 : some matches only the weight is shown with no query norm.  (See below)

 It looks like this is from a distributed query, correct?

 Explanation's generally exlude things that don't wind up impacting the
 query -- so for example, no query boost section exists when he boost is
 1.0 and doesn't affect the multiplication.

 In your specific case i suspect the queryWeight section (which includes
 the queryNorm) is being excluded for some documents if/when it's value is
 1.0

 The reason the queryWeight is probably 1.0 for some documents, but clearly
 not all, goes back to my question about this being a distributed query --
 note the difference in maxDoc and docFreq numbers used in computing the
 idf for the same term in two differnet documents...

 0.4395488 = (MATCH) fieldWeight(ocr:aardvark in 504374), product of:
...
7.4521165 = idf(docFreq=1328, maxDocs=842643)

 0.43645293 = (MATCH) weight(ocr:aardvark in 380212), product of:
...
7.3996296 = idf(docFreq=1550, maxDocs=933116)

 ...so in one shard the queryWeight winds up being 1.0 and gets left out
 of the explanation for conciseness, but in another shard the queryWeight
 winds up being 0.9994 and is included in the explanation.


 -Hoss



coord missing from debugQuery explain?

2013-01-08 Thread Tom Burton-West
Hello,

I'm trying to understand some Solr relevance issues using debugQuery=on,
but I don't see the coord factor listed anywhere in the explain output.
My understanding is that the coord factor is not included in either the
querynorm or the fieldnorm.
What am I missing?

Tom


Best practices for Solr highlighter for CJK

2013-01-02 Thread Tom Burton-West
Hello all,

What are the best practices for setting up the highlighter to work with CJK?
We are using the ICUTokenizer with the CJKBigramFilter, so overlapping
bigrams are what are actually being searched. However the highlighter seems
to only highlight the first of any two overlapping bigrams.   i.e.  ABC =
searched as AB BC  only AB gets highlighted even if the matching string is
ABC. (Where ABC are chinese characters such as 大亚湾  = searched as 大亚 亚湾,
but only   大亚 is highlighted rather than 大亚湾)

Is there some highlighting parameter that might fix this?

Tom Burton-West


ICUTokenizer labels number as Han character?

2012-12-19 Thread Tom Burton-West
Hello,

Don't know if the Solr admin panel is lying, or if this is a wierd bug.
The string: 1986年  gets analyzed by the ICUTokenizer with 1986 being
identified as type:NUM and script:Han.  Then the CJKBigram filter
identifies 1986 as type:Num and script:Han and 年 as type:Single and
script: Common.

This doesn't seem right.   Couldn't fit the whole analysis output on one
screen so there are two screenshots attached.

Any clues as to what is going on and whether it is a problem?

Tom


configuring per-field similarity in Solr 4: the global similarity does not support it

2012-12-17 Thread Tom Burton-West
Hello,

I have Solr 4 configured with several fields using different similarity
classes according to:
http://wiki.apache.org/solr/SchemaXml#Similarity

However, I get this error message:
 FieldType 'DFR' is configured with a similarity, but the global
similarity does not support it: class
org.apache.solr.search.similarities.DefaultSimilarityFactory

Excerpt from schema.xml below.

What I am trying to do is have any field that doesn't specify a similarity
to use the default, but to set up 3 specific fields to use the DFR, IB, and
BM25 similarities respectively.

I think I'm missing something here.  Can someone point me to documentation
or examples?

Tom


Simplified schema.xml excerpt:
 fieldType name=CJKFullText class=solr.TextField
positionIncrementGap=100  autoGeneratePhraseQueries=false
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
/fieldType

!--###--
!--  relevance rank testing --


 fieldType name=DFR class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=false
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/

filter class=solr.ICUFoldingFilterFactory/
  /analyzer

similarity class=solr.DFRSimilarityFactory
  str name=basicModelI(F)/str
  str name=afterEffectB/str
  str name=normalizationH2/str
/similarity


/fieldType


 fieldType name=IB class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=false
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer

 similarity class=solr.IBSimilarityFactory
  str name=distributionSPL/str
  str name=lambdaDF/str
  str name=normalizationH2/str
/similarity
/fieldType


 fieldType name=BM25 class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=false
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer

 similarity class=solr.BM25SimilarityFactory
!-- start with the defaults  --
  float name=k11.2/float
  float name=b0.75/float
/similarity

/fieldType










===-
Excerpt from actual schema.xml
 fieldType name=CJKFullText class=solr.TextField
positionIncrementGap=100  autoGeneratePhraseQueries=false
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.CJKBigramFilterFactory
 han=true hiragana=true
katakana=false hangul=false   /


filter class=solr.CommonGramsFilterFactory
words=1000common.txt /
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/

filter class=solr.ICUFoldingFilterFactory/
filter class=solr.CJKBigramFilterFactory
   han=true hiragana=true
  katakana=false hangul=false   /

filter class=solr.CommonGramsQueryFilterFactory
words=1000common.txt /
  /analyzer
/fieldType

!--###--
!--  relevance rank testing --


 fieldType name=DFR class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=false
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.CJKBigramFilterFactory
 han=true hiragana=true
katakana=false hangul=false   /


filter class=solr.CommonGramsFilterFactory
words=1000common.txt /
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/

filter class=solr.ICUFoldingFilterFactory/
filter class=solr.CJKBigramFilterFactory
   han=true hiragana=true
  katakana=false hangul=false   /

filter class=solr.CommonGramsQueryFilterFactory
words=1000common.txt /
  /analyzer

similarity class=solr.DFRSimilarityFactory
  str name=basicModelI(F)/str
  str name=afterEffectB/str
  str name=normalizationH2/str
/similarity


/fieldType


 fieldType name=IB class=solr.TextField positionIncrementGap=100
 

How to configure termvectors to not store positions/offsets

2012-12-13 Thread Tom Burton-West
Hello,

As I understand it, MoreLikeThis only requires term frequencies, not
positions or offsets.  So in order to save disk space I would like to store
termvectors, but without positions and offsets.  Is there documentation
somewhere that
1) would confirm that MoreLikeThis only needs term frequencies
2) Shows how to configure termvectors in Solr schema.xml to only store term
frequencies, and not positions and offsets?

Tom


Re: BM25 model for solr 4?

2012-11-15 Thread Tom Burton-West
Hello Floyd,

There is a ton of research literature out there comparing BM25 to vector
space.  But you have to be careful interpreting it.

BM25 originally beat the SMART vector space model in the early  TRECs
 because it did better tf and length normalization.  Pivoted Document
Length normalization was invented to get the vector space model to catch up
to BM25.   (Just Google for Singhal length normalization.  Amith Singhal,
now chief of Google Search did his doctoral thesis on this and it is
available.  Similarly Stephan Robertson, now at Microsoft Research
published a ton of studies of BM25)

The default Solr/Lucene similarity class doesn't provide the length or tf
normalization tuning params that BM25 does.  There is the sweetspot
simliarity, but that doesn't quite work the same way that the BM25
normalizations do.

Document length normalization needs and parameter tuning all depends on
your data.  So if you are reading a comparison, you need to determine:
1) When comparing recall/precision etc. between vector space and Bm25, did
the experimenter tune both the vector space and the BM25 parameters
2) Are the documents (and queries) they are using in the test, similar in
 length characteristics to your documents and
queries.

We are planning to do some testing in the next few months for our use case,
which is 10 million books where we index the entire book.  These are
extremely long documents compared to most IR research.
I'd love to hear about actual (non-research) production implementations
that have tested the new ranking models available in Solr.

Tom



On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu floyd...@gmail.com wrote:

 Hi there,
 Does anybody can kindly tell me how to setup solr to use BM25?
 By the way, are there any experiment or research shows BM25 and classical
 VSM model comparison in recall/precision rate?

 Thanks in advanced.



URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Tom Burton-West
Hello,

I  would like to send a request to the FieldAnalysisRequestHandler.  The
javadoc lists the parameter names such as analysis.field, but sending those
as URL parameters does not seem to work:

mysolr.umich.edu/analysis/field?analysis.name=titleq=fire-fly

leaving out the analysis doesn't work either:

mysolr.umich.edu/analysis/field?name=titleq=fire-fly

No matter what field I specify, the analysis returned is for the default
field. (See repsonse excerpt below)

Is there a page somewhere that shows the correct syntax for sending get
requests to the FieldAnalysisRequestHandler?

Tom


lst name=analysis
lst name=field_types/
lst name=field_names
lst name=ocr


Re: URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Tom Burton-West
Thanks Robert,

Somehow I read the doc but still entered the params wrong.  Should have
been analysis.fieldname instead of analysis.name  Works fine now.

Tom

On Tue, Nov 13, 2012 at 2:11 PM, Robert Muir rcm...@gmail.com wrote:

 I think the UI uses this behind the scenes, as in no more
 analysis.jsp like before?

 So maybe try using something like burpsuite and just using the
 analysis UI in your browser to see what requests its sending.

 On Tue, Nov 13, 2012 at 11:00 AM, Tom Burton-West tburt...@umich.edu
 wrote:
  Hello,
 
  I  would like to send a request to the FieldAnalysisRequestHandler.  The
  javadoc lists the parameter names such as analysis.field, but sending
 those
  as URL parameters does not seem to work:
 
  mysolr.umich.edu/analysis/field?analysis.name=titleq=fire-fly
 
  leaving out the analysis doesn't work either:
 
  mysolr.umich.edu/analysis/field?name=titleq=fire-fly
 
  No matter what field I specify, the analysis returned is for the default
  field. (See repsonse excerpt below)
 
  Is there a page somewhere that shows the correct syntax for sending get
  requests to the FieldAnalysisRequestHandler?
 
  Tom
 
  
  lst name=analysis
  lst name=field_types/
  lst name=field_names
  lst name=ocr



Re: Skewed IDF in multi lingual index

2012-11-08 Thread Tom Burton-West
Hi Markus,

No answers, but I am very interested in what you find out.  We currently
index all languages in one index, which presents different IDF issues, but
are interested in exploring alternatives such as the one you describe.

Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 We're testing a large multi lingual index with _LANG fields for each
 language and using dismax to query them all. Users provide, explicit or
 implicit, language preferences that we use for either additive or
 multiplicative boosting on the language of the document. However, additive
 boosting is not adequate because it cannot overcome the extremely high IDF
 values for the same word in another language so regardless of the the
 preference, foreign documents are returned. Multiplicative boosting solves
 this problem but has the other downside as it doesn't allow us with
 standard qf=field^boost to prefer documents in another language above the
 preferred language because the multiplicative is so strong. We do use the
 def function (boost=def(query($qq),.3)) to prevent one boost query to
 return 0 and thus a product of 0 for all boost queries. But it doesn't help
 that much

 This all comes down to IDF differences between the languages, even common
 words such as country names like `india` show large differences in IDF. Is
 here anyone with some hints or experiences to share about skewed IDF in
 such an index?

 Thanks,
 Markus



Solr 4.0 error message: Unsupported ContentType: Content-type:text/xml

2012-11-02 Thread Tom Burton-West
Hello all,

Trying to get Solr 4.0 up and running with a port of our production 3.6
schema and documents.

We are getting the following error message in the logs:

org.apache.solr.common.SolrException: Unsupported ContentType:
Content-type:text/xml  Not in: [app
lication/xml, text/csv, text/json, application/csv, application/javabin,
text/xml, application/json]


We use exactly the same code without problem with Solr 3.6.


We are sending a ContentType 'text/xml'.

Is it likely that there is some other problem and this is just not quite
the right error message?

Tom


Re: Solr 4.0 error message: Unsupported ContentType: Content-type:text/xml

2012-11-02 Thread Tom Burton-West
Thanks Jack,

That is exactly the problem.  Apparently earlier versions of Solr ignored
the extra text, which is why we didn't catch the bug in our code earlier.

Thanks for the quick response.

Tom

On Fri, Nov 2, 2012 at 5:34 PM, Jack Krupansky j...@basetechnology.comwrote:

 That message makes it sounds as if the literal text Content-type: is
 included in your content type. How exactly are you setting/sending the
 content type?

 -- Jack Krupansky

 -Original Message- From: Tom Burton-West
 Sent: Friday, November 02, 2012 5:30 PM
 To: solr-user@lucene.apache.org
 Subject: Solr 4.0 error message: Unsupported ContentType:
 Content-type:text/xml

 Hello all,

 Trying to get Solr 4.0 up and running with a port of our production 3.6
 schema and documents.

 We are getting the following error message in the logs:

 org.apache.solr.common.**SolrException: Unsupported ContentType:
 Content-type:text/xml  Not in: [app
 lication/xml, text/csv, text/json, application/csv, application/javabin,
 text/xml, application/json]


 We use exactly the same code without problem with Solr 3.6.


 We are sending a ContentType 'text/xml'.

 Is it likely that there is some other problem and this is just not quite
 the right error message?

 Tom



Solr 4.0 Beta: Admin UI does not correctly implement dismax/edismax query

2012-09-13 Thread Tom Burton-West
Just want to check I am not doing something obviously wrong before I file a
bug ticket.

In Solr 4.0Beta, in the admin UI in the Query panel,, there is a checkbox
option to check dismax or edismax
When you check one of those, text boxes for the dismax parameters are
exposed.  However, the query that gets sent to Solr is not actually a
dismax query.


lst name=params
str name=mm3/str
str name=pfocr^200/str
str name=debugQuerytrue/str
str name=dismaxtrue/str
str name=tie0.1/str
str name=qfire-fly/str
str name=wtxml/str

str name=rawquerystringfire-fly/str
str name=querystringfire-fly/str
str name=parsedquerytext:fire text:fly/str

If a correct dismax query was being sent to Solr the parsedquery would have
something like the following:
str name=parsedquery(+DisjunctionMaxQuery(((text:fire text:fly)))

Tom Burton-West


Re: Solr 4.0 Beta: Admin UI does not correctly implement dismax/edismax query

2012-09-13 Thread Tom Burton-West
Thanks Erik,

Just found out that there is already a bug report for this open as
https://issues.apache.org/jira/browse/SOLR-3811.

Tom

On Thu, Sep 13, 2012 at 12:52 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 That's definitely a bug.  dismax=true is not the correct parameter to
 send.  Should be defType=dismax

 Erik

 On Sep 13, 2012, at 12:22 , Tom Burton-West wrote:

  Just want to check I am not doing something obviously wrong before I
 file a
  bug ticket.
 
  In Solr 4.0Beta, in the admin UI in the Query panel,, there is a checkbox
  option to check dismax or edismax
  When you check one of those, text boxes for the dismax parameters are
  exposed.  However, the query that gets sent to Solr is not actually a
  dismax query.
 
 
  lst name=params
  str name=mm3/str
  str name=pfocr^200/str
  str name=debugQuerytrue/str
  str name=dismaxtrue/str
  str name=tie0.1/str
  str name=qfire-fly/str
  str name=wtxml/str
 
  str name=rawquerystringfire-fly/str
  str name=querystringfire-fly/str
  str name=parsedquerytext:fire text:fly/str
 
  If a correct dismax query was being sent to Solr the parsedquery would
 have
  something like the following:
  str name=parsedquery(+DisjunctionMaxQuery(((text:fire text:fly)))
 
  Tom Burton-West




Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Hello all,

Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms (
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again).
In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file.   We originally used the
termInfosIndexDivisor which affects the sampling of the tii file when read
into memory.   While this solved our problem for searching, unfortunately
the termInfosIndexDivisor was not read during indexing and caused OOM
problems once our indexes grew beyond a certain size.  See:
https://issues.apache.org/jira/browse/SOLR-2290.

Has this been changed in Solr 4.0?

The advantage of using the termInfosIndexDivisor is that it can be changed
without re-indexing, so we were able to experiment with different settings
to determine a good setting without re-indexing several terabytes of data.

When we ran into problems with the memory use for the in-memory
representation of the tii file during indexing, we changed the
termIndexInterval.  The termIndexInterval is an indexing-time setting
 which affects the size of the tii file.  It sets the sampling of the tis
file that gets written to the tii file.

In Solr 4.0 termInfosIndexDivisor has been replaced with
termIndexDivisor.The documentation for these two features, the
index-time termIndexInterval and the run-time  termIndexDivisor no longer
seems to be on the solr config page of the wiki and the docmentation in the
example file does not exlain what the termIndexDivisor does.

Would it be appropriate to add these back to the wiki page?  If not, could
someone add a line or two to the comments in the Solr 4.0 example file
explaining what the termIndexDivisor doe?


Tom


Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Thanks Robert,

I'll have to spend some time understanding the default codec for Solr 4.0.
Did I miss something in the changes file?

 I'll be digging into the default codec docs and testing sometime in next
week  or two (with a 2 billion term index)  If I understand it well enough,
I'll be happy to draft some changes up for either the wiki or Solr the
example solrconfig.xml  file.

Does this mean that the default codec will reduce memory use for the terms
index enough so I don't need to use either of these settings to deal with
my  2 billion term indexes?

If both of these parameters don't make sense for the default codec, then
maybe they need to be commented out or removed from the solr example
solrconfig.xml.

Tom

On Fri, Sep 7, 2012 at 1:33 PM, Robert Muir rcm...@gmail.com wrote:

 Hi Tom: I already enhanced the javadocs about this for Lucene, putting
 warnings everywhere in bold:

 NOTE: This parameter does not apply to all PostingsFormat
 implementations, including the default one in this release. It only
 makes sense for term indexes that are implemented as a fixed gap
 between terms.
 NOTE: divisor settings  1 do not apply to all PostingsFormat
 implementations, including the default one in this release. It only
 makes sense for terms indexes that can efficiently re-sample terms at
 load time.
 etc


 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29

 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29

 In the future I expect these parameters ill be removed completely:
 anything like this is specific to the codec/implementation.

 In Lucene 4.0 the terms index works completely differently: these
 parameters don't make sense for it.

 On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West tburt...@umich.edu
 wrote:
  Hello all,
 
  Due to multiple languages and dirty OCR, our indexes have over 2 billion
  unique terms (
  http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
 ).
  In Solr 3.6 and previous we needed to reduce the memory used for storing
  the in-memory representation of the tii file.   We originally used the
  termInfosIndexDivisor which affects the sampling of the tii file when
 read
  into memory.   While this solved our problem for searching, unfortunately
  the termInfosIndexDivisor was not read during indexing and caused OOM
  problems once our indexes grew beyond a certain size.  See:
  https://issues.apache.org/jira/browse/SOLR-2290.
 
  Has this been changed in Solr 4.0?
 
  The advantage of using the termInfosIndexDivisor is that it can be
 changed
  without re-indexing, so we were able to experiment with different
 settings
  to determine a good setting without re-indexing several terabytes of
 data.
 
  When we ran into problems with the memory use for the in-memory
  representation of the tii file during indexing, we changed the
  termIndexInterval.  The termIndexInterval is an indexing-time setting
   which affects the size of the tii file.  It sets the sampling of the tis
  file that gets written to the tii file.
 
  In Solr 4.0 termInfosIndexDivisor has been replaced with
  termIndexDivisor.The documentation for these two features, the
  index-time termIndexInterval and the run-time  termIndexDivisor no longer
  seems to be on the solr config page of the wiki and the docmentation in
 the
  example file does not exlain what the termIndexDivisor does.
 
  Would it be appropriate to add these back to the wiki page?  If not,
 could
  someone add a line or two to the comments in the Solr 4.0 example file
  explaining what the termIndexDivisor doe?
 
 
  Tom



 --
 lucidworks.com



Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Thanks Robert,

if not, just customize blocktree's params with a CodecFactory in solr,
or even pick another implementation (FixedGap, VariableGap, whatever).

Still trying to get my head around 4.0 and flexible indexing.  I'll take
another look at Mike's and your presentations.  I'm trying to figure out
how to get from the Lucene JavaDocs you pointed out  to how to specify
things in Solr and it's config files..

Is there an example CodecFactory somewhere I could look at?  Also is
Is there an example somewhere of how to specify a CodecFactory/Codec in
Solr using the schema.xml or solrconfig.xml?

Is there some simple way to specify minBlockSize and maxBlockSize in
schema.xml?

Once I get this all working and understand it, I'll be happy to draft some
documentation.

I'm really looking forward to experimenting with 4.0!

Tom



Tom
On Fri, Sep 7, 2012 at 2:58 PM, Robert Muir rcm...@gmail.com wrote:

 On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West tburt...@umich.edu
 wrote:
  Thanks Robert,
 
  I'll have to spend some time understanding the default codec for Solr
 4.0.
  Did I miss something in the changes file?

 http://lucene.apache.org/core/4_0_0-BETA/

 see the file formats section, especially

 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary

 (since blocktree covers term dictionary and terms index)

 
   I'll be digging into the default codec docs and testing sometime in next
  week  or two (with a 2 billion term index)  If I understand it well
 enough,
  I'll be happy to draft some changes up for either the wiki or Solr the
  example solrconfig.xml  file.

 right i think we should remove these parameters.

 
  Does this mean that the default codec will reduce memory use for the
 terms
  index enough so I don't need to use either of these settings to deal with
  my  2 billion term indexes?

 probably. i dont know enough about your terms or how much RAM you have
 to say for sure.

 if not, just customize blocktree's params with a CodecFactory in solr,
 or even pick another implementation (FixedGap, VariableGap, whatever).

 the interval/divisor stuff is mostly only useful if you are not
 reindexing from scratch: e.g. if you are gonna plop your 3.x index
 into 4.x then you should set
 those to whatever you were using before, since it will be using
 PreflexCodec to read those.

 --
 lucidworks.com



Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
I removed the string collection1 from my solr.xml file in solr home and
modified my solr.xml file as follows:

  cores adminPath=/admin/cores defaultCoreName=foobar1 host=${host:}
hostPort=${jetty.port:} zkClientTimeout=${zkClientTimeout:15000}
core name=foobarcorename instanceDir=. /
  /cores
Then I restarted Solr.

However, I keep getting messages about
Can't find resource 'solrconfig.xml' in classpath or
'/l/solrs/dev/solrs/4.0/1/collection1/conf/'
And the log messages show that Solr is trying to create the collection1
instance

Aug 23, 2012 12:06:02 PM org.apache.solr.core.CoreContainer create
INFO: Creating SolrCore 'collection1' using instanceDir:
/l/solrs/dev/solrs/4.0/3/collection1
Aug 23, 2012 12:06:02 PM org.apache.solr.core.SolrResourceLoader init
I think somehow the previous solr.xml configuration is being stored on disk
somewhere and loaded.

Any clues?

Tom


Re: Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
I did not describe the problems correctly.

I have 3 solr shards with solr homes .../solrs/4.0/1  .../solrs/4.0/2 and
.../solrs/4.0/2solrs/3

For shard 1 I have a solr.xml file with the modifications described in the
previous message.  For that instance, it appears that the problem is that
the semantics of specifing the instancedir have changed between 3.6 and
4.0.

I specified the instancedir as  instanceDir=.

However, I get this error in the log:

Cannot create directory: /l/solrs/dev/solrs/4.0/1/./data/index

Note that instead of using Solr home /l/solrs/dev/solrs/4.0/1 (what I would
expect for the relative path .), that Solr appears to be appending . to
Solr home.
The solr.xml file says that paths are relative to the installation
directory.  Perhaps that needs to be clarified in the file.


For shards 2 and 3, I tried not using a solr.xml file and I did not create
a collection1 subdirectory.  For these solr instances, I got the messages
about collection1 and files not being found in the $SOLRHOME/collection1
path

 Can't find resource 'solrconfig.xml' in classpath or
'/l/solrs/dev/solrs/4.0/3/collection1/conf/',
cwd=/l/local/apache-tomcat-dev
Looking at the logs it appears that collection1 is specified as the
default core somewhere:

Aug 23, 2012 12:42:47 PM org.apache.solr.core.CoreContainer$Initializer
initialize
INFO: looking for solr.xml: /l/solrs/dev/solrs/4.0/3/solr.xml
Aug 23, 2012 12:42:47 PM org.apache.solr.core.CoreContainer init
INFO: New CoreContainer 1281149250
Aug 23, 2012 12:42:47 PM org.apache.solr.core.CoreContainer$Initializer
initialize
INFO: no solr.xml file found - using default
Aug 23, 2012 12:42:47 PM org.apache.solr.core.CoreContainer load
INFO: Loading CoreContainer using Solr Home: '/l/solrs/dev/solrs/4.0/3/'
Aug 23, 2012 12:42:47 PM org.apache.solr.core.SolrResourceLoader init
INFO: Creating SolrCore 'collection1' using instanceDir:
/l/solrs/dev/solrs/4.0/3/collection1

Is this default of collection1 specified in some other config file or
hardcoded into Solr somewhere?

If using a core is mandatory with Solr 4.0 , the CoreAdmin wiki page and
the release notes should point this out.


Tom


Re: Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
The answer is yes.   collection1 is defined as the default core name in
CoreContainer.java on line 94 or so.   I have opened a jira issue for this
and other issues related to the documentation of solr.xml and Solr core
configuration issues for Solr 4.0

https://issues.apache.org/jira/browse/SOLR-3753

On Thu, Aug 23, 2012 at 1:04 PM, Tom Burton-West tburt...@umich.edu wrote:

 I did not describe the problems correctly.

 I have 3 solr shards with solr homes .../solrs/4.0/1  .../solrs/4.0/2 and
 .../solrs/4.0/2solrs/3

 For shard 1 I have a solr.xml file with the modifications described in the
 previous message.  For that instance, it appears that the problem is that
 the semantics of specifing the instancedir have changed between 3.6 and
 4.0.

 I specified the instancedir as  instanceDir=.

 However, I get this error in the log:

 Cannot create directory: /l/solrs/dev/solrs/4.0/1/./data/index

 Note that instead of using Solr home /l/solrs/dev/solrs/4.0/1 (what I
 would expect for the relative path .), that Solr appears to be appending
 . to Solr home.
 The solr.xml file says that paths are relative to the installation
 directory.  Perhaps that needs to be clarified in the file.


 For shards 2 and 3, I tried not using a solr.xml file and I did not create
 a collection1 subdirectory.  For these solr instances, I got the messages
 about collection1 and files not being found in the $SOLRHOME/collection1
 path

  Can't find resource 'solrconfig.xml' in classpath or
 '/l/solrs/dev/solrs/4.0/3/collection1/conf/',
 cwd=/l/local/apache-tomcat-dev
 Looking at the logs it appears that collection1 is specified as the
 default core somewhere:

 Aug 23, 2012 12:42:47 PM org.apache.solr.core.CoreContainer$Initializer
 initialize
 INFO: looking for solr.xml: /l/solrs/dev/solrs/4.0/3/solr.xml
 Aug 23, 2012 12:42:47 PM org.apache.solr.core.CoreContainer init
 INFO: New CoreContainer 1281149250
 Aug 23, 2012 12:42:47 PM org.apache.solr.core.CoreContainer$Initializer
 initialize
 INFO: no solr.xml file found - using default
 Aug 23, 2012 12:42:47 PM org.apache.solr.core.CoreContainer load
 INFO: Loading CoreContainer using Solr Home: '/l/solrs/dev/solrs/4.0/3/'
 Aug 23, 2012 12:42:47 PM org.apache.solr.core.SolrResourceLoader init
 INFO: Creating SolrCore 'collection1' using instanceDir:
 /l/solrs/dev/solrs/4.0/3/collection1

 Is this default of collection1 specified in some other config file or
 hardcoded into Solr somewhere?

 If using a core is mandatory with Solr 4.0 , the CoreAdmin wiki page and
 the release notes should point this out.


 Tom






Re: Solr 4.0 Beta missing example/conf files?

2012-08-23 Thread Tom Burton-West
Thanks Erik!

What confused me in the README is that it wasn't clear what
files/directorys need to be in Solr home and what files/directories need to
be in SolrHome/corename.  For example the /conf and /data directories are
now under the core subdirectory.  What about /lib and /bin?   Will a core
use a conf file in SolrHome/conf if there is no Solrhome/collection1/conf
directory?

Also when upgrading from a previous Solr setup that doesn't use a core,  I
was definitely confused about whether or not it is mandatory to have core
with Solr 4.0.  And when I tried not using a solr.xml file, it was very
wierd to still get a message about a missing collection1 core directory.

See this JIRA issue:https://issues.apache.org/jira/browse/SOLR-3753

Tom


On Thu, Aug 23, 2012 at 7:56 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 Tom -

 I corrected, on both trunk and 4_x, a reference to solr/conf (to
 solr/collection1/conf) in tutorial.html.  I didn't see anything in
 example/README that needed fixing.  Was there something that is awry there
 that needs correcting that I missed?   If so, feel free to file a JIRA
 marked for 4.0 so we can be sure to fix it before final release.

 Thanks,
 Erik

 On Aug 22, 2012, at 16:32 , Tom Burton-West wrote:

  Thanks Markus!
 
  Should the README.txt file in solr/example be updated to reflect this?
  Is that something I need to enter a JIRA issue for?
 
  Tom
 
  On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
  Hi - The example has been moved to collection1/
 
 
 
  -Original message-
  From:Tom Burton-West tburt...@umich.edu
  Sent: Wed 22-Aug-2012 20:59
  To: solr-user@lucene.apache.org
  Subject: Solr 4.0 Beta missing example/conf files?
 
  Hello,
 
  Usually in the example/solr file in Solr distributions there is a
  populated
  conf file.  However in the distribution I downloaded of solr
 4.0.0-BETA,
  there is no /conf directory.   Has this been moved somewhere?
 
  Tom
 
  ls -l apache-solr-4.0.0-BETA/example/solr
  total 107
  drwxr-sr-x 2 tburtonw dlps0 May 29 13:02 bin
  drwxr-sr-x 3 tburtonw dlps   22 Jun 28 09:21 collection1
  -rw-r--r-- 1 tburtonw dlps 2259 May 29 13:02 README.txt
  -rw-r--r-- 1 tburtonw dlps 2171 Jul 31 19:35 solr.xml
  -rw-r--r-- 1 tburtonw dlps  501 May 29 13:02 zoo.cfg
 
 




Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance,

I don't understand enough of how the field collapsing is implemented, but I
thought it worked with distributed search.  Are you saying it only works if
everything that needs collapsing is on the same shard?

Tom

On Wed, Aug 22, 2012 at 2:41 AM, Lance Norskog goks...@gmail.com wrote:

 How do you separate the documents among the shards? Can you set up the
 shards such that one collapse group is only on a single shard? That
 you never have to do distributed grouping?

 On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
 tchatter...@commvault.com wrote:
  This wont work, see my thread on Solr3.6 Field collapsing
  Thanks,
  Tirthankar
 
  -Original Message-
  From: Tom Burton-West tburt...@umich.edu
  Date: Tue, 21 Aug 2012 18:39:25
  To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
  Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
  Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu
  Subject: Scalability of Solr Result Grouping/Field Collapsing:
   Millions/Billions of documents?
 
  Hello all,
 
  We are thinking about using Solr Field Collapsing on a rather large scale
  and wonder if anyone has experience with performance when doing Field
  Collapsing on millions of or billions of documents (details below. )  Are
  there performance issues with grouping large result sets?
 
  Details:
  We have a collection of the full text of 10 million books/journals.  This
  is spread across 12 shards with each shard holding about 800,000
  documents.  When a query matches a journal article, we would like to
 group
  all the matching articles from the same journal together. (there is a
  unique id field identifying the journal).  Similarly when there is a
 match
  in multiple copies of the same book we would like to group all results
 for
  the same book together (again we have a unique id field we can group on).
  Sometimes a short query against the OCR field will result in over one
  million hits.  Are there known performance issues when field collapsing
  result sets containing a million hits?
 
  We currently index the entire book as one Solr document.  We would like
 to
  investigate the feasibility of indexing each page as a Solr document
 with a
  field indicating the book id.  We could then offer our users the choice
 of
  a list of the most relevant pages, or a list of the books containing the
  most relevant pages.  We have approximately 3 billion pages.   Does
 anyone
  have experience using field collapsing on this sort of scale?
 
  Tom
 
  Tom Burton-West
  Information Retrieval Programmer
  Digital Library Production Service
  Univerity of Michigan Library
  http://www.hathitrust.org/blogs/large-scale-search
  **Legal Disclaimer***
  This communication may contain confidential and privileged
  material for the sole use of the intended recipient. Any
  unauthorized review, use or distribution by others is strictly
  prohibited. If you have received the message in error, please
  advise the sender by reply email and delete the message. Thank
  you.
  *



 --
 Lance Norskog
 goks...@gmail.com



Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Tirthankar,

Can you give me a quick summary of what   won't work and why?
I couldn't figure it out from looking at your thread.  You seem to have a
different issue, but maybe I'm missing something here.

Tom

On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee 
tchatter...@commvault.com wrote:

 This wont work, see my thread on Solr3.6 Field collapsing
 Thanks,
 Tirthankar




Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance and Tirthankar,

We are currently using Solr 3.6.  I tried a search across our current 12
shards grouping by book id (record_no in our schema) and it seems to work
fine (the query with the actual urls for the shards changed is appended
below.)

I then searched for the record_no of the second group in the results to
confirm that the number of records being folded is correct. In both cases
the numFound is 505 so it seems as though the record counts for the group
are correct.  Then I tried the same search but changed the shards parameter
to limit the search to 1/2 of the shards and got numFound = 325.  This
shows that the items in the group are distributed between different shards.

What am I missing here?   What is it that you are saying does not work?

Tom
Field Collapse query ( IP address changed, and newlines added and  shard
urls simplified  for readability)


http://solr-myhost.edu/serve-9/select?indent=onversion=2.2
shards=shard1,shard2,shard3, shard4,shard5, shard,6,...shard12
q=title:naturefq=start=0rows=10fl=id,author,title,volume_enumcron,score
group=truegroup.field=record_nogroup.limit=2


Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Hello,

Usually in the example/solr file in Solr distributions there is a populated
conf file.  However in the distribution I downloaded of solr 4.0.0-BETA,
there is no /conf directory.   Has this been moved somewhere?

Tom

ls -l apache-solr-4.0.0-BETA/example/solr
total 107
drwxr-sr-x 2 tburtonw dlps0 May 29 13:02 bin
drwxr-sr-x 3 tburtonw dlps   22 Jun 28 09:21 collection1
-rw-r--r-- 1 tburtonw dlps 2259 May 29 13:02 README.txt
-rw-r--r-- 1 tburtonw dlps 2171 Jul 31 19:35 solr.xml
-rw-r--r-- 1 tburtonw dlps  501 May 29 13:02 zoo.cfg


Re: Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Thanks Markus!

Should the README.txt file in solr/example be updated to reflect this?
Is that something I need to enter a JIRA issue for?

Tom

On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi - The example has been moved to collection1/



 -Original message-
  From:Tom Burton-West tburt...@umich.edu
  Sent: Wed 22-Aug-2012 20:59
  To: solr-user@lucene.apache.org
  Subject: Solr 4.0 Beta missing example/conf files?
 
  Hello,
 
  Usually in the example/solr file in Solr distributions there is a
 populated
  conf file.  However in the distribution I downloaded of solr 4.0.0-BETA,
  there is no /conf directory.   Has this been moved somewhere?
 
  Tom
 
  ls -l apache-solr-4.0.0-BETA/example/solr
  total 107
  drwxr-sr-x 2 tburtonw dlps0 May 29 13:02 bin
  drwxr-sr-x 3 tburtonw dlps   22 Jun 28 09:21 collection1
  -rw-r--r-- 1 tburtonw dlps 2259 May 29 13:02 README.txt
  -rw-r--r-- 1 tburtonw dlps 2171 Jul 31 19:35 solr.xml
  -rw-r--r-- 1 tburtonw dlps  501 May 29 13:02 zoo.cfg
 



Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Thanks Tirthankar,

So the issue in memory use for sorting.  I'm not sure I understand how
sorting of grouping fields  is involved with the defaults and field
collapsing, since the default sorts by relevance not grouping field.  On
the other hand I don't know much about how field collapsing is implemented.

So far the few tests I've made haven't revealed any memory problems.  We
are using very small string fields for grouping and I think that we
probably only have a couple of cases where we are grouping more than a few
thousand docs.   I will try to find a query with a lot of docs per group
and take a look at the memory use using JConsole.

Tom


On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee 
tchatter...@commvault.com wrote:

  Hi Tom,

 We had an issue where we are keeping millions of docs in a single node and
 we were trying to group them on a string field which is nothing but full
 file path… that caused SOLR to go out of memory…

 ** **

 Erick has explained nicely in the thread as to why it won’t work and I had
 to find another way of architecting it. 

 ** **

 How do you think this is different in your case. If you want to group by a
 string field with thousands of similar entries I am guessing you will face
 the same issue. 

 ** **

 Thanks,

 Tirthankar
 ***Legal Disclaimer***
 This communication may contain confidential and privileged material for
 the
 sole use of the intended recipient. Any unauthorized review, use or
 distribution
 by others is strictly prohibited. If you have received the message in
 error,
 please advise the sender by reply email and delete the message. Thank you.
 **



Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-21 Thread Tom Burton-West
Hello all,

We are thinking about using Solr Field Collapsing on a rather large scale
and wonder if anyone has experience with performance when doing Field
Collapsing on millions of or billions of documents (details below. )  Are
there performance issues with grouping large result sets?

Details:
We have a collection of the full text of 10 million books/journals.  This
is spread across 12 shards with each shard holding about 800,000
documents.  When a query matches a journal article, we would like to group
all the matching articles from the same journal together. (there is a
unique id field identifying the journal).  Similarly when there is a match
in multiple copies of the same book we would like to group all results for
the same book together (again we have a unique id field we can group on).
Sometimes a short query against the OCR field will result in over one
million hits.  Are there known performance issues when field collapsing
result sets containing a million hits?

We currently index the entire book as one Solr document.  We would like to
investigate the feasibility of indexing each page as a Solr document with a
field indicating the book id.  We could then offer our users the choice of
a list of the most relevant pages, or a list of the books containing the
most relevant pages.  We have approximately 3 billion pages.   Does anyone
have experience using field collapsing on this sort of scale?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
Univerity of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search


Re: edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-07-02 Thread Tom Burton-West
Opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3589, which
also lists a couple other related mailing list posts.




On Thu, Jun 28, 2012 at 12:18 PM, Tom Burton-West tburt...@umich.eduwrote:

 Hello,

 My previous e-mail with a CJK example has received no replies.   I
 verified that this problem also occurs for English.  For example in the
 case of the word fire-fly , The ICUTokenizer and the WordDelimeterFilter
 both split this into two tokens fire and fly.

 With an edismax query and a must match of 2 :  q={!edsmax mm=2} if the
 words are entered separately at [fire fly], the edismax parser honors the
 mm parameter and does the equivalent of a Boolean AND query.  However if
 the words are entered as a hypenated word [fire-fly], the tokenizer splits
 these into two tokens fire and fly and the edismax parser does the
 equivalent of a Boolean OR query.

 I'm not sure I understand the output of the debugQuery, but judging by the
 number of hits returned it appears that edismax is not honoring the mm
 parameter. Am I missing something, or is this a bug?

  I'd like to file a JIRA issue, but want to find out if I am missing
 something here.

 Details of several queries are appended below.

 Tom Burton-West

 edismax query mm=2   query with hypenated word [fire-fly]

 lst name=debug
 str name=rawquerystring{!edismax mm=2}fire-fly/str
 str name=querystring{!edismax mm=2}fire-fly/str
 str name=parsedquery+DisjunctionMaxQuery(((ocr:fire ocr:fly)))/str
 str name=parsedquery_toString+((ocr:fire ocr:fly))/str


 Entered as separate words [fire fly]  numFound=184962
  edismax mm=2
 lst name=debug
 str name=rawquerystring{!edismax mm=2}fire fly/str
 str name=querystring{!edismax mm=2}fire fly/str
 str name=parsedquery
 +((DisjunctionMaxQuery((ocr:fire)) DisjunctionMaxQuery((ocr:fly)))~2)
 /str


 Regular Boolean AND query:   [fire AND fly] numFound=184962
 str name=rawquerystringfire AND fly/str
 str name=querystringfire AND fly/str
 str name=parsedquery+ocr:fire +ocr:fly/str
 str name=parsedquery_toString+ocr:fire +ocr:fly/str

 Regular Boolean OR query: fire OR fly 366047  numFound=366047
 lst name=debug
 str name=rawquerystringfire OR fly/str
 str name=querystringfire OR fly/str
 str name=parsedqueryocr:fire ocr:fly/str
 str name=parsedquery_toStringocr:fire ocr:fly/str



edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-06-28 Thread Tom Burton-West
Hello,

My previous e-mail with a CJK example has received no replies.   I verified
that this problem also occurs for English.  For example in the case of the
word fire-fly , The ICUTokenizer and the WordDelimeterFilter both split
this into two tokens fire and fly.

With an edismax query and a must match of 2 :  q={!edsmax mm=2} if the
words are entered separately at [fire fly], the edismax parser honors the
mm parameter and does the equivalent of a Boolean AND query.  However if
the words are entered as a hypenated word [fire-fly], the tokenizer splits
these into two tokens fire and fly and the edismax parser does the
equivalent of a Boolean OR query.

I'm not sure I understand the output of the debugQuery, but judging by the
number of hits returned it appears that edismax is not honoring the mm
parameter. Am I missing something, or is this a bug?

 I'd like to file a JIRA issue, but want to find out if I am missing
something here.

Details of several queries are appended below.

Tom Burton-West

edismax query mm=2   query with hypenated word [fire-fly]

lst name=debug
str name=rawquerystring{!edismax mm=2}fire-fly/str
str name=querystring{!edismax mm=2}fire-fly/str
str name=parsedquery+DisjunctionMaxQuery(((ocr:fire ocr:fly)))/str
str name=parsedquery_toString+((ocr:fire ocr:fly))/str


Entered as separate words [fire fly]  numFound=184962
 edismax mm=2
lst name=debug
str name=rawquerystring{!edismax mm=2}fire fly/str
str name=querystring{!edismax mm=2}fire fly/str
str name=parsedquery
+((DisjunctionMaxQuery((ocr:fire)) DisjunctionMaxQuery((ocr:fly)))~2)
/str


Regular Boolean AND query:   [fire AND fly] numFound=184962
str name=rawquerystringfire AND fly/str
str name=querystringfire AND fly/str
str name=parsedquery+ocr:fire +ocr:fly/str
str name=parsedquery_toString+ocr:fire +ocr:fly/str

Regular Boolean OR query: fire OR fly 366047  numFound=366047
lst name=debug
str name=rawquerystringfire OR fly/str
str name=querystringfire OR fly/str
str name=parsedqueryocr:fire ocr:fly/str
str name=parsedquery_toStringocr:fire ocr:fly/str


edismax parser ignores mm parameter when tokenizer splits tokens (i.e. CJK)

2012-06-26 Thread Tom Burton-West
We are using the edismax query parser with an mm=100%.  However, when a CJK
query ( ABC) gets tokenized by the CJKBigramFilter ([AB] [BC]),  instead of
a Boolean AND for [AB] AND [BC], which is what we expect with mm=100%, this
gets searched as a Boolean OR query.

For example searching for Daya Bay 大亚湾 (which gets tokenized to 大亚 亚湾) we
get about 10,000 results.
If instead we manually segment the Chinese characters for Daya Bay and
enter the query [大亚  亚湾] we get 5,000 results.
(Our default Boolean operator is also AND)

This problem also occurs with non-CJK queries for example [two-thirds]
turns into a Boolean OR query for ( [two] OR [thirds] ).

Is there some way to tell the edismax query parser to stick with mm =100%?

Appended below is the debugQuery output for these two queries and an
exceprt from our schema.xml.


Tom

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search


Entered as  [大亚湾] in Just Full Text

str name=rawquerystring
 _query_:{!edismax qf='ocr^50 ' pf='' mm='100%' tie='0.9' } 大亚湾
/str
-
str name=querystring
 _query_:{!edismax qf='ocr^50 ' pf='' mm='100%' tie='0.9' } 大亚湾
/str
-
str name=parsedquery
+DisjunctionMaxQueryocr:大亚 ocr:亚湾)^50.0))~0.9)
/str
-


Entered as two phrases [ 大亚\ 亚湾] in Just Full Text
We get 4909 hits.  This is what I was expecting with the bigrams above.

lst name=debug
-
str name=rawquerystring
 _query_:{!edismax qf='ocr^50 ' pf='' mm='100%' tie='0.9' } \大亚\
\亚湾\
/str
-
str name=querystring
 _query_:{!edismax qf='ocr^50 ' pf='' mm='100%' tie='0.9' } \大亚\
\亚湾\
/str
-
str name=parsedquery
+((DisjunctionMaxQuery((ocr:大亚^50.0)~0.9)
DisjunctionMaxQuery((ocr:亚湾^50.0)~0.9))~2)
/str


---
fieldType name=CJKFullText class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=false
-
analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.CJKBigramFilterFactory han=true hiragana=true
katakana=false hangul=false/


What is the docs number in Solr explain query results for fieldnorm?

2012-05-25 Thread Tom Burton-West
Hello all,

I am trying to understand the output of Solr explain for a one word query.
I am querying on the ocr field with no stemming/synonyms or stopwords.
And no query or index time boosting.

The query is ocr:the

The document (result below)  which contains two words The Aeroplane gets
more hits than documents with 50 or more occurances of the word the
Since the idf is the same I am assuming this is a result of length norms.

The explain (debugQuery) shows the following for fieldnorm:
 0.625 = fieldNorm(field=ocr, doc=16624)
What does the doc=16624 mean?  It certainly can not represent either the
length of the field (as an integer) since there are only two terms in the
field.
It can't represent the number of docs with the query term (the idf output
shows the word the occurs in 16,219 docs.

I have appended below the explain scoring for a couple of documents with tf
50 and 67.


float name=score0.6798219/float
str name=IDDF9199B7049F8DFE-220/str
str name=doc_IDDF9199B7049F8DFE/str
str name=ocrThe Aeroplane
/str
str name=DF9199B7049F8DFE-220
0.6798219 = (MATCH) fieldWeight(ocr:the in 16624), product of:
  1.0 = tf(termFreq(ocr:the)=1)
  1.087715 = idf(docFreq=16219, maxDocs=17707)
  0.625 = fieldNorm(field=ocr, doc=16624)
/str

Tom Burton-West

-

str name=78562575E066497D-518
0.42061833 = (MATCH) fieldWeight(ocr:the in 8396), product of:
  7.071068 = tf(termFreq(ocr:the)=50)
  1.087715 = idf(docFreq=16219, maxDocs=17707)
  0.0546875 = fieldNorm(field=ocr, doc=8396)
/str



 str name=18881D8AE8B1576E-120

0.41734362 = (MATCH) fieldWeight(ocr:the in 2782), product of:
  8.185352 = tf(termFreq(ocr:the)=67)
  1.087715 = idf(docFreq=16219, maxDocs=17707)
  0.046875 = fieldNorm(field=ocr, doc=2782)
/str


boost not showing up in Solr 3.6 debugQueries?

2012-05-17 Thread Tom Burton-West
Hello all,

In Solr 3.4, the boost factor is explicitly shown in debugQueries:

str name=coo.31924100321193
0.37087926 = (MATCH) sum of:
  0.3708323 = (MATCH) weight(ocr:dog^1000.0 in 215624), product of:
0.995 = queryWeight(ocr:dog^1000.0), product of:
  1000.0 = boost
  2.32497 = idf(docFreq=237626, maxDocs=893970)
  4.3011288E-4 = queryNorm
0.37083247 = (MATCH) fieldWeight(ocr:dog in 215624), product of:
  27.221315 = tf(termFreq(ocr:dog)=741)
...

But in Solr 3.6 I am not seeing the boost factor called out.

 On the other hand it looks like it may now be incoroporated in the
queryNorm (Please see example below).

Is there a bug in Solr 3.6 debugQueries?  Is there some new behavior
regarding boosts and queryNorms? or am I missing something obvious?

(Apologies for the Japanese query, but right now the only index I have
in Solr 3.6 is for CJK and this is one of the querie from our log.

Tom Burton-West



lst name=debug
  str name=rawquerystring 兵にな^1000 OR hanUnigrams:兵にな/str
  str name=querystring 兵にな^1000 OR hanUnigrams:兵にな/str
  str name=parsedquery((+ocr:兵に +ocr:にな)^1000.0) hanUnigrams:兵/str
  str name=parsedquery_toString((+ocr:兵に +ocr:にな)^1000.0)
hanUnigrams:兵/str

  lst name=explain
str name=mdp.39015021911386
0.15685473 = (MATCH) sum of:
  0.15684697 = (MATCH) sum of:
0.0067602023 = (MATCH) weight(ocr:兵に in 213594), product of:
  0.81443477 = queryWeight(ocr:兵に), product of:
3.3998778 = idf(docFreq=70130, maxDocs=772972)
0.23954825 = queryNorm
  0.008300483 = (MATCH) fieldWeight(ocr:兵に in 213594), product of:
1.0 = tf(termFreq(ocr:兵に)=1)
3.3998778 = idf(docFreq=70130, maxDocs=772972)
0.0024414062 = fieldNorm(field=ocr, doc=213594)
0.15008678 = (MATCH) weight(ocr:にな in 213594), product of:
  0.5802551 = queryWeight(ocr:にな), product of:
2.422289 = idf(docFreq=186410, maxDocs=772972)
0.23954825 = queryNorm
  0.25865653 = (MATCH) fieldWeight(ocr:にな in 213594), product of:
43.737854 = tf(termFreq(ocr:にな)=1913)
2.422289 = idf(docFreq=186410, maxDocs=772972)
0.0024414062 = fieldNorm(field=ocr, doc=213594)
  7.76674E-6 = (MATCH) weight(hanUnigrams:兵 in 213594), product of:
2.9968342E-4 = queryWeight(hanUnigrams:兵), product of:
  1.2510358 = idf(docFreq=601367, maxDocs=772972)
  2.3954824E-4 = queryNorm
0.025916481 = (MATCH) fieldWeight(hanUnigrams:兵 in 213594), product of:
  4.2426405 = tf(termFreq(hanUnigrams:兵)=18)
  1.2510358 = idf(docFreq=601367, maxDocs=772972)
  0.0048828125 = fieldNorm(field=hanUnigrams, doc=213594)
/str


Re: Solr RAM Requirements

2010-03-17 Thread Tom Burton-West

Hi Chak

Rather than comparing the overall size of your index to the RAM available
for the OS disk cache, you might want to look at particular files. For
example if you allow phrase queries, than the size of the *prx files is
relevant, if you don't, you can look at the size of your *frq files.   You
also might want to take a look at the free memory when you start up Solr and
then watch as it fills up as you get more queries (or send cache-warming
queries).   

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search





KaktuChakarabati wrote:
 
 My question was mainly about the fact there seems to be two different
 aspects to the solr RAM usage: in-process and out-process. 
 By that I mean, yes i know the many different parameters/caches to do with
 solr in-process memory usage and related culprits, however I also
 understand that as for actual index access (posting list, positional index
 etc), solr mostly delegates the access/caching of this to the OS/disk
 cache. 
 So I guess my question is more about that: namely, what would be a good
 way to calculate an overall ram requirement profile for a server running
 solr? 
 
-- 
View this message in context: 
http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27933779.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

Thanks Simon,

We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
looking for unlikely mixes of unicode character blocks.  For example some of
the CJK material ends up with Cyrillic characters. (except we would have to
watch out for any Russian-Chinese dictionaries:)

Tom



 
 
 There wasn't any completely satisfactory solution; there were a large
 number
 of two and three letter n-grams so we were able to use a dictionary
 approach
 to eliminate those (names tend to be longer).  We also looked for runs of
 punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
 eliminated longer words which consisted of runs of not-ocurring-in-English
 bigrams.
 
 Hope this helps
 
 -Simon
 

 --

 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27869940.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

Interesting.  I wonder though if we have 4 million English documents and 250
in Urdu, if the Urdu words would score badly when compared to ngram
statistics for the entire corpus.  


hossman wrote:
 
 
 
 Since you are dealing with multiple langugaes, and multiple varient usages 
 of langauges (ie: olde english) I wonder if one way to try and generalize 
 the idea of unlikely letter combinations into a math problem (instead of 
 grammer/spelling problem) would be to score all the hapax legomenon 
 words in your index based on the frequency of (character) N-grams in 
 each of those words, relative the entire corpus, and then eliminate any of 
 the hapax legomenon words whose score is below some cut off threshold 
 (that you'd have to pick arbitrarily, probably by eyeballing the sorted 
 list of words and their contexts to deide if they are legitimate)
 
   ?
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing.  Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mentioned above.   

I'm not sure I understand your suggestion. Since real word hapax legomenons
are generally pretty common (maybe 40-60% of unique words) wouldn't  using
them as the no set provide mixed signals to the classifier?

Tom


Walter Underwood-2 wrote:
 
 
 Hmm, how about a classifier? Common words are the yes training set,
 hapax legomenons are the no set, and n-grams are the features.
 
 But why isn't the OCR program already doing this?
 
 wunder
 
 
 
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-19 Thread Tom Burton-West

Hi Glen,

I'd love to use LuSql, but our data is not in a db.  Its 6-8TB of files
containing OCR (one file per page for about 1.5 billion pages) gzipped on
disk which are ugzipped, concatenated, and converted to Solr documents
on-the-fly.  We have multiple instances of our Solr document producer script
running. At this point we can run enough producers, so that the rate at
which Solr can ingest and index documents is our current bottleneck and so
far the bottleneck we see for indexing appears to be disk I/O for
Solr/Lucene during merges.

Is there any obvious relationship between the size of the ramBuffer and how
much heap you need to give the JVM, or is there some reasonable method of
finding this out by experimentation?
We would rather not find out by decreasing the amount of memory allocated to
the JVM until we get an OOM.

Tom



I've run Lucene with heap sizes as large as 28GB of RAM (on a 32GB
machine, 64bit, Linux) and a ramBufferSize of 3GB. While I haven't
noticed the GC issues mark mentioned in this configuration, I have
seen them in the ranges he discusses (on 1.6 update 18).

You may consider using LuSql[1] to create the indexes, if your source
content is in a JDBC accessible db. It is quite a bit faster than
Solr, as it is a tool specifically created and tuned for Lucene
indexing. But it is command-line, not RESTful like Solr. The released
version of LuSql only runs single machine (though designed for many
threads), the new release will allow distributing indexing across any
number of machines (with each machine building a shard). The new
release also has plugable sources, so it is not restricted to JDBC.

-Glen
[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql


-- 
View this message in context: 
http://old.nabble.com/What-is-largest-reasonable-setting-for-ramBufferSizeMB--tp27631231p27658384.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-18 Thread Tom Burton-West

Thanks Otis,

I don't know enough about Hadoop to understand the advantage of using Hadoop
in this use case.  How would using Hadoop differ from distributing the
indexing over 10 shards on 10 machines with Solr?

Tom



Otis Gospodnetic wrote:
 
 Hi Tom,
 
 32MB is very low, 320MB is medium, and I think you could go higher, just
 pick whichever garbage collector is good for throughput.  I know Java 1.6
 update 18 also has some Hotspot and maybe also GC fixes, so I'd use that. 
 Finally, this sounds like a good use case for reindexing with Hadoop!
 
  Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Hadoop ecosystem search :: http://search-hadoop.com/
 
 

-- 
View this message in context: 
http://old.nabble.com/What-is-largest-reasonable-setting-for-ramBufferSizeMB--tp27631231p27645167.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: persistent cache

2010-02-15 Thread Tom Burton-West

Hi Tim,

Due to our performance needs we optimize the index early in the morning and
then run the cache-warming queries once we mount the optimized index on our
servers.  If you are indexing and serving using the same Solr instance, you
shouldn't have to re-run the cache warming queries when you add documents. 
I believe that the disk writes caused by adding the documents to the index
should put that data in the OS cache.   Actually 1600 queries are not a lot
of queries.  If you are using actual user queries from your logs you may
need more.   We used some tools based on Luke to analyze our index and
determine which words would most benefit by being in the OS cache (assuming
users entered a phrase query containing those words.)  You can experiment to
see how many queries you need to fill memory by emptying the OS cache and
then send queries and use top to watch memory usage.

Your options  (assuming peformance with current hardware does not meet your
needs ) are using SSD's, increasing memory on the machine, or splitting the
index using Solr shards.  If you either increase memory on the machine or
split the index, you will still have to run cache warming queries.

One other thing you might consider is to use stop words or CommonGrams to
reduce disk I/O requirments for phrase queries containing common words.  
(Our experiments with CommonGrams and cache-warming are described in our
blog : http://www.hathitrust.org/blogs/large-scale-search
http://www.hathitrust.org/blogs/large-scale-search )

Tom




Hi Tom,

1600 warming queries, that's quite many. Do you run them every time a
document is added to the index? Do you have any tips on warming?

If the index size is more than you can have in RAM, do you recommend
to split the index to several servers so it can all be in RAM?

I do expect phrase queries. Total index size is 107 GB. *prx files are
total 65GB and *frq files 38GB. It's probably worth buying more RAM.

/Tim


-- 
View this message in context: 
http://old.nabble.com/persistent-cache-tp27562126p27598026.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: persistent cache

2010-02-12 Thread Tom Burton-West

Hi Tim,

We generally run about 1600 cache-warming queries to warm up the OS disk
cache and the Solr caches when we mount a new index.

Do you have/expect phrase queries?   If you don't, then you don't need to
get any position information into your OS disk cache.  Our position
information takes about 85% of the total index size (*prx files).  So with a
100GB index, your *frq files might only be 15-20GB and you could probably
get more than half of that in 16GB of memory.

If you have limited memory and a large index, then you need to choose cache
warming queries carefully as once the cache is full, further queries will
start evicting older data from the cache.  The tradeoff is to populate the
cache with data that would require the most disk access if the data was not
in the cache versus populating the cache based on your best guess of what
queries your users will execute.  A good overview of the issues is the paper
by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of
Caching on Search Engines )


Tom Burton-West
Digital Library Production Service
University of Michigan Library
-- 
View this message in context: 
http://old.nabble.com/persistent-cache-tp27562126p27567840.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West

Thanks Lance and Michael,


We are running Solr 1.3.0.2009.09.03.11.14.39  (Complete version info from
Solr admin panel appended below)

I tried running CheckIndex (with the -ea:  switch ) on one of the shards.
CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger
segment containing 500K+ documents. (Complete CheckIndex output appended
below)

Is it likely that all 10 shards are corrupted?  Is it possible that we have
simply exceeded some lucene limit?

I'm wondering if we could have exceeded the lucene limit of unique terms of
2.1 billion as mentioned towards the end of the Lucene Index File Formats
document.  If the small 731 document index has nine million unique terms as
reported by check index, then even though many terms are repeated, it is
concievable that the 500,000 document index could have more than 2.1 billion
terms.

Do you know if  the number of terms reported by CheckIndex is the number of
unique terms?

On the other hand, we previously optimized a 1 million document index down
to 1 segment and had no problems.  That was with an earlier version of Solr
and did not include CommonGrams which could conceivably increase the number
of terms in the index by 2 or 3 times.


Tom
---

Solr Specification Version: 1.3.0.2009.09.03.11.14.39
Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 11:14:39
Lucene Specification Version: 2.9-dev
Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55


[tburt...@slurm-4 ~]$  java -Xmx4096m  -Xms4096m -cp
/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
/l/solrs/1/.snapshot/serve-2010-02-07/data/index 

Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index

Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene
2.9]
  1 of 2: name=_29dn docCount=554799
compound=false
hasProx=true
numFiles=9
size (MB)=267,131.261
diagnostics = {optimize=true, mergeFactor=2,
os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_29dn_7.del]
test: open reader.OK [184 deleted docs]
test: fields, norms...OK [6 fields]
test: terms, freq, prox...FAILED
WARNING: fixIndex() would remove reference to this segment; full
exception:
java.lang.ArrayIndexOutOfBoundsException: -16777214
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715)

  2 of 2: name=_29im docCount=731
compound=false
hasProx=true
numFiles=8
size (MB)=421.261
diagnostics = {optimize=true, mergeFactor=3,
os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.OK
test: fields, norms...OK [6 fields]
test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs;
144869629 tokens]
test: stored fields...OK [3550 total field count; avg 4.856 fields
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq
vector fields per doc]

WARNING: 1 broken segments (containing 554615 documents) detected
WARNING: would write new segments file, and 554615 documents would be lost,
if -fix were specified


[tburt...@slurm-4 ~]$ 


The index is corrupted. In some places ArrayIndex and NPE are not
wrapped as CorruptIndexException.

Try running your code with the Lucene assertions on. Add this to the
JVM arguments:  -ea:org.apache.lucene...


-- 
View this message in context: 
http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p27518800.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West

Thanks Michael,

I'm not sure I understand.  CheckIndex reported a negative number:
-16777214. 

But in any case we can certainly try running CheckIndex from a patched
lucene   We could also run a patched lucene on our dev server.   

Tom



Yes, the term count reported by CheckIndex is the total number of unique
terms.

It indeed looks like you are exceeding the unique term count limit --
16777214 * 128 (= the default term index interval) is 2147483392 which
is mighty close to max/min 32 bit int value.  This makes sense,
because CheckIndex steps through the terms in order, one by one.  So
the first term just over the limit triggered the exception.

Hmm -- can you try a patched Lucene in your area?  I have one small
change to try that may increase the limit to termIndexInterval
(default 128) * 2.1 billion.

Mike


-- 
View this message in context: 
http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p2752.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Thanks Robert!

2010-02-05 Thread Tom Burton-West

+1
And thanks to you both for all your work on CommonGrams!

Tom Burton-West


Jason Rutherglen-2 wrote:
 
 Robert, thanks for redoing all the Solr analyzers to the new API!  It
 helps to have many examples to work from, best practices so to speak.
 
 

-- 
View this message in context: 
http://old.nabble.com/Thanks-Robert%21-tp27460899p27472503.html
Sent from the Solr - User mailing list archive at Nabble.com.



  1   2   >