Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-31 Thread Dotan Cohen
On Wed, Jul 31, 2013 at 4:56 AM, Bill Bell billnb...@gmail.com wrote:
 On Jul 30, 2013, at 12:34 PM, Dotan Cohen dotanco...@gmail.com wrote:
 On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal alghos...@gmail.com wrote:
 Does adding facet.mincount=2 help?

 In fact, when adding facet.mincount=20 (I know that some dupes are in
 the hundreds) I got the OutOfMemoryError in seconds instead of
 minutes.

 Dotan Cohen

 This seems like a fairly large issue. Can you create a Jira issue ?

 Bill Bell

I'll file an issue, but on what? What information should I include?
How is this different that what you would expect?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal alghos...@gmail.com wrote:
 Does adding facet.mincount=2 help?



In fact, when adding facet.mincount=20 (I know that some dupes are in
the hundreds) I got the OutOfMemoryError in seconds instead of
minutes.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:23 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 Are you talking about the document's ID field?

 If so, you can't have duplicates... the latter document would overwrite the
 earlier.

 If not, sorry for asking irrelevant questions. :)


In Solr 4.1 we were using overwrite=falseallowDups=false in order to
discard the new document, not overwrite the extant document. We knew
at the time that the features were depreciated, and apparently
allowDups=false stopped working in 4.3. We are testing new solutions,
but we need to identify the dupes to get them out.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:43 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 Since this is a one-time problem, Have you thought of just dumping all the
 IDs and looking for dupes using sort and awk or something similar to that?


All 100,000,000 of them :) That would take even longer! Also, I fear
that this is not a one-time problem, rather, that I should already
learn how to deal with tuning Solr for intensive queries as such. I
learn by the problems encountered!

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:56 PM, Shawn Heisey s...@elyograg.org wrote:
 On 7/30/2013 12:49 PM, Dotan Cohen wrote:

 ‎Thanks, the query ran for almost 2 full minutes but it returned
 results! I'll google for how to increase the disk cache for queries
 like this. Other than the Qtime, is there no way to judge the amount
 of memory required for a particular query to run?


 The way you increase disk cache is to add memory to the server.  Any memory
 that's not being used by programs (OS, Solr, or anything else) is
 automatically part of the disk cache.

 Thanks,
 Shawn


I see, thanks. I thought that 'disk cache' was something on disk, such
as swap space. The server is already maxed out on RAM:
$ free -m
 total   used   free sharedbuffers
cached
Mem: 14980  14906 73  0167
5293
-/+ buffers/cache:   9444   5535
Swap:0  0  0

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 11:00 PM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 Dotan,

 Could you please provide more line of the stack trace?

Sure, thanks:
responselst name=errorstr
name=msgjava.lang.OutOfMemoryError: Java heap space/strstr
name=tracejava.lang.RuntimeException: java.lang.OutOfMemoryError:
Java heap space
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.lang.OutOfMemoryError: Java heap space
/strint name=code500/int/lst/response


 I have no idea why it made worse at 4.3. I know that 4.3 can use facets
 backed on DocValues, which are modest for the heap. But from what I saw,
 but can be wrong it's disabled from numeric facets. Hence, I can suggest to
 reindex id as string docvalues and hope for them. However, it's doubtful to
 reindex everything without strong guaranties.

We also had issues with 4.2, though I really don't remember the
details. Some simple queries such as 'q=ubuntu' would take tens of
seconds whereas on 4.1 it was almost instantaneous. In fact, even in
4.3 I feel that things have slowed down terribly (3000 ms on simple
queries whereas 4.1 would do it in tens or maximum a few hundred). Of
course, the index is constantly growing so that may be a factor. Note
that in both cases the index and configuration was carryover from 4.1
so that may have been an issue. Moving back from 4.2 to 4.1 I bit the
bullet and deleted the extant documents. I no longer have that luxury
now.


 Also, I checked source code of
 http://wiki.apache.org/solr/TermsComponentand found that it can be
 really memory modest (ie without sort nor limit).
 Be aware that df-s returned by that component are unaware of deleted
 document, hence expungeDeletes before.


Thank you, I will look into that.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 11:14 PM, Jack Krupansky
j...@basetechnology.com wrote:
 The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe...
 any particular reason you did not use it?

 See:
 http://wiki.apache.org/solr/Deduplication

 and

 https://cwiki.apache.org/confluence/display/solr/De-Duplication


Actually, the guy who made the changes (a coworker) did in fact write
an alternative UpdateHandler. I've just noticed that there are a bunch
of dupes right now, though.

public class DiscoAPIUpdateHandler extends DirectUpdateHandler2 {

public DiscoAPIUpdateHandler(SolrCore core) {
super(core);
}

@Override
public int  addDoc(AddUpdateCommand cmd) throws IOException{

// if overwrite is set to false we'll use the
DefaultUpdateHandler2 , this is done for debugging to insert
duplicates to solr
if (!cmd.overwrite) return super.addDoc(cmd);


// when using ref counted objects you have!! to decrement the
ref count when your done
RefCountedSolrIndexSearcher indexSearcher =
this.core.getNewestSearcher(false);

// the idea is like this we'll make an internal lucene query
and check if that id already exists

Term updateTerm = null;


if (cmd.updateTerm != null){
updateTerm = cmd.updateTerm;
} else {
updateTerm = new Term(id,cmd.getIndexedId());
}


Query query = new TermQuery(updateTerm);
TopDocs docs = indexSearcher.get().search(query,2);

if (docs.totalHits0){
// index searcher is no longer needed
indexSearcher.decref();
// don't add the new document
return 0;
}

// index searcher is no longer needed
indexSearcher.decref();

// if i'm here then it's a new document
return super.addDoc(cmd);

}

}


 And I give a bunch of examples in my book.


I anticipate the book with esteem!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Find related words

2013-07-04 Thread Dotan Cohen
How might one find the top related words for a given word in a Solr index?

For instance, given the following single-field documents:
1: I love chocolate
2: I love Solr
3: I eat chocolate cake
4: You will eat chocolate candy

Thus, given the word Chocolate Solr might find these top words:
I (3 times matched)
eat (2 times matched)
love, cake, you, will, candy (1 time each)

Thanks!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Find related words

2013-07-04 Thread Dotan Cohen
Thank you Jack and Koji. I will take a look at MLT and also at the
.zip files from LUCENE-474. Koji, did you have to modify the code for
the latest Solr?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: How to improve the Solr OR query performance

2013-07-03 Thread Dotan Cohen
On Wed, Jul 3, 2013 at 6:48 AM, huasanyelao huasanye...@163.com wrote:
 Nowdays, I've got a urgent task to improve the OR query performance with 
 solr.
 I have deployed 9 shards with solr-cloud in two server(each server : 16 
 cores, 32G RAM).
 The total document count: 60,000,000, total index size : 9G.
 According to the requirement, I have to use the OR query to get results.
 The average number of query terms is about 15.
 The response time for OR query is around 1-2seconds(the AND query is just 
 about 30ms-40ms ).
 Our target : promote 50%, that is, at most 500ms-1s per query.
 The document will soar to 80,000,000, however, the performance should keep in 
 500ms-1s query.
 Any advise or approach is appreciated. Thanks in advance.


What size documents? I've currently got stats like this, only a few
more documents but 5s searches on 15 ORs:
q=love%20OR%20hate%20OR%20beer%20OR%20sex%20OR%20peace%20OR%20war%20OR%20up%20OR%20down%20OR%20this%20OR%20that%20OR%20left%20OR%20right%20OR%20north%20OR%20south%20OR%20east%20OR%20west
lst name=responseHeaderint name=status0/intint
name=QTime5604/intlst name=paramsstr name=qlove OR hate
OR beer OR sex OR peace OR war OR up OR down OR this OR that OR left
OR right OR north OR south OR east OR west/str/lst/lst
result name=response numFound=22495012 start=0

My index currently has 77461952 documents, most under 1 KiB each but
upwards of ten fields.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: No date.gap on pivoted facets

2013-07-02 Thread Dotan Cohen
On Sun, Jun 30, 2013 at 5:33 PM, Jack Krupansky j...@basetechnology.com wrote:
 Sorry, but Solr pivot faceting is based solely on field facets, not
 range (or date) facets.


Thank you. I tried adding that information to the
SimpleFacetParameters wiki page, but that page seems to be defined as
Immutable Page.


 You can approximate date gaps by making a copy of your raw date field and
 then manually gap (truncate) the date values so that the their discrete
 values correspond to your date gap.


Thank you, this is what I have done.


 In the next release of my book I have a script for a
 StatelessScriptUpdateProccessor (with examples) that supports truncation of
 dates to a desired resolution, copying or modifying the input date as
 desired.


Terrific, I anticipate the release. Next release? Did I miss the release?
http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957/

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


No date.gap on pivoted facets

2013-06-30 Thread Dotan Cohen
Consider the following query:
select?q=*:*
facet=true
facet.date=added
facet.date.start=2013-04-01T00:00:00Z
facet.date.end=2013-06-30T00:00:00Z
facet.date.gap=%2b7DAYS
rows=0
facet.pivot=added,provider

In this query, the facet.date.gap is ignored and each individual
second in faceted on. The issue remains the same even when reversing
the order of the pivot:
facet.pivot=provider,added

Is this a Solr bug, or am I pivoting wrong? This is on Solr 4.1.0
running on OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) on
Ubuntu Server 12.04. Thank you!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Filter queries taking a long time, even with cache disabled

2013-06-27 Thread Dotan Cohen
On a Solr 4.1 install I see that queries with use the fq parameter
take a long time (upwards of 120 seconds), both on the standard Lucene
query parser and also with edismax. I have added the {!cache=false}
localparam to the filter query, but this does not speed up the query.
Putting all the search terms in the main query returns results in
miliseconds.

Note that I am not using any wildcard queries, in each case I am
specifying the field to search and the terms to search on. Where
should I start to debug?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Filter queries taking a long time, even with cache disabled

2013-06-27 Thread Dotan Cohen
On Thu, Jun 27, 2013 at 12:14 PM, Upayavira u...@odoko.co.uk wrote:
 can you give an example?


Thank you. This is an example query:
select
?q=search_field:iraq
fq={!cache=false}search_field:love%20obama
defType=edismax

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Filtering on results with more than N words.

2013-06-06 Thread Dotan Cohen
Is there any way to restrict the search results to only those
documents with more than N words / tokens in the searched field? I
thought that this would be an easy one to Google for, but I cannot
figure it out. or find any references. There are many references to
word size in characters, but not to  filed size in words.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Receiving unexpected Faceting results.

2013-06-05 Thread Dotan Cohen
Consider the following Solr query:
select?q=*:*fq=tags:dotan-*facet=truefacet.field=tagsrows=0

The 'tags' field is a multivalue field. I would expect the previous
query to return only tags that begin with the string 'dotan-' such as:
dotan-home
dotan-work
...but not strings which do not begin with (or even contain) the
string in question.

However, I am getting these results:
lst name=discoapi_tags
int name=dotan-home14/int
int name=dotan-work13/int
int name=beer0/int
int name=beatles0/int
/lst

It _may_ be that the 'beer' and 'beatles' tags were once attached to
the same documents as are attached the 'dotan-home' and/or
'dotan-work'. I've done a bit of experimenting on this Solr install,
so I cannot be sure. However, considering that they are in fact 0
results for those two, I would not expect them to show up at all, even
if they ever were attached to (i.e. once a value in the multiValue
field) any of the results that match the filter query.

So, the questions are:
1) How can I check if ever the multiValue fields for a particular
document (given its uniqueKey id) ever contains a specific value.
Alternatively, how can I see all the values that the document ever had
for the field. I don't expect this to actually be possible, but I ask
if it is, i.e. by examining certain aspects of the Solr index with a
text editor.

2) If those spurious results are appearing does that mean necessarily
that those values for the multivalued field were in fact once in the
multivalued field for documents matching the filter query? Thus, the
answer to the previous question would be to simply run a query for the
id of the document in question, and facet on the multivalued field
with a large limit.

3) How to have Solr return only those faceting values for the field
that in fact begin with 'dotan-', even if a document has other tags
such as 'beatles'?

4) How to have Solr return only those faceting values which are larger than 0?

Thank you!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Receiving unexpected Faceting results.

2013-06-05 Thread Dotan Cohen
On Wed, Jun 5, 2013 at 3:38 PM, Raymond Wiker rwi...@gmail.com wrote:
 3) Use the parameter facet.prefix, e.g, facet.prefix=dotan-. Note: this
 particular case will not work if the field you're facetting on is tokenised
 (with - being used as a taken separator).

 4) Use the parameter facet.mincount - looks like you want to set it to 1,
 instead of the default which is 0.

Perfect, thank you Raymond!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Receiving unexpected Faceting results.

2013-06-05 Thread Dotan Cohen
On Wed, Jun 5, 2013 at 3:41 PM, Brendan Grainger
brendan.grain...@gmail.com wrote:
 Hi Dotan,

 I think all you need to do is add:

 facet.mincount=1

 i.e.

 select?q=*:*fq=tags:dotan-*facet=truefacet.field=tags
 rows=0facet.mincount=1

 Note that you can do it per field as well:

 select?q=*:*fq=tags:dotan-*facet=truefacet.field=tags
 rows=0f.tags.facet.mincount=1

 http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount


Thanks, Brendan. I will review the available Facet Parameters, which I
really should have thought to do before posting as it is already
bookmarked!


Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen
How would one write a query which should perform set union on the
search terms (term1 OR term2 OR term3), and yet also perform phrase
matching if both terms are found? I tried a few variants of the
following, but in every case I am getting set intersection on the
search terms:

select?q={!q.op=OR}text:term1 term2~10

Thus, if term1 matches 10 documents and term2 matches 20 documents,
then SET UNION would include all of the documents that have either
term1 and/or term2. That means that between 20-30 results should be
returned. Conversely, SET INTERSECTION would return only results with
_both_ term1 _and_ term2, which could be between 0-10 documents.

Note that in the application, users will be searching for any
arbitrary number of terms, in fact they will be entering phrases. I
can limit these phrases to 140 characters if needed.

Thank you in advance!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen
On Wed, Jun 5, 2013 at 6:10 PM, Shawn Heisey s...@elyograg.org wrote:
 On 6/5/2013 9:03 AM, Dotan Cohen wrote:
 How would one write a query which should perform set union on the
 search terms (term1 OR term2 OR term3), and yet also perform phrase
 matching if both terms are found? I tried a few variants of the
 following, but in every case I am getting set intersection on the
 search terms:

 select?q={!q.op=OR}text:term1 term2~10

 A phrase search by definition will require all terms to be present.
 Even though it is multiple terms, conceptually it is treated as a single
 term.

 It sounds like what you are after is what edismax can do.  If you define
 the pf field in addition to the qf field, Solr will do something pretty
 amazing - it will automatically construct a phrase query from a
 non-phrase query and search with it against multiple fields.  Done
 correctly, this means that an exact match will be listed first in the
 results.

 http://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29

 Thanks,
 Shawn


Thank you Shawn, this pretty much does what I need it to do:
select?defType=edismaxq={!q.op=OR}search_field:term1 term2pf=search_field

I'm reviewing the Edismax page now. Is there any other documentation
that I should review? I have found the Edismax page at the wonderful
lucidworks site, but if there are any other documentation that I
should review to squeeze the most out of Edismax thenI would love to
know about it.
http://docs.lucidworks.com/display/solr/The+Extended+DisMax+Query+Parser

Thank you very much!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen
On Wed, Jun 5, 2013 at 6:23 PM, Jack Krupansky j...@basetechnology.com wrote:
 term1 OR term2 OR term1 term2^2

 term1 OR term2 OR term1 term2~10^2

 The latter would rank documents with the terms nearby higher, and the
 adjacent terms highest.

 term1 OR term2 OR term1 term2~10^2 OR term1 term2^20 OR term2 term1^20

 To further boost adjacent terms.

 But the edismax pf/pf2/pf3 options might be good enough for you.


Thank you Jack. I suppose that I could write a script in PHP to create
such a query string from an arbitrary-length phrase, but it wouldn't
be pretty! Edismax does in fact meet my need, though.

Thanks!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen
 select?defType=edismaxq={!q.op=OR}search_field:term1 term2pf=search_field


Is there any way to perform a fuzzy search with this method? I have
tried appending ~1 to every term in the search like so:
select?defType=edismaxq={!q.op=OR}search_field:term1~1%20term2~1pf=search_field

However, two issues:
1) It doesn't work! The results are identical to the results given
when not appending ~1 to every term (or ~3).

2) If at all possible, I would rather define the 'fuzzyness'
elsewhere. Right now I would have to mangle the user-input in order to
add the ~1 to the end of each term.

Note that the ExtendedDisMax page does in fact mention that fuzziness
is supported:
http://wiki.apache.org/solr/ExtendedDisMax#Query_Syntax

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen
On Wed, Jun 5, 2013 at 9:04 PM, Eustache Felenc
eustache.fel...@idilia.com wrote:
 There is also http://wiki.apache.org/solr/SolrRelevancyCookbook with nice
 examples.


Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Removing a single value from a multiValue field

2013-06-03 Thread Dotan Cohen
On Thu, May 30, 2013 at 5:01 PM, Jack Krupansky j...@basetechnology.com wrote:
 You gave an XML example, so I assumed you were working with XML!


Right, I did give the output as XML. I find XML to be a great document
markup language, but a terrible command format! Mostly, due to
(mis-)use of the attributes.


 In JSON...

 [{id: doc-id, tags: {add: [a, b]}]

 and

 [{id: doc-id, tags: {set: null}}]


Thank you! That is quite more intuitive and less ambiguous than the
XML, would you not agree?

 BTW, this kind of stuff is covered in the book, separate chapters for XML
 and JSON, each with dozens of examples like this.


I have not posted on the book postings, but I will definitely order
one. My vote is for spiral bound, though I know that the perfect-bound
will look more professional on a bookshelf. I don't even care what the
book costs, within reason. Any resource that compiles in a single
package the wonderful methods that yourself and other contributors
mention here and in other places online, will pay for itself in short
order. Apache Solr is an amazing product, but it is often obtuse and
unintuitive. Other times one does not even know what Solr is capable
of, such as the case in this thread, where I was parsing entire
documents to change the multiField value.

Thank you very much!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Reindexing strategy

2013-06-03 Thread Dotan Cohen
On Fri, May 31, 2013 at 3:57 AM, Michael Sokolov
msoko...@safaribooksonline.comgt wrote:
 On UNIX platforms, take a look at vmstat for basic I/O measurement, and
 iostat for more detailed stats.  One coarse measurement is the number of
 blocked/waiting processes - usually this is due to I/O contention, and you
 will want to look at the paging and swapping numbers - you don't want any
 swapping at all.  But the best single number to look at is overall disk
 activity, which is the I/O percentage utilized number Shaun was mentioning.

 -Mike

Great, thanks! I've got some terms to google. For those who follow in
my footsteps, on Ubuntu the package 'sysstat' needs to be installed to
use iostat. Here are my reference stats before starting to experiment,
both for my own use later to compare and also if anybody sees anything
amiss here then I would love to know about it. If there is any fine
manual that is particularly urgent that I should read, please do
mention it. Thanks!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Removing a single value from a multiValue field

2013-05-30 Thread Dotan Cohen
I have a Solr application with a multiValue field 'tags'. All fields
are indexed in this application. There exists a uniqueKey field 'id'
and a '_version_' field. This is running on Solr 4.x.

In order to add a tag, the application retrieves the full document,
creates a PHP array from the document structure, removes the
'_version_' field, and then adds the appropriate tag to the 'tags'
array. This is all then sent to Solr's update method via HTTP with
'overwrite=true'. Solr correctly replaces the extant document with the
new document, which is identical with the exception of a new value for
the '_version_' field and an additional value in the multiValued field
'tags'. This all works correctly.

I am now adding a feature where one can remove tags. I am using the
same business logic, however instead of adding a value to the 'tags'
array I am removing one. I can confirm that the data being sent to
Solr does not contain the removed tag. However, it seems that the old
value for the multiValue field is persisted, that is the old tag
stays. I can see that the '_version_' field has a new value, so I see
that the change was properly commited.

Is there a known bug that overwriting such a doc...:
doc
arr name=tags
stra/str
strb/str
 /arr
/doc

...with this doc...:
doc
arr name=tags
stra/str
 /arr
/doc

...has no effect? Can multiValue fields be only added, but not removed?

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Reindexing strategy

2013-05-30 Thread Dotan Cohen
On Wed, May 29, 2013 at 5:37 PM, Shawn Heisey s...@elyograg.org wrote:
 It's impossible for us to give you hard numbers.  You'll have to
 experiment to know how fast you can reindex without killing your
 servers.  A basic tenet for such experimentation, and something you
 hopefully already know: You'll want to get baseline measurements before
 you begin testing for comparison.


Thanks. I wan't looking for hard numbers, but rather am looking for
what are the signs of problems. I know to keep my eye on memory and
CPU, but I have no idea how to check disk I/O, and I'm not sure how to
determine even if that becomes saturated.

 One of the most reliable Solr-specific indicators of pushing your
 hardware too hard is that the QTime on your queries will start to
 increase dramatically.  Solr 4.1 and later has more granular query time
 statistics in the UI - the median and 95% numbers are much more
 important than the average.


Thank you, this will help. At least I now have a hard metric to see
when Solr is getting overburdened (QTime).


 Outside of that, if your overall IOwait CPU percentage starts getting
 near (or above) 30-50%, your server is struggling.  If all of your CPU
 cores are staying near 100% usage, then it's REALLY struggling.


I see, thanks.


 Assuming you have plenty of CPU cores, using fast storage and having
 plenty of extra RAM will alleviate much of the I/O bottleneck.  The
 usual rule of thumb for good query performance is that you need enough
 RAM to put 50-100% of your index in the OS disk cache.  For blazing
 performance during a rebuild, that becomes 100-200%.  If you had 150%,
 that would probably keep most indexes well-cached even during a rebuild.

 A rebuild will always lower performance, even with lots of RAM.


Considering that the Solr index is the only place that the data is
stored, and that users are actively using the system, I was not
planning on a rebuild but rather to iteratively reindex the extant
documents, even as new documents are being push in.


 My earlier reply to your other message has some other ideas that will
 hopefully help.


Thank you Shawn!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: What exactly happens to extant documents when the schema changes?

2013-05-30 Thread Dotan Cohen
On Wed, May 29, 2013 at 5:09 PM, Shawn Heisey s...@elyograg.org wrote:
 I handle this in a very specific way with my sharded index.  This won't
 work for all designs, and the precise procedure won't work for SolrCloud.

 There is a 'live' and a 'build' core for each of my shards.  When I want
 to reindex, the program makes a note of my current position for deletes,
 reinserts, and new documents.  Then I use a DIH full-import from mysql
 into the build cores.  Once the import is done, I run the update cycle
 of deletes, reinserts, and new documents on those build cores, using the
 position information noted earlier.  Then I swap the cores so the new
 index is online.


I do need to examine sharding and multiple cores. I'll look into that,
thank you. By the way, don't google for DIH! It took me some time to
figure out that it is DataImportHandler, as some people use the
acronym for something completely different.


 To adapt this for SolrCloud, I would need to use two collections, and
 update a collection alias for what is considered live.

 To control the I/O and CPU usage, you might need some kind of throttling
 in your update/rebuild application.

 I don't need any throttling in my design.  Because I'm using DIH, the
 import only uses a single thread for each shard on the server.  I've got
 RAID10 for storage and half of the CPU cores are still available for
 queries, so it doesn't overwhelm the server.

 The rebuild does lower performance, so I have the other copy of the
 index handle queries while the rebuild is underway.  When the rebuild is
 done on one copy, I run it again on the other copy.  Right now I'm
 half-upgraded -- one copy of my index is version 3.5.0, the other is
 4.2.1.  Switching to SolrCloud with sharding and replication would
 eliminate this flexibility, unless I maintained two separate clouds.


Thank you. I am not using Solr Cloud but if I ever consider it, then I
will keep this in mind.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Removing a single value from a multiValue field

2013-05-30 Thread Dotan Cohen
On Thu, May 30, 2013 at 3:42 PM, Jack Krupansky j...@basetechnology.com wrote:
 First, you cannot do any internal editing of a multi-valued list, other
 than:

 1. Replace the entire list.
 2. Add values on to the end of the list.


Thank you. I meant that I am actually editing the entire document.
Reading it, changing the values that I need, and then 'updating' it. I
will look into updating only the single multiValued field.


 But you can do both of those operations on a single multivalued field with
 atomic update without reading and writing the entire document.

 Second, there is no arr element in the Solr Update XML format. Only
 field.

 To simply replace the full, current value of one multi-valued field:

 add
  doc
field name=iddoc-id/field
field name=tags update=seta/field
field name=tags update=setb/field
  /doc
 /add

 If you simply want to append a couple of values:

 add
  doc
field name=iddoc-id/field
field name=tags update=adda/field
field name=tags update=addb/field
  /doc
 /add

 To empty out a multivalued field:

 add
  doc
field name=iddoc-id/field
field name=tags update=set null=true /
  /doc
 /add


Thank you. I will see about translating that into the JSON format that
I work with.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: What exactly happens to extant documents when the schema changes?

2013-05-29 Thread Dotan Cohen
On Tue, May 28, 2013 at 2:20 PM, Upayavira u...@odoko.co.uk wrote:
 The schema provides Solr with a description of what it will find in the
 Lucene indexes. If you, for example, changed a string field to an
 integer in your schema, that'd mess things up bigtime. I recently had to
 upgrade a date field from the 1.4.1 date field format to the newer
 TrieDateField. Given I had to do it on a live index, I had to add a new
 field (just using copyfield) and re-index over the top, as the old field
 was still in use. I guess, given my app now uses the new date field
 only, I could presumably reindex the old date field with the new
 TrieDateField format, but I'd want to try that before I do it for real.


Thank you for the insight. Unfortunately, with 20 million records and
growing by hundreds each minute (social media posts) I don't see that
I could ever reindex the data in a timely way.


 However, if you changed a single valued field to a multi-valued one,
 that's not an issue, as a field with a single value is still valid for a
 multi-valued field.

 Also, if you add a new field, existing documents will be considered to
 have no value in that field. If that is acceptable, then you're fine.

 I guess if you remove a field, then those fields will be ignored by
 Solr, and thus not impact anything. But I have to say, I've never tried
 that.

 Thus - changing the schema will only impact on future indexing. Whether
 your existing index will still be valid depends upon the changes you are
 making.

 Upayavira

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: What exactly happens to extant documents when the schema changes?

2013-05-29 Thread Dotan Cohen
On Tue, May 28, 2013 at 3:58 PM, Jack Krupansky j...@basetechnology.com wrote:
 The technical answer: Undefined and not guaranteed.


I was afraid of that!

 Sure, you can experiment and see what the effects happen to be in any
 given release, and maybe they don't tend to change (too much) between most
 releases, but there is no guarantee that any given change schema but keep
 existing data without a delete of directory contents and full reindex will
 actually be benign or what you expect.

 As a general proposition, when it comes to changing the schema and not
 deleting the directory and doing a full reindex, don't do it! Of course, we
 all know not to try to walk on thin ice, but a lot of people will try to do
 it anyway - and maybe it happens that most of the time the results are
 benign.


In the case of this particular application, reindexing really is
overly burdensome as the application is performing hundreds of writes
to the index per minute. How might I gauge how much spare I/O Solr
could commit to a reindex? All the data that I need is in fact in
stored fields.

Note that because the social media application that feeds our Solr
index is global, there are no 'off hours'.


 OTOH, you could file a Jira to propose that the effects of changing the
 schema but keeping the existing data should be precisely defined and
 documented, but, that could still change from release to release.


Seems like a lot of effort to document, for little benefit. I'm not
going to file it. I would like to know, though, is the schema
consulted at index time, query time, or both?


 From a practical perspective for your original question: If you suddenly add
 a field, there is no guarantee what will happen when you try to access that
 field for existing documents, or what will happen if you update existing
 documents. Sure, people can talk about what happens to be true today, but
 there is no guarantee for the future. Similarly for deleting a field from
 the schema, there is no guarantee about the status of existing data, even
 though people can chatter about what it seems to do today.

 Generally, you should design your application around contracts and what is
 guaranteed to be true, not what happens to be true from experiments or even
 experience. Granted, that is the theory and sometimes you do need to rely on
 experimentation and folklore and spotty or ambiguous documentation, but to
 the extent possible, it is best to avoid explicitly trying to rely on
 undocumented, uncontracted behavior.


Thanks. The application does change (added features) and we do not
want to loose old data.


 One question I asked long ago and never received an answer: what is the best
 practice for doing a full reindex - is it sufficient to first do a delete of
 *:*, or does the Solr index directory contents or even the directory
 itself need to be explicitly deleted first? I believe it is the latter, but
 the former seems to work, most of the time. Deleting the directory itself
 seems to be the best answer, to date - but no guarantees!


I don't have an answer for that, sorry!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Reindexing strategy

2013-05-29 Thread Dotan Cohen
I see that I do need to reindex my Solr index. The index consists of
20 million documents with a few hundred new documents added per minute
(social media data). The documents are mostly smaller than 1KiB of
data, but some may go as large as 10 KiB. All the data is text, and
all indexed fields are stored.

To reindex, I am considering adding a 'last_indexed' field, and having
a Python or Java application pull out N results every T seconds when
sorting on last_indexed asc. How might I determine a good values for
N and T? I would like to know when the Solr index is 'overloaded', or
whatever happens to Solr when it is being pushed beyond the limits of
its hardware. What should I be looking at to know if Solr is over
stressed? Is looking at CPU and memory good enough? Is there a way to
measure I/O to the disk on which the Solr index is stored? Bear in
mind that while the reindex is happening, clients will be performing
searches and a few hundred documents will be written per minute. Note
that the machine running Solr is an EC2 instance running on Amazon Web
Services, and that the 'disk' on which the Solr index is stored in an
EBS volume.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Reindexing strategy

2013-05-29 Thread Dotan Cohen
On Wed, May 29, 2013 at 2:41 PM, Upayavira u...@odoko.co.uk wrote:
 I presume you are running Solr on a multi-core/CPU server. If you kept a
 single process hitting Solr to re-index, you'd be using just one of
 those cores. It would take as long as it takes, I can't see how you
 would 'overload' it that way.


I mean 'overload' Solr in the sense that it cannot read, process, and
write data fast enough because too much data is being handled. I
remind you that this system is writing hundreds of documents per
minute. Certainly there is a limit to what Solr can handle. I ask how
to know how close I am to this limit.


 I guess you could have a strategy that pulls 100 documents with an old
 last_indexed, and push them for re-indexing. If you get the full 100
 docs, you make a subsequent request immediately. If you get less than
 100 back, you know you're up-to-date and can wait, say, 30s before
 making another request.


Actually, I would add a filter query for documents whose last_index
value is before the last schema change, and stop when less documents
were returned than were requested.

Thanks.


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


What exactly happens to extant documents when the schema changes?

2013-05-28 Thread Dotan Cohen
When adding or removing a text field to/from the schema and then
restarting Solr, what exactly happens to extant documents? Is the
schema only consulted when Solr writes a document, therefore extant
documents are unaffected?

Considering that Solr supports dynamic fields, my experimentation with
removing and adding fields to the schema has shown almost no change in
the extant index results returned.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

2013-05-27 Thread Dotan Cohen
On Sun, May 26, 2013 at 8:16 PM, Jack Krupansky j...@basetechnology.com wrote:
 The only comment I was trying to make here is the relationship between the
 RemoveDuplicatesTokenFilterFactory and the KeywordRepeatFilterFactory.

 No, stemmed terms are not considered the same text as the original word. By
 definition, they are a new value for the term text.



I see, for some reason I did not concentrate on this key quote of yours:
...to remove the tokens that did not produce a stem ...

Now it makes perfect sense.

Thank you, Jack!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

2013-05-26 Thread Dotan Cohen
On Fri, May 24, 2013 at 4:04 PM, Jack Krupansky j...@basetechnology.com wrote:
 The primary purpose of this filter is in conjunction with the
 KeywordRepeatFilterFactory and a stemmer, to remove the tokens that did not
 produce a stem from the original token, so the keyword duplicate is no
 longer needed. The goal is to index both the stemmed and unstemmed terms at
 the same position.

 Whether your app is using the filter for that purpose remains to be seen.

 Removing duplicates from the raw input token stream would impact the term
 frequency.

 -- Jack Krupansky


Thank you Jack. I thought that the filter only removed tokens with
both identical position and identical text:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory

Are stemmed terms considered the same text as the original word, such
that they will show as a dupe fo the
RemoveDuplicatesTokenFilterFactory? That seems odd.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Why would one not use RemoveDuplicatesTokenFilterFactory?

2013-05-24 Thread Dotan Cohen
I am looking through the schema of a Solr installation that I
inherited last year. The original dev, who is unavailable for comment,
has two types of text fields: one with
RemoveDuplicatesTokenFilterFactory and one without. These fields are
intended for full-text search.

Why would someone _not_ use RemoveDuplicatesTokenFilterFactory on a
field intended for full-text search? What are the drawbacks to using
it? This application is very, very write heavy (hundreds of writes per
minute) if that matters. It was running on websolr.com at the time,
I've now moved it to Amazon Web Services.

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Understanding the Solr Admin page

2013-04-08 Thread Dotan Cohen
I am expanding my Solr skills and would like to understand the Admin
page better. I understand that understanding Java memory management
and Java memory options will help me, and I am reading and
experimenting on that front, but if there are any concise resources
that are especially pertinent to Solr I would love to know about them.
Everything that I've found is either a do this one-liner or expects
Java experience which I don't have and don't know what I need to
learn.

I notice that some of the Args presented are in black text, and others
in grey. Why are  they presented differently? Where would I have found
this information in the fine manual?

When I start Solr with nohup, the resulting nohup.out file is _huge_.
How might I start Solr such that INFO is not output, but only WARNINGs
and SEVEREs are. In particular, I'd rather not log every query, even
the invalid queries which also log as SEVERE. I thought that this
would be easy to Google for, but it is not! If there is a concise
document that examines this issue, I would love to know where on the
wild wild web it exists.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-08 Thread Dotan Cohen
On Wed, Apr 3, 2013 at 8:47 PM, Shawn Heisey s...@elyograg.org wrote:
 On 4/2/2013 3:09 AM, Dotan Cohen wrote:
 I notice that this only occurs on queries that run facets. I start
 Solr with the following command:
 sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
 -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
 /opt/solr-4.1.0/example/start.jar 

 It looks like you've followed some advice that I gave previously on how
 to tune java.  I have since learned that this advice is bad, it results
 in long GC pauses, even with heaps that aren't huge.


I see, thanks.

 As others have pointed out, you don't have a max heap setting, which
 would mean that you're using whatever Java chooses for its default,
 which might not be enough.  If you can get Solr to successfully run for
 a while with queries and updates happening, the heap should eventually
 max out and the admin UI will show you what Java is choosing by default.

 Here is what I would now recommend for a beginning point on your Solr
 startup command.  You may need to increase the heap beyond 4GB, but be
 careful that you still have enough free memory to be able to do
 effective caching of your index.

 sudo nohup java -Xms4096M -Xmx4096M -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3
 -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled
 -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts
 -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
 /opt/solr-4.1.0/example/start.jar 


Thank you, I will experiment with that.

 If you are running a really old build of java (latest versions on
 Oracle's website are 1.6 build 43 and 1.7 build 17), you might want to
 leave AggressiveOpts out.  Some people would argue that you should never
 use that option.


Great, thank for the warning. This is what we're running, I'll see
about updating it through my distro's package manager:
$ java -version
java version 1.6.0_27
OpenJDK Runtime Environment (IcedTea6 1.12.3) (6b27-1.12.3-0ubuntu1~12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: maxWarmingSearchers in Solr 4.

2013-04-08 Thread Dotan Cohen
On Thu, Apr 4, 2013 at 10:54 PM, Shawn Heisey s...@elyograg.org wrote:
 You'll want to ensure that your autowarmCount value on Solr's caches is low
 enough that each commit happens quickly.  If it takes 5000 milliseconds to
 warm the caches when you commit, then you want to be sure that you are
 committing less often than that, or you'll quickly reach your
 maxWarmingSearchers config value.  If the commits are happening VARY
 quickly, you may need to set autowarmCount to 0, and possibly disable caches
 entirely.


I see. This seems to be the opposite of the approach that I was taking.


 I went poking in the code, and it seems that maxWarmingSearchers
 defaults to Integer.MAX_VALUE.  I'm not sure whether this is a bad
 default or not.  It does mean that a pathological setup without
 maxWarmingSearchers in the config will probably blow up with an
 OutOfMemory exception, but is that better or worse than commits that
 don't make new documents searchable?  I can see arguments either way.


 This is interesting, what you found is that the value in the stock
 solrconfig.xml file differs from the Solr default value. I think that
 this is bad practice: a single default should be decided upon and Solr
 should use this value when nothing is specified in solrconfig.xml, and
 that _same_value_ should be specified in the stock solrconfig.xml. Is
 it not a reasonable assumption that this would be the case?


 That was directed more at the other committers.  I would argue that either a
 low number or a relatively high number should be the default, but not
 MAX_VALUE.  The example config should have a commented out section for
 maxWarmingSearchers that mentions the default.  I'm having the same
 discussion about maxBooleanClauses on SOLR-4586.


Right.


 It's possible that this has already been discussed, and that everyone
 prefers that a badly configured setup will eventually have a spectacular
 blow up with OutOfMemory, rather than semi-silently ignoring commits.  A
 searcher object contains caches and uses a lot of memory, so having lots of
 them around will eventually use up the entire heap.


Silently dropping data is by far the worse choice, I agree, especially
as a default setting.


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: maxWarmingSearchers in Solr 4.

2013-04-04 Thread Dotan Cohen
On Wed, Apr 3, 2013 at 7:55 PM, Shawn Heisey s...@elyograg.org wrote:
 In situations where I don't want to change the default value, I prefer
 to leave config elements out of the solrconfig.  It makes the config
 smaller, and it also makes it so that I will automatically see benefits
 from the default changing in new versions.


Thanks. This makes sense. I take it, then, that you update (or at
least review) solrconfig for each new Solr version. As I become more
familiar with that file I will begin doing the same.

 In the case of maxWarmingSearchers, I would hope that you have your
 system set up so that you would never need more than 1 warming searcher
 at a time.  If you do a commit while a previous commit is still warming,
 Solr will try to create a second warming searcher.


How would I set the system up for that? We have very many commits
(every few seconds) and each commit contains a few tens of documents
(mostly smaller than 1 KiB per document). Right now we get about
200-300 searches per minute.

Note that I expect both the commit rate and the search rate to
increase 2-3 times in the next month, and ideally I should be able to
scale it beyond that. I'm right now looking into sharding as a
possible solution.

 I went poking in the code, and it seems that maxWarmingSearchers
 defaults to Integer.MAX_VALUE.  I'm not sure whether this is a bad
 default or not.  It does mean that a pathological setup without
 maxWarmingSearchers in the config will probably blow up with an
 OutOfMemory exception, but is that better or worse than commits that
 don't make new documents searchable?  I can see arguments either way.


This is interesting, what you found is that the value in the stock
solrconfig.xml file differs from the Solr default value. I think that
this is bad practice: a single default should be decided upon and Solr
should use this value when nothing is specified in solrconfig.xml, and
that _same_value_ should be specified in the stock solrconfig.xml. Is
it not a reasonable assumption that this would be the case?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


maxWarmingSearchers in Solr 4.

2013-04-03 Thread Dotan Cohen
I have been dragging the same solrconfig.xml from Solr 3.x to 4.0 to
4.1, with no customization (bad, bad me!). I'm now looking into
customizing it and I see that the Solr 4.1 solrconfig.xml is much
simpler and shorter. Is this simply because many of the examples have
been removed?

In particular, I notice that there is no mention of
maxWarmingSearchers in the Solr 4.1 solrconfig.xml. I assume that I
can simply add it in, are there any other critical config options that
are missing that I should be looking into as well? Would I be better
off using the old Solr 3.x solrconfig.xml in Solr 4.1 as it contains
so many examples?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-03 Thread Dotan Cohen
On Tue, Apr 2, 2013 at 6:26 PM, Andre Bois-Crettez
andre.b...@kelkoo.com wrote:
 warmupTime is available on the admin page for each type of cache (in
 milliseconds) :
 http://solr-box:8983/solr/#/core1/plugins/cache

 Or if you are only interested in the total :
 http://solr-box:8983/solr/core1/admin/mbeans?stats=truekey=searcher


Thanks.


 Batches of 20-50 results are added to solr a few times a minute, and a
 commit is done after each batch since I'm calling Solr as such:
 http://127.0.0.1:8983/solr/core/update/json?commit=true Should I
 remove commit=true and run a cron job to commit once per minute?


 Even better, it sounds like a job for CommitWithin :
 http://wiki.apache.org/solr/CommitWithin



I'll look into that. Thank you!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-03 Thread Dotan Cohen
On Wed, Apr 3, 2013 at 10:11 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 However, once per day I would like to facet on the text field,
 which is a free-text field usually around 1 KiB (about 100 words), in
 order to determine what the top keywords / topics are. That query
 would take up to 200 seconds to run, [...]

 If that query is somehow part of your warming, then I am surprised that
 search has worked at all with your commit frequency. That would however
 explain your OOM if you have multiple warmups running at the same time.


No, the 'heavy facet' is not part of the warming. I run it at most
once per day, at the end of the day. Solr is not shut down daily.

 It sounds like TermsComponent would be a better fit for getting top
 topics: https://wiki.apache.org/solr/TermsComponent


I had once looked at TermsComponent, but I think that I eliminated it
as a possibility because I actually need the top keywords related to a
specific keyword. For instance, I need to know which words are most
commonly used with the word coffee.


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)\n\tat
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)\n\tat
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n,code:500}}

I notice that this only occurs on queries that run facets. I start
Solr with the following command:
sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
/opt/solr-4.1.0/example/start.jar 

The server seems to have enough memory:
$ free -m
 total   used   free sharedbuffers cached
Mem: 14980  10604   4375  0472   8078
-/+ buffers/cache:   2054  12925
Swap:0  0  0

The server is 64-bit Ubuntu Server 12.04 LTS running Solr 4.1 and the
following Java:
$ java -version
java version 1.6.0_27
OpenJDK Runtime Environment (IcedTea6 1.12.3) (6b27-1.12.3-0ubuntu1~12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)


Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen
On Tue, Apr 2, 2013 at 12:59 PM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 How many documents does your index have, how many fields do you facet on
 and approximately how many unique values does your facet fields have?


8971763 documents, growing at a rate of about 500 per minute. We
actually expect that to be ~5 per minute once we get out of
testing. Most documents are less than a KiB in the 'text' field, and
they have a few other fields which store short strings, dates, or
ints. You can think of these documents like tweets: short general
purpose text messages.

 I notice that this only occurs on queries that run facets. I start
 Solr with the following command:
 sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
 -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
 /opt/solr-4.1.0/example/start.jar 

 You are not specifying any maximum heap size (-Xmx), which you should do
 in order to avoid unpleasant surprises. Facets and sorting are often
 memory hungry, but your system seems to have 13GB free RAM so the easy
 solution attempt would be to increase the heap until Solr serves the
 facets without OOM.


Thanks, I will start with -Xmx8g and test.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen
On Tue, Apr 2, 2013 at 2:41 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote:
 9M documents in a heavily updated index with faceting. Maybe you are
 committing faster than the faceting can be prepared?
 https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F


Thank you Toke, this is exactly on my list of things to learn about
Solr. We do get the error mentioned and we cannot reduce the amount
of commits. Also, I do believe that we have the necessary server
resources (16 GiB RAM).

I have increased maxWarmingSearchers to 4, let's see how this goes.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen
On Tue, Apr 2, 2013 at 5:33 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote:
 On Tue, 2013-04-02 at 15:55 +0200, Dotan Cohen wrote:

 [Tokd: maxWarmingSearchers limit exceeded?]

 Thank you Toke, this is exactly on my list of things to learn about
 Solr. We do get the error mentioned and we cannot reduce the amount
 of commits. Also, I do believe that we have the necessary server
 resources (16 GiB RAM).

 Memory does not help you if you commit too frequently. If you commit
 each X seconds and warming takes X+Y seconds, then you will run out of
 memory at some point.

 I have increased maxWarmingSearchers to 4, let's see how this goes.

 If you still get the error with 4 concurrent searchers, you will have to
 either speed up warmup time or commit less frequently. You should be
 able to reduce facet startup time by switching to segment based faceting
 (at the cost of worse search-time performance) or maybe by using
 DocValues. Some of the current threads on the solr-user list is about
 these topics.

 How often do you commit and how many unique values does your facet
 fields have?

 Regards,
 Toke Eskildsen




-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen
On Tue, Apr 2, 2013 at 5:33 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote:
 Memory does not help you if you commit too frequently. If you commit
 each X seconds and warming takes X+Y seconds, then you will run out of
 memory at some point.


How might I time the warming? I've been googling warming since your
earlier message but there does not seem to be any really good
documentation on the subject. If there is anything that you feel I
should be reading I would appreciate a link or a keyword to search on.
I've read the Solr wiki on caching and performance, but other than
that I don't see the issue addressed.


 I have increased maxWarmingSearchers to 4, let's see how this goes.

 If you still get the error with 4 concurrent searchers, you will have to
 either speed up warmup time or commit less frequently. You should be
 able to reduce facet startup time by switching to segment based faceting
 (at the cost of worse search-time performance) or maybe by using
 DocValues. Some of the current threads on the solr-user list is about
 these topics.

 How often do you commit and how many unique values does your facet
 fields have?


Batches of 20-50 results are added to solr a few times a minute, and a
commit is done after each batch since I'm calling Solr as such:
http://127.0.0.1:8983/solr/core/update/json?commit=true

Should I remove commit=true and run a cron job to commit once per minute?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen
 How often do you commit and how many unique values does your facet
 fields have?


Most of the time I facet on one field that has about twenty unique
values. However, once per day I would like to facet on the text field,
which is a free-text field usually around 1 KiB (about 100 words), in
order to determine what the top keywords / topics are. That query
would take up to 200 seconds to run, but it does not have to return
the results in real-time (the output goes to another process, not to a
waiting user).

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Don't cache filter queries

2013-03-22 Thread Dotan Cohen
On Thu, Mar 21, 2013 at 6:22 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Just add {!cache=false} to the filter in your query
 : (http://wiki.apache.org/solr/SolrCaching#filterCache).
 ...
 :  I need to use the filter query feature to filter my results, but I
 :  don't want the results cached as documents are added to the index
 :  several times per second and the results will be state immediately. Is
 :  there any way to disable filter query caching?

 Or remove the filterCache config option from your solrconfig.xml if you
 really don't want any caching of any filter queries.

 Fnrakly though: that's throwing the baby out with the bath water -- just
 because you are updating your index super-fast-like doesn't mean you
 aren't getting benefts from the caches, particularly from commonly
 reused filters which are applied to many qureies which might get
 executed concurrently -- not to entnion that a single filter might be
 reused multiple times within a single request to solr.

 disabling cache *warming* can make a lot of sense in NRT cases, but
 eliminating caching alltogether rarely does.


Thanks. The problem is that the queries with filter queries are taking
much longer to run (~60-80 ms) than the queries without (~1-4 ms). I
figured that the problem may have been with the caching.

In fact, running a query with a filter query and caching disabled is
running in the range of 16-30 ms, which is quite an improvement.

Thanks.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Don't cache filter queries

2013-03-21 Thread Dotan Cohen
I need to use the filter query feature to filter my results, but I
don't want the results cached as documents are added to the index
several times per second and the results will be state immediately. Is
there any way to disable filter query caching?

This is on Solr 4.1 running in Jetty on Ubuntu Server. Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Returning to Solr 4.0 from 4.1

2013-03-03 Thread Dotan Cohen
On Sat, Mar 2, 2013 at 9:32 PM, Upayavira u...@odoko.co.uk wrote:
 What I'm questioning is whether the issue you see in 4.1 has been
 resolved in Subversion. While I would not expect 4.0 to read a 4.1
 index, the SVN branch/4.2 should be able to do so effortlessly.

 Upayavira


I see, thanks. Actually, running a clean 4.1 with no previous index
does not have the issues.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Returning to Solr 4.0 from 4.1

2013-03-02 Thread Dotan Cohen
On Fri, Mar 1, 2013 at 1:37 PM, Upayavira u...@odoko.co.uk wrote:
 Can you use a checkout from SVN? Does that resolve your issues? That is
 what will become 4.2 when it is released soon:

 https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/

 Upayavira


Thank you. Which feature of 4.2 are you suggesting for this issue? Can
Solr 4.2 natively import from a Solr index?


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Returning to Solr 4.0 from 4.1

2013-03-01 Thread Dotan Cohen
Solr 4.1 has been giving up much trouble rejecting documents indexed.
While I try to work my way through this, I would like to move our
application back to Solr 4.0. However, now when I try to start Solr
with same index that was created with Solr 4.0 but has been running on
4.1 few a few days I get this error chain:

org.apache.solr.common.SolrException: Error opening new searcher
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
Caused by: java.lang.IllegalArgumentException: A SPI class of type
org.apache.lucene.codecs.Codec with name 'Lucene41' does not exist.
You need to add the corresponding JAR file supporting this SPI to your
classpath.The current classpath supports the following names:
[Lucene40, Lucene3x]

Obviously I'll not be installing Lucene41 in Solr 4.0, but is there
any way to work around this? Note that neither solrconf.xml nor
schema.xml have changed. Thanks.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Returning to Solr 4.0 from 4.1

2013-03-01 Thread Dotan Cohen
On Fri, Mar 1, 2013 at 11:28 AM, Rafał Kuć r@solr.pl wrote:
 Hello!

 I suppose the only way to make this work will be reindexing the data.
 Solr 4.1 uses Lucene 4.1 as you know, which introduced new default
 codec with stored fields compression and this is one of the reasons
 you can't read that index with 4.0.


Thank you. My first inclination is to reindex the documents, but the
only store of these documents is the Solr index itself. I am trying to
find solutions to create a new core and to index the data in the old
core into the new core. I'm not finding any good ways of going about
this.

Note that we are talking about ~18,000,000 (yes, 18 million) small
documents similar to 'tweets' (mostly under 1 KiB each, very very few
over 5 KiB).


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Returning to Solr 4.0 from 4.1

2013-03-01 Thread Dotan Cohen
On Fri, Mar 1, 2013 at 11:59 AM, Rafał Kuć r@solr.pl wrote:
 Hello!

 I assumed that re-indexing can be painful in your case, if it wouldn't
 you probably would re-index by now :) I guess (didn't test it myself),
 that you can create another collection inside your cluster, use the
 old codec for Lucene 4.0 (setting the version in solrconfig.xml should
 be enough) and re-indexing, but still re-indexing will have to be
 done. Or maybe someone knows a better way ?


Will I have to reindex via an external script bridging, such as a
Python script which requests N documents at a time, indexes them into
Solr 4.1, then requests another N documents to index? Or is there
internal Solr / Lucene facility for this? I've actually looked for
such a facility, but as I am unable to find such a thing I ask.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Returning to Solr 4.0 from 4.1

2013-03-01 Thread Dotan Cohen
On Fri, Mar 1, 2013 at 12:22 PM, Rafał Kuć r@solr.pl wrote:
 Hello!

 As far as I know you have to re-index using external tool.


Thank you Rafał. That is what I figured.



-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Solr 4 Spatial: NoClassDefFoundError: com/vividsolutions/jts/geom/Geometry

2013-02-27 Thread Dotan Cohen
On Wed, Feb 27, 2013 at 10:24 AM, Smiley, David W. dsmi...@mitre.org wrote:
 Dotan,

 http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4#Configuration
 You need to put the its jar within Solr's WEB-INF/lib; unfortunately you
 can't simply reference it via a lib entry and put it wherever.  FWIW you
 can find the same question and my response on Stackoverflow.

 ~ David


Thank you David. In fact I do frequent Stack Overflow.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Can't search words in quotes

2013-02-27 Thread Dotan Cohen
On Thu, Feb 28, 2013 at 8:14 AM, Alex Cougarman acoug...@bwc.org wrote:
 Thanks, Oussama. That was very useful information and we have added the 
 double quotes. One interesting trick: we had to change the way we did it to 
 wrap the pattern value in single quotes so we could have double quotes inside.


Hi Alex. Would you mind posting the new analyzers?

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Solr 4 Spatial: NoClassDefFoundError: com/vividsolutions/jts/geom/Geometry

2013-02-20 Thread Dotan Cohen
Note that the issue is present in Solr 4.1 as well.

I did find this post, which is not very encouraging:
http://grokbase.com/t/lucene/solr-user/128sz03jdk/recursiveprefixtreestrategy-class-not-found

Might the name of the class be simply a typo that is easily rectified?
How might one go about checking which classes are available?

Thank you.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: What should focus be on hardware for solr servers?

2013-02-19 Thread Dotan Cohen
On Thu, Feb 14, 2013 at 5:54 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 My dual-core, HT-enabled Dell Latitude from last year has this CPU:
 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
 bogomips: 4988.65

 An m3.xlarge reports:
 model name : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
 bogomips : 4000.14

 I tried running geekbench and phoronx-test-suite and failed at both...
 Anybody have a favorite, free, CLI benchmarking suite?


I'll suggest to the Phoronix team to include some Solr tests in their
suite. Solr does seem to be a perfect test for Phoronix, and much more
relevant for some readers than Jack-the-Ripper or Quake.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Error: _version_field must exist in schema

2012-11-22 Thread Dotan Cohen
On Thu, Nov 22, 2012 at 9:26 PM, Nick Zadrozny n...@onemorecloud.com wrote:
 Belated reply, but this is probably something you should let us know about
 directly at supp...@onemorecloud.com if it happens again. Cheers.


Hi Nick. This particular issue was on a Solr 4 instance on AWS, not on
the Websolr account. But I commend you taking notice and taking an
interest. Thank you!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Best way to retrieve 20 specific documents

2012-11-21 Thread Dotan Cohen
On Tue, Nov 20, 2012 at 12:45 AM, Shawn Heisey s...@elyograg.org wrote:
 You can also use this query format:

 id:(123 OR 456 OR 789)

 This does get expanded internally by the query parser to the format that has
 the field name on every clause, but it is sometimes easier to write code
 that produces the above form.


Thank you Shawn, that is much cleaner and will be easier to debug when
/ if things go wrong.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Exponential omitNorms

2012-11-08 Thread Dotan Cohen
On Wed, Nov 7, 2012 at 5:16 PM, Walter Underwood wun...@wunderwood.org wrote:
 You are probably thinking of SweetSpotSimilarity. You might also want to look 
 at pivoted document normalization.


Thanks, I'll take a look at that.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Exponential omitNorms

2012-11-07 Thread Dotan Cohen
On Wed, Nov 7, 2012 at 2:01 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 The name escapes me but Lucene used to have a special Similarity impl for
 this sort of stuff. I think it's still there.

 We implemented a slightly better Similarity that used Gaussian distribution
 and was thus smoother. Try doing that.

 Otis

Thanks, Otis. I'll start googling for Solr and Lucene Similarity.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: After adding field to schema, the field is not being returned in results.

2012-11-04 Thread Dotan Cohen
On Fri, Nov 2, 2012 at 4:32 PM, Erick Erickson erickerick...@gmail.com wrote:
 Well, I'm at my wits end. I tried your field definitions (using the
 exampledocs XML) and they work just fine. As far as if you mess up the date
 on the way in, you should be seeing stack traces in your log files.


Please don't go to wit's end on this! I'm a bit frustrated too, but I
really don't want to bring the frustration to the mailing list!


 The only way I see not getting the Sorry, no Term Info available :(
 message is if you don't have any values in the field. So, my guess is that
 you're not getting the format right and the docs aren't getting indexed,
 but that's just a guess. You can freely sort even if there are no values at
 all in a particular field. This can be indicated if you sort asc and desc
 and the order doesn't change. It just means the field is defined in the
 schema, not necessarily that there are any values in it.

 So, I claim you have no date values in your index. The fact that you can
 sort is just an artifact of sortMissingFirst/Last doing something sensible.

 Next question, are you absolutely sure that your indexing program and your
 searching program are pointing at the same server?

 So what I'd do next is
 1 create a simple XML doc that conforms to your schema and use the
 post.jar tool to send it to your server. Watch the output log for any date
 format exceptions.
 2 Use the admin UI to insure that you can see terms in docs added this way.
 3 from there back up and see what step in the indexing process isn't
 working (assuming that's the problem). Solr logs help here.

 Note I'm completely PHP-ignorant, I have no clue whether the formatting
 you're doing is OK or not. You might try logging the value somewhere in
 your php so you an post that and/or include it in your sample XML file...


Yes, it seems that there is a contradiction. On one hand, by appending
the value of the created_iso8601 field to another field, I can ensure
that the value is legal and does exist! On the other hand, it seems
that there is no such value being stored in the index, but new
documents are being added that ostensibly should have that value.

I'll try adding a document with post.jar and see what happens. I'll
update the thread.

Thanks!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: After adding field to schema, the field is not being returned in results.

2012-11-04 Thread Dotan Cohen
On Sat, Nov 3, 2012 at 4:23 AM, Lance Norskog goks...@gmail.com wrote:
 If any value is in a bogus format, the entire document batch in that HTTP 
 request fails. That is the right timestamp format.
 The index may be corrupted somehow. Can you try removing all of the fields in 
 data/ and trying again?


Thanks. Seeing how all the documents are being added, either there is
a valid format in the created_iso8601 field or it is empty. I've
pretty much ruled out empty in code, but still nothing in the index.

I'll play around some more and update the list. At least I am learning.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: After adding field to schema, the field is not being returned in results.

2012-11-02 Thread Dotan Cohen
On Thu, Nov 1, 2012 at 9:09 PM, Erick Erickson erickerick...@gmail.com wrote:
 What happens if you sort ascending rather than descending? Depending on
 what (if anything) you've done with sortMissingFirst/Last on that field,
 it's possible that you're just seeing the results of the sort and docs with
 your new field are somewhere down the list. If you've done nothing, you
 should be seeing the docs with the new field at the top of the list  with
 the query you posted, so this is grasping at straws a bit.


Thanks.  Sorting ASC or DESC still does not show the field, even in
documents for which the field should exist based on the time that it
was created. However, I am starting to wonder that perhaps my
application is creating the wrong field values and perhaps that is why
the field don't exist. This is the field in question:
fieldType name=tdate class=solr.TrieDateField omitNorms=true
precisionStep=6 positionIncrementGap=0/
field name=created_iso8601 type=tdate stored=true
multiValued=false indexed=true/

My application is writing dates in this format (ISO 8601):
2012-11-02T13:48:10Z

Here is the PHP code:
date(Y-m-d\TH:i:s\Z)

I am setting the server timezone, as PHP = 5.1 requires:
date_default_timezone_set('UTC');



 The solr admin page, try going to collectionschema browser and choose the
 field in question from your drop-down. see if it looks like it is stored
 and indexed, and see what some of the values are. This is getting the vals
 from the indexed terms, but at least it should say stored in the schema
 and index sections. If it doesn't, then you somehow have a mismatch between
 your schema and what's actually in your index. This really shouldn't be the
 case since it's a brand-new field


Sorry, no Term Info available :(

Alright, so it is an issue with the data that I'm feeding it. Would
Solr include the fields with good data and reject the fields with bad
data, but update the Document anyway? I can confirm that the variable
used to populate the field in question is not empty.



 Two other things I'd try.
 1 If you have the ID of the document you're _sure_ has a date field in it
 try your query just on that, with fl=*. This would avoid any possible
 sortMissingFirst/Last issues.


Yes, I've done that, with and without fl.



 2 Another way to restrict this would be to add an fq clause to the query
 so docs without the field would not be displayed, something like
 fq=[NOW-1YEAR TO NOW] assuming your dates are in the last year.

 But I guess we're down to needing to see the schema definition etc. if that
 doesn't work.

 Best
 Erick

Thanks, Erick. It does look like the issue is that the field remains
empty. Perhaps I'm writing ISO 8601 wrong, I'll get to looking at that
now. I'm surprised that Solr accepts the documents with bad data in
some of the fields, I will look into that too as well.

Have a peaceful Saturday.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: After adding field to schema, the field is not being returned in results.

2012-11-02 Thread Dotan Cohen
On Thu, Nov 1, 2012 at 9:28 PM, Lance Norskog goks...@gmail.com wrote:
 Have you uploaded data with that field populated? Solr is not like a 
 relational database. It does not automatically populate a new field when you 
 add it to the schema. If you sort on a field, a document with no data in that 
 field comes first or last (I don't know which).


Thank you. In fact, I am being careful to try to pull up records after
the date in which the application was updated to populate the field.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: After adding field to schema, the field is not being returned in results.

2012-11-01 Thread Dotan Cohen
On Thu, Nov 1, 2012 at 3:00 PM, Erick Erickson erickerick...@gmail.com wrote:
 I'd try several things

 1 just because you an sort has nothing to do with whether the field is
 returned. Sorting uses the indexed data, returning it is the stored data.
 So it's a bit of a red herring when you can sort on a field but not see it,
 although it is a good test that your schema knows about the field.


Right, the only thing that I was testing for is that the field is
recognised by Solr. If I had a typo in the field name in the query or
in the schema, Solr would have complained.


 2 Try fl=* just for yucks.


yuck, yuck, but didn't work!


 3 Check your schema.xml for typos. stroed=true for instance?

No, tried that. Good thinking, though.


 4 Why are you restricting your returns to 1 and only one document? Are you
 absolutely sure that that document has the new field? Solr happily sorts
 documents that do not have a value for a field, that's the purpose of
 sortMissingFirst/Last.


All the newest documents have the field, and I'm sorting by time
descending. In fact, I did test with more rows, but for the mailing
list I wanted the output to be concise.

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Eliminate or reduce fieldNorm as a consideration.

2012-10-31 Thread Dotan Cohen
On Wed, Oct 31, 2012 at 11:50 PM, Ahmet Arslan iori...@yahoo.com wrote:
 omitNorms=true|false

 This is arguably an advanced option. Set to true to omit the norms associated 
 with this field (this disables length normalization and index-time boosting 
 for the field, and saves some memory). Only full-text fields or fields that 
 need an index-time boost need norms. 

 http://wiki.apache.org/solr/SchemaXml

Thank you, but I am looking for a query-time modifier. I do need the
fieldNorm enabled in the general sense.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Eliminate or reduce fieldNorm as a consideration.

2012-10-31 Thread Dotan Cohen
On Wed, Oct 31, 2012 at 11:44 PM, Jack Krupansky
j...@basetechnology.com wrote:
 Scoring or ranking of document relevancy is called similarity. You can
 create your own similarity class, or even have a field-specific similarity
 class.

 See, for example:
 http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
 http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html

 and

 http://wiki.apache.org/solr/SchemaXml#Similarity


Thank you Jack. That seems extraordinarily rigid, in the sense that
one could not apply on-the-fly score computation component
coefficients. Surely I'm not the first dev to run into an issue with
the default scoring algorithm and want to tweak it only on specific
queries!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Eliminate or reduce fieldNorm as a consideration.

2012-10-31 Thread Dotan Cohen
On Thu, Nov 1, 2012 at 12:16 AM, Jack Krupansky j...@basetechnology.com wrote:
 You could write a custom search component that checked for your desired
 request parameters, and then it could set them for a custom similarity
 class, which you would also have to write.


Perhaps, but if I'm going that route I would have it recognize some
LocalParams (such as omitNorms=true right there) to be flexible at
query time. I'm actually surprised that this doesn't yet exist.

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: After adding field to schema, the field is not being returned in results.

2012-10-31 Thread Dotan Cohen
On Thu, Nov 1, 2012 at 2:52 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 That should work just fine.  It;s either a bug or you are doing something
 you didn't mention.  Maybe you can provide a small, self-enclosed unit test
 and stick it in JIRA?


I would assume that it's me doing something wrong! How does this look:

/solr/select?q=*rows=1sort=created_iso8601%20descfl=created_iso8601,created

response
  lst name=responseHeader
int name=status0/int
int name=QTime1/int
lst name=params
  str name=q*:*/str
  str name=rows1/str
  str name=flcreated_iso8601,created/str
/lst
  /lst
  result name=response numFound=1037937 start=0
doc
  int name=created1350854389/int
/doc
  /result
/response

Surely the sort parameter would throw an error if the
created_iso8601field did not exist. That field is indexed and stored,
with no parameters defined on handlers that may list the fields to
return as Alexandre had mentioned.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-29 Thread Dotan Cohen
On Mon, Oct 29, 2012 at 7:04 AM, Shawn Heisey s...@elyograg.org wrote:
 They are indeed Java options.  The first two control the maximum and
 starting heap sizes.  NewRatio controls the relative size of the young and
 old generations, making the young generation considerably larger than it is
 by default.  The others are garbage collector options.  This seems to be a
 good summary:

 http://www.petefreitag.com/articles/gctuning/

 Here's the official Sun (Oracle) documentation on GC tuning:

 http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html


Thank you Shawn! Those are exactly the documents that I need. Google
should hire you to fill in the pages when someone searches for java
garbage collection. Interestingly, I just check and bing.com does
list the Oracle page on the first pager of results. I shudder to think
that I might have to switch search engines!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-28 Thread Dotan Cohen
On Fri, Oct 26, 2012 at 11:04 PM, Shawn Heisey s...@elyograg.org wrote:
 Warming doesn't seem to be a problem here -- all your warm times are zero,
 so I am going to take a guess that it may be a heap/GC issue.  I would
 recommend starting with the following additional arguments to your JVM.
 Since I have no idea how solr gets started on your server, I don't know
 where you would add these:

 -Xmx4096M -Xms4096M -XX:NewRatio=1 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled


Thanks. I've added those flags to the Solr line that I use to start
Solr. Those are Java flags, not Solr, correct? I'm googling the flags
now, but I find it interesting that I cannot find a canonical
reference for them.


 This allocates 4GB of RAM to java, sets up a larger than normal Eden space
 in the heap, and uses garbage collection options that usually fare better in
 a server environment than the default.Java memory management options are
 like religion to some people ... I may start a flamewar with these
 recommendations. ;)  The best I can tell you about these choices: They made
 a big difference for me.


Thanks. I will experiment with them empirically. The first step is to
learn to read the debug info, though. I've been googing for days, but
I must be missing something. Where is the information that I pasted in
pastebin documented?


 I would also recommend switching to a Sun/Oracle jvm.  I have heard that
 previous versions of Solr were not happy on variants like OpenJDK, I have no
 idea whether that might still be the case with 4.0.  If you choose to do
 this, you probably have package choices in Ubuntu.  I know that in Debian,
 the package is called sun-java6-jre ... Ubuntu is probably something
 similar. Debian has a CLI command 'update-java-alternatives' that will
 quickly switch between different java implementations that are installed.
 Hopefully Ubuntu also has this.  If not, you might need the following
 command instead to switch the main java executable:

 update-alternatives --config java


Thanks, I will take a look at the current Oracle JVM.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-26 Thread Dotan Cohen
On Wed, Oct 24, 2012 at 4:33 PM, Walter Underwood wun...@wunderwood.org wrote:
 Please consider never running optimize. That should be called force merge.


Thanks. I have been letting the system run for about two days already
without an optimize. I will let it run a week, then merge to see the
effect.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-26 Thread Dotan Cohen
I spoke too soon! Wereas three days ago when the index was new 500
records could be written to it in 3 seconds, now that operation is
taking a minute and a half, sometimes longer. I ran optimize() but
that did not help the writes. What can I do to improve the write
performance?

Even opening the Logging tab of the Solr instance is taking quite a
long time. In fact, I just left it for 20 minutes and it still hasn't
come back with anything. I do have an SSH window open on the server
hosting Solr and it doesn't look overloaded at all:

$ date  du -sh data/  uptime  free -m
Fri Oct 26 13:15:59 UTC 2012
578Mdata/
 13:15:59 up 4 days, 17:59,  1 user,  load average: 0.06, 0.12, 0.22
 total   used   free sharedbuffers cached
Mem: 14980   3237  11743  0284   
-/+ buffers/cache:729  14250
Swap:0  0  0


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-26 Thread Dotan Cohen
On Fri, Oct 26, 2012 at 4:02 PM, Shawn Heisey s...@elyograg.org wrote:

 Taking all the information I've seen so far, my bet is on either cache
 warming or heap/GC trouble as the source of your problem.  It's now specific
 information gathering time.  Can you gather all the following information
 and put it into a web paste page, such as pastie.org, and reply with the
 link?  I have gathered the same information from my test server and created
 a pastie example. http://pastie.org/5118979

 On the dashboard of the GUI, it lists all the jvm arguments. Include those.

 Click Java Properties and gather the java.runtime.version and
 java.specification.vendor information.

 After one of the long update times, pause/stop your indexing application.
 Click on your core in the GUI, open Plugins/Stats, and paste the following
 bits with a header to indicate what each section is:
 CACHE-filterCache
 CACHE-queryResultCache
 CORE-searcher

 Thanks,
 Shawn


Thank you Shawn. The information is here:
http://pastebin.com/aqEfeYVA

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-24 Thread Dotan Cohen
On Tue, Oct 23, 2012 at 3:07 PM, Erick Erickson erickerick...@gmail.com wrote:
 Maybe you've been looking at it but one thing that I didn't see on a fast
 scan was that maybe the commit bit is the problem. When you commit,
 eventually the segments will be merged and a new searcher will be opened
 (this is true even if you're NOT optimizing). So you're effectively committing
 every 1-2 seconds, creating many segments which get merged, but more
 importantly opening new searchers (which you are getting since you pasted
 the message: Overlapping onDeckSearchers=2).

 You could pinpoint this by NOT committing explicitly, just set your autocommit
 parameters (or specify commitWithin in your indexing program, which is
 preferred). Try setting it at a minute or so and see if your problem goes away
 perhaps?

 The NRT stuff happens on soft commits, so you have that option to have the
 documents immediately available for search.



Thanks, Erick. I'll play around with different configurations. So far
just removing the periodic optimize command worked wonders. I'll see
how much it helps or hurts to run that daily or more or less frequent.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen
When Solr is slow, I'm seeing these in the logs:
[collection1] Error opening new searcher. exceeded limit of
maxWarmingSearchers=2,​ try again later.
[collection1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Googling, I found this in the FAQ:
Typically the way to avoid this error is to either reduce the
frequency of commits, or reduce the amount of warming a searcher does
while it's on deck (by reducing the work in newSearcher listeners,
and/or reducing the autowarmCount on your caches)
http://wiki.apache.org/solr/FAQ#What_does_.22PERFORMANCE_WARNING:_Overlapping_onDeckSearchers.3DX.22_mean_in_my_logs.3F

I happen to know that the script will try to commit once every 60
seconds. How does one reduce the work in newSearcher listeners? What
effect will this have? What effect will reducing the autowarmCount on
caches have?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen
On Mon, Oct 22, 2012 at 5:02 PM, Rafał Kuć r@solr.pl wrote:
 Hello!

 You can check if the long warming is causing the overlapping
 searchers. Check Solr admin panel and look at cache statistics, there
 should be warmupTime property.


Thank you, I have gone over the Solr admin panel twice and I cannot
find the cache statistics. Where are they?


 Lowering the autowarmCount should lower the time needed to warm up,
 howere you can also look at your warming queries (if you have such)
 and see how long they take.


Thank you, I will look at that!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen
On Mon, Oct 22, 2012 at 5:27 PM, Mark Miller markrmil...@gmail.com wrote:
 Are you using Solr 3X? The occasional long commit should no longer
 show up in Solr 4.


Thank you Mark. In fact, this is the production release of Solr 4.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen
On Mon, Oct 22, 2012 at 7:29 PM, Shawn Heisey s...@elyograg.org wrote:
 On 10/22/2012 9:58 AM, Dotan Cohen wrote:

 Thank you, I have gone over the Solr admin panel twice and I cannot find
 the cache statistics. Where are they?


 If you are running Solr4, you can see individual cache autowarming times
 here, assuming your core is named collection1:

 http://server:port/solr/#/collection1/plugins/cache?entry=queryResultCache
 http://server:port/solr/#/collection1/plugins/cache?entry=filterCache

 The warmup time for the entire searcher can be found here:

 http://server:port/solr/#/collection1/plugins/core?entry=searcher



Thank you Shawn! I can see how I missed that data. I'm reviewing it
now. Solr has a low barrier to entry, but quite a learning curve. I'm
loving it!

I see that the server is using less than 2 GiB of memory, whereas it
is a dedicated Solr server with 16 GiB of memory. I understand that I
can increase the query and document caches to increase performance,
but I worry that this will increase the warm-up time to unacceptable
levels. What is a good strategy for increasing the caches yet
preserving performance after an optimize operation?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen
On Mon, Oct 22, 2012 at 9:22 PM, Mark Miller markrmil...@gmail.com wrote:
 Perhaps you can grab a snapshot of the stack traces when the 60 second
 delay is occurring?

 You can get the stack traces right in the admin ui, or you can use
 another tool (jconsole, visualvm, jstack cmd line, etc)

Thanks. I've refactored so that the index is optimized once per hour,
instead after each dump of commits. But when I will need to increase
the optmize frequency in the future I will go through the stack
traces. Thanks!

In any case, the server has an extra 14 GiB of memory available, how
might I make the best use of that for Solr assuming both heavy reads
and writes?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen
On Mon, Oct 22, 2012 at 10:01 PM, Walter Underwood
wun...@wunderwood.org wrote:
 First, stop optimizing. You do not need to manually force merges. The system 
 does a great job. Forcing merges (optimize) uses a lot of CPU and disk IO and 
 might be the cause of your problem.


Thanks. Looking at the index statistics, I see that within minutes
after running optimize that the stats say the index needs to be
reoptimized. Though, the index still reads and writes fine even in
that state.


 Second, the OS will use the extra memory for file buffers, which really 
 helps performance, so you might not need to do anything. This will work 
 better after you stop forcing merges. A forced merge replaces every file, so 
 the OS needs to reload everything into file buffers.


I don't see that the memory is being used:

$ free -g
 total   used   free sharedbuffers cached
Mem:14  2 12  0  0  1
-/+ buffers/cache:  0 14
Swap:0  0  0

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen
On Mon, Oct 22, 2012 at 10:44 PM, Walter Underwood
wun...@wunderwood.org wrote:
 Lucene already did that:

 https://issues.apache.org/jira/browse/LUCENE-3454

 Here is the Solr issue:

 https://issues.apache.org/jira/browse/SOLR-3141

 People over-use this regardless of the name. In Ultraseek Server, it was 
 called force merge and we had to tell people to stop doing that nearly 
 every month.


Thank you for those links. I commented on the Solr bug. There are some
very insightful comments in there.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen
On Tue, Oct 23, 2012 at 3:52 AM, Shawn Heisey s...@elyograg.org wrote:
 As soon as you make any change at all to an index, it's no longer
 optimized.  Delete one document, add one document, anything.  Most of the
 time you will not see a performance increase from optimizing an index that
 consists of one large segment and a bunch of very tiny segments or deleted
 documents.


I've since realized that by experimentation. I've probably saved quite
a few minutes of reading time by investing hours of experiment time!


 How big is your index, and did you run this right after a reboot?  If you
 did, then the cache will be fairly empty, and Solr has only read enough from
 the index files to open the searcher.The number is probably too small to
 show up on a gigabyte scale.  As you issue queries, the cached amount will
 get bigger.  If your index is small enough to fit in the 14GB of free RAM
 that you have, you can manually populate the disk cache by going to your
 index directory and doing 'cat *  /dev/null' from the commandline or a
 script.  The first time you do it, it may go slowly, but if you immediately
 do it again, it will complete VERY fast -- the data will all be in RAM.


The cat trick to get the files in RAM is great. I would not have
thought that would work for binary files.

The index is small, much less than the available RAM, for the time
being. Therefore, there was nothing to fill it with I now understand.
Both 'free' outputs were after the system had been running for some
time.


 The 'free -m' command in your first email shows cache usage of 1243MB, which
 suggests that maybe your index is considerably smaller than your available
 RAM.  Having loads of free RAM is a good thing for just about any workload,
 but especially for Solr.Try running the free command without the -g so you
 can see those numbers in kilobytes.

 I have seen a tendency towards creating huge caches in Solr because people
 have lots of memory.  It's important to realize that the OS is far better at
 the overall job of caching the index files than Solr itself is.  Solr caches
 are meant to cache result sets from queries and filters, not large sections
 of the actual index contents.  Make the caches big enough that you see some
 benefit, but not big enough to suck up all your RAM.


I see, thanks.


 If you are having warm time problems, make the autowarm counts low.  I have
 run into problems with warming on my filter cache, because we have filters
 that are extremely hairy and slow to run. I had to reduce my autowarm count
 on the filter cache to FOUR, with a cache size of 512.  When it is 8 or
 higher, it can take over a minute to autowarm.


I will have to experiment with the warning. Thank you for the tips.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Error: _version_field must exist in schema

2012-10-18 Thread Dotan Cohen
On Thu, Oct 18, 2012 at 12:25 AM, Rafał Kuć r@solr.pl wrote:
 Hello!

 You can some find information about requirements of SolrCloud at
 http://wiki.apache.org/solr/SolrCloud . I don't know if _version_ is
 mentioned elsewhere.

 As for Websolr - I'm afraid I can't say anything about the cause of
 those errors without seeing the exception.


I see, thanks. I don't think that I'm using the SolrCloud feature. Is
it enable because there exist solr/collection1 and also
multicore/core0?

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Error: _version_field must exist in schema

2012-10-18 Thread Dotan Cohen
On Thu, Oct 18, 2012 at 9:21 AM, Rafał Kuć r@solr.pl wrote:
 Hello!

 Look at your solrconfig.xml file, you should see something like that:

 updateLog
  str name=dir${solr.data.dir:}/str
 /updateLog

 Just remove it and Solr shouldn't bother you with the version field
 information. However remember that some features won't work (like the
 real time get or partial documents update).


Thank you. Is there any place where this is documented? It certainly
does not appear in the relevant wiki page:
http://wiki.apache.org/solr/SolrConfigXml


 You can also add _version_ field to your schema and forget about it.
 You don't need to do anything with it as it is used internally by
 Solr.


That is exactly my plan, but I would also like to understand more
about what is going on. I don't like cut-and-paste programming.

Thank you very much!


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Error: _version_field must exist in schema

2012-10-18 Thread Dotan Cohen
On Thu, Oct 18, 2012 at 1:06 PM, Erick Erickson erickerick...@gmail.com wrote:
 I've updated the schema.xml page, see
 http://wiki.apache.org/solr/SchemaXml#Recommended_fields


Great, thanks!


 Care to change the schema.xml file to warn about this too and
 submit a patch?


If you are referring to the example schema.xml file provided with
Solr, then I'd love to. I'm signing up for the dev list now. Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Error: _version_field must exist in schema

2012-10-17 Thread Dotan Cohen
On Thu, Oct 18, 2012 at 12:09 AM, Rafał Kuć r@solr.pl wrote:
 Hello!

 The _version_ field is needed by some of Solr 4.0 functionality like
 transaction log or partial documents update. If you want to use them,
 just update your schema.xml and put the _version_ field definition
 there.

 However if you don't want those, you can remove the transaction log
 configuration in your solrconfig.xml. However please remember that
 when using SolrCloud you'll need that field.


Thanks. Where is that bit documented? I don't see it on the Solr wiki:
http://wiki.apache.org/solr/SchemaXml

I do have a Solr 4 Beta index running on Websolr that does not have
such a field. It works, but throws many Service Unavailable and
Communication Error errors. Might the lack of the _version_ field be
the reason?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


How does Solr know which relative paths to use?

2012-10-16 Thread Dotan Cohen
I have just installed Solr 4.0 on a test server. I start it like so:
$ pwd
/some/dir
$ java -jar start.jar

The Solr Instance now looks like this:
CWD
/some/dir
Instance
/some/dir/solr/collection1
Data
/some/dir/solr/collection1/data
Index
/some/dir/solr/collection1/data/index

From where did the additional relative paths 'collection1',
'collection1/data', and 'collection1/data/index' come from? I know
that I can change the value of CWD with the -Dsolr.solr.home flag, but
what affects the relative paths mentioned?

Thanks.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: How does Solr know which relative paths to use?

2012-10-16 Thread Dotan Cohen
On Wed, Oct 17, 2012 at 12:16 AM, P Williams
williams.tricia.l...@gmail.com wrote:
 Hi Dotan,

 It seems that the examples now use Multiple
 Coreshttp://wiki.apache.org/solr/CoreAdminby default.  If your test
 server is based on the stock example, you should
 see a solr.xml file in your CWD path which is how Solr knows about the
 relative paths.  There should also be a README.txt file that will tell you
 more about how the directory is expected to be organized.

 Cheers,
 Tricia


Thanks. I read the top-level README.txt but now I see that the answer
is in the solr/README.txt file.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Return only matched multiValued field

2012-09-24 Thread Dotan Cohen
 field name=doctest type=textmulti stored=true 
 indexed=true
 multiValued=true /
 /fields
 defaultSearchFielddoctest/defaultSearchField

Note that in anonymizing the information, I introduced a typo. The
above doctest should be doctext. In any case, the field names in
the production application and in production schema do in fact match!


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Return only matched multiValued field

2012-09-24 Thread Dotan Cohen
On Mon, Sep 24, 2012 at 2:16 PM, Erick Erickson erickerick...@gmail.com wrote:
 Hmmm, works for me. What is your entire response packet?

 And you've covered the bases with indexed and stored so this
 seems like it _should_ work.


I'm sorry, reducing the output to rows=1 helped me notice that the
highlighted sections come after the main results. The highlighting
feature works as expected.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Return only matched multiValued field

2012-09-24 Thread Dotan Cohen
On Mon, Sep 24, 2012 at 9:47 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 Hi
 It seems like highlighting feature.

Thank you Mikhail. I actually do need the entire matched single entry,
not a snippet of it. Looking at the example in the OP, with
highlighting on gold I would get

emglitters is gold/em

Whereas I need:
strall that glitters is gold/str

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Return only matched multiValued field

2012-09-23 Thread Dotan Cohen
Assuming a multivalued, stored and indexed field with name comment.
When performing a search, I would like to return only the values of
comment which contain the match. For example:

When searching for gold instead of getting this result:

doc
arr name=comment
strTheres a lady whos sure/str
strall that glitters is gold/str
strand shes buying a stairway to heaven/str
/arr
/doc

I would prefer to get this result:

doc
arr name=comment
strall that glitters is gold/str
/arr
/doc

(psuedo-XML from memory, may not be accurate but illustrates the point)

Is there any way to do this with a Solr 4 index? The client accessing
Solr is on a dial-up connection (no provision for DSL or other high
speed internet) so I'd like to move as little data over the wire as
possible. In reality, the array will have tens of fields so returning
only the relevant fields may reduce the data transferred by an order
of magnitude.

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Cannot insert text into solr.StrField field

2012-09-13 Thread Dotan Cohen
On Fri, Sep 14, 2012 at 1:00 AM, Jack Krupansky j...@basetechnology.com wrote:
 Did you check the log file?

 How are you adding data to Solr? Show us the actual input document or code.


The Solr instances on Websolr. I will put in a feature request for
that, though. I am adding the documents with Solr-PHP-Client. In fact,
preceding the variable with (int) does in fact resolve the issue I
have found. This looks like an issue with PHP being weakly typed.



-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


  1   2   >