RE: How to 'filter' facet results

2010-07-28 Thread Nagelberg, Kallin
ManBearPig is still a threat.

-Kallin Nagelberg

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Tuesday, July 27, 2010 7:44 PM
To: solr-user@lucene.apache.org
Subject: RE: How to 'filter' facet results

 Is there a way to tell Solr to only return a specific set of facet values?  I
 feel like the facet query must be able to do this, but I'm not really
 understanding the facet query.  In my specific case, I'd like to only see 
 facet
 values for the same values I pass in as query filters, i.e. if I run this 
 query:
fq=keyword:man OR keyword:bear OR keyword:pig
facet=on
facet.field:keyword

 then I only want it to return the facet counts for man, bear, and pig.  The
 resulting docs might have a number of different values for keyword, in 
 addition

For the general case of filtering facet values, I've wanted to do that too in 
more complex situations, and there is no good way I've found. 

For your very specific use case though, yeah, you can do it with facet.query.  
Leave out the facet.field, but instead:

facet.query=keyword:man
facet.query=keyword:bear
facet.query=keyword:pig

You'll get three facet.query results in the response, one each for man, bear, 
pig. 

Solr behind the scenes will kind of do three seperate 'sub-queries', one for 
each facet.query, but since the query itself should be cached, you shouldn't 
notice much difference. Especially if you have a warming query that facets on 
the keyword field (I'm never entirely sure when caches created by warming 
queries will be used by a facet.query, or if it depends on the facet method in 
use, but it can't hurt). 

Jonathan



solrj occasional timeout on commit

2010-07-23 Thread Nagelberg, Kallin
Hey,

I recently moved a solr app from a testing environment into a production 
environment, and I'm seeing a brand new error which never occurred during 
testing. I'm seeing this in the solrJ-based app logs:


org.apache.solr.common.SolrException: com.caucho.vfs.SocketTimeoutException: 
client timeout

com.caucho.vfs.SocketTimeoutException: client timeout

request: http://somehost:8080/solr/live/update?wt=javabinversion=1

at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:424)

at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)

at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)




This occurs in a service that periodically adds new documents to solr. There 
are 4 boxes that could be doing updates in parallel. In testing there were 2.





We're running on a new Resin 4 based install on production, whereas we were 
using resin 3 in testing. Does anyone have any ideas. Help would be greatly 
appreciated!



Thanks,

-Kallin Nagelberg







RE: help with a schema design problem

2010-07-23 Thread Nagelberg, Kallin
I think you just want something like:

p_value:Pramod AND p_type:Supplier

no?
-Kallin Nagelberg

-Original Message-
From: Pramod Goyal [mailto:pramod.go...@gmail.com] 
Sent: Friday, July 23, 2010 2:17 PM
To: solr-user@lucene.apache.org
Subject: help with a schema design problem

Hi,

Lets say i have table with 3 columns document id Party Value and Party Type.
In this table i have 3 rows. 1st row Document id: 1 Party Value: Pramod
Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party Type:
Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: Supplier.
Now in this table if i use SQL its easy for me find all document with Party
Value as Pramod and Party Type as Client.

I need to design solr schema so that i can do the same in Solr. If i create
2 fields in solr schema Party value and Party type both of them multi valued
and try to query +Pramod +Supplier then solr will return me the first
document, even though in the first document Pramod is a client and not a
supplier
Thanks,
Pramod Goyal


RE: help with a schema design problem

2010-07-23 Thread Nagelberg, Kallin
   When i search
   p_value:Pramod AND p_type:Supplier
  
   it would give me result as document 1. Which is incorrect, since in
   document
   1 Pramod is a Client and not a Supplier.

Would it? I would expect it to give you nothing.

-Kal



-Original Message-
From: Geert-Jan Brits [mailto:gbr...@gmail.com] 
Sent: Friday, July 23, 2010 5:05 PM
To: solr-user@lucene.apache.org
Subject: Re: help with a schema design problem

 Is there any way in solr to say p_value[someIndex]=pramod
And p_type[someIndex]=client.
No, I'm 99% sure there is not.

 One way would be to define a single field in the schema as p_value_type =
client pramod i.e. combine the value from both the field and store it in a
single field.
yep, for the use-case you mentioned that would definitely work. Multivalued
of course, so it can contain Supplier Raj as well.


2010/7/23 Pramod Goyal pramod.go...@gmail.com

In my case the document id is the unique key( each row is not a unique
 document ) . So a single document has multiple Party Value and Party Type.
 Hence i need to define both Party value and Party type as mutli-valued. Is
 there any way in solr to say p_value[someIndex]=pramod And
 p_type[someIndex]=client.
Is there any other way i can design my schema ? I have some solutions
 but none seems to be a good solution. One way would be to define a single
 field in the schema as p_value_type = client pramod i.e. combine the
 value
 from both the field and store it in a single field.


 On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits gbr...@gmail.com
 wrote:

  With the usecase you specified it should work to just index each Row as
  you described in your initial post to be a seperate document.
  This way p_value and p_type all get singlevalued and you get a correct
  combination of p_value and p_type.
 
  However, this may not go so well with other use-cases you have in mind,
  e.g.: requiring that no multiple results are returned with the same
  document
  id.
 
 
 
  2010/7/23 Pramod Goyal pramod.go...@gmail.com
 
   I want to do that. But if i understand correctly in solr it would store
  the
   field like this:
  
   p_value: Pramod  Raj
   p_type:  Client Supplier
  
   When i search
   p_value:Pramod AND p_type:Supplier
  
   it would give me result as document 1. Which is incorrect, since in
   document
   1 Pramod is a Client and not a Supplier.
  
  
  
  
   On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin 
   knagelb...@globeandmail.com wrote:
  
I think you just want something like:
   
p_value:Pramod AND p_type:Supplier
   
no?
-Kallin Nagelberg
   
-Original Message-
From: Pramod Goyal [mailto:pramod.go...@gmail.com]
Sent: Friday, July 23, 2010 2:17 PM
To: solr-user@lucene.apache.org
Subject: help with a schema design problem
   
Hi,
   
Lets say i have table with 3 columns document id Party Value and
 Party
Type.
In this table i have 3 rows. 1st row Document id: 1 Party Value:
 Pramod
Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party
  Type:
Supplier. 3rd row Document id:2 Party Value: Pramod Party Type:
  Supplier.
Now in this table if i use SQL its easy for me find all document with
   Party
Value as Pramod and Party Type as Client.
   
I need to design solr schema so that i can do the same in Solr. If i
   create
2 fields in solr schema Party value and Party type both of them multi
valued
and try to query +Pramod +Supplier then solr will return me the first
document, even though in the first document Pramod is a client and
 not
  a
supplier
Thanks,
Pramod Goyal
   
  
 



RE: faceted search with job title

2010-07-21 Thread Nagelberg, Kallin
Yeah you should definitely just setup a custom parser for each site.. should be 
easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
html. If you can't find the pattern for each site leading to the job title how 
can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.sea...@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the 
title 
tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 





From: Dave Searle dave.sea...@magicalia.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules 
for 
each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class 
or 
id, based on the particular website



-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in 
the 

index to make it work with solr faceted search, am I right?
Thanks.


  


RE: how to eliminating scoring from a query?

2010-07-15 Thread Nagelberg, Kallin
How about:

1. Create a date field to indicate indextime.

2  Use a date filter to restrict articles to today and yesterday such as 
myindexdate:[NOW/DAY-1DAY TO NOW/DAY+1DAY]

3. sort on that field.

-Kallin Nagelberg

-Original Message-
From: oferiko [mailto:ofer...@gmail.com] 
Sent: Thursday, July 15, 2010 1:38 PM
To: solr-user@lucene.apache.org
Subject: Re: how to eliminating scoring from a query?


thanks,

i want it to be the indexing order, but with a limit,  something like
everything that matches my query, and was indexed since yesterday, in an
ascending order.

Ofer

On Thu, Jul 15, 2010 at 8:25 PM, Erick Erickson [via Lucene] 
ml-node+970139-889457701-316...@n3.nabble.comml-node%2b970139-889457701-316...@n3.nabble.com
 wrote:

 By specifying a sort that doesn't include score. I think it's just
 automatic
 then.

 It wouldn't make sense to eliminate scoring *without* sorting by some other

 field , you'd essentially get a random ordering.


 Best
 Erick

 On Thu, Jul 15, 2010 at 1:43 AM, oferiko [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=970139i=0
 wrote:

 
  in  http://www.lucidimagination.com/files/file/LIWP_WhatsNew_Solr1.4.pdf
  http://www.lucidimagination.com/files/file/LIWP_WhatsNew_Solr1.4.pdf under

  the performance it mentions:
  Queries that don't sort by score can eliminate scoring, which speeds up
  queries
  how exactly can i do that? If i don't mention which sort i want, it
  automatically sorts by score desc.
 
  thanks
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p968581.htmlhttp://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p968581.html?by-user=t
  Sent from the Solr - User mailing list archive at Nabble.com.
 


 --
  View message @
 http://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p970139.html
 To unsubscribe from how to eliminating scoring from a query?, click here 
 (link removed) =.




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p970180.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: limiting the total number of documents matched

2010-07-14 Thread Nagelberg, Kallin
So you want to take the top 1000 sorted by score, then sort those by another 
field. It's a strange case, and I can't think of a clean way to accomplish it. 
You could do it in two queries, where the first is by score and you only 
request your IDs to keep it snappy, then do a second query against the IDs and 
sort by your other field. 1000 seems like a lot for that approach, but who 
knows until you try it on your data.

-Kallin Nagelberg 


-Original Message-
From: Paul [mailto:p...@nines.org] 
Sent: Wednesday, July 14, 2010 4:16 PM
To: solr-user
Subject: limiting the total number of documents matched

I'd like to limit the total number of documents that are returned for
a search, particularly when the sort order is not based on relevancy.

In other words, if the user searches for a very common term, they
might get tens of thousands of hits, and if they sort by title, then
very high relevancy documents will be interspersed with very low
relevancy documents. I'd like to set a limit to the 1000 most relevant
documents, then sort those by title.

Is there a way to do this?

I guess I could always retrieve the top 1000 documents and sort them
in the client, but that seems particularly inefficient. I can't find
any other way to do this, though.

Thanks,
Paul


RE: Help patching Solr

2010-06-15 Thread Nagelberg, Kallin
I'm pretty sure you need to be running the patch against a checkout of the 
trunk sources, not a generated .war file. Once you've done that you can use the 
build scripts to make a new war.

-Kallin Nagelberg

-Original Message-
From: Moazzam Khan [mailto:moazz...@gmail.com] 
Sent: Tuesday, June 15, 2010 1:53 PM
To: solr-user@lucene.apache.org
Subject: Help patching Solr

Hey guys,

Does anyone know how to patch stuff in Windows? I am trying to patch
Solr with patch 238 but it keeps erroring out with this message:



C:\solr\example\webappspatch solr.war ..\..\SOLR-236-trunk.patch
patching file solr.war
Assertion failed: hunk, file ../patch-2.5.9-src/patch.c, line 354

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

Thanks in advance

Moazzam


RE: index growing with updates

2010-06-04 Thread Nagelberg, Kallin
Ok so I think that Solr (lucene) will only remove deleted/updated documents 
from the disk after an optimize or after an 'expungeDeletes' request. Is there 
a way to trigger the expunsion (new word) across the entire index? I tried :

final UpdateRequest request = new UpdateRequest()
request.setParam(expungeDeletes,true);
request.add someofmydocs
server.sendrequest..


But that didn't seem to do the trick as I know I have about 7 Gigs of documents 
that should be removed from the disk and the index size hasn't really budged.

Any ideas?

Thanks,
Kallin Nagelberg





-Original Message-
From: Nagelberg, Kallin 
Sent: Thursday, June 03, 2010 1:36 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: index growing with updates

Is there a way to trigger a purge, or under what conditions does it occur?

-Kallin Nagelberg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, June 03, 2010 12:40 PM
To: solr-user@lucene.apache.org
Subject: Re: index growing with updates

Assuming your config is set up to replace unique keys, you're really
doing a delete and an add (under the covers). It could very well be that
the deleted version of the document is still in your index taking up
space and will be until it is purged.

HTH
Erick

On Thu, Jun 3, 2010 at 10:22 AM, Nagelberg, Kallin 
knagelb...@globeandmail.com wrote:

 Hey,

 If I add a document to the index that already exists (same uniquekey) what
 is the expected behavior? I would imagine that if the document is the same
 then the index should not grow, but mine appears to be growing. Any ideas?

 Thanks,
 -Kallin Nagelberg




RE: index growing with updates

2010-06-03 Thread Nagelberg, Kallin
Is there a way to trigger a purge, or under what conditions does it occur?

-Kallin Nagelberg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, June 03, 2010 12:40 PM
To: solr-user@lucene.apache.org
Subject: Re: index growing with updates

Assuming your config is set up to replace unique keys, you're really
doing a delete and an add (under the covers). It could very well be that
the deleted version of the document is still in your index taking up
space and will be until it is purged.

HTH
Erick

On Thu, Jun 3, 2010 at 10:22 AM, Nagelberg, Kallin 
knagelb...@globeandmail.com wrote:

 Hey,

 If I add a document to the index that already exists (same uniquekey) what
 is the expected behavior? I would imagine that if the document is the same
 then the index should not grow, but mine appears to be growing. Any ideas?

 Thanks,
 -Kallin Nagelberg




RE: general debugging techniques?

2010-06-03 Thread Nagelberg, Kallin
How much memory have you given tomcat? The default is 64M which is going to be 
really small for 5MB documents. 

-Original Message-
From: jim.bl...@pbwiki.com [mailto:jim.bl...@pbwiki.com] On Behalf Of Jim Blomo
Sent: Thursday, June 03, 2010 2:05 PM
To: solr-user@lucene.apache.org
Subject: general debugging techniques?

I am new to debugging Java services, so I'm wondering what the best
practices are for debugging solr on tomcat.  I'm running into a few
issues while building up my index, using the ExtractingRequestHandler
to format the data from my sources.  I can read through the catalina
log, but this seems to just log requests; not much info is given about
errors or when the service hangs.  Here are some examples:

Some zip or Office formats uploaded to the extract requestHandler
simply hang with the jsvc process spinning at 100% CPU.  I'm unclear
where in the process the request is hanging.  Did it make it through
Tika?  Is it attempting to index?  The problem is often not
reproducible after restarting tomcat and starting with the last failed
document.

Although I am keeping document size under 5MB, I regularly see
SEVERE: java.lang.OutOfMemoryError: Java heap space errors.  How can
I find what component had this problem?

After the above error, I often see this followup error on the next
document: SEVERE: org.apache.lucene.store.LockObtainFailedException:
Lock obtain timed out: NativeFSLock@/var/lib/solr/data/
index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock .  This has
a backtrace, so I could dive directly into the code.  Is this the best
way to track down the problem, or are there debugging settings that
could help show why the lock is being held elsewhere?

I attempted to turn on indexing logging with the line

infoStream file=INFOSTREAM.txttrue/infoStream

but I can't seem to find this file in either the tomacat or the index directory.

I am using solr 3.1 with the patch to work with Tika 0.7.  Thanks for any tips,

Jim


RE: general debugging techniques?

2010-06-03 Thread Nagelberg, Kallin
That is still really small for 5MB documents. I think the default solr document 
cache is 512 items, so you would need at least 3 GB of memory if you didn't 
change that and the cache filled up.

Try disabling the document cache by removing the 

documentCache
class=solr.LRUCache
size=512
initialSize=512
autowarmCount=0/

block from your solrconfig, or at least turn it down to like 5 documents.

-Kal



-Original Message-
From: jim.bl...@pbwiki.com [mailto:jim.bl...@pbwiki.com] On Behalf Of Jim Blomo
Sent: Thursday, June 03, 2010 2:29 PM
To: solr-user@lucene.apache.org
Subject: Re: general debugging techniques?

On Thu, Jun 3, 2010 at 11:17 AM, Nagelberg, Kallin
knagelb...@globeandmail.com wrote:
 How much memory have you given tomcat? The default is 64M which is going to 
 be really small for 5MB documents.

-Xmx128M - my understanding is that this bumps heap size to 128M.
What is a reasonable size?  Are there other memory flags I should
specify?

Jim


RE: Storing different entities in Solr

2010-05-28 Thread Nagelberg, Kallin
Good read here: http://mysolr.com/tips/denormalized-data-structure/ .

Are consultation requests unique to each consultant? In that case you could 
represent the request as a Json String and store it as a multi-valued string 
field for each consultant, though that makes querying against requests 
trickier. If you need to search against specific fields in the consultant 
requests than you could try a schema where the consultant is your primary 
entity and have fields like

consultantrequests-field1,
consultantrequests-field2,
consultantrequests-field3

and then one
consultantrequests-fulljson

all multi-valued. You could query against the specific fields, then associate 
to the whole request by searching the json object. It's an approach I've used 
with success. 

-Kallin Nagelberg

-Original Message-
From: Moazzam Khan [mailto:moazz...@gmail.com] 
Sent: Friday, May 28, 2010 12:17 PM
To: solr-user@lucene.apache.org
Subject: Storing different entities in Solr

Hi Guys,

Is there a way to store 2 types of things in Solr. We have a list of
consultants and a list of consultation requests. and I want to store
them as separate documents. Can I do this with one instance of Solr or
do I have to have two instances?

Thanks,

MOazzam


RE: Storing different entities in Solr

2010-05-28 Thread Nagelberg, Kallin
Multi-core is an option, but keep in mind if you go that route you will need to 
do two searches to correlate data between the two. 

-Kallin Nagelberg

-Original Message-
From: Robert Zotter [mailto:robertzot...@gmail.com] 
Sent: Friday, May 28, 2010 12:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Storing different entities in Solr


Sounds like you'll want to use a multiple core setup. One core fore each type
of document

http://wiki.apache.org/solr/CoreAdmin
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Storing-different-entities-in-Solr-tp852299p852346.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Storing different entities in Solr

2010-05-28 Thread Nagelberg, Kallin
I agree with Erick,

Could you show us what these two entities look like, and the total count of 
each? That might shed some light on the appropriate approach.

-Kallin Nagelberg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, May 28, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Storing different entities in Solr

You most certainly *can* store the many-many relationship, you
are just denormalizing your data. I know it goes against the grain
of any good database admin, but it's very often a good solution
for a search application.

You've gotta forget almost everything you learned about how data
*should* be stored in databases when working with a search app.
Well, perhaps I'm overstating a bit, but you get the idea

When I see messages about primary keys and foreign keys etc, I
break out in hives. It's almost always a mistake to try to force
lucene/solr to behave like a database. Whenever you find yourself
trying, stop, take a deep breath, and think about searching G...

A lot depends on how much data we're talking about here. If
fully denormalizing things would cost you 10M, who cares? If it
would cost you 100G, it's a different story

Best
Erick


On Fri, May 28, 2010 at 1:12 PM, Moazzam Khan moazz...@gmail.com wrote:

 Thanks for all your answers guys. Requests and consultants have a many
 to many relationship so I can't store request info in a document with
 advisorID as the primary key.

 Bill's solution and multicore solutions might be what I am looking
 for. Bill, will I be able to have 2 primary keys (so I can update and
 delete documents)? If yes, can you please give me a link or someting
 where I can get more info on this?

 Thanks,
 Moazzam



 On Fri, May 28, 2010 at 11:50 AM, Bill Au bill.w...@gmail.com wrote:
  You can keep different type of documents in the same index.  If each
  document has a type field.  You can restrict your searches to specific
  type(s) of document by using a filter query, which is very fast and
  efficient.
 
  Bill
 
  On Fri, May 28, 2010 at 12:28 PM, Nagelberg, Kallin 
  knagelb...@globeandmail.com wrote:
 
  Multi-core is an option, but keep in mind if you go that route you will
  need to do two searches to correlate data between the two.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: Robert Zotter [mailto:robertzot...@gmail.com]
  Sent: Friday, May 28, 2010 12:26 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Storing different entities in Solr
 
 
  Sounds like you'll want to use a multiple core setup. One core fore each
  type
  of document
 
  http://wiki.apache.org/solr/CoreAdmin
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Storing-different-entities-in-Solr-tp852299p852346.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



RE: Any realtime indexing plugin available for SOLR

2010-05-26 Thread Nagelberg, Kallin
I'm afraid nothing is completely 'real-time'. Even when doing your inserts on 
the database there is time taken for those operations to complete. Right now I 
have my solr server autocommiting every 30 seconds, which is 'real-time' enough 
for me. You need to figure out what your threshold is, and then tune your 
index, hardware, caching to achieve it. If you don't want the results to show 
up in the database before the search you could store an 'indexed' value in the 
DB which you flip after you've indexed the new data.

-Kallin Nagelberg

-Original Message-
From: bbarani [mailto:bbar...@gmail.com] 
Sent: Wednesday, May 26, 2010 10:39 AM
To: solr-user@lucene.apache.org
Subject: Any realtime indexing plugin available for SOLR


Hi,

Sorry if I am asking this question again in this forum..

Is there any plugin which I can use to do a realtime indexing?

I have a requirement where we have an application which sits on top of SQL
server DB and updates happen on day to day basis. Users would like to see
the changes made to the DB immediately in the search results. I am thinking
of using JMS queue for achieving this, but before that I just want to check
if anyone has implemented similar kind of requirement before?

Any help / suggestions would be greatly appreciated.

Thanks,
bb
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Any-realtime-indexing-plugin-available-for-SOLR-tp845026p845026.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: How real-time are Solr/Lucene queries?

2010-05-26 Thread Nagelberg, Kallin
Searching is very fast with Solr, but no way as fast as keying into a map. 
There is possibly disk I/O if your document isn't cached. Your situation sounds 
unique enough I think you're going to need to prototype to see if it meets your 
demands. Figure out how 'fast' is 'fast' for your application, and then see if 
you can hit your targets. Once you have some real numbers and queries you'll be 
able to get more meaningful feedback from the community I imagine.

-Kallin Nagelberg

-Original Message-
From: Thomas J. Buhr [mailto:t...@superstringmedia.com] 
Sent: Wednesday, May 26, 2010 11:30 AM
To: solr-user@lucene.apache.org
Subject: Re: How real-time are Solr/Lucene queries?

What about my situation? 

My renderers need to query the index for fast access to layout and style info 
as I already described about 3 messages ago on this thread. Another scenario is 
having automatic queries triggered as my midi player iterates through the 
model. As the player encounters trigger tags it needs to make a query quickly 
so that the next notes played will have the context they are meant to have.

Basically, I need to know that issuing searches to a local index will not be 
slower than searching a hashmap or array. How different or similar will the 
performance be?

Thom


On 2010-05-26, at 9:41 AM, Walter Underwood wrote:

 On May 25, 2010, at 11:24 PM, Amit Nithian wrote:
 
 2) What are typical/accepted definitions of Real Time vs Near Real Time?
 
 Real time means that an update is available in the next query after it 
 commits. Near real time means that the delay is small, but not zero.
 
 This is within a single server. In a cluster, there will be some 
 communication delay. 
 
 3) I could understand POSTing a document to a server and then turning around
 and searching for it on the same server but what about a replicated
 environment and how do you prevent caches from being blown and constantly
 re-warmed (hence performance degradation)?
 
 You need a different caching design, with transaction-aware caches that are 
 at a lower level, closer to the indexes.
 
 wunder
 --
 Walter Underwood
 Lead Engineer
 MarkLogic
 
 
 



RE: seemingly impossible query

2010-05-26 Thread Nagelberg, Kallin
I developed a solution to this problem and I thought I should share it in case 
someone encounters a similar problem.

Recap: My problem was that for every document in my index I needed to know if 
it was the most recent that contained an ID in a multi-valued field. Doing this 
for one ID was simple (id:${myId} sort:date asc, rows=1). It is much more 
difficult to do this for a set of ids at the same time, in my case up to 100. 
If I try 'id:id1 or id:id2 or id:id3... sort=date asc  rows=11' I may not get 
a match for every ID in my query. IE, with a query of 100 unique IDs, 100 Rows, 
I might only find 75 of those uniqueIds in the response.

My solution is to pre-calculate this information. 

I created a new multi-valued field, mostRecentForIds, and store in that field 
all of the IDS for which this document is the most recent. Each ID will only 
appear once in the index in this field, allowing me to obtain my 100 unique Id 
response when querying with 100 unique IDs. I also created a Boolean field, 
'isPostProcessed' which is set to false when a new doc is added.

Then, on a cron, I select all documents with isPostProcessed:false, and perform 
the precalculation logic on all the ids stored in the resultset, and updating 
isPostProcessed:false. 

The downside to this approach is that every document must be indexed twice. I 
could not perform the logic before the initial index since there could be other 
unindexed documents in a forthcoming commit that would conflict.

Hopefully someone finds this useful eventually!

-Kallin Nagelberg 





-Original Message-
From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] 
Sent: Friday, May 21, 2010 4:44 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: seemingly impossible query

I just realized something that may make the fieldcollapsing strategy 
insufficient. My 'ids' field is multi-valued. From what I've read you cannot 
field collapse on a multi-valued field. Any other ideas?

Thanks,
-Kallin Nagelberg

-Original Message-
From: Geert-Jan Brits [mailto:gbr...@gmail.com] 
Sent: Thursday, May 20, 2010 1:03 PM
To: solr-user@lucene.apache.org
Subject: Re: seemingly impossible query

Hi Kallin,

again please look at
FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24
collapse.threshold=1collapse.field=listOfIdscollapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

  Thanks Darren,
 
  The problem with that is that it may not return one document per id,
 which
  is what I need.  IE, I could give 100 ids in that OR query and retrieve
  100 documents, all containing just 1 of the IDs.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
  Sent: Thursday, May 20, 2010 12:21 PM
  To: solr-user@lucene.apache.org
  Subject: Re: seemingly impossible query
 
  Ok. I think I understand. What's impossible about this?
 
  If you have a single field name called id that is multivalued
  then you can retrieved the documents with something like:
 
  id:1 OR id:2 OR id:56 ... id:100
 
  then add limit 100.
 
  There's probably a more succinct way to do this, but I'll leave that to
  the experts.
 
  If you also only want the documents within a certain time, then you also
  create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
  or something similar to this. Check the query syntax wiki for specifics.
 
  Darren
 
 
  Hey everyone,
 
  I've recently been

RE: seemingly impossible query

2010-05-21 Thread Nagelberg, Kallin
I just realized something that may make the fieldcollapsing strategy 
insufficient. My 'ids' field is multi-valued. From what I've read you cannot 
field collapse on a multi-valued field. Any other ideas?

Thanks,
-Kallin Nagelberg

-Original Message-
From: Geert-Jan Brits [mailto:gbr...@gmail.com] 
Sent: Thursday, May 20, 2010 1:03 PM
To: solr-user@lucene.apache.org
Subject: Re: seemingly impossible query

Hi Kallin,

again please look at
FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24
collapse.threshold=1collapse.field=listOfIdscollapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

  Thanks Darren,
 
  The problem with that is that it may not return one document per id,
 which
  is what I need.  IE, I could give 100 ids in that OR query and retrieve
  100 documents, all containing just 1 of the IDs.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
  Sent: Thursday, May 20, 2010 12:21 PM
  To: solr-user@lucene.apache.org
  Subject: Re: seemingly impossible query
 
  Ok. I think I understand. What's impossible about this?
 
  If you have a single field name called id that is multivalued
  then you can retrieved the documents with something like:
 
  id:1 OR id:2 OR id:56 ... id:100
 
  then add limit 100.
 
  There's probably a more succinct way to do this, but I'll leave that to
  the experts.
 
  If you also only want the documents within a certain time, then you also
  create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
  or something similar to this. Check the query syntax wiki for specifics.
 
  Darren
 
 
  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1
  row,
  and caching the result. This could work OK since some of these ids
  should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 
 
 




field collapsing on multi-valued field

2010-05-21 Thread Nagelberg, Kallin
As I understand from looking at 
https://issues.apache.org/jira/login.jsp?os_destination=/browse/SOLR-236 field 
collapsing has been disabled on multi-valued fields. Is this really necessary?

Let's say I have a multi-valued field, 'my-mv-field'. I have a query like 
(my-mv-field:1 OR my-mv-field:5) that returns docs with the following values 
for 'my-mv-field':

Doc1: 1, 2, 3,
Doc2: 1, 3
Doc3: 2, 4, 5, 6
Doc4: 1

If I collapse on that field with that query I imagine it should mean 'collect 
the docs, starting from the top, so that I find 1 and 5'. In this case if it 
returned Doc1 and Doc3 I would be happy.

There must be some ambiguity or implementation detail I am unaware that is 
preventing this. It may be a critical piece of functionality for an application 
I'm working on, so I'm curious if there is point in pursuing development of 
this functionality or if I am missing something.

Thanks,
Kallin Nagelberg


seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Hey everyone,

I've recently been given a requirement that is giving me some trouble. I need 
to retrieve up to 100 documents, but I can't see a way to do it without making 
100 different queries.

My schema has a multi-valued field like 'listOfIds'. Each document has between 
0 and N of these ids associated to them.

My input is up to 100 of these ids at random, and I need to retrieve the most 
recent document for each id (N Ids as input, N docs returned). I'm currently 
planning on doing a single query for each id, requesting 1 row, and caching the 
result. This could work OK since some of these ids should repeat quite often. 
Of course I would prefer to find a way to do this in Solr, but I'm not sure 
it's capable.

Any ideas?

Thanks,
-Kallin Nagelberg


RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
How about throwing a blockingqueue, 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size of 10,000 or 
something, with one thread trying to feed it, and one thread waiting for it to 
get near full then draining it. Take the drained results and add them to the 
server (maybe try not using streamingsolrserver). Something like that worked 
well for me with about 5,000,000 documents each ~5k taking about 8 hours.

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker then it does 
at the moment.

I have to index (and re-index) some 3-5 million documents. These 
documents are preprocessed by a java application that effectively 
combines multiple database tables with each-other to form the 
SolrInputDocument.

What I'm seeing however is that the queue of documents that are ready to 
be send to the solr server exceeds my preset limit. Telling me that Solr 
somehow can't process the documents fast enough.

(I have created my own queue in front of Solrj.StreamingUpdateSolrServer 
as it would not process the documents fast enough causing 
OutOfMemoryExceptions due to the large amount of documents building up 
in it's queue)

I have an index that for 95% consist of ID's (Long). We don't do any 
analysis on the fields that are being indexed. The schema is rather 
straight forward.

most fields look like
fieldType name=long class=solr.LongField omitNorms=true/
field name=objectId type=long stored=true indexed=true 
required=true /
field name=listId type=long stored=false indexed=true 
multiValued=true/

the relevant solrconfig.xml
indexDefaults
 useCompoundFilefalse/useCompoundFile
 mergeFactor100/mergeFactor
 RAMBufferSizeMB256/RAMBufferSizeMB
 maxMergeDocs2147483647/maxMergeDocs
 maxFieldLength1/maxFieldLength
 writeLockTimeout1000/writeLockTimeout
 commitLockTimeout1/commitLockTimeout
 lockTypesingle/lockType
/indexDefaults


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

What I'm seeing is that the network almost never reaches more then 10% 
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the 
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

And that in the beginning (with a empty index) I get 2ms per insert but 
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my indexing? Because I 
have a feeling that my machine is capable of doing more (use more 
cpu's). I just can't figure-out how.

Thijs


RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs 
ram, and a doc size of 5-10k maybe substantially larger than what you have. 
They could be substantially smaller too. As another point of reference my index 
ends up being about 20Gigs with the 5 million docs. 

I should also point out I only need to do this once.. I'm not constantly 
reindexing everything. My indexed documents rarely change, and when they do we 
have a process that selectively updates those few that need it. Combine that 
with a constant trickle of new documents and indexing performance isn't much of 
a concern.

You should be able to experiment with a small subset of your documents to 
speedily test new schemas, etc. In my case I selected a representative sample 
and store them in my project for unit testing.

-Kallin Nagelberg


-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Thursday, May 20, 2010 11:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Machine utilization while indexing

It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions. 

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote:

 From: Nagelberg, Kallin knagelb...@globeandmail.com
 Subject: RE: Machine utilization while indexing
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, May 20, 2010, 8:16 AM
 How about throwing a blockingqueue,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size
 of 10,000 or something, with one thread trying to feed it,
 and one thread waiting for it to get near full then draining
 it. Take the drained results and add them to the server
 (maybe try not using streamingsolrserver). Something like
 that worked well for me with about 5,000,000 documents each
 ~5k taking about 8 hours.
 
 -Kallin Nagelberg
 
 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing
 
 Hi.
 
 I have a question about how I can get solr to index quicker
 then it does 
 at the moment.
 
 I have to index (and re-index) some 3-5 million documents.
 These 
 documents are preprocessed by a java application that
 effectively 
 combines multiple database tables with each-other to form
 the 
 SolrInputDocument.
 
 What I'm seeing however is that the queue of documents that
 are ready to 
 be send to the solr server exceeds my preset limit. Telling
 me that Solr 
 somehow can't process the documents fast enough.
 
 (I have created my own queue in front of
 Solrj.StreamingUpdateSolrServer 
 as it would not process the documents fast enough causing 
 OutOfMemoryExceptions due to the large amount of documents
 building up 
 in it's queue)
 
 I have an index that for 95% consist of ID's (Long). We
 don't do any 
 analysis on the fields that are being indexed. The schema
 is rather 
 straight forward.
 
 most fields look like
 fieldType name=long class=solr.LongField
 omitNorms=true/
 field name=objectId type=long stored=true
 indexed=true 
 required=true /
 field name=listId type=long stored=false
 indexed=true 
 multiValued=true/
 
 the relevant solrconfig.xml
 indexDefaults
  
    useCompoundFilefalse/useCompoundFile
  
    mergeFactor100/mergeFactor
  
    RAMBufferSizeMB256/RAMBufferSizeMB
  
    maxMergeDocs2147483647/maxMergeDocs
  
    maxFieldLength1/maxFieldLength
  
    writeLockTimeout1000/writeLockTimeout
  
    commitLockTimeout1/commitLockTimeout
  
    lockTypesingle/lockType
 /indexDefaults
 
 
 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPU    Q9550  @
 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr
 version 1.4
 
 What I'm seeing is that the network almost never reaches
 more then 10% 
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is
 used, not the 
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
 2730.68 MB
 
 And that in the beginning (with a empty index) I get 2ms
 per insert but 
 this slows to 18-19ms per insert.
 
 Are there any tips/tricks I can use to speed up my
 indexing

RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
You're sure it's not blocking on indexing IO? If not then I guess it must be a 
thread waiting unnecessarily in solr or your loading program. To get my loader 
running at full speed I hooked it up to jprofiler's thread views to see where 
the stalls were and optimized from there. 

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Thursday, May 20, 2010 11:25 AM
To: solr-user@lucene.apache.org
Subject: Re: Machine utilization while indexing

I already have a blockingqueue in place (that's my custom queue) and 
luckily I'm indexing faster then what your doing.Currently it takes 
about 2hour to index the 5m documents I'm talking about. But I still 
feel as if my machine is under utilized.

Thijs


On 20-5-2010 17:16, Nagelberg, Kallin wrote:
 How about throwing a blockingqueue, 
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
  between your document-creator and solrserver? Give it a size of 10,000 or 
 something, with one thread trying to feed it, and one thread waiting for it 
 to get near full then draining it. Take the drained results and add them to 
 the server (maybe try not using streamingsolrserver). Something like that 
 worked well for me with about 5,000,000 documents each ~5k taking about 8 
 hours.

 -Kallin Nagelberg

 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing

 Hi.

 I have a question about how I can get solr to index quicker then it does
 at the moment.

 I have to index (and re-index) some 3-5 million documents. These
 documents are preprocessed by a java application that effectively
 combines multiple database tables with each-other to form the
 SolrInputDocument.

 What I'm seeing however is that the queue of documents that are ready to
 be send to the solr server exceeds my preset limit. Telling me that Solr
 somehow can't process the documents fast enough.

 (I have created my own queue in front of Solrj.StreamingUpdateSolrServer
 as it would not process the documents fast enough causing
 OutOfMemoryExceptions due to the large amount of documents building up
 in it's queue)

 I have an index that for 95% consist of ID's (Long). We don't do any
 analysis on the fields that are being indexed. The schema is rather
 straight forward.

 most fields look like
 fieldType name=long class=solr.LongField omitNorms=true/
 field name=objectId type=long stored=true indexed=true
 required=true /
 field name=listId type=long stored=false indexed=true
 multiValued=true/

 the relevant solrconfig.xml
 indexDefaults
   useCompoundFilefalse/useCompoundFile
   mergeFactor100/mergeFactor
   RAMBufferSizeMB256/RAMBufferSizeMB
   maxMergeDocs2147483647/maxMergeDocs
   maxFieldLength1/maxFieldLength
   writeLockTimeout1000/writeLockTimeout
   commitLockTimeout1/commitLockTimeout
   lockTypesingle/lockType
 /indexDefaults


 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

 What I'm seeing is that the network almost never reaches more then 10%
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is used, not the
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

 And that in the beginning (with a empty index) I get 2ms per insert but
 this slows to 18-19ms per insert.

 Are there any tips/tricks I can use to speed up my indexing? Because I
 have a feeling that my machine is capable of doing more (use more
 cpu's). I just can't figure-out how.

 Thijs



RE: seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Thanks Darren,

The problem with that is that it may not return one document per id, which is 
what I need.  IE, I could give 100 ids in that OR query and retrieve 100 
documents, all containing just 1 of the IDs. 

-Kallin Nagelberg

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] 
Sent: Thursday, May 20, 2010 12:21 PM
To: solr-user@lucene.apache.org
Subject: Re: seemingly impossible query

Ok. I think I understand. What's impossible about this?

If you have a single field name called id that is multivalued
then you can retrieved the documents with something like:

id:1 OR id:2 OR id:56 ... id:100

then add limit 100.

There's probably a more succinct way to do this, but I'll leave that to
the experts.

If you also only want the documents within a certain time, then you also
create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
or something similar to this. Check the query syntax wiki for specifics.

Darren


 Hey everyone,

 I've recently been given a requirement that is giving me some trouble. I
 need to retrieve up to 100 documents, but I can't see a way to do it
 without making 100 different queries.

 My schema has a multi-valued field like 'listOfIds'. Each document has
 between 0 and N of these ids associated to them.

 My input is up to 100 of these ids at random, and I need to retrieve the
 most recent document for each id (N Ids as input, N docs returned). I'm
 currently planning on doing a single query for each id, requesting 1 row,
 and caching the result. This could work OK since some of these ids should
 repeat quite often. Of course I would prefer to find a way to do this in
 Solr, but I'm not sure it's capable.

 Any ideas?

 Thanks,
 -Kallin Nagelberg




RE: seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Yeah I need something like:
(id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

I'm not sure how I can hit solr once. If I do try and do them all in one big OR 
query then I'm probably not going to get a hit for each ID. I would need to 
request probably 1000 documents to find all 100 and even then there's no 
guarantee and no way of knowing how deep to go.

-Kallin Nagelberg

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] 
Sent: Thursday, May 20, 2010 12:27 PM
To: solr-user@lucene.apache.org
Subject: RE: seemingly impossible query

I see. Well, now you're asking Solr to ignore its prime directive of
returning hits that match a query. Hehe.

I'm not sure if Solr has a unique attribute.

But this sounds, to me, like you will have to filter the results yourself.
But at least you hit Solr only once before doing so.

Good luck!

 Thanks Darren,

 The problem with that is that it may not return one document per id, which
 is what I need.  IE, I could give 100 ids in that OR query and retrieve
 100 documents, all containing just 1 of the IDs.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: seemingly impossible query

 Ok. I think I understand. What's impossible about this?

 If you have a single field name called id that is multivalued
 then you can retrieved the documents with something like:

 id:1 OR id:2 OR id:56 ... id:100

 then add limit 100.

 There's probably a more succinct way to do this, but I'll leave that to
 the experts.

 If you also only want the documents within a certain time, then you also
 create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
 or something similar to this. Check the query syntax wiki for specifics.

 Darren


 Hey everyone,

 I've recently been given a requirement that is giving me some trouble. I
 need to retrieve up to 100 documents, but I can't see a way to do it
 without making 100 different queries.

 My schema has a multi-valued field like 'listOfIds'. Each document has
 between 0 and N of these ids associated to them.

 My input is up to 100 of these ids at random, and I need to retrieve the
 most recent document for each id (N Ids as input, N docs returned). I'm
 currently planning on doing a single query for each id, requesting 1
 row,
 and caching the result. This could work OK since some of these ids
 should
 repeat quite often. Of course I would prefer to find a way to do this in
 Solr, but I'm not sure it's capable.

 Any ideas?

 Thanks,
 -Kallin Nagelberg






RE: seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Thanks, I'm going to take a look at fieldcollapsingquery as it seems like it 
should do the trick!

-Kallin Nagelberg

-Original Message-
From: Geert-Jan Brits [mailto:gbr...@gmail.com] 
Sent: Thursday, May 20, 2010 1:03 PM
To: solr-user@lucene.apache.org
Subject: Re: seemingly impossible query

Hi Kallin,

again please look at
FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24
collapse.threshold=1collapse.field=listOfIdscollapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

  Thanks Darren,
 
  The problem with that is that it may not return one document per id,
 which
  is what I need.  IE, I could give 100 ids in that OR query and retrieve
  100 documents, all containing just 1 of the IDs.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
  Sent: Thursday, May 20, 2010 12:21 PM
  To: solr-user@lucene.apache.org
  Subject: Re: seemingly impossible query
 
  Ok. I think I understand. What's impossible about this?
 
  If you have a single field name called id that is multivalued
  then you can retrieved the documents with something like:
 
  id:1 OR id:2 OR id:56 ... id:100
 
  then add limit 100.
 
  There's probably a more succinct way to do this, but I'll leave that to
  the experts.
 
  If you also only want the documents within a certain time, then you also
  create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
  or something similar to this. Check the query syntax wiki for specifics.
 
  Darren
 
 
  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1
  row,
  and caching the result. This could work OK since some of these ids
  should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 
 
 




RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
StreamingUpdateSolrServer already has multiple threads and uses multiple 
connections under the covers. At least the api says ' Uses an internal 
MultiThreadedHttpConnectionManager to manage http connections'. The constructor 
allows you to specify the number of threads used, 
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html#StreamingUpdateSolrServer(java.lang.String,
 int, int) . 

-Kallin Nagelberg

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, May 20, 2010 3:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Machine utilization while indexing


I'm really only guessing here, but based on your description of what you 
are doing it sounds like you only have one thread streaming documents to 
solr (via a single StreamingUpdateSolrServer instance which creates a 
single HTTP connection)

Have you at all attempted to have parallel threads in your client initiate 
parallel connections to Solr via multiple instances of 
StreamingUpdateSolrServer objects?)


-Hoss



RE: seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Yeah this looks perfect. Too bad it's not in 1.4, I guess I can build from 
trunk and patch it. This is probably a stupid question but is there any feeling 
as to when 1.5 might come out? 

Thanks,
-Kallin Nagelberg

-Original Message-
From: Geert-Jan Brits [mailto:gbr...@gmail.com] 
Sent: Thursday, May 20, 2010 1:03 PM
To: solr-user@lucene.apache.org
Subject: Re: seemingly impossible query

Hi Kallin,

again please look at
FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24
collapse.threshold=1collapse.field=listOfIdscollapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

  Thanks Darren,
 
  The problem with that is that it may not return one document per id,
 which
  is what I need.  IE, I could give 100 ids in that OR query and retrieve
  100 documents, all containing just 1 of the IDs.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
  Sent: Thursday, May 20, 2010 12:21 PM
  To: solr-user@lucene.apache.org
  Subject: Re: seemingly impossible query
 
  Ok. I think I understand. What's impossible about this?
 
  If you have a single field name called id that is multivalued
  then you can retrieved the documents with something like:
 
  id:1 OR id:2 OR id:56 ... id:100
 
  then add limit 100.
 
  There's probably a more succinct way to do this, but I'll leave that to
  the experts.
 
  If you also only want the documents within a certain time, then you also
  create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
  or something similar to this. Check the query syntax wiki for specifics.
 
  Darren
 
 
  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1
  row,
  and caching the result. This could work OK since some of these ids
  should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 
 
 




RE: Challenge: Searching for variant products and get basic products in result set

2010-05-19 Thread Nagelberg, Kallin
I agree that pulling all attributes into the parent sku during indexing could 
work well. Define a Boolean field like 'isVirtual' to identify the non-leaf 
skus, and use a multi-valued field for each of the attributes. For now you can 
do a search like (isVirtual:true AND doorType:screen). If at a later date you 
want the actual variants just search for isVirtual:false.

Does that work?

-Kallin Nagelberg

-Original Message-
From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com] 
Sent: Wednesday, May 19, 2010 11:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Challenge: Searching for variant products and get basic products 
in result set

if that is so, and maybe, you have for example, two variants of cars with
automatic, what would define on which one was the hit? or field dont share
common information across variants? if they do share, you wouldnt be able to
define in which one was the hit(because it was on both of them) and would
either have to pick one randomly, or retrieve both. if they dont share that
info, you would have that covered, since only one would match any given
query.

On Wed, May 19, 2010 at 5:04 PM, hkmortensen ko...@yahoo.com wrote:


 thanks. Currently not, but requirements change all the time as always ;-)
 If we get a requirement, that a facet shall be material of doors, we will
 need to know which variant was the hit. I would like to be prepared for
 that.




 Leonardo Menezes wrote:
 
  would you then need to know in which variant was your match produced?
  because if not, you can just index the whole thing as one single
  document...
 
  On Wed, May 19, 2010 at 4:23 PM, hkmortensen ko...@yahoo.com wrote:
 
 
  I do searching for products. Each base product exist in variants as
 well.
  One
  variant has a glass door, another a steel door etc. The variants can
 have
  diffent prices. The base product does not really exist, only the
 variants
  exists IRL. The case corresponds to cars: the car model is the base
  product,
  with color variants  or with automatic/manual etc.
 
  I want to search for variants, but I only want to have base products in
  the
  result. Ie when one or more variants from the same base product are
  found,
  only the base product shall be in the search result.
 
  Does somebody have an idea how this could be done?
 
  Best regards
 
  Henning
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829218.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829319.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Challenge: Searching for variant products and get basic products in result set

2010-05-19 Thread Nagelberg, Kallin
Sorry, in North America 'sku' (stock keeping unit) is the common term in 
business to specifically identify a particular product, 
http://lmgtfy.com/?q=sku. 

And yes, I think you understand me. I am imagining you can structure your 
products in a hierarchy. For each node in the tree you traverse all children, 
collecting their attributes into the current node.

-Kallin Nagelberg

-Original Message-
From: hkmortensen [mailto:ko...@yahoo.com] 
Sent: Wednesday, May 19, 2010 11:39 AM
To: solr-user@lucene.apache.org
Subject: RE: Challenge: Searching for variant products and get basic products 
in result set


sorry, what does sku mean?

I understand you like this: indexing base and variants, and include all
atributes (for one base and its variants) in each document. I think that
would work. Thanks.


Nagelberg, Kallin wrote:
 
 I agree that pulling all attributes into the parent sku during indexing
 could work well. Define a Boolean field like 'isVirtual' to identify the
 non-leaf skus, and use a multi-valued field for each of the attributes.
 For now you can do a search like (isVirtual:true AND doorType:screen). If
 at a later date you want the actual variants just search for
 isVirtual:false.
 
 Does that work?
 
 -Kallin Nagelberg
 
 -Original Message-
 From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com] 
 Sent: Wednesday, May 19, 2010 11:13 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Challenge: Searching for variant products and get basic
 products in result set
 
 if that is so, and maybe, you have for example, two variants of cars with
 automatic, what would define on which one was the hit? or field dont share
 common information across variants? if they do share, you wouldnt be able
 to
 define in which one was the hit(because it was on both of them) and would
 either have to pick one randomly, or retrieve both. if they dont share
 that
 info, you would have that covered, since only one would match any given
 query.
 
 On Wed, May 19, 2010 at 5:04 PM, hkmortensen ko...@yahoo.com wrote:
 

 thanks. Currently not, but requirements change all the time as always ;-)
 If we get a requirement, that a facet shall be material of doors, we
 will
 need to know which variant was the hit. I would like to be prepared for
 that.




 Leonardo Menezes wrote:
 
  would you then need to know in which variant was your match produced?
  because if not, you can just index the whole thing as one single
  document...
 
  On Wed, May 19, 2010 at 4:23 PM, hkmortensen ko...@yahoo.com wrote:
 
 
  I do searching for products. Each base product exist in variants as
 well.
  One
  variant has a glass door, another a steel door etc. The variants can
 have
  diffent prices. The base product does not really exist, only the
 variants
  exists IRL. The case corresponds to cars: the car model is the base
  product,
  with color variants  or with automatic/manual etc.
 
  I want to search for variants, but I only want to have base products
 in
  the
  result. Ie when one or more variants from the same base product are
  found,
  only the base product shall be in the search result.
 
  Does somebody have an idea how this could be done?
 
  Best regards
 
  Henning
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829218.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829319.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829435.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: disable caches in real time

2010-05-19 Thread Nagelberg, Kallin
I suppose you are still losing some performance on the replicated box since it 
needs to use some resources to warm the cache. It would be nice if a warmed 
cache could be replicated from the master though perhaps that's not practical. 
Chris is right though: The newly updated index created by a commit is not seen 
by users until it has been warmed, at which point it is atomically swapped.

-Kallin Nagelberg



-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Wednesday, May 19, 2010 2:38 PM
To: solr-user@lucene.apache.org
Subject: Re: disable caches in real time


: I've always undestand that if you do a commit (replication does it), a new
: searcher is open, and you lose performance (queries per second) while the
: caches are regenerated. I think i don't explain correctly my situation

not if you configure your caches with autowarming -- then solr will warm 
up the new caches (on the new index) while the old index still serves 
requests -- this is all manged for you by the SolrCore, no need for core 
swapping.


-Hoss



confused by simple OR

2010-05-13 Thread Nagelberg, Kallin
I must be missing something very obvious here. I have a filter query like so:

(-rootdir:somevalue)

I get results for that filter

However, when I OR it with another term like so I get nothing:

((-rootdir:somevalue) OR (rootdir:somevalue AND someboolean:true))

How is this possible? Have I gone mad?

Thanks,
Kallin Nagelberg




RE: confused by simple OR

2010-05-13 Thread Nagelberg, Kallin
Awesome that works, thanks Ahmet. 

-Kallin Nagelberg

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, May 13, 2010 12:24 PM
To: solr-user@lucene.apache.org
Subject: Re: confused by simple OR


 I must be missing something very
 obvious here. I have a filter query like so:
 
 (-rootdir:somevalue)
 
 I get results for that filter
 
 However, when I OR it with another term like so I get
 nothing:
 
 ((-rootdir:somevalue) OR (rootdir:somevalue AND
 someboolean:true))
 

Simply you cannot combine NOT and OR clauses like you did. It should be 
something like: 

((+*:* -rootdir:somevalue) OR (rootdir:somevalue AND someboolean:true))


  


maximum recommended document cache size

2010-05-13 Thread Nagelberg, Kallin
I am trying to tune my Solr setup so that the caches are well warmed after the 
index is updated. My documents are quite small, usually under 10k. I currently 
have a document cache size of about 15,000, and am warming up 5,000 with a 
query after each indexing. Autocommit is set at 30 seconds, and my caches are 
warming up easily in just a couple of seconds. I've read of concerns regarding 
garbage collection when your cache is too large. Does anyone have experience 
with this? Ideally I would like to get 90% of all documents from the last month 
in memory after each index, which would be around 25,000. I'm doing extensive 
load testing, but if someone has recommendations I'd love to hear them.

Thanks,
-Kallin Nagelberg


RE: strange behaviour when sorting, fields are missing in result

2010-05-12 Thread Nagelberg, Kallin
I'm not sure I understand how your results are truncated. They both find 21502 
documents. The fact that you are sorting on '_erstelldatum' ascending and not 
seeing any results for that field on the first page leads me to think that you 
have 'sortMissingLast=false' on that field's fieldType. In that case it would 
put all the documents missing the '_erstelldatum' first. 

-Kallin Nagelberg



-Original Message-
From: markus.rietz...@rzf.fin-nrw.de [mailto:markus.rietz...@rzf.fin-nrw.de] 
Sent: Wednesday, May 12, 2010 9:00 AM
To: solr-user@lucene.apache.org
Subject: strange behaviour when sorting, fields are missing in result

when i do a search, eg. 

http://xxx:8983/solr/select?q=steuerfl=score,id,__intern,title,__source,_dienststelle,_erstelldatum,__cyear,_stelle

i get a normal result, like

result name=response numFound=21502 start=0 maxScore=1.3633566
doc
float name=score1.3633566/float
int name=__cyear2009/int
str name=__intern0/str
str name=__sourcezzz/str
str name=_dienststellexyz/str
long name=_erstelldatum2009020200/long
str name=_stellePresse- u. Informationsreferat/str
str name=id34931684/str
str name=titleMerkblatt Vereine und Steuern/str
/doc

when i do a search with the sort param, my result is suddenly truncated:

http://xxx:8983/solr/select?q=steuerfl=score,id,__intern,title,__source,_dienststelle,_erstelldatum,__cyear,_stellesort=_erstelldatum+asc

result name=response numFound=21502 start=0 maxScore=1.3633566
doc
float name=score0.14290115/float
str name=__intern0/str
str name=__sourceisys/str
str name=id18205520/str
str name=titleAmtsübersicht /str
/doc

so, not all of the fields from fl-param are displayed. this is what admin 
schema browser says about _erstelldatum:

Field Type: long
Properties: Indexed, Tokenized, Stored, Omit Norms, undefined
Schema: Indexed, Tokenized, Stored, Omit Norms, undefined
Index: Stored, Omit Norms, Binary
Index Analyzer: org.apache.solr.analysis.TokenizerChain Details
Tokenizer Class: org.apache.solr.analysis.TrieTokenizerFactory
Query Analyzer: org.apache.solr.analysis.TokenizerChain Details
Tokenizer Class: org.apache.solr.analysis.TrieTokenizerFactory
Docs: 118496
Distinct: 11477
Top Terms
termfrequency
2009072100  655
2003111000  500
2006110800  428
2010012900  412
2006032000  356
2003062000  354
2010041500  313
2010043000  310
2010030100  296
2008110112  260

and this for validFrom

Field Type: long
Properties: Indexed, Tokenized, Stored, Omit Norms, undefined
Schema: Indexed, Tokenized, Stored, Omit Norms, undefined
Index: Stored, Omit Norms, Binary
Index Analyzer: org.apache.solr.analysis.TokenizerChain Details
Tokenizer Class: org.apache.solr.analysis.TrieTokenizerFactory
Query Analyzer: org.apache.solr.analysis.TokenizerChain Details
Tokenizer Class: org.apache.solr.analysis.TrieTokenizerFactory
Docs: 111762
Distinct: 66649
Top Terms
termfrequency
2002101700  315
2003111000  309
2002102100  293
2009042312  258
20060320152000  229
2005060700  227
2010041500  215
2007010100  207
2005061000  205
2010012900  200



i have checked all of our fields, with some of them sort works and with some of 
them it doesn't work. this is our finding:

(+ means it works, - it doesn't work)

_aktenzeichen   +
_autor  -
_dienststelle   +
_dokumententyp  +
_erstelldatum   -
_hauptthema -
_kurzbeschreibung -
kurzinfoGruppe  -
lastChanged +
objClass-
objType -
publicationHinweis  -
publicationNavigationstitel -
publicationStichwort-
_stelle +
_stichwort  -
_unterthema -
title   +
validFrom   +
validUntil  -
_verteiler  - 
_vertraulich-
_zielgruppen+
__dst   + (all fields but not _stelle)
__intern-
__lokal + (all fields but not _stelle)
__cdate -
__cyear -
__source-
__doctype   +
__mikronav  -

what can lead to this problem?


we have the following fields defined in our schema.xml

!-- RZF isys --
   field name=_aktenzeichen type=string indexed=true stored=true /
   field name=_anlagedoc type=string indexed=false stored=false /
   field name=_autor type=textgen indexed=true stored=true /
   field name=_dienststelle type=string indexed=true stored=true /
   field name=_dokumententyp type=string indexed=true stored=true /
   field name=_erstelldatum type=long indexed=true stored=true /
   field name=_hauptthema type=text_de indexed=true stored=true /
   field name=_kurzbeschreibung type=text_de indexed=true stored=true 
/
   field name=kurzinfoGruppe type=long indexed=true stored=true 
mulitValued=true/
   field name=lastChanged type=long indexed=true stored=true /
   field name=objClass type=string indexed=true stored=true /
   field name=objType type=string indexed=true 

caching repeated OR'd terms

2010-05-06 Thread Nagelberg, Kallin
Hey everyone,

I'm having some difficulty figuring out the best way to optimize for a certain 
query situation. My documents have a many-valued field that stores lists of 
IDs. All in all there are probably about 10,000 distinct IDs throughout my 
index. I need to be able to query and find all documents that contain a given 
set of IDs. Ie, I want to find all documents that contain IDs 3, 202, 3030 or 
505. Currently I'm implementing this like so:

q= (myfield:3) OR (myfield:202) OR (myfield:3030) OR (myfield:505).

It's possible that there could be upwards of hundreds of terms, although 90% of 
the time it will be under 10. Ideally I would like to do this with a filter 
query, but I have read that it is impossible to cache OR'd terms in a fq, 
though this feature may come soon. The problem is that the combinations of OR'd 
terms will almost always be unique, so the query cache will have a very low hit 
rate. It would be great if the individual terms could be cached individually, 
but I'm not sure how to accomplish that.

Any suggestions would be welcome!
-Kallin Nagelberg



cache control per-request

2010-05-06 Thread Nagelberg, Kallin
Hey everyone,

Does anyone know if it is possible to control cache behavior on a per-request 
basis? I would like to be able to use the queryResultCache for certain queries, 
but have it bypassed for others. IE, I know at query time if there is 0 chance 
of a hit and would like to avoid the cache on those. If I can do that it leaves 
space in the cache for those that may actually hit.

Thanks,
-Kallin Nagelberg


nstein and 3S

2010-05-05 Thread Nagelberg, Kallin
Hey everyone,

I'm curious if anyone has experiencing working with the company NStein and 
their Solr based search solution S3. Any comments on performance, usability, 
support etc. would be really appreciated.

Thanks,
-Kallin Nagelberg


RE: benefits of float vs. string

2010-04-30 Thread Nagelberg, Kallin
When using numerical types you can do ranges like 3  myfield = 10 , as well 
as a lot of other interesting mathematical functions that would not be possible 
with a string type.

Thanks for the info Yonik,
-Kallin Nagelberg

-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Friday, April 30, 2010 1:27 AM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: Re: benefits of float vs. string

Please explain a range query? 

tia :-)

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 4/29/10, Yonik Seeley yo...@lucidimagination.com wrote:

 From: Yonik Seeley yo...@lucidimagination.com
 Subject: Re: benefits of float vs. string
 To: solr-user@lucene.apache.org
 Date: Thursday, April 29, 2010, 1:01 PM
 On Wed, Apr 28, 2010 at 11:22 AM,
 Nagelberg, Kallin
 knagelb...@globeandmail.com
 wrote:
  Does anyone have an idea about the performance
 benefits of searching across floats compared to strings? I
 have one multi-valued field that contains about 3000
 distinct IDs across 5 million documents. I am going to be a
 lot of queries like q=id:102 OR id:303 OR id:305, etc. Right
 now it is a String but I am going to switch to a float as
 intuitively it ought to be easier to filter a number than a
 string.
 
 
 There won't be any difference in search speed for term
 queries as you
 show above.
 If you don't need to do sorting or range queries on that
 field, I'd
 leave it as a String.
 
 
 -Yonik
 Apache Lucene Eurocon 2010
 18-21 May 2010 | Prague
 


prefixing with dismax

2010-04-30 Thread Nagelberg, Kallin
Hey,

I've been using the dismax query parser so that I can pass a user created 
search string directly to Solr. Now I'm getting the requirement that something 
like 'Bo' must match 'Bob', or 'Bob Jo' must match 'Bob Jones'. I can't think 
of a way to make this happen with Dismax, though it's pretty simple with 
standard syntax. I guess I would just split on space and created ANDed terms 
like 'myfield:token*' . This doesn't feel like a great approach though, since 
I'm losing all of the escaping magic of Dismax. Does anyone have any cleaner 
solutions to this sort of problem? I imagine it's quite common.

Thanks,
Kallin Nagelberg


RE: Slow Date-Range Queries

2010-04-29 Thread Nagelberg, Kallin
You might want to look at DateMath, 
http://lucene.apache.org/solr/api/org/apache/solr/util/DateMathParser.html. I 
believe the default precision is to the millisecond, so if you afford to round 
to the nearest second or even minute you might see some performance gains.

-Kallin Nagelberg

-Original Message-
From: Jan Simon Winkelmann [mailto:winkelm...@newsfactory.de] 
Sent: Thursday, April 29, 2010 4:36 AM
To: solr-user@lucene.apache.org
Subject: Slow Date-Range Queries

Hi,

I am currently having serious performance problems with date range queries. 
What I am doing, is validating a datasets published status by a valid_from and 
a valid_till date field.

I did get a performance boost of ~ 100% by switching from a normal 
solr.DateField to a solr.TrieDateField with precisionStep=8, however my query 
still takes about 1,3 seconds.

My field defintion looks like this:

fieldType name=date class=solr.TrieDateField precisionStep=8 
sortMissingLast=true omitNorms=true/

field name=valid_from type=date indexed=true stored=false 
required=false /
field name=valid_till type=date indexed=true stored=false 
required=false /


And the query looks like this:
((valid_from:[* TO 2010-04-29T10:34:12Z]) AND (valid_till:[2010-04-29T10:34:12Z 
TO *])) OR ((*:* -valid_from:[* TO *]) AND (*:* -valid_till:[* TO *])))

I use the empty checks for datasets which do not have a valid from/till range.


Is there any way to get this any faster? Would it be faster using 
unix-timestamps with int fields?

I would appreciate any insight and help on this.

regards,
Jan-Simon



RE: Evangelism

2010-04-29 Thread Nagelberg, Kallin
I had a very hard time selling Solr to business folks. Most are of the mind 
that if you're not paying for something it can't be any good. That might also 
be why they refrain from posting 'powered by solr' on their website, as if it 
might show them to be cheap. They are also fearful of lack of support should 
you get hit by a bus. This might be remedied by recommending professional 
services from a company such as lucid imagination. 

I think your best bet is to create a working demo with your data and show them 
the performance. 

Cheers,
-Kallin Nagelberg



-Original Message-
From: Israel Ekpo [mailto:israele...@gmail.com] 
Sent: Thursday, April 29, 2010 2:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Evangelism

Their main search page has the Powered by Solr logo

http://www.lucidimagination.com/search/



On Thu, Apr 29, 2010 at 2:18 PM, Israel Ekpo israele...@gmail.com wrote:

 Checkout Lucid Imagination

 http://www.lucidimagination.com/About-Search

 This should convince you.


 On Thu, Apr 29, 2010 at 2:10 PM, Daniel Baughman da...@hostworks.comwrote:

 Hi I'm new to the list here,



 I'd like to steer someone in the direction of Solr, and I see the list of
 companies using solr, but none have a power by solr logo or anything.



 Does anyone have any great links with evidence to majorly successful solr
 projects?



 Thanks in advance,



 Dan B.






 --
 Good Enough is not good enough.
 To give anything less than your best is to sacrifice the gift.
 Quality First. Measure Twice. Cut Once.
 http://www.israelekpo.com/




-- 
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


RE: nfs vs sas in production

2010-04-28 Thread Nagelberg, Kallin
Thanks all,

Tom, your results are interesting. We both have about 5 million documents, but 
my index is 20 gigs vs. yours 2 TB. I imagine we'll have a much easier time 
getting quick responses against these small documents compared to your 
multi-second queries. As for index/search disk contention we're planning to 
have independent indexing and searching machines, probably following some of 
the guidelines in this great article, 
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr#resources.haproxy
 .

-Kallin Nagelberg

-Original Message-
From: Burton-West, Tom [mailto:tburt...@umich.edu] 
Sent: Tuesday, April 27, 2010 6:03 PM
To: solr-user@lucene.apache.org
Subject: RE: nfs vs sas in production

Hi Kallin,

Given the previous postings on the list about terrible NFS performance we were 
pleasantly surprised when we did some tests against a well tuned NFS RAID array 
on a private network.  We got reasonably good results (given our large index 
sizes.) See 
http://www.hathitrust.org/blogs/large-scale-search/current-hardware-used-testing
  and 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance.   

Just prior to going into production we moved from direct attached storage to a 
very high performance NAS in production for a number of reasons including ease 
of management as we scale out.  One of the reasons was to reduce contention 
between indexing/optimizing and search instances for disk I/O.  See 
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond
 for details.

Tom

-Original Message-
From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] 
Sent: Tuesday, April 27, 2010 4:13 PM
To: 'solr-user@lucene.apache.org'
Subject: nfs vs sas in production

Hey,

A question was raised during a meeting about our new Solr based search 
projects. We're getting 4 cutting edge servers each with something like 24 Gigs 
of ram dedicated to search. However there is some problem with the amount of 
SAS based storage each machine can handle, and people wonder if we might have 
to use a NFS based drive instead. Does anyone have any experience using SAS vs. 
NFS drives for Solr? Any feedback would be appreciated!

Thanks,
-Kallin Nagelberg


benefits of float vs. string

2010-04-28 Thread Nagelberg, Kallin
Hi,

Does anyone have an idea about the performance benefits of searching across 
floats compared to strings? I have one multi-valued field that contains about 
3000 distinct IDs across 5 million documents. I am going to be a lot of queries 
like q=id:102 OR id:303 OR id:305, etc. Right now it is a String but I am going 
to switch to a float as intuitively it ought to be easier to filter a number 
than a string. I'm just curious if this should in fact bring a benefit, and 
more generally what the benefits/penalties to using numerical over string field 
types is.

Thanks,
Kallin Nagelberg


nfs vs sas in production

2010-04-27 Thread Nagelberg, Kallin
Hey,

A question was raised during a meeting about our new Solr based search 
projects. We're getting 4 cutting edge servers each with something like 24 Gigs 
of ram dedicated to search. However there is some problem with the amount of 
SAS based storage each machine can handle, and people wonder if we might have 
to use a NFS based drive instead. Does anyone have any experience using SAS vs. 
NFS drives for Solr? Any feedback would be appreciated!

Thanks,
-Kallin Nagelberg


RE: Benchmarking Solr

2010-04-12 Thread Nagelberg, Kallin
I have been using Jmeter to perform some load testing. In your case you might 
like to take a look at 
http://jakarta.apache.org/jmeter/usermanual/component_reference.html#CSV_Data_Set_Config
 . This will allow you to use a random item from your query list.

Regards,
Kallin Nagelberg

-Original Message-
From: Blargy [mailto:zman...@hotmail.com] 
Sent: Friday, April 09, 2010 9:47 PM
To: solr-user@lucene.apache.org
Subject: Benchmarking Solr


I am about to deploy Solr into our production environment and I would like to
do some benchmarking to determine how many slaves I will need to set up.
Currently the only way I know how to benchmark is to use Apache Benchmark
but I would like to be able to send random requests to the Solr... not just
one request over and over.

I have a sample data set of 5000 user entered queries and I would like to be
able to use AB to benchmark against all these random queries. Is this
possible?

FYI our current index is ~1.5 gigs with ~5m documents and we will be using
faceting quite extensively. Are average requests per/day is ~2m. We will be
running RHEL with about 8-12g ram. Any idea how many slaves might be
required to handle our load?

Thanks
-- 
View this message in context: 
http://n3.nabble.com/Benchmarking-Solr-tp709561p709561.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index corruption / deployment strategy

2010-04-09 Thread Nagelberg, Kallin
Thanks Erik,

I forwarded your thoughts to management and put in good word for Lucid 
Imagination.

Regards,
Kallin Nagelberg

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Thursday, April 08, 2010 2:18 PM
To: solr-user@lucene.apache.org
Subject: Re: index corruption / deployment strategy

Kallin,

It's a very rare report, and practically impossible I'm told, to  
corrupt the index these days thanks to Lucene's improvements over the  
last several releases (ignoring hardware malfunctions).

A single index is the best way to go, in my opinion - though at your  
scale you're probably looking at sharding it and using distributed  
search.  So you'll have multiple physical indexes, one for each shard,  
and a single virtual index in the eyes of your searching clients.

Backups, of course, are sensible, and Solr's replication capabilities  
can help here by requesting them periodically.  You'll be using  
replication anyway to scale to your query volume.

As for hardware scaling considerations, there are variables to  
consider like how faceting, sorting, and querying speed across a  
single large index versus sharding.  I'm guessing you'll be best with  
at least two shards, though possibly more considering these variables.

Erik
 @ Lucid Imagination

p.s. have your higher-ups give us a call if they'd like to discuss  
their concerns and consider commercial support for your mission  
critical big scale use of Solr :)



On Apr 8, 2010, at 1:33 PM, Nagelberg, Kallin wrote:
 I've been doing work evaluating Solr for use on a hightraffic  
 website for sometime and things are looking positive. I have some  
 concerns from my higher-ups that I need to address. I have suggested  
 that we use a single index in order to keep things simple, but there  
 are suggestions to split are documents amongst different indexes.

 The primary motivation for this split is a worry about potential  
 index corruption. IE, if we only have one index and it becomes  
 corrupt what do we do? I never considered this to be an issue since  
 we would have backups etc., but I think they have had issues with  
 other search technology in the past where one big index resulted in  
 frequent and difficult to recover from corruption. Do you think this  
 is a concern with Solr? If so, what would you suggest to mitigate  
 the risk?

 My second question involves general deployment strategy. We will  
 expect about 50 million documents, each on average a few paragraphs,  
 and our website receives maybe 10 million hits a day. Can anyone  
 provide an idea of # of servers, clustering/replication setup etc.  
 that might be appropriate for this scenario? I'm interested to hear  
 what other's experience is with similar situations.

 Thanks,
 -Kallin Nagelberg




RE: multicore embedded swap / reload etc.

2010-03-26 Thread Nagelberg, Kallin
Thanks everyone,
I was following the solrj wiki which says:



If you want to use MultiCore features, then you should use this:


File home = new File( /path/to/solr/home );
File f = new File( home, solr.xml );
CoreContainer container = new CoreContainer();
container.load( /path/to/solr/home, f );

EmbeddedSolrServer server = new EmbeddedSolrServer( container, core name 
as defined in solr.xml );
...


I'm just a little confused with the disconnect between that and what I see 
about managing multiple cores here: http://wiki.apache.org/solr/CoreAdmin . If 
someone could provide some high-level directions it would be greatly 
appreciated.

Thanks,
-Kallin Nagelberg


-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, March 26, 2010 7:54 AM
To: solr-user@lucene.apache.org
Subject: Re: multicore embedded swap / reload etc.

Embedded supports MultiCore  - it's the direct core connection thing  
that supports one.

- Mark

http://www.lucidimagination.com (mobile)

On Mar 26, 2010, at 7:38 AM, Erik Hatcher erik.hatc...@gmail.com  
wrote:

 But wait... embedded Solr doesn't support multicore, does it?  Just  
 off memory, I think it's fixed to a single core.

Erik

 On Mar 25, 2010, at 10:31 PM, Lance Norskog wrote:

 All operations through the SolrJ work exactly the same against the
 Solr web app and embedded Solr. You code the calls to update cores
 with the same SolrJ APIs either way.

 On Wed, Mar 24, 2010 at 2:19 PM, Nagelberg, Kallin
 knagelb...@globeandmail.com wrote:
 Hi,

 I've got a situation where I need to reindex a core once a day. To  
 do this I was thinking of having two cores, one 'live' and one  
 'staging'. The app is always serving 'live', but when the daily  
 index happens it goes into 'staging', then staging is swapped into  
 'live'. I can see how to do this sort of thing over http, but I'm  
 using an embedded solr setup via solrJ. Any suggestions on how to  
 proceed? I could just have two solrServer's built from different  
 coreContainers, and then swap the references when I'm ready, but I  
 wonder if there is a better approach. Maybe grab a hold of the  
 CoreAdminHandler?

 Thanks,
 Kallin Nagelberg




 -- 
 Lance Norskog
 goks...@gmail.com



multicore embedded swap / reload etc.

2010-03-24 Thread Nagelberg, Kallin
Hi,

I've got a situation where I need to reindex a core once a day. To do this I 
was thinking of having two cores, one 'live' and one 'staging'. The app is 
always serving 'live', but when the daily index happens it goes into 'staging', 
then staging is swapped into 'live'. I can see how to do this sort of thing 
over http, but I'm using an embedded solr setup via solrJ. Any suggestions on 
how to proceed? I could just have two solrServer's built from different 
coreContainers, and then swap the references when I'm ready, but I wonder if 
there is a better approach. Maybe grab a hold of the CoreAdminHandler?

Thanks,
Kallin Nagelberg


RE: lowercasing for sorting

2010-03-23 Thread Nagelberg, Kallin
Thanks, and my cover is apparently blown :P

We're looking at solr for a number of applications, from taking the load off 
the database, to user searching etc. I don't think I'll get fired for saying 
that :P

Thanks,
Kallin Nagelberg

-Original Message-
From: Binkley, Peter [mailto:peter.bink...@ualberta.ca] 
Sent: Tuesday, March 23, 2010 2:09 PM
To: solr-user@lucene.apache.org
Subject: RE: lowercasing for sorting

Solr makes this easy:

tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/ 

You can populate this field from another field using copyField, if you
also need to be able to search or display the original values.

Just out of curiosity, can you tell us anything about what the Globe and
Mail is using Solr for? (assuming the question is work-related)

Peter


 -Original Message-
 From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] 
 Sent: Tuesday, March 23, 2010 11:07 AM
 To: 'solr-user@lucene.apache.org'
 Subject: lowercasing for sorting
 
 I'm trying to perform a case-insensitive sort on a field in 
 my index that contains values like
 
 aaa
 bbb
 AA
 BB
 
 And I get them sorted like:
 
 aaa
 bbb
 AA
 BB
 
 When I would like them:
 
 aa
 aaa
 bb
 bbb
 
 To do this I'm trying to setup a fieldType who's sole purpose 
 is to lowercase a value on query and index. I don't want to 
 tokenize the value, just lowercase it. Any ideas?
 
 Thanks,
 Kallin Nagelberg
 


RE: How to use dismax and boosting properly?

2010-02-25 Thread Nagelberg, Kallin
Try setting the boost to 0 for the fields you don't want to contribute to the 
score.

Kallin Nagelberg

-Original Message-
From: Jason Chaffee [mailto:jchaf...@ebates.com] 
Sent: Thursday, February 25, 2010 4:03 PM
To: solr-user@lucene.apache.org
Subject: How to use dismax and boosting properly?

I am using dismax and I have configured to search 3 different fields
with one field getting an extra boost so that I the results of that
field are at the top of result set.  Then, I sort the results by another
field to get the ordering.

 

My problem is that the scores are being skewed by the adding the scores
from the different fields.  What I really want is to have all matches in
the boost field have an equal score and take precedence over matches
from other fields.  I want them to have the same score so that the
sorting will sort them alphabetically.   Therefore, the scores must be
the same.  Because the query is being found in all three fields with
different number of occurrences some scores are being skewed in the
boosted matches and it is putting them at the top of my results and
alphabetically, they should be near the bottom.

 

Here is an example, in case my explanation isn't clear:

 

I have dismax with the following config:

str name=qfField1^3.0 Field2^0.1 Field3^0.1/str

str name=sortscore desc, sortField asc/str

 

Where sortField is the original keyword token, without any processing
except for lowercase.

 

Field1 (the boosted field)

 a

at

at

att

 
a

ab

abe

abeb

abebo

abeboo

abebook

abebooks




 

 

Field2 

a

at

at

att

a

at

att

a

at

at

at 

at 

at  t

 

att

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

 

abebooks

 

a

ab

abe

abeb

abebo

abeboo

abebook

abebooks

 

 

Field3

a

at

at

att

a

at

att

a

at

at

at 

at 

at  t

 

att

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

att

att

at  t

 

 

abebooks

 

a

ab

abe

abeb

abebo

abeboo

abebook

abebooks

 

The user types in the query 'a':  

 

Here is the debugQuery:

 

  str name=ATT

5.4186125 = (MATCH) sum of:

  2.7147598 = (MATCH) max plus 0.1 times others of:

0.10907243 = (MATCH) weight(Field2:a^0.1 in 80), product of:

  0.01970195 = queryWeight(Field2:a^0.1), product of:

0.1 = boost

3.1962826 = idf(docFreq=117, maxDocs=1061)

0.0616402 = queryNorm

  5.5361238 = (MATCH) fieldWeight(Field2:a in 80), product of:

1.7320508 = tf(termFreq(Field2:a)=3)

3.1962826 = idf(docFreq=117, maxDocs=1061)

1.0 = fieldNorm(field=Field2, doc=80)

2.7038527 = (MATCH) weight(Field1:a^3.0 in 80), product of:

  0.7071054 = queryWeight(Field1:a^3.0), product of:

3.0 = boost

3.8238325 = idf(docFreq=62, maxDocs=1061)

0.0616402 = queryNorm

  3.8238325 = (MATCH) fieldWeight(Field1:a in 80), product of:

1.0 = tf(termFreq(Field1:a)=1)

3.8238325 = idf(docFreq=62, maxDocs=1061)

1.0 = fieldNorm(field= Field1, doc=80)

  2.7038527 = (MATCH) weight(Field1:a^3.0 in 80), product of:

0.7071054 = queryWeight(Field1:a^3.0), product of:

  3.0 = boost

  3.8238325 = idf(docFreq=62, maxDocs=1061)

  0.0616402 = queryNorm

3.8238325 = (MATCH) fieldWeight(Field1:a in 80), product of:

  1.0 = tf(termFreq(Field1:a)=1)

  3.8238325 = idf(docFreq=62, maxDocs=1061)

  1.0 = fieldNorm(field= Field1, doc=80)

/str

 

  str name=Abebooks

5.4140024 = (MATCH) sum of:

  2.71015 = (MATCH) max plus 0.1 times others of:

0.062973 = (MATCH) weight(edgeNGramStandardField:a^0.1 in 138),
product of:

  0.01970195 = queryWeight(edgeNGramStandardField:a^0.1), product
of:

0.1 = boost

3.1962826 = idf(docFreq=117, maxDocs=1061)

0.0616402 = queryNorm

  3.1962826 = (MATCH) fieldWeight(edgeNGramStandardField:a in 138),
product of:

1.0 = tf(termFreq(edgeNGramStandardField:a)=1)

3.1962826 = idf(docFreq=117, maxDocs=1061)

1.0 = fieldNorm(field=edgeNGramStandardField, doc=138)

2.7038527 = (MATCH) weight(edgeNGramKeywordField:a^3.0 in 138),
product of:

  0.7071054 = queryWeight(edgeNGramKeywordField:a^3.0), product of:

3.0 = boost

3.8238325 = idf(docFreq=62, maxDocs=1061)

0.0616402 = queryNorm

  3.8238325 = (MATCH) fieldWeight(edgeNGramKeywordField:a in 138),
product of:

1.0 = tf(termFreq(edgeNGramKeywordField:a)=1)

3.8238325 = idf(docFreq=62, maxDocs=1061)

1.0 = fieldNorm(field=edgeNGramKeywordField, doc=138)

  2.7038527 = (MATCH) weight(edgeNGramKeywordField:a^3.0 in 138),
product of:

0.7071054 = queryWeight(edgeNGramKeywordField:a^3.0), product of:

  3.0 = boost

  3.8238325 = idf(docFreq=62, maxDocs=1061)

  0.0616402 = queryNorm

3.8238325 = (MATCH) 

stop words make dismax fail

2010-02-24 Thread Nagelberg, Kallin
I'm having a problem when users enter stopwords in their query. I'm using a 
dismax request handler against a field setup  like:

fieldType name=simpleText class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory 
ignoreCase=true words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.LengthFilterFactory min=2 
max=20 /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory 
ignoreCase=true words=stopwords.txt enablePositionIncrements=true /
filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.LengthFilterFactory min=2 
max=20 /
  /analyzer
/fieldType



The problem is that when a user enters a query like 'meet the president', zero 
results are returned. I imagine it has something to do with 'the' being 
stripped out, then only 2 of the 3 terms matching. As a temporary workaround I 
set minshouldmatch to 1 so I do get results. That causes other problems though, 
such as 'the' never being highlighted in the results. Am I doing something 
totally wrong?

Thanks,
Kallin Nagelberg


including 'the' dismax query kills results

2010-02-18 Thread Nagelberg, Kallin
I've noticed some peculiar behavior with the dismax searchhandler.

In my case I'm making the search The British Open, and am getting 0 results. 
When I change it to British Open I get many hits. I looked at the query 
analyzer and it should be broken down to british and open tokens ('the' is 
a stopword). I imagine it is doing an 'and' type search, and by setting the 
'mm' parameter to 1 I once again get results for 'the british open'. I would 
like mm to be 100% however, but just not care about stopwords. Is there a way 
to do this?

Thanks,
-Kal


filter queries not fully filtering

2010-02-16 Thread Nagelberg, Kallin
Hi everyone,

I am attempting to implement a faceted drill down feature with Solr. I am 
having problems explaining some results of the fq parameter.

Let's say I have two fields, 'people' and 'category'. I do a search for 'dog' 
and ask to facet on the people and category fields.

I am told that there are 200 documents with people='bob' and 100 with 
category='news'.

I would expect that when I make the query q=dog, fq=category:news that the new 
faceting results should never show more than 100 entries. However this is not 
what I see. Instead I see facets on fields exceeding 100. How could that be 
when I just told it to filter and only show the 100 articles that contained 
category:news?

Thanks,
Kallin Nagelberg.


RE: filter queries not fully filtering

2010-02-16 Thread Nagelberg, Kallin
Problem solved. I wasn't quoting the value. Since I was using names such as 
'Gary Bettman' solr must have been giving all the Garys.

-Original Message-
From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] 
Sent: Tuesday, February 16, 2010 3:22 PM
To: 'solr-user@lucene.apache.org'
Subject: filter queries not fully filtering

Hi everyone,

I am attempting to implement a faceted drill down feature with Solr. I am 
having problems explaining some results of the fq parameter.

Let's say I have two fields, 'people' and 'category'. I do a search for 'dog' 
and ask to facet on the people and category fields.

I am told that there are 200 documents with people='bob' and 100 with 
category='news'.

I would expect that when I make the query q=dog, fq=category:news that the new 
faceting results should never show more than 100 entries. However this is not 
what I see. Instead I see facets on fields exceeding 100. How could that be 
when I just told it to filter and only show the 100 articles that contained 
category:news?

Thanks,
Kallin Nagelberg.


parabolic type function centered on a date

2010-02-11 Thread Nagelberg, Kallin
Hi everyone,

I'm trying to enhance a more like this search I'm conducting by boosting the 
documents that have a date close to the original. I would like to do something 
like a parabolic function centered on the date (would make tuning a little more 
effective), though a linear function would probably suffice. Has anyone 
attempted this? If so I'd love to hear your strategy and results!

Thanks,
Kallin Nagelberg


ord on TrieDateField always returning max

2010-01-06 Thread Nagelberg, Kallin
Hi everyone,

I've been trying to add a date based boost to my queries. I have a field like:

fieldType name=tdate class=solr.TrieDateField omitNorms=true 
precisionStep=6 positionIncrementGap=0/
field name=datetime type=tdate indexed=true stored=true 
required=true /

When I look at the datetime field in the solr schema browser I can see that 
there are 9051 distinct dates.

When I try to add the parameter to my query like: bf=ord(datetime) (on a dismax 
query) I always get 9051 as the result of the function. I see this in the debug 
data:


1698.6041 = (MATCH) FunctionQuery(top(ord(datetime))), product of:

9051.0 = 9051

1.0 = boost

0.18767032 = queryNorm



It is exactly the same for every result, even though each result has a 
different value for datetime.



Does anyone have any suggestions as to why this could be happening? I have done 
extensive googling with no luck.



Thanks,

Kallin Nagelberg.



RE: ord on TrieDateField always returning max

2010-01-06 Thread Nagelberg, Kallin
Thanks Yonik, I was just looking at that actually.
Trying something like recip(ms(NOW,datetime),3.16e-11,1,1)^10  now.
My 'inspiration' for the ord method was actually the Solr 1.4 Enterprise Search 
server book. Page 126 has a section 'using reciprocals and rord with dates'. 
You should let those guys know what's up!

Thanks,
Kallin.

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Wednesday, January 06, 2010 11:23 AM
To: solr-user@lucene.apache.org
Subject: Re: ord on TrieDateField always returning max

Besides using up a lot more memory, ord() isn't even going to work for
a field with multiple tokens indexed per value (like tdate).
I'd recommend using a function on the date value itself.
http://wiki.apache.org/solr/FunctionQuery#ms

-Yonik
http://www.lucidimagination.com



On Wed, Jan 6, 2010 at 10:52 AM, Nagelberg, Kallin
knagelb...@globeandmail.com wrote:
 Hi everyone,

 I've been trying to add a date based boost to my queries. I have a field like:

 fieldType name=tdate class=solr.TrieDateField omitNorms=true 
 precisionStep=6 positionIncrementGap=0/
 field name=datetime type=tdate indexed=true stored=true 
 required=true /

 When I look at the datetime field in the solr schema browser I can see that 
 there are 9051 distinct dates.

 When I try to add the parameter to my query like: bf=ord(datetime) (on a 
 dismax query) I always get 9051 as the result of the function. I see this in 
 the debug data:


 1698.6041 = (MATCH) FunctionQuery(top(ord(datetime))), product of:

    9051.0 = 9051

    1.0 = boost

    0.18767032 = queryNorm



 It is exactly the same for every result, even though each result has a 
 different value for datetime.



 Does anyone have any suggestions as to why this could be happening? I have 
 done extensive googling with no luck.



 Thanks,

 Kallin Nagelberg.