RE: Lock obtain timeout

2007-01-15 Thread Stephanie Belton
Thanks for that - I have made the following changes:

- optimize more often
- omitNorms on all non-fulltext fields
- useCompoundfFile=true (will keep an eye on performance)

And that seems to have solved the problem.

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: 13 January 2007 01:37
To: solr-user@lucene.apache.org
Subject: Re: Lock obtain timeout



: Are the two problems related? Looking through the mailing list it seems
: that changing the settings for useCompoundFile from false to true could
: help but before I do that I would like to understand if there are
: undesirable side effects, what isn’t this param set to true by
: default?

Too Many Open Files can result from lots of different possible reasons:
one is that you have soo many indexed fields with norms that the number of
files in your index is too big -- that's the use case where
useCompoundFile=true can help you -- but it's not set that way be default
because it can make searching slower.

the other reason why you can have too many open files is if you are
getting more concurrent requests then you can handle -- or if the clients
initiating those requests aren't closing them properly (sockets count as
files too)

understanding why you are getting these errors requires that you look at
what your hard and soft file limits are (ulimit -aH and ulimit -aS on my
system) and what files are in use by Solr when these errors occur (lsof -p
_solrpid_).

to answer your earlier question, i *think* you may be getting the lock
timeout errors because it can't access the lock file, because it can't
open any more files ... i'm not 100% sure.




-Hoss





Re: Does Solr support integration with the Compass framework?

2007-01-15 Thread Graham O'Regan

Hi Marios,

It can store the index in a database, but I wouldn't want to use that 
route myself. Here is a quick link to the docs which provides an 
over-view of the transactional features;


http://www.opensymphony.com/compass/versions/1.1M3/html/core-searchengine.html#core-searchengine-transaction

HTH,

Graham

Marios Skounakis wrote:

Does compass store the lucene index in a database? If this is the
case, it is fairly straightforward to understand how this happens.

If the index is still in disk files how does it provide transactional
semantics? Would you care to give a high-level overview?

TIA
Marios

On 1/15/07, Graham O'Regan [EMAIL PROTECTED] wrote:

compass provides a transaction manager for lucene indexes so you can
incorporate an index update and database update in a single transaction
or roll-back if either fails. thats why it would be interesting to see
the two working together.

Marios Skounakis wrote:
 Hi all,

 
 I am working on a hibernate-solr bridge that will behave like the
 compass Hibernate3GpsDevice.  It gets a callback from hibernate when
 an object is stored, checks if it is 'SolrDocumentable' and sends it
 to solr using the client library from:
   http://issues.apache.org/jira/browse/SOLR-20  (solr-client.zip)
 
 If your interested, i can send you my initial version...  when i'm
 further along, i'll try to post it to solr/client/java

 That would be great - we're also facing the same issue of rolling our
 own code to keep a Solr index in sync with a MySQL DB that we access
 via Hibernate.

 I wonder whether people who try to keep a Solr (or Lucene) index in
 sync with a database are at all worried about index update failures.

 Propagating the update from the DB to the index is one thing, and
 relatively easy to implement. But how do you handle failures to update
 either the index or the DB since you cannot enforce transactional
 semantics over both updates? Or do index update failures occur so
 infrequently that you do not worry about it?

 Marios






Re: Does Solr support integration with the Compass framework?

2007-01-15 Thread Ryan McKinley

On 1/15/07, Lukas Vlcek [EMAIL PROTECTED] wrote:

Ryan,

Could you be more specific on your statement?

On 1/12/07, Ryan McKinley [EMAIL PROTECTED] wrote:

 I started using compass a few months back.  It is an amazing system:
 with almost no effort, it just works.  BUT the showstopper (for me)
 was that you could not easily update the index from multiple machines.
 Compass lets you put the lucene indexes in SQL with JDBC, but this
 felt wrong  then i found solr, and it solves most things.



What exactly do you mean by you could not easily update the index from
multiple machines?



Consider a standard load balanced web setup with three machines:
DB1 - running mysql
WEB1 - webapp talking to DB1
WEB2 - webapp talking to DB1
...

In compass, the lucene index is stored on a disk - WEB1 writes its
lucene index on WEB1.  So for changes that WEB1 makes, WEB2 does not
see them (without adding more logic)

To solve this problem, compass is able to write its index into SQL.
WEB1  WEB2 can write the lucene index in DB1.  But the performance is
not great and it seems to be something people discourage (tho i have
not tried it)

Also, take a look at:
http://forums.opensymphony.com/thread.jspa?messageID=100071


Could you describe you problem in more details (and possible workaround if
you found any) please?



workaround?  I'm now using solr :)

Otherwise, consider:

* Try the JDBC store:
http://www.opensymphony.com/compass/versions/1.1M3/html/core-connection.html#core-connection-jdbc

* If you are ok with WEB1  WEB2 being slightly out of sync for new
content, you could us them normally and periodically call index() on
the hibernate GPS device.  This will synchronize whatever is stored in
hibernate with the lucene index


Re: XML querying

2007-01-15 Thread Thorsten Scherler
On Mon, 2007-01-15 at 12:23 +, Luis Neves wrote:
 Hello.
 What I do now to index XML documents it's to use a Filter to strip the 
 markup, 
 this works but it's impossible to know where in the document is the match 
 located.
 What would it take to make possible to specify a filter query that accepts 
 xpath 
 expressions?... something like:
 
 fq=xmlField:/book/content/text()
 
 This way only the /book/content/ element was searched.
 
 Did I make sense? Is this possible?

AFAIK short answer: no.

The field is ALWAYS plain text. There is no xmlField type.

...but why don't you just add your text in multiple field when indexing.

Instead of plain stripping the markup do above xpath on your document
and create different fields. Like
field name=content xsl:value-of
select=/book/content/text()//field
field name=more xsl:value-of select=/book/more/text()//field

Makes sense?

HTH

salu2

 
 --
 Luis Neves



Re: XML querying

2007-01-15 Thread Luis Neves


Hi!

Thorsten Scherler wrote:


On Mon, 2007-01-15 at 12:23 +, Luis Neves wrote:

Hello.
What I do now to index XML documents it's to use a Filter to strip the markup, 
this works but it's impossible to know where in the document is the match located.
What would it take to make possible to specify a filter query that accepts xpath 
expressions?... something like:


fq=xmlField:/book/content/text()

This way only the /book/content/ element was searched.

Did I make sense? Is this possible?


AFAIK short answer: no.

The field is ALWAYS plain text. There is no xmlField type.

...but why don't you just add your text in multiple field when indexing.

Instead of plain stripping the markup do above xpath on your document
and create different fields. Like
field name=content xsl:value-of
select=/book/content/text()//field
field name=more xsl:value-of select=/book/more/text()//field

Makes sense?


Yes, but I have documents with different schemas on the same xml field, also, 
that way I  would have to know the schema of the documents being indexed (which 
I don't).


The schema I use is something like:
field name=DocumentType type=string indexed=true stored=true/
field name=Document type=text indexed=true stored=true/

Where each distinct DocumentType has its own schema.

I could revise this approach to use an Solr instance for each DocumentType but I 
would have to find a way to merge results from the different instances because 
I also need to search across different DocumentTypes... I guess I'm SOL :-(



--
Luis Neves


Re: One item, multiple fields, and range queries

2007-01-15 Thread Jeff Rodenburg

Thanks Hoss.  Interesting approach, but the N bound could be well in the
hundreds, and the N bound would be variable (some maximum number, but
different across events.)

I've not yet used dynamic fields in this manner.  With that number range,
what limitations could I encounter?  Given the size of that, I would need
the solr engine to formulate that query, correct?  I can't imagine I could
pass that entire subquery statement in the http request, as the character
limit would likely be exceeded.

Some of my comments may not make sense, so I'll check into dynamic fields
and such in the meantime.

thanks,
j


On 1/14/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: 2) use multivalued fields as correlated vectors, so the first start
: date corresponds
:to the first end date corresponds to the first lat and long value.
: You get them all back
:in a query though, so your app would need to do extra work to sort
: out which matched.

if you expect a bounded number of correlated events per item, you can
use dynaimc fields, and build up N correlated subqueries where N is the
upper bound on the number of events you expect any item to have, ie...

  (+lat1:[x TO y] +lon1:[w TO z] +time1:[a TO b])
   OR (+lat2:[x TO y] +lon2:[w TO z] +time2:[a TO b])
   OR (+lat3:[x TO y] +lon3:[w TO z] +time3:[a TO b])
   ...




-Hoss




Re: One item, multiple fields, and range queries

2007-01-15 Thread Chris Hostetter

: I've not yet used dynamic fields in this manner.  With that number range,
: what limitations could I encounter?  Given the size of that, I would need

very little, yonik recently listed the costs of dynamic fields...
http://www.nabble.com/Searching-multiple-indices-%28solr-newbie%29-tf2903899.html#a8245621
..as he points out, with omitNorms=true you can have thousands of
dynamic fields and not even notice.

: the solr engine to formulate that query, correct?  I can't imagine I could
: pass that entire subquery statement in the http request, as the character
: limit would likely be exceeded.

yeah ... if you wanted to try the approach i described, and your N
wasn't a single digit number, i would recommend putting the query
building code into a custom RequestHandler ... it could even inspect the
list of field names from the IndexReader and know exactly how big N is at
any given moment.  i have no idea how efficient this approach would be if
N really does get up into the hundreds.


A completely different approach you could take if you want to get into
Lucene Query internals would be to take advantage of something Doug
mentioned once that has stayed in the back of my mind for almost a year
now:  PhraseQuery artificially enforces that the Terms you add to it are
in the same field ... you could easily write a PhraseQuery-ish query that
takes Terms from differnet fields, and ensures that they appear near
eachother in terms of their token sequence -- the context of that comment
was searching for instances of words with specific usage (ie: house used
as a noun) by putting the usage type of each term in a different term in a
seperate parallel field, but with identicle token positions.

if you forget for a moment about the ranges you need to do, and imagine
instead that you store the quadrent number and hour of day for each
event, where e1q is the quadtrent of event1 for an item, and e1h is the
hour of the day that event1 happened at, then for an item with multiple
events you could index the field/terms lists
quadrent:  e1q   e2q   e3q
hour:  e1h   e2h   e3h

and query for your input quadrent at a term position equal to the term
position of your input hour.

if you got *that* working, you could concievably change the query to take
in a range for each field -- using TermEnum to get the list of of all
latitude Terms in your latitude range, then for each of those Terms get
the list of documents and the term position within thta document, and then
look for the longitude terms in the same relative term position which are
in your longitude range, and time terms in the same relative term position
in your time range.

does that make any sense?

this is all purely theoretical, it just seems like it *should* be
possible, but i haven't thought through how it would be implimented.  if
you acctually wanted to tackle it, i would start a discussion on
[EMAIL PROTECTED] first, so people smarter then me can tlel you if i'm
smoking crack or not.

-Hoss



Re: Faceting question...

2007-01-15 Thread Chris Hostetter

: fieldtype name=string class=solr.StrField sortMissingLast=true
: omitNorms=true/
: field name=child_catname type=text indexed=true stored=true/

your fhild_catname isn't using string as it's field type -- it's using
text (which is most likely using TextField and being tokenized)





-Hoss



Re: Faceting question...

2007-01-15 Thread escher2k

Thanks Chris. DUMB of me not to have noticed.


Chris Hostetter wrote:
 
 
 : fieldtype name=string class=solr.StrField sortMissingLast=true
 : omitNorms=true/
 : field name=child_catname type=text indexed=true stored=true/
 
 your fhild_catname isn't using string as it's field type -- it's using
 text (which is most likely using TextField and being tokenized)
 
 
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Faceting-question...-tf3016974.html#a8379221
Sent from the Solr - User mailing list archive at Nabble.com.



Apostrophes in fields

2007-01-15 Thread Nick Jenkin

Hi
This is probably more of a lucene question, but:
I have an author field,

If I query author:Shelley Ohara - no results are returned
If I query author:Shelley O'hara - many results are returned,

Is it possible, to get solr to ignore apostrophes in queries like the one above?

e.g. doc
doc
 arr name=authorstrShelley  O'Hara/str/arr
 bool name=availabletrue/bool
 str name=descriptionlong descirption/str
 str name=ean9780764559747/str
 str name=formatPaperback/str
 str name=publisherIDGP/str
 str name=titleKierkegaard Within Your Grasp/str
 str name=year2004/str
/doc
Thanks
--
- Nick


Re: document support for file system crawling

2007-01-15 Thread Chris Hostetter

: In that respect I agree with the original posting that Solr lacks
: functionality with respect to desired functionality. One can argue that
: more or less random data should be structured by the user writing a
: decent application. However a more easy to use and configurable plugin
: architecture for different filtering and document parsing could make
: Solr more attractive. I think that many potential users would welcome
: such additions.

i don't think you'll get any argument about the benefits of supporting
more plugins to handle updates - both in terms of how the data is
expressed, and how the data is fetched, in fact you'll find some rather
involved discussions on that very topic going on on the solr-dev list
right now.

the thread you cite was specificly asking about:
  a) crawling a filesystem
  b) detecting document types and indexing text portions accordingly.

I honestly can't imagine either of those things being supported out of the
box by Solr -- there's just no reason for Solr to duplicate what Nutch
alrady does very well.

What i see being far more likely are:

1) more documentation (and posisbly some locking configuration options) on
how you can use Solr to access an index generated by the nutch crawler (i
think Thorsten has allready done this) or by Compass, or any other system
that builds a Lucene index.

2) contrib code that runs as it's own process to crawl documents and
send them to a Solr server. (mybe it parses them, or maybe it relies on
the next item...)

3) Stock update plugins that can each read a raw inputstreams of a some
widely used file format (PDF, RDF, HTML, XML of any schema) and have
configuration options telling them them what fields in the schema each
part of their document type should go in.

4) easy hooks for people to write their own update plugins for non widely
used fileformats.


-Hoss



Re: Trouble with data type in schema

2007-01-15 Thread Mike Klaas

On 1/15/07, Phil Rosen [EMAIL PROTECTED] wrote:

I am trying to construct a data type that given the content: ID-111 would
match on either ID or 111

Text and string wont do this, any suggestions?


The text field as defined by Solr example's schema.xml should
achieve this effect.  Have you looked at the analysis portion of the
solr admin ui (with 'verbose' checked)  to investigate how your
strings are being tokenized?

regards,
-Mike


separate log files

2007-01-15 Thread Ben Incani
Hi Solr users,

I'm running multiple instances of Solr, which all using the same war
file to load from.

Below is an example of the servlet context file used for each
application.

Context path=/app1-solr docBase=/var/usr/solr/solr-1.0.war
debug=0 crossContext=true 
Environment name=solr/home type=java.lang.String
value=/var/local/app1 override=true /
/Context

Hence each application is using the same
WEB-INF/classes/logging.properties file to configure logging.

I would like to each instance to log to separate log files such as;
app1-solr.-mm-dd.log
app2-solr.-mm-dd.log
...

Is there an easy way to append the context path to
org.apache.juli.FileHandler.prefix
E.g. 
org.apache.juli.FileHandler.prefix = ${catalina.context}-solr.
 
Or would this require a code change?

Regards

-Ben